# Modeling

---

***Data Science Problem:***
> Using comments and submissions from the `r/history` and `r/Futurology` subreddits, can we build a model that can tell the difference between a conversation about the future and a conversation about the past?

---

## Notebook Contents:
- [Imports](#imports)
- [Modeling Prep](#prep)
    - [Base Model](#base-model)
    - [Logistic Regression](#log-reg-prep)
    - [Naive Bayes](#naive-bayes-prep)
    - [k-Nearest Neighbors](#knn)
    - [Random Forest](#random-forest)
    - [Support Vector Machine](#svm)
- [Results](#results)
    - [Logistic Regression](#log-reg-results)
    - [Naive Bayes](#naive-bayes-results)
    - [k-Nearest Neighbors](#knn-results)
    - [Random Forest](#random-forest-results)
    - [Support Vector Machine](#svm-results)

<a id='imports'></a>

## Imports

---

In [88]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score

from sklearn.ensemble import RandomForestClassifierf
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

In [106]:
# read in the data
reddit_data = pd.read_csv('./data/reddit_data.csv')

In [107]:
reddit_data.head()

Unnamed: 0,author,text,subreddit
0,AutoModerator,"Hello, /u/Woofislove! Thank you for your parti...",1
1,user1688,Taking away freedom after a crisis. This is tr...,1
2,natemb123,Yep but that would lead to a decrease in oil d...,1
3,lIllIlIIlll,no i wouldn't be surprised lol i have a gradua...,1
4,tyler56721,You must live a very lavish lifestyle.,1


<a id='prep'></a>

## Modeling Prep

---

<a id='base-model'></a>
### Base Model

In [103]:
reddit_data['subreddit'].value_counts(normalize=True)

0    0.502429
1    0.497571
Name: subreddit, dtype: float64

The Base Model accuracy is 50%.

In [49]:
# set X and y values
X = reddit_data['text']
y = reddit_data['subreddit']

In [50]:
# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=6,
                                                    stratify=y)

<a id='log-reg-prep'></a>

### Logistic Regression

In [72]:
def logreg_model(X_train, X_test, y_train, y_test, vectorizer):
    
    # Count Vectorized model
    if vectorizer == 'cvec':
        
        # instantiate pipeline
        pipe = Pipeline([
            ('cvec', CountVectorizer()),
            ('logreg', LogisticRegression(solver='lbfgs', max_iter=10_000))
        ])
        
        # set pipeline parameters
        pipe_params = {
            'cvec__max_features': [None, 5_000, 10_000],
            'cvec__stop_words': [None, 'english'],
            'cvec__ngram_range': [(1, 1), (1, 2)]
        }
    
    # Tfidf-Vectorized model
    if vectorizer == 'tvec':
        
        # instantiate pipeline
        pipe = Pipeline([
            ('tvec', TfidfVectorizer()),
            ('logreg', LogisticRegression(solver='lbfgs', max_iter=10_000))
        ])
        
        # set pipeline parameters
        pipe_params = {
            'tvec__max_features': [None, 5_000, 10_000],
            'tvec__stop_words': [None, 'english'],
            'tvec__ngram_range': [(1, 1), (1, 2)]
        }

    
    # instantiate gridsearch
    gs = GridSearchCV(pipe,
                      pipe_params,
                      cv=5)
    
    # fit model
    gs.fit(X_train, y_train)
    
    # print out accuracy scores and best model parameters
    print(f"{vectorizer} logistic regression model:")
    print("===============================")
    print(f"Best Score: {gs.best_score_}")
    print(f"Best Parameters: {gs.best_params_}\n")
    print(f"Test Score: {gs.score(X_test, y_test)}")


<a id='naive-bayes-prep'></a>

### Naive Bayes

In [45]:
# Multinomial Naive Bayes
def mnb_model(X_train, X_test, y_train, y_test):
    
    # instantiate pipeline
    pipe = Pipeline([
        ('cvec', CountVectorizer()),
        ('mnb', MultinomialNB())
    ])
    
    # set gridsearch parameters
    pipe_params = {
        'cvec__max_features': [None, 5_000, 10_000],
        'cvec__stop_words': [None, 'english'],
        'cvec__ngram_range': [(1, 1), (1, 2)]
    }
    
    # instantiate gridsearch
    gs = GridSearchCV(pipe,
                      pipe_params,
                      cv=5)
    
    # fit model
    gs.fit(X_train, y_train)
    
    # print out accuracy scores and best model parameters
    print("multinomial naive bayes model:")
    print("==============================")
    print(f"Best Score: {gs.best_score_}")
    print(f"Best Parameters: {gs.best_params_}\n")
    print(f"Test Score: {gs.score(X_test, y_test)}")    

In [58]:
# Gaussian Naive Bayes

def gnb_model(X_train, X_test, y_train, y_test):
    
    # instantiate estimator
    gnb = GaussianNB()
    
    # instantiate vectorizer
    tvec = TfidfVectorizer()
    
    # fit and transform features
    tvec_train_data_features = tvec.fit_transform(X_train)
    tvec_test_data_features = tvec.transform(X_test)
    
    # fit model
    gnb.fit(tvec_train_data_features.toarray(), y_train)
    
    # print out accuracy scores
    print("gaussian naive bayes model:")
    print("===========================")
    print(f"Train Score: {gnb.score(tvec_train_data_features.toarray(), y_train)}")
    print(f"Test Score: {gnb.score(tvec_test_data_features.toarray(), y_test)}")   

<a id='knn-prep'></a>

### k-Nearest Neighbors

In [75]:
def knn_cvec_model(X_train, X_test, y_train, y_test):
    
#     ss = StandardScaler()

    # instantiate vectorizer
    cvec = CountVectorizer()
    
    # fit and transform features
    cvec_train_data_features = cvec.fit_transform(X_train)
    cvec_test_data_features = cvec.transform(X_test)
    
#     cvec_train_data_features_scaled = ss.fit_transform(cvec_train_data_features)
#     cvec_test_data_features_scaled = ss.transform(vec_test_data_features)   
    
    # set gridsearch parameters
    params = {
        'n_neighbors': [5, 15, 25]
    }
    
    # instantiate gridsearch
    gs = GridSearchCV(KNeighborsClassifier(),
                      params,
                      cv=5)
    
    # fit model
    gs.fit(cvec_train_data_features, y_train)
    
    # print out accuracy scores and best model parameters
    print("k-nearest neighbors cvec model:")
    print("===============================")
    print(f"Best Score: {gs.best_score_}")
    print(f"Best Parameters: {gs.best_params_}\n")
    print(f"Test Score: {gs.score(cvec_test_data_features, y_test)}")  

In [79]:
def knn_tvec_model(X_train, X_test, y_train, y_test):
    
    # instantiate vectorizer
    tvec = TfidfVectorizer()
    
    # fit and transform features
    tvec_train_data_features = tvec.fit_transform(X_train)
    tvec_test_data_features = tvec.transform(X_test)
    
    # set gridsearch parameters
    params = {
        'n_neighbors': [5, 15, 25]
    }
    
    # instantiate gridsearch
    gs = GridSearchCV(KNeighborsClassifier(),
                      params,
                      cv=5)
    
    # fit model
    gs.fit(tvec_train_data_features, y_train)
    
    # print out accuracy scores and best model parameters
    print("k-nearest neighbors tvec model:")
    print("===============================")
    print(f"Best Score: {gs.best_score_}")
    print(f"Best Parameters: {gs.best_params_}\n")
    print(f"Test Score: {gs.score(tvec_test_data_features, y_test)}")  

<a id='random-forest'></a>

### Random Forest

In [100]:
def random_forest_cvec_model(X_train, X_test, y_train, y_test):
    
    # instantiate estimator
    rf = RandomForestClassifier(random_state=6)
    
    # instantiate vectorizer
    cvec = CountVectorizer()
    
    # fit and transform features
    cvec_train_data_features = cvec.fit_transform(X_train)
    cvec_test_data_features = cvec.transform(X_test)
    
    # fit model
    cvec.fit(cvec_train_data_features, y_train)
    
    # print out accuracy scores
    print("random forest cvec model:")
    print("=========================")
    print(f"Train Score: {cvec.score(cvec_train_data_features, y_train)}")  
    print(f"Test Score: {cvec.score(cvec_test_data_features, y_test)}")  

In [101]:
def random_forest_tvec_model(X_train, X_test, y_train, y_test):
    
    rf = RandomForestClassifier(random_state=6)
    tvec = TfidfVectorizer()

    # fit and transform features
    tvec_train_data_features = tvec.fit_transform(X_train)
    tvec_test_data_features = tvec.transform(X_test)
    
    # fit model
    tvec.fit(tvec_train_data_features, y_train)
    
    # print out accuracy scores
    print("random forest tvec model:")
    print("=========================")
    print(f"Train Score: {tvec.score(tvec_train_data_features, y_train)}")  
    print(f"Test Score: {tvec.score(tvec_test_data_features, y_test)}")   

<a id='svm'></a>

### Support Vector Machine

In [92]:
def svm_cvec_model(X_train, X_test, y_train, y_test):
    
    # instantiate estimator
    svc = SVC()
    
    # instantiate vectorizer
    cvec = CountVectorizer()
    
    # fit and transform features
    cvec_train_data_features = cvec.fit_transform(X_train)
    cvec_test_data_features = cvec.transform(X_test)
    
    # fit model
    svc.fit(cvec_train_data_features, y_train)
    
    # print out accuracy scores
    print("cvec support vector machine:")
    print("============================")
    print(f"Train Score: {svc.score(cvec_train_data_features, y_train)}")  
    print(f"Test Score: {svc.score(cvec_test_data_features, y_test)}")  

In [98]:
def svm_tvec_model(X_train, X_test, y_train, y_test):
    
    # instantiate estimator
    svc = SVC()
    
    # instantiate vectorizer
    tvec = TfidfVectorizer()
    
    # fit and transform features
    tvec_train_data_features = tvec.fit_transform(X_train)
    tvec_test_data_features = tvec.transform(X_test)
    
    # fit model
    svc.fit(tvec_train_data_features, y_train)
    
    # print out accuracy scores
    print("tvec support vector machine:")
    print("============================")
    print(f"Train Score: {svc.score(tvec_train_data_features, y_train)}")  
    print(f"Test Score: {svc.score(tvec_test_data_features, y_test)}") 

<a id='results'></a>

## Results

---

Using the above functions, all model results have been consolidated below for ease of comparison.

<a id='log-reg-results'></a>

### Logistic Regression

In [52]:
logreg_model(X_train, X_test, y_train, y_test, 'cvec')

cvec logistic regression model:
Best Score: 0.8941637542466664
Best Parameters: {'cvec__max_features': None, 'cvec__ngram_range': (1, 2), 'cvec__stop_words': None}

Test Score: 0.8976055450535602


In [53]:
logreg_model(X_train, X_test, y_train, y_test, 'tvec')

tvec logistic regression model:
Best Score: 0.8962643800737519
Best Parameters: {'tvec__max_features': None, 'tvec__ngram_range': (1, 1), 'tvec__stop_words': 'english'}

Test Score: 0.897500525099769


<a id='naive-bayes-results'></a>

### Naive Bayes

In [54]:
mnb_model(X_train, X_test, y_train, y_test)

multinomial naive bayes model:
Best Score: 0.9080629599993332
Best Parameters: {'cvec__max_features': None, 'cvec__ngram_range': (1, 2), 'cvec__stop_words': None}

Test Score: 0.9081075404326822


In [60]:
gnb_model(X_train, X_test, y_train, y_test)

gaussian naive bayes model:
Train Score: 0.8700416622903756
Test Score: 0.7548834278512917


<a id='knn-results'></a>

### k-Nearest Neighbors

In [77]:
knn_cvec_model(X_train, X_test, y_train, y_test)

k-Nearest Neighbors cvec model:
Best Score: 0.6657918436059879
Best Parameters: {'n_neighbors': 5}

Test Score: 0.6781138416299097


In [80]:
knn_tvec_model(X_train, X_test, y_train, y_test)

k-Nearest Neighbors tvec model:
Best Score: 0.8140601304411138
Best Parameters: {'n_neighbors': 25}

Test Score: 0.7568788069733249


<a id='random-forest-results'></a>

### Random Forest

In [105]:
# random_forest_cvec_model(X_train, X_test, y_train, y_test)

In [None]:
# random_forest_tvec_model(X_train, X_test, y_train, y_test)

<a id='svm-results'></a>

### Support Vector Machine

In [95]:
svm_cvec_model(X_train, X_test, y_train, y_test)

cvec support vector machine:
Train Score: 0.9192311731960928
Test Score: 0.8699852972064692


In [99]:
svm_tvec_model(X_train, X_test, y_train, y_test)

tvec support vector machine:
Train Score: 0.9887616846969857
Test Score: 0.9023314429741651
