# NLP Subreddit Classification - Modeling Supplement

This script can only be run after you run the notebook NLP_Subreddit_Classificiation.ipynb. Clean files were saved in that process to be used here. Here is where you can find a breakdown of each model I ran through for testing.

### Contents:
- [Imports](#Imports)
- [Modeling](#Modeling)
  * [Logistic Regression Model 1](#Logistic-Regression-Model-1) - combined text (title + selftext) for X and CountVectorizer
  * [Logistic Regression Model 2](#Logistic-Regression-Model-2) - title for X and CountVectorizer
  * [Logistic Regression Model 3](#Logistic-Regression-Model-3) - combined text for X and TfidfVectorizer
  * [Logistic Regression Model 4](#Logistic-Regression-Model-4) - title for X and TfidfVectorizer
  * [Naive Bayes Model 1](#Naive-Bayes-Model-1) - combined text for X and CountVectorizer  
  * [Naive Bayes Model 2](#Naive-Bayes-Model-2) - title for X and CountVectorizer
  * [Naive Bayes Model 4](#Naive-Bayes-Model-4) - title for X and TfidfVectorizer
  * [Support Vector Machine Model 1](#Support-Vector-Machine-Model-1) - combined text for X and CountVectorizer
  * [Support Vector Machine Model 2](#Support-Vector-Machine-Model-2) - title for X and CountVectorizer
  * [Support Vector Machine Model 3](#Support-Vector-Machine-Model-3) - combined text for X and TfidfVectorizer
  * [Support Vector Machine Model 4](#Support-Vector-Machine-Model-4) - title for X and TfidfVectorizer

## Imports

In [106]:
#importing in the packages
import numpy as np
import pandas as pd
import requests
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn import svm

%config InlineBackend.figure_format = 'retina'

In [107]:
#importing warning to turn off future warnings
import warnings
warnings.simplefilter(action='ignore')

In [108]:
#reading in the cleaned file
df_combined = pd.read_csv('./datasets/clean_df_combined.csv')

In [109]:
#reading in the saved file, pulled in 10 Nans. Converting them back to blank cells
df_combined['clean_titles'] = df_combined['clean_titles'].fillna(' ')

## Modeling

In [110]:
#Finding the baseline accuracy. Our goal here is to do better than 51.7%, which is the majority of the sample.
df_combined['subreddit'].value_counts(normalize=True)

0    0.518116
1    0.481884
Name: subreddit, dtype: float64

### Log Reg Model 1

This model uses combined text (title + selftext) for X and uses CountVectorizer.

#### Preprocessing

In [111]:
#dropping down to just the columns I want to use
df_crop = df_combined[['clean_posts', 'subreddit']]

In [112]:
#setting X and y
X = df_crop['clean_posts']
y = df_crop['subreddit']

In [113]:
#splitting into train test split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state=42,
                                                    stratify=y)

##### Modeling

In [114]:
#setting up the pipeline order
pipe = Pipeline([
    ('cvec', CountVectorizer()), 
    ('lr', LogisticRegression()) 
])

In [115]:
#setting up the pipe parameters
pipe_params = {
    'cvec__max_features': [1000]
}
gs = GridSearchCV(pipe, param_grid=pipe_params, cv=5)
gs.fit(X_train, y_train); 
print(gs.best_score_)
gs.best_params_

0.7964113181504486


{'cvec__max_features': 1000}

In [116]:
# Train score
gs.score(X_train, y_train)

0.9585921325051759

In [117]:
# Test score
gs.score(X_test, y_test)

0.8302277432712215

### Log Reg Model 2

This model uses titles for X and uses CountVectorizer.

#### Preprocessing

In [118]:
#dropping down to just the columns I want to use
df_crop = df_combined[['clean_titles', 'subreddit']]

In [119]:
#setting X and y
X = df_crop['clean_titles']
y = df_crop['subreddit']

In [120]:
#splitting into train test split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state=42,
                                                    stratify=y)

#### Modeling

In [121]:
#setting up the pipeline order
pipe = Pipeline([
    ('cvec', CountVectorizer()), 
    ('lr', LogisticRegression()) 
])

In [122]:
#setting up the pipe parameters
pipe_params = {
    'cvec__max_features': [1000]
}
gs = GridSearchCV(pipe, param_grid=pipe_params, cv=5)
gs.fit(X_train, y_train); 
print(gs.best_score_)
gs.best_params_

0.7777777777777778


{'cvec__max_features': 1000}

In [123]:
# Train score
gs.score(X_train, y_train)

0.9296066252587992

In [124]:
# Test score
gs.score(X_test, y_test)

0.8178053830227743

### Log Reg Model 3

This model uses combined (title & selftext) for X and uses TfidVectorizer.

#### Preprocessing

In [125]:
#dropping down to just the columns I want to use
df_crop = df_combined[['clean_posts', 'subreddit']]

In [126]:
#setting X and y
X = df_crop['clean_posts']
y = df_crop['subreddit']

In [127]:
#splitting into train test split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state=42,
                                                    stratify=y)

#### Modeling

In [128]:
#setting up the pipeline order
pipe = Pipeline([
    ('tvec', TfidfVectorizer()), 
    ('lr', LogisticRegression()) 
])

In [129]:
#setting up the pipe parameters
pipe_params = {
    'tvec__max_features': [1000]
}
gs = GridSearchCV(pipe, param_grid=pipe_params, cv=5)
gs.fit(X_train, y_train); 
print(gs.best_score_)
gs.best_params_

0.821256038647343


{'tvec__max_features': 1000}

In [130]:
# Train score
gs.score(X_train, y_train)

0.9164941338854382

In [131]:
# Test score
gs.score(X_test, y_test)

0.865424430641822

### Log Reg Model 4

This model uses combined titles for X and uses TfidfVectorizer.

#### Preprocessing

In [132]:
#dropping down to just the columns I want to use
df_crop = df_combined[['clean_titles', 'subreddit']]

In [133]:
#setting X and y
X = df_crop['clean_titles']
y = df_crop['subreddit']

In [134]:
#splitting into train test split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state=42,
                                                    stratify=y)

#### Modeling

In [135]:
#setting up the pipeline order
pipe = Pipeline([
    ('tvec', TfidfVectorizer()), 
    ('lr', LogisticRegression()) 
])

In [136]:
#setting up the pipe parameters
pipe_params = {
    'tvec__max_features': [1000],

}
gs = GridSearchCV(pipe, param_grid=pipe_params, cv=5)
gs.fit(X_train, y_train); 
print(gs.best_score_)
gs.best_params_

0.7888198757763976


{'tvec__max_features': 1000}

In [137]:
# Train score
gs.score(X_train, y_train)

0.906832298136646

In [138]:
# Test score
gs.score(X_test, y_test)

0.8157349896480331

### Naive Bayes Model 1

This model uses combined text (title + selftext) for X and uses CountVectorizer.

#### Preprocessing

In [139]:
#dropping down to just the columns I want to use
df_crop = df_combined[['clean_posts', 'subreddit']]

In [140]:
#setting X and y
X = df_crop['clean_posts']
y = df_crop['subreddit']

In [141]:
#splitting into train test split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state=42,)

#### Modeling

In [142]:
#setting up the pipeline order
pipe = Pipeline([
    ('cvec', CountVectorizer()), 
    ('nb', MultinomialNB()) 
])

In [143]:
#setting up the pipe parameters
pipe_params = {
    'cvec__max_features': [1000]
}
gs = GridSearchCV(pipe, param_grid=pipe_params, cv=5)
gs.fit(X_train, y_train); 
print(gs.best_score_)
gs.best_params_

0.7984817115251898


{'cvec__max_features': 1000}

In [144]:
# Train score
gs.score(X_train, y_train)

0.8578329882677709

In [145]:
# Test score
gs.score(X_test, y_test)

0.8219461697722568

### Naive Bayes Model 2

This model uses titles for X and uses CountVectorizer.

#### Preprocessing

In [146]:
#dropping down to just the columns I want to use
df_crop = df_combined[['clean_titles', 'subreddit']]

In [147]:
#setting X and y
X = df_crop['clean_titles']
y = df_crop['subreddit']

In [148]:
#splitting into train test split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state=42,
                                                    stratify=y)

#### Modeling

In [149]:
#setting up the pipeline order
pipe = Pipeline([
    ('cvec', CountVectorizer()), 
    ('nb', MultinomialNB()) 
])

In [150]:
#setting up the pipe parameters
pipe_params = {
    'cvec__max_features': [1000]
}
gs = GridSearchCV(pipe, param_grid=pipe_params, cv=5)
gs.fit(X_train, y_train); 
print(gs.best_score_)
gs.best_params_

0.7867494824016563


{'cvec__max_features': 1000}

In [151]:
# Train score
gs.score(X_train, y_train)

0.873015873015873

In [152]:
# Test score
gs.score(X_test, y_test)

0.8115942028985508

Note on Naive Bayes Model 3. This was the best performing model and can be seen in the main notebook Project_3_Main_File.ipynb. 

### Naive Bayes Model 4

This model uses combined titles for X and uses TfidfVectorizer.

#### Preprocessing

In [153]:
#dropping down to just the columns I want to use
df_crop = df_combined[['clean_titles', 'subreddit']]

In [154]:
#setting X and y
X = df_crop['clean_titles']
y = df_crop['subreddit']

In [155]:
#splitting into train test split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state=42,
                                                    stratify=y)

#### Modeling

In [156]:
#setting up the pipeline order
pipe = Pipeline([
    ('tvec', TfidfVectorizer()), 
    ('nb', MultinomialNB()) 
])

#setting up the pipe parameters
pipe_params = {
    'tvec__max_features': [1000]
}
gs = GridSearchCV(pipe, param_grid=pipe_params, cv=5)
gs.fit(X_train, y_train); 
print(gs.best_score_)
gs.best_params_

0.790200138026225


{'tvec__max_features': 1000}

In [157]:
# Train score
gs.score(X_train, y_train)

0.893719806763285

In [158]:
# Test score
gs.score(X_test, y_test)

0.8219461697722568

### Support Vector Machine Model 1

This model uses combined text (title + selftext) for X and uses CountVectorizer.

#### Preprocessing

In [160]:
#dropping down to just the columns I want to use
df_crop = df_combined[['clean_titles', 'subreddit']]

In [161]:
#setting X and y
X = df_crop['clean_titles']
y = df_crop['subreddit']

In [162]:
#splitting into train test split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state=42,
                                                    stratify=y)

#### modeling

In [163]:
#setting up the pipeline order
pipe = Pipeline([
    ('cvec', CountVectorizer()),  
    ('svm', svm.SVC(gamma = 'auto')) 
])

#setting up the pipe parameters
pipe_params = {
    'cvec__max_features': [200]
}
gs = GridSearchCV(pipe, param_grid=pipe_params, cv=5)
gs.fit(X_train, y_train); 
print(gs.best_score_)
gs.best_params_

0.5769496204278813


{'cvec__max_features': 200}

In [164]:
# Train score
gs.score(X_train, y_train)

0.6273291925465838

In [165]:
# Test score
gs.score(X_test, y_test)

0.639751552795031

### Support Vector Machine Model 2

This model uses titles for X and uses CountVectorizer.

#### Preprocessing

In [166]:
#dropping down to just the columns I want to use
df_crop = df_combined[['clean_titles', 'subreddit']]

In [167]:
#setting X and y
X = df_crop['clean_titles']
y = df_crop['subreddit']

In [168]:
#splitting into train test split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state=42,
                                                    stratify=y)

#### modeling

In [169]:
#setting up the pipeline order
pipe = Pipeline([
    ('cvec', CountVectorizer()),  
    ('svm', svm.SVC(gamma = 'auto')) 
])

#setting up the pipe parameters
pipe_params = {
    'cvec__max_features': [200]
}
gs = GridSearchCV(pipe, param_grid=pipe_params, cv=5)
gs.fit(X_train, y_train); 
print(gs.best_score_)
gs.best_params_

0.5769496204278813


{'cvec__max_features': 200}

In [170]:
# Train score
gs.score(X_train, y_train)

0.6273291925465838

In [171]:
# Test score
gs.score(X_test, y_test)

0.639751552795031

### Support Vector Machine Model 3

This model uses combined (title & selftext) for X and uses TfidVectorizer.

#### Preprocessing

In [172]:
#dropping down to just the columns I want to use
df_crop = df_combined[['clean_posts', 'subreddit']]

In [173]:
#setting X and y
X = df_crop['clean_posts']
y = df_crop['subreddit']

In [174]:
#splitting into train test split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state=42,
                                                    stratify=y)

#### modeling

In [175]:
#setting up the pipeline order
pipe = Pipeline([
    ('tvec', TfidfVectorizer()),  
    ('svm', svm.SVC(gamma = 'auto')) 
])

#setting up the pipe parameters
pipe_params = {
    'tvec__max_features': [50]
}
gs = GridSearchCV(pipe, param_grid=pipe_params, cv=5)
gs.fit(X_train, y_train); 
print(gs.best_score_)
gs.best_params_

0.6928916494133885


{'tvec__max_features': 50}

In [176]:
# Train score
gs.score(X_train, y_train)

0.7129054520358868

In [177]:
# Test score
gs.score(X_test, y_test)

0.7370600414078675

### Support Vector Machine Model 4

This model uses combined titles for X and uses TfidfVectorizer.

#### Preprocessing

In [178]:
#dropping down to just the columns I want to use
df_crop = df_combined[['clean_titles', 'subreddit']]

In [179]:
#setting X and y
X = df_crop['clean_titles']
y = df_crop['subreddit']

In [180]:
#splitting into train test split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state=42,
                                                    stratify=y)

#### modeling

In [181]:
#setting up the pipeline order
pipe = Pipeline([
    ('tvec', TfidfVectorizer()),  
    ('svm', svm.SVC(gamma = 'auto')) 
])

#setting up the pipe parameters
pipe_params = {
    'tvec__max_features': [1000]
}
gs = GridSearchCV(pipe, param_grid=pipe_params, cv=5)
gs.fit(X_train, y_train); 
print(gs.best_score_)
gs.best_params_

0.5182884748102139


{'tvec__max_features': 1000}

In [182]:
# Train score
gs.score(X_train, y_train)

0.5182884748102139

In [183]:
# Test score
gs.score(X_test, y_test)

0.5175983436853002