# Modelling

In this notebook, I model my dataset using both Logistic Regression w/TfidfVectorizer and Mutlinomial Naive Bayes w/CountVectorizer, based on explorative modelling done in notebook 2. I first run these models again and peek at the models' coefficients and then make changes to my dataset.

### Library Imports

In [1]:
# Import basic libraries
import numpy as np
import pandas as pd

# Import modelling libraries
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB, BernoulliNB

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import regex as re

### Read In Cleaned Data & Prepare Variables For Modelling

In [2]:
west_house = pd.read_csv('../data/west_house.csv')
submissions = west_house[west_house['submission'] == 1]
comments = west_house[west_house['submission'] == 0]

In [3]:
# Sets X variable as text column and sets y variable as subreddit column
# Prepares train/test split
X = west_house['text']
y = west_house['subreddit']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33, random_state=42)

In [4]:
# Sets a baseline accuracy score
y.value_counts(normalize=True)

1    0.50069
0    0.49931
Name: subreddit, dtype: float64

## Logistic Regression w/ TfidfVectorizer

In [5]:
tf = TfidfVectorizer(max_df=0.25,
                     max_features=250,
                     min_df=3,
                     ngram_range=(1, 1))
X_train_lr = pd.DataFrame(tf.fit_transform(X_train).toarray(),
                          columns=tf.get_feature_names())
X_test_lr = pd.DataFrame(tf.transform(X_test).toarray(),
                          columns=tf.get_feature_names())

lr = LogisticRegression()
lr.fit(X_train_lr, y_train)
print('Train: ', lr.score(X_train_lr, y_train))
print('Test: ', lr.score(X_test_lr, y_test))

# Puts coefficients in dataframe with the words they correspond to
lr_df = pd.DataFrame({'features': X_train_lr.columns, 'coefs': lr.coef_[0]}).sort_values('coefs')

# Converts coefficients into probabilities
lr_df['probs'] = np.exp(lr_df['coefs']) / (1 + np.exp(lr_df['coefs']))

# Export nb_df to use in data visualization notebook
lr_df.to_csv('../data/logreg.csv', index=False) 

# Peek at results
lr_df

Train:  0.7950927139913841
Test:  0.7741444866920152


Unnamed: 0,features,coefs,probs
65,frank,-7.215240,0.000735
218,underwood,-4.442517,0.011629
29,claire,-4.181368,0.015048
24,card,-4.168590,0.015238
82,hoc,-3.961200,0.018685
...,...,...,...
184,sorkin,3.777608,0.977634
234,west,4.029840,0.982533
208,toby,4.058730,0.983022
93,josh,4.228423,0.985634


## Multinomial Naive Bayes w/ CountVectorizer

In [6]:
cv = CountVectorizer(max_df=0.33,
                         max_features=600,
                         min_df=10,
                         ngram_range=(1, 2))
X_train_nb = pd.DataFrame(cv.fit_transform(X_train).toarray(),
                          columns=cv.get_feature_names())
X_test_nb = pd.DataFrame(cv.transform(X_test).toarray(),
                          columns=cv.get_feature_names())

nb = MultinomialNB()
nb.fit(X_train_nb, y_train)
print('Train: ', nb.score(X_train_nb, y_train))
print('Test: ', nb.score(X_test_nb, y_test))

# Puts coefficients in dataframe with the words they correspond to
nb_df = pd.DataFrame({'features': X_train_nb.columns, 'coefs': nb.coef_[0]}).sort_values('coefs')

# Converts coefficients into probabilities (I'm not sure if this is the same for MultiNB as it is for
# LogReg, but the results look similar so I'm going for it. I looked into it for a while and nothing
# quite made sense.)
nb_df['probs'] = np.exp(nb_df['coefs'])/(1+np.exp(nb_df['coefs']))

# Export nb_df to use in data visualization notebook
nb_df.to_csv('../data/multi_nb.csv', index=False) 

# Peek at results
nb_df

Train:  0.809327589436224
Test:  0.8034220532319392


Unnamed: 0,features,coefs,probs
465,spacey,-10.531403,0.000027
204,hammerschmidt,-10.531403,0.000027
74,claire,-10.531403,0.000027
171,frank claire,-10.531403,0.000027
68,chapter,-10.531403,0.000027
...,...,...,...
558,west,-4.334959,0.012933
582,would,-4.287236,0.013557
129,episode,-4.249136,0.014076
449,show,-4.191043,0.014905


### Findings & Data Cleaning (cont'd)


As expected, my models picked out character names, show titles, and other key features of each show and were able to guess with pretty high accuracy whether the text came from the House of Cards or The West Wing subreddit. To make things more interesting (hopefully), I will now remove these words/names from my datasets. Though I consider this a form of preprocessing, I waited til the modelling phase so that I could see the difference in scores with and without show-specific words. I predict that my scores will drop precipitously with the removal of show-specific words.

In [7]:
# Creates a list of words that are overtly unique to The West Wing
west_words = ['sam', 'seaborn', 'bartlet', 'bartlett', 'allison', 'janney',
        'cj', 'c.j.', 'cregg', 'craig', 'john', 'spencer', 'leo', 'mcgarry',
        'bradley', 'whitford', 'josh', 'lyman', 'mandy', 'santos', 
        'martin', 'sheen', 'josiah', 'janel', 'west', 'wing',
        'moloney', 'donna', 'moss', 'richard', 'schiff', 'toby', 'ziegler',
        'dule', 'hill', 'charlie', 'margaret', 'rob', 'lowe', 'joshua',
        'malina', 'will', 'bailey', 'wa', 'stockard', 'channing', 'abbey', 'aaron',
        'sorkin', 'tommy', 'schlamme', 'misiano', 'graves', 'lawrence',
        "o'donnel", 'nbc', 'tww', 'hoynes']

# Creates a list of words that are overtly unique to House of Cards
house_words = ['kevin', 'spacey', 'hammerschmidt', 'netflix', 'james', 'foley', 'andrew', 'freddy',
              'davies', 'michael', 'dobbs', 'beau', 'willimon', 'robin', 'meechum', 'durant',
              'wright', 'claire', 'underwood', 'michael', 'kelly', 'frank', 'dunbar', 'yates',
              'justin', 'doescher', 'seth', 'grayson', 'zoe', 'kate', 'mara', 'remy', 'conway',
              'russo', 'rachel', 'hoc', 'walker', 'cards', 'card', 'doug', 'house', 'tusk', 'francis',
              'frances', 'tom', 'peter']

# Combines these lists together and then combines them with the nltk library's stopwords
west_house_words = west_words + house_words
wrong_words = set(stopwords.words('english') + west_house_words)

In [8]:
# Uses the same function from preprocessing notebook to apply new stopwords to dataset
# I used the function we wrote in the NLP lesson as a template
lemmatizer = WordNetLemmatizer()
def cleaner_strings(post):
    post = re.sub("[^a-zA-Z]", " ", post.lower()).split()
    post = [lemmatizer.lemmatize(i) for i in post]
    right_words = [w for w in post if w not in wrong_words]
    return (" ".join(right_words))
west_house['text'] = west_house['text'].apply(cleaner_strings)

In [9]:
# Were any text rows rendered empty by this process?
west_house[west_house['text'] == ""]

Unnamed: 0,text,subreddit,trump,submission,created_utc,word_count
541,,0,0,0,1478621783,1
643,,0,0,0,1478146214,1
647,,0,0,0,1478140783,2
793,,0,0,0,1477743560,2
833,,0,0,0,1477539864,1
985,,0,0,0,1477254956,1
994,,0,0,0,1477204646,1
1798,,0,1,0,1479245213,1
3587,,1,1,0,1478791318,1
4419,,0,0,1,1460452270,2


In [10]:
# Drop blank rows
west_house = west_house[west_house['text'] != ""]

## Modelling 2.0

Here I use the same two models I chose as my 'best models' above, this time passing through the dataset with show-specific words removed. Though I am using the same model/vectorizer combinations, I am once again using Pipeline and GridSearch to optimize those models given the changes to the data. I expect scores to be low but coefficients to be intertesting.

In [11]:
# Resets X variable as new version of text column and sets y variable as subreddit column
# Prepares train/test split
X = west_house['text']
y = west_house['subreddit']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33, random_state=42)

In [12]:
# Sets a baseline accuracy score
y.value_counts(normalize=True)

1    0.50132
0    0.49868
Name: subreddit, dtype: float64

## Logistic Regression w/ TfidfVectorizer

Following the same pattern in notebook 2, I first optimize my hyperparamters and then recreate my model using those hyperparameters to create a dataframe of tokenized words so that I can visualize the coefficients.

In [13]:
# Sets pipeline parameters, passes them through GridSearch, and prints out scores
pipe_lr = Pipeline([('tvec', TfidfVectorizer()),
                    ('lr', LogisticRegression())])

# Hyperparameters were slowly tweaked to find optimal performance
pipe_params_lr = {'tvec__max_features' : [100],
                'tvec__min_df' : [1, 2],
                'tvec__max_df' : [.1, .25],
                'tvec__ngram_range' : [(1, 1), (1, 2)]}

grid_lr = GridSearchCV(pipe_lr, 
                  pipe_params_lr,
                  cv=5)
grid_lr.fit(X_train, y_train)
mod_lr = grid_lr.best_estimator_
print('Train: ', mod_lr.score(X_train, y_train))
print('Test: ', mod_lr.score(X_test, y_test))
grid_lr.best_params_

Train:  0.6300675675675675
Test:  0.6011428571428571


{'tvec__max_df': 0.25,
 'tvec__max_features': 100,
 'tvec__min_df': 1,
 'tvec__ngram_range': (1, 1)}

In [14]:
tf = TfidfVectorizer(max_df=0.25,
                     max_features=100,
                     min_df=1,
                     ngram_range=(1, 2))
X_train_lr = pd.DataFrame(tf.fit_transform(X_train).toarray(),
                          columns=tf.get_feature_names())
X_test_lr = pd.DataFrame(tf.transform(X_test).toarray(),
                          columns=tf.get_feature_names())

lr = LogisticRegression()
lr.fit(X_train_lr, y_train)
print('Train: ', lr.score(X_train_lr, y_train))
print('Test: ', lr.score(X_test_lr, y_test))

# Puts coefficients in dataframe with the words they correspond to
lr_df = pd.DataFrame({'features': X_train_lr.columns, 'coefs': lr.coef_[0]}).sort_values('coefs')

# Converts coefficients into probabilities
lr_df['probs'] = np.exp(lr_df['coefs']) / (1 + np.exp(lr_df['coefs']))

# Export nb_df to use in data visualization notebook
lr_df.to_csv('../data/logreg2.csv', index=False) 

# Peek at results
lr_df

Train:  0.628566066066066
Test:  0.603047619047619


Unnamed: 0,features,coefs,probs
75,spoiler,-3.310273,0.035220
67,season,-2.557993,0.071891
98,www,-1.157475,0.239126
42,lot,-0.733047,0.324526
71,shot,-0.570617,0.361094
...,...,...,...
90,watch,1.162855,0.761851
49,need,1.321812,0.789483
40,line,1.533307,0.822490
18,episode,1.718025,0.847874


## Multinomial Naive Bayes w/ CountVectorizer

Following the same pattern above, I first optimize my hyperparamters and then recreate my model using those hyperparameters to create a dataframe of tokenized words so that I can visualize the coefficients.

In [15]:
# Sets pipeline parameters, passes them through GridSearch, and prints out scores
pipe_nb = Pipeline([('cvec', CountVectorizer()),
                    ('nb', MultinomialNB())])

# Hyperparameters were slowly tweaked to find optimal performance
pipe_params_nb = {'cvec__max_features' : [400],
                'cvec__min_df' : [10, 20],
                'cvec__max_df' : [.25, .4],
                'cvec__ngram_range' : [(1, 1), (1, 2)]}

grid_nb = GridSearchCV(pipe_nb, 
                  pipe_params_nb,
                  cv=5)
grid_nb.fit(X_train, y_train)
mod_nb = grid_nb.best_estimator_
print('Train: ', mod_nb.score(X_train, y_train))
print('Test: ', mod_nb.score(X_test, y_test))
grid_nb.best_params_

Train:  0.6931306306306306
Test:  0.6636190476190477


{'cvec__max_df': 0.25,
 'cvec__max_features': 400,
 'cvec__min_df': 10,
 'cvec__ngram_range': (1, 2)}

In [16]:
cv = CountVectorizer(max_df=0.25,
                         max_features=400,
                         min_df=10,
                         ngram_range=(1, 1))
X_train_nb = pd.DataFrame(cv.fit_transform(X_train).toarray(),
                          columns=cv.get_feature_names())
X_test_nb = pd.DataFrame(cv.transform(X_test).toarray(),
                          columns=cv.get_feature_names())

nb = MultinomialNB()
nb.fit(X_train_nb, y_train)
print('Train: ', nb.score(X_train_nb, y_train))
print('Test: ', nb.score(X_test_nb, y_test))

# Puts coefficients in dataframe with the words they correspond to
nb_df = pd.DataFrame({'features': X_train_nb.columns, 'coefs': nb.coef_[0]}).sort_values('coefs')

# Converts coefficients into probabilities (I'm not sure if this is the same for MultiNB as it is for
# LogReg, but the results look similar so I'm going for it. I looked into it for a while and nothing
# quite made sense.)
nb_df['probs'] = np.exp(nb_df['coefs'])

# Export nb_df to use in data visualization notebook
nb_df.to_csv('../data/multi_nb2.csv', index=False) 

# Peek at results
nb_df

Train:  0.6950075075075075
Test:  0.668952380952381


Unnamed: 0,features,coefs,probs
44,chapter,-10.244770,0.000036
347,trailer,-8.858475,0.000142
77,electoral,-7.942185,0.000355
369,war,-7.942185,0.000355
332,theory,-7.759863,0.000427
...,...,...,...
169,like,-4.148945,0.015781
251,president,-4.094167,0.016670
390,would,-4.062685,0.017203
299,show,-3.973781,0.018802


### Ready For Trump Side Project And Data Visualization

Now that my data is free of show-specific words and my main modelling is done, I am ready to export my dataframe as a CSV file to use for data visualization and my further analysis.

In [17]:
west_house.to_csv('../data/west_house_2.csv', index=False) 