# Modelling¶

In this notebook, I use my two chosen models to predict for a different y variable. First, I run my models with the same y variable (subreddit), but split the data into text posted before Trump won and after Trump won to see if the coefficients change at all (and in interesting ways). Then, I change the y variable to the 'trump' column, i.e. to predict whether the text was posted before Trump won the 2016 election or after Trump won the 2016 election. These are very similar projects, but they might highlight interesting differences. I use my dataset with all show-specific words taken out.

### Library Imports

In [1]:
# Import basic libraries
import numpy as np
import pandas as pd

# Import modelling libraries
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB, BernoulliNB

### Read In Cleaned Data & Prepare Variables For Modelling

In [2]:
west_house = pd.read_csv('../data/west_house_2.csv')

# Predicting Subreddits Pre-Trump vs Post-Trump

In [3]:
X = west_house[west_house['trump'] == 0]['text']
y = west_house[west_house['trump'] == 0]['subreddit']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33, random_state=42)

In [4]:
y.value_counts(normalize=True)

1    0.501636
0    0.498364
Name: subreddit, dtype: float64

## Pre-Trump Posts

### Logistic Regression w/ TfidfVectorizer

In [5]:
# Hyperparameters were slowly tweaked to find optimal performance
tf = TfidfVectorizer(max_df=0.25,
                     max_features=100,
                     min_df=3,
                     ngram_range=(1, 1))
X_train_lr = pd.DataFrame(tf.fit_transform(X_train).toarray(),
                          columns=tf.get_feature_names())
X_test_lr = pd.DataFrame(tf.transform(X_test).toarray(),
                          columns=tf.get_feature_names())

lr = LogisticRegression()
lr.fit(X_train_lr, y_train)

# Let's see these scores
print('Train: ', lr.score(X_train_lr, y_train))
print('Test: ', lr.score(X_test_lr, y_test))

# Let's see these coefficients
lr_df = pd.DataFrame({'features': X_train_lr.columns, 'coefs': lr.coef_[0]}).sort_values('coefs')
lr_df['exp_coef'] = np.exp(lr_df['coefs'])
lr_df

Train:  0.6508831266441187
Test:  0.6234756097560976


Unnamed: 0,features,coefs,exp_coef
75,spoiler,-3.208140,0.040432
60,removed,-2.741804,0.064454
53,power,-2.631223,0.071990
67,season,-1.623320,0.197243
51,people,-1.524104,0.217816
...,...,...,...
39,line,1.406209,4.080457
94,white,1.484881,4.414438
18,episode,1.597609,4.941203
7,best,1.857071,6.404947


### Multinomial Naive Bayes w/ CountVectorizer

In [6]:
# Hyperparameters were slowly tweaked to find optimal performance
cv = CountVectorizer(max_df=0.33,
                         max_features=350,
                         min_df=10,
                         ngram_range=(1, 2))
X_train_nb = pd.DataFrame(cv.fit_transform(X_train).toarray(),
                          columns=cv.get_feature_names())
X_test_nb = pd.DataFrame(cv.transform(X_test).toarray(),
                          columns=cv.get_feature_names())

nb = MultinomialNB()
nb.fit(X_train_nb, y_train)

# Let's see these scores
print('Train: ', nb.score(X_train_nb, y_train))
print('Test: ', nb.score(X_test_nb, y_test))

# Let's see these coefficients
nb_df = pd.DataFrame({'features': X_train_nb.columns, 'coefs': nb.coef_[0]}).sort_values('coefs')
nb_df

Train:  0.6918451709883502
Test:  0.6509146341463414


Unnamed: 0,features,coefs
38,chapter,-9.493864
229,removed,-7.884426
69,electoral,-7.702104
316,war,-7.702104
42,close,-7.547954
...,...,...
148,like,-4.146756
211,president,-4.109369
258,show,-3.933182
337,would,-3.859074


## Post-Trump Posts

In [7]:
# Sets X variable as text column and sets y variable as subreddit column
# Prepares train/test split
X = west_house[west_house['trump'] == 1]['text']
y = west_house[west_house['trump'] == 1]['subreddit']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33, random_state=42)

### Logistic Regression w/ TfidfVectorizer

In [8]:
# Hyperparameters were slowly tweaked to find optimal performance
tf = TfidfVectorizer(max_df=0.25,
                     max_features=100,
                     min_df=3,
                     ngram_range=(1, 1))
X_train_lr = pd.DataFrame(tf.fit_transform(X_train).toarray(),
                          columns=tf.get_feature_names())
X_test_lr = pd.DataFrame(tf.transform(X_test).toarray(),
                          columns=tf.get_feature_names())

lr = LogisticRegression()
lr.fit(X_train_lr, y_train)

# Let's see these scores
print('Train: ', lr.score(X_train_lr, y_train))
print('Test: ', lr.score(X_test_lr, y_test))

# Let's see these coefficients
lr_df = pd.DataFrame({'features': X_train_lr.columns, 'coefs': lr.coef_[0]}).sort_values('coefs')
lr_df['exp_coef'] = np.exp(lr_df['coefs'])
lr_df

Train:  0.6481620405101275
Test:  0.6004566210045662


Unnamed: 0,features,coefs,exp_coef
67,season,-2.919242,0.053975
74,spoiler,-2.502997,0.081839
56,political,-1.290304,0.275187
20,every,-1.201565,0.300723
80,term,-1.053072,0.348865
...,...,...,...
95,white,1.313357,3.718635
48,name,1.315126,3.725220
88,two,1.340325,3.820284
49,need,1.467595,4.338787


### Multinomial Naive Bayes w/ CountVectorizer

In [9]:
# Hyperparameters were slowly tweaked to find optimal performance
cv = CountVectorizer(max_df=0.33,
                         max_features=350,
                         min_df=10,
                         ngram_range=(1, 2))
X_train_nb = pd.DataFrame(cv.fit_transform(X_train).toarray(),
                          columns=cv.get_feature_names())
X_test_nb = pd.DataFrame(cv.transform(X_test).toarray(),
                          columns=cv.get_feature_names())

nb = MultinomialNB()
nb.fit(X_train_nb, y_train)

# Let's see these scores
print('Train: ', nb.score(X_train_nb, y_train))
print('Test: ', nb.score(X_test_nb, y_test))

# Let's see these coefficients
nb_df = pd.DataFrame({'features': X_train_nb.columns, 'coefs': nb.coef_[0]}).sort_values('coefs')
nb_df

Train:  0.7029257314328582
Test:  0.6621004566210046


Unnamed: 0,features,coefs
27,belly,-9.578173
28,belly fat,-9.578173
88,fat,-9.578173
89,fat lose,-9.578173
43,chapter,-9.578173
...,...,...
290,think,-4.175495
218,president,-4.162072
76,episode,-4.081005
154,like,-4.044783


# Analysis

Due to time constraints, I settled for higher variance with these models (a difference in scores of <= 5 was ok with me). Comparing these models' scores with the models in notebook 3, these scores were about the same. However, the coefficients did get shaken up a bit. 

# Predicting Pre-/Post-Trump by Subreddit

In [10]:
# Split data by subreddit
west = west_house[west_house['subreddit'] == 1]
house = west_house[west_house['subreddit'] == 0]

# Trump: The West Wing

In [11]:
# Sets X variable as text column and sets y variable as subreddit column
# Prepares train/test split
X = west['text']
y = west['trump']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33, random_state=42)

In [12]:
y.value_counts(normalize=True)

1    0.500125
0    0.499875
Name: trump, dtype: float64

## Logistic Regression w/ TfidfVectorizer

In [13]:
# Hyperparameters were slowly tweaked to find optimal performance
tf = TfidfVectorizer(max_df=0.1,
                     max_features=75,
                     min_df=3,
                     ngram_range=(1, 1))
X_train_lr = pd.DataFrame(tf.fit_transform(X_train).toarray(),
                          columns=tf.get_feature_names())
X_test_lr = pd.DataFrame(tf.transform(X_test).toarray(),
                          columns=tf.get_feature_names())

lr = LogisticRegression()
lr.fit(X_train_lr, y_train)

# Let's see these scores
print('Train: ', lr.score(X_train_lr, y_train))
print('Test: ', lr.score(X_test_lr, y_test))

# Let's see these coefficients
lr_df = pd.DataFrame({'features': X_train_lr.columns, 'coefs': lr.coef_[0]}).sort_values('coefs')
lr_df['exp_coef'] = np.exp(lr_df['coefs'])
lr_df

Train:  0.5716959940097341
Test:  0.5220364741641338


Unnamed: 0,features,coefs,exp_coef
60,state,-1.007120,0.365270
41,moment,-0.933731,0.393085
15,election,-0.717658,0.487893
35,line,-0.626019,0.534716
19,every,-0.494210,0.610052
...,...,...,...
4,anyone,0.827617,2.287861
69,watching,0.827840,2.288372
48,people,0.855286,2.352047
42,much,0.999316,2.716423


## Multinomial Naive Bayes w/ CountVectorizer

In [14]:
# Hyperparameters were slowly tweaked to find optimal performance
cv = CountVectorizer(max_df=0.33,
                         max_features=200,
                         min_df=10,
                         ngram_range=(1, 2))
X_train_nb = pd.DataFrame(cv.fit_transform(X_train).toarray(),
                          columns=cv.get_feature_names())
X_test_nb = pd.DataFrame(cv.transform(X_test).toarray(),
                          columns=cv.get_feature_names())

nb = MultinomialNB()
nb.fit(X_train_nb, y_train)

# Let's see these scores
print('Train: ', nb.score(X_train_nb, y_train))
print('Test: ', nb.score(X_test_nb, y_test))

# Let's see these coefficients
nb_df = pd.DataFrame({'features': X_train_nb.columns, 'coefs': nb.coef_[0]}).sort_values('coefs')
nb_df

Train:  0.5911643579183826
Test:  0.5326747720364742


Unnamed: 0,features,coefs
177,vinnick,-7.216801
195,ww,-6.523654
106,oh,-6.523654
189,win,-6.405871
129,running,-6.351804
...,...,...
193,would,-4.028385
107,one,-4.023243
78,like,-3.819779
33,episode,-3.819779


# Trump: House of Cards

In [15]:
# Sets X variable as text column and sets y variable as subreddit column
# Prepares train/test split
X = house['text']
y = house['trump']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33, random_state=42)

## Logistic Regression w/ TfidfVectorizer

In [18]:
# Hyperparameters were slowly tweaked to find optimal performance
tf = TfidfVectorizer(max_df=0.25,
                     max_features=100,
                     min_df=3,
                     ngram_range=(1, 1))
X_train_lr = pd.DataFrame(tf.fit_transform(X_train).toarray(),
                          columns=tf.get_feature_names())
X_test_lr = pd.DataFrame(tf.transform(X_test).toarray(),
                          columns=tf.get_feature_names())

lr = LogisticRegression()
lr.fit(X_train_lr, y_train)

# Let's see these scores
print('Train: ', lr.score(X_train_lr, y_train))
print('Test: ', lr.score(X_test_lr, y_test))

# Let's see these coefficients
lr_df = pd.DataFrame({'features': X_train_lr.columns, 'coefs': lr.coef_[0]}).sort_values('coefs')
lr_df['exp_coef'] = np.exp(lr_df['coefs'])
lr_df

Train:  0.5950319909672563
Test:  0.5492742551566081


Unnamed: 0,features,coefs,exp_coef
59,removed,-1.170044,0.310353
3,amp,-1.060272,0.346361
68,seen,-0.974701,0.377305
22,find,-0.849738,0.427527
93,watching,-0.753015,0.470944
...,...,...,...
9,com,0.870793,2.388804
51,political,0.895413,2.448348
42,may,0.950721,2.587574
12,day,1.644554,5.178702


## Multinomial Naive Bayes w/ CountVectorizer

In [20]:
# Hyperparameters were slowly tweaked to find optimal performance
cv = CountVectorizer(max_df=0.33,
                         max_features=200,
                         min_df=10,
                         ngram_range=(1, 2))
X_train_nb = pd.DataFrame(cv.fit_transform(X_train).toarray(),
                          columns=cv.get_feature_names())
X_test_nb = pd.DataFrame(cv.transform(X_test).toarray(),
                          columns=cv.get_feature_names())

nb = MultinomialNB()
nb.fit(X_train_nb, y_train)

# Let's see these scores
print('Train: ', nb.score(X_train_nb, y_train))
print('Test: ', nb.score(X_test_nb, y_test))

# Let's see these coefficients
nb_df = pd.DataFrame({'features': X_train_nb.columns, 'coefs': nb.coef_[0]}).sort_values('coefs')
nb_df

Train:  0.6044410989838164
Test:  0.5500381970970206


Unnamed: 0,features,coefs
174,understand,-6.867662
20,candidate,-6.685341
159,talk,-6.462197
160,tell,-6.462197
12,bad,-6.462197
...,...,...
121,president,-3.923223
80,like,-3.714926
193,would,-3.706415
147,show,-3.702187


# Analysis

Due to time constraints, I settled for higher variance with these models (a difference in scores of <= 5 was ok with me). These scores were barely higher than the baseline, so I don't think they performed well at all; I was definitely hoping to see more interesting results. 