# Project 3: 'AskFeminists' vs. 'MensRights'
## Part C: Vectorizing & Analysis

[**Vectorizing**](#vector)

Before combining the two subreddit dataframes into one for modeling, I tested out different vectorizing options on both to compare the two subreddits. I tried out CountVectorizer and TFIDF on the AskFeminists content setting ngrams = (3-5) to get the top phrases from both: TFIDF was less repetitive than the CountVectorizer so I used TFIDF for vectorizing throughout. 

I tried vectorizing on ngrams 3-5, 1-2, and just single words, and created two custom lists of stopwords: one with all English stopwords and words in common between the top 100 words (without stopwords) from each subreddit; the second list of stopwords was created after fitting TFIDF on both subreddits (with stopwords = the first custom list) and taking the common words from the top 100 lists of words for each subreddit again. I tested models with both sets of stopwords and ultimately they didn't improve the model over using stopwords = 'english.')

[**Modeling**](#model)

I primarily tested two models on the lemmatized text: Logistic Regression and Random Forest Classifier. Starting with Logistic Regression, I tried several different parameters for vectorizing with TFIDF and the best parameters were 125,000 max features, ngrams = 1-4, and stopwords = 'English'.

The best test score I had ended up being on Logistic Regression (76% accuracy compared to baseline of 50%). This model was overfit on the training data (85% accuracy), and setting lower max_features closed the gap to a 3% difference between training/testing scores but also lowered the test scores. 

The best training score I had was with RandomForestClassifier, n_estimators = 100 and max_depth = 880, which gave me 98% on training data (but just 72% on the test data.)

In [230]:
# Import libaries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score

import regex as re
import nltk
from nltk.corpus import stopwords # Import the stop word list

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction import text

from sklearn.linear_model import LogisticRegression

pd.set_option('display.max_colwidth', -1)
pd.options.display.max_columns = 999

In [2]:
men = pd.read_csv('./men_clean_lem')
fem = pd.read_csv('./fem_clean_lem')

### Balancing datasets, removing nulls

In [3]:
men['lems'].isnull().sum()

225

In [5]:
men.dropna(inplace = True)

In [4]:
fem['lems'].isnull().sum()

350

In [6]:
fem.dropna(inplace = True)

In [7]:
men.shape

(31221, 8)

In [8]:
fem.shape

(28955, 8)

In [10]:
# creating smaller version for mensrights, same # rows as submissions for askfeminists
men = men.sample(29000, replace = False, random_state=42)

In [11]:
men.shape

(29000, 8)

## Vectorizing: Separate Analysis <a name="vector"></a>

In [49]:
# Instantiate a CountVectorizer
vect = CountVectorizer(ngram_range=(3,5), max_features = 10000, stop_words = 'english')
# Instantiate TFIDF
tfidf = TfidfVectorizer(ngram_range = (3, 5), max_features = 10000, stop_words = 'english')

In [29]:
# vectorizing fem for test
fem_vect = vect.fit_transform(fem['lems'])
# tfidfing fem for test
fem_tfidf = tfidf.fit_transform(fem['lems'])

In [30]:
# creating a df for vectorized words
fem_vect_df = pd.DataFrame(fem_vect.toarray(), columns=vect.get_feature_names())
# creating a df for tfidf words
fem_tfidf_df = pd.DataFrame(fem_tfidf.toarray(), columns=tfidf.get_feature_names())

In [31]:
# looking at vectorized value counts
vect_counts = fem_vect_df.sum(axis=0)
vect_counts.sort_values(ascending=False)[0:10]

reflect feminist perspective                  160
feminist reflect feminist                     155
feminist reflect feminist perspective         153
come feminist reflect                         89 
come feminist reflect feminist                89 
come feminist reflect feminist perspective    89 
level comment thread                          81 
difference men woman                          80 
feminist perspective comment                  74 
doesn make sense                              74 
dtype: int64

In [32]:
# looking at tfidf value counts
tfidf_counts = fem_tfidf_df.sum(axis=0)
tfidf_counts.sort_values(ascending=False)[0:10]

doesn make sense                41.643637
false rape accusation           36.430295
difference men woman            32.162463
traditional gender role         28.743795
just don think                  27.087392
reflect feminist perspective    26.147715
don really know                 25.665504
innocent proven guilty          24.234714
rape sexual assault             24.215631
gt don think                    23.785815
dtype: float64

TFIDF seems to have done a better job of ignoring similar phrases than count vectorizer (ex. 'reflect feminist perspective') and the results are more interesting.

##### Comparing Top Phrases (TFIDF, ngrams 3-5)

In [50]:
# fitting to tfidf
men_tfidf = tfidf.fit_transform(men['lems'])

In [51]:
# creating a df for tfidf words
men_tfidf_df = pd.DataFrame(men_tfidf.toarray(), columns=tfidf.get_feature_names())

In [52]:
# pulling top 10 phrases for men TFIDF with English stopwords, 3 are the same as fem
tfidf_counts_men = men_tfidf_df.sum(axis=0)
tfidf_counts_men.sort_values(ascending=False)[0:10]

false rape accusation     62.306164
men right movement        60.595215
men right issue           54.943756
international men day     50.388507
men right activist        39.744089
gender pay gap            33.950820
innocent proven guilty    33.152104
pay child support         27.741176
year old boy              22.977497
doesn make sense          22.798229
dtype: float64

#### Analyzing single words and bigrams

In [64]:
tfidf = TfidfVectorizer(ngram_range = (1, 2), max_features = 10000, stop_words = 'english')

In [65]:
# fitting to tfidf
fem_tfidf = tfidf.fit_transform(fem['lems'])

In [66]:
men_tfidf = tfidf.fit_transform(men['lems'])

In [67]:
# creating a df for tfidf words
fem_tfidf_df = pd.DataFrame(fem_tfidf.toarray(), columns=tfidf.get_feature_names())
men_tfidf_df = pd.DataFrame(men_tfidf.toarray(), columns=tfidf.get_feature_names())

In [68]:
# pulling top 10 words/phrases for fem TFIDF with English stopwords
tfidf_counts_fem = fem_tfidf_df.sum(axis=0)
tfidf_counts_fem.sort_values(ascending=False)[0:10]

wife ha             976.118614
men family          729.171984
fellow              677.549175
thing               612.694714
divorce             607.185712
partner violence    605.775949
lifestyle           550.282282
june                537.683578
voting              513.169382
greatest            450.885487
dtype: float64

In [69]:
# pulling top 10 words/phrases for men TFIDF with English stopwords
# interesting that feminist is one of the top 10 words for mensrights
tfidf_counts_men = men_tfidf_df.sum(axis=0)
tfidf_counts_men.sort_values(ascending=False)[0:10]

woman       868.976790
men         826.650364
wa          579.965131
just        491.954841
like        476.984193
don         462.072782
people      423.281936
gt          407.147621
right       403.265087
feminist    391.284568
dtype: float64

##### Comparing single words & bigrams

In [70]:
# finding top ngrams 1-2
top_100_fem_ngrams_stop = tfidf_counts_fem.sort_values(ascending=False)[0:100]
top_100_men_ngrams_stop = tfidf_counts_men.sort_values(ascending=False)[0:100]

In [71]:
# finding similar ngrams in both
fem_men_ngrams = set(top_100_men_ngrams_stop.index) & set(top_100_fem_ngrams_stop.index)

In [72]:
# not a lot of similarity in smaller ngrams
fem_men_ngrams

{'thing'}

There isn't a lot of similarity when comparing ngrams 1-2 between the two subs - AskFeminists has more common 2-word phrases (with stopwords removed) which suggest more similar contextualization of common words such as 'men' and 'women'. MensRights has very few 2-word phrases that are common, so the top 100 words/short phrases come with less context than the list for AskFeminists.

#### Analyzing top words (ngrams=1)

In [121]:
tfidf1 = TfidfVectorizer(max_features = 50000, stop_words = 'english')

In [125]:
# fitting to tfidf
fem_tfidf = tfidf1.fit_transform(fem['lems'])

In [127]:
men_tfidf = tfidf1.fit_transform(men['lems'])

In [126]:
# creating a df for tfidf words
fem_tfidf_df = pd.DataFrame(fem_tfidf.toarray(), columns=tfidf1.get_feature_names())

In [128]:
men_tfidf_df = pd.DataFrame(men_tfidf.toarray(), columns=tfidf1.get_feature_names())

In [129]:
# top words for askfeminists
tfidf_counts_fem = fem_tfidf_df.sum(axis=0)
tfidf_counts_fem.sort_values(ascending=False)[0:10]

woman       1058.470342
men         802.978127 
feminist    751.930089 
don         664.663076 
think       655.181627 
people      645.120908 
like        581.666010 
just        572.214255 
wa          530.030073 
gt          473.983548 
dtype: float64

In [130]:
# top words for mensrights
tfidf_counts_men = men_tfidf_df.sum(axis=0)
tfidf_counts_men.sort_values(ascending=False)[0:10]

woman       929.135323
men         889.633840
wa          594.643131
just        509.152025
don         493.963959
like        489.428134
people      441.916170
right       433.626604
feminist    407.702110
gt          406.739804
dtype: float64

In [135]:
# finding top ngrams 1-2
top_100_fem_words_stop = tfidf_counts_fem.sort_values(ascending=False)[0:100]
top_100_men_words_stop = tfidf_counts_men.sort_values(ascending=False)[0:100]

In [136]:
# finding similar words in both
fem_men_words = set(top_100_men_words_stop.index) & set(top_100_fem_words_stop.index)

In [137]:
len(fem_men_words) # 77 of the words are in the top 100 for both! 

77

In [138]:
fem_men_words

{'actually',
 'agree',
 'bad',
 'believe',
 'better',
 'case',
 'child',
 'come',
 'comment',
 'did',
 'didn',
 'doe',
 'doesn',
 'don',
 'equality',
 'fact',
 'feel',
 'female',
 'feminism',
 'feminist',
 'gender',
 'girl',
 'going',
 'good',
 'group',
 'gt',
 'guy',
 'ha',
 'having',
 'help',
 'isn',
 'issue',
 'just',
 'know',
 'life',
 'like',
 'look',
 'lot',
 'make',
 'male',
 'man',
 'mean',
 'men',
 'need',
 'people',
 'person',
 'point',
 'post',
 'problem',
 'rape',
 'read',
 'really',
 'reason',
 'right',
 'said',
 'say',
 'saying',
 'sex',
 'sexual',
 'society',
 'sure',
 'thanks',
 'thing',
 'think',
 'thought',
 'time',
 'use',
 'victim',
 'wa',
 'want',
 'way',
 'white',
 'woman',
 'word',
 'work',
 'wrong',
 'yes'}

#### Comparing Stopwords with Count Vect

In [104]:
vect1 = CountVectorizer(max_features = 50000, stop_words = 'english')

In [105]:
# fitting to vect
fem_vect = vect1.fit_transform(fem['lems'])

In [107]:
men_vect = vect1.fit_transform(men['lems'])

In [106]:
# creating a df for tfidf words
fem_vect_df = pd.DataFrame(fem_vect.toarray(), columns=vect1.get_feature_names())

In [108]:
men_vect_df = pd.DataFrame(men_vect.toarray(), columns=vect1.get_feature_names())

In [109]:
# top words for askfeminists
vect_counts_fem = fem_vect_df.sum(axis=0)
vect_counts_fem.sort_values(ascending=False)[0:10]

woman       22124
men         14360
people      12063
don         11036
think       10884
like        10296
feminist    10242
just        9659 
wa          9085 
gt          7853 
dtype: int64

In [110]:
# top words for mensrights
vect_counts_men = men_vect_df.sum(axis=0)
vect_counts_men.sort_values(ascending=False)[0:10]

woman       17622
men         16400
wa          10868
just        7402 
like        6929 
people      6385 
don         6367 
gt          6236 
right       5382 
feminist    5265 
dtype: int64

In [111]:
# finding top words
top_100_fem_words_stop = vect_counts_fem.sort_values(ascending=False)[0:100]
top_100_men_words_stop = vect_counts_men.sort_values(ascending=False)[0:100]

In [112]:
# finding similar words in both
fem_men_words = set(top_100_men_words_stop.index) & set(top_100_fem_words_stop.index)

In [114]:
len(fem_men_words) # about the same as tfidf

78

#### Setting Custom Stopwords, Second Comparison

In [None]:
# Brian's code, setting stop words equal to intersection terms in top 100 using count vectorizer
stop_words_1 = text.ENGLISH_STOP_WORDS.union(['actually',
 'agree',
 'bad',
 'believe',
 'better',
 'case',
 'child',
 'come',
 'comment',
 'did',
 'didn',
 'doe',
 'doesn',
 'don',
 'equality',
 'fact',
 'feel',
 'female',
 'feminism',
 'feminist',
 'gender',
 'girl',
 'going',
 'good',
 'group',
 'gt',
 'guy',
 'ha',
 'having',
 'help',
 'isn',
 'issue',
 'just',
 'know',
 'life',
 'like',
 'look',
 'lot',
 'make',
 'male',
 'man',
 'mean',
 'men',
 'need',
 'people',
 'person',
 'point',
 'post',
 'problem',
 'rape',
 'read',
 'really',
 'reason',
 'right',
 'said',
 'say',
 'saying',
 'sex',
 'sexual',
 'society',
 'sure',
 'thanks',
 'thing',
 'think',
 'thought',
 'time',
 'use',
 'victim',
 'wa',
 'want',
 'way',
 'white',
 'woman',
 'word',
 'work',
 'wrong',
 'yes'])

In [164]:
tfidf2 = TfidfVectorizer(max_features = 50000, stop_words = stop_words_1)

In [165]:
# fitting to tfidf
fem_tfidf = tfidf2.fit_transform(fem['lems'])

In [167]:
men_tfidf = tfidf2.fit_transform(men['lems'])

In [166]:
# creating a df for tfidf words
fem_tfidf_df = pd.DataFrame(fem_tfidf.toarray(), columns=tfidf2.get_feature_names())

In [168]:
men_tfidf_df = pd.DataFrame(men_tfidf.toarray(), columns=tfidf2.get_feature_names())

In [170]:
# top words for askfeminists, after removing most common words
tfidf_counts_fem = fem_tfidf_df.sum(axis=0)
tfidf_counts_fem.sort_values(ascending=False)[0:10]

question      377.321037
yes           252.605901
idea          223.008711
different     218.713998
agree         211.648976
understand    204.031619
answer        201.841974
opinion       195.025050
example       191.834915
trans         182.431184
dtype: float64

In [171]:
# top words for mensrights, after removing most common words
tfidf_counts_men = men_tfidf_df.sum(axis=0)
tfidf_counts_men.sort_values(ascending=False)[0:10]

day        203.099967
archive    202.091099
boy        192.386067
law        172.444206
sub        169.660586
got        163.834252
article    163.380445
shit       163.243384
care       159.536521
yes        157.502748
dtype: float64

In [173]:
# finding top words
top_100_fem_words_stop1 = tfidf_counts_fem.sort_values(ascending=False)[0:100]
top_100_men_words_stop1 = tfidf_counts_men.sort_values(ascending=False)[0:100]

In [174]:
# finding similar words in both
fem_men_words1 = set(top_100_men_words_stop1.index) & set(top_100_fem_words_stop1.index)

In [176]:
len(fem_men_words1)

62

In [177]:
fem_men_words1

{'agree',
 'aren',
 'argument',
 'article',
 'assault',
 'best',
 'boy',
 'care',
 'change',
 'claim',
 'consent',
 'crime',
 'day',
 'different',
 'doing',
 'equal',
 'evidence',
 'exactly',
 'example',
 'far',
 'friend',
 'getting',
 'got',
 'hard',
 'hate',
 'idea',
 'job',
 'kind',
 'law',
 'let',
 'making',
 'masculinity',
 'matter',
 'maybe',
 'movement',
 'opinion',
 'place',
 'power',
 'pretty',
 'probably',
 'question',
 'read',
 'real',
 'seen',
 'sound',
 'stop',
 'study',
 'sub',
 'talk',
 'talking',
 'tell',
 'thanks',
 'toxic',
 'true',
 'try',
 'understand',
 'used',
 'violence',
 'word',
 'wouldn',
 'yeah',
 'yes'}

### Setting Custom Stopwords (2nd layer)

In [178]:
# added common words from last version to common words from previous version
stop_words_2 = text.ENGLISH_STOP_WORDS.union(['agree',
 'aren',
 'argument',
 'article',
 'assault',
 'best',
 'boy',
 'care',
 'change',
 'claim',
 'consent',
 'crime',
 'day',
 'different',
 'doing',
 'equal',
 'evidence',
 'exactly',
 'example',
 'far',
 'friend',
 'getting',
 'got',
 'hard',
 'hate',
 'idea',
 'job',
 'kind',
 'law',
 'let',
 'making',
 'masculinity',
 'matter',
 'maybe',
 'movement',
 'opinion',
 'place',
 'power',
 'pretty',
 'probably',
 'question',
 'read',
 'real',
 'seen',
 'sound',
 'stop',
 'study',
 'sub',
 'talk',
 'talking',
 'tell',
 'thanks',
 'toxic',
 'true',
 'try',
 'understand',
 'used',
 'violence',
 'word',
 'wouldn',
 'yeah',
 'yes',
'actually',
 'agree',
 'bad',
 'believe',
 'better',
 'case',
 'child',
 'come',
 'comment',
 'did',
 'didn',
 'doe',
 'doesn',
 'don',
 'equality',
 'fact',
 'feel',
 'female',
 'feminism',
 'feminist',
 'gender',
 'girl',
 'going',
 'good',
 'gt',
 'guy',
 'ha',
 'isn',
 'issue',
 'just',
 'know',
 'life',
 'like',
 'look',
 'lot',
 'make',
 'male',
 'man',
 'mean',
 'men',
 'nan',
 'need',
 'people',
 'person',
 'point',
 'post',
 'problem',
 'rape',
 'read',
 'really',
 'reason',
 'right',
 'said',
 'say',
 'saying',
 'sex',
 'sexual',
 'sure',
 'thanks',
 'thing',
 'think',
 'time',
 'use',
 'wa',
 'want',
 'way',
 'white',
 'woman',
 'word',
 'work',
 'wrong',
 'yes', 'aren',
 'argument',
 'assault',
 'best',
 'boy',
 'care',
 'change',
 'claim',
 'consent',
 'day',
 'different',
 'doing',
 'equal',
 'evidence',
 'exactly',
 'far',
 'getting',
 'got',
 'group',
 'hate',
 'having',
 'help',
 'idea',
 'job',
 'kind',
 'law',
 'le',
 'let',
 'making',
 'matter',
 'maybe',
 'movement',
 'ok',
 'place',
 'power',
 'pretty',
 'probably',
 'question',
 'real',
 'society',
 'sound',
 'stop',
 'sub',
 'support',
 'talk',
 'talking',
 'tell',
 'thought',
 'true',
 'try',
 'trying',
 'understand',
 'used',
 'victim',
 'violence',
 'world',
 'wouldn',
 'yeah',
 'year'])

### Combining into one DF

In [179]:
femvmen = pd.concat([fem, men], axis=0, join='outer')

In [180]:
femvmen.shape

(57955, 8)

In [181]:
femvmen.columns

Index(['Unnamed: 0', 'text', 'type', 'subreddit', 'removed', 'deleted',
       'clean_text_stop', 'lems'],
      dtype='object')

In [182]:
femvmen = pd.get_dummies(femvmen, columns=['subreddit'], drop_first = True)

In [183]:
femvmen.drop(columns = "Unnamed: 0", inplace = True)

In [366]:
femvmen.reset_index(inplace = True)

In [381]:
femvmen = pd.DataFrame(femvmen)

In [368]:
femvmen.to_csv('./femvmen_lem')

In [369]:
femvmen.columns

Index(['index', 'text', 'type', 'removed', 'deleted', 'clean_text_stop',
       'lems', 'subreddit_MensRights'],
      dtype='object')

# Modeling <a name="model"></a>

#### Setting Vars/TrainTestSplit

In [518]:
X = femvmen['lems']
y = femvmen['subreddit_MensRights']

In [519]:
y.value_counts(normalize = True) # about 50% mensrights posts

1    0.500388
0    0.499612
Name: subreddit_MensRights, dtype: float64

In [520]:
#TRAIN TEST SPLIT
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=True, 
                                                    stratify = y)

In [521]:
y_train.value_counts(normalize = True)

1    0.500388
0    0.499612
Name: subreddit_MensRights, dtype: float64

#### Vectorizing

In [273]:
# setting up multiple vectorizers with english stopwords, ngrams 1-4, different features
tfidf = TfidfVectorizer(ngram_range = (1, 4), max_features = 125000, stop_words = 
                        'english')
tfidf2 = TfidfVectorizer(ngram_range = (1, 4), max_features = 125000, stop_words = 
                        stop_words_1)
tfidf3 = TfidfVectorizer(ngram_range = (1, 4), max_features = 125000, stop_words = 
                        stop_words_2)
tfidf4 = TfidfVectorizer(ngram_range = (1, 4), max_features = 100000, stop_words = 
                        'english')
tfidf5 = TfidfVectorizer(ngram_range = (1, 4), max_features = 100000, stop_words = 
                        stop_words_1)
tfidf6 = TfidfVectorizer(ngram_range = (1, 4), max_features = 100000, stop_words = 
                        stop_words_2)
tfidf7 = TfidfVectorizer(ngram_range = (1, 4), max_features = 50000, stop_words = 
                        'english')
tfidf8 = TfidfVectorizer(ngram_range = (1, 4), max_features = 50000, stop_words = 
                        stop_words_1)
tfidf9 = TfidfVectorizer(ngram_range = (1, 4), max_features = 50000, stop_words = 
                        stop_words_2)
tfidf10 = TfidfVectorizer(ngram_range = (1, 4), max_features = 5000, stop_words = 
                        'english')
tfidf11 = TfidfVectorizer(ngram_range = (1, 4), max_features = 5000, stop_words = 
                        stop_words_1)
tfidf12 = TfidfVectorizer(ngram_range = (1, 4), max_features = 5000, stop_words = 
                        stop_words_2)

In [274]:
#fit transform train sets
X_train_tf = tfidf.fit_transform(X_train)
X_train_tf2 = tfidf2.fit_transform(X_train)
X_train_tf3 = tfidf3.fit_transform(X_train)
X_train_tf4 = tfidf4.fit_transform(X_train)
X_train_tf5 = tfidf5.fit_transform(X_train)
X_train_tf6 = tfidf6.fit_transform(X_train)
X_train_tf7 = tfidf7.fit_transform(X_train)
X_train_tf8 = tfidf8.fit_transform(X_train)
X_train_tf9 = tfidf9.fit_transform(X_train)
X_train_tf10 = tfidf10.fit_transform(X_train)
X_train_tf11 = tfidf11.fit_transform(X_train)
X_train_tf12 = tfidf12.fit_transform(X_train)

In [275]:
# transform test set
X_test_tf = tfidf.transform(X_test)
X_test_tf2 = tfidf2.transform(X_test)
X_test_tf3 = tfidf3.transform(X_test)
X_test_tf4 = tfidf4.transform(X_test)
X_test_tf5 = tfidf5.transform(X_test)
X_test_tf6 = tfidf6.transform(X_test)
X_test_tf7 = tfidf7.transform(X_test)
X_test_tf8 = tfidf8.transform(X_test)
X_test_tf9 = tfidf9.transform(X_test)
X_test_tf10 = tfidf10.transform(X_test)
X_test_tf11 = tfidf11.transform(X_test)
X_test_tf12 = tfidf12.transform(X_test)

### Logistic Regression

Best versions of vectorized data for logistic regression are with stopwords = 'english' and max features = 125000. With very few features (10K or less) the gap between train and test scores closes to 3% but the test score continues to drop with fewer than 100K features.

In [522]:
logreg = LogisticRegression(penalty='l2')

In [524]:
# version 1
X_train_tf = tfidf.fit_transform(X_train)

In [525]:
X_test_tf = tfidf.transform(X_test)

In [526]:
logreg.fit(X_train_tf, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [527]:
logreg.score(X_train_tf, y_train)

0.8545854542317315

In [528]:
logreg.score(X_test_tf, y_test)

0.7605038391855751

In [279]:
# version 2
logreg.fit(X_train_tf2, y_train)
logreg.score(X_train_tf2, y_train)

0.8391640065568113

In [280]:
logreg.score(X_test_tf2, y_test)

0.7380726425675093

In [281]:
# version 3
logreg.fit(X_train_tf3, y_train)
logreg.score(X_train_tf3, y_train)

0.8343326719006126

In [282]:
logreg.score(X_test_tf3, y_test)

0.7316883789146752

In [283]:
# version 4
logreg.fit(X_train_tf4, y_train)
logreg.score(X_train_tf4, y_train)

0.8502070571995514

In [284]:
logreg.score(X_test_tf4, y_test)

0.7510999913726167

In [285]:
# version 5
logreg.fit(X_train_tf5, y_train)
logreg.score(X_train_tf5, y_train)

0.8355620740229488

In [286]:
logreg.score(X_test_tf5, y_test)

0.738849107065827

In [287]:
# version 6
logreg.fit(X_train_tf6, y_train)
logreg.score(X_train_tf6, y_train)

0.8304287809507377

In [288]:
logreg.score(X_test_tf6, y_test)

0.731860926580968

In [289]:
# version 7
logreg.fit(X_train_tf7, y_train)
logreg.score(X_train_tf7, y_train)

0.8377404883098956

In [290]:
logreg.score(X_test_tf7, y_test)

0.7490294193771029

In [291]:
# version 8
logreg.fit(X_train_tf8, y_train)
logreg.score(X_train_tf8, y_train)

0.824346475713916

In [292]:
logreg.score(X_test_tf8, y_test)

0.7391079285652662

In [293]:
# version 9
logreg.fit(X_train_tf9, y_train)
logreg.score(X_train_tf9, y_train)

0.8187818134759728

In [294]:
logreg.score(X_test_tf9, y_test)

0.7308256405832111

In [295]:
# version 10
logreg.fit(X_train_tf10, y_train)
logreg.score(X_train_tf10, y_train)

0.7789664394789061

In [296]:
logreg.score(X_test_tf10, y_test)

0.7393667500647054

In [297]:
# version 11
logreg.fit(X_train_tf11, y_train)
logreg.score(X_train_tf11, y_train)

0.7708135622465706

In [298]:
logreg.score(X_test_tf11, y_test)

0.7247001984298163

In [299]:
# version 12
logreg.fit(X_train_tf12, y_train)
logreg.score(X_train_tf12, y_train)

0.7659606591320852

In [300]:
logreg.score(X_test_tf12, y_test)

0.7149512552842723

### Most Influential Features

In [373]:
# fitting on best model
logreg.fit(X_train_tf, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [529]:
coefficients = logreg.coef_

In [530]:
coef_df = pd.DataFrame(coefficients, columns = tfidf2.get_feature_names())

In [531]:
coef_df.head()

Unnamed: 0,aa,aa policy,aaaaand,aabg,aap,aauw,ab,aback,abandon,abandoned,abandoning,abandonment,abbey,abbreviated,abbreviation,abc,abc news,abctv,abctv scottmorrisonmp,abdicate,abdicate responsibility,abdication,abdomen,abdominal,abducted,abduction,aberrant,aberration,abhor,abhorrent,abide,abide rule,abiding,abigail,ability,ability ability,ability associated,ability associated chess,ability associated chess assumption,ability based,ability birth,ability career,ability change,ability choose,ability coder,ability commit,ability communicate,ability compete,ability consent,ability control,ability decision,ability earn,ability experience,ability influence,ability job,ability lead,ability obtain,ability pay,ability perceive,ability power,ability prevent,ability provide,ability reproduce,ability responsibility,ability speak,ability survive,ability understand,ability wanted,ability willingness,ability willingness negotiate,ability willingness negotiate salary,abit,abit bbc,abject,abject poverty,ablated,able,able abandon,able abandon parental,able abandon parental obligation,able able,able abortion,able accept,able accept trash,able accept trash quality,able access,able accrue,able accrue wealth,able accrue wealth unlike,able achieve,able acknowledge,able act,able address,able afford,able afford able,able afford able day,able afford private,able agree,able analyze,able analyze situation,able angry,able angry allowed,able angry allowed arent,able answer,able answer question,able appeal,able apply,able appreciate,able argument,able ask,able assault,able assist,able assist husband,able assist husband trade,able avoid,able aware,able away,able baby,able beat,able block,able bodied,able bodied ci,able bodied disease,able bodied disease free,able bodied neurotypical,able body,able break,able break agreement,able breastfeed,able breathe,able bring,able build,able buy,able buy house,able care,able catch,able catch damn,able catch damn diagnostic,able change,able change maybe,able change mind,able choice,able choose,able choose father,able claim,able clear,able close,able comfortable,able comfortable home,able comfortable home guarantee,able company,able compete,able compete ci,able compete sport,able compete sport line,able complain,able concede,able concede status,able concede status father,able connect,able consent,able consent tell,able consider,able continue,able contribute,able control,able convince,able convince reddit,able cope,able court,able court order,able create,able cum,able data,able date,able day,able day recover,able day recover clinic,able deal,able decide,able decision,able defend,able definitely,able destroy,able determine,able directly,able discern,able discriminate,able discus,able discussion,able distinguish,able dress,able empathize,able evidence,able experience,able explain,able explain biological,able explain biological basis,able express,able fight,able financial,able force,able forward,able fully,able gain,able game,able handle,able happens,able hold,able identify,able incorporate,able infer,able infer kind,able infer kind iat,able inform,able interpret,able job,able laid,able learn,able leave,able legally,able leverage,able lift,able lift pound,able live,able ma,able ma line,able maintain,able manipulate,able marry,able meet,able million,able navigate,able note,able note moderator,able offer,able opt,able participate,able past,able pay,able perform,able pick,able play,able power,able pregnant,able protect,able prove,able provide,able pull,able read,able recognize,able record,able reduce,able reduced,able reduced free,able reduced free home,able refute,able relate,able remain,able remove,able run,able sell,able separate,able separate conversation,able separate conversation fgm,able set,able sexist,able share,able shut,able sign,able similar,able single,able solve,able speak,able speak clearly,able speak clearly unambiguously,able spend,able spend hour,able spot,able stand,able stay,able stop,able straight,able strong,able study,able subpoena,able subpoena law,able survive,able talk,able talk feeling,able talk feeling scared,able teach,able tell,able toilet,able trust,able try,able turn,able twist,able understand,able understand concept,able understand concept emotional,able vote,able vote learn,able walk,able watch,able wear,able willing,able win,able withdraw,able withdraw consent,able withdraw consent morning,able word,abled,ableism,ableism calling,ableism calling believed,ableism calling believed kid,ableism homophobia,ableist,ableist change,ableist change dime,ableist change dime challenging,ableist language,ableist preference,ableist remark,ableist slur,ableist slur cognitive,ableist slur cognitive developmental,abnormal,abnormal behavior,abnormality,abnormality hilarious,abnormality hilarious banana,abnormality hilarious banana useless,abnormally,aboard,aboard plane,abolish,abolish prison,abolished,abolishing,abolishment,abolition,abolitionism,abolitionist,abominable,abomination,abomination important,abomination important moral,abomination important moral consideration,aboriginal,abort,abort adoption,abort baby,abort body,abort fetus,abort force,aborted,aborting,abortion,abortion able,abortion abortion,abortion adoption,abortion agree,abortion anti,abortion appointment,abortion aren,abortion arguing,abortion argument,abortion asked,abortion attacked,abortion available,abortion available assault,abortion available assault aside,abortion avoid,abortion avoid supporting,abortion avoid supporting maternity,abortion away,abortion away gay,abortion away gay marriage,abortion baby,abortion basically,abortion bc,abortion begin,abortion best,abortion birth,abortion birth control,abortion birth control purpose,abortion bodily,abortion bodily autonomy,abortion body,abortion body choice,abortion called,abortion came,abortion carry,abortion carrying,abortion carrying term,abortion carrying term analogous,abortion choice,abortion choice body,abortion chooses,abortion chooses half,abortion chooses half pregnancy,abortion clinic,abortion common,abortion completely,abortion consent,abortion consider,abortion contraceptive,abortion control,abortion cost,abortion country,abortion cut,abortion debate,abortion defend,abortion despite,abortion discovered,abortion discovered expecting,abortion discovered expecting son,abortion easy,abortion fetus,abortion financial,abortion fine,abortion form,abortion free,abortion freely,abortion fucked,abortion gendered,abortion happen,abortion illegal,abortion late,abortion law,abortion legal,abortion medical,abortion morally,abortion murder,abortion necessarily,abortion okay,abortion opinion,abortion option,abortion parent,abortion pay,abortion performed,abortion poor,abortion position,abortion possible,abortion pregnancy,abortion pregnant,abortion pretty,abortion pro,abortion pro choice,abortion reproductive,abortion restriction,abortion service,abortion simply,abortion wanted,abortion week,abovementioned,abraham,abrahamic,abrahamic religion,abrahamic religion christianity,abrasion,abrasive,abridging,abridging freedom,abridging freedom speech,abroad,abrupt,abruptly,absence,absence court,absence court order,absence evidence,absent,absent evidence,absent father,absentee,absolute,absolute abundance,absolute bare,absolute bare minimum,absolute best,absolute bitch,absolute confidence,absolute garbage,absolute hell,absolute hiv,absolute hiv number,absolute hiv number methodological,absolute nonsense,absolute opposite,absolute opposite love,absolute ops,absolute ops business,absolute ops business ask,absolute outrage,absolute piece,absolute piece shit,...,yes consider,yes consider using,yes consider using gain,yes considered,yes correct,yes course,yes created,yes crime,yes cross,yes culture,yes current,yes currently,yes currently biology,yes currently biology relevance,yes dad,yes dangerous,yes day,yes dead,yes decided,yes decision,yes decision body,yes decision impact,yes decision impact choice,yes default,yes definitely,yes definitely site,yes definitely site wide,yes definition,yes depends,yes difference,yes different,yes directly,yes disagree,yes distinction,yes distinction important,yes distinction important denying,yes doing,yes domination,yes endorsing,yes enthusiastic,yes especially,yes exactly,yes example,yes exception,yes experience,yes explain,yes fair,yes false,yes false accusation,yes feeling,yes feminine,yes fine,yes fix,yes fluid,yes fluid nb,yes fluid nb keen,yes focus,yes free,yes free informed,yes free informed consent,yes freely,yes freely enthusiastically,yes freely enthusiastically given,yes friend,yes fucking,yes gave,yes gay,yes generally,yes got,yes guess,yes happen,yes happened,yes hard,yes hate,yes hear,yes heard,yes hold,yes home,yes home longer,yes home longer allowed,yes homemaker,yes huge,yes human,yes idea,yes important,yes including,yes inequality,yes interesting,yes involved,yes involved ending,yes involved ending innocent,yes job,yes join,yes kid,yes kind,yes law,yes lawyer,yes lead,yes let,yes let continue,yes likely,yes limit,yes literally,yes making,yes matter,yes maybe,yes meant,yes medium,yes mental,yes mental illness,yes mention,yes met,yes mother,yes muslim,yes necessarily,yes necessarily real,yes necessarily real human,yes news,yes news deemed,yes news deemed law,yes non,yes obviously,yes ok,yes okay,yes old,yes opinion,yes oppresses,yes oppresses wouldn,yes oppresses wouldn eager,yes option,yes outside,yes partially,yes past,yes patriarchy,yes personally,yes politics,yes prefer,yes pretty,yes privilege,yes probably,yes probably stalemate,yes probably stalemate change,yes prominent,yes question,yes quite,yes race,yes race play,yes radical,yes rapist,yes rationale,yes react,yes react id,yes react id skeptical,yes read,yes real,yes realize,yes refer,yes refer toxic,yes refer toxic masculinity,yes regret,yes relationship,yes removal,yes removal idelaogical,yes removal idelaogical argument,yes ridiculous,yes risk,yes rocket,yes rocket science,yes role,yes role switched,yes role switched protest,yes saw,yes school,yes science,yes scream,yes scream bring,yes scream bring awareness,yes second,yes seen,yes seen article,yes self,yes self defense,yes sense,yes sexism,yes sexist,yes shitty,yes solid,yes sort,yes specifically,yes start,yes stop,yes strong,yes strongly,yes strongly anti,yes strongly anti doping,yes study,yes stuff,yes talk,yes talking,yes taught,yes teach,yes technically,yes tell,yes tend,yes terrible,yes thank,yes told,yes totally,yes trans,yes transphobic,yes true,yes try,yes type,yes understand,yes understanding,yes used,yes using,yes wanted,yes wasn,yes wearing,yes word,yes worse,yes wouldn,yes yes,yes yes enthusiastic,yes yes important,yes yes yes,yesallmen,yesterday,yesterday evening,yesterday posted,yesterday revealed,yi,yiannopoulos,yield,yielded,yikes,yin,yin yang,yinz,ymca,ymmv,yo,yo kid,yoga,yoga pant,yoke,york,york article,york city,york state,york state law,york university,yorker,yorkshire,youd,youll,young,young able,young able inform,young absolutely,young academic,young actress,young adult,young adult strongly,young adult strongly associated,young adult study,young adult study parental,young age,young age knew,young age simply,young aged,young aged old,young aged old complete,young american,young angry,young aren,young asian,young ask,young attractive,young baby,young barely,young beautiful,young black,young black commit,young bodily,young bodily integrity,young boy,young boy aren,young boy born,young boy born surgery,young caught,young childless,young coming,young consent,young conservative,young couple,young daughter,young day,young desperate,young doing,young face,young father,young fertile,young fertile attractive,young fertile attractive countless,young fertile attractive period,young got,young hate,young hot,young human,young infant,young interested,young kid,young kind,young lady,young lady narrating,young lady narrating video,young learn,young left,young likely,young live,young live home,young living,young living city,young living city earn,young looking,young majority,young majority college,young majority college student,young mere,young mind,young money,young mother,young moved,young murdered,young nazi,young nazi soldier,young nazi soldier fall,young negative,young old,young old emotion,young old emotion dont,young owning,young owning property,young parent,young particular,young perception,young perception attitude,young perception attitude behaviour,young perform,young phase,young phase figure,young phase figure angry,young politician,young pretty,young promised,young raped,young single childless earn,young skill,young son,young sort,young speak,young start,young student,young teen,young teenager,young today,young told,young used,young workforce,young worried,young young,younger,younger age,younger attractive,younger born,younger boy,younger brother,younger doing,younger generation,younger older,younger older younger,younger rapist,younger self,younger sister,youngest,youngest son,youngster,youre,youre older,youre older instead,youre older instead transitioning,youre sexist,youse,yout,youth,youth age,youth age domestic,youth age domestic violence,youth beauty,youth boy,youth club,youth mad,youth suicide,youth suicide average,youth suicide average okay,youthful,youtoo,youtoo successful,youtoo successful using,youtoo successful using pussy,youtried,youtried jpg,youtube,youtube antifeminist,youtube audio,youtube channel,youtube channel called,youtube channel extracredits,youtube channel focused,youtube channel kenyan,youtube channel kenyan cosmetic,youtube channel similar,youtube check,youtube got,youtube great,youtube great analysis,youtube great analysis masculinity,youtube link,youtube probably,youtube search,youtube shooting,youtube spotlight,youtube star,youtube subscriber,youtube talk,youtube try,youtube tv,youtube video,youtuber,youtubers,youtubes,youve,youve got,ypt,ypu,yr,yr old,yt,yuck,yum,yummy,yung,yup,zahidi,zarya,zealand,zealand report,zealand report human,zealand submission,zealot,zealous,zeitgeist,zelda,zero,zero dawn,zero dollar,zero effect,zero evidence,zero experience,zero precedent,zero proof,zero repercussion,zero restriction,zero self,zero sense,zero sum,zero sum game,zero sum situation,zero sympathy,zero tolerance,zero tolerance policy,zero value,zeroing,zimbardo,zionism,zipper,zizek,zoe,zoe quinn,zombie,zombie apocalypse,zone,zoning,zoo,zuckerberg,zur,zygote
0,-0.013801,-0.141119,0.135741,-0.066027,0.503888,0.03459,-0.05515,0.543464,0.069633,-0.142672,-0.536743,0.689786,0.265999,-0.095073,0.17266,-0.048525,0.258237,0.043457,0.043457,-0.128661,0.088476,0.088476,-0.086814,-0.058487,0.121984,0.032844,-0.012841,-0.103017,0.084248,-0.036005,-0.118764,0.282517,0.16833,0.00311,-1.252711,-0.093389,-0.111448,-0.012955,-0.084094,0.030585,0.056264,-0.292663,-0.009066,-0.088057,0.016901,-0.011886,-0.010777,-0.111861,-0.111861,-0.111861,0.104289,0.198193,0.198193,0.026947,0.076652,0.190171,0.373146,-0.090035,0.130749,0.036709,0.073873,0.139434,0.042872,-0.110928,-0.188247,-0.054043,0.031674,-0.054201,-0.092198,0.158451,0.033684,-0.011121,0.090753,0.073145,-1.405962,-0.147112,-0.429875,-0.101707,-0.093644,0.154937,-0.067173,0.102377,0.217105,-0.040465,-0.194401,-0.117583,0.111432,-0.05733,0.263231,0.015192,-0.055667,-0.112371,0.279423,-0.239471,0.010729,0.062226,-0.096364,-0.296076,0.099042,0.108727,0.071469,-0.053579,0.087489,-0.026312,0.012859,0.082486,-0.044007,0.215793,-0.263083,-0.046557,-0.122718,-0.102466,-0.102466,-0.094636,-0.112471,0.124342,-0.015199,0.07734,0.090896,0.219333,-0.163747,-0.00362,0.025096,-0.023298,-0.073324,0.14148,-0.028741,-0.04261,-0.12696,-0.022103,-0.022103,-0.022103,-0.062365,0.09851,-0.108937,0.057056,0.092018,0.257785,-0.021962,0.06291,0.028815,0.071489,-0.008874,0.019217,0.113206,-0.28389,-0.094827,0.178487,-0.123384,-0.156738,0.040769,-0.057435,0.113731,0.024201,0.179434,0.021308,0.118899,-0.08601,-0.16344,-0.150398,0.037491,-0.080761,0.011464,-0.049858,-0.049323,-0.121607,-0.0057,0.10455,-0.276675,-0.224059,-0.108227,-0.092431,-0.092431,-0.092431,-0.02558,-0.015917,-0.156055,-0.028652,0.405327,-0.192786,-0.053531,-0.053531,0.039243,0.003667,0.130862,-0.127839,0.107106,-0.15868,0.01047,-0.056014,-0.238836,-0.627668,-0.00488,-0.041386,-0.153965,-0.028448,-0.157582,-0.094852,0.202064,-0.125202,-0.02277,0.263009,-0.1412,0.054921,-0.152945,-0.160367,-0.584844,-0.446439,0.175182,-0.092472,0.487803,-0.061039,-0.649933,-0.07844,-0.488996,-0.467679,-0.047036,-0.054417,-0.38204,-0.130737,-0.208354,-0.133586,0.067749,0.177807,0.517093,-0.028604,-0.112302,-0.448517,-0.054087,0.20955,0.136744,-0.091129,-1.701626,0.235557,0.302203,-0.103447,0.199577,-0.020802,-0.101514,-0.200917,-0.141872,-0.021318,0.117159,0.032993,-0.017005,0.010621,-0.086935,-0.097789,-0.087312,-0.040179,-0.148746,0.126125,0.119545,-0.038586,-0.05041,-0.039911,0.235451,0.085875,0.069165,-0.263898,-0.414661,-0.065961,0.069526,0.527997,-0.187985,-0.118801,-0.005344,0.490073,0.07506,-0.090715,-0.222432,-0.1412,-0.325048,-0.069383,-0.123516,-0.045292,-0.084152,0.029231,0.244494,-0.088717,-0.084976,0.050597,-0.02377,-0.266758,0.038504,0.079417,-0.174064,0.118842,0.118842,-0.098758,-0.107849,-0.004814,-0.130708,-0.148454,0.022296,0.130513,0.045675,-0.1779,-0.05608,-0.077782,-0.077782,-0.03396,-0.006677,0.044879,-0.179602,-0.096941,-0.002813,0.066566,-0.250945,-0.03251,0.035896,-0.103768,-0.028266,-0.081879,0.102373,-0.233645,0.253002,-0.125007,0.053407,-0.455485,-0.037271,-0.037271,0.071326,0.071326,-0.516419,-0.108829,0.10215,0.766,0.11996,0.054351,0.054351,-0.105413,0.076546,0.173011,-0.196716,-0.059646,-0.015787,-0.015787,0.230308,-0.004226,0.038161,0.038161,0.038161,0.03963,0.188217,-0.03307,-0.13574,-0.123277,0.098853,-1.123938,-0.188604,0.43442,-0.152571,0.005715,-0.056054,-0.047699,0.101436,-0.064735,0.016623,0.075124,0.047811,0.072275,0.106259,-0.061802,0.340723,-0.077443,-0.03494,0.096941,-0.04679,-0.206905,0.075695,0.133644,-0.170964,-0.029918,-0.05191,0.099455,-0.02733,0.070453,-0.092718,-0.007252,0.015302,-0.056786,0.105451,-0.080909,0.112019,-0.119381,0.073792,0.027526,-0.202448,0.03279,0.018175,-0.025973,-0.025973,-0.171835,-0.217267,0.167015,0.202228,-0.226405,0.053906,-0.003149,0.09117,-0.158952,-0.273237,-0.224685,0.041681,-0.159544,-0.067409,-0.210803,-0.027083,0.057484,0.169128,0.0462,-0.112776,-0.196997,0.043071,-0.100907,0.016179,0.189607,0.116955,-0.016483,-0.005374,-0.184704,-0.093434,-0.093434,-0.093434,0.094764,-0.079519,0.240681,0.080963,0.069049,0.205445,-0.484692,-0.0716,-0.071705,-0.21844,0.148333,0.175444,-0.132484,0.545039,-0.061477,0.197428,0.262061,0.156871,-0.025762,-0.122377,-0.122377,1.826338,-0.007622,0.025725,-0.041447,0.141115,0.169675,0.438967,0.067149,-0.011826,-0.08262,0.032735,-0.024385,-0.01042,-0.003797,-0.004761,-0.080495,0.013936,0.054567,0.047868,-0.046693,0.040731,-0.08616,0.4668,0.08351,0.120661,-0.297348,-0.042037,-0.051076,0.056927,-0.030117,0.09585,-0.171241,0.241729,0.020236,-0.083961,-0.048765,-0.024382,0.022319,0.094539,0.094539,-0.002156,-0.057168,-0.227791,0.005482,0.13769,0.030075,-0.07476,-0.064297,-0.109869,-0.097301,-0.101066,-0.015474,0.290786,...,0.095013,-0.178606,-0.020043,-0.131688,0.112962,-0.204788,0.050109,0.028321,0.159058,-0.089152,-0.011177,-0.193152,-0.108236,-0.112438,-0.039882,-0.125846,0.218854,0.034804,-0.148204,-0.057877,-0.03885,0.021082,0.055308,0.095366,0.110642,-0.157254,0.130372,-0.165282,0.15822,0.16993,-0.04146,-0.012433,-0.223001,-0.128088,0.138956,-0.102106,0.01063,0.099535,0.115148,0.118905,0.047721,0.018354,0.018354,0.018354,0.018354,0.207855,0.08103,0.026225,-0.267899,-0.047946,0.110324,0.001897,-0.102733,-0.171455,0.092736,0.092407,-0.025842,0.053629,-0.014642,-0.025345,0.045047,0.045047,0.073592,-0.055425,0.135004,0.135004,0.075854,-0.250672,0.092427,0.114059,0.167384,-0.053314,0.057647,-0.022241,-0.2184,0.067159,-0.094351,-0.116624,0.209809,0.110495,0.064209,0.102088,0.012837,0.042122,-0.050209,0.043078,-0.08549,0.111026,0.111026,-0.014297,-0.009242,-0.235332,-0.076386,-0.18567,0.148122,0.024592,-0.576347,0.008605,-0.22904,0.122635,0.078277,0.044588,-0.05107,-0.305125,-0.126613,0.087113,-0.020491,0.050234,-0.061356,-0.118123,0.04928,0.073472,-0.108349,-0.158302,0.465058,0.255471,0.284249,0.148515,0.538596,-0.006555,0.057228,-0.272398,0.075147,-0.04602,-0.174645,0.288637,-0.183893,0.022524,0.022524,0.022524,0.08794,-1.28104,0.287855,0.066836,0.434999,-0.162659,0.084062,-0.045462,-0.014438,0.128645,-0.076572,0.194161,-0.11038,0.163122,-0.350755,-0.060632,0.460476,-0.040169,0.011349,0.04887,-0.453809,-0.233438,0.092694,0.227634,-0.114278,-0.12977,-0.30349,0.083598,0.012982,-0.42547,0.408151,0.616217,0.108565,-0.117188,0.444636,-0.169942,0.223058,-0.018665,0.086153,0.252178,0.034577,0.034577,-0.073295,0.328955,-0.289744,-0.067731,0.168038,0.051313,-0.153004,0.147223,0.10179,-0.282742,-0.108143,-0.171238,-0.215366,-0.447929,0.032773,-0.003452,0.055703,-0.136509,0.210015,-0.068223,0.161286,0.132516,-0.780321,0.254701,0.034795,-0.25084,0.064849,-0.055504,0.144463,0.300259,-0.11417,0.036296,-0.043757,-0.069088,0.25588,0.051562,-0.154409,0.048949,-0.091855,-0.181041,-0.038945,-0.038945,-0.092689,-0.228304,0.354478,-0.281828,0.007001,0.113579,0.083164,0.248739,-0.110646,0.083658,-0.069804,-0.110099,-0.491825,0.005978,-0.174379,-0.037779,-0.058543,-0.058543,-0.199385,-0.199385,0.051915,-0.227849,0.028345,0.208895,-0.112152,0.214461,0.103231,-0.200116,-0.072409,-0.074031,0.017978,-0.071135,0.084541,-0.209878,0.164617,-0.033514,-0.287021,0.005589,-0.09111,-0.301151,0.318535,-0.231952,-0.067317,0.408692,-0.030199,-0.015255,-0.024318,-0.040241,-0.070589,-0.007172,-0.205295,-0.172488,0.155748,0.182368,-0.047256,-0.06569,-0.022818,-0.022818,-0.022818,-0.116528,-0.142054,0.038707,0.209722,-0.043799,0.054646,0.072902,-0.146315,0.101579,-0.210988,-0.225467,0.048542,-0.060921,-0.115123,-0.109527,-0.057073,0.055363,-0.168579,-0.019975,0.061489,-0.383365,-0.064578,-0.292523,-0.017446,-0.0736,-0.150328,-0.071659,-0.009263,0.033544,0.053993,0.029788,0.022344,-0.026576,0.263466,0.051497,0.034476,0.01792,-0.240831,0.030891,-0.063325,-0.050623,0.119808,0.270559,-0.594569,0.409892,-0.601565,-0.303788,-0.059649,0.016029,0.538064,0.060874,-0.00774,0.3061,0.088079,0.674292,0.089199,0.050286,0.034548,0.087452,-0.291747,0.014095,0.169029,0.221815,0.007509,0.085171,0.414743,-0.085468,0.131778,-0.025757,-0.011157,-0.011157,0.098908,-0.328829,-0.012308,-0.49162,0.079339,0.265701,-0.212197,0.130132,0.057788,0.011496,-0.230263,-0.078902,-0.107062,-0.107062,0.033344,0.129086,-0.002299,0.118558,-0.039862,-0.122609,0.060381,-0.222637,0.047406,0.274765,0.044267,0.377402,0.122007,-0.036438,0.175751,0.02459,-0.07778,0.107969,-0.080698,0.099514,0.283346,0.082713,-0.123062,0.034147,-0.137778,0.084149,-0.037821,-0.535847,-0.187449,-0.705774,-0.019537,0.057259,0.049069,0.013937,0.038972,0.039232,-0.290525,0.11691,0.118818,0.030412,0.186821,-0.297302,0.038355,0.082695,-0.162426,-0.007776,0.090388,-0.465424,-0.070227,-0.081568,0.030719,-0.508734,0.109745,-0.423109,-0.028381,0.369053,0.042671,-0.037004,-0.16331,0.004206,0.181057,-0.459958,0.385913,-0.020394,0.17848,-0.043952,0.024024,0.296268,0.033891,0.364426,-0.005798,-0.060089,0.008443,0.035463,0.076756,-0.58515,-0.507242,0.437706,-0.147546,0.196251,0.110459,0.147317,-0.296508,0.406385,0.160566,0.283069,0.368022,-0.149424,-0.041164,-0.146211,0.040896,0.069903,0.148292,-0.07844,0.189116,0.171171,0.00235,0.114916,0.023637,0.329235,0.218473,-0.154122,1.005889,-0.020682,-0.545062,-0.190473,0.210488,0.251006,-0.048343,-0.014167,-0.014167,-0.014167,0.038456,0.09866,0.221035,0.044666,0.078818,0.152585,0.060094,0.07693,-0.101928,-0.226935,0.475147,-0.027663,0.236656,0.077343,0.260145,0.105844,-0.280075,-0.291292,0.195554,-0.150114,0.982024,0.090484,0.127725,-0.030725,0.102878,0.020038,0.091779,0.227828,0.462465


In [532]:
# words most positively correlated with content being from MensRights
# some of these make NO sense, might have been good to do more cleaning
coef_df.sum(axis=0).sort_values(ascending=False)[0:15]

member dominant                      5.570271
bread scientist react violently      4.393380
argument moving                      4.075883
message author                       3.676436
sexually submissive                  3.567055
brave serve                          3.275661
mobile atm                           3.235209
manifesting                          3.220126
self righteousness simply            2.914714
atlassian accepted majority coder    2.782258
beg scrap                            2.760610
slut slut                            2.736139
expendable                           2.723877
month son                            2.716348
bias court                           2.708516
dtype: float64

In [533]:
# words most positively correlated with content being from AskFeminists
# these make a little more sense than the MensRights ones...slightly.
coef_df.sum(axis=0).sort_values(ascending=True)[0:15]

pre determined                  -5.681340
study seminar                   -4.972181
talking federation autonomous   -4.799818
eff                             -3.908666
anti sjw                        -3.638601
desperate attempt               -3.596841
shitty yeah                     -3.415897
pick apart                      -3.373890
okay sexualize                  -3.134356
exposed message                 -3.017257
brainwashing hell drug          -2.897521
duo                             -2.883798
distinguish                     -2.873333
non binary particularly         -2.842388
appearance body                 -2.806616
dtype: float64

### Confusion Matrix

In [534]:
pred_log = logreg.predict(X_test_tf)

In [535]:
pred_log.shape

(11591,)

In [536]:
y_test.shape

(11591,)

In [537]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, pred_log)
tn, fp, fn, tp = confusion_matrix(y_test, pred_log).ravel()
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 4317
False Positives: 1474
False Negatives: 1302
True Positives: 4498


In [538]:
# setting up df for comparison
y_test_preds = pd.DataFrame(y_test)

In [539]:
y_test_preds['pred'] = pred_log

In [540]:
y_test_preds.head()

Unnamed: 0,subreddit_MensRights,pred
57021,1,0
46763,1,1
26190,0,0
37992,1,1
11353,0,0


In [541]:
# rows where predictions don't match
y_test_preds[y_test_preds['subreddit_MensRights'] != y_test_preds['pred']].head()

Unnamed: 0,subreddit_MensRights,pred
57021,1,0
53228,1,0
18964,0,1
49924,1,0
41747,1,0


In [542]:
# example of a post misclassified as AskFeminists
#this comment is just responding to another user who posted something about JP (Jordan Peterson)
# but there's no real context that would help identify which thread this belongs to
femvmen.loc[57021]

index                   19820                                                                                                                                                                                                                                                   
text                    The first clip legitimately doesn't work, and none from the second are JP so I've still not seen proof of your origional claim that JP ever said that. If you send me a functioning clip then I'll change my opinion but it just says video unavailable.
type                    comment                                                                                                                                                                                                                                                 
removed                 0                                                                                                                                                            

In [543]:
# example of a post misclassified as MensRights
# not much to go off here for classification, it could really be from either.
femvmen.loc[18964]

index                   19170                                                                     
text                    Dude, I can't argue with you when you are so deep in the dogma. Good luck.
type                    comment                                                                   
removed                 0                                                                         
deleted                 0                                                                         
clean_text_stop         dude  i can t argue with you when you are so deep in the dogma  good luck 
lems                    dude i can argue with you when you are so deep in the dogma good luck     
subreddit_MensRights    0                                                                         
Name: 18964, dtype: object

### Other models

I tried all of the below models before selecting the Logistic Regression for further analysis of the coefficients - none of these had as strong a test score as the one on the Logistic Regression. I spent more time with Random Forest, and I could have tried to tune the parameters for Naive Bayes and SVC to improve the scores but didn't have time.

### Naive Bayes Multinomial

In [386]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [387]:
nb.fit(X_train_tf2, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [388]:
nb.score(X_train_tf2, y_train)

0.8503796048658442

In [389]:
nb.score(X_test_tf2, y_test)

0.5014235182469157

## Support Vector Classifier

In [390]:
from sklearn import svm
svc = svm.SVC(kernel= 'rbf', C = 100, gamma = 0.05)

In [391]:
svc.fit(X_train_tf, y_train)

SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.05, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [392]:
svc.score(X_train_tf, y_train)

0.9543180053489777

In [393]:
svc.score(X_test_tf, y_test)

0.4925373134328358

## Random Forest

In [394]:
from sklearn.ensemble import RandomForestClassifier

In [455]:
# taking sample for faster gridsearching
sample20000 = femvmen.sample(n=20000, random_state = 42)

In [456]:
sample20000['subreddit_MensRights'].value_counts(normalize=True) # about same proportions as full dataset

1    0.5018
0    0.4982
Name: subreddit_MensRights, dtype: float64

In [457]:
X = sample20000['lems']
y = sample20000['subreddit_MensRights']

In [458]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size = .30, 
                                                   stratify=y)

In [459]:
tfidf = TfidfVectorizer(ngram_range = (1, 4), max_features = 125000, stop_words = 
                        'english')

In [460]:
X_train_tf = tfidf.fit_transform(X_train)
X_test_tf = tfidf.transform(X_test)

In [461]:
X_train_tf.shape

(14000, 125000)

In [448]:
# just trying out basic random forest without tuning parameters or setting max depth
rf = RandomForestClassifier()

In [449]:
rf.fit(X_train_tf, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [450]:
rf.score(X_train_tf, y_train)

0.9857142857142858

In [451]:
rf.score(X_test_tf, y_test)

0.6108333333333333

In [415]:
rf = RandomForestClassifier()

In [416]:
# gridsearching for best parameters - this is the first one I did
from sklearn.model_selection import GridSearchCV

In [419]:
params = {
    'n_estimators': [5, 10, 1],
    'max_depth': [10000, 90000, 20000],
    'oob_score': ['True', 'False'],
    'warm_start': ['True'],
    'n_jobs': [-2]
}
gs = GridSearchCV(rf, param_grid = params, cv = 3)
gs.fit(X_train_tf, y_train)
print(gs.best_score_)
gs.best_params_ 

  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])

  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])

  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])

0.6128571428571429


  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])


{'max_leaf_nodes': 20000,
 'n_estimators': 10,
 'n_jobs': -2,
 'oob_score': 'False',
 'warm_start': 'True'}

In [462]:
# second gridsearch, I repeated this until I narrowed down the best max depth range
params = {
    'n_estimators': [10],
    'max_depth': range(875, 900, 1),
    'oob_score': ['False'],
    'warm_start': ['True'],
    'n_jobs': [-2]
}
gs = GridSearchCV(rf, param_grid = params, cv = 3)
gs.fit(X_train_tf, y_train)
print(gs.best_score_)
gs.best_params_ 

  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])

  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])

  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])

  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])

0.6589285714285714


  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])


{'max_depth': 880,
 'n_estimators': 10,
 'n_jobs': -2,
 'oob_score': 'False',
 'warm_start': 'True'}

In [463]:
# resetting X and y variables to test random forest on full train/test sets
X = femvmen['lems']
y = femvmen['subreddit_MensRights']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = .25,
                                                   stratify = y)

In [464]:
X_train_tf = tfidf.fit_transform(X_train)
X_test_tf = tfidf.transform(X_test)

In [481]:
rf = RandomForestClassifier(max_depth = 880, n_estimators = 1000, n_jobs = -2, oob_score = False,
                 warm_start = True)

In [482]:
rf.fit(X_train_tf, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=880, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=-2,
            oob_score=False, random_state=None, verbose=0, warm_start=True)

In [483]:
rf.score(X_train_tf, y_train)

0.9846776791055077

In [484]:
rf.score(X_test_tf, y_test)

0.7199254606943198

## Conclusion

It would be too easy to say that my best model's inability to crack 77% accuracy means that "we're really not that different after all", and if I were to spend more time on this project there are other options I would explore to try and improve my model:

1. I'd pull a completely unrelated subreddit to test as a control against the other two subreddits to make sure I was tuning the best model possible.

2. Given how important context is for the comments, I would either try to aggregate comments on a post and analyze them together, or set a minimum wordcount on comments to be used in the analysis to give the model a better chance of distinguishing them.

3. I could scrape another years' worth of data to add to the model.

4. I could do some more cleaning and EDA that might help consolidate slang/similar terms that weren't captured by the lemmatizer, and identify stronger trends in the subreddits that could be leveraged for better prediction. It'd also be interesting to see what difference it might make to fit the TFIDF on just one subreddit first, then transform both subreddits and analyze them together.

5. If any of the above helped minimize overfit on the Logistic Regression or Random Forest, I'd give Naive Bayes Multinomial and Support Vector Classifier another shot (and spend more time tuning them).

The really interesting questions this project has generated would require a deeper analysis: what's the overlap in people who post on MensRights and AskFeminists? (How many of those are trolls, whose content is removed?) What common themes exist between removed posts (can we write an algorithm for trolling content?) How does sentiment analysis compare across the two subreddits, is there a discernable difference? Is it possible to measure 'extreme' attitudes in a subreddit, and if so can it be mapped over time, against real events happening in the world that might trigger anger on both or one side? Is it possible to follow the posts of single users and see how they change in sentiment and neutrality over time (and does the sub they post on the most make a difference?) Maybe I'll come back to it when I've solidified my modeling skills.