# Project 3: 'AskFeminists' vs. 'MensRights'
## Part C: Vectorizing & Analysis

[**Vectorizing**](#vector)

Before combining the two subreddit dataframes into one for modeling, I tested out different vectorizing options on both to compare the two subreddits. I tried out CountVectorizer and TFIDF on the AskFeminists content setting ngrams = (3-5) to get the top phrases from both: TFIDF was less repetitive than the CountVectorizer so I used TFIDF for vectorizing throughout. 

I tried vectorizing on ngrams 3-5, 1-2, and just single words, and created two custom lists of stopwords: one with all English stopwords and words in common between the top 100 words (without stopwords) from each subreddit; the second list of stopwords was created after fitting TFIDF on both subreddits (with stopwords = the first custom list) and taking the common words from the top 100 lists of words for each subreddit again. I tested models with both sets of stopwords and ultimately they didn't improve the model over using stopwords = 'english.')

[**Modeling**](#model)

I primarily tested two models on the lemmatized text: Logistic Regression and Random Forest Classifier. Starting with Logistic Regression, I tried several different parameters for vectorizing with TFIDF and the best parameters were 125,000 max features, ngrams = 1-4, and stopwords = 'English'.

The best test score I had ended up being on Logistic Regression (76% accuracy compared to baseline of 50%). This model was overfit on the training data (85% accuracy), and setting lower max_features closed the gap to a 3% difference between training/testing scores but also lowered the test scores. 

The best training score I had was with RandomForestClassifier, n_estimators = 100 and max_depth = 880, which gave me 98% on training data (but just 72% on the test data.)

In [1]:
# Import libaries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score

import regex as re
import nltk
from nltk.corpus import stopwords # Import the stop word list

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction import text

from sklearn.linear_model import LogisticRegression

pd.set_option('display.max_colwidth', -1)
pd.options.display.max_columns = 999

In [2]:
men = pd.read_csv('./data/men_clean_lem')
fem = pd.read_csv('./data/fem_clean_lem')

### Balancing datasets, removing nulls

In [3]:
men['lems'].isnull().sum()

225

In [4]:
men.dropna(inplace = True)

In [5]:
fem['lems'].isnull().sum()

350

In [6]:
fem.dropna(inplace = True)

In [7]:
men.shape

(31221, 8)

In [8]:
fem.shape

(28955, 8)

In [9]:
# creating smaller version for mensrights, same # rows as submissions for askfeminists
men = men.sample(29000, replace = False, random_state=42)

In [10]:
men.shape

(29000, 8)

## Vectorizing: Separate Analysis <a name="vector"></a>

In [11]:
# Instantiate a CountVectorizer
vect = CountVectorizer(ngram_range=(3,5), max_features = 10000, stop_words = 'english')
# Instantiate TFIDF
tfidf = TfidfVectorizer(ngram_range = (3, 5), max_features = 10000, stop_words = 'english')

In [12]:
# vectorizing fem for test
fem_vect = vect.fit_transform(fem['lems'])
# tfidfing fem for test
fem_tfidf = tfidf.fit_transform(fem['lems'])

In [13]:
# creating a df for vectorized words
fem_vect_df = pd.DataFrame(fem_vect.toarray(), columns=vect.get_feature_names())
# creating a df for tfidf words
fem_tfidf_df = pd.DataFrame(fem_tfidf.toarray(), columns=tfidf.get_feature_names())

In [14]:
# looking at vectorized value counts
vect_counts = fem_vect_df.sum(axis=0)
vect_counts.sort_values(ascending=False)[0:10]

reflect feminist perspective                  160
feminist reflect feminist                     155
feminist reflect feminist perspective         153
come feminist reflect                         89 
come feminist reflect feminist                89 
come feminist reflect feminist perspective    89 
level comment thread                          81 
difference men woman                          80 
feminist perspective comment                  74 
doesn make sense                              74 
dtype: int64

In [15]:
# looking at tfidf value counts
tfidf_counts = fem_tfidf_df.sum(axis=0)
tfidf_counts.sort_values(ascending=False)[0:10]

doesn make sense                41.643637
false rape accusation           36.430295
difference men woman            32.162463
traditional gender role         28.743795
just don think                  27.087392
reflect feminist perspective    26.147715
don really know                 25.665504
innocent proven guilty          24.234714
rape sexual assault             24.215631
gt don think                    23.785815
dtype: float64

TFIDF seems to have done a better job of ignoring similar phrases than count vectorizer (ex. 'reflect feminist perspective') and the results are more interesting.

##### Comparing Top Phrases (TFIDF, ngrams 3-5)

In [16]:
# fitting to tfidf
men_tfidf = tfidf.fit_transform(men['lems'])

In [17]:
# creating a df for tfidf words
men_tfidf_df = pd.DataFrame(men_tfidf.toarray(), columns=tfidf.get_feature_names())

In [18]:
# pulling top 10 phrases for men TFIDF with English stopwords, 3 are the same as fem
tfidf_counts_men = men_tfidf_df.sum(axis=0)
tfidf_counts_men.sort_values(ascending=False)[0:10]

false rape accusation     62.306164
men right movement        60.595215
men right issue           54.943756
international men day     50.388507
men right activist        39.744089
gender pay gap            33.950820
innocent proven guilty    33.152104
pay child support         27.741176
year old boy              22.977497
doesn make sense          22.798229
dtype: float64

#### Analyzing single words and bigrams

In [19]:
tfidf = TfidfVectorizer(ngram_range = (1, 2), max_features = 10000, stop_words = 'english')

In [20]:
# fitting to tfidf
fem_tfidf = tfidf.fit_transform(fem['lems'])

In [21]:
men_tfidf = tfidf.fit_transform(men['lems'])

In [22]:
# creating a df for tfidf words
fem_tfidf_df = pd.DataFrame(fem_tfidf.toarray(), columns=tfidf.get_feature_names())
men_tfidf_df = pd.DataFrame(men_tfidf.toarray(), columns=tfidf.get_feature_names())

In [23]:
# pulling top 10 words/phrases for fem TFIDF with English stopwords
tfidf_counts_fem = fem_tfidf_df.sum(axis=0)
tfidf_counts_fem.sort_values(ascending=False)[0:10]

wife ha             976.118614
men family          729.171984
fellow              677.549175
thing               612.694714
divorce             607.185712
partner violence    605.775949
lifestyle           550.282282
june                537.683578
voting              513.169382
greatest            450.885487
dtype: float64

In [24]:
# pulling top 10 words/phrases for men TFIDF with English stopwords
# interesting that feminist is one of the top 10 words for mensrights
tfidf_counts_men = men_tfidf_df.sum(axis=0)
tfidf_counts_men.sort_values(ascending=False)[0:10]

woman       868.976790
men         826.650364
wa          579.965131
just        491.954841
like        476.984193
don         462.072782
people      423.281936
gt          407.147621
right       403.265087
feminist    391.284568
dtype: float64

##### Comparing single words & bigrams

In [25]:
# finding top ngrams 1-2
top_100_fem_ngrams_stop = tfidf_counts_fem.sort_values(ascending=False)[0:100]
top_100_men_ngrams_stop = tfidf_counts_men.sort_values(ascending=False)[0:100]

In [26]:
# finding similar ngrams in both
fem_men_ngrams = set(top_100_men_ngrams_stop.index) & set(top_100_fem_ngrams_stop.index)

In [27]:
# not a lot of similarity in smaller ngrams
fem_men_ngrams

{'thing'}

There isn't a lot of similarity when comparing ngrams 1-2 between the two subs - AskFeminists has more common 2-word phrases (with stopwords removed) which suggest more similar contextualization of common words such as 'men' and 'women'. MensRights has very few 2-word phrases that are common, so the top 100 words/short phrases come with less context than the list for AskFeminists.

#### Analyzing top words (ngrams=1)

In [28]:
tfidf1 = TfidfVectorizer(max_features = 30000, stop_words = 'english')

In [29]:
# fitting to tfidf
fem_tfidf = tfidf1.fit_transform(fem['lems'])

In [30]:
fem_tfidf.shape

(28955, 30000)

In [31]:
men_tfidf = tfidf1.fit_transform(men['lems'])

In [32]:
# creating a df for tfidf words
fem_tfidf_df = pd.DataFrame(fem_tfidf.toarray(), columns=tfidf1.get_feature_names())

In [33]:
men_tfidf_df = pd.DataFrame(men_tfidf.toarray(), columns=tfidf1.get_feature_names())

In [34]:
# top words for askfeminists
tfidf_counts_fem = fem_tfidf_df.sum(axis=0)
tfidf_counts_fem.sort_values(ascending=False)[0:10]

workshop     1059.252157
manigault    803.501338 
fanedit      752.867745 
doesnt       665.165769 
tomboy       655.855623 
paypal       645.690254 
lawyered     582.257494 
jordon       572.694283 
weak         530.538017 
hardwick     474.213095 
dtype: float64

In [35]:
# top words for mensrights
tfidf_counts_men = men_tfidf_df.sum(axis=0)
tfidf_counts_men.sort_values(ascending=False)[0:10]

woman       932.960891
men         894.450155
wa          598.002861
just        512.151036
don         496.218393
like        493.284077
people      443.586868
right       435.860838
feminist    409.632239
gt          408.652878
dtype: float64

In [36]:
# finding top ngrams 1-2
top_100_fem_words_stop = tfidf_counts_fem.sort_values(ascending=False)[0:100]
top_100_men_words_stop = tfidf_counts_men.sort_values(ascending=False)[0:100]

In [37]:
# finding similar words in both
fem_men_words = set(top_100_men_words_stop.index) & set(top_100_fem_words_stop.index)

In [38]:
len(fem_men_words) # 77 of the words are in the top 100 for both! 

0

In [39]:
fem_men_words

set()

#### Comparing Stopwords with Count Vect

In [40]:
vect1 = CountVectorizer(max_features = 30000, stop_words = 'english')

In [41]:
# fitting to vect
fem_vect = vect1.fit_transform(fem['lems'])

In [42]:
men_vect = vect1.fit_transform(men['lems'])

In [43]:
# creating a df for tfidf words
fem_vect_df = pd.DataFrame(fem_vect.toarray(), columns=vect1.get_feature_names())

In [44]:
men_vect_df = pd.DataFrame(men_vect.toarray(), columns=vect1.get_feature_names())

In [45]:
# top words for askfeminists
vect_counts_fem = fem_vect_df.sum(axis=0)
vect_counts_fem.sort_values(ascending=False)[0:10]

workshop     22124
manigault    14360
paypal       12063
doesnt       11036
tomboy       10884
lawyered     10296
fanedit      10242
jordon       9659 
weak         9085 
hardwick     7853 
dtype: int64

In [46]:
# top words for mensrights
vect_counts_men = men_vect_df.sum(axis=0)
vect_counts_men.sort_values(ascending=False)[0:10]

woman       17622
men         16400
wa          10868
just        7402 
like        6929 
people      6385 
don         6367 
gt          6236 
right       5382 
feminist    5265 
dtype: int64

In [47]:
# finding top words
top_100_fem_words_stop = vect_counts_fem.sort_values(ascending=False)[0:100]
top_100_men_words_stop = vect_counts_men.sort_values(ascending=False)[0:100]

In [48]:
# finding similar words in both
fem_men_words = set(top_100_men_words_stop.index) & set(top_100_fem_words_stop.index)

In [49]:
len(fem_men_words) # about the same as tfidf

0

#### Setting Custom Stopwords, Second Comparison

In [50]:
# Brian's code, setting stop words equal to intersection terms in top 100 using count vectorizer
stop_words_1 = text.ENGLISH_STOP_WORDS.union(['actually',
 'agree',
 'bad',
 'believe',
 'better',
 'case',
 'child',
 'come',
 'comment',
 'did',
 'didn',
 'doe',
 'doesn',
 'don',
 'equality',
 'fact',
 'feel',
 'female',
 'feminism',
 'feminist',
 'gender',
 'girl',
 'going',
 'good',
 'group',
 'gt',
 'guy',
 'ha',
 'having',
 'help',
 'isn',
 'issue',
 'just',
 'know',
 'life',
 'like',
 'look',
 'lot',
 'make',
 'male',
 'man',
 'mean',
 'men',
 'need',
 'people',
 'person',
 'point',
 'post',
 'problem',
 'rape',
 'read',
 'really',
 'reason',
 'right',
 'said',
 'say',
 'saying',
 'sex',
 'sexual',
 'society',
 'sure',
 'thanks',
 'thing',
 'think',
 'thought',
 'time',
 'use',
 'victim',
 'wa',
 'want',
 'way',
 'white',
 'woman',
 'word',
 'work',
 'wrong',
 'yes'])

In [51]:
tfidf2 = TfidfVectorizer(max_features = 30000, stop_words = stop_words_1)

In [52]:
# fitting to tfidf
fem_tfidf = tfidf2.fit_transform(fem['lems'])

In [53]:
men_tfidf = tfidf2.fit_transform(men['lems'])

In [54]:
# creating a df for tfidf words
fem_tfidf_df = pd.DataFrame(fem_tfidf.toarray(), columns=tfidf2.get_feature_names())

In [55]:
men_tfidf_df = pd.DataFrame(men_tfidf.toarray(), columns=tfidf2.get_feature_names())

In [56]:
# top words for askfeminists, after removing most common words
tfidf_counts_fem = fem_tfidf_df.sum(axis=0)
tfidf_counts_fem.sort_values(ascending=False)[0:10]

relationshipi    378.489849
herald           223.135285
dialed           219.112282
unoriginal       204.333295
anatomist        204.207552
partyi           195.403570
eradicate        191.807080
trustworthy      182.818588
unbroken         178.130618
applies          177.842142
dtype: float64

In [57]:
# top words for mensrights, after removing most common words
tfidf_counts_men = men_tfidf_df.sum(axis=0)
tfidf_counts_men.sort_values(ascending=False)[0:10]

year       257.012902
day        203.796288
archive    202.565864
boy        193.050981
law        172.890177
sub        170.635181
article    166.782898
got        164.916142
shit       163.609884
support    161.837291
dtype: float64

In [58]:
# finding top words
top_100_fem_words_stop1 = tfidf_counts_fem.sort_values(ascending=False)[0:100]
top_100_men_words_stop1 = tfidf_counts_men.sort_values(ascending=False)[0:100]

In [59]:
# finding similar words in both
fem_men_words1 = set(top_100_men_words_stop1.index) & set(top_100_fem_words_stop1.index)

In [60]:
len(fem_men_words1)

0

In [61]:
fem_men_words1

set()

### Setting Custom Stopwords (2nd layer)

In [62]:
# added common words from last version to common words from previous version
stop_words_2 = text.ENGLISH_STOP_WORDS.union(['agree',
 'aren',
 'argument',
 'article',
 'assault',
 'best',
 'boy',
 'care',
 'change',
 'claim',
 'consent',
 'crime',
 'day',
 'different',
 'doing',
 'equal',
 'evidence',
 'exactly',
 'example',
 'far',
 'friend',
 'getting',
 'got',
 'hard',
 'hate',
 'idea',
 'job',
 'kind',
 'law',
 'let',
 'making',
 'masculinity',
 'matter',
 'maybe',
 'movement',
 'opinion',
 'place',
 'power',
 'pretty',
 'probably',
 'question',
 'read',
 'real',
 'seen',
 'sound',
 'stop',
 'study',
 'sub',
 'talk',
 'talking',
 'tell',
 'thanks',
 'toxic',
 'true',
 'try',
 'understand',
 'used',
 'violence',
 'word',
 'wouldn',
 'yeah',
 'yes',
'actually',
 'agree',
 'bad',
 'believe',
 'better',
 'case',
 'child',
 'come',
 'comment',
 'did',
 'didn',
 'doe',
 'doesn',
 'don',
 'equality',
 'fact',
 'feel',
 'female',
 'feminism',
 'feminist',
 'gender',
 'girl',
 'going',
 'good',
 'gt',
 'guy',
 'ha',
 'isn',
 'issue',
 'just',
 'know',
 'life',
 'like',
 'look',
 'lot',
 'make',
 'male',
 'man',
 'mean',
 'men',
 'nan',
 'need',
 'people',
 'person',
 'point',
 'post',
 'problem',
 'rape',
 'read',
 'really',
 'reason',
 'right',
 'said',
 'say',
 'saying',
 'sex',
 'sexual',
 'sure',
 'thanks',
 'thing',
 'think',
 'time',
 'use',
 'wa',
 'want',
 'way',
 'white',
 'woman',
 'word',
 'work',
 'wrong',
 'yes', 'aren',
 'argument',
 'assault',
 'best',
 'boy',
 'care',
 'change',
 'claim',
 'consent',
 'day',
 'different',
 'doing',
 'equal',
 'evidence',
 'exactly',
 'far',
 'getting',
 'got',
 'group',
 'hate',
 'having',
 'help',
 'idea',
 'job',
 'kind',
 'law',
 'le',
 'let',
 'making',
 'matter',
 'maybe',
 'movement',
 'ok',
 'place',
 'power',
 'pretty',
 'probably',
 'question',
 'real',
 'society',
 'sound',
 'stop',
 'sub',
 'support',
 'talk',
 'talking',
 'tell',
 'thought',
 'true',
 'try',
 'trying',
 'understand',
 'used',
 'victim',
 'violence',
 'world',
 'wouldn',
 'yeah',
 'year'])

### Combining into one DF

In [63]:
femvmen = pd.concat([fem, men], axis=0, join='outer')

In [64]:
femvmen.shape

(57955, 8)

In [65]:
femvmen.columns

Index(['Unnamed: 0', 'text', 'type', 'subreddit', 'removed', 'deleted',
       'clean_text_stop', 'lems'],
      dtype='object')

In [66]:
femvmen = pd.get_dummies(femvmen, columns=['subreddit'], drop_first = True)

In [67]:
femvmen.drop(columns = "Unnamed: 0", inplace = True)

In [68]:
femvmen.reset_index(inplace = True)

In [69]:
femvmen = pd.DataFrame(femvmen)

In [70]:
femvmen.to_csv('./data/femvmen_lem')

In [71]:
femvmen.columns

Index(['index', 'text', 'type', 'removed', 'deleted', 'clean_text_stop',
       'lems', 'subreddit_MensRights'],
      dtype='object')

# Modeling <a name="model"></a>

#### Setting Vars/TrainTestSplit

In [72]:
X = femvmen['lems']
y = femvmen['subreddit_MensRights']

In [73]:
y.value_counts(normalize = True) # about 50% mensrights posts

1    0.500388
0    0.499612
Name: subreddit_MensRights, dtype: float64

In [74]:
#TRAIN TEST SPLIT
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=True, 
                                                    stratify = y)

In [75]:
y_train.value_counts(normalize = True)

1    0.500388
0    0.499612
Name: subreddit_MensRights, dtype: float64

#### Vectorizing

In [76]:
# setting up multiple vectorizers with english stopwords, ngrams 1-4, different features
tfidf = TfidfVectorizer(ngram_range = (1, 4), max_features = 125000, stop_words = 
                        'english')
tfidf2 = TfidfVectorizer(ngram_range = (1, 4), max_features = 125000, stop_words = 
                        stop_words_1)
tfidf3 = TfidfVectorizer(ngram_range = (1, 4), max_features = 125000, stop_words = 
                        stop_words_2)
tfidf4 = TfidfVectorizer(ngram_range = (1, 4), max_features = 100000, stop_words = 
                        'english')
tfidf5 = TfidfVectorizer(ngram_range = (1, 4), max_features = 100000, stop_words = 
                        stop_words_1)
tfidf6 = TfidfVectorizer(ngram_range = (1, 4), max_features = 100000, stop_words = 
                        stop_words_2)
tfidf7 = TfidfVectorizer(ngram_range = (1, 4), max_features = 50000, stop_words = 
                        'english')
tfidf8 = TfidfVectorizer(ngram_range = (1, 4), max_features = 50000, stop_words = 
                        stop_words_1)
tfidf9 = TfidfVectorizer(ngram_range = (1, 4), max_features = 50000, stop_words = 
                        stop_words_2)
tfidf10 = TfidfVectorizer(ngram_range = (1, 4), max_features = 5000, stop_words = 
                        'english')
tfidf11 = TfidfVectorizer(ngram_range = (1, 4), max_features = 5000, stop_words = 
                        stop_words_1)
tfidf12 = TfidfVectorizer(ngram_range = (1, 4), max_features = 5000, stop_words = 
                        stop_words_2)

In [77]:
#fit transform train sets
X_train_tf = tfidf.fit_transform(X_train)
X_train_tf2 = tfidf2.fit_transform(X_train)
X_train_tf3 = tfidf3.fit_transform(X_train)
X_train_tf4 = tfidf4.fit_transform(X_train)
X_train_tf5 = tfidf5.fit_transform(X_train)
X_train_tf6 = tfidf6.fit_transform(X_train)
X_train_tf7 = tfidf7.fit_transform(X_train)
X_train_tf8 = tfidf8.fit_transform(X_train)
X_train_tf9 = tfidf9.fit_transform(X_train)
X_train_tf10 = tfidf10.fit_transform(X_train)
X_train_tf11 = tfidf11.fit_transform(X_train)
X_train_tf12 = tfidf12.fit_transform(X_train)

In [78]:
# transform test set
X_test_tf = tfidf.transform(X_test)
X_test_tf2 = tfidf2.transform(X_test)
X_test_tf3 = tfidf3.transform(X_test)
X_test_tf4 = tfidf4.transform(X_test)
X_test_tf5 = tfidf5.transform(X_test)
X_test_tf6 = tfidf6.transform(X_test)
X_test_tf7 = tfidf7.transform(X_test)
X_test_tf8 = tfidf8.transform(X_test)
X_test_tf9 = tfidf9.transform(X_test)
X_test_tf10 = tfidf10.transform(X_test)
X_test_tf11 = tfidf11.transform(X_test)
X_test_tf12 = tfidf12.transform(X_test)

### Logistic Regression

Best versions of vectorized data for logistic regression are with stopwords = 'english' and max features = 125000. With very few features (10K or less) the gap between train and test scores closes to 3% but the test score continues to drop with fewer than 100K features.

In [79]:
logreg = LogisticRegression(penalty='l2')

In [80]:
# version 1
X_train_tf = tfidf.fit_transform(X_train)

In [81]:
X_test_tf = tfidf.transform(X_test)

In [82]:
logreg.fit(X_train_tf, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [83]:
logreg.score(X_train_tf, y_train)

0.853420757484255

In [84]:
logreg.score(X_test_tf, y_test)

0.7560175998619618

In [85]:
# version 2
logreg.fit(X_train_tf2, y_train)
logreg.score(X_train_tf2, y_train)

0.8392718488482444

In [86]:
logreg.score(X_test_tf2, y_test)

0.7379863687343629

In [87]:
# version 3
logreg.fit(X_train_tf3, y_train)
logreg.score(X_train_tf3, y_train)

0.8339228711931671

In [88]:
logreg.score(X_test_tf3, y_test)

0.7366059874040204

In [89]:
# version 4
logreg.fit(X_train_tf4, y_train)
logreg.score(X_train_tf4, y_train)

0.8494737296178069

In [90]:
logreg.score(X_test_tf4, y_test)

0.7559313260288154

In [91]:
# version 5
logreg.fit(X_train_tf5, y_train)
logreg.score(X_train_tf5, y_train)

0.8349365887326374

In [92]:
logreg.score(X_test_tf5, y_test)

0.739711845397291

In [93]:
# version 6
logreg.fit(X_train_tf6, y_train)
logreg.score(X_train_tf6, y_train)

0.829868001035286

In [94]:
logreg.score(X_test_tf6, y_test)

0.7355707014062635

In [95]:
# version 7
logreg.fit(X_train_tf7, y_train)
logreg.score(X_train_tf7, y_train)

0.8363385385212665

In [96]:
logreg.score(X_test_tf7, y_test)

0.7547234923647658

In [97]:
# version 8
logreg.fit(X_train_tf8, y_train)
logreg.score(X_train_tf8, y_train)

0.8239582434647571

In [98]:
logreg.score(X_test_tf8, y_test)

0.739280476231559

In [99]:
# version 9
logreg.fit(X_train_tf9, y_train)
logreg.score(X_train_tf9, y_train)

0.8189974980588387

In [100]:
logreg.score(X_test_tf9, y_test)

0.736519713570874

In [101]:
# version 10
logreg.fit(X_train_tf10, y_train)
logreg.score(X_train_tf10, y_train)

0.7811664222241395

In [102]:
logreg.score(X_test_tf10, y_test)

0.7394530238978518

In [103]:
# version 11
logreg.fit(X_train_tf11, y_train)
logreg.score(X_train_tf11, y_train)

0.7710292468294366

In [104]:
logreg.score(X_test_tf11, y_test)

0.7278923302562332

In [105]:
# version 12
logreg.fit(X_train_tf12, y_train)
logreg.score(X_train_tf12, y_train)

0.7657449745492192

In [106]:
logreg.score(X_test_tf12, y_test)

0.7192649469415926

### Most Influential Features

In [107]:
# fitting on best model
logreg.fit(X_train_tf, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [108]:
coefficients = logreg.coef_

In [109]:
coef_df = pd.DataFrame(coefficients, columns = tfidf2.get_feature_names())

In [110]:
coef_df.head()

Unnamed: 0,aa,aaaaand,aap,aap circumcision,aauw,ab,aback,abandon,abandoned,abandoning,abandonment,abbey,abbreviated,abbreviation,abc,abc news,abctv,abctv scottmorrisonmp,abd,abdicate,abdicate responsibility,abdication,abdomen,abdominal,abducted,abduction,abelist,aberrant,aberration,abhor,abhor culture,abhorrent,abide,abiding,abiding citizen,abigail,ability,ability based,ability birth,ability change,ability choose,ability commit,ability compete,ability consent,ability consent reasonable,ability control,ability create,ability decision,ability dress,ability dress certain,ability earn,ability enjoy,ability experience,ability influence,ability informed,ability job,ability le,ability obtain,ability pay,ability power,ability prevent,ability provide,ability provide care,ability refrain,ability refrain personal,ability refrain personal accountability,ability reproduce,ability responsibility,ability survive,ability understand,ability willingness,abit,abject,able,able abandon,able able,able abortion,able access,able accrue,able accrue wealth,able achieve,able afford,able angry,able answer,able answer question,able argue,able argument,able ask,able away,able bodied,able bodied ci,able bring,able buy,able care,able change,able choose,able claim,able compete,able consent,able consider,able continue,able contribute,able control,able convince,able create,able date,able day,able deal,able decide,able decision,able defend,able discriminate,able discus,able distinguish,able empathize,able evidence,able experience,able explain,able explain biological,able explain biological basis,able express,able fight,able financial,able fix,able force,able forward,able fully,able gain,able handle,able hold,able identify,able incorporate,able job,able learn,able leave,able legally,able lift,able live,able maintain,able manipulate,able meet,able navigate,able opt,able participate,able perform,able pick,able play,able pregnant,able protect,able prove,able provide,able pull,able raise,able recognize,able record,able refute,able relate,able remove,able separate,able separate conversation,able separate conversation fgm,able set,able sexist,able share,able shut,able sign,able solve,able sort,able speak,able speak clearly,able speak clearly unambiguously,able spend,able stand,able start,able stay,able study,able support,able survive,able talk,able tell,able train,able trust,able turn,able understand,able vote,able watch,able wear,abled,ableism,ableist,ableist slur,abnormal,abnormal behavior,abnormality,aboard,abolish,abolish prison,abolished,abolishing,abolition,abolition capitalism,abolitionism,abolitionist,abominable,abomination,aboriginal,abort,abort baby,abort body,abort fetus,abort force,aborted,aborted fetus,aborting,aborting baby,abortion,abortion able,abortion abortion,abortion access,abortion adoption,abortion aren,abortion argument,abortion available,abortion baby,abortion based,abortion bc,abortion begin,abortion birth,abortion birth control,abortion bodily,abortion bodily autonomy,abortion body,abortion body choice,abortion choice,abortion clinic,abortion completely,abortion consent,abortion consider,abortion cost,abortion country,abortion debate,abortion easy,abortion effective,abortion fine,abortion form,abortion free,abortion freely,abortion gendered,abortion generally,abortion happening,abortion illegal,abortion killing,abortion late,abortion law,abortion le,abortion legal,abortion let,abortion literally,abortion making,abortion medical,abortion month,abortion mother,abortion murder,abortion necessarily,abortion okay,abortion option,abortion parent,abortion pay,abortion poor,abortion position,abortion pregnancy,abortion pregnant,abortion pretty,abortion prevent,abortion pro,abortion pro choice,abortion rare,abortion reproductive,abortion service,abortion shouldn,abortion simply,abortion support,abortion topic,abortion unjust,abortion unjust killing,abortion wanted,abortion week,abortion world,abovementioned,abraham,abrahamic,abrahamic religion,abrams,abrasion,abrasive,abridging,abridging freedom,abridging freedom speech,abroad,abruptly,absence,absence compelling,absence compelling evidence,absence court,absence court order,absence evidence,absent,absent evidence,absent father,absentee,absolute,absolute best,absolute bitch,absolute critical,absolute critical reflection,absolute critical reflection mediated,absolute garbage,absolute hell,absolute hiv,absolute hiv number,absolute hiv number methodological,absolute lack,absolute power,absolute statement,absolute totally,absolute totally free,absolute totally free oppression,absolute truth,absolute worst,absolutely,absolutely absurd,absolutely arguing,absolutely awful,absolutely ban,absolutely based,absolutely body,absolutely called,absolutely change,absolutely choice,absolutely choose,absolutely clear,absolutely correct,absolutely deserve,absolutely disgusting,absolutely doubt,absolutely equal,absolutely equal oppression,absolutely equal oppression free,absolutely essential,absolutely evidence,absolutely fine,absolutely fucking,absolutely harassment,absolutely hate,absolutely horrible,absolutely idea,absolutely important,absolutely involves,absolutely love,absolutely matter,absolutely necessary,absolutely option,absolutely possible,absolutely proof,absolutely proof gendered,absolutely proof gendered difference,absolutely question,absolutely relevant,absolutely ridiculous,absolutely sense,absolutely shit,absolutely sign,absolutely stand,absolutely support,absolutely talk,absolutely terrible,absolutely true,absolutely understand,absolutely useless,absolutely whatsoever,absolutely wonderful,absolutely wonderful helping,absolutely wonderful helping primary,absolutely zero,absolution,absolutist,absolutley,absolutly,absolve,absolved,absolving,absorb,absorbed,absorbs,absorption,abstain,abstaining,abstinence,abstract,abstract concept,abstract paper,abstraction,abstractly,absurd,absurd claim,absurd statement,absurdity,absurdly,absurdum,abundance,abundant,abundantly,abundantly clear,abuse,abuse abuse,abuse abused,abuse abuser,abuse according,abuse accusation,abuse allegation,abuse amp,abuse assault,abuse authority,abuse based,abuse believed,abuse boy,abuse bullying,abuse called,abuse came,abuse cause,abuse childhood,abuse claim,abuse convicted,abuse country,abuse culture,abuse depression,abuse discrimination,abuse domestic,abuse domestic violence,abuse dr,abuse end,abuse equal,abuse especially,abuse example,abuse exploitation,abuse family,abuse far,abuse form,abuse got,abuse hand,abuse happened,abuse harassment,abuse horrible,abuse husband,abuse inflicted,abuse intimate,abuse intimate partner,abuse kid,abuse kill,abuse kind,abuse law,abuse le,abuse lead,abuse let,abuse likely,abuse little,abuse maybe,abuse mental,abuse mental health,abuse mother,abuse neglect,abuse number,abuse offender,abuse ok,abuse okay,abuse online,abuse parent,abuse parent court,abuse partner,abuse perpetrator,abuse physical,abuse position,abuse power,abuse psychopathic,abuse psychopathic guidance,abuse psychopathic guidance stereotype,abuse question,abuse raped,abuse receive,abuse received,abuse report,abuse seen,abuse sharing,abuse sharing metoo,abuse sharing metoo platform,abuse shelter,abuse situation,abuse social,abuse source,...,yeah relationship,yeah saw,yeah seen,yeah sense,yeah sexist,yeah shit,yeah shouldn,yeah sorry,yeah sort,yeah sound,yeah tell,yeah thats,yeah thinking,yeah totally,yeah toxic,yeah trans,yeah transphobia,yeah true,yeah understand,yeah used,yeah usually,yeah went,yeah won,yeah yeah,yeah year,year,year abuse,year age,year ago,year ago considered,year ago culture,year ago difference,year ago got,year ago maybe,year ago pay,year ago refused,year ago refused appear,year ago student,year ago talking,year ago today,year ago year,year ago year old,year amp,year apart,year assault,year average,year away,year baby,year bar,year based,year basically,year black,year book,year called,year care,year career,year change,year changed,year clean,year close,year college,year compared,year considering,year control,year control ex,year couldn,year count,year country,year crime,year culture,year culture cool,year current,year date,year dating,year day,year decade,year degree,year difference,year difficult,year doctrine,year doing,year earlier,year early,year education,year end,year especially,year eventually,year eventually got,year example,year expect,year experience,year failed,year fair,year fall,year false,year field,year fighting,year finally,year fine,year free,year friend,year fuck,year fucking,year future,year general,year genuinely,year getting,year got,year graduated,year great,year grew,year guilty,year half,year happened,year harassment,year hard,year haven,year high,year high school,year hit,year home,year hrt,year huge,year human,year husband,year inappropriate,year income,year instead,year international,year jail,year job,year judge,year kid,year killing,year later,year law,year le,year left,year let,year likely,year line,year live,year lived,year living,year long,year longer,year looked,year lost,year love,year marriage,year married,year massive,year max,year maximum,year maybe,year medical,year met,year metoo,year middle,year million,year minimum,year miserable,year mod,year month,year mrm,year new,year noticed,year number,year obviously,year old,year old able,year old attractive,year old baby,year old black,year old boy,year old chose,year old consent,year old considered,year old dating,year old daughter,year old florida,year old friend,year old getting,year old got,year old kid,year old le,year old live,year old living,year old love,year old mature,year old model,year old mom,year old nephew,year old pregnant,year old probably,year old raped,year old single,year old sister,year old son,year old started,year old student,year old teenager,year old told,year old used,year old working,year old year,year old year old,year older,year oppression,year order,year paid,year parent,year particularly,year passed,year past,year pay,year period,year pretty,year prison,year prison fine,year probably,year probation,year quite,year reach,year real,year realize,year receive,year received,year recently,year regarding,year relationship,year released,year remember,year retirement,year retirement difference,year retirement difference retirement,year retirement expectancy,year retirement expectancy pension,year road,year rodger,year row,year saving,year school,year second,year seen,year self,year senior,year sentence,year social,year specifically,year spend,year spent,year splc,year stand,year start,year start coming,year start coming stop,year started,year state,year step,year straight,year student,year study,year suck,year support,year taken,year taking,year talk,year talking,year teacher,year tell,year telling,year thinking,year thousand,year took,year training,year treatment,year tried,year true,year trying,year undergrad,year understand,year united,year university,year unless,year used,year using,year usually,year waiting,year wanted,year went,year wife,year won,year worked,year working,year world,year wouldn,year year,year year ago,year year old,year young,year younger,year zero,yearly,yearn,yearning,yeast,yeast infection,yee,yee pedophile,yeh,yell,yelled,yelling,yellow,yellow red,yen,yep,yep absolutely,yep concept,yep concept little,yep concept little fuel,yep toxic,yer,yesallmen,yesterday,yesterday evening,yesterday posted,yesterday revealed,yi,yiannopoulis,yiannopoulos,yield,yielded,yikes,yin,yin yang,yinz,yo,yo kid,yoffe,yoga,yoga pant,yoke,york,york city,york city called,york city called wing,york city washington,york city washington dc,york state,york state excluded,york state excluded attending,york state law,york university,yorker,yorkshire,youd,youll,young,young absolutely,young adult,young age,young american,young angry,young asian,young beautiful,young black,young boy,young brother,young caught,young childless,young coming,young consent,young conservative,young couple,young daughter,young day,young desperate,young face,young fertile,young fertile attractive,young fertile attractive countless,young fertile attractive period,young got,young hate,young hot,young human,young infant,young kid,young lady,young learn,young living,young looking,young mind,young murdered,young old,young remember,young son,young sort,young straight,young student,young teen,young teen boy,young teenager,young today,young told,young used,young violence,young world,young year,young year old,young young,younger,younger boy,younger brother,younger doing,younger generation,younger older,younger older younger,younger sister,younger student,younger year,youngest,youngster,youre,youth,youth age,youth suicide,youtoo,youtube,youtube channel,youtube channel called,youtube spotlight,youtube video,youtube viewer,youtuber,youtubers,youtubes,youve,youve got,yovino,yoy,ypt,ypu,yr,yr old,yt,yuck,yummy,yup,zambia,zarya,ze,zealand,zealand report,zealand report human,zealand support,zealand support submission,zealot,zealous,zeitgeist,zelda,zero,zero actual,zero dawn,zero dollar,zero effect,zero evidence,zero experience,zero personal,zero personal experience,zero personal experience oppression,zero precedent,zero proof,zero responsibility,zero sense,zero sum,zero sum game,zero sum situation,zero sympathy,zero tolerance,zero tolerance policy,zero value,zeroing,zimbardo,zip,zipper,zizek,zoe,zoe quinn,zombie,zombie apocalypse,zone,zone problematic,zoo,zoom,zuckerberg,zygote
0,0.003033,-0.243533,0.584949,0.027575,-0.031888,0.447751,0.143809,-0.002159,-0.351781,0.68035,0.244434,-0.271246,0.087812,-0.006912,0.092327,0.035839,0.339848,0.067532,0.03482,0.03482,-0.117718,0.142861,0.142861,-0.078466,-0.059329,-0.037682,0.090871,0.002733,-0.013394,-0.015316,0.038021,0.098304,0.019924,-0.199023,0.149842,0.148714,0.005406,-0.002942,-1.423937,-0.097382,-0.087434,-0.006748,-0.111227,-0.08009,-0.016359,0.05178,-0.220896,-0.010088,-0.058558,0.01246,-0.016994,-0.015997,-0.11345,-0.11345,-0.11345,0.098854,-0.005324,-0.042037,-0.054979,0.021139,0.194201,0.424501,-0.08296,0.044902,0.081207,-0.119315,-0.030559,0.028345,0.139491,0.170983,0.170983,0.170983,0.040814,-0.114449,0.033531,-0.131436,-0.107079,0.150548,0.100568,0.074039,-0.039929,0.105973,0.039938,-1.397111,-0.161323,-0.413053,-0.077746,-0.237168,-0.237168,-0.066151,0.21879,0.175632,-0.326183,-0.040549,0.037573,-0.002807,-0.031812,-0.045277,-0.218265,0.109496,-0.01798,0.009982,0.00401,0.034322,-0.049388,-0.061987,0.199782,-0.275799,-0.060911,0.197067,-0.138953,-0.095473,-0.027915,0.110358,0.10363,0.0397,0.050348,0.001879,0.112641,0.08501,-0.021393,0.195895,-0.012996,-0.236062,-0.029908,-0.060161,-0.014808,-0.030472,0.000573,0.007121,-0.056807,-0.056807,0.259255,-0.010853,0.08631,-0.107596,-0.00326,0.026905,-0.030063,-0.011004,-0.156547,-0.051091,-0.131074,-0.034017,0.012918,0.224952,-0.030311,0.101266,0.033531,-0.035723,-0.025325,0.02585,0.102706,-0.009168,0.142517,-0.093112,-0.158737,-0.179177,-0.051309,-0.103808,-0.155513,0.256864,0.175216,0.025905,0.125598,-0.125761,-0.10457,0.034271,-0.113076,0.008346,-0.065847,-0.085608,-0.149411,-0.122514,0.00025,-0.135367,-0.340193,-0.22684,-0.101936,-0.101936,-0.101936,-0.001051,0.011018,-0.039199,-0.02846,0.387025,-0.107016,-0.054314,-0.054314,0.002411,0.144977,-0.211097,-0.018556,0.009354,-0.095282,-0.186047,-0.524099,0.015689,-0.008753,-0.167259,0.144716,-0.272834,-0.087354,0.212212,0.004133,0.193009,0.058722,0.094241,0.120052,-0.277535,-0.655305,-0.465653,0.273255,0.010718,0.458838,-0.065199,-0.627907,-0.496817,-0.520416,0.11938,-0.298573,-0.137042,-0.09393,-0.166976,0.052455,0.085735,0.170614,0.261225,-0.269783,-0.054826,-0.121761,0.123499,-1.479742,0.099794,0.263282,0.244188,0.071321,-0.214189,-0.146056,-0.023488,-0.054664,0.024207,-0.18809,-0.093573,-0.04174,-0.155959,-0.143353,0.090762,-0.019061,-0.057622,0.166257,0.408203,0.020433,-0.15814,0.072865,0.470934,-0.166658,0.001047,0.49776,0.10731,-0.266527,-0.399055,-0.097801,-0.081926,0.150281,-0.069697,0.138582,0.104782,0.120072,0.16378,-0.022614,-0.110706,-0.136473,-0.138934,-0.044775,0.099999,0.11268,-0.158365,-0.061198,-0.077319,-0.077319,0.06619,-0.023671,-0.124976,-0.138493,-0.090144,-0.108994,-0.003449,0.093834,-0.284981,-0.031728,0.136778,-0.230838,-0.229084,0.037179,0.091815,-0.210392,-0.227497,-0.227497,-0.227497,0.094825,0.043382,-0.536455,-0.043676,-0.043676,0.076108,0.076108,-0.559523,0.018427,0.105596,0.10605,-0.118325,0.222178,0.998195,0.031494,0.082953,-0.162383,-0.039299,-0.039299,-0.060426,-0.01783,-0.01783,0.137868,-0.006951,0.039127,0.039127,0.039127,0.016847,0.19813,-0.002727,-0.134241,-0.124115,-0.234706,0.101434,-1.679096,-0.116716,0.267626,-0.068443,-0.123776,-0.022759,-0.005149,-0.068109,0.160371,0.392655,0.464881,-0.104759,-0.132916,-0.094957,-0.098495,-0.026177,-0.026177,-0.026177,0.039833,-0.144296,0.10202,-0.046132,-0.159564,0.084196,0.012997,0.161981,-0.153378,-0.031059,-0.072652,0.092178,0.063612,-0.029394,-0.029394,0.044419,-0.015316,-0.001124,-0.006852,-0.03837,0.081718,-0.274572,0.121526,-0.03151,0.08789,0.100911,-0.152726,0.067252,0.012061,-0.033388,-0.033388,-0.2878,0.27899,0.210158,-0.315858,-0.003517,-0.124605,0.051907,-0.063801,-0.144068,-0.212129,-0.111617,0.034917,-0.035362,-0.181126,0.064944,0.068985,-0.091769,-0.361114,0.044587,-0.176689,0.42389,0.02052,0.25461,0.0741,0.017153,-0.079558,0.009877,0.100538,-0.078111,0.020759,0.176666,0.159543,0.067211,-0.774588,-0.08708,0.023401,-0.178037,-0.069889,-0.124833,-0.067405,0.048711,0.163258,-0.066281,0.184622,-0.169374,0.560434,-0.16392,0.218972,0.204333,0.052009,-0.150673,-0.100176,1.382885,-0.04918,0.059883,0.01296,0.077377,-0.06871,-0.380138,0.045499,0.051549,0.081579,0.056813,0.476011,0.166003,0.079476,0.072354,-0.066219,0.03307,0.00089,0.025629,-0.058908,0.029188,0.126229,0.095597,-0.182711,0.044051,-0.013555,0.01428,0.029507,-0.077927,-0.043066,-0.043066,0.338858,0.084956,-0.284016,-0.040185,-0.01312,0.082785,-0.023375,0.10065,-0.058112,0.285583,0.161262,0.02201,-0.081786,-0.047592,-0.023796,0.080394,0.080394,0.073885,-0.047166,-0.196578,0.005641,0.147501,-0.060149,-0.097269,-0.140631,0.020711,-0.067842,0.287079,0.149268,0.090183,-0.138436,0.309549,-0.054012,...,0.136794,-0.26858,0.079516,0.058392,-0.04076,-0.178584,0.002498,0.012771,-0.16728,-0.22212,-0.094598,0.023982,-0.129618,0.013523,0.075654,0.084012,-0.167851,-0.015747,0.031317,0.028779,0.134754,-0.211545,0.05439,-0.011439,-0.087032,-0.092204,-0.206332,-0.11572,-0.097904,0.048629,0.230822,0.029898,0.018941,-0.141627,-0.052695,-0.05989,0.098427,0.085665,-0.126007,-0.044015,-0.169554,0.139878,0.144167,-0.076193,-0.057955,-0.135479,0.052299,-0.061283,0.01708,0.076153,-0.02935,0.088058,0.074274,0.04332,0.016662,0.016662,0.016662,0.016662,0.046441,0.02926,-0.275626,-0.006527,-0.005471,-0.005422,-0.12967,-0.07661,-0.208533,-0.116666,0.21889,0.117209,0.048756,0.02094,0.00386,0.024934,-0.035903,0.039592,0.039592,-0.009273,0.071776,-0.255808,0.116936,0.134808,0.153929,-0.095655,-0.03791,-0.029834,0.045476,-0.238958,0.104134,-0.259499,0.080021,-0.122181,0.213352,0.127354,0.072446,0.04368,-0.003741,0.003081,-0.034064,0.00157,0.010216,-0.033073,0.067897,-0.015314,-0.224128,-0.102437,-0.192308,0.088323,0.021673,-0.510079,0.015877,-0.195509,-0.003902,0.088054,0.020161,-0.391686,-0.127541,0.026809,0.04131,-0.203985,-0.151362,-0.017348,0.149184,-0.088179,-0.131045,0.528672,0.2967,0.564764,0.06383,0.093864,-0.150902,0.058984,-0.131373,-0.060703,-0.177436,0.230716,-0.034772,0.019796,0.019796,0.019796,-0.013792,-0.243116,-0.052454,0.006053,-0.066202,-1.340425,-0.094227,0.03647,0.138289,0.781977,-0.123327,0.083917,-0.011025,0.10334,-0.062911,0.180141,-0.115751,0.14153,-0.340436,0.219864,0.306538,-0.035868,0.056957,-0.256483,0.086969,0.007377,0.147694,-0.140744,0.178801,-0.144593,-0.131837,-0.02958,-0.032533,-0.243308,0.138783,-0.459269,0.39277,0.452557,0.169659,0.506227,0.12897,-0.018501,-0.138186,0.152399,0.230071,-0.074288,0.342786,0.079686,-0.164304,-0.002386,0.080122,0.117803,-0.112538,0.093017,0.044463,-0.104195,-0.049628,-0.220199,-0.446343,0.020442,-0.003605,0.098229,-0.134701,0.305972,0.347248,-0.030005,0.205748,-0.259808,0.370935,0.032276,-0.260934,0.056175,-0.227671,0.165453,-0.206087,-0.127777,-0.203745,0.239801,-0.016793,0.269171,-0.044588,-0.023681,-0.050442,0.262803,-0.169904,0.046024,-0.268557,0.034434,-0.22143,-0.001517,0.197624,-0.06664,-0.238371,0.184367,-0.119819,0.106052,-0.254044,0.092819,0.277081,-0.010746,-0.084838,-0.01192,-0.458486,0.034431,-0.439326,-0.057407,-0.057407,-0.202226,-0.202226,0.028847,0.126624,-0.226738,-0.101724,-0.101724,-0.101724,0.226049,0.13371,-0.02243,0.134877,-0.078193,0.022441,-0.167004,0.274191,0.142918,-0.299737,0.140461,-0.000395,0.141959,-0.09825,-0.265313,0.308771,-0.247846,-0.079035,0.328105,-0.03544,-0.019421,-0.163447,0.056583,0.05712,0.000297,-0.002115,-0.227527,-0.034896,-0.209908,0.183232,-0.007245,0.109675,-0.211242,-0.36167,-0.032158,-0.032158,-0.032158,-0.107253,-0.138227,0.031211,0.200998,-0.013764,-0.263116,-0.000904,-0.247455,-0.270831,0.070422,-0.022313,-0.141632,-0.089454,-0.046937,0.354989,-0.191125,0.005009,-0.046838,-0.042338,0.0698,-0.131505,-0.065774,0.228715,-0.024353,0.203087,-0.136054,-0.306048,-0.001915,-0.061226,0.033287,-0.05202,0.47369,-0.158837,0.035811,-0.058971,0.006417,0.030815,0.134422,-0.863954,0.516989,-0.595953,0.440231,0.057658,0.041778,0.288871,0.014876,0.734128,0.064774,0.08399,0.374523,0.113517,0.241101,-0.011783,0.099255,0.929777,-0.066494,-0.115416,0.008937,-0.564995,-0.14528,-0.829372,0.07865,0.216418,-0.189748,0.109552,0.061852,0.055993,-0.210279,-0.080048,-0.108629,-0.108629,0.470721,0.044068,-0.005501,-0.153481,0.044066,-0.018572,0.163128,0.249898,0.260648,0.015294,-0.03293,0.112291,-0.183449,0.114078,0.501022,0.074917,-0.170533,0.025675,0.085622,0.075545,0.034,0.035136,-0.580076,-0.243753,0.248297,-0.022074,0.017928,0.025502,0.010832,0.007731,0.007545,-0.352387,0.122644,0.108077,0.040919,0.156855,-0.3378,0.047601,-0.121499,0.02446,0.077975,-0.464481,-0.020886,-0.357478,0.127858,-0.190057,0.151928,0.259579,0.04086,-0.05289,-0.167665,0.006193,0.203727,0.030923,-0.047192,0.166768,-0.036382,0.33843,-0.25552,0.363928,-0.166857,0.099266,-0.016461,-0.008326,0.081342,-0.36388,-0.060652,0.23976,-0.013698,-0.130544,0.355948,-0.056412,0.106787,-0.084668,0.138885,-0.297487,0.22545,0.145129,0.214343,0.080091,0.302283,-0.335726,0.155163,-0.212949,0.169459,0.067619,0.074171,0.03195,0.01113,0.017715,0.008347,0.008347,0.417368,0.271601,-0.511137,-0.283006,0.784665,-0.021755,-0.527261,-0.095838,0.096313,0.363583,0.104756,-0.033616,-0.016057,-0.016057,-0.016057,-0.020875,0.039515,0.178559,0.058939,0.109902,-0.155147,0.059649,0.094529,-0.119047,-0.291584,0.432402,-0.150712,-0.029039,0.225849,-0.016697,0.157491,0.259206,-0.201174,-0.30915,-0.115108,-0.136204,0.771067,0.086986,0.074171,-0.048134,0.093929,0.014135,0.077032,0.428348


In [111]:
# words most positively correlated with content being from MensRights
# some of these make NO sense, might have been good to do more cleaning
coef_df.sum(axis=0).sort_values(ascending=False)[0:15]

military base                    5.529991
broad generalization             4.476077
article quote                    4.112461
shouldn question idea            3.429160
bewildered                       3.273144
minute slightly increased        3.245495
bro jap                          3.156064
masculinity boy                  3.101634
mra incel                        3.046819
shared common                    2.923378
extensively                      2.895841
experiential authority member    2.846495
mother choose                    2.794948
semantic                         2.685628
hear trans                       2.653834
dtype: float64

In [112]:
# words most positively correlated with content being from AskFeminists
# these make a little more sense than the MensRights ones...slightly.
coef_df.sum(axis=0).sort_values(ascending=True)[0:15]

pretty important law             -5.759800
tell laugh joke pretty           -4.878412
support accused                  -4.871705
end jail                         -3.916339
designated disposal area         -3.349880
open feeling                     -3.347517
single conversation              -3.258743
argument opinion                 -3.215864
haram                            -3.131597
dominated dominated              -3.064219
argument waste                   -3.058328
fair deserve miserable death     -3.045156
physical violent abuse sharing   -3.019294
muslim absolutely equal          -2.755272
britain agonizing day            -2.743639
dtype: float64

### Confusion Matrix

In [113]:
pred_log = logreg.predict(X_test_tf)

In [114]:
pred_log.shape

(11591,)

In [115]:
y_test.shape

(11591,)

In [116]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, pred_log)
tn, fp, fn, tp = confusion_matrix(y_test, pred_log).ravel()
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 4266
False Positives: 1525
False Negatives: 1303
True Positives: 4497


In [117]:
# setting up df for comparison
y_test_preds = pd.DataFrame(y_test)

In [118]:
y_test_preds['pred'] = pred_log

In [119]:
y_test_preds.head()

Unnamed: 0,subreddit_MensRights,pred
5938,0,0
153,0,0
7102,0,0
20531,0,1
32532,1,1


In [120]:
# rows where predictions don't match
y_test_preds[y_test_preds['subreddit_MensRights'] != y_test_preds['pred']].head()

Unnamed: 0,subreddit_MensRights,pred
20531,0,1
48911,1,0
27098,0,1
17941,0,1
29097,1,0


In [121]:
# example of a post misclassified as AskFeminists
#this comment is just responding to another user who posted something about JP (Jordan Peterson)
# but there's no real context that would help identify which thread this belongs to
femvmen.loc[57021]

index                   19820                                                                                                                                                                                                                                                   
text                    The first clip legitimately doesn't work, and none from the second are JP so I've still not seen proof of your origional claim that JP ever said that. If you send me a functioning clip then I'll change my opinion but it just says video unavailable.
type                    comment                                                                                                                                                                                                                                                 
removed                 0                                                                                                                                                            

In [122]:
# example of a post misclassified as MensRights
# not much to go off here for classification, it could really be from either.
femvmen.loc[18964]

index                   19170                                                                     
text                    Dude, I can't argue with you when you are so deep in the dogma. Good luck.
type                    comment                                                                   
removed                 0                                                                         
deleted                 0                                                                         
clean_text_stop         dude  i can t argue with you when you are so deep in the dogma  good luck 
lems                    dude i can argue with you when you are so deep in the dogma good luck     
subreddit_MensRights    0                                                                         
Name: 18964, dtype: object

### Other models

I tried all of the below models before selecting the Logistic Regression for further analysis of the coefficients - none of these had as strong a test score as the one on the Logistic Regression. I spent more time with Random Forest, and I could have tried to tune the parameters for Naive Bayes and SVC to improve the scores but didn't have time.

### Naive Bayes Multinomial

In [123]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [124]:
nb.fit(X_train_tf2, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [125]:
nb.score(X_train_tf2, y_train)

0.8530109567768096

In [126]:
nb.score(X_test_tf2, y_test)

0.7464412043827108

## Support Vector Classifier

In [127]:
from sklearn import svm
svc = svm.SVC(kernel= 'rbf', C = 100, gamma = 0.05)

In [128]:
svc.fit(X_train_tf, y_train)

SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.05, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [129]:
svc.score(X_train_tf, y_train)

0.9867785350703132

In [130]:
svc.score(X_test_tf, y_test)

0.7386765593995341

## Random Forest

In [131]:
from sklearn.ensemble import RandomForestClassifier

  from numpy.core.umath_tests import inner1d


In [132]:
# taking sample for faster gridsearching
sample20000 = femvmen.sample(n=20000, random_state = 42)

In [133]:
sample20000['subreddit_MensRights'].value_counts(normalize=True) # about same proportions as full dataset

1    0.5018
0    0.4982
Name: subreddit_MensRights, dtype: float64

In [134]:
X = sample20000['lems']
y = sample20000['subreddit_MensRights']

In [135]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size = .30, 
                                                   stratify=y)

In [136]:
tfidf = TfidfVectorizer(ngram_range = (1, 4), max_features = 125000, stop_words = 
                        'english')

In [137]:
X_train_tf = tfidf.fit_transform(X_train)
X_test_tf = tfidf.transform(X_test)

In [138]:
X_train_tf.shape

(14000, 125000)

In [139]:
# just trying out basic random forest without tuning parameters or setting max depth
rf = RandomForestClassifier()

In [140]:
rf.fit(X_train_tf, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [141]:
rf.score(X_train_tf, y_train)

0.981

In [142]:
rf.score(X_test_tf, y_test)

0.6566666666666666

In [143]:
rf = RandomForestClassifier()

In [144]:
# gridsearching for best parameters - this is the first one I did
from sklearn.model_selection import GridSearchCV

In [145]:
params = {
    'n_estimators': [5, 10, 1],
    'max_depth': [10000, 90000, 20000],
    'oob_score': ['True', 'False'],
    'warm_start': ['True'],
    'n_jobs': [-2]
}
gs = GridSearchCV(rf, param_grid = params, cv = 3)
gs.fit(X_train_tf, y_train)
print(gs.best_score_)
gs.best_params_ 

  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])

  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])

  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])

0.6554285714285715


  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])


{'max_depth': 90000,
 'n_estimators': 10,
 'n_jobs': -2,
 'oob_score': 'False',
 'warm_start': 'True'}

In [146]:
# second gridsearch, I repeated this until I narrowed down the best max depth range
params = {
    'n_estimators': [10],
    'max_depth': range(875, 900, 1),
    'oob_score': ['False'],
    'warm_start': ['True'],
    'n_jobs': [-2]
}
gs = GridSearchCV(rf, param_grid = params, cv = 3)
gs.fit(X_train_tf, y_train)
print(gs.best_score_)
gs.best_params_ 

  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])

  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])

  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])

  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])

0.6618571428571428


  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])


{'max_depth': 885,
 'n_estimators': 10,
 'n_jobs': -2,
 'oob_score': 'False',
 'warm_start': 'True'}

In [147]:
# resetting X and y variables to test random forest on full train/test sets
X = femvmen['lems']
y = femvmen['subreddit_MensRights']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = .25,
                                                   stratify = y)

In [148]:
X_train_tf = tfidf.fit_transform(X_train)
X_test_tf = tfidf.transform(X_test)

In [149]:
rf = RandomForestClassifier(max_depth = 880, n_estimators = 1000, n_jobs = -2, oob_score = False,
                 warm_start = True)

In [150]:
rf.fit(X_train_tf, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=880, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=-2,
            oob_score=False, random_state=None, verbose=0, warm_start=True)

In [151]:
rf.score(X_train_tf, y_train)

0.984401601251553

In [152]:
rf.score(X_test_tf, y_test)

0.7180619780523155

## Conclusion

It would be too easy to say that my best model's inability to crack 77% accuracy means that "we're really not that different after all", and if I were to spend more time on this project there are other options I would explore to try and improve my model:

1. I'd pull a completely unrelated subreddit to test as a control against the other two subreddits to make sure I was tuning the best model possible.

2. Given how important context is for the comments, I would either try to aggregate comments on a post and analyze them together, or set a minimum wordcount on comments to be used in the analysis to give the model a better chance of distinguishing them.

3. I could scrape another years' worth of data to add to the model.

4. I could do some more cleaning and EDA that might help consolidate slang/similar terms that weren't captured by the lemmatizer, and identify stronger trends in the subreddits that could be leveraged for better prediction. It'd also be interesting to see what difference it might make to fit the TFIDF on just one subreddit first, then transform both subreddits and analyze them together.

5. If any of the above helped minimize overfit on the Logistic Regression or Random Forest, I'd give Naive Bayes Multinomial and Support Vector Classifier another shot (and spend more time tuning them).

The really interesting questions this project has generated would require a deeper analysis: what's the overlap in people who post on MensRights and AskFeminists? (How many of those are trolls, whose content is removed?) What common themes exist between removed posts (can we write an algorithm for trolling content?) How does sentiment analysis compare across the two subreddits, is there a discernable difference? Is it possible to measure 'extreme' attitudes in a subreddit, and if so can it be mapped over time, against real events happening in the world that might trigger anger on both or one side? Is it possible to follow the posts of single users and see how they change in sentiment and neutrality over time (and does the sub they post on the most make a difference?) Maybe I'll come back to it when I've solidified my modeling skills.