## Step 4-A: Modeling
### Multinomial Naive Bayes Model

I started with the Multinomial Naive Bayes model and Count Vectorizer, with a few custom stop words. I tune the Count Vectorizer parameters and re-used the best parameters in subsequent models. This approach may not have resulted in the most optimized models, however it did save a lot of time grid-searching. 

Because MNB does not do well with negative values, this model is fitted to the count vectorized data only. The other two models incorporate sentiment analysis. 

In [1]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

In [2]:
df = pd.read_csv('./df.csv')

In [3]:
df.shape

(6345, 9)

In [4]:
X = df['selftext']
y = df['subreddit']

In [5]:
X.shape, y.shape

((6345,), (6345,))

In [6]:
df['subreddit'].value_counts(normalize = True)

0    0.290623
2    0.262096
1    0.237825
3    0.209456
Name: subreddit, dtype: float64

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state = 42)

In [8]:
#custom stopwords
my_words = ['just','do','don','ve']
from sklearn.feature_extraction import text 
my_stop_words = text.ENGLISH_STOP_WORDS.union(my_words)

In [10]:
cvec = CountVectorizer(stop_words = my_stop_words)
Xcvec_train = cvec.fit_transform(X_train)
Xcvec_test = cvec.transform(X_test)

In [11]:
Xcvec_train

<4758x19771 sparse matrix of type '<class 'numpy.int64'>'
	with 269004 stored elements in Compressed Sparse Row format>

In [12]:
Xcv_train_df = pd.DataFrame(Xcvec_train.todense(), columns=cvec.get_feature_names())
Xcv_test_df = pd.DataFrame(Xcvec_test.todense(), columns=cvec.get_feature_names())

In [13]:
Xcv_train_df.reset_index(drop=True, inplace=True)
X_train.reset_index(drop=True, inplace=True)
Xcv_test_df.reset_index(drop=True, inplace = True)
X_test.reset_index(drop=True, inplace=True)

In [14]:
X_train_all = pd.concat([Xcv_train_df, X_train], axis=1)
X_test_all = pd.concat([Xcv_test_df, X_test], axis=1)

In [15]:
X_train_all.drop(columns = 'selftext', inplace = True)

In [16]:
X_train_all.shape

(4758, 19771)

In [17]:
mnb = MultinomialNB()
mnb.fit(Xcvec_train, y_train)

MultinomialNB()

In [18]:
mnb.score(Xcvec_train, y_train), mnb.score(Xcvec_test, y_test)

(0.894703656998739, 0.7511027095148078)

In [19]:
pipe = Pipeline([
    ('cvec', CountVectorizer(stop_words = my_stop_words)),
    ('mnb', MultinomialNB())
])

In [20]:
pipe_params = {
    'cvec__max_features': [4500, 4000, 5000],
    'cvec__min_df': [2, 3],
    'cvec__max_df': [.9, .95],
    'cvec__ngram_range': [(1,1), (1,2)]
}

In [21]:
X_train.shape, y_train.shape

((4758,), (4758,))

In [22]:
gs = GridSearchCV(pipe, pipe_params, cv = 3)
gs.fit(X_train, y_train)

GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('cvec',
                                        CountVectorizer(stop_words=frozenset({'a',
                                                                              'about',
                                                                              'above',
                                                                              'across',
                                                                              'after',
                                                                              'afterwards',
                                                                              'again',
                                                                              'against',
                                                                              'all',
                                                                              'almost',
                                                            

In [23]:
gs.score(X_train, y_train), gs.score(X_test, y_test)

(0.8520386717108028, 0.7548834278512917)

In [24]:
gs.best_params_

{'cvec__max_df': 0.9,
 'cvec__max_features': 5000,
 'cvec__min_df': 3,
 'cvec__ngram_range': (1, 2)}