## Step 4-B: Modeling
### Random Forest Classifier.

I next ran a Random Forest Classifier. 

Due to the slowness of grid searching over many parameters, I used the Count Vectorizer parameters that I had previously determined when modeling with Multinomial Naive Bayes. I only gridsearched parameters for the Random Forest Classifier. 

In [1]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

In [2]:
df = pd.read_csv('./all.csv')

In [3]:
#Quick clean-up task that leaked into modeling
df.dropna(inplace = True)

### Pre-Processing

In [4]:
X = df.drop(columns = ['subreddit','created_utc'])
y = df['subreddit']

In [5]:
#Baseline
#Anxiety = 0, ADHD = 1, Depression = 2, Autism = 3
y.value_counts(normalize = True)

0    0.284551
2    0.258203
1    0.232724
3    0.224522
Name: subreddit, dtype: float64

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state = 42)

In [7]:
cvec = CountVectorizer(stop_words = 'english', max_df = .9, max_features = 4000)
Xcvec_train = cvec.fit_transform(X_train['selftext'])
Xcvec_test = cvec.transform(X_test['selftext'])

In [8]:
Xcvec_train

<12069x4000 sparse matrix of type '<class 'numpy.int64'>'
	with 585859 stored elements in Compressed Sparse Row format>

In [9]:
Xcv_train_df = pd.DataFrame(Xcvec_train.todense(), columns=cvec.get_feature_names())
Xcv_test_df = pd.DataFrame(Xcvec_test.todense(), columns=cvec.get_feature_names())

In [10]:
Xcv_train_df.reset_index(drop=True, inplace=True)
X_train.reset_index(drop=True, inplace=True)
Xcv_test_df.reset_index(drop=True, inplace = True)
X_test.reset_index(drop=True, inplace=True)

In [11]:
X_train_all = pd.concat([Xcv_train_df, X_train], axis=1)
X_test_all = pd.concat([Xcv_test_df, X_test], axis=1)

In [12]:
X_train_all.drop(columns = 'selftext', inplace = True)

In [13]:
X_test_all.drop(columns = 'selftext', inplace = True)

In [14]:
X_train_all.shape, y_train.shape

((12069, 4006), (12069,))

### Modeling with Random Forest

The accuracy of this model was about the same as the others, but it took the longest to run the grid search out of all of them. The overfit issue persists. 

One reason I chose Random Forest and GBoost is that I was able to take into consideration the sentiment analysis, which I believe is highly relevant to this problem. However, it doesn't seem to make much of a difference to the accuracy. 

In [15]:
rfc = RandomForestClassifier()

In [16]:
params = {'n_estimators': [750, 1000],
          'max_depth': [17, 19],
         'min_samples_split': [4, 5],
         'min_samples_leaf': [2, 3],
         'max_features': ['auto']}

In [17]:
gs = GridSearchCV(rfc, params, cv = 3)

In [18]:
gs.fit(X_train_all, y_train)

GridSearchCV(cv=3, estimator=RandomForestClassifier(),
             param_grid={'max_depth': [17, 19], 'max_features': ['auto'],
                         'min_samples_leaf': [2, 3],
                         'min_samples_split': [4, 5],
                         'n_estimators': [750, 1000]})

In [19]:
gs.score(X_train_all, y_train), gs.score(X_test_all, y_test)

(0.8332090479741486, 0.7578921203082277)

In [20]:
gs.best_params_

{'max_depth': 19,
 'max_features': 'auto',
 'min_samples_leaf': 2,
 'min_samples_split': 4,
 'n_estimators': 1000}