## Step 4-C: Modeling
### Gradient Boosting Classifier

I decided to try a Gradient Boost model to see if I could reduce the variance. 

Due to the slowness of grid searching over many parameters, I used the Count Vectorizer parameters that I had previously determined when modeling with Multinomial Naive Bayes. I only gridsearched parameters for the Gradient Booster. 

In [1]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier

In [2]:
df = pd.read_csv('./all.csv')

In [3]:
#Quick clean-up task that leaked into modeling
df.dropna(inplace = True)

### Pre-processing

In [4]:
X = df.drop(columns = ['subreddit','created_utc'])
y = df['subreddit']

In [5]:
#Baseline
#Anxiety = 0, ADHD = 1, Depression = 2, Autism = 3
y.value_counts(normalize = True)

0    0.284551
2    0.258203
1    0.232724
3    0.224522
Name: subreddit, dtype: float64

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state = 42)

In [7]:
cvec = CountVectorizer(stop_words = 'english', max_df = .9, max_features = 4000)
Xcvec_train = cvec.fit_transform(X_train['selftext'])
Xcvec_test = cvec.transform(X_test['selftext'])

In [8]:
Xcvec_train

<12069x4000 sparse matrix of type '<class 'numpy.int64'>'
	with 585859 stored elements in Compressed Sparse Row format>

In [9]:
Xcv_train_df = pd.DataFrame(Xcvec_train.todense(), columns=cvec.get_feature_names())
Xcv_test_df = pd.DataFrame(Xcvec_test.todense(), columns=cvec.get_feature_names())

In [10]:
Xcv_train_df.reset_index(drop=True, inplace=True)
X_train.reset_index(drop=True, inplace=True)
Xcv_test_df.reset_index(drop=True, inplace = True)
X_test.reset_index(drop=True, inplace=True)

In [11]:
X_train_all = pd.concat([Xcv_train_df, X_train], axis=1)
X_test_all = pd.concat([Xcv_test_df, X_test], axis=1)

In [12]:
X_train_all.drop(columns = 'selftext', inplace = True)

In [13]:
X_test_all.drop(columns = 'selftext', inplace = True)

In [14]:
X_train_all.shape, y_train.shape

((12069, 4006), (12069,))

### Modeling with GBoost

I actually did a few different grid searches where I included the "best" parameters and at least one other pair. The best parameters of the grid search below were consistent across several different searches.

The accuracy of this model was about the same as the others, but the variance gap closed a tiny bit. 

As with Random Forest, this model is able to take into consideration the sentiment analysis scores. 

In [15]:
gb = GradientBoostingClassifier()

In [16]:
params = {'n_estimators': [75,50],
          'max_depth': [2, 3],
         'learning_rate' : [1.2, 1]}

In [17]:
gb_gs = GridSearchCV(gb, params, cv = 3)

In [18]:
gb_gs.fit(X_train_all, y_train)

GridSearchCV(cv=3, estimator=GradientBoostingClassifier(),
             param_grid={'learning_rate': [1.2, 1], 'max_depth': [2, 3],
                         'n_estimators': [75, 50]})

In [19]:
gb_gs.score(X_train_all, y_train), gb_gs.score(X_test_all, y_test)

(0.8447261579252631, 0.7638578175490928)

In [20]:
gb_gs.best_params_

{'learning_rate': 1, 'max_depth': 2, 'n_estimators': 75}

# Step 5: Evaluation

Overall, all three models had very similar performance. All are overfit to the training data. I had originally collected a smaller data set and hoped that more data might help address the overfit issue. However, when I collected more data, the accuracy of the models actual decreased as well as the variance. 

| Model                     | Test Accuracy | Train Accuracy | Best Parameters                                                                                            |
|---------------------------|---------------|----------------|------------------------------------------------------------------------------------------------------------|
| Multinomial Naive Bayes   | 85%           | 75%            | max_df: 0.9<br>max_features: 5000<br>min_df: 3<br>ngram_range: (1, 2)                                      |
| Random Forest Classifier  | 83%           | 76%            | max_depth: 19<br>max_features: 'auto'<br>min_samples_leaf: 2<br>min_samples_split: 4<br>n_estimators: 1000 |
| Gradient Boost Classifier | 84%           | 77%            | learning rate: 1<br>max_depth: 2<br>n_estimators: 50                                                       |

There are a number of steps that I could take to further optimize these models:
- GridSearch over both CountVectorizer and model parameters within the same pipeline
- Lemmatize the data
- Create a more thorough custom stop-words list
- Try TFIDF instead of CVec

# Step 6: Conclusions & Recommendations

I would recommend the GBoost model because it has the best accuracy on test data. However, all three models have similar performance and MNB is the fastest, so it could be an acceptable choice if computational resources or time are big considerations. 

It would be interesting to investigate the posts that the model classifies incorrectly. Since the goal of this tool is to help disentangle messy diagnoses, we could potentially learn a lot from examples that confused the computer. Perhaps the model predicted incorrectly because the human who wrote the post shows signs of having a different condition or more than one condition. That is precisely the kind of application that could be extremely useful to a clinician. 

I could imagine this tool being available on self-improvement or mental wellness websites, just as you can find screening questionaires for various conditions all over the internet. People who are curious or struggling could submit a sample journal entry and immediately get a few suggestions of illnesses or conditions to learn more about. The mental healthcare system in the US is generally overburdened, so patient self-advocacy is crucial. This tool could allow a patient to bring up their concerns with their doctor and get the right referrals. 

Clinicians and mental health professionals could also use this tool in conjunction with other long-standing tools in a variety of situations. It could be part of their initial evaluation of a new patient. For a patient who has been in treatment for a while and is still struggling, it could help suggest new areas to explore -- maybe they have only been getting treatment for some of their problems.  