# Model Building
#### Project Overview
Build a model that can predict whether or not a rating will be "good" (8 or higher) based on the text of a drug review.

#### Goal
I have already gone through steps for data wrangling, storytelling, and statistical analysis. My goal for this final step is to evaluate the performance of several models and choose the best performing model.

## Import

In [22]:
# data manipulation
import numpy as np
import pandas as pd

# visualization
import matplotlib.pyplot as plt
import seaborn as sns

# machine learning
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_curve, roc_auc_score

# data import/export
from scipy.sparse import load_npz
import pickle

In [23]:
# import dataframe
data = pd.read_pickle('drugsCom_data')

# import term matrix
term_matrix = load_npz('ngram_csr.npz')

# convert term matrix to dataframe
term_matrix = pd.DataFrame(term_matrix.todense())

# import column headers
pickle_in = open('list.pickle', 'rb')
reviews_columns = pickle.load(pickle_in)

# add column headers to term matrix
term_matrix.columns = reviews_columns

## Pre-Processing
To keep the model simple, I will convert the review ratings from a scale of 1 to 10 to binary. 8 and up will be labeled 1 and below 8 will be labeled 0.

In [24]:
# convert rating scale to binary; >= 8 is 1, <8 is 0
replace_values = {1:0, 2:0, 3:0, 4:0, 5:0, 6:0, 7:0, 8:1, 9:1, 10:1}
term_matrix['rating'] = term_matrix.rating.replace(replace_values)

In [25]:
# create arrays for independent features and target
X = term_matrix.drop('rating', axis=1).values
y = term_matrix['rating'].values

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=9)

## Dummy Classifier
Using a dummy classifier will provide a baseline for comparing subsequent models.

In [26]:
# fit and evaluate dummy classifier
dummy_model = DummyClassifier(strategy='most_frequent', random_state=123)
dummy_model.fit(X_train, y_train)
y_pred = dummy_model.predict(X_test)

print('AUC Score:\n', roc_auc_score(y_test, y_pred))

AUC Score:
 0.5


## Naive Bayes

Naive Bayes (NB) maintains simplicity and speed by assuming all features (words in this case) are independent. The algorithm performs well in a lot of cases, particularly classification problems from text. I think NB will perform moderately well in this case, but a more robust model, such as Random Forest, will probably be best.

In [36]:
# instantiate model, fit, and predict
naive_bayes = MultinomialNB()
cv_results = cross_val_score(naive_bayes, 
                             X_train, y_train, 
                             scoring='roc_auc',
                             cv=5)

print(f'AUC for each fold: {cv_results}')
print(f'Mean AUC: {np.mean(cv_results)}')

AUC for each fold: [0.73202462 0.73090894 0.72994204 0.73379731 0.73711134]
Mean AUC: 0.7327568507938473


An AUC score of 0.73 is pretty good for a first run with Naive Bayes. I'm also going to try Logistic Regression and Random Forest. I'm pretty sure Random Forest will outperform the others.

## Logistic Regression

In [37]:
# instantiate model, fit, and predict
log_reg = LogisticRegression()
cv_results = cross_val_score(log_reg, 
                             X_train, y_train, 
                             scoring='roc_auc',
                             cv=5)

print(f'AUC for each fold: {cv_results}')
print(f'Mean AUC: {np.mean(cv_results)}')

AUC for each fold: [0.7555984  0.75598477 0.75277187 0.75744846 0.76285309]
Mean AUC: 0.756931317872261


Logisic Regression was a small improvement compared to Naive Bayes bringing our AUC from 0.73 up to 0.76.

## Random Forest

In [38]:
# instantiate model, fit, and predict
r_forest = RandomForestClassifier(n_jobs=1, random_state=123)
cv_results = cross_val_score(r_forest, 
                             X_train, y_train, 
                             scoring='roc_auc',
                             cv=5)

print(f'AUC for each fold: {cv_results}')
print(f'Mean AUC: {np.mean(cv_results)}')

AUC for each fold: [0.89825974 0.89765881 0.89651376 0.9011878  0.90180346]
Mean AUC: 0.8990847120332537


Random Forest outperformed Naive Bayes and Logistic Regression even with the default hyperparameters. I want to see if I can increase the performance using GridSearchCV to tune n_estimators and max_depth.

### Hyperparameter Tuning

In [39]:
# set params to evaluate
params = {'n_estimators': [150, 200, 250, 300],
          'max_depth': [40, 60, 80, 100]}

# instantiate and fit GridSearchCV
grid_search = GridSearchCV(estimator=r_forest,
                           param_grid=params)
grid_search.fit(X_train, y_train)

# print best parameters
grid_search.best_params_

{'max_depth': 80, 'n_estimators': 300}

In [40]:
# re-run random forest with best parameters
r_forest = RandomForestClassifier(n_estimators=300,
                                  max_depth=80,
                                  n_jobs=1, 
                                  random_state=123)
cv_results = cross_val_score(r_forest, 
                             X, y, 
                             scoring='roc_auc',
                             cv=5)

print(f'AUC for each fold: {cv_results}')
print(f'Mean AUC: {np.mean(cv_results)}')

AUC for each fold: [0.92945266 0.92456645 0.92926356 0.92874972 0.92746518]
Mean AUC: 0.9278995153465189
