# Model Building
#### Project Overview
Build a model that can predict whether or not a rating will be "good" (8 or higher) based on the text of a drug review.

#### Goal
I have already gone through steps for data wrangling, storytelling, and statistical analysis. My goal for this final step is to evaluate the performance of several models and choose the best performing model.

## Import

In [179]:
# data manipulation
import numpy as np
import pandas as pd

# visualization
import matplotlib.pyplot as plt
import seaborn as sns

# machine learning
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_curve, roc_auc_score

# data import/export
from scipy.sparse import load_npz
import pickle

In [180]:
# import dataframe
data = pd.read_pickle('drugsCom_data')

# import term matrix
term_matrix = load_npz('ngram_csr.npz')

# convert term matrix to dataframe
term_matrix = pd.DataFrame(term_matrix.todense())

# import column headers
pickle_in = open('list.pickle', 'rb')
reviews_columns = pickle.load(pickle_in)

# add column headers to term matrix
term_matrix.columns = reviews_columns

## Pre-Processing
To keep the model simple, I will convert the review ratings from a scale of 1 to 10 to binary. 8 and up will be labeled 1 and below 8 will be labeled 0.

In [181]:
# convert rating scale to binary; >= 8 is 1, <8 is 0
replace_values = {1:0, 2:0, 3:0, 4:0, 5:0, 6:0, 7:0, 8:1, 9:1, 10:1}
term_matrix['rating'] = term_matrix.rating.replace(replace_values)

In [182]:
# create arrays for response variable and features
y = term_matrix['rating'].values
X = term_matrix.drop('rating', axis=1).values

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=9)

## Dummy Classifier
Using a dummy classifier will provide a baseline for comparing subsequent models.

In [183]:
dummy_model = DummyClassifier(strategy='most_frequent', random_state=9)
dummy_model.fit(X_train, y_train)
y_pred = dummy_model.predict(X_test)
print('AUC Score:\n', roc_auc_score(y_test, y_pred))
print('\n\nConfusion Matrix:\n', confusion_matrix(y_test, y_pred))
print('\n\nClassification Report:\n', classification_report(y_test, y_pred))

AUC Score:
 0.5


Confusion Matrix:
 [[    0 12751]
 [    0 19149]]


Classification Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00     12751
           1       0.60      1.00      0.75     19149

    accuracy                           0.60     31900
   macro avg       0.30      0.50      0.38     31900
weighted avg       0.36      0.60      0.45     31900



  _warn_prf(average, modifier, msg_start, len(result))


## Naive Bayes

Naive Bayes (NB) maintains simplicity and speed by assuming all features (words in this case) are independent. The algorithm performs well in a lot of cases, particularly classification problems from text. I think NB will perform moderately well in this case, but a more robust model, such as Random Forest, will probably be best.

In [193]:
# instantiate model, fit, and predict
naive_bayes = MultinomialNB()
naive_bayes.fit(X_train, y_train)
y_pred = naive_bayes.predict(X_test)

# print AUC score
print('AUC Score:\n', roc_auc_score(y_test, y_pred))

# print confusion matrix
print('\n\nConfusion Matrix:\n', confusion_matrix(y_test, y_pred))

# print classification report
print('\n\nClassification Report:\n', classification_report(y_test, y_pred))

AUC Score:
 0.6534088172302402


Confusion Matrix:
 [[ 6376  6375]
 [ 3700 15449]]


Classification Report:
               precision    recall  f1-score   support

           0       0.63      0.50      0.56     12751
           1       0.71      0.81      0.75     19149

    accuracy                           0.68     31900
   macro avg       0.67      0.65      0.66     31900
weighted avg       0.68      0.68      0.68     31900



The AUC score of 0.65 is pretty good for a first run. Again, I think NB is a great option when speed and simplicity are top priority, but I'd like to see a higher AUC score.

Let's try Logistic Regression next.

## Logistic Regression

In [192]:
# instantiate model, fit, and predict
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)

# print AUC score
print(f'AUC Score:\n {roc_auc_score(y_test, y_pred)}')

# print confusion matrix
print('\n\nConfusion Matrix:\n', confusion_matrix(y_test, y_pred))

# print classification report
print('\n\n Classification Report:\n', classification_report(y_test, y_pred))

AUC Score:
 0.6683542915103206


Confusion Matrix:
 [[ 6309  6442]
 [ 3027 16122]]


 Classification Report:
               precision    recall  f1-score   support

           0       0.68      0.49      0.57     12751
           1       0.71      0.84      0.77     19149

    accuracy                           0.70     31900
   macro avg       0.70      0.67      0.67     31900
weighted avg       0.70      0.70      0.69     31900



Logisic Regression was a small improvement compared to Naive Bayes. 

I believe Random Forest will outperform both of these.

## Random Forest

In [194]:
# instantiate model, fit, and predict
r_forest = RandomForestClassifier(n_jobs=1, random_state=123)
r_forest.fit(X_train, y_train)
y_pred = r_forest.predict(X_test)

# print AUC score
print(f'AUC score: {roc_auc_score(y_test, y_pred)}')

# print confusion matrix
print(f'\n\nConfusion Matrix:\n', confusion_matrix(y_test, y_pred))

# print classification report
print('\n\nClassification Report:\n', classification_report(y_test, y_pred))

AUC score: 0.8256836449100752


Confusion Matrix:
 [[ 9369  3382]
 [ 1597 17552]]


Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.73      0.79     12751
           1       0.84      0.92      0.88     19149

    accuracy                           0.84     31900
   macro avg       0.85      0.83      0.83     31900
weighted avg       0.84      0.84      0.84     31900



### Hyperparameter Tuning

In [195]:
# set params to evaluate
params = {'n_estimators': [150, 200, 250],
          'max_depth': [40, 60, 80]}

# instantiate and fit GridSearchCV
grid_search = GridSearchCV(estimator=r_forest,
                           param_grid=params)
grid_search.fit(X_train, y_train)

# print best parameters
grid_search.best_params_

{'max_depth': 80, 'n_estimators': 200}

In [196]:
# instantiate model with best params
r_forest = RandomForestClassifier(max_depth=90,
                                  n_estimators=140, 
                                  n_jobs=1, 
                                  random_state=123)
# fit and predict
r_forest.fit(X_train, y_train)
y_pred = r_forest.predict(X_test)

# print AUC score
print(f'AUC score: {roc_auc_score(y_test, y_pred)}')

# print confusion matrix
print(f'\n\nConfusion Matrix:\n', confusion_matrix(y_test, y_pred))

# print classification report
print('\n\nClassification Report:\n', classification_report(y_test, y_pred))

AUC score: 0.8256552301527968


Confusion Matrix:
 [[ 9319  3432]
 [ 1523 17626]]


Classification Report:
               precision    recall  f1-score   support

           0       0.86      0.73      0.79     12751
           1       0.84      0.92      0.88     19149

    accuracy                           0.84     31900
   macro avg       0.85      0.83      0.83     31900
weighted avg       0.85      0.84      0.84     31900

