# prediction models

## high-level questions

<div class='alert alert-danger'>
<li>for a classification problem, how do i decide which models to try? Thus far I am just trying the ones I know.
<li>all I do in this is try different models and try different features. is this the right method?
<li>am i using the proper evaluation criteria?
</div>

## imports

In [137]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [138]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

## prep

In [139]:
appointments = pd.read_csv("data/appointments_2.csv",
                          parse_dates=['scheduled_day','appointment_day'],
                          )
appointments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71735 entries, 0 to 71734
Data columns (total 20 columns):
Unnamed: 0             71735 non-null int64
appointment_id         71735 non-null int64
patient_id             71735 non-null float64
scheduled_day          71735 non-null datetime64[ns]
appointment_day        71735 non-null datetime64[ns]
age                    71735 non-null int64
neighborhood           71735 non-null object
scholarship            71735 non-null int64
hypertension           71735 non-null int64
diabetes               71735 non-null int64
alcoholism             71735 non-null int64
handicap               71735 non-null int64
sms_received           71735 non-null int64
no_show                71735 non-null int64
male                   71735 non-null int64
appointment_count      71735 non-null int64
days_to_appointment    71735 non-null int64
scheduled_weekday      71735 non-null int64
appointment_weekday    71735 non-null int64
handicap_binary        71735 non-n

### get dummies

In [None]:
# get a list of features before creating dummies
# this makes it easy to add and remove features
features_initial = set(appointments.columns)

In [141]:
# remove the features from this list that cannot be used in any model,i.e.timestamps
useless = {'Unnamed: 0', 'appointment_id', 'patient_id', 'scheduled_day',
       'appointment_day','scheduled_weekday','no_show'}
       

In [142]:
# get dummies for categorical data: appointment day, neighborhood
# only do this for categorical features we know have statistical significance
appointments_dummies = pd.get_dummies(appointments, columns=['appointment_weekday','neighborhood'],drop_first=True)

<div class='alert alert-danger'>
<li>is there a minimum category count above which I do not need to drop a column when doing dummies?
<li>for example, do I need to drop a column for neighborhood to avoid multicollinearity?
<li>is it a problem that I now have 80 columns for neighborhood? will this drown out the other columns significance?
</div>

In [143]:
features_all = set(appointments_dummies.columns) - useless
features_dummies = features_all - features_initial

# logistic regression

## without derived data

this first model is simply to establish a baseline. How does a model that only uses predictors from the initial dataset perform?

In [144]:
# split the data for training and testing
columns = features_all - {
    'days_to_appointment', 
    'appointment_weekday', 
    'handicap_binary'
}

X = appointments[list(columns)]
y = appointments['no_show']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# train the model
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [145]:
predictions = logmodel.predict(X_test)

print(classification_report(y_test,predictions))

             precision    recall  f1-score   support

          0       0.71      1.00      0.83     10256
          1       0.00      0.00      0.00      4091

avg / total       0.51      0.71      0.60     14347



  'precision', 'predicted', average, warn_for)


## without dummies

This model allows me to play with various categorical predictors without managing all the dummy predictors. I assume that toggling anyone of these will have more of an impact than toggling an individual dummy.

In [146]:
# split the data for training and testing

predictors_2 = {
    'days_to_appointment', 
    'scholarship', 
#     'hypertension', 
#     'diabetes', 
#     'sms_received',
    'age',
#     'appointment_weekday', 
#     'handicap_binary'
}
    

X = appointments[list(predictors_2)]
y = appointments['no_show']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# train the model
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [147]:
predictions = logmodel.predict(X_test)

print(classification_report(y_test,predictions))

             precision    recall  f1-score   support

          0       0.71      1.00      0.83     10256
          1       0.37      0.00      0.00      4091

avg / total       0.62      0.71      0.60     14347



## with dummies

take into account the optimizations made in the previous regression model by turning off certain predictors and including all of the dummies

In [148]:
predictors_3 = predictors - {
    'hypertension', 
    'diabetes', 
    'sms_received',
    'appointment_weekday', 
    'handicap_binary',
    'male',
    'handicap'
}

X = appointments_dummies[list(predictors_3)]
y = appointments_dummies['no_show']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# train the model
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [149]:
predictions = logmodel.predict(X_test)

print(classification_report(y_test,predictions))

             precision    recall  f1-score   support

          0       0.72      1.00      0.83     10256
          1       0.43      0.01      0.01      4091

avg / total       0.63      0.71      0.60     14347



<div class='alert alert-info'>
how do I toggle the dummy predictors given that there are so many?
<li>the easiest ways to do this manually. 
</div>

# random forest 

In [153]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report,confusion_matrix

In [161]:
# list of all features with statistically significant distinctions between categories
predictors_4 = {
    'days_to_appointment', 
    'scholarship', 
    'hypertension', 
    'diabetes', 
    'sms_received',
    'age',
    'appointment_weekday', 
    'handicap_binary'
}
    

## without dummies

<div class='alert alert-info'>
is it appropriate to assume that the optimal predictors for the logistic regression are the same for random forests?
<li>No. Outside of the features that have strong correlations with the response, different models use features differently.
</div>

In [162]:
rfc = RandomForestClassifier(n_estimators=600)

X = appointments[list(predictors_4)]
y = appointments['no_show']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

rfc.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=600, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [163]:
# make predictions on test data based on fitted model
predictions = rfc.predict(X_test)

In [164]:
print(classification_report(y_test,predictions))

             precision    recall  f1-score   support

          0       0.73      0.84      0.78     10256
          1       0.36      0.22      0.27      4091

avg / total       0.62      0.66      0.64     14347



In [165]:
print(confusion_matrix(y_test,predictions))

[[8612 1644]
 [3180  911]]


## with optimal predictor list based on logistic regression

In [166]:
rfc = RandomForestClassifier(n_estimators=600)

X = appointments_dummies[list(predictors_3)]
y = appointments_dummies['no_show']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

rfc.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=600, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [167]:
# make predictions on test data based on fitted model
predictions = rfc.predict(X_test)

In [168]:
print(classification_report(y_test,predictions))

             precision    recall  f1-score   support

          0       0.74      0.88      0.81     10256
          1       0.44      0.23      0.30      4091

avg / total       0.65      0.70      0.66     14347



In [169]:
print(confusion_matrix(y_test,predictions))

[[9044 1212]
 [3150  941]]


<div class='alert alert-info'>
would it make my model stronger if I got rid of the neighborhood columns that had a very low count?
<li>No. it is not worth the effort to compress these features.
</div>

# svm

In [170]:
from sklearn.svm import SVC

In [None]:
# split the data
X = appointments_dummies[list(predictors_3)]
y = appointments_dummies['no_show']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# train the model
svc_model = SVC()
svc_model.fit(X_train,y_train)

In [None]:
predictions = svc_model.predict(X_test)

In [None]:
print(classification_report(y_test,predictions))

In [None]:
print(confusion_matrix(y_test,predictions))

---

## a note on evaluation methods
source: [sklearn docs](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html)

### precision
The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative. In layman terms, it is how little your model claims that a bad burger is actually a tasty burger. The lower your false positives, the higher your precision.
* the real question is why this is called precision. it should be called: **avoiding false positives**

### recall
The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples. In layman terms, it is how little your model throws away tasty burgers. The lower your false negatives, the higher your recall.
* this should be called: **avoiding false negatives**

### f-beta
The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.
* we use harmonic mean here because we are taking the average of fractions
* The F-beta score weights recall more than precision by a factor of beta. beta == 1.0 means recall and precision are equally important.

### support
The support is the number of occurrences of each class in y_true.