# CAPSTONE PROJECT - PART 4
__Michael Gat__  
__General Assembly Santa Monica, Data Science Immersive, Summer 2016__

In this notebook, we'll build upon __Part_3__ by adding oversampling as suggested in [Predictive risk modelling for early hospital readmission of patients with diabetes in India](http://link.springer.com/article/10.1007/s13410-016-0511-8).

In [1]:
# IMPORT LIBRARIES ############################################################

import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.datasets import make_classification
from imblearn.under_sampling import RandomUnderSampler

## READ IN DATA
We have a clean dataset from __Part 1__. We'll import that, then select only the columns we want to deal with at this time. This code is similar to __Part_2__, please review that notebook for additional details/comments.

In [2]:
df = pd.read_csv('diabetic_data_clean.csv')

In [3]:
df_model_1 = df.ix[:,0:15]
df_model_1 = df_model_1.join(df.ix[:,['change', 'diabetesMed', 'age_group', 'readmit']])

## BUILD FEATURE SELECTION
Still working with the recommended Chi2, test for different numbers of features, using code adapted from __Part2__.

In [4]:
X = df_model_1.ix[:,0:17]
y = df_model_1.ix[:,18]

X_std = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.75)

### Undersampling
Use the random under sampler to train the model on data that is evenly balanced between the two possible target results. Given the healthy size of the training set, undersampling the negatives should yield a sufficient sample for training and is preferable in this case to attempting an oversample of the positive results.

The idea behind undersampling is relatively simple, but implementations can be complex depending on the circumstances and the desired result. It is difficult to train a categorical model when the datasets is significantly out of balance, as this dataset is. As established in __Part 1A__, approximately 90% of the records in the set are "no readmit" (0 in the target variable). Most models will not handle very sparse positive results very well, as they will tend to discount the possibility of positives occuring. Much better results can be achieved when the training data is artificially balanced even when the test data is not! This avoids biasing the model towards a particular mix of results and forces it to focus exclusively on the predictive variables.

Oversampling can be used in smaller datasets when it is not possible to reduce the prevalance of one target variable while maintaining a statistically-significant set of training data. In oversampling, artificial additional datapoints are generated to "fill out" the under-represented target variable.

As the dataset for this project is fairly large, this is not a significant concern, we can select a random subset (approximately 10%) from the "no readmit" category to maintain numerical balance with the "readmit" category. The imbalanced-learn package does this for us using the "random under sampler." Other over and under sampling methods are available as well, but there was inadequate time to explore other options.

Examples and documentation of the imbalanced-learn methods are available at [http://contrib.scikit-learn.org/imbalanced-learn/auto_examples/index.html#example-using-under-sampling-class-methods](http://contrib.scikit-learn.org/imbalanced-learn/auto_examples/index.html#example-using-under-sampling-class-methods)

Key results from a test run are in __Capstone_Results.xlsx__

In [5]:
# Use random under sampler module to reduce the number of "no readmit" records influencing the learning
rus = RandomUnderSampler()
X_resampled, y_resampled = rus.fit_sample(X, y)

In [6]:
classifiers = [DecisionTreeClassifier(max_depth=7), \
RandomForestClassifier(max_depth=7, n_estimators=10, max_features=1), \
GaussianNB(), LogisticRegression()]

for numtest in range (3,10):
    ch2 = SelectKBest(chi2, k=numtest)
    X_train_fit = ch2.fit_transform(X_resampled, y_resampled)
    print 'SELECTED FEATURES:' + str(numtest)
    col_indices = ch2.get_support(indices=True)
    for i in col_indices:
        print X_train.columns.values[i]
    print ch2.scores_
    print
    
    X_test_xform = ch2.transform(X_test)
    
    for clf in classifiers:
        clf.fit(X_train_fit, y_resampled)
        score = clf.score(X_test_xform, y_test)
        y_pred = clf.predict(X_test_xform)
        y_pred = clf.predict(X_test_xform)
        cm = confusion_matrix(y_test, y_pred)
        fpr, tpr, thresholds = roc_curve(y_test, y_pred)
        print clf
        print score
        print cm
        print "AUC Metrics:"
        print auc(fpr, tpr)
        print

SELECTED FEATURES:3
discharge_disposition_id
number_emergency
number_inpatient
[  6.63668475e-01   9.80655047e-02   2.86383524e+00   1.05855045e+03
   1.21419430e+01   1.99441011e+02   2.76587759e+02   2.50889156e+01
   3.49720813e+02   5.92906597e+01   7.69714668e+02   2.58585905e+03
   6.92053231e+01   3.08454810e+00   6.61144219e+00   1.52367438e+01
   9.62466307e+00]

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=7,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
0.660661644284
[[45673 22175]
 [ 3725  4752]]
AUC Metrics:
0.616871082579

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=7, max_features=1, max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_sco