## Classification Modeling (supervised)

If you are reading this then you are interested in performing some supervised classification modeling. First, lets make sure you actually have a supervised classification problem.

Consider the following YES/NO questions:
1. Do you have labels for your target variable?
2. Is your target variable categorical?

    ex: (yes,no) | (0,1) ==> (binary classification)
    
    ex: (yes,no,maybe) | |(freshman,sophomore,junior) | (0,1,2) ==> (ranking classification)
   
    ex: (red,blue,green,yellow) | (english,chinese,spanish,french) ==> (multi classification)

If you answered YES to both question 1 & 2, then you have a supervised classifcation model. You may continue on! Enjoy this modeling recipe! :)

If you answered NO to question 1, then you have an unsupervised problem.

If you answered NO to question 2, then you most likely have a regression problem.

### IMPORTS

In [None]:
# MODELS
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
import xgboost

# OTHER GOODIES & METRICS
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.metrics import classification_report, precision_recall_curve
from sklearn.metrics import confusion_matrix, roc_curve, accuracy_score , auc
from sklearn.grid_search import GridSearchCV
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## LOAD DATA

In [None]:
filename = 'insert file name'
df = pd.read_csv(filename) # or read_json, read_sql, read_pickle, read_html, etc
df.head(5)                 # see head of data frame

## SPLIT INTO TRAIN & TEST
- shuffle data at random state = 42
- Train with 80% of data
- Test with 20% of data

In [None]:
X = [] # features, should be matrix (or a vector if only 1 feature)
y = [] # target, should be vector
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=42)

## MACHINE LEARNING ALGORITHMS
- I personally like to place any and all ML algorithms that I might want to implement in production into a dictionary
- Keep in mind size of data and application, this will dictate what algorithms you will actually end up using.
- I listed some of my favorite ML algorithms below. Feel free to check out sklearn for more! :D

In [None]:
models = {}
models['XG_Boost'] = xgboost.XGBClassifier() #same as gradient boosting, but faster!
models['KNeighbors'] = KNeighborsClassifier()
models['Gaussian_Naive_Bayes'] = GaussianNB()
models['Random_Forest'] = RandomForestClassifier()
models['Logistic_Regression'] = LogisticRegression()
models['Gradient_Boosting'] = GradientBoostingClassifier()

# TRAIN/SCORE/EVALUATE MODELS

- Once I train my classification models, I personally like to view the PR and ROC curves to roughly evaluate my model(s)
- Another one of my personal favorite metrics is the confusion matrix. It's great because you simultaneous view and quantify how well your model is doing. The more numbers on the diagonal (from top left to bottom right), the better. In addition, numbers on the lower or upper triangle from the diagonal help quantify how many false positives or true positives you have (assuming binary classification).

In [None]:
fig,axs = plt.subplots(nrows=len(models,ncols=2)
fig.set_figwidth(15)
fig.set_figheight(5)

for name,model in models.items():
    results = model.fit(X_train,y_train) #fit that model
    y_pred = results.predict(X_test)
    y_pred_probs = results.predict_proba(X_test)
    score = accuracy_score(y_test,y_pred)
    
    precision,recall,threshold_PR = precision_recall_curve(y_test,y_pred_probs[:,1],pos_label=1)
    fpr,tpr,threshold_ROC = roc_curve(y_test,y_pred_probs[:,1],pos_label=1)
    AUC = auc(fpr,tpr)
    
    print '############################## MODEL:{} ##############################'.format(name)
    print 'AUC: {}'.format(AUC)
    report = classification_report(y_test,y_pred)
    con_mat = confusion_matrix(y_test,y_pred)
    i+=1
    ax = axs[-1+i,0]
    ax.plot(recall,precision)
    ax.set_title('Precision-Recall')
    ax.set_xlabel('Recall')
    ax.set_ylabel('Precision')
    ax.grid(True)
    ax = axs[-1+i,1]
    ax.plot(fpr,tpr)
    ax.set_title('ROC Curve')
    ax.set_xlabel('False Positive Rate')
    ax.set_ylabel('True Positive Rate')
    ax.grid(True)