# Cross Validation (Classification)

## Binary classification

What is a classification problem?
    Use features (X) to be models to predict an outcome/target (y).
    
What is a binary classification problem?  Classification, where y is binary.
    Example: predicting acceptance to UCLA.
    
Binary classification is related to information retrieval.

To validate the performance of a classifier, we use precision and recall.

What is precision? and recall?

Precision = TP / P.   This is the probabilty that a positive prediction is correct.

Precision = TP / (TP + FP)

Recall = TP / (TP + FN).  This is the probability that a correct data point is predicted.







Example:

400 people are suspected to be infected with a virus. Through bloodwork, a new medical procedure predicted that 320 of these people were infected.

In reality, only 280 of these 320 were actually infected.  Further, out of the 80 people who were thought not to be infected, 30 of them were infected.  The procedure failed to detect them.

P = 320
TP = 280
FP = 40
FN = 30

Precision = 280 / 320

Recall = 280 / (280 + 30)



## Multi-class classification

y is not binary.  y has more than 2 values.  Example: iris dataset.

To validate, we average precision/recall across all classes.  For example, with respect to setosa, all setosa data points are considered "positives"; the others are "negatives".

Per-class precision/recall is also valuable for binary classification.

Example: admission.csv dataset.  We have to classes. Overall precison/recall, class 1 is positive. (admitted).  Class 0 is negative.

If we consider per-class precision/recall, then for class 1, 1 is positive and 0 is negative; and for class 0, 0 is positive and 1 is negative.

In [3]:
from sklearn.tree import DecisionTreeClassifier
import pandas

model = DecisionTreeClassifier()
df = pandas.read_csv('~/Dropbox/datasets/iris.csv')
X = df.drop('Species', axis=1)
y = df.Species

In [4]:
model.fit(X,y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [5]:
df.sample()

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
17,5.1,3.5,1.4,0.3,setosa


In [6]:
model.predict([ [5.0, 3.4, 1.6, 0.5]  ])

array(['setosa'], dtype=object)

## Validate a decision tree model (on iris)

In [12]:
from sklearn.metrics import precision_score, recall_score, classification_report
model.predict(X)
print(classification_report(y, model.predict(X)))

             precision    recall  f1-score   support

     setosa       1.00      1.00      1.00        50
 versicolor       1.00      1.00      1.00        50
  virginica       1.00      1.00      1.00        50

avg / total       1.00      1.00      1.00       150



### This validation is not good because it's done on training data (i.e the data we used to build the model).

### Proper way is to use one set for training (i.e. model building/fitting) and another set for testing

In [26]:
from sklearn.model_selection import train_test_split


In [28]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

             precision    recall  f1-score   support

     setosa       1.00      1.00      1.00         4
 versicolor       0.86      1.00      0.92         6
  virginica       1.00      0.80      0.89         5

avg / total       0.94      0.93      0.93        15



### Exercise

Validate the performance of decision tree on predicting acceptance to UCLA.  Use precision_score, recall_score as performance measures.


In [40]:
# 1. get the data
df = pandas.read_csv('~/Dropbox/datasets/admission.csv')

# 2. get X and y
y = df.admit
X = df.drop('admit', axis=1)

# 3. create the model
model = DecisionTreeClassifier()

# 4. create training/testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

# 5. build model.
model.fit(X_train, y_train)

# 6. validate model.
y_pred = model.predict(X_test)
print( precision_score(y_test, y_pred), recall_score(y_test, y_pred))




0.5 0.3157894736842105


Observation: performance varies across different splits.

Solution: cross validate.

Challenge: write a Python function to cross validate a model across multiple random splits.

In [56]:

def cross_validate(model, X, y, n, test_size):
    p, r = [], []
    for i in range(n):
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        p.append( precision_score(y_test, y_pred) )
        r.append( recall_score(y_test, y_pred) )
    return round(sum(p)/n, 2), round(sum(r)/n, 2) 

In [57]:
cross_validate(DecisionTreeClassifier(), X, y, 100, 0.1)

(0.4, 0.4)

In [45]:
cross_validate(DecisionTreeClassifier(max_depth=7), X, y, 10, 0.1)

(0.41318903318903316, 0.24706709956709955)

Question: which one is better?  Decision Tree or Random Forest

In [48]:
from sklearn.ensemble import RandomForestClassifier


In [60]:
cross_validate(RandomForestClassifier(n_estimators=50), X, y, 100, 0.1)

(0.49, 0.35)

## Different types of cross validation

1. Shuffle and Split.  This is what we just did by going through multiple train_test_split.

2. KFold.

3. Repeated KFold.  (repeat KFold n times, each with a random initial ordering of data)

4. Stratified KFold.  (similar to KFold but attempts to appropriate the distribution of the data).


## Other measures of classification performance

Aside from precision and recall, people also look at sensitivity and specificity.

Sensitivity is another name for recall.  TP / (TP + FN)

Specificity is the true negative rate.  TN / (TN + FP)

Just like Precision + Recall go "together",  sensitivity and specificity go together.  These measures are prefered in applications where true negative rates are important to know.