# MACHINE LEARNING - VALIDATION

**CONTENTS**

- [Train Test Split](#train_test_split)
- [k-fold Validation](#validation)
- [Evaluation Metrics](#evaluation_metrics)

<a id='train_test_split'></a>

## 1. TRAINING AND TESTING SPLIT

- Split data into train and test set, shuffle = True (default).

### 1.1. EXAMPLE

**CASE STUDY EXAMPLE:**

- A part of the [The BNP Paribas Cardif Claims Management dataset](https://www.kaggle.com/c/bnp-paribas-cardif-claims-management)

In [2]:
### import necessary packages
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

In [3]:
### Load the data
data = pd.read_csv("data/paribas_claims.csv", nrows = 50000)
print(data.shape)
data.head(3)
#### Select the numeric variables only
numerics = ['int16','int32','int64','float16','float32','float64']
numerical_vars = list(data.select_dtypes(include = numerics).columns)
data = data[numerical_vars]
data.shape

### Split data into train and test set
X_train, X_test, y_train, y_test = train_test_split(
                data.drop(labels=['target', 'ID'], axis=1),
                data['target'], test_size=0.3, random_state=0)
X_train.shape, X_test.shape

(50000, 133)


((35000, 112), (15000, 112))

### 1.2. FOR CLASS IMBALANCE

- When the data have imbalanced numbers of data points in the outcome classes (e.g. one is rare compared to the others) => change the `stratify` parameter => which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set.

In [47]:
from sklearn import datasets 
from sklearn.model_selection import train_test_split

iris = datasets.load_iris()

X = iris.data[:,:2]
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, \
                                        test_size = .2, stratify = y)

In [48]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [49]:
y_test

array([2, 1, 0, 2, 1, 0, 1, 0, 0, 0, 0, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2,
       1, 2, 2, 2, 1, 0, 1, 2])

<a id='validation'></a>

## 2. K-FOLD CROSS VALIDATION

- **PROCESSS:**

    + Split data into k sets.
    + Performing k experiment runs. In each run: pick a set as test set, the rest as train set. 
    + Evaluation: Average result from these k experiments.
=> More accurate evaluation by using all the data.

- **DRAWBACK:**
    + More compute time
    - Train/test split minimize the training time, while K-fold CV maximize accuracy.
    
- k-fold is usually used in combining with grid_search.GridSearchCV.

<a id='evaluation_metrics'></a>
## 3. EVALUATION METRICS

### 3.1. SIMPLE METRIC: ACCURACY

- accuracy = data points with corrected labels/all data points
- used for classification problems.
- **Drawbacks**:
    + not ideal for skewed classes

In [51]:
### 2 ways to calculate the accuracy score
# sklearn metrics
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
clf = SVC(kernel = "linear", gamma = "auto")
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("Accuracy score of the prediction", accuracy_score(y_test, y_pred))
# model attribute
clf.score(X_test, y_test)

Accuracy score of the prediction 0.8


0.8

### 3.2. CONFUSION MATRIX

- Precision and recall can help illuminate the performance better for the dataset that has imbalanced classes.
- Precision = TP/(TP + FP): Possibility to be correct among the ones identify as the target. It is also known as the positive predictive value.
- Recall = TP/(TP + FN): The possibility to correctly identify the target among the ground truth. It is also known as sensitivity.
- F1 score = 2 * (Precision * Recall) /(Precision + Recall)
    + max F1: 1 => perfect precision and recall
    + min F1: 0: either precision or recall is 0.

#### Examples to understand the precision and recall:

- **My identifier doesn't have great precision, but it does have good recall**. That means that, nearly everytime a person of interest (POI) shows up in my test set, I am able to identify the POI. The cost for this is that sometimes I get some false positives, where non-POIs get flagged.

- **My identifier doesn't have great recall, but it does have good precision**. That means that, whenever a POI get flagged in my test set, I know with a lot of confidence that its very likely to be a real POI and not a false alarm. ON the other hand, the price I pay for this is that sometimes I miss real POIs, since I'm effectively reluctant to pull the trigger on edge cases.

- **My identifier has a really great F1 score**. This is the best of both worlds. Both my false positive and false negative rates are low, which means that I can identify POI's reliably and accurately. If my identifier finds a POI then the person is almost certainly a POI, and if the identifier does not flag someone, then they are almost certainly not a POI.


In [54]:
from sklearn.metrics import recall_score
### Different values for `average` parameter
# the scores for each class are returned
recall_score(y_test, y_pred, average = None)

array([1. , 0.7, 0.7])

In [55]:
# Calculate metrics for each label, and find their unweighted mean.
recall_score(y_test, y_pred, average = 'macro')

0.7999999999999999

In [56]:
recall_score(y_test, y_pred, average = 'weighted')

0.8

In [57]:
from sklearn.metrics import precision_score
precision_score(y_test, y_pred, average = None)

array([1. , 0.7, 0.7])

In [58]:
precision_score(y_test, y_pred, average = 'macro')

0.7999999999999999

In [59]:
precision_score(y_test, y_pred, average = 'weighted')

0.8

In [64]:
### Example of binary classification
predictions = [0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1]
true_labels =  [0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0]
precision_score(true_labels, predictions)

0.6666666666666666

In [None]:
            predicted
             0     1
real    0    9    3
        1    2    6   
    
precision = 6/(6 + 3) = 0.6666666666666666

### 3..