## Model Selection & Evaluation

<hr>

### Agenda
1. Cross Validation 
2. Hyperparameter Tuning  
3. Model Evaluation 
4. Model Persistance 
5. Validation Curves

<hr>

### 1. Cross Validation
* Simple models underfit.
* Accuracy for training data & validation data is not much different.
* But, accurcy isn't that great.
* This situation is of low variance & high bias
* On moving towards complex models, accuracy improves.
* But, gap between accuracy on training data & validation data increases
* This situation is of high variance & low bias

<img src="biasvariance.png" width="400px">

* **bias error** is an error from erroneous assumptions in the learning algorithm.
* The **variance** is an error from sensitivity to small fluctuations in the training set. 

* We need to compare across models to find the best model.
* We need to compare across all hyper-parameters for a particular model.
* The data that is used for training should not be used for validation. 
* The validation accuracy is the one that we claims

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
from sklearn.datasets import load_digits

In [None]:
digits = load_digits()

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
plt.imshow(digits.images[3],cmap='gray')

In [None]:
dt = DecisionTreeClassifier(max_depth=10)

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
trainX, testX, trainY, testY = train_test_split(digits.data, digits.target)

In [None]:
dt.fit(trainX,trainY)

In [None]:
dt.score(testX,testY)

In [None]:
dt.score(trainX,trainY)

* Decreasing the complexity of model

In [None]:
dt = DecisionTreeClassifier(max_depth=7)

In [None]:
dt.fit(trainX,trainY)

In [None]:
dt.score(testX,testY)

In [None]:
dt.score(trainX,trainY)

* Observation : With decrease in complexity the gap in training & validation accuracy also decreased

#### Cross Validation API
* Splits data into k parts.
* Use k - 1 parts for training the model
* Use kth part for validation
* Repeat the above steps multiple times to get a genralized behaviour

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
scores = cross_val_score(dt, digits.data, digits.target)

In [None]:
scores.mean()

#### Cross-validate Function : Scores for multiple matrices

In [None]:
from sklearn.model_selection import cross_validate

In [None]:
scoring = ['precision_macro', 'recall_macro', 'accuracy']

In [None]:
cross_validate(dt, digits.data, digits.target, scoring=scoring, cv=5)

#### Stratification for dealing with imbalanced Classes
* StratifiedKFold 
  - Class frequencies are preserved in data splitting

In [None]:
import numpy as np

In [None]:
Y = np.append(np.ones(12),np.zeros(6))
Y

In [None]:
X = np.ones((18,3))
X

In [None]:
from sklearn.model_selection import StratifiedKFold

In [None]:
skf = StratifiedKFold(n_splits=3)

In [None]:
for i, (train_index, test_index) in enumerate(skf.split(X, Y)):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    print(f"  Test:  index={test_index}")

### 2. Hyperparameter Tuning
* Model parameters are learnt by learning algorithms based on data
* Hyper-parameters needs to be configured
* Hyper-parameters are data dependent & many times need experiments to find the best
* sklearn provides GridSerach for finding the best hyper-parameters

##### Exhaustive GridSearch
* Searches sequentially for all the configued params
* For all possible combinations

In [None]:
trainX, testX, trainY, testY = train_test_split(digits.data, digits.target)

In [None]:
dt = DecisionTreeClassifier()

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
grid_search = GridSearchCV(dt, param_grid={'max_depth':range(5,30,5)}, cv=5)

In [None]:
grid_search.fit(digits.data,digits.target)

In [None]:
grid_search.best_params_

In [None]:
grid_search.best_score_

In [None]:
grid_search.best_estimator_

#### RandomizedSearch
* Unlike GridSearch, not all parameters are tried & tested
* But rather a fixed number of parameter settings is sampled from the specified distributions.

##### Comparing GridSearch and RandomSearchCV

In [None]:
from time import time

#randint is an intertor for generating numbers between range specified
from scipy.stats import randint

In [None]:
X = digits.data
Y = digits.target

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

In [None]:
# specify parameters and distributions to sample from
param_dist = {"max_depth": [3, None],
              "max_features": randint(1,11),
              "min_samples_split": randint(2, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

In [None]:
param_dist

In [None]:
rf = RandomForestClassifier(n_estimators=20)

In [None]:
n_iter_search = 20
random_search = RandomizedSearchCV(rf, param_distributions=param_dist,
                                   n_iter=n_iter_search, cv=5)

start = time()
random_search.fit(X, Y)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))

In [None]:
random_search.best_score_

In [None]:
param_grid = {"max_depth": [3, None],
              "max_features": [1, 3, 10],
              "min_samples_split": [2, 3, 10],
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# run grid search
grid_search = GridSearchCV(rf, param_grid=param_grid, cv=5)
start = time()
grid_search.fit(X, Y)

print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
      % (time() - start, len(grid_search.cv_results_['params'])))

In [None]:
grid_search.best_score_

* GridSearch & RandomizedSearch can fine tune hyper-parameters of transformers as well when part of pipeline

### 3. Model Evaluation
* Three different ways to evaluate quality of model prediction
  - score method of estimators, a default method is configured .i.e r2_score for regression, accuracy for classification
  - Model evalutaion tools like cross_validate or cross_val_score also returns accuracy
  - Metrices module is rich with various prediction error calculation techniques

In [None]:
trainX, testX, trainY, testY = train_test_split(X,Y)

In [None]:
rf.fit(trainX, trainY)

* Technique 1 - Using score function

In [None]:
rf.score(testX,testY)

* Technique 2 - Using cross_val_score as discussed above

In [None]:
cross_val_score(rf,X,Y,cv=5)

#### Cancer prediction sample for understanding metrices

In [None]:
from sklearn.datasets import load_breast_cancer

In [None]:
dt = DecisionTreeClassifier()

In [None]:
cancer_data = load_breast_cancer()

In [None]:
trainX, testX, trainY, testY = train_test_split(cancer_data.data, cancer_data.target)

In [None]:
dt.fit(trainX,trainY)

In [None]:
pred = dt.predict(testX)

#### Technique 3 - Using metrices
##### Classfication metrices
* Accuracy Score - Correct classification vs ( Correct classification + Incorrect Classification )

In [None]:
from sklearn import metrics

In [None]:
metrics.accuracy_score(y_pred=pred, y_true=testY)

* Confusion Matrix - Shows details of classification inclusing TP,FP,TN,FN
  - True Positive (TP), Actual class is 1 & prediction is also 1
  - True Negative (TN), Actual class is 0 & prediction is also 0
  - False Positive (FP), Acutal class is 0 & prediction is 1
  - False Negative (FN), Actual class is 1 & prediction is 0

In [None]:
metrics.confusion_matrix(y_pred=pred, y_true=testY, labels=[0,1])

<img src="https://github.com/awantik/machine-learning-slides/blob/master/confusion_matrix.png?raw=true" width="400px">

* Precision Score
  - Ability of a classifier not to label positive if the sample is negative
  - Claculated as TP/(TP+FP)
  - We don't want a non-spam mail to be marked as spam

In [None]:
metrics.precision_score(y_pred=pred, y_true=testY)

* Recall Score
  - Ability of classifier to find all positive samples
  - It's ok to predict patient tumor to be cancer so that it undergoes more test
  - But it is not ok to miss a cancer patient without further analysis

In [None]:
metrics.recall_score(y_pred=pred, y_true=testY)

* F1 score
  - Weighted average of precision & recall

In [None]:
metrics.f1_score(y_pred=pred, y_true=testY)

##### House Price Prediction - Understanding matrices

In [None]:
import sklearn
house_data = sklearn.datasets.fetch_california_housing()

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lr = LinearRegression()

In [None]:
lr.fit(house_data.data, house_data.target)

In [None]:
pred = lr.predict(house_data.data)

#### Metrics for Regression
* mean squared error
  - Sum of squares of difference between expected value & actual value

In [None]:
metrics.mean_squared_error(y_pred=pred, y_true=house_data.target)

* mean absolute error
  - Sum of abs of difference between expected value & actual value

In [None]:
metrics.mean_absolute_error(y_pred=pred, y_true=house_data.target)

* r2 score
  - Returns accuracy of model in the scale of 0 & 1
  - It measures goodness of fit for regression models
  - Calculated as =  (variance explained by the model)/(Total variance)
  - High r2 means target is close to prediction
  
  
  <img src="https://github.com/awantik/machine-learning-slides/blob/master/Capture.PNG?raw=true" width="400px">

In [None]:
metrics.r2_score(y_pred=pred, y_true=house_data.target)

### Metrics for Clustering 
* Two forms of evaluation 
* supervised, which uses a ground truth class values for each sample.
  - completeness_score
  - homogeneity_score
* unsupervised, which measures the quality of model itself
  - silhoutte_score
  - calinski_harabaz_score

##### completeness_score
- A clustering result satisfies completeness if all the data points that are members of a given class are elements of the same cluster.
- Accuracy is 1.0 if data belonging to same class belongs to same cluster, even if multiple classes belongs to same cluster

In [None]:
from sklearn.metrics.cluster import completeness_score

In [None]:
completeness_score( labels_true=[10,10,11,11],labels_pred=[1,1,0,0])

* The acuracy is 1.0 because all the data belonging to same class belongs to same cluster

In [None]:
completeness_score( labels_true=[11,22,22,11],labels_pred=[1,0,1,1])

* The accuracy is .3 because class 1 - [11,22,11], class 2 - [22]  

In [None]:
print(completeness_score([10, 10, 11, 11], [0, 0, 0, 0]))

##### homogeneity_score
- A clustering result satisfies homogeneity if all of its clusters contain only data points which are members of a single class.

In [None]:
from sklearn.metrics.cluster import homogeneity_score

In [None]:
homogeneity_score([0, 0, 1, 1], [1, 1, 0, 0])

In [None]:
homogeneity_score([0, 0, 1, 1], [0, 1, 2, 3])

In [None]:
homogeneity_score([0, 0, 0, 0], [1, 1, 0, 0])

* Same class data is broken into two clusters

#### silhoutte_score
* The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample.
* The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of.

##### Selecting the number of clusters with silhouette analysis on KMeans clustering

In [None]:
from sklearn.datasets import make_blobs
X, Y = make_blobs(n_samples=500,
                  n_features=2,
                  centers=4,
                  cluster_std=1,
                  center_box=(-10.0, 10.0),
                  shuffle=True,
                  random_state=1)

In [None]:
plt.scatter(X[:,0],X[:,1],s=10)

In [None]:
range_n_clusters = [2, 3, 4, 5, 6]

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

In [None]:
for n_cluster in range_n_clusters:
    kmeans = KMeans(n_clusters=n_cluster, n_init='auto')
    kmeans.fit(X)
    labels = kmeans.predict(X)
    print (n_cluster, silhouette_score(X,labels))

* The best number of clusters is 2

#### calinski_harabaz_score
* The score is defined as ratio between the within-cluster dispersion and the between-cluster dispersion.

In [None]:
from sklearn.metrics import calinski_harabasz_score

for n_cluster in range_n_clusters:
    kmeans = KMeans(n_clusters=n_cluster, n_init='auto')
    kmeans.fit(X)
    labels = kmeans.predict(X)
    print (n_cluster, calinski_harabasz_score(X,labels))

### 4. Model Persistance
* Model training is an expensive process
* It is desireable to save the model for future reuse
* using pickle & joblib this can be achieved

In [None]:
import pickle

In [None]:
s = pickle.dumps(dt)

In [None]:
pickle.loads(s)

In [None]:
type(s)

* joblib is better extension of pickle
* Doesn't convert into string

In [None]:
import joblib

In [None]:
joblib.dump(dt, 'dt.joblib')

* Loading the file back into model

In [None]:
dt = joblib.load('dt.joblib')

In [None]:
dt

### 5. Validation Curves
* To validate a model, we need a scoring function.
* Create a grid of possible hyper-prameter configuration.
* Select the hyper-parameter which gives the best score

In [None]:
from sklearn.model_selection import validation_curve

param_range = np.arange(1, 50, 2)

train_scores, test_scores = validation_curve(RandomForestClassifier(), 
                                             digits.data, 
                                             digits.target, 
                                             param_name="n_estimators", 
                                             param_range=param_range,
                                             cv=3, 
                                             scoring="accuracy", 
                                             n_jobs=-1)

In [None]:
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)

test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

plt.plot(param_range, train_mean, label="Training score", color="black")
plt.plot(param_range, test_mean, label="Cross-validation score", color="dimgrey")

plt.title("Validation Curve With Random Forest")
plt.xlabel("Number Of Trees")
plt.ylabel("Accuracy Score")
plt.tight_layout()
plt.legend(loc="best")
plt.show()