## Model validation

- Ensuring your model performs as expected on new data
- Testing model performance on hold-out set
- Selecting the best model, parameters, and accuracy metrics
- Achieving the best accuracy for the data given

### Modeling review

~~~
model = RandomForestRegressor(n_estimators=500, random_state=1111)

model.fit(X=X_train, y=y_train)

predictions = model.predict(X_test)

# Mean Absolute Error
print( mae(y_true=y_test, y_pred=predictions) )
~~~

- Mean Absolute Error:

$ MAE = \displaystyle\frac{1}{n} \displaystyle\sum^n |y_i - \hat{y}_i | $

### Random Forest

#### Regressor

- Parameters:
	- n_estimatators: the number of trees in the forest
	- max_depth: the maximum depth of the trees.
	- random_state: random seed

- Setting parameters:

~~~
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators=50, max_depth=10)
~~~

or

~~~
from sklearn.ensemble import RandomForestRegressor
rfr.n_estimators = 50
rfr.max_depth = 10
~~~

- Feature importance:

Print how important each column is to the model:

~~~
for i, item in enumerate(rfr.feature_importances_):
	print("{0:s}: {1:.2f}".format(X.columns[i], item))
~~~

#### Classifier

~~~
rfc = RandomForestClassifier(random_state=1111)
rfc.get_params()

rfc.fit(X_train, y_train)

# Returns accuracy
rfc.score(X_test, y_test)
~~~


#### Creating train, test and validation data sets

~~~
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=1111)

X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=11111)
~~~


## Accuracy metrics

### Regression models

#### Mean Absolute Error (MAE)

$MAE = \displaystyle\frac{1}{n} \displaystyle\sum^n |y_i - \hat{y}_i| $

- Simplest and most intuitive metric
- Treats all points equally
- Not sensitive to outliers

#### Mean Squared Error (MSE)

$MSE = \displaystyle\frac{1}{n} \displaystyle\sum^n (y_i - \hat{y}_i)^2 $

- Most widely used regression metric
- Allows outlier errors to contribute more to the overall error

#### MAE *vs* MSE

- Accuracy metris are always application specific
- MAE and MSE error term are in different units and should not be compared

### Classification models

#### Confusion matrix

~~~
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, test_predictions)
print(cm)

# cm[<true_category_index>, <predicted_category_index>]
print(cm[1,0]) # False Negatives
~~~

#### Accuracy

$\displaystyle\frac{TN + TP}{TN + TP + FN + FP}$

~~~
(cm[0,0] + cm[1,1])/sum(cm)
~~~

#### Precision

$\displaystyle\frac{TP}{TP + FP}$

~~~
sum(cm[:,1])/sum(cm)
~~~

#### Recall

$\displaystyle\frac{TP}{TP + FN}$

~~~
sum(cm[1,:])/sum(cm)
~~~

#### In scikit-learn

~~~
from sklearn.metrics import accuracy_score, precision_score, recall_score

print(accuracy_score(y_test, y_pred))
print(precision_score(y_test, y_pred))
print(recall_score(y_test, y_pred))
~~~

## The bias-variance trade-off

### Variance

- Following the training data too closely
	- Fails to generalize to the test data
	- Low training error but high testing error
	- Occurs when models are overfit and have high complexity

### Bias

- Failing to find the relationship between the data and the response
	- High training/testing error
	- Occurs when models are underfit


## Cross-validation

- KFold() parameters
	- n_splits: numberof cross-validation splits
	- shuffle: boolean indicating to shuffle the data before splitting
	- random_state: random seed

~~~
from sklearn.mode_selection import KFold

X = np.array(range(40))
y = np.array([0] * 20 + [1] * 20)

kf = KFold(n_splits=5)
splits = kf.split(X)

for train_index, test_index in splits:
	print(len(train_index),len(test_index)) # 32 8 (five times)

rfr = RandomForestRegressor(n_estimators=25, random_state=1111)

errors = []

for train_index, val_index in splits:
	X_train, y_train = X[train_index], y[train_index]
	X_val, y_val = X[val_index], y[val_index]

	rfr.fit(X_train, y_train)
	predictions = rfc.predictions(X_test)
	errors.append(<some_accuracy_metric>)

print(np.mean(errors))
~~~

### In scikit-learn

- cross_val_score()
	- estimator: the model to use
	- X: the predictor data set
	- y: the response array
	- cv: the number of CV splits

- make_scorer()

~~~
from sklearn.metrics import mean_absolute_error, make_scorer
from sklearn.model_selection import cross_val_score

mae_scorer = make_scorer(mean_absolute_error)

cv_results = cross_val_score(<estimator>, <X>, <y>, cv=5, scoring=mae_scorer)
~~~

### Leave-one-out Cross-Validation (LOOCV)

- Use when:
	- the amount of training data is limited
	- you want the absolute best error estimate for new data
- Be cautious when:
	- computational resources are limited
	- you have a lot of data
	- you have a lot of parameters to test

#### Example

~~~
n = X.shape[0]
mse = make_scorer(mean_squared_error)

cv_results = cross_val_score(estimator, X, y, scoring=mse, cv=n)
~~~

## Hyperparameter tuning

### Grid Searching

- Benefits:
	- Tests every possible combination
- Drawbacks:
	- Additional hyperparameters increase training time exponentially

### Random Searching

- Parameters:
	- estimator: the model to use
	- param_distributions: dictionary containing hyperparemeters and possible values
	- n_iter: number of iterations
	- scoring: scoring method to use

~~~
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import make_scorer, mean_absolute_error

param_dist = {'max_depth': [4, 6, 8, None],
		'max_features': range(2,11),
		'min_samples_split': range(2,11)}

rfr = RandomForestRegressor(n_estimators=20, random_state=1111)
scorer = make_scorer(mean_absolute_error)

random_search = RandomizedSearchCV(estimator=rfr, param_distributions=param_dist, n_iter=40, cv=5)

random_search.fit(X, y)
~~~

#### Attributes

- random_search.best_score_
- random_search.best_params_
- random_search.best_estimator_
- random_search.cv_results_ [dict]
	- Example:

	~~~
	# Grouping the maximum depths
	max_depth = [item['max_depth'] for item in random_search.cv_results_['params']]
	scores = list(random_search.cv_results_['mean_test_score'])
	df = pd.DataFrame([max_depth, scores]).T
	df.columns = ['Max Depth', 'Score']
	df.groupby(['Max Depth']).mean()
	~~~

#### Best Estimator

- Predict new data:
~~~
random_search.best_estimator_.predict(<new_data>)
~~~

- Check the parameters:
~~~
random_search.best_estimator_.get_params()
~~~

- Save model to use later:
~~~
from sklearn.externals import joblib
joblib.dump(random_search.best_estimator_, 'rfr_best_<date>.pkl')
~~~