# Model Validation

Model validation consists of: 
- ensuring your model performs as expected on new data
- testing model performance on holdout datasets
- selecting the best model, parameters and accuracy metrics
- achiving the best accuracy for the data given

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
import warnings 

from sklearn.model_selection import train_test_split

warnings.filterwarnings('ignore')

We refer to *seen data* as the data that has been used during the training phase and *unseen data* the one not used for trainig.

We call *testing data* the one left aside to asses model performance. 

Often the ratio is 80% of the available data for training and the other 20% for testing.



In [None]:
ttt_df = pd.read_csv('../data/tic-tac-toe.csv')
ttt_df

In [None]:
X = pd.get_dummies(ttt_df.iloc[:, :9])
y = ttt_df['Class'] 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1111)

If we want to test model parameters we need another kind of data other than the trainint or testing. We call this new kind of data *validation set*. 

For creating training, validation and test sets we can use the *.train_test_split()* sklearn method twice.

In [None]:
# Create training and testing datasets. Use 10% for the test set
X_train, X_test, y_train, y_test  = train_test_split(X, y, test_size=0.1, random_state=1111)

# Create temporary training and final testing datasets
X_temp, X_test, y_temp, y_test  =\
    train_test_split(X, y, test_size=0.2, random_state=1111)

# Create the final training and validation datasets
X_train, X_val, y_train, y_val =\
    train_test_split(X_temp, y_temp, test_size=0.25, random_state=1111)

# Accuracy Metrics

Accuracy metrics are always application specific

## Regression Models 

MAE and MSE error terms are in different units and should not be compared

### Mean Absolute Error (MAE)

- Simplest and most intuitive metric
- Treats all points equally
- Not sensitive to outliers

$$MAE = \frac{\sum_{i=1}^{n} |y_i - \hat{y_i}|}{n}$$

In [None]:
from sklearn.metrics import mean_absolute_error

# mean_absolute_error(y_test, test_predictions)

### Mean Squared Error (MSE)

- Most widely used regression metric
- Allow outlier errors to contribute more to the overall error

$$MSE = \frac{\sum_{i=1}^{n} (y_i - \hat{y_i})^2}{n}$$

In [None]:
from sklearn.metrics import mean_squared_error

# mean_squared_error(y_test, test_predictions)

Sometimes we are interested in knowing how well our model performs on a particular subset of the data.

## Classification Models 

There are several accuracy metrics for classification models: precision, recall, accuracy, f1 score...
They all can be easily calculated from the confusion matrix: 

|                | Predicted 0  | Predicted 1  |
|----------------|--------------|--------------|
| **Actual 0**   | 23 (TN)      | 7 (FP)       |
| **Actual 1**   | 8 (FN)       | 62 (TP)      |

- **True Positive (TP)**: Predict/Actual are both 1
- **True Negative (TN)**: Predict/Actual are both 0
- **False Positive (FP)**: Predicted 1, actual 0
- **False Negative (FN)**: Predicted 0, actual 1

In [None]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, test_predictions)


### Accuracy

- Represents the hability of our model to correctly predict the correct classification
  
$$Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$$

### Precision

- Used when we dont want to overpredict the positive class
  
$$Precision = \frac{TP}{TP + FP}$$

### Recall

- Its about finding all the positive values
- Used when we cant afford to lose any positive values.
  
$$Recall = \frac{TP}{TP + FN}$$

### F1 Score

The F1 score is the harmonic mean of precision and recall, providing a single metric that balances the trade-off between the two. 

It is particularly useful when the dataset is imbalanced, meaning one class is significantly more common than the other. In such cases, relying solely on accuracy can be misleading, as it may mask poor performance on the minority class. 

The F1 score helps to account for both false positives and false negatives, making it ideal when both types of errors are important.

$$F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}$$

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy_score(y_test, test_predictions)
precision_score(y_test, test_predictions)
recall_score(y_test, test_predictions)
f1_score(y_test, test_predictions)

# The Bias-Variance tradeoff

## Variance

- Variance occurs when the model pays too much attention to the training data and fails to generalize.
- Low training error but high testing error
- Occurs when models are overfit and have high complexity
- **Overfit** happens when out model starts to attach meaning to the noise of our data.
- You can spot overfit because the training error would be much lower than the test error.

## Bias 

- Failing to find the relationship between the data and the response
- Leads to high training and test error
- Occurs when models are underfit
- Underfitting occurs when the model cannot find patterns in the data
- Underfitting is difficult to spot since both training and test errors are high.

# Cross Validation

Models and accuracies can be very dependent on the data in each of the sets when using holdout sets.

Cross validation helps mitigating the split dependency of the holdout approach.

Cross validations divides the training data into n folds and perform a training using n-1 folds and validates on the last fold. It runs training and validation n times, rotating the validation set.

When using cross validation we often report the mean of the errors as the overall error. We can calculate the std as well.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, make_scorer

candy_data = pd.read_csv('../data/candy-data.csv')
X=candy_data.iloc[:,1:11]
y=candy_data[['winpercent']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=23)

In [None]:
rfc = RandomForestRegressor(n_estimators=25, random_state=1111)
mse = make_scorer(mean_squared_error)

# Set up cross_val_score
cv = cross_val_score(estimator=rfc,
                     X=X_train,
                     y=y_train,
                     cv=10,
                     scoring=mse)

# Print the mean error
print(cv.mean())

# Leave One Out Cross Validation (LoOCV)

CV with the number of splits equals to the number of observations, so we train on all the data but one point and we test on that point.

We can use it when the data is limited.

Gives the best error estimate possible.

Its computational expensive.



In [None]:
from sklearn.metrics import mean_absolute_error, make_scorer

# Create scorer
mae_scorer = make_scorer(mean_absolute_error)

rfr = RandomForestRegressor(n_estimators=15, random_state=1111)

# Implement LOOCV
scores = cross_val_score(rfr, X=X, y=y, cv=X.shape[0], scoring=mae_scorer)

# Print the mean and standard deviation
print("The mean of the errors is: %s." % np.mean(scores))
print("The standard deviation of the errors is: %s." % np.std(scores))