# Measuring Error in Machine Learning

### Mean Squared Error (MSE)

$$
\mathrm{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

### Mean Absolute Error (MAE)

$$
\mathrm{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|
$$

In order to calculate error we need training data

In [16]:
import pandas as pd

df = pd.read_csv("https://dlsun.github.io/pods/data/bordeaux.csv", index_col="year")

df_train = df.loc[:1980].copy()
df_test = df.loc[1980:].copy()


X_train = df_train[["win", "summer"]]
y_train = df_train["price"]

X_test = df_test[["win", "summer"]]
y_test = df_test["price"]


In [5]:
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler

pipeline = make_pipeline(
    StandardScaler(),
    KNeighborsRegressor(n_neighbors=5)
)

pipeline.fit(X=X_train, y=y_train)
train_data = pipeline.predict(X_train)

In [7]:
MSE =((y_train - train_data) ** 2).mean()

MSE

np.float64(207.24148148148146)

There's also a **Scikit-Learn** function for that called `mean_squared_error`

In [8]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y_train, train_data)

207.24148148148146

The model is off by 207.24 square dollars on average.
$\sqrt{207.24} = \$14.40$ on average. This is called **RMSE(Root Mean Squared Error)**

The 1-nearest neighbor to any observation in the training data is the observation itself.

In [9]:
pipeline = make_pipeline(
    StandardScaler(),
    KNeighborsRegressor(n_neighbors=1)
)

pipeline.fit(X=X_train, y=y_train)
train_data = pipeline.predict(X=X_train)
mean_squared_error(y_train, train_data)

0.0

Mostly, test error > training error

**Training Error:** Error on the data the model was trained on
**Test Error:** Error on new, unseen data

### Estimating Test Error

One wat to estimate the test error is to not use all of the training data to fit the model, leaving the remainin data for estimating the test error.

### Cross-Validation

A technique used to check how well a machine learning model performs on unseen data while preventing overfitting. It works by:
1. Splitting the dataset into several parts.
2. Training the model on some parts and testing it on the remaining part.
3. Repeating this resampling process multiple times by choosing different parts of the dataset.
4. Averaging the results from each validation step to get the final performance.

### K-Fold Cross Validation

One problem with splitting the data into two is that we only fit the model on half of the data.

A model trained on half of the data may be very different from model trained on all of the data.

It may be better to split the data into K samples and come up with K validation errors.

### Implementing Cross-Validation in Scikit-Learn

1. Split the training data into $K$ samples
2. Hold pout one sample at a time as a validation set
   1. Fit the model to remaining 1 - 1/$k$ of the data
   2. Predict the labels on the validation set
   3. Calculate the prediction error

In [20]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(
    pipeline,
    X=df_train[["win", "summer"]],
    y=df_train["price"],
    scoring="neg_mean_squared_error",
    cv=4
)

scores

array([-554.85714286, -676.36      , -199.22285714,  -70.64666667])

- `df_trainn[["win","summer"]]` is the all of the training data!
- **cv = 4:** Performs 4-fold cross-validation.
  - Training data is split into 4 parts.
  - Each part is used once as validation data.
  

- **Scores** Shows how well the model predicts wine prices across 4 different train-validation splits

- scoring="neg_mean_squared_error" -> Higher is better for a score since it is negative.

Overall estimate of test MSE is:

In [25]:
-scores.mean()

np.float64(375.27166666666665)

In [28]:
mean = -scores.mean()

import numpy as np
np.sqrt(mean)

np.float64(19.37192986428215)

**RMSE â‰ˆ 19.37**  
This means that the wine prices predicted by the model deviate from the actual prices by an average of approximately **19.37 units**.