
--- 
## L08 - Generalization error

<img src="https://itundervisning.ase.au.dk/SWMAL/L08/Figs/dl_generalization_error.png" alt="WARNING: could not get image from server." style="height:500px">

### Qa) On Generalization Error

A detailed description of figure 5.3 is written below, diveded into given concepts from the assignment.
 

__<u>Training/Generalization Error</u>__

In the figure, the __training error__ is displayed by the dotted blue line. It represents the error / loss observed during the models training phase, and is calculated based on the difference between the models predictions and the actual values in the dataset, showing how well the model understands this pattern.

__Generalization error__, shown by the green line on the graph, indicates the error / loss AFTER the trained model has been evaluated. This could be on data not used during the traning phase. It therefore measures how well the model can make predictions on new and unseen data.

__<u>Underfit / overfit zone</u>__

In the __underfit zone/regime__, it represents a situation where the model's capacity is too low, also seen to the left in the graph. In this zone, both the traning and generalization error are too high, and the model fails to learn on both trained and unseen data.

The __Overfit zone/regime__ represents sutations where the capacity is too high, which might lead it to fitting ie. on the noise in the training data(meaning the model understands the patterns in the training data very well). The training error is thereby low, but the generation error is too high. 

__<u>Generalization gab</u>__

The __generalization gab__ represents the difference between the training and generalization error. This is seen to the right in the graph when the capacity is being increased, where the gap size outweights the decrease in training error (leads to overfitting!)

__<u>Optimal capactity</u>__

The __optimal capacity__ represents a 'sweet spot', where the model greatly balances between being underfitted and overfitted. This is because of the model being able to find the patterns in the data without overfitting, meaning a good capacity value has been chosen to fit both unseen and trained data.

__<u>The axes</u>__

The y-axis represents the error / loss by the model, where the x-axis represents the model's capacity / complexity. Increasing a models capacity gives the model a better understand of the underlying patterns in the data, which is seen on the graph by the decrease in error. Increasing it too much, it starts overfitting, as previously explains.


### Qb) A MSE-Epoch/Error Plot

Now we will look into the SGD model for fitting polynomial (__polynomial regression__), described simialary in [HOML] ("Polynomial Regression" + "Learning Curves").  

The code will be reviewed, each part by itself, and important points will be descriped. 

__<u>Part I</u>__

Once again, data is being generated by the `GenerateData` function, also adding some noise to the dataset. We split the data into training and validation sets by using the `test_train_split` function. By using a 90 degree polynomial for the polynomial regression, we produce a model with a very high capacity. The `PolynomialFeatures` are added into a pipeline, afterwareds scaled by a `StandardScaler` to ensure that each feature has a mean of 0 and standard deviation of 1 (as done in previous assignments). Lastly both the training and validation sets are being transformed by the pipeline we just created.

In [None]:
# Run code: Qb(part I)
# NOTE: modified code from [GITHOML], 04_training_linear_models.ipynb

%matplotlib inline

import matplotlib
import matplotlib.pyplot as plt
import numpy as np

from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

np.random.seed(42)

def GenerateData():
    m = 100
    X = 6 * np.random.rand(m, 1) - 3
    y = 2 + X + 0.5 * X**2 + np.random.randn(m, 1)
    return X, y

X, y = GenerateData()
X_train, X_val, y_train, y_val = \
    train_test_split( \
        X[:50], y[:50].ravel(), \
        test_size=0.5, \
        random_state=10)

print("X_train.shape=",X_train.shape)
print("X_val  .shape=",X_val.shape)
print("y_train.shape=",y_train.shape)
print("y_val  .shape=",y_val.shape)

poly_scaler = Pipeline([
        ("poly_features", PolynomialFeatures(degree=90, include_bias=False)),
        ("std_scaler", StandardScaler()),
    ])

X_train_poly_scaled = poly_scaler.fit_transform(X_train)
X_val_poly_scaled   = poly_scaler.transform(X_val)

X_new=np.linspace(-3, 3, 100).reshape(100, 1)
plt.plot(X, y, "b.", label="All X-y Data")
plt.xlabel("$x_1$", fontsize=18, )
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.legend(loc="upper left", fontsize=14)
plt.axis([-3, 3, 0, 10])
plt.show()

print('OK')      

__<u>Part II</u>__

The `Train` function trains our SGD regressor for a specified number of epochs, defined by `n_epochs` in the codeblock.

An __epoch__ is a one complete pass through our training data set in the training phase, basically one iteration where it updates the model's parameters by using SGD.

We use this function to train our model, and returning its result in the `tran_errors` and `val_errors` arrays. For each epoch, here set to 500, the model is being trained with a max iteration of 1, meaning  the algorithm will only update the model's parameters once (with a random selected sample from the training set), rather than processing the entire dataset.

Lasty by setting `verbose` to true, we print out the `epoch`, `mse_train` and `mse_val`. The last two are metrics that we use to evaluate how the model performs (seperately) on both the training and validation sets. With a lower MSE value, it indicated a better performance. The `mse_train` tells us how well the model fits to the training data, and the `mse_val` tells us how well it generalizes on unseen data.

In [None]:
# Run code: Qb(part II)

def Train(X_train, y_train, X_val, y_val, n_epochs, verbose=False):
    print("Training...n_epochs=",n_epochs)
    
    train_errors, val_errors = [], []
    
    sgd_reg = SGDRegressor(max_iter=1,
                           penalty=None,
                           eta0=0.0005,
                           warm_start=True,
                           early_stopping=False,
                           learning_rate="constant", 
                           tol=-float(0), 
                           random_state=42)

    for epoch in range(n_epochs):
        
        sgd_reg.fit(X_train, y_train)
        
        y_train_predict = sgd_reg.predict(X_train)
        y_val_predict   = sgd_reg.predict(X_val)

        mse_train=mean_squared_error(y_train, y_train_predict)
        mse_val  =mean_squared_error(y_val  , y_val_predict)

        train_errors.append(mse_train)
        val_errors  .append(mse_val)
        if verbose:
            print(f"  epoch={epoch:4d}, mse_train={mse_train:4.2f}, mse_val={mse_val:4.2f}")

    return train_errors, val_errors

n_epochs = 500
train_errors, val_errors = Train(X_train_poly_scaled, y_train, X_val_poly_scaled, y_val, n_epochs, True)

print('OK')

__<u>Part III</u>__

The last codeblock is determing the best model based on the epoch with the lowest validation RMSE (both the `best_epoch` and `best_val_rmse`). We plot the values of both the training and validation dataset, and the best model is highlighted on the grapth.

In [None]:
# Run code: Qb(part III)

best_epoch = np.argmin(val_errors)
best_val_rmse = np.sqrt(val_errors[best_epoch])

plt.figure(figsize=(10,5))
plt.annotate('Best model',
             xy=(best_epoch, best_val_rmse),
             xytext=(best_epoch, best_val_rmse + 1),
             ha="center",
             arrowprops=dict(facecolor='black', shrink=0.05),
             fontsize=16,
            )

best_val_rmse -= 0.03  # just to make the graph look better
plt.plot([0, n_epochs], [best_val_rmse, best_val_rmse], "k:", linewidth=2)
plt.plot(np.sqrt(train_errors), "b--", linewidth=2, label="Training set")
plt.plot(np.sqrt(val_errors), "g-", linewidth=3, label="Validation set")
plt.legend(loc="upper right", fontsize=14)
plt.xlabel("Epoch", fontsize=14)
plt.ylabel("RMSE", fontsize=14)
plt.show()

print('OK')

### Qc)  Early Stopping

Potentially we could implement early stopping to prevent overfitting. We will look at the model's performance on our validation set during training, and then stop when the performance starts to get worse (when we observe an increase in the validation error (`val_error`) compared to the `best_val_error`)

Below is some siple pseudo code that could implement early stopping.

```python
best_val_error = float("inf")
best_epoch = None

for epoch in range(n_epochs):  
    if val_error < best_val_error:
        best_val_error = val_error
        best_epoch = epoch
    
    if val_error > best_val_error:
        print("Stopping training: Validation error increased, training stopped.")
        break 
``` 

### Qd) Explain the Polynomial RMSE-Capacity plot

In the generated plot, the x-axis _Capacity_ represents the capacity / complexity, and the y-axis _RMSE_ / error of the model.

It is seen on the plot, that as we increase the capacity, both the training and validation RMS decreases at the beginning. This is due to the model becomming more and more complex as the capacity increases, and it's able to better fit the training data.

However, around a degree of 3, we see that the validation RMSE starts to increase, even though the training RMSE still falls. This is because the model is becomming too complex and starts overfitting the data (i.e. starts to capture noise). Increasing the capacity even further, we see on the plot that the overfitting increases. The training RMSE continues to fall, as it's getting good at fitting noise in the training data (it becomes too specific for the given training data).

In [None]:
# Run and review this code
# NOTE: modified code from [GITHOML], 04_training_linear_models.ipynb

%matplotlib inline

from math import sqrt
import numpy as np
import matplotlib.pyplot as plt

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error

def true_fun(X):
    return np.cos(1.5 * np.pi * X)

def GenerateData():
    n_samples = 30
    #degrees = [1, 4, 15]
    degrees = range(1,8)

    X = np.sort(np.random.rand(n_samples))
    y = true_fun(X) + np.random.randn(n_samples) * 0.1
    return X, y, degrees

np.random.seed(0)
X, y, degrees  = GenerateData()

print("Iterating...degrees=",degrees)
capacities, rmses_training, rmses_validation= [], [], []
for i in range(len(degrees)):
    d=degrees[i]
    
    polynomial_features = PolynomialFeatures(degree=d, include_bias=False)
    
    linear_regression = LinearRegression()
    pipeline = Pipeline([
            ("polynomial_features", polynomial_features),
            ("linear_regression", linear_regression)
        ])
    
    Z = X[:, np.newaxis]
    pipeline.fit(Z, y)
    
    p = pipeline.predict(Z)
    train_rms = mean_squared_error(y,p)

    # Evaluate the models using crossvalidation
    scores = cross_val_score(pipeline, Z, y, scoring="neg_mean_squared_error", cv=10)
    score_mean = -scores.mean()
    
    rmse_training=sqrt(train_rms)
    rmse_validation=sqrt(score_mean)
    
    print(f"  degree={d:4d}, rmse_training={rmse_training:4.2f}, rmse_cv={rmse_validation:4.2f}")
    
    capacities      .append(d)
    rmses_training  .append(rmse_training)
    rmses_validation.append(rmse_validation)
    
plt.figure(figsize=(7,4))
plt.plot(capacities, rmses_training,  "b--", linewidth=2, label="training RMSE")
plt.plot(capacities, rmses_validation,"g-",  linewidth=2, label="validation RMSE")
plt.legend(loc="upper right", fontsize=14)
plt.xlabel("Capacity", fontsize=14)
plt.ylabel("RMSE", fontsize=14)
plt.show()

print('OK')

When increasing the model capacity above 10, we notice on the below plot an exponentially growing overfitting. As explained in the `capacity_under_overfitting.ipynb` exercises, this is due to the model overfitting the training data and becomes bad at predicting new data. (The same happened in the Qa+b exercise when we set the degrees too high, shown in the first three plots) 

In [None]:
# Run and review this code
# NOTE: modified code from [GITHOML], 04_training_linear_models.ipynb

%matplotlib inline

from math import sqrt
import numpy as np
import matplotlib.pyplot as plt

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error

def true_fun(X):
    return np.cos(1.5 * np.pi * X)

def GenerateData():
    n_samples = 30
    #degrees = [1, 4, 15]
    degrees = range(1,11)

    X = np.sort(np.random.rand(n_samples))
    y = true_fun(X) + np.random.randn(n_samples) * 0.1
    return X, y, degrees

np.random.seed(0)
X, y, degrees  = GenerateData()

print("Iterating...degrees=",degrees)
capacities, rmses_training, rmses_validation= [], [], []
for i in range(len(degrees)):
    d=degrees[i]
    
    polynomial_features = PolynomialFeatures(degree=d, include_bias=False)
    
    linear_regression = LinearRegression()
    pipeline = Pipeline([
            ("polynomial_features", polynomial_features),
            ("linear_regression", linear_regression)
        ])
    
    Z = X[:, np.newaxis]
    pipeline.fit(Z, y)
    
    p = pipeline.predict(Z)
    train_rms = mean_squared_error(y,p)

    # Evaluate the models using crossvalidation
    scores = cross_val_score(pipeline, Z, y, scoring="neg_mean_squared_error", cv=10)
    score_mean = -scores.mean()
    
    rmse_training=sqrt(train_rms)
    rmse_validation=sqrt(score_mean)
    
    print(f"  degree={d:4d}, rmse_training={rmse_training:4.2f}, rmse_cv={rmse_validation:4.2f}")
    
    capacities      .append(d)
    rmses_training  .append(rmse_training)
    rmses_validation.append(rmse_validation)
    
plt.figure(figsize=(7,4))
plt.plot(capacities, rmses_training,  "b--", linewidth=2, label="training RMSE")
plt.plot(capacities, rmses_validation,"g-",  linewidth=2, label="validation RMSE")
plt.legend(loc="upper right", fontsize=14)
plt.xlabel("Capacity", fontsize=14)
plt.ylabel("RMSE", fontsize=14)
plt.show()

print('OK')