# **Lab 1A: Training, validation, model selection, and testing**
## The Salary versus Age example from Hull Section 1.4, analysed with Python
Section 1.4 explains some of the most important principles of machine learning. Sticking to these principles will get you well on your way to producing valid and reliable results with your machine learning applications. The idea of this lab is to try to reproduce the results in Hull Section 1.4, as exactly as possible. 

This notebook is partially complete and you are to complete and correct things where appropriate. There are also some questions included, for you to answer.

In [None]:
import pandas as pd
import numpy as np

# plotting packages
%matplotlib inline
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as clrs

np.set_printoptions(precision=3)  # print results with 3 decimals behind the decimal point

## 1. Loading and splitting the data

In [None]:
# Load the data 
raw = pd.read_csv('Salary_vs_Age.csv')
# Check the data
print(raw.shape) # check dimensions
raw.head(10)    # check against Hull Table 1.1

<div style="background-color:#c2eafa">
We split the data into a training, validation, and test set as needed for the analysis with our regression models that follows.

In [None]:
raw_array = np.array(raw) # convert dataframe into numpy array 
training_data = raw_array[:10, :] # First ten instances will form the training data set (as in Table 1.1).
validation_data = raw_array[10:20, :] # The next ten instances will form the validation data (Table 1.2).
test_data = raw_array[20:, :] # The final ten instances will form the test set (Table 1.4)

<div style="background-color:#c2eafa">
    
**Question 1** What is the purpose for each of the sets we make?

<div style="background-color:#f1be3e">

Write your answer here 
    
[//]: # (START ANSWER)
[//]: # (END ANSWER)

## 2. Plotting Figure 1.2

<div style="background-color:#c2eafa">
We will now plot the training data; a plot for has already been given, but requires some sprucing up. 

In [None]:
training_data

In [None]:
train_age, train_sal = training_data[:,0], training_data[:,1]
np.vstack([train_age, train_sal]).transpose()

<div style="background-color:#c2eafa">
    
Find out what type of object `ax` is and search the Matplotlib API reference for 'matplotlib.axes.Axes.set' to find out how to
> * label the x- and y-axis of the plot;
> * add a title to the plot;
> * change the axes ranges;

Then add commands to make the figure below look exactly like Figure 1.2 in Hull. We have already passed an appropriate figure size argument to `subplots()` so that the proportions more or less match those of Figure 1.2.

In [None]:
# Create a figure 
fig, ax = plt.subplots(figsize = (8,5))  

# START ANSWER
# END ANSWER 

# Plot the training data 
ax.scatter(train_age, train_sal)

# START ANSWER
# END ANSWER

plt.show()

## 3. Fitting a fifth degree polynomial to the training data
Below we have used Numpy's `polyfit` and `poly1d` because it seems an accessible route for this first lab; look them up in the Numpy's API Reference. At the end, we will redo some of the analyses using Scikit-learn methods.

In [None]:
coef5 = np.polyfit(train_age, train_sal, 5)  # coefficients of LS fit, highest order first
for i in range(6):
    print(f"b{i}= {coef5[5-i]:+10.3e}") 
print("")
p5 = np.poly1d(coef5)  # creates a polynomial function with coefficients from coef5
print(np.vstack((train_age, train_sal, p5(train_age))).transpose())

<div style="background-color:#c2eafa">
    
**Question 2.** What do you notice about the fitted coefficients? Can you explain (some of) it?

**Question 3.** What's in the matrix that is printed?

<div style="background-color:#f1be3e">
    
Write your answer here 

[//]: # (START ANSWER)
[//]: # (END ANSWER)

<div style="background-color:#c2eafa">
    
Compute the residuals for this model fit as well as the root mean squared error.

In [None]:
train5_res = None
train5_rmse = None

# START ANSWER
# END ANSWER
print(f"rmse = {train5_rmse:5.0f}")

<div style="background-color:#c2eafa">

This is not the same as the 12902 in Hull which is a factor sqrt(10/9) bigger. Look up `numpy.std` and play with `ddof` until you get it: it seems that Hull uses *root mean squared error* and *standard deviation of the errors* interchangeably, though this is not correct.

Below add commands to make the figure look exactly like Figure 1.3 in Hull.

In [None]:
# Create figure that looks exactly like Hull Figure 1.3
fig, ax = plt.subplots(figsize = [8,5])  
ax.scatter(train_age, train_sal)
age_range = np.linspace(np.min(train_age), np.max(train_age), 1000)
ax.plot(age_range, p5(age_range), color='red', label='Degree 5 polynomial fit')

# START ANSWER
# END ANSWER

# Display the plot
plt.show()

## 4. Plotting Figure 1.4 (the validation data) and adding the fitted curve
Use the validation data to (exactly) reproduce Hull Figure 1.4. Add to the figure the fitted fifth order polynomial we just determined.

In [None]:
val_age, val_sal = validation_data[:,0], validation_data[:,1]

# Create figure that looks (almost) exactly like Hull Figure 1.4
fig, ax = plt.subplots(figsize = [8,5])  
ax.scatter(val_age, val_sal)

# START ANSWER
# END ANSWER

# START ANSWER
# END ANSWER

# Display the plot
plt.show()

<div style="background-color:#c2eafa">

**Question 4.** What is your conclusion about the fitted model, judging by the plot you just made?

<div style="background-color:#f1be3e">

Write your answer here 
    
[//]: # (START ANSWER)
[//]: # (END ANSWER)

<div style="background-color:#c2eafa">
    
Compute the residuals for the validation data and the fitted model and then the root mean squared error.

In [None]:
val5_res = None
val5_rmse = None

# START ANSWER
# END ANSWER

print(f"rmse = {val5_rmse:5.0f}")

<div style="background-color:#c2eafa">
    
**Question 5.** Are these results signs of overfitting or underfitting? Explain.

<div style="background-color:#f1be3e">
    
Write your answer here 

[//]: # (START ANSWER)
[//]: # (END ANSWER)

## 5. Fit a quadratic model to the training data and reproduce Figure 1.5
You can do that on your own....

In [None]:
# START ANSWER
# END ANSWER

In [None]:
train2_res = None
train2_rmse = None

# START ANSWER
# END ANSWER
print(f"rmse training set: {train2_rmse:5.0f}")

In [None]:
val2_res = None
val2_rmse = None

# START ANSWER
# END ANSWER
print(f"rmse validation set: {val2_rmse:5.0f}")

## 6. Fit a linear model to the training data and reproduce Figure 1.6
Just go ahead.

In [None]:
# START ANSWER
# END ANSWER

In [None]:
train1_res = None
train1_rmse = None

# START ANSWER
# END ANSWER
print(f"rmse training set  : {train1_rmse:5.0f}")

val1_res = None
val1_rmse = None

# START ANSWER
# END ANSWER
print(f"rmse validation set: {val1_rmse:5.0f}")

## 7. Compute root mean square errors and reproduce Table 1.3
Go ahead:

In [None]:
# make the table:
# START ANSWER
# END ANSWER

## 8. Redo all regression models and a bit more using Scikit-learn
What we did above followed the exposition of Section 1.4. Once you have overview, there may be a more efficient way to get all the results. You may already have done some things more efficiently than in the answers above.
Below we use Scikit-learn tools  (in Python you write `sklearn` for Scikit-learn):
 + `sklearn.linear_model.LinearRegression`, and
 + `sklearn.preprocessing.PolynomialFeatures`

Check out what [PolynomialFeatures](https://scikit-learn.org/stable/modules/preprocessing.html#generating-polynomial-features) does.

<div style="background-color:#c2eafa">
We will now create five regression models from degree 1 to 5. Each model will be trained on the training data and then tested on the validation data. We will store the rmse of the training and validation data in arrays. Complete the code block by calculating the RMSE for the training set and validation set and storing it in the defined arrays. 

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

X_train, y_train = training_data[:, 0], training_data[:, 1] 
X_val, y_val = validation_data[:, 0], validation_data[:, 1]
X_test, y_test = test_data[:, 0], test_data[:, 1] 

# Degrees 
degrees = [1,2,3,4,5] 
colors = ['green', 'blue', 'red', 'purple', 'magenta']
train_rmse = [] 
validation_rmse = [] 

for i in degrees: 
    plt.subplots(figsize = [8,5])
    plt.scatter(X_train, y_train, label = 'Training data')
    plt.scatter(X_val, y_val, color = 'red', marker = '^', label = 'Validation data')

    # Create polynomial features
    poly = PolynomialFeatures(degree=i)
    X_train_poly = poly.fit_transform(X_train.reshape(-1, 1))
    X_val_poly = poly.fit_transform(X_val.reshape(-1, 1))

    # Fit a linear regression model
    model = LinearRegression()
    model.fit(X_train_poly, y_train) 
    
    train_predictions = model.predict(X_train_poly) 
    val_predictions = model.predict(X_val_poly) 
    
    # START ANSWER 
    # END ANSWER
    
    
    sorted_indices = np.argsort(X_train)
    X_test_polynomial = X_train[sorted_indices]
    predictions = train_predictions[sorted_indices]
    plt.plot(X_test_polynomial, predictions, color=colors[i-1], label=f"Degree {i}")

    # Add labels and a legend
    plt.xlabel('Age (years)')
    plt.ylabel('Salary ($)')
    plt.title(f"Regression model for degree {i}")
    plt.legend()

    # Display the plot
    plt.show()

<div style="background-color:#c2eafa">

Plot the training rmse and validation rmse on the same graph for the different models.

In [None]:
# START ANSWER
# END ANSWER

<div style="background-color:#c2eafa">
Create a table showing the rmse of the training and validation sets of the different regression models

In [None]:
# START ANSWER 
# END ANSWER 

<div style="background-color:#c2eafa">

**Question 6.** Comment on things you notice about the graphs and  the table. Rank the models in terms of performace. Which model displays signs of overfitting or underfitting? Does the best fit model generalize well from the training set to the validation set?

<div style="background-color:#f1be3e">

Write your answer here 

    
[//]: # (START ANSWER)
[//]: # (END ANSWER)

<div style="background-color:#c2eafa">
    
Determine the final prediction rmse on the **test** set using the best determined model. 

In [None]:
# START ANSWER 
# END ANSWER 