# Overfitting Test Notebook

This notebook is designed as a showcase for overfitting. 
Overfitting can happen if you try to explain data with some kind of model (e.g., a function). Data - especially if generated by physical sensors - is usually noisy. This means, in addition to the valuable information in the data there is a bunch of garbage information that is not valuable to us. 

To illustrate this, imagine a microphone. It picks up the voice of the person speaking into it, but it will also pick up other noises around the microphone like typing on a keyboard or the fan of your computer. And it may also pick up static noise. If we wanted to create a transcript of the speaker, we would like to focus only on the spoken words. All other sounds are noise that we want to ignore. 

In this notebook we will look at the problem of overfitting specifically for machine learning. But you should be aware that the problem also exists outside of machine learning. Deep knowledge of machine learning will not be required for this notebook. Everything you need will be explained in the accompanying text.

Before we start let's do some coding busywork so we have less clutter later on. First, let's import everyting we need later:


In [None]:
## IMPORTS

# general imports
import numpy as np 
import pandas as pd 
import math
import matplotlib.pyplot as plt
import tensorflow as tf

# sklearn utility
from sklearn.metrics import mean_squared_error

# machine learning models
from sklearn import svm
from sklearn import linear_model
from sklearn import neighbors
from sklearn import gaussian_process
from sklearn import cross_decomposition
from sklearn import tree 



And second, let's define some helper functions that we can use later to make the code more readible:
You don't need to understand them to understand this notebook.

In [None]:

def f_pure(input):
  """Produces data samples without noise"""
  return math.sin(input*10) + math.cos(1.8*input*10)


def f_noisy(input):
  """Function for producing noisy data samples. It calls f_pure and add noise"""
  return f_pure(input) + np.random.uniform(-0.3,0.3)


def samples(number):
    """ creates a number of samples. 
        Returns a data frame for the input and a list for output (i.e., it's compatible with the machine learning libraries)"""
    data = {"key":np.random.rand(number)}
    x_train = pd.DataFrame(data=data)

    y_train = [f_noisy(x) for x in data["key"]]
    return x_train, y_train


def plot_data(x_train, y_train, x_val = None, y_val = None, show_functions = False, model = None):
    """ plots our data and additional information as configured by input parameters """
    # create plot
    fig, ax = plt.subplots()

    # plot samples 
    ax.scatter(x_train, y_train)

    # plot test samples if needed
    if x_val is not None and y_val is not None:
        ax.scatter(x_val,y_val,c='g')

    # plot the function f_pure if needed
    if show_functions:
        f_x = np.arange(0,1,0.001)
        f_y = [f_pure(x) for x in f_x]
        plt.plot(f_x, f_y,'y')

    # plot the function f_pure if needed
    if model is not None:
        f_x = np.arange(0,1,0.001)
        df2 = pd.DataFrame(data={"key":f_x})
        plt.plot(f_x, model.predict(df2),'r') 

    # show plot
    plt.show()


## The Data Set

Before looking at overfitting, we will first need some data to overfit to. Let's make a small dataset with one featue and one label that can easily be plotted for illustration. You can imagine it as the input and output of a one-dimensional function (i.e., a function that has one input value and one output value).

Normally, this is where you would load a dataset. Here, we're not too concerned with what the data represents and will just generate a fake dataset that fits our requirements. This dataset is generated by using the function f(x) = sin(x/10)+ cos(1.8*x/10) and adding some noise by adding a random number between 0.3 and -0.3. 

In the following cell we calculate 100 random samples of this function and store them into:
- x_train: the input of the function
- y_train: the output (or f(x) value) of the function with noise

If you run the cell you see a graphical representation of the points we've created. 

### Reflection Questions

1. Based on these points, can you draw the original function? Keep in mind that the points contain noise.
2. Do you see any patterns in this drawing that are not part of the function but are ceated by noise?
3. If you run the cell again, you see 100 different random points. How did this change the patterns you've identified in question 2?

In [None]:

x_train, y_train = samples(100)

plot_data(x_train["key"],y_train)

## Machine learning

In machine learning our main task is to select a model that we then learn to fit the data.
You can imagine a "model" as defining the shape of a function (e.g., a straight line, a parabola, etc). The learning process will then try to find a function that conforms to this shape and is as close as possible to the data points.

An example: Let's say we suspect a linear function to be a good explanation of our data. We could then try to do linear regression and find a linear function of the form f(x) = ax + b that is as close as possible to our data points. The learning process would be responsible for finding the best values for a and b.

And this, on it's core, is machine learning. It consists of three steps:

1. create a model
2. learn the model 
3. visualize the results 

And this is why machine learning can often be implemented in a handful lines of code. Check out the cell below to see how the three steps are implemented. It tries to fit a linear function to our data.



In [None]:
# 1. create a model
model = linear_model.LinearRegression()

# 2. learn the model 
model.fit(x_train, y_train)

# 3. visualize the results 
print("mean squared error: " + str(mean_squared_error(y_train, model.predict(x_train))))
plot_data(x_train["key"],y_train,model= model)




Just from looking at the output of the cell it is obvious that the linear model is not a good model to explain our data. It's pretty far away from our data points.

To capture this numericaly, we use an error function.  Here, we use the mean square error, which represents the average squared distance between our data samples $x_i$ and the line $\hat{x}_i$.
$$
MSE = \frac{\sum _{i\in \{1...N\}} (x_i - \hat{x} _i)^2}{N}
$$
If our function perfectly captures all data points the MSE is 0. In our case the MSE is somewhere around 0.9 (depending on your random samples). Considering the overall variation of our points is between -2 and 2, this is pretty bad.

And this is where your tasks come in. Let's find a better model to explain our data.


### Task 1 - Model selection

In Step one of the cell below, there are several machine learning models that you can try. Your task is to try them and determine which one has the lowest mean square error. Just comment the respective lines in / out to test them.

Once you have found the best model, please copy it into this cell. You'll need it later, so you don't want to loose it when doing the next task.

Answer: \<copy of your line of code here\>


### Reflection Question:

1. Do you think the model with the best mean square error is also best suited to explain the data?
2. Which one would do you think is better?


In [None]:
# 1. create a model
model = svm.SVR()
#model = neighbors.KNeighborsRegressor()
#model = gaussian_process.GaussianProcessRegressor()
#model = tree.DecisionTreeRegressor()

# 2. learn the model 
model.fit(x_train, y_train)

# 3. visualize the results 
print("mean squared error: " + str(mean_squared_error(y_train, model.predict(x_train))))
plot_data(x_train["key"],y_train,model= model)



### Task 2 - Hyper Paraemter Selection

In addition to selecting models, a machine learning expert can select hyper parameters. These are parameters that can be used to tune the respective models to tweak their results. In the cell below  you can see them in the respective constructing functions. E.g., SVR(C=1) has one hyper paraemter C which has been set to value 1.

Your next task is to play around with these hyper parameters. Your goal is to change the model so it learns as much as possible about the original function and as little as possible about the noise. This means you'll have a tradeoff between the mean square error and the shape of the function. For now just use your eye measure to determine which function you think is closest to the original function.


All parameters expect positive whole numbers (in case of alpha you can change the 10 in "1e-10" to change how small this number should be).

Again, let's keep track of the answer:

Answer: \<copy of your line of code here\>

### Reflection Question:

1. How hard is it for you to judge whether a given function is actually the right one?
2. How would you define an automatic process to judge whether the function is the right one?


In [None]:
# 1. create a model
model = svm.SVR(C=1)
#model = neighbors.KNeighborsRegressor(n_neighbors = 5)
#model = gaussian_process.GaussianProcessRegressor(alpha=1e-10)
#model = tree.DecisionTreeRegressor(max_depth=20,min_samples_split=2)

# 2. learn the model 
model.fit(x_train, y_train)

# 3. visualize the results 
print("mean squared error: " + str(mean_squared_error(y_train, model.predict(x_train))))
plot_data(x_train["key"],y_train,model= model)