# Level 2: Fundamentals of ML

##### We familiarize ourselves with key concepts of machine learning that must be confronted in any implementation of machine learning. We continue to view these concepts through linear models.

In [2]:
# Run this cell for necessary imports.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Problem 1: Overfitting, Underfitting, or Just Right. 

We revisit the projectile motion dataset of Level 1, problem 2. We add seemingly pointless features to the model, and explore the effect this has on our model.

### a)

Append the features $x^2, x^3, x^4, ... , x^{10}$ to X_train and construct a linear fit to Y_train. Does the fit perform well? Answer by superposing a scatter plot of all (x,y) in the training set with a plot of your fit to the dataset as a function of x.

In [3]:
X_train = pd.read_hdf('data/projectile_motion.h5', key='X_train')
Y_train = pd.read_hdf('data/projectile_motion.h5', key='Y_train')

# Your code here.

### b)

Define the concepts:

*Underfitting (high bias)*: *Your answer here*

*Overfitting (high variance)*: *Your answer here*

Bias Variance Tradeoff: *Your answer here*

### c)

Compare your plot from part a with your plots from Level 1, Problem 2, parts b and e. Which of these models is overfitting? Which of these models is underfitting?

*Your answer here*

### d)

BONUS: *Why* does adding more features to this model result in overfitting? If a model is overfitting, is it typically caused by introducing too many features to the model, or is there usually a better explanation?

*Your answer here*

## Problem 2: Diagnosing Overfitting and Underfitting.

 We explore how additional data can be used to determine whether a model is overfitting or underfitting.

### a)

Evaluate the training loss of all 3 models that we have trained thus far (See Level 2, Problem 1, part a and Level 1, Problem  2, parts a and d). Which model has the lowest training loss? Which model has the highest training loss?

### b)

Define the concept of a *validation set*. We have provided code to load in a validation set for this problem. Calculate the validation error for all 3 trained models. Which model has the lowest validation loss?

In [4]:
X_val = pd.read_hdf('data/projectile_motion.h5', key='X_val')
Y_val = pd.read_hdf('data/projectile_motion.h5', key='Y_val')

### c)

By evaluating the loss function on the training set and the validation set, how may we diagnose problems of underfitting? Similarly, how may we use this to diagnose overfitting?

*Your answer here*

## Problem 3: Taming High Variance Models. 

We have seen that reducing the number of features to a linear model may reduce it's tendency to overfit. In this problem, we look at alternative ways to reduce the variance in a model that will be easier to generalize beyond linear models.

### a)

Retrain the linear model with features $x$, $x^2$, ..., $x^{10}$ with a training set of $1000$ samples instead of $10$. Comment on the effect that this has on the validation loss of this model.

In [None]:
# Your code here

### b)

In practice, why is adding training data usually not a viable solution to overfitting?

*Your answer here*

### c)

Define *regularization*. Add a regularization term with parameter $\lambda = $... to the loss function. With this new loss function, retrain the model with features $x$, $x^2$, ..., $x^{10}$ on $10$ training points. What effect does this have on the validation loss?

### d)

Define *hyperparameter*. How do hyperparameters differ from the usual parameters of your model? What other hyperparameter (besides $\lambda$) have we already encountered in this series?

*Your answer here*

### e)

Define *cross validation*. Confirm via cross validation that $\lambda = $... was an appropriate choice. Show a plot of the training and validation losses as a function of $\lambda$. Compare the relationship between training and validation loss for $\lambda$ much smaller than $\lambda=$... and for $\lambda$ much larger than $\lambda=$...

*Your answer here*

In [7]:
# Your code here

### f)

Describe *K-folds cross validation*. What are the advantages of k-folds cross validation? What are the disadvantages?

*Your answer here.*

### h) 

Describe why it is necessary to evaluate your model on a *testing set* that is separate from the validation set when reporting expected performance of your model.

*Your answer here*

## Problem 4: Learning Curves.

### a)

Train our linear model with features $x$, $x^2$, ... , $x^{10}$ with gradient descent or *stochastic gradient descent* using a value of $\lambda = $... Plot the *learning curve*.

In [8]:
# Your code here 

### b)

How can we use learning curves to diagnose:

$\lambda$ too small: *your answer here*

$\lambda$ too large: *your answer here*

$\alpha$ too small: *your answer here*

$\alpha$ too large: *your answer here*