# Training models

In this notebook you will learn about the following topics:

- [Training models](#training)
- [Gradient descent](#gradient_descent)
- [Under and overfitting](#under_overfitting)
- [Training, testing and validation sets](#train_test_valid)

### Imports

In [38]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.linear_model import LinearRegression 

ModuleNotFoundError: No module named 'sklearn'

### Data loading

We'll start by using the Boston housing dataset again.

In [4]:
df = pd.read_csv('../data/boston.csv')

## Training models <a id="training"></a>

As we learned last week, linear regression is the process of fitting a line to data in the "best" way possible. We measured "best" in terms of mean squared error (MSE), given specifically by
$$
\displaystyle\sum_i (y_i - (mx_i + b))^2
$$
where $y_i$ are the values to be predicted (median home value in our example last week) and $x_i$ are the data being used to make predictions (things like poverty rate, distance to downtown, etc). We then showed that you could simply take derivatives and find critical points to solve for what values of $m$ and $b$ will make this "error" as small as possible, i.e. minimize it. 

None of this is common in machine learning. In fact, linear regression is largely the only case of machine learning where we can actually *solve* for what value of the **model parameters**, the variables used in the model, will give the smallest error. Instead, we do what is called "model training".

Model training works like this:
1. Find data which you want your model to predict. 
2. Pick a model. Last week this was linear regression. As the semester goes on you'll learn about several other models.
3. Start with random guesses for the model parameters.
4. Have your model make predictions from your data, and compare the predictions to the correct values using a loss function like mean squared error.
5. Take the gradient of the loss function and adjust the model parameters in the direction of negative of the gradient.
6. Repeat steps 4 through 6 over and over.

Let's go through these steps one-by-one, as each requires significant explanation.

### Finding data to train your model on
The first step is to find data. Normally you've got a general problem in mind that you want to answer. Your first step should be looking for data related to that problem. If you don't have data then you can't do anything else either.

Let's define a few terms that we will be using throughout the rest of this semester:
- **features:** Features are simply the "inputs" in your data. So in the Titanic example, this would be things like age, fare, sex, etc.
- **labels:** Labels are the values you want to predict. The term "label" comes from when you are trying to predict a categorical variable, such as the breed of a dog, or the survival or death of a passenger. However we also use it for numerical variables, such as home value.
- **ground truth:** This refers to the "correct" values of the labels. For instance, suppose we collected data on passengers on the Titanic. We could build a machine learning model to predict whether or not each person survived, and the model would predict a label ("survived" or "died"). However, these are just *predictions*. By "ground truth" we mean the actual correct labels. That is, for each person described in the data, did they *actually* survive or die? Whatever the answer to this is is called the ground truth.

So we want data with features and ground truth. Once we have that, we can move on to step 3.

### Model types

Machine learning "models" are simply functions. Linear regression is an especially simple function represented by a simple equation ($y=mx+b$). When dealing with inputs with many features then $m$, $x$ and $b$ are all vectors, which makes things seem complicated. But in reality it's just a line. Another model we will deal with extensively this semester is called a "decision tree". We will hold off on the details for now, but a decision tree is simply a function that repeated asks "yes/no" questions of the data. For instance, suppose we want to use a decision tree to determine whether or not a passenger on the titanic survives. Below is a possible decision tree:

![Titanic decision tree](images/titanic_decision_tree.png)

You can see that the first question is about the person's gender, then if they are a female the model predicts they will survive. If they are a male, the model then asks about their age, and so forth. This may not *look* like a function, but it is. Recall that a function is simply something that takes in input and returns a single output (think about the "vertical line test"). We could write this as an *equation* (which is probably how you typically think about functions) as follows:
$$
f(\text{sex}, \text{age}, \text{sibsp}) = \text{piecewise function}
$$
In this decision tree we have the following model parameters:
- Which columns should we ask questions about? 
- What order should we ask these questions? Do we start with sex, age, or sibsp?
- When we ask about a numerical column (such as "is age > 9.5" or "is sibsp > 2.5"), what value should be our cutoff? That is why aren't we asking "is age > 14", or "is sibsp < 5"?
- After each question, what should we do next? Should we predict a value or go to another question?
- Whenever we decide to predict a value, what value should we predict?

As you can see, model parameters can be quite complicated. It is impossible to setup an equation and "solve" for each of these like we did for linear regression. So instead, we train the model. That leads us to step 3.

### Start with random guesses for the model parameters

Suppose you decide to predict home value using the data in the Boston dataset. You decide to use linear regression, and so know your model will look something like
$$
y = m_1\cdot \text{crime_rate} + m_2\cdot \text{pct_industrial} + \cdots + m_8\cdot \text{poverty} + b
$$
What values of $m_1, m_2, \ldots, m_8, b$ will make the predicted value be as close as possible to the ground truth? You have no idea? So all you can do is start with a random guess. Let's have numpy do this for us using `np.random`.

In [7]:
# Pick 8 random "slopes"
m = np.random.rand(8)

# Pick one random "y-intercept"
b = np.random.rand(1)

In [9]:
m

array([0.57864073, 0.2565169 , 0.65679831, 0.06675287, 0.41239833,
       0.13474376, 0.53575606, 0.97992396])

In [10]:
b

array([0.76140737])

Great, now we've got our initial guess at a function, given by:

In [24]:
# Fancy code to dynamically generate the equation
'y = ' + ' + '.join(f'{m:.3f}({x})' for m, x in list(zip(m, df.columns))) + f' + {b[0]:.3f}'

'y = 0.579(Median home value) + 0.257(Crime rate) + 0.657(% industrial) + 0.067(Nitrous Oxide concentration) + 0.412(Avg num rooms) + 0.135(% built before 1940) + 0.536(Distance to downtown) + 0.980(Pupil-teacher ratio) + 0.761'

Does this give a good prediction? Probably not! Let's try a quick example. We'll first write a function which makes a line out of this data.

In [33]:
def my_lr(x, m, b):
    # np.dot is the dot product, so multiply and add
    return np.dot(m, x) + b

In [37]:
features_df = df[['Crime rate', '% industrial', 'Nitrous Oxide concentration', 'Avg num rooms', '% built before 1940', 
                 'Distance to downtown', 'Pupil-teacher ratio', '% below poverty line']]

predicted_home_value = my_lr(x=features_df.iloc[0], m=m, b=b)[0] # This returns a list with the home price, we "pull it out" using [0]
actual_home_value = df['Median home value'].iloc[0]

print(f'Predicted: {predicted_home_value:.2f}, Actual: {actual_home_value:.2f}')

Predicted: 42.67, Actual: 24000.00


Nowhere close! But that's not surprising, we just started with random guesses for the slopes and y-intercept. However, by comparing what we predicted to the ground truth we can improve on our guesses using something called "gradient descent". We will go into gradient descent later in this notebook, but for now let's just summarize it.

### Gradient descent

Recall from Calculs 3 that the "gradient" is a function which takes the partial derivative with respect to each variable, and evaluates it at the current point. Since the gradient is a real-valued vector, so we can visualize it. In particular, the vector *points in the direction of greatest increase of the function*. So for example, if we take a very simple example like $f(x, y) = x^2 + y^2$, this defines a 3d surface. The gradient is given by $(2x, 2y)$, and evaluating it at (say), $(x, y) = (1, 2)$ we get the vector $(2, 4)$. How should we think about this vector? **MORE HERE**.

### Update your parameters

Now that we know which "direction" to move the values of our parameters to make the loss smaller, we'll do that. This means that our model with these new parameters now makes a better prediction. **SHOW EXAMPLE CALCULATIONS**

## Under- and over-fitting

Now that you know how model training works, let's do an example in code. We'll be using a type of model that you haven't worked with yet, but will be more and more important as the semester goes along. This model is called a "Decision Tree", and the tree diagram at the top of this notebook is one such example of one. Sklearn has a decision tree model which we will use.