# Introduction to Machine Learning

#### 🎯 Learning Goals

1. Understand the behavior of different **Loss Functions**.
2. Understand the concept of **Empirical Risk Minimization**.

In [1]:
# Load our libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Use a nicer style for plots
plt.style.use("seaborn-v0_8-muted")

___
## 📝 Notation and Terminology
Let us briefly refresh notation before we dive into the concepts of interest.

+ We have a dataset $\mathcal{D} = \{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^n$ consisting of $n$ observations.
+ Each observation $(\mathbf{x}^{(i)}, y^{(i)})$ consists of 
    1. A vector $\mathbf{x}^{(i)} \in \mathcal{X}$, also called *feature*, *covariate*, or *independent variable*, 
    2. A scalar $y^{(i)} \in \mathcal{Y}$, also called *label*, *target*, *response*, *outcomes*, or *dependent variable*. (Note that we will later also encounter vector-valued labels, i.e., $\mathbf{y}^{(i)}$ but for now we focus to the simpler scalar case.)
+ We assume that there exists some relationship between $y^{(i)}$ and $\mathbf{x}^{(i)}$, which can be written in the general form $$y^{(i)} = f^*(\mathbf{x}^{(i)}) + \epsilon^{(i)},$$ where $f^*:\mathcal{X}\to\mathcal{Y}$ is the *true* function and $\epsilon^{(i)}$ is the *error* term or *noise* term.
+ We want to find $\hat{f}: \mathcal{X} \to \mathcal{Y}$, an *estimate* of the true function $f^*:\mathcal{X}\to\mathcal{Y}$. We also call this $\hat{f}$ the *predictor*.

<!-- + We assume that the data is generated by an unknown distribution $p_\text{data}$, i.e., $(\mathbf{x}^{(i)}, y^{(i)}) \sim p_\text{data}$. -->

___
## Loss Functions

Loss functions, also known as *cost* functions or *objective* functions, play a fundamental role in machine learning algorithms. They allow us to measure the quality of our predictions by penalizing incorrect predictions. In this section, we will look at some common loss functions and discuss their properties.

### Squared Loss
The squared loss is defined as the squared difference between the true label $y^{(i)}$ and the predicted label $\hat{y}^{(i)}$:

$$
\ell_\text{sq}(y^{(i)}, \hat{y}^{(i)}) = (y^{(i)} - \hat{y}^{(i)})^2.
$$

In [None]:
def squared_loss(y, y_hat):
    return (y - y_hat) ** 2

### 0-1 Loss
The 0-1 loss penalizes any incorrect prediction with a fixed penalty of 1, independent of the magnitude of the error:

$$
\ell_\text{0-1}(y^{(i)}, \hat{y}^{(i)}) = \begin{cases}
0 & \text{if } y^{(i)} = \hat{y}^{(i)} \\
1 & \text{otherwise}
\end{cases}

In [None]:
def zero_one_loss(y, y_hat):
    return 1 * (y != y_hat)

#### ➡️ ✏️ Task 1

Following the examples above, implement the following loss functions below:
1. Absolute loss
2. Huber loss

### Absolute Loss
The absolute loss is defined as the absolute difference between the true label $y^{(i)}$ and the predicted label $\hat{y}^{(i)}$:

$$
\ell_\text{abs}(y^{(i)}, \hat{y}^{(i)}) = |y^{(i)} - \hat{y}^{(i)}|.
$$

In [None]:
def absolute_loss(y, y_hat):
    # ➡️ ✏️ your code here
    return y * np.nan # remove this line when you're done

### Huber Loss
The Huber loss is a combination of the squared loss and the absolute loss. It is quadratic for small errors and linear for large errors. This makes it less sensitive to outliers than the squared loss and less sensitive to small errors than the absolute loss.

$$
\ell_\text{huber}(y, \hat{y}) = \begin{cases}
\frac{1}{2}(y - \hat{y})^2 & \text{if } |y - \hat{y}| \leq \delta \\
\delta(|y - \hat{y}| - \frac{1}{2}\delta) & \text{otherwise}
\end{cases}
$$

*Hint:* Huber loss is difficult. In particular, we would like to code it in a way that `y_hat` and `y` can be vectors. Here is a bit of help to get you started:

```python
def huber_loss(y, y_hat, delta=1):
    # 1.) Start by computing the absolute loss
    abs_loss = ...
    # 2.) Then make a mask for the indices where the absolute loss is smaller than delta
    mask = abs_loss <= delta
    # 3.) Then compute the huber loss
    loss = np.zeros_like(abs_loss) # Initialize the loss vector with zeros
    loss[mask] = ... # Fill in the huber loss where the mask is True
    loss[~mask] = ... # Fill in the huber loss where the mask is False
    return loss
```

alternatively, you can also use the [`np.where()`](https://numpy.org/doc/stable/reference/generated/numpy.where.html) command.

In [None]:
def huber_loss(y, y_hat, delta=1):
    # ➡️ ✏️ your code here
    return y * np.nan # remove this line when you're done

In [None]:
# TODO: REMOVE SOLUTION
def absolute_loss(y, y_hat):
    return np.abs(y_hat - y)

def huber_loss(y, y_hat, delta=1):
    abs_loss = absolute_loss(y_hat, y)
    return np.where(
        abs_loss <= delta, 
        0.5 * abs_loss ** 2, 
        delta * (abs_loss - 0.5 * delta)
    )

In [None]:
fig, ax = plt.subplots()

# Define our x-axis (linearly spaced values between -2 and 2)
xs = np.linspace(-2, 2, 101)
ys = np.zeros_like(xs)


# Plot our loss functions
ax.plot(xs, squared_loss(ys, xs), label='Squared loss')
ax.plot(xs, absolute_loss(ys, xs), label='Absolute loss')
ax.plot(xs, huber_loss(ys, xs), label='Huber loss')
ax.plot(xs, zero_one_loss(ys, xs), label='0-1 loss')

# Add some plot aesthetics
ax.set_xlabel("Prediction error: $y - \hat{y}$")
ax.set_ylabel("Loss")
ax.set_title("Loss functions")
ax.grid(alpha=0.3)

ax.legend()


#### ➡️ ✏️ Task 2
Looking at the above plot, discuss the following questions with your classmates:
1. If your predictor predicts perfectly, i.e., $\hat{y} = y$, does the loss function matter?
2. If your prediction error $y - \hat{y} = 0.5$, which loss is the most forgiving: the absolute loss or the squared loss?
3. What if instead the prediction error is $y - \hat{y} = 1.5$?
4. Why would one prefer the squared loss over the absolute loss or vice versa? (*Hint*: Read the description of the Huber loss again.)
5. Have you noticed that all of the above loss functions are symmetric around $y - \hat{y} = 0$? Why is that? What would happen if they were not symmetric?

#### ➡️ ✏️ Task 3
Repeat the above plot, but do the following:
1. Change the values for `xs` to be linearly spaced between `-5` and `5`.
2. Remove the 0-1 loss.
3. Add Huber loss with $\delta=\frac{1}{2}$, $\delta=2$, and $\delta=5$.

What do you notice? Discuss with your classmates.

In [None]:
# ➡️ ✏️ your code here

In [None]:
# TODO: REMOVE SOLUTION
fig, ax = plt.subplots()

# Define our x-axis (linearly spaced values between -2 and 2)
xs = np.linspace(-5, 5, 101)
ys = np.zeros_like(xs)


# Plot our loss functions
ax.plot(xs, squared_loss(ys, xs), label='Squared loss')
ax.plot(xs, absolute_loss(ys, xs), label='Absolute loss')
ax.plot(xs, huber_loss(ys, xs), label='Huber loss $(\delta = 1)$')
ax.plot(xs, huber_loss(ys, xs, .5), label='Huber loss $(\delta = 0.5)$')
ax.plot(xs, huber_loss(ys, xs, 2), label='Huber loss $(\delta = 2)$')
ax.plot(xs, huber_loss(ys, xs, 5), label='Huber loss $(\delta = 5)$')

# Add some plot aesthetics
ax.set_xlabel("Prediction error: $y - \hat{y}$")
ax.set_ylabel("Loss")
ax.set_title("Loss functions")
ax.grid(alpha=0.3)

ax.legend()


___

## Empirical Risk Minimization

Now that we understand how different loss functions act as a measure of the quality of our predictions, we can focus on using them to find a good predictor $\hat{f}$.

Recall that we would like to pick a predictor $\hat{f}$ out of a larger set of possible predictors: our *hypothesis class* $\mathcal{H}$. We will discuss the concept of hypothesis classes in more detail later on. For now, let us assume that we receive a finite set of possible predictors $\mathcal{H} = \{f_1, f_2, \dots f_n\}$ and we would like to pick the best one.

### US Crop Yields
We begin by considering a scenario where we are interested in predicting the crop yield of a field. We have a dataset $\mathcal{D} = \{(x^{(i)}, y^{(i)})\}_{i=1}^n$ consisting of $n=100$ observations of crops in the US, with average temperature $x^{(i)}$ and crop yield $y^{(i)}$. 

In [None]:
# Load data about US crop yields based on temperature
crops = pd.read_csv("data/us_crops.csv")
crops.head()

Suppose we given the following hypothesis class $\mathcal{H}$ of possible predictors:
$$\begin{align*}
\mathcal{H} &= \{f_1, f_2, \dots f_6\}, \text{ where} \\
f_1(x) &= 639 &&\text{(constant predictor)} \\
f_2(x) &= 1000 - 25x\\
f_3(x) &= 1010 - 30x\\
f_4(x) &= 1020 - 9.5x - 1.5x^2\\
f_5(x) &= 800 + 40x - 3.5x^2\\
f_6(x) &= 400 + 30x + 2.9x^2 - 0.25x^3,
\end{align*}$$

and we would like to pick the best predictor out of this set.

In [None]:
# Hypotheses are functions that take our input data and return a prediction
hypotheses = [
    lambda x: 639 * np.ones_like(x),
    lambda x: 1200 - 45 * x,
    lambda x: 1010 - 30 * x,
    lambda x: 1020 - 9.5 * x - 1.5 * x ** 2,
    lambda x: 800 + 40 * x - 3.5 * x ** 2,
    lambda x: 300 + 30 * x + 2.9 * x ** 2 - 0.25 * x ** 3,
]

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))

# Plot the individual observations
ax.scatter(crops["temp"], crops["yield"], alpha=.5, color="gray")

# Create a grid of x values to make predictions for
xs = np.linspace(crops["temp"].min(), crops["temp"].max(), 100)

# Plot the hypotheses
for i, f in enumerate(hypotheses):
    ax.plot(xs, f(xs), label=f"$f_{i+1}$")

ax.set_title("Crop Yield vs. Average Temperature")
ax.set_xlabel("Average Temperature")
ax.set_ylabel("Crop Yield")
ax.legend()
ax.grid(alpha=0.3)

In [None]:
# Define the loss function we want to use
loss = squared_loss

# Choose a predictor to evaluate
predictor = hypotheses[0]

# Compute predictions for the chosen predictor
y_hat = predictor(crops["temp"])

# Compute the losses across the dataset
losses = loss(crops["yield"], y_hat)

print(f"The empirical risk of the constant predictor is: {losses.mean():.2f}")

#### ➡️ ✏️ Task 4
First, make sure you thoroughly understand the code cell above. Once you do, implement the following:
1. Write a loop that iterates over the hypotheses and computes the squared loss for each hypothesis.
2. Find the hypothesis with the smallest squared loss and print it out.

In [None]:
loss = squared_loss # Define the loss function we want to use

all_losses = [] # Use this list to store the losses for each predictor

# Extend the following code
for i, f in enumerate(hypotheses):
    # Compute the predictions
    y_hat = np.nan # ➡️ ✏️ your code here

    # Compute the losses
    losses = np.zeros(5) # ➡️ ✏️ your code here

    # Store the empirical risk
    all_losses.append(losses.mean())

# Print out the predictor with the lowest empirical risk
print(f"Predictor with lowest empirical risk: f{np.argmin(all_losses) + 1}",
      f"with empirical risk: {np.min(all_losses):.2f}")

In [None]:
# TODO: REMOVE SOLUTION

loss = squared_loss # Define the loss function we want to use

all_losses = [] # Use this list to store the losses for each predictor

for i, f in enumerate(hypotheses):
    # Compute the predictions
    y_hat = f(crops["temp"])

    # Compute the losses
    losses = loss(crops["yield"], y_hat)
    
    # Store the empirical risk
    all_losses.append(losses.mean())

# Print out the predictor with the lowest empirical risk
print(f"Predictor with lowest empirical risk: f{np.argmin(all_losses) + 1}",
      f"with empirical risk: {np.min(all_losses):.2f}")

#### ➡️ ✏️ Task 5
Repeat task 4, but this time compute the squared loss, the absolute loss and different Huber losses ($\delta = 0.5, 1, 2, 5$) for each hypothesis, do you notice anything different? Discuss with your classmates.

In [None]:
# Enter your code here

In [None]:
# TODO: REMOVE SOLUTION

# We use a DataFrame to store the losses
all_losses = pd.DataFrame({"predictor": [f"f{i+1}" for i,_ in enumerate(hypotheses)]}) 

squared_losses = []
absolute_losses = []
huber_losses_delta_1 = []
huber_losses_delta_05 = []
huber_losses_delta_2 = []
huber_losses_delta_5 = []

for i, f in enumerate(hypotheses):
    # Compute the predictions
    y_hat = f(crops["temp"])

    # Compute the empirical risks
    squared_losses.append(squared_loss(crops["yield"], y_hat).mean())
    absolute_losses.append(absolute_loss(crops["yield"], y_hat).mean())
    huber_losses_delta_1.append(huber_loss(crops["yield"], y_hat, delta=1).mean())
    huber_losses_delta_05.append(huber_loss(crops["yield"], y_hat, delta=.5).mean())
    huber_losses_delta_2.append(huber_loss(crops["yield"], y_hat, delta=2).mean())
    huber_losses_delta_5.append(huber_loss(crops["yield"], y_hat, delta=5).mean())

# Add the losses to the DataFrame
all_losses["squared"] = squared_losses
all_losses["absolute"] = absolute_losses
all_losses["huber_delta_05"] = huber_losses_delta_05
all_losses["huber_delta_1"] = huber_losses_delta_1
all_losses["huber_delta_2"] = huber_losses_delta_2
all_losses["huber_delta_5"] = huber_losses_delta_5

# Display the DataFrame
display(all_losses)

for c in all_losses.columns[1:]:
    print("Predictor with lowest empirical risk for {}: f{}".format(c, np.argmin(all_losses[c]) + 1))

___
#### 🤔 Pause and ponder

In this example, we have received a finite set of possible predictors $\mathcal{H} = \{f_1, f_2, \dots f_6\}$ and we have identified the best one out of this set, given a specific loss function. However, this list of predictors was predefined and limited in scope. How can we find a good predictor in a more general setting, i.e., when we do not receive a finite set of possible predictors?

Can you think of a way to find a good predictor in a more general setting? Discuss with your classmates.

*Hint:* Consider the mathematical minimization problem: $$\min_{f \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^n \ell(y^{(i)}, f(\mathbf{x}^{(i)})),$$

can you think of a way to solve this problem for an infinite hypothesis class? 

Or perhaps a way to make this problem easier to solve? Perhaps the picture below can help you think this through. How can we mathematically represent the blue line? Does this provide any insight into how we can set up an infinite hypothesis class and still find its best predictor?

... in any case, this is the topic of the next lecture!

![](https://jldc.ch/slides/img/dsf/gd_linreg.gif)