# Linear Regression

Modeling the relationship between one or more independent variables

In [1]:
import torch
import torch.nn as nn

## Prediction

Regression is usually used in prediction problem  
**Examples**: Predicting stock prices, house prices, COVID cases, demands for specific products, etc.

## Introduction

**Linear** Regression assumes that the relationship between the features $\mathbf{x}$ and the targets $y$ is linear,
i.e., that $y$ can be expressed as a weighted sum of the elements in $\mathbf{x}$ plus some Gaussian observation noise

**Example** :We wish to estimate the prices of houses based on their area and age.
We need a set of example, called a *training set*, where each row (containing the data corresponding to one sale)
is called an *example*  

The thing we are trying to predict is called a *label*  
The variables upon which the predictions are based are called *features*  

$n$ represent the number of examples in our dataset. 
$\mathbf{x}^{(i)}$ denotes the $i$-th sample and $x_j^{(i)}$ denotes its $j$-th coordinate.

## Model

$$\mathrm{price} = w_{\mathrm{area}} \cdot \mathrm{area} + w_{\mathrm{age}} \cdot \mathrm{age} + b.$$

$w_{\mathrm{area}}$ and $w_{\mathrm{age}}$
are called *weights*, and $b$ is called a *bias*

Weights determine the influence of each feature on our prediction. The bias determines the value of the estimate when all features are zero.

This is called **linear** regression but it's not. **Why?**

## Goal

We want to learn the vector of weights $w$ that provides the best predictions $\hat{y}$ as

$$\hat{y} = w_1  x_1 + ... + w_d  x_d + b.$$

Let $\mathbf{x} \in \mathbb{R}^d$ the vector containing all the features,
we express our model using a dot product:

$$\hat{y} = \mathbf{w}^\top \mathbf{x} + b.$$


For a collection of features $\mathbf{X}$, the predictions $\hat{\mathbf{y}} \in \mathbb{R}^n$ can be expressed via the matrix-vector product:

$${\hat{\mathbf{y}}} = \mathbf{X} \mathbf{w} + b$$

## Metric to optimize

To select the best parameters, we need to be capable to compare 2 set of weights

The *loss* function quantifies the distance
between the *real* and *predicted* value of the target.
The loss will usually be a **non-negative number**
where smaller values are better with an optimum of **0**

**Squared error** $$l^{(i)}(\mathbf{w}, b) = \frac{1}{2} \left(\hat{y}^{(i)} - y^{(i)}\right)^2.$$

with $\hat{y}^{(i)}$ our prediction and $y^{(i)}$ the ground truth

To mesure the loss on the entire dataset, we simply average the loss of each item

$$L(\mathbf{w}, b) =\frac{1}{n}\sum_{i=1}^n l^{(i)}(\mathbf{w}, b) =\frac{1}{n} \sum_{i=1}^n \frac{1}{2}\left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)^2.$$

In [2]:
criterion = nn.MSELoss()

## How to train this?

* Random guess? **Impossible too combinatorial**
* Naive approach: **Orthogonal search**, very slow
* Analytic solution: $$\mathbf{w}^* = (\mathbf X^\top \mathbf X)^{-1}\mathbf X^\top \mathbf{y}.$$
Out of scope. It doesn't scale!
* **Gradient descent**: Compute the gradient, do a little step toward the oposite direction of the gradient
$$(\mathbf{w},b) \leftarrow (\mathbf{w},b) - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{(\mathbf{w},b)} l^{(i)}(\mathbf{w},b).$$

The maths behind *gradient descent* are out of the scope of this class  

**Intuition behind *gradient descent*: You are on top of a mountain, it's very foggy, you want to go back to the village; you take a step towards the steepest local descent**

At the end of the day, you might end up somewhere. Where? Are you guaranteed to find the optimal solution?

In [3]:
true_W = torch.Tensor([2.5, 4.8])
input_data = torch.randn((10, 2))
ground_truth = torch.matmul(input_data, true_W)
ground_truth, true_W

(tensor([ -1.5617, -15.5320,   0.1981,  -1.5115,   5.4930,  -4.2655,  -0.2699,
          17.9946,   4.9348,  -7.7862]),
 tensor([2.5000, 4.8000]))

In [4]:
random_weights = torch.randn((2), requires_grad=True)
print(f'initial guess {random_weights}')
epoch = 300
for i in range(epoch):
    prediction = torch.matmul(input_data, random_weights)
    loss = criterion(prediction, ground_truth)
    loss.backward()
    # Why do we need to remove the gradient computation here ?
    with torch.no_grad():
        random_weights = random_weights - 0.01 * random_weights.grad
    random_weights.requires_grad = True
random_weights

initial guess tensor([ 0.0818, -1.3240], requires_grad=True)


tensor([2.5010, 4.7989], requires_grad=True)

Pytorch will write this loop for you automatically.

But it's important to know how to ask for gradient computation and where the gradient is computed.