# Chapter 3 - Linear Neural Networks for Regression

## 3.1. Linear Regression

Suppose that we wish to estimate the prices of houses (in dollars) based on their area (in square feet) and age (in years).

In [1]:
%matplotlib inline
import math
import time
import numpy as np
import torch

from d2l import torch as d2l

### 3.1.1. Basics

For a *linear regression* model, we assume that the relationship between features $\mathbf{x}$ and target $y$ is approximately linear, i.e., that the conditional mean $E[Y \mid X=\mathbf{x}]$ can be expresed as a weighted sum of the features $\mathbf{x}$. In addition, we assume that any noise causing the target value to deviate from its expected value is well behaved, following a Gaussian distribution.

Typically,
* $n$ is the number of examples in the dataset,
* $\mathbf{x}^{(i)}$ denotes the i-th sample and $x_j^{(i)} denotes its j-th coordinate.

#### 3.1.1.1. Model

When the inputs consist of $d$ features, each feature is assigned an index (between 1 to $d$) and the prediction $\hat{y}$ is,

\begin{split}
\hat{y} = w_1  x_1 + \cdots + w_d  x_d + b
\end{split}
or in a dot-product form:

\begin{split}
\hat{y} = \mathbf{w}^\top \mathbf{x} + b
\end{split}
where the vector $\mathbf{x}$ corresponds to the features of a single example.

The *design matrix* $\mathbf{X} \in \mathbb{R}^{n \times d}$ refers to as the features of the entire dataset of $n$ examples. $\mathbf{X}$ contains one row for every example and one column for every feature.

For a collection of features $\mathbf{X}$, the predictions $\hat{\mathbf{y}} \in \mathbb{R}^n$ can be expressed as

\begin{split}
{\hat{\mathbf{y}}} = \mathbf{X} \mathbf{w} + b
\end{split}

Given features of a training dataset $\mathbf{X}$ and corresponding (known) labels $\mathbf{y}$, the goal of linear regression is to find the weight vector $\mathbf{w}$ and the bias term $b$ such that given features of a new data example sampled from the same distribution as $\mathbf{X}$, the new example's label will (in expectation) be predicted with the smallest error.

#### 3.1.1.2. Loss Function

*Loss functions* quantify the distance between the *real* and *predicted* values of the target. 

The most common loss function for regression is the squared error. When the prediction for an example $i$ is $\hat{y}^{(i)}$ and the corresponding true label is $y^{(i)}$, the *squared error* is

\begin{split}
l^{(i)}(\mathbf{w}, b) = \frac{1}{2} \left(\hat{y}^{(i)} - y^{(i)}\right)^2
\end{split}

The loss on the entire training set of $n$ examples is,

\begin{split}
L(\mathbf{w}, b) =\frac{1}{n}\sum_{i=1}^n l^{(i)}(\mathbf{w}, b) =\frac{1}{n} \sum_{i=1}^n \frac{1}{2}\left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)^2
\end{split}

The objective is to see parameters $(\mathbf{w}^*, b^*)$ tha minimize the total loss across all training examples:

\begin{split}
\mathbf{w}^*, b^* = \operatorname*{argmin}_{\mathbf{w}, b}\  L(\mathbf{w}, b).
\end{split}

#### 3.1.1.3. Analytic Solution

Minimizing $\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2$ is equivalent to

\begin{aligned}
    \partial_{\mathbf{w}} \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 =
    2 \mathbf{X}^\top (\mathbf{X} \mathbf{w} - \mathbf{y}) = 0
    \textrm{ and hence }
    \mathbf{X}^\top \mathbf{y} = \mathbf{X}^\top \mathbf{X} \mathbf{w}.
\end{aligned}

The solution is

\begin{split}
\mathbf{w}^* = (\mathbf X^\top \mathbf X)^{-1}\mathbf X^\top \mathbf{y}
\end{split}
which will only be unique when the matrix $\mathbf{X}^\top \mathbf{X}$ is invertible.

#### 3.1.1.4. Minibatch Stochastic Gradient Descent