### Components of a machine learning algorithm (linear regression)


| Symbol Type | Example | Typical Shape | Description |
|--------------|----------|----------------|--------------|
| **Scalar** | $a, b, x_i, y, \eta$ | $1 \times 1$ | Single numeric value (e.g., a feature value, bias, or loss). |
| **Vector** | $\mathbf{x}, \mathbf{w}, \boldsymbol{\theta}$ | $p \times 1$ | Column vector containing multiple values (e.g., features or parameters). |
| **Matrix** | $\mathbf{X}, \mathbf{W}, \mathbf{A}$ | $n \times p$ | 2-D array (rows = samples, columns = features). |
| **Dot product** | $\mathbf{w}^T \mathbf{x}$ | scalar | Inner product between two vectors. |
| **Outer product** | $\mathbf{x}\mathbf{w}^T$ | $p \times p$ | Creates a matrix from two vectors. |
| **Inverse** | $\mathbf{A}^{-1}$ | $p \times p$ | Matrix that “undoes” multiplication by $\mathbf{A}$, if it exists. |
| **Hat notation** | $\hat{y}$ | scalar or vector | Estimated or predicted value (e.g., $\hat{y}$ = model prediction). |
| **Bar notation** | $\bar{x}$ | scalar or vector | Mean or average value (e.g., $\bar{x}$ = sample mean). |


## 1. The model

The mathematical structure or function family that maps inputs to outputs (supervised learning). The model defines what forms of relationships between input and output can be learned.

### For linear regression:

#### 1. Scalar form
Each observation \(i\) has its own equation:
$$
\hat{y}_i = w_0 + w_1 x_{i1} + w_2 x_{i2} + \dots + w_p x_{ip} + \varepsilon_i
$$

#### 2. Vector (standard math) form
Compact form for a single observation using vectors:
$$
\hat{y}_i = \mathbf{w}^T \mathbf{x}_i + b + \varepsilon_i
$$


#### 3. Matrix form
All \(n\) observations combined (X stacks row vectors):
$$
\hat{\mathbf{y}} = \mathbf{X}\mathbf{w} + \mathbf{b} + \boldsymbol{\varepsilon}
$$
or (if \(b\) is absorbed as an intercept column of ones in \(\mathbf{X}\)):
$$
\hat{\mathbf{y}} = \mathbf{X}\mathbf{w} + \boldsymbol{\varepsilon}
$$


## 2. Input Data

- Examples: images, text, sensor data, neuroimaging features, etc.
- Often represented as a matrix X of shape $(n_{samples}, n_{features})$.
- Data needs to be cleaned, normalized, encoded, and sometimes split into train/test sets.


### The target variable Y (what should be predicted)
$$
Y = 
\begin{bmatrix}
y_1 \\
y_2 \\
y_3 
\end{bmatrix}
$$

### The input data (what predicts)
$$
X = 
\begin{bmatrix}
1 & \text{Age}_1 & \text{Sex}_1 \\
1 & \text{Age}_2 & \text{Sex}_2 \\
1 & \text{Age}_3 & \text{Sex}_3
\end{bmatrix}
$$


### 3. Parameters (Weights / Coefficients)

The tunable variables that define a specific instance of the model.

Examples:

- 	w, b in linear regression
- 	Connection weights in a neural network
- 	During learning, these parameters are adjusted to best fit the data
    
$$
w = 
\begin{bmatrix}
w_1 \\
w_2 \\
w_3 
\end{bmatrix}
$$

## Detour: Matrix multiplication

Matrix multiplication is one of the most common operations in linear algebra.


When we multiply two matrices, the **number of columns in the first** must match the **number of rows in the second**.  
If $\mathbf{A}$ is of shape $m \times n$ and $\mathbf{B}$ is of shape $n \times p$,  
then the result $\mathbf{C} = \mathbf{A}\mathbf{B}$ will have shape $m \times p$.

Each element of the result is the **dot product** of a row from $\mathbf{A}$ and a column from $\mathbf{B}$:

$$
c_{ij} = \sum_{k=1}^{n} a_{ik} b_{kj}
$$

As an example, here is the multiplication of a 2 x 3 by a 3 x 2 matrix, resulting in a 2 x 2 matrix: 

$$
\hat{\mathbf{y}} = \mathbf{X}\mathbf{w} + \mathbf{b} + \boldsymbol{\varepsilon} =
$$

$$
=
\begin{bmatrix}
a_{11} & a_{12} & a_{13} \\
a_{21} & a_{22} & a_{23}
\end{bmatrix}
\begin{bmatrix}
b_{11} & b_{12} \\
b_{21} & b_{22} \\
b_{31} & b_{32}
\end{bmatrix}
=
\begin{bmatrix}
a_{11}b_{11} + a_{12}b_{21} + a_{13}b_{31} & a_{11}b_{12} + a_{12}b_{22} + a_{13}b_{32} \\
a_{21}b_{11} + a_{22}b_{21} + a_{23}b_{31} & a_{21}b_{12} + a_{22}b_{22} + a_{23}b_{32}
\end{bmatrix}
$$

Exercise:

### Exercise: Linear Regression Prediction

A linear regression model estimates the relationship between predictors (features) and a target variable.  
Given an **input matrix** $\mathbf{X}$ and a **weight vector** $\mathbf{w}$, the model predicts target values $\hat{\mathbf{y}}$ according to:

$$
\hat{\mathbf{y}} = \mathbf{X}\mathbf{w}
$$


#### **Task**

You are given the following data X, that the following weights w were fitted to:

$$
\mathbf{X} =
\begin{bmatrix}
1 & 2 \\
1 & 4 \\
1 & 6
\end{bmatrix},
\quad
\mathbf{w} =
\begin{bmatrix}
0.5 \\
1.2
\end{bmatrix}
$$

1. Compute the **predicted values** $\hat{\mathbf{y}}$ (by hand).

2. Do the same using the [numpy package using the dot product](https://numpy.org/devdocs/reference/generated/numpy.dot.html):
```
import numpy as np
X = np.array([[1,2],[1,4],[1,6]])
w = np.array([[0.5],[1.2]])
```

::: {.callout-tip title="Solution" collapse="true"}
**Goal:** Compute $\hat{\mathbf{y}} = \mathbf{X}\mathbf{w}$.

**Given**
$$
\mathbf{X} =
\begin{bmatrix}
1 & 2 \\
1 & 4 \\
1 & 6
\end{bmatrix},
\quad
\mathbf{w} =
\begin{bmatrix}
0.5 \\
1.2
\end{bmatrix}
$$

**Step-by-step**
$$
\hat{\mathbf{y}} =
\begin{bmatrix}
1 & 2 \\
1 & 4 \\
1 & 6
\end{bmatrix}
\begin{bmatrix}
0.5 \\
1.2
\end{bmatrix}
=
\begin{bmatrix}
1\cdot0.5 + 2\cdot1.2 \\
1\cdot0.5 + 4\cdot1.2 \\
1\cdot0.5 + 6\cdot1.2
\end{bmatrix}
=
\begin{bmatrix}
2.9 \\
5.3 \\
7.7
\end{bmatrix}
$$

**Result**
$$
\hat{\mathbf{y}} =
\begin{bmatrix}
2.9 \\
5.3 \\
7.7
\end{bmatrix}
$$

** Using NumPy check**
```python
import numpy as np
X = np.array([[1,2],[1,4],[1,6]])
w = np.array([[0.5],[1.2]])
#y_hat = X @ w
np.dot(X,w)
y_hat
```
:::

### 4. Objective or Loss Function

A quantitative measure of how well the model performs.

- 	It defines the goal of learning.
- Examples:
	- Mean Squared Error (MSE) for regression
	- Cross-Entropy Loss for classification
	- Negative log-likelihood for probabilistic models
	
Aim: The algorithm tries to minimize (or maximize) this loss function.


Example: mean squared loss
$$
L = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2
$$

$L$ — total loss (the value we minimize)

$n$ — number of samples

$y_i$ — true (observed) value for sample i

$\hat{y}_i$ — predicted value for sample i

The squared difference $(\hat{y}_i - y_i)^2$ measures the error for each prediction.
Taking the mean makes the loss independent of sample size.

#### **Task**

You are given a target vector $y$, in addition to the previous vector $\hat{y}$:

$$
\mathbf{y} =
\begin{bmatrix}
3.0 \\
5.0 \\
8.0
\end{bmatrix},
\qquad
\hat{\mathbf{y}} =
\begin{bmatrix}
2.9 \\
5.3 \\
7.7
\end{bmatrix}
$$

1. Calculate the mean square loss between the predicted changes from the previous task and the target vector y by hand.

2. Write a **function** in python to do it. The function should take $y$ and $\hat{y}$ as input and return the mean squared loss:


In [None]:
Loss = mean_quared_loss(y, y_hat)