![hslu_logo.png](./img/hslu_logo.png)


<hr style="border:1px solid black">

<h1 style="text-align:center;font-size:50px"><b>AAI - FS25</b></h1>
<p style="text-align:center;font-size:40px">Week 01</p>

---

# Linear Classifiers

---
---


# Table of contents for week 01
1. [Gradient Descent](#gradient_descent)
    1. [Learning and Optimisation](#learning_optimisation)
    2. [MSE-Cost Function](#mse_cost)
    3. [General formulation of Gradient Descent](#general_formulation_grad_desc)
    4. [Introduction to PyTorch](#intro_pytorch)
    5. [Example for Gradient Descent - Function Approximation](#function_approx)
    6. [Gradient Calculation using automatic Calculation from PyTorch](#automatic_gradient)
7. [Perceptron](#perceptron)
     1. [Rosenblatt’s Perceptron](#rosenblatt)
     2. [Single Layer Perceptron](#single_perceptron)
     3. [Cross Entropy Cost](#CE_cost)
8. [Annex](#annex)
    1. [Manual Gradient Calculation](#manual_gradient)


# Gradient Descent <a name="gradient_descent"></a>

## Learning and Optimisation <a name="learning_optimisation"></a>

In this chapter we will formulate the machine learning task in formal mathematical language . The exemplary task we want to solve is to automatically predict the correct digit from a handwritten image (<a href="#fig1">Fig.1</a>). We assume a hypothetical mapping $f(\mathbf{x})$ fulfilling this task, which however we do not have any knowledge of. Therefore, we construct a model $h_θ (\mathbf{x})$ that should approximate this mapping $f(\mathbf{x})$. The subscript $θ$ of the model function $h_θ (\mathbf{x})$ represents the parameters that it depends on, and which will be optimized during the learning step, and which is represented in <a href="#fig2">Fig.2</a>. Based on a set of training data $(\mathbf{x},y)$ ($\mathbf{x}$ representing the input data i.e., the images and $y$ the corresponding labels i.e., the digits) we will iteratively correct i.e., optimize the parameters $θ$ of the model function $h_θ (\mathbf{x})$ to minimize the discrepancy between the true labels y and the model prediction $\hat{y} = h_θ (\mathbf{x})$. This discrepancy is quantified with a so-called Cost-function $J(θ)$.

<br>
<img src="img/ml-task.png" alt="Drawing" width="250" />
<a id="fig1">Fig 1:</a> We want to solve a classification task i.e., find the correct digits (right) to the input images (left).

---

<img src="img/learning.png" alt="Drawing" width="320" />
<a id="fig2">Fig 2:</a> Formalization of the learning procedure required to optimize our model $h_θ (\mathbf{x})$ (details c.f. text).

---


## MSE-Cost Function <a name="mse_cost"></a>
As a simple introductory example for the optimisation procedure we will consider a regression problem consisting of a sine-function approximation. The cost function suitable for regression problems is the Mean Squared Error (MSE), which has the advantage to be very intuitive.
As the name suggests, the Mean Squared Error cost function just determines the average squared distance between the true outcomes $y^{(i)}$ and the predictions $\hat{y}^{(i)}=h_θ (\mathbf{x}^{(i)})$:

<img src="img/mse-cost.png" alt="Drawing" width="400" />

The index $(i)$ runs over the set of training samples, in our case, the available sample images with handwritten digits.  

## General formulation of Gradient Descent <a name="general_formulation_grad_desc"></a>


The minimisation of any cost function $J(θ)$ based on the Gradient Descent (GD) schemes works as follows: 

<img src="img/gradient-decent-algo.png" alt="Drawing" width="650" />

<a id="fig3">Fig 3:</a> illustrates GD using a cost function $J(θ)$ dependent on a two-dimensional parameter vector $θ=(θ_0,θ_1)$. Then $J(θ)$ can be represented as a surface in 3D space. As is well known form multidimensional analysis, the derivative 

<img src="img/grad-mse.png" alt="Drawing" width="200" />

is a 2D vector pointing in direction of the strongest descent of the function  $J(θ_t )$. Thus, when starting at the point indicated by the star, the GD algorithm will lead successively along the black trajectory to the indicated local minimum, which however is not necessarily the global minimum.

<img src="img/gradient-decent.png" alt="Drawing" width="550" />
<a id="fig3">Fig 3:</a> Gradient descent always moves along the (locally) steepest descent and eventually reaches a local (but not necessarily global) minimum.

---

## Introduction to PyTorch <a name="intro_pytorch"></a>
Here we want to review some basic notions of PyTorch using the following Jupyter notebook:

**Exercise:** 
**[sw01.01.intro-pytorch.ipynb](http://localhost:8888/notebooks/Kursunterlagen/SW01/sw01.01.intro-pytorch.ipynb#)**

Work yourself through the Jupyter notebook and study the use of PyTorch tensors, their similarity to numpy arrays and the important autograd feature.

---

## Example for Gradient Descent - Function Approximation <a name="function_approx"></a>

We will use the approximation of a sine-function with a polynomial of increasing degree as a guiding example to study different implementations of the Gradient Descent algorithm.

<br>
<img src="img/sin-fit.png" alt="Drawing" width="450" />
<a id="fig4">Fig 4:</a> We use Gradient Descent for an approximation of a sine-function with a polynomial.

---

## Gradient Calculation using automatic Calculation from PyTorch <a name="automatic_gradient"></a>

As we have already seen in the exercise on the [Introduction to PyTorch](#intro_pytorch) the latter provides the possibility to calculate gradients automatically. Therefore the appropriate declaration (`requires_grad=True`) has to be done, when creating a tensor :

**Exercise:** 
**[sw01.02.optimization_autograd.ipynb](http://localhost:8888/notebooks/Kursunterlagen/SW01/sw01.02.optimization_autograd.ipynb#)**

Work yourself step by step through the Jupyter notebook to understand the general problem and the structure of the class `poly_fit`. Then start completing the code where you find the following remarks:

*### START YOUR CODE ###*

*### END YOUR CODE ###*

With a correct implementation the fit at the end should look as in <a href="#fig4">Fig.4</a>.

Note:<br>
In the example above we used the `autograd` functionality of PyTorch to determine the gradient for the update step of the GD algorithm. It is also possible to calculate the gradient analytically. For the present case of the regression problem with the MSE cost function this is done in the (<a href="#annex">annex</a>).

---


# Perceptron <a name="perceptron"></a>

In the introductory example the sine-function was approximated by a linear combination of polynomials. Such linear optimisation problems played in important role at the beginning of what is now considered to be the field of machine learning and artifical intelligence.

## Rosenblatt's Perceptron <a name="rosenblatt"></a>
A milestone in the development of ANN was the single Perceptron, referred to as Linear Threshold Unit (LTU), developed by Frank Rosenblatt in 1958. 
The idea was inspired by the functioning of biological neurons. Each neuron receives input signals from its dendrites and produces output signals along its (single) axon. The axon eventually branches out and connects via synapses to dendrites of other neurons, thus forming a neural network.

<br>
<img src="img/bio_neuron.png" alt="Drawing" width="450" />
<a id="fig5">Fig 5:</a> Representation of a biological neuron (details see text). 
<br>

---

In the computational model of a neuron (<a href="#fig6">Fig.6</a>), it receives the input signal $\mathbf{x}$, a vector conssting of $n$ components $x_k$, with $k=1,..,n$, from the axons of the previous neurons (situated to the left and not drawn). The signals $x_k$ are weighted with the synaptic strength at that synapse $w_k$ in multiplicative manner $x_k \cdot w_k$. The idea is that the synaptic strengths $w_k$ are learnable and control the strength of influence and direction of one neuron on another, which can be excitatory (positive weight) or inhibitory (negative weight). In the basic model, the dendrites carry the signal to the cell body where they all get summed up. If the final sum is above a certain threshold, the neuron will fire, sending a spike along its axon. Mathematically this can be expressed as follows:

$$ 
\hat{y} = h_θ (\mathbf{x})= H \left( \sum_{k=1}^n w_k \cdot x_k + b \right)
$$

Here $H(z)$ is the Heavide step function i.e. $H(z) = 0$ if $z<0$ and $H(z)=1$ if $0 \leq z$. The parameter $b$ - called bias - allows to tune the threshold value, above which the neuron will "fire". The model parameters consist of the weigths and the bias: $\theta = (w_k,b)$.

<img src="img/rosenblatt.png" alt="Drawing" width="450" />
<a id="fig6">Fig 6:</a> Graphical representation of the single Perceptron also referred to as Linear Threshold Unit.
<br>




---

One can visualise the Rosenblatt's Perceptron as a "hard" decision hyperplane in the $n$-dimensional space, with $n$ being the dimension of the input vector $\mathbf{x}$.

We will extend Rosenblatt's Perceptron in several ways. In a first step we will keep the single (neuron-)layer architecture but will replace the activation function $H(z)$ and extend the binary to a multi-class classification problem. Later we will add further layers to obtain the Multi Layer Perceptron.

## Single Layer Perceptron <a name="single_perceptron"></a>

### Generalised Perceptron for binary classification
The Heaviside $H(z)$ step function, while being conceptionally simple, has the major inconvenience of not being suitable for optimization schemes like Gradient Descent. GD makes use of the derivative of the activation function, which is equal to zero everywhere for $H(z)$ except for $z=0$ where it is equal to a Dirac impulse. A smooth version of the Heaviside function $H(z)$ is given by the so-called sigmoid function $σ(z)$ shown in <a href="#fig7">Fig.7</a>. As is true for $H(z)$ the function $σ(z)$ rises from 0 to 1 but in a smooth manner and therefore is – even infinitely often – differentiable. 

<img src="img/sigmoid.png" alt="Drawing" width="300" />
<a id="fig7">Fig 7:</a> The sigmoid activation function is a smooth generalisation of the Heaviside step function.
<br>

---

Now, based on the sigmoid function we generalise Rosenblatt's Perceptron to the version shown in 
<a href="#fig8">Fig.8</a> below. The output is no longer a binary ‘yes’ (1) or ‘no’ (0), but a numeric value in the interval $[0,1]$. The output can now be interpreted as the probability to belong to the true category, which allows to apply concepts as maximum likelhood or crossentropy to define a suitable loss function. Class labels can be assigned according to the rule:

- yes:   $\hat{y} = h_θ (\mathbf{x})≥0.5$

- no:   $\hat{y} = h_θ (\mathbf{x})<0.5$

This is the intuitive choice that the true label is assigend if the probability to belong to the true category is higher than 50%. It is nevertheless possible to use other threshold values in order to 'tune' the classifier. We will come to this point when discussing the so-called performance measures.

<img src="img/gen_perceptron.png" alt="Drawing" width="550" />
<a id="fig8">Fig 8:</a> The generalised perceptron using the sigmoid activation function.
<br>

---

As for the Rosenblatt Perceptron we require to optimize the parameters $θ=(\mathbf{w},b)$ i.e., the weight vector $\mathbf{w}$ and bias $b$. This is done using Gradient Descent, but before we start, we extend the perceptron to multi-class problems.

### Multi-Class Classification and Softmax Activation
So far, we restricted our models to binary classification because the output was either true or false. We now want to generalise our task to $K$ independent classes as shown in <a href="#fig9">Fig.9</a> below. The index $l$ - $(0≤l<K-1)$ - represents the class index and we now have $K$ different output neurons. This means that our parameter space increases because each input neuron is connected to each output neurons such that we now have $K$ independent parameter vectors $θ_l=(\mathbf{w}_l,b_l)$. However, if we keep the sigmoid activation function for each neuron the total output can no longer be interepreted as a probability because the sum over the outputs will not be equal to one. Therefore we introduce a new activation function which is called 'Softmax'. Its idea is based on the following two points:

- We want to choose the output with the highest value (=probability).

- We want the output to be normalised to 1.

We introduce the so-called logits, which the weighted sum of the input vector $\mathbf{x}$ for each neuron:

${z_l} = \mathbf{w}_l\cdot\mathbf{x}+b_l$

Note that the product between $\mathbf{w}_l$ and $\mathbf{x}$ is a scalar product of two vectors. We do not apply the sigmoid activation function but use the identity 'id'. Then we calculate the Softmax activation to obtain all output values $h_{\theta,l}(\mathbf{x})$ as follows:

$$
h_{\theta,l}(\mathbf{x})=\frac{\exp{(z_l)}}{\sum_{j=0}^{K-1} \exp{(z_j)}}
$$

Because the exponential function is monotone the highest logit $z_l$ will correspond to the highest exponential term $\exp(z_l)$ and because all exponential terms are divided by the same value the order will be preserved. Thus the Softmax is indeed a smooth (soft) version to determine the maximum over all $z_l$. Furthermore, due the normalistion factor in the denominator $\sum_{j=0}^{K-1} \exp{(z_j)}$ the sum over all outputs will be one and $h_{\theta,l}(\mathbf{x})$ can be interpreted as probability for $\mathbf{x}$ to belong to the class $l$:

$$
\sum_{l=0}^{K-1} h_{\theta,l}(\mathbf{x}) = 1
$$

<br>
<img src="img/soft_perceptron.png" alt="Drawing" width="550" />
<a id="fig9">Fig 9:</a> General formulation of the Softmax layer providing normed probabilities over all $K$ output classes.
<br>

---
<a id="single_layer_perceptron">**Exercise:**

**[sw01.03.single_layer_perceptron.ipynb](http://localhost:8888/notebooks/Kursunterlagen/SW01/sw01.03.single_layer_perceptron.ipynb#)**

Work yourself step by step through the Jupyter notebook to understand the general structure:

- Cell [2] `Test the creation of data`:
  This is the first important entry point. The function `create_data` reads the image `Regions00.png` from file and creates a set of 1000 data points (2D-coordinates) distributed over the greyscale regions of the image. The idea is that you can change the content of the image and experiment with the so-called "representational capacity" of the single layer Perceptron.

- Cell [3] `Define the Neural Network`:
  This is where you are supposed to work. You should be familiar with the structure of the class because it resembles closely the class `poly_fit` from    SW03. You should complete the code where you find the following remarks:

   *### START YOUR CODE ###*

   *### END YOUR CODE ###*

   The missing code is the Single Layer Perceptron based on the PyTorch class [`nn`](https://pytorch.org/docs/stable/nn.html#module-torch.nn). You will require a `torch.nn.Linear` layer. Refer to      the PyTorch documentation (link above) for further details.
   Note that we used a new loss function which is `torch.nn.CrossEntropyLoss`. This is the usual choice for classification problems. We will discuss that below.

- Once you completed the implementation of the layers move on cell [4] `Create the data` which will create the data used for the training.

- Cell [5] will prepare the data through the correct type conversion and the normalisation.

- Cell [6] `Setup Perceptron and do optimization` will now start the training. If your implementation is correct you can visualise the result with
  cell [7]. The result should look like <a href="#fig10">Fig.10</a> below.

- You may want to experiment with other data set configurations to explore the possibilites of the single layer Perceptron. Try to find a classification problem that is too hard to be solved.


<img src="img/class_result.png" alt="Drawing" width="550" />
<a id="fig10">Fig 10:</a> Decision boundaries and original datasets after successful training of the single layer Perceptron.

---


## Cross Entropy Cost <a name="CE_cost"></a>

In the introductory example above we use the MSE-cost function, which is the default choice for regression problems.
It turns out that for classification problems a different cost function is appropriate, which is the cross entropy (CE) cost. We will introduce this function now.


Statistics provides a general framework to develop estimators and studies the bias and variance of these estimators, a topic which we will deal with later. One very general concept of obtaining a “good” estimator is the so-called maximum likelihood estimator (MLE) principle. We will introduce the MLE by formulating its principle for a general classification problem.<br>
Consider a set of training data $(\mathbf{x}^{(i) },y^{(i) })$ related to a classification task. We assume that there exists a true but unknown (distribution) function $p(y|\mathbf{x})$ that can predict the probability for the classes $y$ given  the input $\mathbf{x}$. Here $y$ is equal to a value $l$ out of $K$ classes i.e. $l\in \{0,1,...,K-1\}$. For our (Fashion-)MNIST classification problem given an image $\mathbf{x}^{(i) }$ the function $p(y |\mathbf{x})$ would predict with certainty the correct class $y^{(i) }$, i.e.: 
$$
p(y  |\mathbf{x}^{(i) }) = \delta_{y,y^{(i)}}
$$
Here, $\delta_{y,y^{(i)}}$ is the Kronecker delta being equal to zero unless $y=y^{(i)}$. Obviously, the prediction of the hypothetical distribution function $p(y |\mathbf{x})$ is not limited to our training data set $(\mathbf{x}^{(i) },y^{(i) })$ but extends to any possible input vector $\mathbf{x}$.  <br>
Now, we want to model $p(y|\mathbf{x})$ using an estimator $p_θ (y|\mathbf{x})$, with the dependency on the model parameter vector $θ$ denoted by the subscript. Our goal now is to optimize this estimator by tuning the parameter vector $θ$. The maximum likelihood estimator now states to select those parameter values $θ_{MLE}$ that make the observed data most probable (most likely) under that estimator. In mathematical terms this reads:

<img src="img/MLE.png" alt="Drawing" width="750" />

Here the expression $p_θ (y^{(1) },y^{(2) },...|\mathbf{x}^{(1) },\mathbf{x}^{(2) },...)$ denotes the probability for the entire training data. On the right-hand-side this is expressed as the product over the (assumed) independent probabilities for each individual sample $p_θ (y^{(i) } |\mathbf{x}^{(i) })$. As the product on the right-hand-side is prone to numerical underflow the logarithm  is taken leading to the final formulation of the maximum likelihood estimator:

<img src="img/MLE1.png" alt="Drawing" width="700" />

The version on the right-hand-side is obtained by applying two changes, which mutually compensate: adding a minus sign and replacing the max- by a min-function. We conclude that the maximum likelihood estimator is based on the minimization of a suitable function given in square brackets on the right-hand-side which we will denote as Cross Entropy Cost. 

<img src="img/MLE2.png" alt="Drawing" width="650" />

Note the additional factor of $1/m$ which represents the division over all training samples and – because it is a constant factor – will not influence the optimization result. Thus, as for the MSE cost we calculate the average cross entropy loss over all training samples such that the scale of the CE cost will not depend on the number of samples.

*Remark*<br>
The above expression can in fact be understood as the cross entropy between the model $p_θ (y|\mathbf{x})$ and the *empirical* distribution (note the hat, $\hat{p}$),
$$
\hat{p}(y | \mathbf{x}) = \delta_{y,y^{(i)}} \cdot \delta_{\mathbf{x},\mathbf{x}^{(i)}}
$$
which simply gives the correct classes $\delta_{y,y^{(i)}}$ (but - in contrary to the true distribution function $p(y|\mathbf{x})$ - for our sample set only, i.e. $\delta_{\mathbf{x},\mathbf{x}^{(i)}}$). From the definition of the cross entropy between the two distributions we would obtain
$$
\frac{1}{m} \sum_{i=1}^m \sum_{l=0}^{K-1} \hat{p}(l | \mathbf{x}^{(i)}) \cdot \log{ p_θ (l | \mathbf{x}^{(i)}) }
$$
, which equals the expression given above.


---

**Exercise:** 
**[sw01.04.perceptron_fashion-mnist.ipynb](http://localhost:8888/notebooks/Kursunterlagen/SW01/sw01.04.perceptron_fashion-mnist.ipynb#)**

This is an extension to the above exercise <a href="#single_layer_perceptron">single_layer_perceptron</a> now using "real" image data. 

- Cell [2] `Read dataset (MNIST or FashionMNIST)`:
  You can read the data set `MNIST` or `FashionMNIST` both consisting of 28x28 single channel gray scale images. Their small size allows to do a training session in a reasonable time span on a state of the art laptop. Note that the first time the data is downloaded and stored in the local folder `storage_path`. If the function `read_data` is called a second time the data is read from that folder.

- Cells [3] - [5]:
  These cells plot a few representative images. Note that in cell [5] you can plot images corresponding to the same label. This allows to study the intra class variance.


- Cells [6] - [8]:
  These cells should be familiar from the previous exercise <a href="#single_layer_perceptron">single_layer_perceptron</a>

- Cell [9]:
  This cell illustrates how to plot some false classifications

- Cells [10] - [11]:
  Work on these cells and determine the error rate on the test images (cell [10]). Results should be around 8% for MNIST and some 23% for FashionMNIST data. Furthermore plot the weights (cell [11]) be reshaping them as 28x28 images. For FashionMNIST the result should look like in the following figure <a href="#fig11">Fig.11</a>.

<img src="img/weights_fashionMNIST.PNG" alt="Drawing" width="650" />
<a id="fig11">Fig 11:</a> If the weights are reshaped as 28x28 images they reveal, what the perceptron has learned (image for FashionMNIST data).

---

# Annex <a name="annex"></a>
## Manual Gradient Calculation <a name="manual_gradient"></a>

When dealing with the derivative of the MSE-cost function $J_{\rm MSE}(\mathbf{\theta})$ with respect to the models parameters $\mathbf{\theta}$ we always have a generic part:

$$ 
\frac{\partial}{\partial \theta} J_{\rm MSE}(\mathbf{\theta}) = 
\frac{\partial}{\partial \theta} \left[ \frac{1}{2m}\sum_{i=1}^m \left(\hat{y}^{(i)} - y^{(i)} \right)^2 \right]= 
\frac{\partial}{\partial \theta} \left[ \frac{1}{2m}\sum_{i=1}^m \left(h_θ (\mathbf{x}^{(i)}) - y^{(i)} \right)^2 \right]= 
\frac{1}{m}\sum_{i=1}^m \left(h_θ (\mathbf{x}^{(i)}) - y^{(i)} \right) \cdot \frac{\partial}{\partial \theta} h_θ (\mathbf{x}^{(i)})
$$
Thus:
$$ 
\frac{\partial}{\partial \theta} J_{\rm MSE}(\mathbf{\theta}) =
\frac{1}{m}\sum_{i=1}^m \left(\hat{y}^{(i)} - y^{(i)} \right) \cdot \frac{\partial}{\partial \theta} h_θ (\mathbf{x}^{(i)})
$$

The first factor under the sum is the difference between the prediction and the ground truth value $(\hat{y}^{(i)} – y^{(i)})$, which is generic, and the second part is the derivative of the model function $h_θ (\mathbf{x}^{(i)}$ with respect to the parameter $\theta$. The latter depends upon the specific choice of the model function.

In the following example we will use a generalized version of the linear regression. Our model function will be a polynomial of degree $N$ given by:
$$
h_θ (x^{(i)}) = h_{(\mathbf{w},b)} (x^{(i)}) = b + w_1 \cdot x^{(i)} + w_2 \cdot (x^{(i)})^2 + \ldots + w_N \cdot (x^{(i)})^N
$$

The model parameters $\theta$ consists of a list of weights $(w_1, w_2, ...)$ and a bias $b$ and the input $x^{(i)}$ is a single scalar number. We now rewrite the model function in the following way:
$$
h_θ (x^{(i)}) = h_{(\mathbf{w},b)} (x^{(i)}) = b + \mathbf{w^T}\cdot \mathbf{x^{(i)}} 
$$

Here $(\mathbf{w^T})$ is a (*row*) vector of the weights and $\mathbf{x^{(i)}}$ is a column vector defined by:

$$
\mathbf{x^{(i)}}==\left[\begin{array}{c}
x^{(i)} \\
(x^{(i)})^2 \\
\vdots \\
(x^{(i)})^N
\end{array}\right]
$$

By default we will consider vectors to be columns and a row will be obtained by the transpose operator $.^T$. The derivate of model function $h_{(\mathbf{w},b)} (x^{(i)})$ with respect to the parameter $(\mathbf{w},b)$ is now straight forward:
$$
\frac{\partial}{\partial b} h_{(\mathbf{w},b)} (\mathbf{x}^{(i)}) = \frac{\partial}{\partial b} \left( b + \mathbf{w^T} \cdot \mathbf{x^{(i)}} \right) = 1
$$

$$
\frac{\partial}{\partial \mathbf{w}} h_{(\mathbf{w},b)} (\mathbf{x}^{(i)}) = \frac{\partial}{\partial \mathbf{w}}  \left( b + \mathbf{w^T} \cdot \mathbf{x^{(i)}} \right) = (\mathbf{x^{(i)}})^T
$$
 
Thus the result of the derivative of $h_{(\mathbf{w},b)}$ with respect to $\mathbf{w}$ is simply the vector $\mathbf{x^{(i)}}$ as row vector. Therefore the transpose operator.
We can now formulate the full derivative of the MSE-cost function:
$$ 
\frac{\partial}{\partial b} J_{\rm MSE}(\mathbf{\theta}) =
\frac{1}{m}\sum_{i=1}^m \left(\hat{y}^{(i)} - y^{(i)} \right) \cdot 1
$$

$$ 
\frac{\partial}{\partial \mathbf{w}} J_{\rm MSE}(\mathbf{\theta}) = \frac{1}{m}\sum_{i=1}^m \left(\hat{y}^{(i)} - y^{(i)} \right) \cdot (\mathbf{x^{(i)}})^T
$$


Using this formulas you can now complete the following notebook.