### Task A2 Linear Regression, on a very small dataset, from scratch with only numpy
#### (illustrating idea of rates of change with respect to the model parameters and gradient descent)

Consider a really small dataset consisting of only three points in the plane: $(1,1),(2,4),(3,5)$.

For each point (training example) consider the $x$-coordinate as a single input feature, and the $y$-coordinate as the output or label of the example. The objective is to find the best line to fit these data. Here 'best' will mean the line which minimises the mean of the sum of the squares of the residuals with respect to the three points.

In this task and in the following tasks A3 and A4 we will use a slightly different convention to store our data with respect to the one used in A1 or the lectures. $X_{\text{train}}$ and $y_{\text{train}}$ will now be the transpose of the corresponding matrices in A1. In A3 and A4 this will be a more convenient form for applying functions given by matrices to the data with the matrix on the left of the data matrix.

The feature matrix $X_{\text{train}}$ contains the feature vectors for all the training examples. More precisely $X_{\text{train}}= \begin{pmatrix} 1 & 2 & 3 \end{pmatrix} $ has one column for each training example and each column is the feature vector for that training example (in this case we have only one feature which is the $x$-coordinate).

The labels matrix $y_{\text{train}}$ contains the labels for all the training examples. More precisely $y_{\text{train}} = \begin{pmatrix} 1 & 4 & 5\end{pmatrix}$ has one column for each training example and each column contains the label for that training example (the $y$-coordinate).

The model here is a simple linear model

$$
\hat y = mx +c
$$

with trainable parameters $m$ and $c$. The goal is to find the optimal values of $m$ and $c$ to fit these training data. In the Machine Learning context this means defining a loss function between prediction and true label:

$$
L(\hat y, y)=(\hat y - y)^2
$$

and a cost function which is simply the average on all training examples of these losses.

**Specifically, you should**

1. Show, in a markdown cell, how the cost $J$ depends only on the trainable parameters $m$ and $c$, and can be computed to be:
$$ J=\displaystyle\frac{14m^2+12cm-48m+3c^2-20c+42}{3}$$

2. Find, in a markdown cell, expressions for the partial derivatives $\displaystyle\frac{\partial J}{\partial m}$ and $\displaystyle\frac{\partial J}{\partial c}$.
3. Write a function `model1(alpha, num_iterations)` which takes as inputs a  learning rate `alpha` and a number of iterations `num_iterations`, and returns optimized values of $m$ and $c$ through Gradient Descent. More precisely the function should initialize $m$ and $c$ randomly between $-2$ and $2$ and then perform `num_iterations` steps of gradient descent with learning rate `alpha`. This means that the value of $m$ is updated to $m - \displaystyle\frac{\partial J}{\partial m}\alpha$ and the value of $c$ is updated to $c - \displaystyle\frac{\partial J}{\partial c}\alpha$ in each iteration.
Your function should print the cost $J$ periodically throughout the iteration process. You may wish to refer to the Machine learning project lecture for help with this. You may wish to build helper functions for the various tasks you need this function to do and refer to them.
4. Plot the 3 datapoints along with the optimal line $mx+c$ your model has found.
5. Explore the effect of changing the learning rate and the number of iterations.

*Insert code and markdown cells here in which to answer this task*