# Overview 

Please see the [homework policy](https://fdl.thecoatlessprofessor.com/syllabus/#homework)
for detailed instructions and some grading notes. Failure to follow instructions
will result in point reductions. In particular, make sure to commit each 
exercise as you complete them. 

> "Simple models and a lot of data trump more elaborate models based on less data."
> 
> -- Peter Norvig

## Grading

The rubric CAs will use to grade this assignment is:

| Task                                                   | Pts |
|:-------------------------------------------------------|----:|
| Denessness                                             | 30  |
| The Art of Backprop                                    | 20  |
| Descent from Below                                     | 10  |
| Total                                                  | 60  |


## Objectives 

The objectives behind this homework assignment are as follows:

- Implement functions in Python;
- Viewing differences in activation functions;
- Applying gradient descent; and,
- Constructing neural networks.

# Assignment - Homework 3
STAT 430 - FDL, Spring 2020

Due: **Friday, March 13th, 2020 at 6:00 PM**

- **Author:** Skyler Shi
- **NetID:** jingtao2

### Collaborators

If you worked with any other student in preparing these answers, please
make sure to list their full names and NetIDs (e.g. `FirstName LastName (NetID)` ).


## [30 points] Exercise 1 - Denseness

Consider a neural network architecture given by: 

$$\begin{align*}
z_1^{(1)} &= w_{1,1}^{(1)} x_1 + w_{2,1}^{(1)} x_2 + b_{1}^{(1)} \\
z_2^{(1)} &= w_{1,2}^{(1)} x_1 + w_{2,2}^{(1)} x_2 + b_{2}^{(1)} \\
z_3^{(1)} &= w_{1,3}^{(1)} x_1 + w_{2,3}^{(1)} x_2 + b_{3}^{(1)} \\
a_1^{(1)} &= g^{(1)}\left({z_1^{(1)}}\right) \\
a_2^{(1)} &= g^{(1)}\left({z_2^{(1)}}\right) \\
a_3^{(1)} &= g^{(1)}\left({z_3^{(1)}}\right) \\
z_1^{(2)} &= w_{1,1}^{(2)} a_1^{(1)} + w_{2,1}^{(2)} a_1^{(2)} + w_{3,1}^{(2)} a_1^{(2)} +  b_{1}^{(2)} \\
a_1^{(2)} &= g^{(2)}\left({z_1^{(2)}}\right) \\
\hat{y} &= a_1^{(2)}
\end{align*}$$

where $g(z)$ refers to a generic activation function.

**Note:**

- $z_i^{(l)}$ represents the linear combination value at neuron $i$ in layer $l$.
- $a_i^{(l)}$ represents the activation value at neuron $i$ in layer $l$.
- $w_{j,i}^{(l)}$ represents the weight parameter associated with neuron $j$ in layer $l-1$ to neuron $i$ in layer $l$.
- $b_{i}^{(l)}$ represents the bias parameter associated with neuron $i$ in layer $l$.

Let the network's **first layer** have weight and bias values of:

$$
W^{(1)} = \begin{bmatrix} 
w_{1,1}^{(1)} & w_{2,1}^{(1)} \\
w_{1,2}^{(1)} & w_{2,2}^{(1)} \\
w_{1,3}^{(1)} & w_{2,3}^{(1)} \\
\end{bmatrix}_{3 \times 2} = \begin{bmatrix} 
1 & 1 \\
2 & -2 \\
3 & -1 \\
\end{bmatrix}_{3 \times 2}, b^{(1)} = \begin{bmatrix} 
b_{1}^{(1)} 
b_{2}^{(1)} 
b_{3}^{(1)} 
\end{bmatrix}_{3 \times 2} = \begin{bmatrix} 
-1 \\
1 \\
0 \\
\end{bmatrix}_{3 \times 1}
$$

Let the network's **second layer** have weight and bias values of:

$$
W^{(2)} = \begin{bmatrix} 
w_{1,1}^{(2)} & w_{2,1}^{(2)} & w_{3,1}^{(2)}
\end{bmatrix}_{1 \times 3} = \begin{bmatrix} 
2 & 3 & 1
\end{bmatrix}_{1 \times 3},\, b^{(2)} = \begin{bmatrix} 
b_{1}^{(2)} 
\end{bmatrix}_{1 \times 1} = \begin{bmatrix} 
-1
\end{bmatrix}_{1 \times 1}
$$

Moreover, let the **data** $x$ be: 

$$
x = \begin{bmatrix} 
x_1 \\
x_2 
\end{bmatrix}_{2 \times 1} = 
\begin{bmatrix} 
4 \\
1 
\end{bmatrix}_{2 \times 1}
$$





**(a) (10 points)** Sketch the neural network architecture.


![alt text](./1.a.jpeg)

**(b) (10 points)** Consider the generic activation function $g(x)$ equivalent to the identity function $g(x) = x$. What would be the output of the network be?
What does a linear activation function imply about the network's architecture?




... show neuron-by-neuron computations ...

Linear activation functions throughout the network imply that the network is just applying a linear transformation to the data. The entire network architecture can be simplied into a network with one input layer and one output layer and one set of weights and biases to learn. The learning problem essentially becomes a linear regression problem.

![alt text](./1.b.jpeg)

**(c) (10 points)** What would the output be if the generic activation function $g(x)$ is set to be equivalent to PReLU function with $\alpha = 0.01$ in the **first layer**, $g^{(1)}(x) = \mathrm{PReLU}(x, \alpha = 0.01)$, and the ReLU function in the **second layer**, $g^{(2)}(x) = \mathrm{ReLU}(x)$.


... show neuron-by-neuron computations ...

![alt text](./1.c.jpeg)

## [20 points] Exercise 2 - The Art of Backprop

Recall the network architecture given in **Exercise 1**. Let the generic
activation function $g(x)$ be equivalent to the identity, $g(x) = x$, **across all layers**.

**(a) [15 points]** Let the cost function for the network be squared-error given as: 

$$J(W) = \left({y - \hat y }\right)^2$$

Compute the partial derivatives with respect to:

$\frac{\partial J(W)}{\partial w_{1,1}^{(2)} }$, 
$\frac{\partial J(W)}{\partial w_{1,2}^{(1)} }$, 
$\frac{\partial J(W)}{\partial b_{3}^{(1)} }$

_Hint:_ Recall **Exercise 1 b** computation for $\hat y$.


![alt text](./2.a.jpeg)

**(b) [5 points]** If $y = 42$, compute the realized value for each of the partial derivatives given in **(a)**.



![alt text](./2.b.jpeg)

## [10 points] Exercise 3 - Descent from Below

Recall the partial derivatives obtained in **Exercise 2**. 

**(a) [5 points]** Write what the parameter updates would be under **SGD with Momentum** for: 

$\frac{\partial J(W)}{\partial w_{1,2}^{(1)} }$ and  $\frac{\partial J(W)}{\partial b_{3}^{(1)} }$




![alt text](./3.a.jpeg)

**(b) [5 points]** Compute the parameter updates given $\alpha = 0.5$, $\rho = 0.1$, and $v_0 = 4$.




![alt text](./3.b.jpeg)