# STA410 Week 6 Programming Assignment (5 points)

0. **Paired or individual assignment.** Create code solutions for these assignments either individually or in the context of a paired effort. 

   >  Seek homework partners in class, in course discussion board on piazza, etc.   
 
    
1. **Paired students each separately submit their (common) work, including (agreeing) contribution of work statements for each problem.**  
  
   > Students must work in accordance with the [University of Toronto’s Code of Behaviour on Academic Matters](https://governingcouncil.utoronto.ca/secretariat/policies/code-behaviour-academic-matters-july-1-2019) (and see also http://academicintegrity.utoronto.ca.); however, students working in pairs may share work without restriction within their pair. Getting and sharing "hints" from other classmates is encouraged; but, the eventual code creation work and submission must be your own individual or paired creation.
      
2. **Do not delete, replace, or rearranged cells** as this erases `cell ids` upon which automated code tests are based.

   > The "Edit > Undo Delete Cells" option in the notebook editor might be helpful; otherwise, redownload the notebook (so it has the correct required `cells ids`) and repopulate it with your answers (assuming you don't overwrite them when you redownload the notebook).
  >> ***If you are working in any environment other than*** [UofT JupyterHub](https://jupyter.utoronto.ca/hub/user-redirect/git-pull?repo=https://github.com/pointOfive/sta410hw0&branch=master), [Google Colab](https://colab.research.google.com/github/pointOfive/sta410hw0/blob/master/sta410hw0.ipynb), or [UofT JupyterLab](https://jupyter.utoronto.ca/hub/user-redirect/git-pull?repo=https://github.com/pointOfive/sta410hw0&branch=master&urlpath=/lab/tree/sta410hw0), your system must meet the following versioning requirements 
   >>
   >>   - [notebook format >=4.5](https://github.com/jupyterlab/jupyterlab/issues/9729) 
   >>   - jupyter [notebook](https://jupyter.org/install#jupyter-notebook) version [>=6.2](https://jupyter-notebook.readthedocs.io/en/stable/) for "classic" notebooks served by [jupyterhub](https://jupyterhub.readthedocs.io/en/stable/quickstart.html)
   >>   - [jupyterlab](https://jupyter.org/install) version [>=3.0.13](https://github.com/jupyterlab/jupyterlab/releases/tag/v3.0.13) for "jupyterlab" notebooks  
   >>    
   >> otherwise `cell ids` mat not be supported and you will not get any credit for your submitted homework.
   >>
   >> You may check if `cell ids` are present and working by running the following command in a cell 
   >>
   >> `! grep '"id":' <path/to/notebook>.ipynb`
   >>
   >> and making sure the `cell ids` **do not change** when you save your notebook.
   
3. ***You may add cells for scratch work*** but if required answers are not submitted through the provided cells where the answers are requested your answers may not be marked.

 
4. **No cells may have any runtime errors** because this causes subsequent automated code tests to fail and you will not get marks for tests which fail because of previous runtime errors. 

  > Run time errors include, e.g., unassigned variables, mismatched parentheses, and any code which does not work when the notebook cells are sequentially run, even if it was provided for you as part of the starter code. ***It is best to restart and re-run the cells in your notebook to ensure there are no runtime errors before submitting your work.***
  >
  > - The `try`-`except` block syntax catches runtime errors and transforms them into `exceptions` which will not cause subsequent automated code tests to fail.  


5. **No jupyter shortcut commands** such as `! python script.py 10` or `%%timeit` may be included in the final submission as they will cause subsequent automated code tests to fail.

   > ***Comment out ALL jupyter shortcut commands***, e.g., `# ! python script.py 10` or `# %%timeit` in submitted notebooks.


6. **Python library imports are limited** to only libraries imported in the starter code and the [standard python modules](https://docs.python.org/3/py-modindex.html). Importing additional libraries will cause subsequent automated code tests to fail.

  > Unless a problem instructs differently, you may use any functions available from the libraries imported in the starter code; otherwise, you are expected to create your own Python functionality based on the Python stdlib (standard libary, i.e., base Python and standard Python modules).


7. You are encouraged to adapt code you find available online into your notebook; however, if you do so please provide a link to the utilized resource. ***If failure to cite such references is identified and confirmed, your mark will be immediately reduced to 0.***  

In [1]:
# Unless a problem instructs differently, you may use any functions available from the following library imports
import numpy as np
import tensorflow as tf

# Problem 0 (required)

Are you working with a partner to complete this assignment?  
- If not, assign  the value of `None` into the variable `Partner`.
- If so, assign the name of the person you worked with into the variable `Partner`.
    - Format the name as `"<First Name> <Last Name>"` as a `str` type, e.g., "Scott Schwartz".

In [None]:
# Required: only worth points when not completed, in which case, you'll lose points
Partner = #None
# This cell will produce a runtime error until you assign a value to this variable

What was your contribution in completing the code for this assignments problems? Assign one of the following into each of the `Problem_X` variables below.

- `"I worked alone"`
- `"I contributed more than my partner"`
- `"My partner and I contributed equally"`
- `"I contributed less than my partner"`
- `"I did not contribute"`

In [None]:
# Required: only worth points when not completed, in which case, you'll lose points
Problem_1 = #"I worked alone"
# This cell will produce a runtime error until you assign a value to this variable

# Problem 1 (5 points)

Complete the ***gradient descent optimization algorithm*** with ***TensorFlow*** below for 
- parameters $A_1, b_1, A_2,$ and $b_2$ of the model
- $\hat y = f_{A_1,b_1,A_2,b_2}(x) = \big( (A_2(A_1 x + b_1)_+ + b_2)_+ \big)$
- minimizing the (loss objective) function $||\epsilon||_2 = ||y-f_{A_1,b_1,A_2,b_2}(x)||_2$

where $(\cdot)_+$ is the so-called ***ReLU activation function*** which sets all negative values within the object to $0$.

The `f` function above specifies the standard form of a "vanilla" (shallow) two layer ***neural network***
where $A_jx + b_j$ are ***affine transformations*** and $\{q_j\circ (A_jy_j + b_j)\}$ are [elementwise](https://math.stackexchange.com/questions/2324764/notation-for-element-wise-function-application) non-linear transformation of affine transformations.

A "vanilla" ***deep neural network*** is the extension of such sequence of alternating applications of (a) affine transformations and (b) elementwise non-linearly transformations of the previous affine transformation  for some large $K$ 

$$q_K \circ (A_K \{ \cdots \{q_2 \circ (A_2\{q_1 \circ (A_1x + b_1)\} + b_2)\} \cdots \} + b_K)$$

> The "***architecture***" of a ***neural network*** refers to the details of extensions beyond this form which increase sophistication and capability for various purposes. 
> - The primary advance in ***neural networks*** has been the ability to create ***deep neural network architectures*** which do not suffer from the "***vanishing gradient***" problem (that the ***chain rule*** basis of ***backpropegation*** algorithm used to compute ***gradients*** for optimization of ***deep neural network*** is the product of many partial derivatives together which can multiplicatively "vanish to zero"), and to create ***deep neural network*** specifications that have more ***isotropically*** behaved optimization surfaces, with the key respective methodological advances driving improvements in these regards being introduction of ***ReLU*** activation functions and a technique called ***batch norm***.

***Deep neural networks*** are the most flexible ("***universal***") function approximation methodology available today. To provide a sense of how flexible and powerful these are, the images below (taken from this [interactive webpage](https://arogozhnikov.github.io/3d_nn/)) demonstrate ***neural network*** functions by animating how they increases accross the input space. Some good resources to continue learning about ***deep neural network***
are the [deep learning](https://www.deeplearningbook.org/) and [dive into deep learning](https://d2l.ai/) textbooks. 

|||||
|-|-|-|-|
|![](https://s9.gifyu.com/images/SFOrY.gif)|![](https://s9.gifyu.com/images/SFOra.gif)|![](https://s9.gifyu.com/images/SFOrr.gif)|![](https://s9.gifyu.com/images/SFOrZ.gif)|
|![](https://s9.gifyu.com/images/SFOrf.gif)|![](https://s9.gifyu.com/images/SFOrt.gif)|![](https://s9.gifyu.com/images/SFOr5.gif)|![](https://s9.gifyu.com/images/SFOrD.gif)|

In [3]:
import tensorflow as tf
np.random.seed(3)
alpha,K = 0.01,10
d,q1,q2 = 3,2,3
x = tf.constant(np.random.normal(size=(d,1)))
y = tf.constant(np.random.normal(size=(q2,1)))
A1 = tf.Variable(np.random.normal(size=(q1,d)))
b1 = tf.Variable(np.random.normal(size=(q1,1)))
A2 = tf.Variable(np.random.normal(size=(q2,q1)))
b2 = tf.Variable(np.random.normal(size=(q2,1)))

arg1 = tf.TensorSpec(shape=x.shape, dtype=tf.float64)
arg2 = tf.TensorSpec(shape=A1.shape, dtype=tf.float64)
arg3 = tf.TensorSpec(shape=b1.shape, dtype=tf.float64)
arg4 = tf.TensorSpec(shape=A2.shape, dtype=tf.float64)
arg5 = tf.TensorSpec(shape=b2.shape, dtype=tf.float64)

@tf.function(input_signature=(arg1,arg2,arg3,arg4,arg5,))
def f(x0, A1, b1, A2, b2):
    return None #<f as defined in the problem prompt> <- fix this

@tf.function(input_signature=(arg1,arg1,))
def loss(y, yhat):
    return None #<L2 norm of epsilon> <- fix this

# You are not meant to actually fit a model here,
# only set up and run the correct code that could fit a model
# and then run it for `K` steps
# and then assign `p2q0 = loss(x, y, A1, b1, A2, b2)`
# the final value of the `loss` function after the `K` steps finish
for i in range(K):
    # <complete `K` steps of gradient descent>
    # <for the `tf.Variable` model parameters>
    pass

# 1 points [format: `(A1, b1, A2, b2)` with the results of the for loop above after it  
#                    implements the gradient descent algorithm for `alpha,K = 0.01,10` ]
p1q6 = None #(A1, b1, A2, b2) # the `p1q6` variable name is correct -- it makes sense to assign this here
# Uncomment the line above so the result of the gradient descent algorithm for `alpha,K = 0.01,10` is saved

## Problem 1 question 0-3 (1 point)

0. (0.25 points) What is the input and dimesnion of the "first hidden layer" $h_1 = (A_1 x + b_1)$ of `f` above?

    - A. The input vector $x$ has length 2 and the "first hidden layer" output $h_1$ is a vector of length 2
    - B. The input vector $x$ has length 2 and the "first hidden layer" output $h_1$ is a vector of length 3
    - C. The input vector $x$ has length 3 and the "first hidden layer" output $h_1$ is a vector of length 2
    - D. The input vector $x$ has length 3 and the "first hidden layer" output $h_1$ is a vector of length 3


1. (0.25 points) Generally speaking, must the dimension of the input $x$ to the ***neural network*** function `f` match the dimension of the output $y$?

    - A. Yes, and `q1` should also reflect this same dimension
    - B. Yes, but `q1` need not reflect this same dimension
    - C. No, so long as `q2` reflects the same dimension as `d`
    - D. No, `q2` need not reflect the same dimension as `d` 
    
    
2. (0.25 points) Which of the following is true regarding the "first hidden layer" $h_1 = (A_1 x + b_1)$ of `f`?

    - A. It is an elementwise nonlinear transformation of an (lower dimensional) affine transformation of $x$
    - B. It is the input to the "second hidden layer" $h_2 = (A_2 h_1 + b_2)$
    - C. It is a function of $x$ that is parameterized by $A_1$ and $b_1$ 
    - D. All of the above


3. (0.25 points) Which of the following exactly expresses the (loss objective) function $||\epsilon||_2$?

    - A. $\sqrt{(y-f_{A_1,b_1,A_2,b_2}(x))^T(y-f_{A_1,b_1,A_2,b_2}(x))}$
    - B. $\sum_{i=1}^n (y_i - f_{A_1,b_1,A_2,b_2}(x_i))$
    - C. $\frac{1}{2}\sum_{i=1}^n (y_i - f_{A_1,b_1,A_2,b_2}(x)_i)$
    - D. $\frac{1}{2}(y-f_{A_1,b_1,A_2,b_2}(x))^T(y-f_{A_1,b_1,A_2,b_2}(x))$
    - E. None of the above



In [None]:
# 1 point (0.25 points each) [format: `str` either "A" or "B" or "C" or "D" or "E" based on the choices above]
p1q0 = None#<"A"|"B"|"C"|"D"> 
p1q1 = None#<"A"|"B"|"C"|"D"> 
p1q2 = None#<"A"|"B"|"C"|"D"> 
p1q3 = None#<"A"|"B"|"C"|"D"|"E"> 
# Replace `None` above either "A" or "B" or "C" or "D" or "E"

## Problem 1 question 4 (1 point)

Your `f` function above will be tested for various inputs.

## Problem 1 question 5 (1 point)

Your `loss` function above will be tested for various inputs.

## Problem 1 question 6 (1 point)

Your ***gradient descent*** algorithm above will be tested for various input.

## Problem 1 questions 7-10 (1 points)

7. (0.25 points) Why is the algorithm coded in this problem different than ***nonlinear Gauss-Seidel***?

    - A. Because `f` is a linear function of `x`, not a nonlinear function of `x`
    - B. Because each parameter is not cyclically optimized given the value of the others
    - C. Because the algorithm coded in this problem is a ***coordinate descent*** algorithm
    - D. The algorithm coded in this problem is not different than ***nonlinear Gauss-Seidel***
    
    
8. (0.25 points) If `alpha` was decreased towards zero, what could be adjusted to compensate so that the algorithm maintained an approximately similar level of convergence progress?

    - A. `K` could be increased
    - B. `b1` and `b2` could be given smaller magnitudes
    - C. `A1` and `A2` could be given larger magnitudes
    - D. Choices 'A' and 'B' above
    
    
9. (0.25 points) The function under consideration in this problem is

  $$\hat y = f_{A_1,b_1,A_2,b_2}(x) = \big( (A_2(A_1 x + b_1)_+ + b_2)_+ \big)$$
  
  where $(\cdot)_+$ is the so-called ***ReLU activation function*** which sets all negative values within the object to $0$.
  
  What is the problem with this as a function for $\hat y$ predicting $y$? 
  
    - A. It is not a continuous function so it doesn't have derivatives everywhere
    - B. It is a nonlinear function so it should not be used to predict $y$
    - C. It can only produce positive $\hat y$ predictions, so it would not work for regression with negative $y$ values 
    - D. Nothing, this $\hat y$ will be a reasonable predictor of $y$ as long is the function $f$ is sufficiently flexible    
    

10. (0.25 points) Which of the following adjustments to $f_{A_1,b_1,A_2,b_2}(x)$ could be supported by ***gradient descent*** using ***TensorFlow***?

    - A. $L_2$ "ridge" regularization on the $[A_k]_{ij}$ ***weights*** and $[b_k]_{j}$ ***biases*** $\quad f_{A_1,b_1,A_2,b_2}(x) = (A_2(A_1 x + b_1)_+ + b_2)_+ + \sum_k A_k^TA_k + b_k^T b_k$
    - B. $L_1$ "lasso" regularization on the $[A_k]_{ij}$ ***weights*** and $[b_k]_{j}$ ***biases*** $\quad f_{A_1,b_1,A_2,b_2}(x) = (A_2(A_1 x + b_1)_+ + b_2)_+ + \sum_k |A_k| + |b_k|$
    - C. $f_{A_1,b_1,A_2,b_2}(x)$ can be any function whose partial derivatives with respect to $A_1,b_1,A_2$, and $b_2$ are known for ***TensorFlow AutoDiff***
    - D. All of the above
    - E. None of the above


In [None]:
# 0.25 points each [format: `str` either "A" or "B" or "C" or "D" based on the choices above]
p1q7 = None #<"A"|"B"|"C"|"D"> 
p1q8 = None #<"A"|"B"|"C"|"D"> 
p1q9 = None #<"A"|"B"|"C"|"D"> 
p1q10 = None #<"A"|"B"|"C"|"D"> 
# Uncomment the above and keep each only either "A" or "B" or "C" or "D"