<center>  <h1> Data 598 (Winter 2023): Midterm </h1> </center> 
    <center> University of Washington </center>
    

There are four parts in this midterm. 
1. Understanding automatic differentiation
1. Programming a differentiable module
1. Optimization a neural network
1. Explaining concepts from convolutional neural nets

**Rules**:
- No collaborations allowed! You must work on your own.
- You may refer to the reference books.
- You may refer to the labs, demos, lab solutions and homework solutions. 
- However, you may not directly copy-paste code from any of the labs or demos. You should write your own code.
- The only exception to the copy-paste rule is where explicitly specified. 

# Part 1: Understanding automatic differentiation

Consider the function $f: \mathbb{R} \to \mathbb{R}$ defined as 

$$
    f(x) =  \sqrt{2 + x^3 + \log (1+ x^3)} + \sin(2 + x^3 + \log (1+ x^3)) .
$$

<!-- $$
    f(x) = \sqrt{x^2 + \exp (x^2)} + \cos(x^2 + \exp(x^2)) \,.
$$ -->

Questions:

**A)** Write out the chain rule to compute the derivative $f'(x)$ of $f$ at a given $x$. Then, compute the gradient of y with respect to x. 
You may insert the equations in Markdown directly using LaTeX syntax. 

**Your answer here**: 



**B)** Implement this function $f$ in code so that it accepts a scalar $x$ and returns $f(x)$. 

In [None]:
import torch
def my_function(x):
    # x is a torch.tensor (this is the PyTorch scalar type)
    # Example input `x = torch.tensor(3.14159, requires_grad=True)`
    # TODO: your code here

**C)** Compute the derivative $f'(x)$ for $x= 1.432$ using PyTorch's automatic differentiation. You do not have to code up your own backward method. 

In [None]:
# TODO: your code here

# Part 2: Coding up a differentiable module

Consider the soft-thresholding function $f_T: \mathbb{R} \to \mathbb{R}$ defined for any $T > 0$ as 
$$
    f_T(y) = 
    \begin{cases} 
        0, & \text{ if } -T \le y \le T \,, \\
        (y - T)^2, & \text{ if } y > T \,, \\
        (y + T)^3, & \text{ if } y < T \,.
    \end{cases}
$$



**A)** Write a function to compute which takes in as arguments $y, T$ and returns the soft-thresholding $f_T(y)$.
    Plot this function with $T = 2.81$ in the range $[-15, 15]$.

In [None]:
import matplotlib.pyplot as plt
def softt(y, T):
    """ `y` is a torch.tensor (i.e., PyTorch's scalar type; same as above), 
        `T` is a regular Python number (float or int).
        return type: torch.tensor
    """
    # TODO: your code here
    # HINT: if you write a program with branches, make sure that the output type is always a torch.tensor
    

**B)** Write a function which computes the derivate $f_T'(y)$ of the soft-thresholding function w.r.t. $y$, as returned by PyTorch. Plot this for $T=2.81$ in the range $[-15, 15]$. 

**Hint 1**: If you coded up `softt` using branches, you might encounter a situation where the output does not depend on the input. In this case, you will have to appropriately set the `allow_unused` flag. 

**Hint 2**: When PyTorch returns a derivative of `None`, it actually stands for `0`. If your derivative returns a `None`, you will have to handle this appropriately when plotting the function.

In [None]:
def softt_derivative(y, T): 
    # TODO: your code here

# Test your code
a = torch.tensor(1.20, requires_grad=True)
print(softt_derivative(a, 3.14))

# Plot
# TODO: your code here

**C)** We will now code a differentiable module using `torch.nn.Module`. 

First, let us extend the definition of 
the soft-thresholding $f_T$ to vectors by applying the soft-thresholding operation component-wise. 

Now write a differentiable module which implements the transformation $g_{T}(\cdot; M): \mathbb{R}^d \to \mathbb{R}^d$ 
given by 
$$
    g_{T}(x; M) = M^{-1} \, f_T(Mx) \,,
$$
where $M \in \mathbb{R}^{d \times d}$, a symmetric matrix, is a *parameter* of the module. (Recall: parameters maintain state of the module; register a parameter in `torch.nn.Module` by using the `torch.nn.Parameter` wrapper).

Supply $T > 0$ and and initial value $M_0 \in \mathbb{R}^{d \times d}$ symmetric to the constructor, while the `forward` method only accepts $x \in \mathbb{R}^d$ as an input. 

You may use the function `create_symmetric_invertible_matrix` to create this matrix `M`.

In [None]:
def create_symmetric_invertible_matrix(dimension):
    # return symmetric invertible square matrix of size `dimension` x `dimension`
    rng = np.random.RandomState(dimension)
    factor = rng.randn(dimension, dimension).astype(np.float32)
    return 1e-6 * torch.eye(dimension) + torch.from_numpy(np.matmul(factor, factor.T))
    
class MatmulSofttMatmulinv(torch.nn.Module):
    #### TODO: your code here


**D)** Initialize the module with $T = 2.81$ and $M_0$ using the function `create_symmetric_invertible_matrix`. 
Use `dimension=5`. Pass in the following vector `x` defined below and compute $g_T(x;M_0)$.

In [None]:
import numpy as np
dimension = 5
x = torch.randn(5, requires_grad=True)
print('x:', x)
# TODO: your code here

**E)** For the same vector `x` as defined above, compute and print out the gradient of $\varphi_T(x; M) = \|x - g_T(x; M)\|_2^2$
with respect to both $x$ and $M$ using automatic differentiation. Use $T=2.81$ again.

In [None]:
# TODO: your code here

# Part 3: Optimizing a multi-layer perceptron
In this exercise, you will find the divergent learning rate of a MLP and implement a variant of SGD with parameter averaging. 

We will start with the dataloading utilities. 

In [None]:
from torchvision.datasets import FashionMNIST
from torch.nn.functional import cross_entropy

# download dataset (~117M in size)
train_dataset = FashionMNIST('./data', train=True, download=True)
X_train = train_dataset.data # torch tensor of type uint8
y_train = train_dataset.targets # torch tensor of type Long
test_dataset = FashionMNIST('../data', train=False, download=True)
X_test = test_dataset.data
y_test = test_dataset.targets

# choose a subsample of 10% of the data:
idxs_train = torch.from_numpy(
    np.random.choice(X_train.shape[0], replace=False, size=X_train.shape[0]//10))
X_train, y_train = X_train[idxs_train], y_train[idxs_train]

print(f'X_train.shape = {X_train.shape}')
print(f'n_train: {X_train.shape[0]}, n_test: {X_test.shape[0]}')
print(f'Image size: {X_train.shape[1:]}')

f, ax = plt.subplots(1, 5, figsize=(20, 4))
for i, idx in enumerate(np.random.choice(X_train.shape[0], 5)):
    ax[i].imshow(X_train[idx], cmap='gray', vmin=0, vmax=255)
    ax[i].set_title(f'Label = {y_train[idx]}', fontsize=20)
    


# Normalize dataset: pixel values lie between 0 and 255
# Make them zero mean

X_train = X_train.float()  # convert to float32
X_train = X_train.view(-1, 784)  # flatten into a (n, d) shape
mean, std = X_train.mean(axis=0), X_train.std(axis=0)
X_train = (X_train - mean[None, :]) 

X_test = X_test.float()
X_test = X_test.view(-1, 784)
X_test = (X_test - mean[None, :])

n_class = np.unique(y_train).shape[0]

**A)** Code up a MLP with three hidden layers with 32, 16, and 12 hidden units respectively. We will use a batch size of 8 throughout. The objective function we will use is the multinomial logistic loss, also known in PyTorch as `cross_entropy`. 

You **may reuse** code from previous labs and demos for this part.

In [None]:
# TODO: your code here

**B)** Find the divergent learning rate.

**Note**: We changed the data pre-processing (our data no longer has unit variance), which could cause the divergent learning rate to be different from what we had in previous labs.

In [None]:
# TODO: your code here

**C)** We will implement averaged SGD with an exponentially moving average. 
In addition to the model parameters $w_t$, we 
also maintain a separate set of parameters $\bar w_t$ to serve as an average. The updates of averaged SGD are
$$
    w_{t+1} = w_t - \eta g_t \\
    \bar w_{t+1} = (1 - \gamma) \bar w_t + \gamma w_{t+1},
$$
where $\eta$ is a learning rate, $g_t$ is a (minibatch) stochastic gradient at $w_t$ and $\gamma \in (0, 1)$ is an average weight. 

Some notes:
- the update of $w_t$ is identical to the regular SGD method. That is, the averaged parameter $\bar w_t$ is *not* used during the stochastic gradient updates. 
- The averaged parameter $\bar w_t$ is updated on the side and never to be used in model updates. We use $\bar w_t$ for logging only. 

Your task is to train the model for $25$ epochs and plot the train/test loss/accuracy for both the unaveraged model $w_t$ as well as the averaged_model $\bar w_t$ in the same plot. Use a batch size of $8$ and half the divergent learning rate you found in the previous part. We will use the average weight as $\gamma = 10^{-3}$.


**NOTE**: Do not include the logs of the first two passes through the data in the plot. This is because the inital loss is always very large and this tends to drown out the more interesting patterns we observe later on during training. 
 
    

In [None]:
# TODO: your code here

## Part 4: CNN Short Answer
Choose **1 from each section** of the following problems to complete.

**Section 1 (choose 1)**

a. Given a image of size =20 , what is the output shape if it is put through a convolutional neural net with kernel size=3, padding = 2, stride = 4, and number of out_channels = 16? Justify your answer based on the ideas of kernel size, padding, and stride. 

b. For text data, what does a stride of 2 correspond to? Specifically, what other common metric in text analysis could it be called?

c. Do we need a separate minimum pooling layer? Can you replace it with another operation?

**Section 2 (choose 1)**

c. What are the computational and statistical benefits of a stride larger than 1?

d. What is the computational cost of the pooling layer? Assume that the input to the pooling layer is of size $c\times h\times w$,the pooling window has a shape of $p_h \times p_w$ with a padding of $(p_h, p_w)$ and a stride of $(s_h, s_w)$.