# DATASCI 503, Homework 10: Neural Networks and Deep Learning


This assignment covers **network architecture** (layers, activation functions, parameter counting), the **softmax function** for multi-class classification, **convolutional neural networks (CNNs)** for image data, and **gradient descent** optimization.

**Resources:**
- [PyTorch Autograd Tutorial](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html) - Automatic differentiation for gradient computation
- [ISLP Chapter 10](https://www.statlearning.com/) - Deep Learning chapter from Introduction to Statistical Learning

In [None]:
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import torch

---

**Problem 1:** Neural Network Architecture (ISLP 10.1, parts a, b, and d)

Consider a neural network with two hidden layers: p = 4 input units, 2 units in the first hidden layer, 3 units in the second hidden layer, and a single output.

(10.1a) Draw a picture of the network, similar to Figures 10.1 or 10.4 in ISLP.

In [None]:
# BEGIN SOLUTION
G = nx.DiGraph()

G.add_nodes_from([r"$I_1$", r"$I_2$", r"$I_3$", r"$I_4$"], layer=0)
G.add_nodes_from([r"$H^{(1)}_1$", r"$H^{(1)}_2$"], layer=1)
G.add_nodes_from([r"$H^{(2)}_1$", r"$H^{(2)}_2$", r"$H^{(2)}_3$"], layer=2)
G.add_node(r"$O$", layer=3)

edges = [
    (r"$I_1$", r"$H^{(1)}_1$"),
    (r"$I_1$", r"$H^{(1)}_2$"),
    (r"$I_2$", r"$H^{(1)}_1$"),
    (r"$I_2$", r"$H^{(1)}_2$"),
    (r"$I_3$", r"$H^{(1)}_1$"),
    (r"$I_3$", r"$H^{(1)}_2$"),
    (r"$I_4$", r"$H^{(1)}_1$"),
    (r"$I_4$", r"$H^{(1)}_2$"),
    (r"$H^{(1)}_1$", r"$H^{(2)}_1$"),
    (r"$H^{(1)}_1$", r"$H^{(2)}_2$"),
    (r"$H^{(1)}_1$", r"$H^{(2)}_3$"),
    (r"$H^{(1)}_2$", r"$H^{(2)}_1$"),
    (r"$H^{(1)}_2$", r"$H^{(2)}_2$"),
    (r"$H^{(1)}_2$", r"$H^{(2)}_3$"),
    (r"$H^{(2)}_1$", r"$O$"),
    (r"$H^{(2)}_2$", r"$O$"),
    (r"$H^{(2)}_3$", r"$O$"),
]
G.add_edges_from(edges)

pos = nx.multipartite_layout(G, subset_key="layer")

plt.figure(figsize=(8, 4))
nx.draw(
    G,
    pos,
    with_labels=True,
    node_size=2000,
    node_color="skyblue",
    font_size=13,
    font_weight="bold",
    arrowstyle="-|>",
    arrowsize=10,
)
plt.title("Neural Network Architecture: 2 Hidden Layers")
plt.show()
# END SOLUTION

In [None]:
# Test assertions
assert (
    G.number_of_nodes() == 10
), "Network should have 10 nodes (4 input + 2 hidden1 + 3 hidden2 + 1 output)"
assert G.number_of_edges() == 17, "Network should have 17 edges"
print("All tests passed!")

# BEGIN HIDDEN TESTS
input_nodes = [n for n, d in G.nodes(data=True) if d.get("layer") == 0]
hidden1_nodes = [n for n, d in G.nodes(data=True) if d.get("layer") == 1]
hidden2_nodes = [n for n, d in G.nodes(data=True) if d.get("layer") == 2]
output_nodes = [n for n, d in G.nodes(data=True) if d.get("layer") == 3]
assert len(input_nodes) == 4, "Should have 4 input nodes"
assert len(hidden1_nodes) == 2, "Should have 2 nodes in first hidden layer"
assert len(hidden2_nodes) == 3, "Should have 3 nodes in second hidden layer"
assert len(output_nodes) == 1, "Should have 1 output node"
# END HIDDEN TESTS

(10.1b) Write out an expression for $f(X)$, assuming ReLU activation functions. Be as explicit as you can!

> BEGIN SOLUTION

We specify the following notations for clarity:
- ReLU Activation Function: $\text{ReLU}(z) = \max(0, z)$
- Weights: $W^{(1)}, W^{(2)}, W^{(3)}$
- Biases: $b^{(1)}, b^{(2)}, b^{(3)}$
- Input: $X = [x_1, x_2, x_3, x_4]^T$

We then have the following equations for each layer:

1. **First Hidden Layer:**
   $$z^{(1)} = W^{(1)} X + b^{(1)}$$
   $$a^{(1)} = \text{ReLU}(z^{(1)})$$

2. **Second Hidden Layer:**
   $$z^{(2)} = W^{(2)} a^{(1)} + b^{(2)}$$
   $$a^{(2)} = \text{ReLU}(z^{(2)})$$

3. **Output Layer:**
   $$z^{(3)} = W^{(3)} a^{(2)} + b^{(3)}$$
   $$f(X) = z^{(3)}$$

Combining these, we have:
$$f(X) = W^{(3)} \cdot \text{ReLU}(W^{(2)} \cdot \text{ReLU}(W^{(1)} X + b^{(1)}) + b^{(2)}) + b^{(3)}$$
> END SOLUTION


(10.1d) How many parameters are there?

> BEGIN SOLUTION

There are **23 parameters** in total. We can compute this from the dimensions of the weight matrices and bias vectors:

- $W^{(1)}$: $4 \times 2 = 8$ weights, $b^{(1)}$: 2 biases
- $W^{(2)}$: $2 \times 3 = 6$ weights, $b^{(2)}$: 3 biases
- $W^{(3)}$: $3 \times 1 = 3$ weights, $b^{(3)}$: 1 bias

Total: $(4 \times 2) + 2 + (2 \times 3) + 3 + (3 \times 1) + 1 = 8 + 2 + 6 + 3 + 3 + 1 = 23$
> END SOLUTION


---

**Problem 2:** Softmax Invariance (ISLP 10.2)

Consider the softmax function in (10.13) (see also (4.13) on page 145 of ISLP) for modeling multinomial probabilities.

(a) In (10.13), show that if we add a constant $c$ to each of the $z_i$, then the probability is unchanged.

> BEGIN SOLUTION

From the softmax function:
$$\text{Pr}(Y=m|X) = \frac{e^{z_m}}{\sum_{k=1}^K e^{z_k}}$$

We consider adding a constant $c$ to each $z_i$ to get $z_i' = z_i + c$. The corresponding softmax function becomes:
$$\text{Pr}(Y=m|X) = \frac{e^{z_m + c}}{\sum_{k=1}^K e^{z_k + c}}$$

We can factor out $e^c$ from both numerator and denominator:
$$\text{Pr}(Y=m|X) = \frac{e^c \cdot e^{z_m}}{e^c \cdot \sum_{k=1}^K e^{z_k}} = \frac{e^{z_m}}{\sum_{k=1}^K e^{z_k}}$$

Therefore, the probability is unchanged.
> END SOLUTION


(b) In (4.13), show that if we add constants $c_j$, $j = 0,1,\ldots,p$, to each of the corresponding coefficients for each of the classes, then the predictions at any new point $x$ are unchanged.

> BEGIN SOLUTION

Given the softmax probability:
$$\Pr(Y = k | X = x) = \frac{e^{\beta_{0k} + \beta_{1k}x_1 + \ldots + \beta_{pk}x_p}}{\sum_{l=1}^K e^{\beta_{0l} + \beta_{1l}x_1 + \ldots + \beta_{pl}x_p}}$$

We add new coefficients $\beta'_{jk} = \beta_{jk} + c_j$ for all classes $k$. The new function becomes:
$$\Pr(Y = k | X = x) = \frac{e^{(\beta_{0k} + c_0) + (\beta_{1k} + c_1)x_1 + \ldots + (\beta_{pk} + c_p)x_p}}{\sum_{l=1}^K e^{(\beta_{0l} + c_0) + (\beta_{1l} + c_1)x_1 + \ldots + (\beta_{pl} + c_p)x_p}}$$

We can factor out the common terms $e^{c_0 + c_1 x_1 + \ldots + c_p x_p}$ from both numerator and denominator:
$$\Pr(Y = k | X = x) = \frac{e^{c_0 + c_1 x_1 + \ldots + c_p x_p} \cdot e^{\beta_{0k} + \beta_{1k}x_1 + \ldots + \beta_{pk}x_p}}{e^{c_0 + c_1 x_1 + \ldots + c_p x_p} \cdot \sum_{l=1}^K e^{\beta_{0l} + \beta_{1l}x_1 + \ldots + \beta_{pl}x_p}}$$

The common factor cancels, yielding the original probability:
$$\Pr(Y = k | X = x) = \frac{e^{\beta_{0k} + \beta_{1k}x_1 + \ldots + \beta_{pk}x_p}}{\sum_{l=1}^K e^{\beta_{0l} + \beta_{1l}x_1 + \ldots + \beta_{pl}x_p}}$$

Therefore, the predictions at any new point $x$ remain unchanged.
> END SOLUTION


This shows that the softmax function is over-parametrized. However, regularization and SGD typically constrain the solutions so that this is not a problem.

---

**Problem 3:** CNN Parameter Counting (ISLP 10.4)

Consider a CNN that takes in 32 x 32 grayscale images and has a single convolution layer with three 5 x 5 convolution filters (without boundary padding).

(b) How many parameters are in this model?

> BEGIN SOLUTION

There are **78 parameters** in this model.

Each 5 x 5 filter has $5 \times 5 = 25$ weights plus 1 bias = 26 parameters.
With 3 filters: $26 \times 3 = 78$ parameters.

Equivalently: $(5 \times 5 \times 3) + 3 = 75 + 3 = 78$
> END SOLUTION


(c) Explain how this model can be thought of as an ordinary feed-forward neural network with the individual pixels as inputs, and with constraints on the weights in the hidden units. What are the constraints?

> BEGIN SOLUTION

We can view the CNN as a feed-forward network with the following characteristics:

1. **Input layer**: Each pixel of the $32 \times 32$ image is an individual input unit, giving $32 \times 32 = 1024$ input units.

2. **Hidden layer**: Each unit in the $28 \times 28$ output feature map corresponds to a hidden neuron. With 3 filters, we have $28 \times 28 \times 3 = 2352$ hidden units.

3. **Local connectivity constraint**: Each hidden neuron is connected to only a $5 \times 5$ patch of 25 pixels in the input image (rather than all 1024 inputs).

4. **Weight sharing constraint**: The same $5 \times 5 = 25$ weights are used for every neuron within a single filter's feature map. Each filter's weight matrix is reused 784 times (once for each of the $28 \times 28$ output positions).

These constraints dramatically reduce the number of parameters compared to a fully-connected network.
> END SOLUTION


(d) If there were no constraints, then how many weights would there be in the ordinary feed-forward neural network in (c)?

> BEGIN SOLUTION

Without constraints, we would have a fully-connected layer from 1024 inputs to 2352 hidden units.

Number of weights: $(32 \times 32) \times (28 \times 28 \times 3) = 1024 \times 2352 = 2,408,448$

This is over 30,000 times more parameters than the CNN with weight sharing!
> END SOLUTION


---

**Problem 4:** Gradient Descent Visualization (ISLP 10.6)

Consider the simple function $R(\beta) = \sin(\beta) + \beta/10$.

(a) Draw a graph of this function over the range $\beta \in [-6, 6]$.

In [None]:
# BEGIN SOLUTION
def objective_function(beta):
    """Compute R(beta) = sin(beta) + beta/10."""
    return np.sin(beta) + beta / 10


beta_values = np.linspace(-6, 6, 400)
r_values = objective_function(beta_values)

plt.figure(figsize=(8, 5))
plt.plot(beta_values, r_values, label=r"$R(\beta) = \sin(\beta) + \frac{\beta}{10}$")
plt.title(r"Plot of the Function $R(\beta)$")
plt.xlabel(r"$\beta$")
plt.ylabel(r"$R(\beta)$")
plt.grid(ls="--", alpha=0.5)
plt.legend()
plt.show()
# END SOLUTION

In [None]:
# Test assertions
assert len(beta_values) == 400, "Should have 400 points for smooth plotting"
assert beta_values[0] == -6 and beta_values[-1] == 6, "Beta range should be [-6, 6]"
assert callable(objective_function), "objective_function should be callable"
print("All tests passed!")

# BEGIN HIDDEN TESTS
test_beta = np.array([0, np.pi / 2, -np.pi / 2])
expected = np.sin(test_beta) + test_beta / 10
assert np.allclose(
    objective_function(test_beta), expected
), "objective_function should compute sin(beta) + beta/10"
# END HIDDEN TESTS

(b) What is the derivative of this function?

> BEGIN SOLUTION

The derivative of $R(\beta) = \sin(\beta) + \frac{\beta}{10}$ is:

$$R'(\beta) = \cos(\beta) + \frac{1}{10}$$
> END SOLUTION


(c) Given $\beta^0 = 2.3$, run gradient descent for 50 iterations to find a local minimum of $R(\beta)$ using a learning rate of $\rho = 0.1$. Show each of $\beta^0, \beta^1, \ldots$ in your plot, as well as the final answer.

In [None]:
# BEGIN SOLUTION
def objective_function_torch(beta):
    """PyTorch version of R(beta) for autograd."""
    return torch.sin(beta) + beta / 10


# END SOLUTION

In [None]:
# BEGIN SOLUTION
beta = torch.tensor(2.3, requires_grad=True)
learning_rate = 0.1
num_iterations = 50

beta_history = [beta.item()]

for _ in range(num_iterations):
    loss = objective_function_torch(beta)
    loss.backward()

    with torch.no_grad():
        beta -= learning_rate * beta.grad
        beta_history.append(beta.item())

    beta.grad.zero_()

final_beta = beta.item()

# Plot
beta_grid = np.linspace(-6, 6, 400)
r_values = np.sin(beta_grid) + beta_grid / 10

plt.figure(figsize=(10, 6))
plt.plot(beta_grid, r_values, label=r"$R(\beta) = \sin(\beta) + \frac{\beta}{10}$", color="blue")
plt.scatter(
    beta_history,
    np.sin(np.array(beta_history)) + np.array(beta_history) / 10,
    color="red",
    alpha=0.3,
    label="Gradient Descent Steps",
)
plt.scatter(
    final_beta,
    np.sin(final_beta) + final_beta / 10,
    color="green",
    s=100,
    label=f"Local Minimum: {final_beta:.4f}",
)
plt.title(r"Gradient Descent on $R(\beta)$ starting from $\beta^0 = 2.3$")
plt.xlabel(r"$\beta$")
plt.ylabel(r"$R(\beta)$")
plt.legend()
plt.grid(ls="--", alpha=0.5)
plt.show()

print(f"Local minimum found at beta = {final_beta:.4f}")
# END SOLUTION

In [None]:
# Test assertions
assert len(beta_history) > 1, "Should have multiple gradient descent steps"
assert abs(beta_history[0] - 2.3) < 0.001, "Should start from beta^0 = 2.3"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# The local minimum near beta=2.3 is at approximately 4.577 (where cos(beta) = -0.1)
assert 4.0 < final_beta < 5.0, f"Final beta should converge near 4.577, got {final_beta}"
# END HIDDEN TESTS

> BEGIN SOLUTION

Starting from $\beta^0 = 2.3$, gradient descent converges to a local minimum at approximately $\beta \approx 4.577$.
> END SOLUTION


(d) Repeat with $\beta^0 = 1.4$.

In [None]:
# BEGIN SOLUTION
beta = torch.tensor(1.4, requires_grad=True)
learning_rate = 0.1
num_iterations = 50

beta_history = [beta.item()]

for _ in range(num_iterations):
    loss = objective_function_torch(beta)
    loss.backward()

    with torch.no_grad():
        beta -= learning_rate * beta.grad
        beta_history.append(beta.item())

    beta.grad.zero_()

final_beta = beta.item()

# Plot
beta_grid = np.linspace(-6, 6, 400)
r_values = np.sin(beta_grid) + beta_grid / 10

plt.figure(figsize=(10, 6))
plt.plot(beta_grid, r_values, label=r"$R(\beta) = \sin(\beta) + \frac{\beta}{10}$", color="blue")
plt.scatter(
    beta_history,
    np.sin(np.array(beta_history)) + np.array(beta_history) / 10,
    color="red",
    alpha=0.3,
    label="Gradient Descent Steps",
)
plt.scatter(
    final_beta,
    np.sin(final_beta) + final_beta / 10,
    color="green",
    s=100,
    label=f"Local Minimum: {final_beta:.4f}",
)
plt.title(r"Gradient Descent on $R(\beta)$ starting from $\beta^0 = 1.4$")
plt.xlabel(r"$\beta$")
plt.ylabel(r"$R(\beta)$")
plt.legend()
plt.grid(ls="--", alpha=0.5)
plt.show()

print(f"Local minimum found at beta = {final_beta:.4f}")
# END SOLUTION

In [None]:
# Test assertions
assert len(beta_history) > 1, "Should have multiple gradient descent steps"
assert abs(beta_history[0] - 1.4) < 0.001, "Should start from beta^0 = 1.4"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# The local minimum near beta=1.4 is at approximately -1.574 (where cos(beta) = -0.1)
assert -2.0 < final_beta < -1.0, f"Final beta should converge near -1.574, got {final_beta}"
# END HIDDEN TESTS

> BEGIN SOLUTION

Starting from $\beta^0 = 1.4$, gradient descent converges to a different local minimum at approximately $\beta \approx -1.574$.

This demonstrates that gradient descent can converge to different local minima depending on the starting point, which is a key challenge in training neural networks with non-convex loss functions.
> END SOLUTION
