# DATASCI 315, Homework 5: Shallow Networks, Loss Functions, and Fitting (Concepts and Theory) 

This assignment covers the theory behind neural networks through problems from [*Understanding Deep Learning*](https://udlbook.github.io/udlbook/). Topics include **shallow neural networks** (architecture and capacity), **loss functions** (measuring model performance), and **fitting models** (how optimization finds good parameters).

## UDL Chapter 3: Shallow Neural Networks

---

**Problem 1:** UDL 3.4

Draw a version of UDL Figure 3.3 where the y-intercept and slope of the third hidden unit have changed as in UDL Figure 3.14c. Assume that the remaining parameters remain the same.

![UDL Figure 3.3](fig3_3.svg)

*UDL Figure 3.3: Visualization of shallow network computation. a–c) Each hidden unit computes a linear function of the input and passes it through a ReLU activation. d–f) The hidden unit outputs are weighted. g–i) The weighted outputs are summed to produce the final function. j) The complete input-output mapping.*

![UDL Figure 3.14](fig3_14.svg)

*UDL Figure 3.14: Processing in network with one input, three hidden units, and one output for problem 3.4. a–c) The input to each hidden unit is a linear function of the inputs. The first two are the same as in figure 3.3, but the last one differs.*

> BEGIN SOLUTION

> END SOLUTION

---

**Problem 2:** UDL 3.8

Consider replacing the ReLU activation function with (i) the Heaviside step function $\text{heaviside}[z]$, (ii) the hyperbolic tangent function $\tanh[z]$, and (iii) the rectangular function $\text{rect}[z]$, where:

$$\text{heaviside}[z] = \begin{cases} 0 & z < 0 \\ 1 & z \geq 0 \end{cases} \qquad \text{rect}[z] = \begin{cases} 0 & z < 0 \\ 1 & 0 \leq z \leq 1 \\ 0 & z > 1 \end{cases}$$

Redraw a version of UDL Figure 3.3 for each of these functions. The original parameters were: $\boldsymbol{\phi} = \{\phi_0, \phi_1, \phi_2, \phi_3, \theta_{10}, \theta_{11}, \theta_{20}, \theta_{21}, \theta_{30}, \theta_{31}\} = \{-0.23, -1.3, 1.3, 0.66, -0.2, 0.4, -0.9, 0.9, 1.1, -0.7\}$. Provide an informal description of the family of functions that can be created by neural networks with one input, three hidden units, and one output for each activation function.

![UDL Figure 3.3](fig3_3.svg)

*UDL Figure 3.3: Visualization of shallow network computation with ReLU activations.*

> BEGIN SOLUTION

> END SOLUTION

---

**Problem 3:** UDL 3.14

Write out the equations that define the network in UDL Figure 3.11. There should be three equations to compute the three hidden units from the inputs and two equations to compute the outputs from the hidden units.

![UDL Figure 3.11](fig3_11.svg)

*UDL Figure 3.11: Neural network with two inputs, three hidden units, and two outputs.*

> BEGIN SOLUTION

> END SOLUTION

---

**Problem 4:** UDL 3.16

Write out the equations for a network with two inputs, four hidden units, and three outputs. Draw this model in the style of UDL Figure 3.11.

![UDL Figure 3.11](fig3_11.svg)

*UDL Figure 3.11: Neural network with two inputs, three hidden units, and two outputs.*

> BEGIN SOLUTION

> END SOLUTION

## UDL Chapter 5: Loss Functions

---

**Problem 5:** UDL 5.5

Consider extending the model from problem 5.3 to predict the wind direction using a mixture of two von Mises distributions. Write an expression for the likelihood $\Pr(y|\boldsymbol{\theta})$ for this model. How many outputs will the network need to produce?

*Background (from problem 5.3):* The von Mises distribution is suitable for circular domains:

$$\Pr(y|\mu, \kappa) = \frac{\exp[\kappa \cos(y - \mu)]}{2\pi \cdot \text{Bessel}_0[\kappa]}$$

where $\mu$ is the mean direction and $\kappa$ is the concentration (inverse of variance).

> BEGIN SOLUTION

> END SOLUTION

---

**Problem 6:** UDL 5.6

Consider building a model to predict the number of pedestrians $y \in \{0, 1, 2, \ldots\}$ that will pass a given point in the city in the next minute, based on data $\mathbf{x}$ that contains information about the time of day, the longitude and latitude, and the type of neighborhood. A suitable distribution for modeling counts is the Poisson distribution (figure 5.15). This has a single parameter $\lambda > 0$ called the *rate* that represents the mean of the distribution. The distribution has probability density function:

$$\Pr(y = k) = \frac{\lambda^k e^{-\lambda}}{k!}$$

Design a loss function for this model assuming we have access to $I$ training pairs $\{\mathbf{x}_i, y_i\}$.

> BEGIN SOLUTION

> END SOLUTION

---

**Problem 7:** UDL 5.7

Consider a multivariate regression problem where we predict ten outputs, so $\mathbf{y} \in \mathbb{R}^{10}$, and model each with an independent normal distribution where the means $\mu_d$ are predicted by the network, and variances $\sigma^2$ are constant. Write an expression for the likelihood $\Pr(\mathbf{y}|\mathbf{f}[\mathbf{x}, \boldsymbol{\omega}])$. Show that minimizing the negative log-likelihood of this model is still equivalent to minimizing a sum of squared terms if we don't estimate the variance $\sigma^2$.

> BEGIN SOLUTION

> END SOLUTION

## UDL Chapter 6: Fitting Models

---

**Problem 8:** UDL 6.9

We run the stochastic gradient descent algorithm for 1,000 iterations on a dataset of size 100 with a batch size of 20. For how many epochs did we train the model?

> BEGIN SOLUTION

> END SOLUTION