In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import KFold, train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import PolynomialFeatures, FunctionTransformer, SplineTransformer, StandardScaler
from sklearn.metrics import mean_squared_error, r2_score,accuracy_score,confusion_matrix,ConfusionMatrixDisplay
from sklearn.linear_model import Ridge, RidgeCV, Lasso, LassoCV, LinearRegression
from sklearn.svm import SVC
import warnings
warnings.filterwarnings("ignore")

# CMSE 381, Fundamental Data Science Methods
## Homework 9, Fall 2025

**Name:** Monis, Lowell

---

### Question 1: ISLP $\S$ 10.10.1

Consider a neural network with two hidden layers: $p = 4$ input units, $2$ units in the first hidden layer, $3$ units in the second hidden layer, and a single output.

#### (a) Draw a picture of the network.

![ans](../img/ans10-10-1.png)

#### (b) Write out an expression for $f(X)$, assuming ReLU activation functions. Be as explicit as you can!

The ReLU activation function can be expressed as,

$$g(x)=(x)_+=\max(0,x)$$

For hidden weight matrices $\mathbf{w}^{(1)}$ and $\mathbf{w}^{(2)}$, and output weight vector $\beta$, we can etch the first hidden layer's component vector, $A^{(1)}$ for $k\in[2]$ with the activation function applied:

$$A^{(1)}_k=g(\mathbf{w}^{(1)}_k \cdot \mathbf{X})=\left(\mathbf{w}^{(1)}_k \cdot \mathbf{X}\right)_+=\max\left(0, w^{(1)}_{k,0} + \sum_{j=1}^{4}w^{(1)}_{k,j}X_j\right)$$

Similarly, the weighted measures from the second hidden layer can be expressed as $A^{(2)}$ for $l\in[3]$:

$$A^{(2)}_l=g(\mathbf{w}^{(2)}_l \cdot \mathbf{A}^{(1)})=\left(\mathbf{w}^{(2)}_l \cdot \mathbf{A}^{(1)}\right)_+=\max\left(0, w^{(2)}_{l,0} + \sum_{j=1}^{2}w^{(2)}_{l,j}A_j^{(1)}\right)$$

Finally, we can express the result $f(\mathbf{X})$ as a function of the second hidden layer, weighted by the output weights $\beta$:

$$f(\mathbf{X})=\boldsymbol{\beta}\cdot \mathbf{A}^{(2)}=\beta_0 + \sum_{l=1}^{3}\beta_l A^{(2)}_l$$

To elaborate in detail:

$$f(\mathbf{X})=\beta_0 + \sum_{l=1}^{3}\beta_l \max\left(0, w^{(2)}_{l,0} + \sum_{k=1}^{2}w^{(2)}_{l,k}\max\left(0, w^{(1)}_{k,0} + \sum_{j=1}^{4}w^{(1)}_{k,j}X_j\right)\right)$$

#### (c) Now plug in some values for the coefficients and write out the value of $f(X)$.

Pick any values you would like for the coefficients, but be sure to elaborate what they are. Then, plug in the vectors $(1,0,0,0)$, $(0,1,0,0)$, $(0,0,1,0)$, and $(0,0,0,1)$ and show the output.

I will first define the coefficients.

First hidden layer weights $\mathbf{w}^{(1)}$ (2×5 matrix, including column associated with coefficient):
- $\mathbf{w}^{(1)}_1 = [0.5, 1, -1, 0.5, -0.5]$
- $\mathbf{w}^{(1)}_2 = [-0.3, 0.5, 1, -0.5, 1]$

Second hidden layer weights $\mathbf{w}^{(2)}$ (3×3 matrix, including column associated with coefficient):
- $\mathbf{w}^{(2)}_1 = [0.2, 1, -0.5]$
- $\mathbf{w}^{(2)}_2 = [-0.1, 0.5, 1]$
- $\mathbf{w}^{(2)}_3 = [0.3, -1, 0.5]$

Output layer weights $\boldsymbol{\beta}$:
- $\boldsymbol{\beta} = [1, 0.8, 1.2, -0.5]$

Now, I will compute $f(\mathbf{X})$ for each input vector:

1. For $\mathbf{X} = (1,0,0,0)$:

- $A^{(1)}_1 = \max(0, 0.5 + 1(1) + (-1)(0) + 0.5(0) + (-0.5)(0)) = \max(0, 1.5) = 1.5$
- $A^{(1)}_2 = \max(0, -0.3 + 0.5(1) + 1(0) + (-0.5)(0) + 1(0)) = \max(0, 0.2) = 0.2$

- $A^{(2)}_1 = \max(0, 0.2 + 1(1.5) + (-0.5)(0.2)) = \max(0, 1.6) = 1.6$
- $A^{(2)}_2 = \max(0, -0.1 + 0.5(1.5) + 1(0.2)) = \max(0, 0.85) = 0.85$
- $A^{(2)}_3 = \max(0, 0.3 + (-1)(1.5) + 0.5(0.2)) = \max(0, -1.1) = 0$

$$f(1,0,0,0) = 1 + 0.8(1.6) + 1.2(0.85) + (-0.5)(0) = 1 + 1.28 + 1.02 + 0 = 3.30$$

2. For $\mathbf{X} = (0,1,0,0)$:

- $A^{(1)}_1 = \max(0, 0.5 + 0 - 1 + 0 + 0) = \max(0, -0.5) = 0$
- $A^{(1)}_2 = \max(0, -0.3 + 0 + 1 + 0 + 0) = \max(0, 0.7) = 0.7$

- $A^{(2)}_1 = \max(0, 0.2 + 0 - 0.35) = \max(0, -0.15) = 0$
- $A^{(2)}_2 = \max(0, -0.1 + 0 + 0.7) = \max(0, 0.6) = 0.6$
- $A^{(2)}_3 = \max(0, 0.3 + 0 + 0.35) = \max(0, 0.65) = 0.65$

$$f(0,1,0,0) = 1 + 0 + 0.72 - 0.325 = 1.395$$

3. For $\mathbf{X} = (0,0,1,0)$:

- $A^{(1)}_1 = \max(0, 0.5 + 0.5) = 1.0$
- $A^{(1)}_2 = \max(0, -0.3 - 0.5) = 0$

- $A^{(2)}_1 = \max(0, 0.2 + 1.0) = 1.2$
- $A^{(2)}_2 = \max(0, -0.1 + 0.5) = 0.4$
- $A^{(2)}_3 = \max(0, 0.3 - 1.0) = 0$

$$f(0,0,1,0) = 1 + 0.96 + 0.48 + 0 = 2.44$$

4. For $\mathbf{X} = (0,0,0,1)$:

- $A^{(1)}_1 = \max(0, 0.5 - 0.5) = 0$
- $A^{(1)}_2 = \max(0, -0.3 + 1) = 0.7$

- $A^{(2)}_1 = \max(0, 0.2 - 0.35) = 0$
- $A^{(2)}_2 = \max(0, -0.1 + 0.7) = 0.6$
- $A^{(2)}_3 = \max(0, 0.3 + 0.35) = 0.65$

$$f(0,0,0,1) = 1 + 0 + 0.72 - 0.325 = 1.395$$

#### (d) How many parameters are there?

There are four predictors entering two units in the first hidden layer, and two parameters representing bias for each of the units in the hidden layer. This indicates there are $(4\times2)+2=10$ parameters between the input and first hidden layers. There are two weighted units coming in from the first hidden layer towards 3 units in the second hidden layer, and three parameters representing bias. Thus, $(2\times3)+3=9$ parameters exist between the two hidden layers. The final push towards the output layer includes three incoming weighted units towards a single final output plus one count of bias, giving us $(3\times1)+1=4$ parameters. Thus, there are $10+9+4=23$ parameters in this neural network.

### Question 2: ISLP $\S$ 10.10.4

Only a subset of the subproblems in this question are to be answered.

Consider a CNN that takes in $32\times32$ grayscale images and has a single convolution layer with three $5\times5$ convolution filters (without boundary padding).

#### (a) Draw a sketch of the input and first hidden layer.

![plot](../img/ans10.svg)

#### (b) How many parameters are in this model?

The parameters for a convolutional neural network consist of the weights in each filter and an extra bias term per filter.

For each of the three $5\times5$ convolution filter, there are $5\times5\times1=25$ weights, with the $1$ representing the single input channel (grayscale), or the depth.

$$\mathrm{Parameters}=(\text{filter height}\times\text{filter width}\times\text{input channels}+1)\times\text{number of filters}=(25+1)\times3=26\times3=78$$

