# Chapter 13: Training probabilistic graphical models

In the [programming assignment](https://github.com/jmyers7/stats-book-materials/blob/main/programming-assignments/assignment_12.ipynb) for the [previous chapter](https://mml.johnmyersmath.com/stats-book/chapters/12-models.html), we trained probabilistic models using the [scikit-learn](https://scikit-learn.org/stable/index.html) library. Part of what makes that library so powerful and useful is that it hides the training process from the user, allowing the user to focus on extracting insights and enlightenment from their data without worrying about technical stuff. But, at least once in their career, analysts should see what it takes to train a probabilistic model _from scratch_. That's what we're going to do in this assignment.

In the previous programming assignment, we trained a Naive Bayes model to function as a spam classifier. Our goal is to re-do the construction of that model, but this time we will train the model ourselves by minimizing the cross entropy training objective (see [Theorem 13.4](https://mml.johnmyersmath.com/stats-book/chapters/13-learning.html#equiv-obj-gen-thm) in the book) via stochastic gradient descent (SGD). Our workflow is this:

1. Implement the link function for the Naive Bayes model.
2. Implement the [model surprisal function](https://mml.johnmyersmath.com/stats-book/chapters/13-learning.html#gen-model-functions-def) to serve as the [loss function](https://mml.johnmyersmath.com/stats-book/chapters/11-optim.html#stochastic-gradient-descent) for SGD.
3. Minimize the [cross entropy](https://mml.johnmyersmath.com/stats-book/chapters/13-learning.html#equiv-obj-gen-thm) via SGD.
4. Check for convergence of SGD using diagnostic plots.
5. Evaluate the goodness-of-fit of the model by computing classification metrics like the _confusion matrix_, _accuracy_, _precision_, and _recall_.

This general workflow is not special to Naive Bayes models---this is more or less the same sequence of steps that you would follow to train _any_ model.

Let's get started!


## Directions

1. The programming assignment is organized into sequences of short problems. You can see the structure of the programming assignment by opening the "Table of Contents" along the left side of the notebook (if you are using Google Colab or Jupyter Lab).

2. Do not add any cells of your own to the notebook, or delete any existing cells (either code or markdown).

## Submission instructions

1. Once you have finished entering all your solutions, you will want to rerun all cells from scratch to ensure that everything works OK. To do this in Google Colab, click "Runtime -> Restart and run all" along the top of the notebook.

2. Now scroll back through your notebook and make sure that all code cells ran properly.

3. If everything looks OK, save your assignment and upload the `.ipynb` file at the provided link on the course <a href="https://github.com/jmyers7/stats-book-materials">GitHub repo</a>. Late submissions are not accepted.

4. You may submit multiple times, but I will only grade your last submission.

## The cross entropy training objective for a Naive Bayes model

We first saw _Naive Bayes models_ back in the [programming assignment](https://github.com/jmyers7/stats-book-materials/blob/main/programming-assignments/assignment_12.ipynb) for Chapter 12, and we studied their likelihood and surprisal functions in the [worksheet](https://github.com/jmyers7/stats-book-materials/blob/main/worksheets/13-learning-sol.pdf) for the current chapter. Recall that the underlying graph of a Naive Bayes model is of the form

<br>
<center>
<img src="https://raw.githubusercontent.com/jmyers7/stats-book-materials/main/img/nb.svg" width="200" align="center">
</center>
<br>

where $\mathbf{X} \in \mathbb{R}^n$. The parameters are given by a number $\psi\in [0,1]$ that parametrizes the distribution of $Y\sim \mathcal{B}er(\psi)$, as well as two vectors $\boldsymbol{\theta}_0, \boldsymbol{\theta}_1 \in [0,1]^n$. The link function at $\mathbf{X}$ is given by

$$
p(\mathbf{x} \mid y ; \  \boldsymbol{\theta}_0, \boldsymbol{\theta}_1 ) = \prod_{j=1}^n \phi_j^{x_j}(1-\phi_j)^{1-x_j} \quad \text{where} \quad \boldsymbol{\phi} = (1-y) \boldsymbol{\theta}_0 + y \boldsymbol{\theta}_1
$$

and	$\boldsymbol{\phi}^\intercal = (\phi_1,\ldots,\phi_n)$.

Given an observed dataset

$$
(\mathbf{x}_1,y_1),(\mathbf{x}_2,y_2),\ldots,(\mathbf{x}_m,y_m) \in \{0,1\}^n \times \{0,1\},
$$

we saw in Problem 1 of the [worksheet](https://github.com/jmyers7/stats-book-materials/blob/main/worksheets/13-learning-sol.pdf) (for the current chapter) that the cross entropy stochastic objective function is given by

$$
J(\psi,\boldsymbol{\theta}_0,\boldsymbol{\theta}_1) = E_{(\mathbf{x},y)\sim \hat{p}(\mathbf{x},y)} \left[\mathcal{I}_\text{model}(\psi,\boldsymbol{\theta}_0,\boldsymbol{\theta}_1; \ \mathbf{x},y) \right] = \frac{1}{m} \sum_{i=1}^m \mathcal{I}_\text{model}(\psi,\boldsymbol{\theta}_0,\boldsymbol{\theta}_1; \ \mathbf{x}_i,y_i),
$$

where

$$
\mathcal{I}_\text{model}(\psi,\boldsymbol{\theta}_0,\boldsymbol{\theta}_1; \ \mathbf{x}_i,y_i) = - y_i \log{\psi} - (1-y_i) \log{(1-\psi)} - \sum_{j=1}^n\left[x_{ij} \log{\phi_j} + (1-x_{ij}) \log{(1-\phi_j)} \right]
$$

is the _model surprisal function_ evaluated on the $i$-th instance of the dataset (for $i=1,\ldots,m$) and where

$$
\mathbf{x}_i^\intercal = (x_{i1},x_{i2},\ldots,x_{in})
$$

is the _feature vector_ for the $i$-th instance.

## The idea of "vectorization"

So, it would seem that all we need to do is code the model surprisal function, and then toss it into the stochastic gradient descent algorithm as the [loss function](https://mml.johnmyersmath.com/stats-book/chapters/11-optim.html#stochastic-gradient-descent) to learn the parameters from the data. But, for the most efficient implementation, we need our model surprisal function to be "vectorized."

To explain what this means, it will be convenent to recall that the _design matrix_ of the dataset is the $m\times n$ matrix $\mathcal{X}$ with the feature vectors $\mathbf{x}_i$ as rows:

$$
\mathcal{X} \stackrel{\text{def}}{=}
\begin{bmatrix}
\leftarrow & \mathbf{x}_1^\intercal & \rightarrow \\
\leftarrow & \mathbf{x}_2^\intercal & \rightarrow \\
\vdots & \vdots & \vdots \\
\leftarrow & \mathbf{x}_m^\intercal & \rightarrow
\end{bmatrix} =
\begin{bmatrix}
x_{11} & x_{12} & \cdots & x_{1n} \\
x_{21} & x_{22} & \cdots & x_{2n} \\
\vdots & \vdots & \vdots & \vdots \\
x_{m1} & x_{m2} & \cdots & x_{mn}
\end{bmatrix}.
$$

We also imagine taking all the $y_i$'s and loading them into a single $m\times 1$ column vector

$$
\mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m\end{bmatrix}
$$

that we call the _(true) label vector_. Now, to say that the model surprisal function should be _vectorized_ means that we can plug the _entire_ design matrix $\mathcal{X}$ and the _entire_ label vector $\mathbf{y}$ into it as arguments, producing the following vector as output:

$$
\mathcal{I}_\text{model}(\psi,\boldsymbol{\theta}_0,\boldsymbol{\theta}_1; \ \mathcal{X},\mathbf{y}) =
\begin{bmatrix}
\mathcal{I}_\text{model}(\psi,\boldsymbol{\theta}_0,\boldsymbol{\theta}_1; \ \mathbf{x}_1, y_1) \\
\mathcal{I}_\text{model}(\psi,\boldsymbol{\theta}_0,\boldsymbol{\theta}_1; \ \mathbf{x}_2, y_2) \\
\vdots \\
\mathcal{I}_\text{model}(\psi,\boldsymbol{\theta}_0,\boldsymbol{\theta}_1; \ \mathbf{x}_m, y_m)
\end{bmatrix} \in \mathbb{R}^m.
$$

Notice that the output is a column vector containing the surprisals of each of the $m$ instances of data.

So, the bulk of the work in this assignment is designing a fully _vectorized_ version of the model surprisal function. I will walk you through it, step by step.

## Importing the data

First, let's import the email data from the previous [programming assignment](https://github.com/jmyers7/stats-book-materials/blob/main/programming-assignments/assignment_12.ipynb), as well as import (most of) the libraries required for this assignment. Run the next cell.

In [None]:
import pandas as pd
import torch
!pip install math_stats_ml>=0.0.18  # install the custom library for our course
from math_stats_ml.gd import SGD, plot_gd # import the functions for training
from math_stats_ml.autograders.assignment_13 import * # import autograders

url = 'https://raw.githubusercontent.com/jmyers7/stats-book-materials/main/data/data-12-3.csv'
df = pd.read_csv(url)
print('\nThe email data:\n')
df

Remember that the dataset consists of observations of seven random variables $X_1,X_2,X_3,X_4,X_5,X_6$, and $Y$. The $X$'s are binary random variables that indicate ($1=$ yes and $0=$ no) whether the words

$$
\text{office, cash, vacation, meeting, credit, cat}
$$

occur in an email, while $Y$ indicates whether the email is spam ($Y=1$) or not spam ($Y=0$).

Our design matrix $\mathcal{X}$ consists of the $X$-columns in the dataframe, while our label vector $\mathbf{y}$ consists of the $Y$-column. Let's extract these columns from the dataframe, and turn them into PyTorch tensors. I will do this for you. Run the next cell.

In [None]:
X = torch.tensor(df.iloc[:, :6].to_numpy(), dtype=torch.float32)
y = torch.tensor(df['y'].to_numpy(), dtype=torch.float32)

# print the design matrix `X` and vector `y`
print('The design matrix X: \n', X)
print('\nThe class label vector y: \n', y)

Notice the design matrix $\mathcal{X}$ is assigned to the Python variable `X`, while the label vector $\mathbf{y}$ is assigned to `y`.

## Building the vectorized link function

Our first step in building a vectorized model surprisal function is building a vectorized link function. From our discussion above, the link function is given by

$$
\boldsymbol{\phi} = (1-y) \boldsymbol{\theta}_0 + y \boldsymbol{\theta}_1.
$$

In terms of components, this equation yields

$$
\phi_j = (1-y) \theta_{0j} + y \theta_{1j} \tag{1}
$$

for each $j=1,\ldots,n$, where

$$
\boldsymbol{\theta}_0^\intercal = (\theta_{01},\theta_{02},\ldots,\theta_{0n}) \quad \text{and} \quad \boldsymbol{\theta}_1^\intercal = (\theta_{11},\theta_{12},\ldots,\theta_{1n}).
$$

But these formulas are correct _only_ for a single instance $y\in \{0,1\}$. To get the correct vectorized formulas, we need to bring in all the $y$'s from the dataset:

$$
y_1,y_2,\ldots,y_m \in \{0,1\}.
$$

If we write $y_i$ for the $i$-th instance in the dataset, then our formula (1) above needs to be rewritten as

$$
\phi_{ij} = (1-y_i) \theta_{0j} + y_i \theta_{1j}.
$$

Notice now that the $\phi$'s are doubly-indexed, with the first index $i$ (with $i=1,\ldots,m$) picking out the $i$-th instance in the dataset, and the second index $j$ (with $j=1,\ldots,n$) picking out the components of the parameter vectors $\boldsymbol{\theta}_0$ and $\boldsymbol{\theta}_1$. Thus, the $\phi_{ij}$'s naturally combine to form a matrix of size $m\times n$.

But how do we get this matrix?

To help you along, I'm going to show you the formula, but it'll be up to you to figure out the implementation. Here is the matrix:

$$
\text{matrix of $\phi_{ij}$'s} = (\boldsymbol{1} - \mathbf{y})\boldsymbol{\theta}_0^\intercal + \mathbf{y} \boldsymbol{\theta}_1^\intercal. \tag{2}
$$

In this formula, the bold $\boldsymbol{1}$ represents an $m\times 1$ column vector of all $1$'s. Take a moment or two to convince yourself that this expression really does yield the correct $\phi_{ij}$'s.

So, all you need to do is use formula (2) for your vectorized link function. To test out your implementation after you write it, let's get some random values for the parameter vectors $\boldsymbol{\theta}_0$ and $\boldsymbol{\theta}_1$. I will do this for you, as well as place the vectors in a [dictionary](https://docs.python.org/3/tutorial/datastructures.html#dictionaries) called `parameters`. Run the next cell.

In [None]:
torch.manual_seed(57702)
theta0 = torch.rand(size=(6,))
theta1 = torch.rand(size=(6,))
parameters = {'theta0': theta0, 'theta1': theta1}

You can access the individual parameters in the dictionary `parameters` by indexing into it using the name of the parameter, exactly like indexing into the columns of a dataframe. For example:

In [None]:
parameters['theta0']

Our code creates a parameter tensor `theta0` that consists of six random numbers drawn from the uniform distribution on the interval $[0,1)$. Same for `theta1`. Crucially, notice that the shapes of the parameter tensors are `(6,)` and `(6,)`.

Also, the shape of the vector `y` is computed in the following cell:

In [None]:
y.shape

Thus, the shape of `y` is `(512,)`. These observations are **very** important for getting a working implementation! Remember them!

### Problem 1 --- Implementing the vectorized link function

In the next cell, write your implementation of the vectorized link function.

_Here are some hints:_

1. Look at the code in [the book](https://mml.johnmyersmath.com/stats-book/chapters/13-learning.html#mle-for-logistic-regression) to see how I implemented the link function for a logistic regression model. Use this code as inspiration.
2. Referring to formula (2), you may simply write `1 - y` to stand for $\boldsymbol{1} - \mathbf{y}$.
3. Use the symbol `@` for matrix multiplication.
4. To get the multiplications to work correctly, you may need to reshape your tensors from shapes `(6,)` and `(512,`) to shapes `(1, 6)` and `(512, 1)` by using the `reshape` method from PyTorch. Here's how it works: If `T` is a tensor of shape `(n,)` and you want to turn it into a column vector of shape `(n, 1`), write `T.reshape(-1, 1)`. If you want to turn it into a row vector of shape `(1, n)`, write `T.reshape(1, -1)`.

In [None]:
# ENTER YOUR CODE IN THIS CELL

def phi_link(parameters, y):
  None          # <-- replace `None` with your own code
  None          # <-- replace `None` with your own code
  return None   # <-- replace `None` with your own code

Now, let's test to make sure your implementation is correct. In the next code cell, pass in the `parameters` dictionary we defined above and the vector `y` to your `phi_link` function. Save the output of `phi_link` into the variable `phi`:

In [None]:
# ENTER YOUR CODE IN THIS CELL

phi = None      # <-- replace `None` with your own code

Now, run the next code cell to check the output:

In [None]:
# RUN THIS CELL TO CHECK YOUR ANSWERS

prob_check(answers=[phi], prob_num=1)

## Building the vectorized model surprisal function

We must now take your vectorized link function and build the full vectorized model surprisal function. For your reference, here is the formula from above, evaluated on a single instance $(\mathbf{x}_i,y_i)$ of data:

$$
\mathcal{I}_\text{model}(\psi,\boldsymbol{\theta}_0,\boldsymbol{\theta}_1; \ \mathbf{x}_i,y_i) = - y_i \log{\psi} - (1-y_i) \log{(1-\psi)} - \sum_{j=1}^n\left[x_{ij} \log{\phi_j} + (1-x_{ij}) \log{(1-\phi_j)} \right].
$$

Remember, our goal is to write an implementation that takes the entire $m\times n$ design matrix $\mathcal{X}$ as input, along with the $m\times 1$ label vector $\mathbf{y}$:

$$
\mathcal{I}_\text{model}(\psi,\boldsymbol{\theta}_0,\boldsymbol{\theta}_1; \ \mathcal{X},\mathbf{y}) =
\begin{bmatrix}
\mathcal{I}_\text{model}(\psi,\boldsymbol{\theta}_0,\boldsymbol{\theta}_1; \ \mathbf{x}_1, y_1) \\
\mathcal{I}_\text{model}(\psi,\boldsymbol{\theta}_0,\boldsymbol{\theta}_1; \ \mathbf{x}_2, y_2) \\
\vdots \\
\mathcal{I}_\text{model}(\psi,\boldsymbol{\theta}_0,\boldsymbol{\theta}_1; \ \mathbf{x}_m, y_m)
\end{bmatrix} \in \mathbb{R}^m.
$$

But, using vector/matrix algebra, this may be rewritten as

$$
\mathcal{I}_\text{model}(\psi,\boldsymbol{\theta}_0,\boldsymbol{\theta}_1; \ \mathcal{X},\mathbf{y}) = -\mathbf{y} \log{\psi} - (\boldsymbol{1}-\mathbf{y}) \log{(1-\psi)} - \text{sum over columns}\left( \mathcal{X} \odot \log{\boldsymbol{\phi}} + (\boldsymbol{1} - \mathcal{X}) \odot \log{(\boldsymbol{1}-\boldsymbol{\phi})} \right).
$$

In this formula:

1. We write $\boldsymbol{1} - \mathbf{y}$, where the bold $\boldsymbol{1}$ stands for the $m\times 1$ column vector with $1$'s in every entry. (_Hint_: You may simply write `1 - y` in your code.)
2. We write $\log{\boldsymbol{\phi}}$ for the entrywise logarithm of the $m\times n$ matrix $\boldsymbol{\phi}$ obtained from your `phi_link` function, and similarly for $\log{(\boldsymbol{1}-\boldsymbol{\phi})}$. The bold $\boldsymbol{1}$ stands for the $m\times n$ matrix with $1$'s in every entry. (_Hint_: You may simply write `1 - phi` in your code.)
3. We write $\odot$ for the entrywise product of one $m\times n$ matrix with another. (_Hint_: The entrywise product is given by the star operator `*`.)
4. We write "*sum over columns*" for the function that does exactly that: It takes an $m\times n$ matrix and sums over the columns to obtain an $m\times 1$ column vector. (_Hint_: Look up the method `torch.sum` in [the docs](https://pytorch.org/docs/stable/generated/torch.sum.html). Summing over columns corresponds to the parameter `dim=1`.)


### Problem 2 --- Implementing the vectorized model surprisal function

In this problem, you will put everything together and implement your vectorized model suprisal function. Using the hints above (also see the code in the [the book](https://mml.johnmyersmath.com/stats-book/chapters/13-learning.html#mle-for-logistic-regression) where I implemented `I_model` for logistic regression) enter your implementation in the next cell:

In [None]:
# ENTER YOUR CODE IN THIS CELL

def I_model(parameters, X, y):
  None            # <-- replace `None` with your own code
  None            # <-- replace `None` with your own code
  return None     # <-- replace `None` with your own code

Let's test whether your implementation is correct before we train the model on the email data. For this, you need initial values for the parameter vectors $\boldsymbol{\theta}_0$ and $\boldsymbol{\theta}_1$, as well as the parameter $\psi \in [0,1]$. In the next cell, I will do this for you, choosing random values for all these parameters, and then load them into a dictionary called `parameters`. Run the next cell.

In [None]:
torch.manual_seed(57702)
theta0 = torch.rand(size=(6,))
theta1 = torch.rand(size=(6,))
psi = torch.rand(size=(1,))
parameters = {'theta0': theta0, 'theta1': theta1, 'psi': psi}

Selecting good initial values for parameters can be tricky, and learning algorithms (like gradient descent) can be quite sensitive to initial parameter choice. Very often, initial parameters are _randomly_ chosen in some fashion. Here, we've chosen each of the $13 = 6 + 6 + 1$ initial values in the parameters from the uniform distribution on $[0,1)$.

Let's have a look at our initial parameters, just out of curiosity. Run the next cell.

In [None]:
print('Initial theta0 :', theta0)
print('Initial theta1 :', theta1)
print('Initial psi :', psi)

Yup, those look like random numbers. Just like we were expecting! 🤘

Now, in the next cell, pass in the dictionary `parameters`, as well as the design matrix `X` and the label vector `y` into your `I_model` function. Save the output into the variable `surprisals`.

In [None]:
# ENTER YOUR CODE IN THIS CELL

surprisals = None       # <-- replace `None` with your own code

Now for the moment of truth. Run the next cell to check if your implementation of `I_model` is correct.

In [None]:
# RUN THIS CELL TO CHECK YOUR ANSWERS

prob_check(answers=[surprisals], prob_num=2)

## Training the model and checking convergence

Congratulations on making it this far! 😀 👍 As you can tell from the hard work that it took to write the  `I_model` function, it's one thing to write down a mathematical formula on paper, but an entirely different thing to code an efficient implementation.

Now, assuming that `I_model` is correct, you will run stochastic gradient descent with `I_model` as the [loss function](https://mml.johnmyersmath.com/stats-book/chapters/11-optim.html#stochastic-gradient-descent) in order to minimize the cross entropy between the model distribution and the empirical distribution of the data. Thus, we are looking to **actually carry out** the learning process according to [Theorem 13.4](https://mml.johnmyersmath.com/stats-book/chapters/13-learning.html#equiv-obj-gen-thm) in the book.

### Problem 3 --- Training the model

In the next cell, I am asking you to write **all** of the code to train your model using stochastic gradient descent.

_Directions/hints/tips_:

1. This will require you to choose appropriate values (on your own!) for the learning rate, the number of epochs, and the mini-batch size. To begin, just pick what seem like sensible values, knowing that you might have to return and choose different values to get the algorithm to converge. Collectively, these are called _hyperparameters_, and the process of choosing values for them is called called _hyperparameter tuning_.
2. Use the code in [the book](https://mml.johnmyersmath.com/stats-book/chapters/13-learning.html#mle-for-logistic-regression) for inspiration. In particular, the part where I train a logistic regression model might be informative.
3. The docs for our custom stochastic gradient descent routine are [here](https://github.com/jmyers7/math_stats_ml?tab=readme-ov-file#sgd-function-stochastic-gradient-descent), if you need them.
4. Save the output of your SGD run into the variable `gd_output`.

Even if your code is correct, there is the possibility that your algorithm blows up because you chose poor values for your hyperparameters. Be careful!

In [None]:
# ENTER YOUR CODE IN THIS CELL



### Problem 4 --- Checking convergence

How do you know if stochastic gradient descent converged on good parameter values for $\boldsymbol{\theta}_0$, $\boldsymbol{\theta}_1$, and $\psi$? Well, first, if the previous cell didn't blow up and throw an error, that's a good first sign. But in order to properly check convergence, we need to make a diagnostic plot.

So, in the next cell, plot the cross entropy versus the number of gradient steps using our convenient `plot_gd` helper function. (See the book.) Again, you must write **all** the code. (See the book for help.) Choose descriptive names for the axis labels and the plot title, and be sure to include a legend with good descriptions. (**SEE THE DANG BOOK!!!**) By the way, the docs for the `plot_gd` function are [here](https://github.com/jmyers7/math_stats_ml?tab=readme-ov-file#plot_gd-function-plot-the-output-of-gradient-descent).

In [None]:
# ENTER YOUR CODE IN THIS CELL



You may claim victory if the mean entropy over the last epoch is below a value of $3.5$. This value is represented in the plot as the _last_ orange dot. But to check its value _precisely_, run the next cell.

In [None]:
gd_output.per_epoch_objectives

This code accesses the `per_epoch_objectives` attribute of the `gd_output` object produced by SGD, which is a $1$-dimensional PyTorch tensor containing the mean cross entropies over all the epochs. (Note that `gd_output` is an object of the custom class `GD_output` that I wrote specifically for our course. These objects track many other useful attributes. To see them all, see the docs [here](https://github.com/jmyers7/math_stats_ml?tab=readme-ov-file#gd_output-class-container-class-for-output-of-algorithms).)

If the _last_ number is below $3.5$, you're golden. 😎 If not, then you need to go back to Problem 3 and select different values for your hyperparameters in order to get the value below $3.5$. (If you copied your hyperparameters from the book, then you might see a `nan`, which means "not a number." Your algorithm blew up! Ha ha! Got ya! 😈 😛)

## Building the spam classifier

We now want to turn our trained Naive Bayes model into a spam classifier. This means that, given a feature vector

$$
\mathbf{x} = (x_1,x_2,\ldots,x_6) \in \{0,1\}^6
$$

containing observations of the six indicator random variables $X_1,X_2,\ldots,X_6$ for the six words mentioned above, we want to generate a predicted value $\hat{y}$ of the spam indicator random variable $Y$. If $\hat{y}=1$, then the model predicts the email is spam, while if $\hat{y}=0$, it is predicting non-spam.

The spam classifier will take the form of the _predictor function_ given by

$$
h:\{0,1\}^6 \to \{0,1\}, \quad \hat{y} = h(\mathbf{x}) = \operatorname*{argmax}_{y\in \{0,1\}} p(y \mid \mathbf{x}).
$$

In other words, $h(\mathbf{x})$ is equal to $\hat{y}=1$ if

$$
p(y=1 \mid \mathbf{x}) > p(y=0 \mid \mathbf{x}),
$$

and it is equal to $\hat{y}=0$ otherwise. So, if the (conditional) probability that an email is spam is greater than the probability that it is not spam, the predictor predicts spam. Pretty intuitive, right?

These conditional probabilities come from the model probability distribution. So, we really should be writing

$$
p(y \mid \mathbf{x} ; \ \boldsymbol{\theta}_0,\boldsymbol{\theta}_1,\psi)
$$

since they depend on the model parameters. But we won't.

By definition we have

$$
p(y\mid \mathbf{x}) = \frac{p(\mathbf{x},y)}{p(\mathbf{x})}.
$$

Since the feature vector $\mathbf{x}$ is assumed given, the probability $p(\mathbf{x})$ is constant with respect to $y$, and so we have

$$
\operatorname*{argmax}_{y\in \{0,1\}} p(y \mid \mathbf{x}) = \operatorname*{argmax}_{y\in \{0,1\}} p(\mathbf{x},y).
$$

But recall that

$$
\mathcal{I}_\text{model}(\boldsymbol{\theta}_0,\boldsymbol{\theta}_1, \psi; \ \mathbf{x},y) = -\log\left[p(\mathbf{x},y)\right].
$$

Since the negative logarithm function is strictly decreasing, we conclude that

$$
\operatorname*{argmax}_{y\in \{0,1\}} p(y \mid \mathbf{x}) = \operatorname*{argmin}_{y\in \{0,1\}} \mathcal{I}_\text{model}(\boldsymbol{\theta}_0,\boldsymbol{\theta}_1, \psi; \ \mathbf{x},y),
$$

and so our predictor function is (equivalently) given by the formula

$$
\hat{y} = h(\mathbf{x}) = \operatorname*{argmin}_{y\in \{0,1\}} \mathcal{I}_\text{model}(\boldsymbol{\theta}_0,\boldsymbol{\theta}_1, \psi; \ \mathbf{x},y).
$$

This is convenient because _we already have an implementation of the model surprisal function!_


### Problem 5 --- Implementing the predictor function

But, not only do we already have an implementation of the model surprisal function, we have a _vectorized_ implementation. So, this means that we may pass in an _entire_ design matrix into our predictor function $h$ to generate predictions over an entire dataset in one line of code.

In the next code cell, please implement a vectorized predictor function using `I_model`.

In [None]:
# ENTER YOUR CODE IN THIS CELL

def h(X, parameters):
  m = len(X)                          # size of the dataset
  y_zeros = torch.zeros(size=(m,))    # a y-vector of all 0's
  y_ones = torch.ones(size=(m,))      # a y-vector of all 1's
  nonspam_surprisals = None           # <-- replace `None` with your own code
  spam_surprisals = None              # <-- replace `None` with your own code
  surprisals = torch.column_stack((nonspam_surprisals, spam_surprisals))
  return torch.argmin(surprisals, dim=1)

Before we test out our predictor function, we need to get the learned parameters from SGD. These are contained in the `parameters` attribute of the `gd_output` object produced by SGD. I will grab these parameters for you using "dictionary comprehension," and load them into a dictionary called `learned_parameters`. Run the next cell.

In [None]:
learned_parameters = {name: param[-1] for name, param in gd_output.parameters.items()}

Let's have a look at the learned parameters, just to satisfy our own curiosity. Run the next cell.

In [None]:
learned_parameters

So, those are the learned parameters discovered by SGD---they are the parameters that minimize the cross entropy from the model distribution to the empirical distribution.

### Problem 6 --- Checking classification metrics

Having written an implementation of the predictor function, let's check how well our model performs on the original dataset used for training (i.e., the _training set_). Note that, in the real world, we would not only check the performance on the training set, but also on _validation sets_ as described in the previous [programming assignment](https://github.com/jmyers7/stats-book-materials/blob/main/programming-assignments/assignment_12.ipynb).

In the next code cell, use your predictor function `h` to generate predictions on the data in the original design matrix `X`. What parameters are you passing into `h`?

In [None]:
# ENTER YOUR CODE IN THIS CELL

y_hat = None    # <-- replace `None` with your own code

Having generated the predictions and loaded them into the vector `y_hat`, we now want to compute the various classification metrics described in Problem 6 of the worksheet in the [previous chapter](https://github.com/jmyers7/stats-book-materials/blob/main/worksheets/12-models.pdf). Conveniently, these metrics are already implemented in the [TorchEval](https://pytorch.org/torcheval/stable/) library!

Run the next cell to import these metrics---take care to note which ones we are importing! For technical reasons, we also need to alter the data types of the `y` and `y_hat` vectors.

In [None]:
!pip install torcheval  # install the `torcheval` library
from torcheval.metrics.functional import binary_accuracy, binary_precision, binary_recall, binary_confusion_matrix

# cast the true label vector and predicted label vector to ints
y_hat = y_hat.to(torch.int64)
y = y.to(torch.int64)

In the next code cell, generate the confusion matrix for our spam filter. The format of the matrix output by TorchEval is:

$$
\begin{array}{c|c|c}
& \hat{y}=1 & \hat{y}=0 \\ \hline
y=1 & \text{TP} & \text{FN} \\
y=0 & \text{FP} & \text{TN}
\end{array}
$$

See [the docs](https://pytorch.org/torcheval/stable/generated/torcheval.metrics.functional.binary_confusion_matrix.html#torcheval.metrics.functional.binary_confusion_matrix) for the call signature of the `binary_confusion_matrix` function.



In [None]:
# ENTER YOUR CODE IN THIS CELL



Finally, compute the _accuracy_, _precision_, and _recall_ scores using the functions imported from TorchEval. (Search the docs for the call signatures of these functions. Google them!)

In [None]:
# ENTER YOUR CODE IN THIS CELL

accuracy = None         # <-- replace `None` with your own code
precision = None        # <-- replace `None` with your own code
recall = None           # <-- replace `None` with your own code

print(f'accuracy:  {accuracy.item():0.4f}')
print(f'precision: {precision.item():0.4f}')
print(f'recall:    {recall.item():0.4f}')

Assuming that you successfully trained your model in Problem 3 so that the cross entropy is $\leq 3.5$, all three of the metrics in the previous cell should be $\approx 95\%$ or better. If that's what you see, then congrats! You're done! If not, then you need to return to Problem 3 and further tune the SGD hyperparameters in order to get scores on these metrics around $95\%$.