# Chapter 12: Probabilistic modeling with scikit-learn

Through the [current chapter](https://mml.johnmyersmath.com/stats-book/chapters/12-models.html) of the book, we have studied several examples of _probabilistic graphical models_ (*PGM*s). However, we must wait until the [next chapter](https://mml.johnmyersmath.com/stats-book/chapters/13-learning.html) before we are able to train the models from scratch.

But fortunately for us, the [scikit-learn](https://scikit-learn.org/stable/index.html) library in Python contains many implementations of probabilistic models that we may use _without_ knowing the underlying details on the training and fitting processes. The purpose of this programming assignment is to introduce scikit-learn.

## Directions

1. The programming assignment is organized into sequences of short problems. You can see the structure of the programming assignment by opening the "Table of Contents" along the left side of the notebook (if you are using Google Colab or Jupyter Lab).

2. Do not add any cells of your own to the notebook, or delete any existing cells (either code or markdown).

## Submission instructions

1. Once you have finished entering all your solutions, you will want to rerun all cells from scratch to ensure that everything works OK. To do this in Google Colab, click "Runtime -> Restart and run all" along the top of the notebook.

2. Now scroll back through your notebook and make sure that all code cells ran properly.

3. If everything looks OK, save your assignment and upload the `.ipynb` file at the provided link on the course <a href="https://github.com/jmyers7/stats-book-materials">GitHub repo</a>. Late submissions are not accepted.

4. You may submit multiple times, but I will only grade your last submission.

## Polynomial regression models

The first types of PGMs that we shall consider are straightforward generalizations of the linear regression models we studied [in the book](https://mml.johnmyersmath.com/stats-book/chapters/12-models.html#linear-regression-models). They are called _polynomial regression models_.

### Training and fitting

A _single-variable polynomial regression model of degree $d$_ is a probabilistic graphical model with underlying graph

<br>
<center>
<img src="https://raw.githubusercontent.com/jmyers7/stats-book-materials/main/img/poly-reg.svg" width="300" align="center">
</center>
<br>

where $X$ and $Y$ are random variables. The parameters are:

* a real number $\beta_0 \in \mathbb{R}$,
* a vector $\boldsymbol{\beta} \in \mathbb{R}^d$,
* a positive real number $\sigma^2 >0$.

The link function at $Y$ is of the form

$$
Y \mid X = x; \beta_0,\boldsymbol{\beta},\sigma^2 \sim N(\mu,\sigma^2) \quad \text{where} \quad \mu = \beta_0 +\beta_1x + \cdots + \beta_dx^d,
$$

and where $\boldsymbol{\beta}^\intercal = (\beta_1,\beta_2,\ldots,\beta_d)$.

As you might imagine, a (single-variable) polynomial regression model is appropriate for datasets

$$
(x_1,y_1),(x_2,y_2),\ldots,(x_m,y_m) \in \mathbb{R}^2
$$

where we believe that

$$
y_i \approx \beta_0 + \beta_1 x + \cdots + \beta_d x^d
$$

for each $i=1,\ldots,m$, where $\beta_0,\beta_1,\ldots,\beta_d$ are fixed parameters.

Let's look at an example. Run the next cell.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
!pip install math_stats_ml>=0.0.14
from math_stats_ml.autograders.assignment_12 import *

url = 'https://raw.githubusercontent.com/jmyers7/stats-book-materials/main/data/data-12-1.csv'
df = pd.read_csv(url)

X = df['x'].to_numpy().reshape(-1, 1)
y = df['y'].to_numpy()
m = len(X)

plt.scatter(X, y, s=20)
plt.ylim(-2, 2)
plt.xlabel('$x$')
plt.ylabel('$y$')
plt.gcf().set_size_inches(w=5, h=4)
plt.title(f'$m={m}$ data points')
plt.tight_layout()

The data looks like it roughly falls along the graph of a polynomial, right?

Let's use scikit-learn to fit a polynomial regression model of degree $3$. We will accomplish this by first creating polynomial features from the original $x$-values. Specifically, suppose that the $x$-values in our original dataset make up the column of an $m\times 1$ matrix (i.e., a column vector) called the [_design matrix_](https://en.wikipedia.org/wiki/Design_matrix):

$$
\mathbf{X} = \begin{bmatrix}x_1 \\ x_2 \\ \vdots \\ x_{m} \end{bmatrix}.
$$

scikit-learn contains a convenient utility to create "polynomial features" (of degree $d$) from a given design matrix $\mathbf{X}$, producing a new design matrix of the form

$$
\mathbf{X}_\text{poly} = \begin{bmatrix}
x_1 & x_1^2 & \cdots & x_1^d \\
x_2 & x_2^2 & \cdots & x_2^d \\
\vdots & \vdots & \ddots & \vdots \\
x_m & x_m^2 & \cdots & x_m^d
\end{bmatrix}.
$$

If you look in the code cell above, you'll notice that I already implemented the original design matrix as a NumPy array `X` of shape `(32, 1)`. Run the next cell to check it:

In [None]:
X.shape

To create polynomial features of degree $3$ for our polynomial regression model, we import [`PolynomialFeatures`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) from the `preprocessing` submodule of scikit-learn, fit it to the original design matrix, and then transform the original design matrix to create `X_poly`:

In [None]:
from sklearn.preprocessing import PolynomialFeatures

# Instantiate a PolynomialFeatures object. Set the degree to 3, and
# exclude the bias term from the polynomial features by setting the
# `include_bias` parameter to `False`.
pf = PolynomialFeatures(degree=3, include_bias=False)

# Fit to the original design matrix, then transform it.
X_poly = pf.fit_transform(X)

# Display the polynomial features.
X_poly

The array `X_poly` is a NumPy array of shape `(32, 3)`. You can check using a calculator that the rows really are of the form $x, x^2, x^3$.

Now, all we need to do to fit (i.e., train) a degree-$3$ polynomial regression model is to fit a **linear** regression model on the polynomial features in `X_poly`. This is how we do it:

In [None]:
# Import `LinearRegression` from the `linear_model` submodule.
from sklearn.linear_model import LinearRegression

# Instantiate a `LinearRegression` object with default parameters.
model = LinearRegression()

# Fit the model to the polynomial features. The array `y` contains the original
# y-values in the dataset.
model.fit(X_poly, y)

# Get the learned coefficients through the `intercept_` and `coef_` attributes of
# the model.
beta0, beta = model.intercept_, model.coef_

# Print the coefficients.
print(f'beta_0 = {beta0:0.4f}\nbeta_1 = {beta[0]:0.4f}\nbeta_2 = {beta[1]:0.4f}\nbeta_3 = {beta[2]:0.4f}')

### Visualizing goodness of fit

We want to visualize the goodness of fit of the model! Using the learned coefficients, here's how we would do it:

In [None]:
# Define polynomial regression function using the learned coefficients.
def f3(x, beta0, beta):
  beta1 = beta[0]
  beta2 = beta[1]
  beta3 = beta[2]
  return beta0 + beta1 * x + beta2 * x ** 2 + beta3 * x ** 3

# Plot the data and the cubic polynomial.
grid = np.linspace(-0.5, 2.5, 200)
plt.scatter(X, y, s=20)
plt.plot(grid, f3(grid, beta0, beta), color='orange')
plt.ylim(-2, 2)
plt.xlabel('$x$')
plt.ylabel('$y$')
plt.gcf().set_size_inches(w=5, h=4)
plt.title('polynomial regression model of degree $3$')
plt.tight_layout()

That's a pretty good fit!

### Quantifying goodness of fit

We can also numerically judge the goodness of fit through the [_mean squared error_](https://en.wikipedia.org/wiki/Mean_squared_error) metric, defined as

$$
MSE(\mathbf{y},\hat{\mathbf{y}})  = \frac{1}{m} \sum_{i=1}^m (y_i - \hat{y}_i)^2,
$$

where $\hat{y}_i$ has the same meaning here that it does in [the book](https://mml.johnmyersmath.com/stats-book/chapters/12-models.html#linear-regression-models). It is the predicted $y$-value of the $i$-th instance in the dataset:

$$
\hat{y}_i = \beta_0 + \beta_1 x_i + \cdots + \beta_d x_i^d.
$$

The next code cell shows you how to compute the MSE:

In [None]:
# Import the metric from the `metrics` submodule.
from sklearn.metrics import mean_squared_error

# Get the predictions from the model.
y_hat = model.predict(X_poly)

# Compute the MSE.
mse = mean_squared_error(y, y_hat)
print(f'The mean squared error is {mse:0.4f}.')

As the name mean squared **error** suggests, the smaller the MSE, the better.

Let's recap the workflow to build and train a polynomial regression model:

1. Create a new design matrix `X_poly` from the original design matrix `X`.
2. Fit a linear regression model to the design matrix `X_poly` and the original $y$-values in the array `y`.
3. For a visual test for goodness of fit: Get the coefficients from the linear regression model, use them to define the polynomial regression function, and plot this function on top of a scatter plot of the data.
4. For a numerical test for goodness of fit: Get the predicted $y$-values by calling the `predict` method on the model. Pass the true $y$-values in `y` and the predicted $y$-values in `y_hat` into the `mean_squared_error` metric.

Wouldn't it be nice if we could combine steps (1) and (2) into a **single** object, so that we don't have to explicitly create polynomial features by hand? Conveniently, scikit-learn makes this possible through the `pipeline` submodule. Here's how:

In [None]:
# Import `Pipeline` from the `pipeline` submodule.
from sklearn.pipeline import Pipeline

# Instantiate `PolynomialFeatures` and `LinearRegression` objects.
pf = PolynomialFeatures(degree=3, include_bias=False)
lr = LinearRegression()

# Toss the `PolynomialFeatures` and `LinearRegression` objects into the
# pipeline constructor as a list of tuples. The first entry in the tuple
# is the name (or key) to the associated component of the pipeline.
model3 = Pipeline([('preprocessor', pf), ('linear regressor', lr)])

# We can now access the components of the pipeline by key. For example,
# the following line grabs the linear regressor from the pipeline.
model3['preprocessor']

In this context, the `PolynomialFeatures` object plays the role of a "data preprocessor." This is why it is called `preprocessor` in the pipeline. We saved the model into the variable `model3` to differentiate it from other models of different degrees that we will use later on.

We can now fit the pipeline model to the data using the **original** design matrix:

In [None]:
model3.fit(X, y)

# Get the predicted y-values from the pipeline model.
y_hat_pipeline = model3.predict(X)

Let's make sure the predicted $y$-values from the original model (with `X_poly` constructed by hand) match the predicted values from the pipeline model:

In [None]:
# Are the predictions equal?
np.array_equal(y_hat, y_hat_pipeline)

Finally, you can grab the coefficients $\beta_0$ and $\boldsymbol{\beta}$ from the pipeline model by first grabbing the linear regressor component from the pipeline and accessing the `intercept_` and `coef_` attributes just like before:

In [None]:
beta_0, beta = model3['linear regressor'].intercept_, model3['linear regressor'].coef_

# Print the coefficients.
print(f'beta_0 = {beta0:0.4f}\nbeta_1 = {beta[0]:0.4f}\nbeta_2 = {beta[1]:0.4f}\nbeta_3 = {beta[2]:0.4f}')

#### Problem 1 --- Fitting a polynomial regression model

You now have everything you need in order to build polynomial regression models on your own through scikit-learn!

In this problem, I want you to construct and fit a polynomial regression model of degree $19$ to our dataset of $m=32$ points in the plane. In the next code cell, construct this model as a pipeline model using the template from above. To differentiate this model from the degree-$3$ one above, we will use the variable `model19`:

In [None]:
# ENTER YOUR CODE IN THIS CELL

pf = None         # <-- replace `None` with your own code
lr = None         # <-- replace `None` with your own code
model19 = None    # <-- replace `None` with your own code

Now, in the next code cell, fit `model19` to the data in the original design matrix `X` and the $y$-values in the array `y`. Then, get the learned coefficients $\beta_0$ and $\boldsymbol{\beta}$, and also the predicted $y$-values for later:

In [None]:
# ENTER YOUR CODE IN THIS CELL

None                        # <-- replace `None` with your own code
beta0_19, beta_19 = None    # <-- replace `None` with your own code
y_hat = None                # <-- replace `None` with your own code

In [None]:
# RUN THIS CELL TO CHECK YOUR ANSWERS

prob_check(answers=[beta0_19, beta_19, y_hat], prob_num=1)

#### Problem 2 --- Visualizing goodness of fit

We now want to plot the degree-$19$ polynomial regression function on top of a scatter plot to visualize goodness of fit. I've implemented the polynomial regression function for you with call signature `f19(x, beta0, beta)`. Run the next cell:

In [None]:
def f19(x, beta0, beta):
  return beta0 + sum([beta[k] * x ** (k + 1) for k in range(len(beta))])

In the next code cell, visualize the goodness of fit by plotting the polynomial regression function on the scatter plot. (_Hint_: Copy and paste code from above, making the "obvious" changes.)

In [None]:
# ENTER YOUR CODE IN THIS CELL


#### Problem 3 --- Quantifying goodness of fit

The plot you just produced shows the degree-$19$ model fits the data well in at least one sense: The regression function appears to pass directly through multiple data points. Let's quantify the fit, using the MSE. In the next code cell, compute the MSE for the degree-$19$ model:


In [None]:
# ENTER YOUR CODE IN THIS CELL.

mse = None      # <-- replace `None` with your own code

In [None]:
# RUN THIS CELL TO CHECK YOUR ANSWERS

prob_check(answers=[mse], prob_num=3)

Assuming you coded everything correctly, the MSE for the degree-$19$ model should be smaller than the MSE for the degree-$3$ model.

### Overfitting and underfitting

So, there are at least two ways in which the degree-$19$ model is "better" than the degree-$3$ model:

1. The degree-$19$ regression function passes **directly through** more data points than the degree-$3$ model.
2. The MSE for the degree-$19$ model is smaller than the MSE for the degree-$3$ model.

Does this mean that the degree-$19$ model is _truly_ better than the degree-$3$ model?

_Nope_.

Here's why: Very often, one of the main uses for regression models is to predict the $y$-values of future (i.e., new) data. In this context, the dataset used to fit the model is often called the _training data_. But the degree-$19$ model fits the training data so well that it is actually "overfitting" the data, which means essentially that it is learning the random variations in the data. We don't want this!

So, the degree-$19$ model is actually _worse_ than the degree-$3$ model, because it doesn't _generalize_ as well to new data. To drive home this point, let's suppose that we've been given new data. Run the following cell to import it.

In [None]:
url = 'https://raw.githubusercontent.com/jmyers7/stats-book-materials/main/data/data-12-2.csv'
df = pd.read_csv(url)
X_new = df['x'].to_numpy().reshape(-1, 1)
y_new = df['y'].to_numpy()

Now, run the following cell to compare the degree-$3$ and degree-$19$ models on the new data:

In [None]:
_, axes = plt.subplots(ncols=2, figsize=(8, 4), sharex=True, sharey=True)
regression_functions = [f3, f19]
parameters = [(beta0, beta), (beta0_19, beta_19)]
models = [model3, model19]
degrees = [3, 19]

for f, betas, model, degree, axis in zip(regression_functions, parameters, models, degrees, axes):
  y_hat = model.predict(X_new)
  mse = mean_squared_error(y_new, y_hat)

  axis.scatter(X_new, y_new)
  axis.plot(grid, f(grid, *betas), color='orange')

  axis.set_ylim(-2, 2)
  axis.set_xlabel('$x$')
  axis.set_title(f'degree {degree}, MSE = {mse:0.4f}')

axes[0].set_ylabel('$y$')
plt.suptitle('degree-3 and degree-19 models on new data')
plt.tight_layout()

Notice that the MSE for the degree-$3$ model on the new data is roughly the same as the MSE on the training data, while the new MSE for the degree-$19$ is _way_ larger than the MSE on the training data. This is what overfitting looks like!

You should think of data as consisting of a combination of "signal" and "noise." A good model separates the signal from the noise. Models that overfit are learning too much noise.

The problem opposite to overfitting is, of course, called _underfitting_. This occurs when the model is so "rigid" that it isn't capable of learning the signal. In our example, this would occur for a linear regression model (i.e., a polynomial regression model of degree $1$). To see what underfitting looks like, the next code cell fits a linear regression model to the original data, then plots the regression line over top of the scatter plot of the _new_ data. It also computes the MSE on the new data.


In [None]:
model1 = LinearRegression()
model1.fit(X, y)
beta0_1, beta_1 = model1.intercept_, model1.coef_
y_hat = model1.predict(X_new)
mse = mean_squared_error(y_new, y_hat)

def f1(x, beta0, beta):
  return beta0 + beta * x

plt.scatter(X_new, y_new)
plt.plot(grid, f1(grid, beta0_1, beta_1), color='orange')
plt.ylim(-2, 2)
plt.xlabel('$x$')
plt.ylabel('$y$')
plt.gcf().set_size_inches(w=5, h=4)
plt.title(f'linear regression model\nMSE = {mse:0.4f}')
plt.tight_layout()

Indeed, this plot shows that the the regression line is so "rigid" that it can't fit the evident pattern in the data. It's underfitting!



### Model validation

So, we want to avoid both overfitting and underfitting. With single-variable polynomial regression models, it's pretty easy to check whether a model is under- or overfitting based on a plot of the regression function. But you can imagine in other situations, in many more dimensions, it's not as easy to gauge under- and overfitting. One could just wait until new data arrives to check for these issues, but this isn't always feasible---your client wants the model *now*, not *later*!

All this business falls under the heading of _model validation_, which is the process through which the analyst aims to check whether their proposed model generalizes well to new data.

One popular way to validate a model is through _cross validation_. Here's how it works:

1. The **original** data is split into _training sets_ and _validation sets_.
2. The model is fit to the training set.
3. The fitted model is then used to make predictions on the validation set.
4. Various goodness-of-fit metrics (like MSE) are computed from the predictions on the validation set.
5. The validation metrics in (4) are used as a proxy for the model's ability to generalize to new data.
6. If needed, steps (1) through (4) are repeated $k$ times, splitting the data into different training and validation sets each time, and the validation metrics over each training/prediction run are averaged to get the final generalizability proxy.

For example, here's a diagram depicting *$4$-fold cross validation*:


<br>
<center>
<img src="https://raw.githubusercontent.com/jmyers7/stats-book-materials/main/img/cv.svg" width="500" align="center">
</center>
<br>

First, the data is split into four subsets of equal size. In the first run, the model is trained on $3/4$-ths of the data, and validation metrics (like MSE) are computed from predictions on the remaining $1/4$-th of the data "held out" from the training process. Then, in the second run, a different subset of the data is selected as the validation set, and the model is trained on the remaining $3/4$-ths of the data. Validation metrics are computed on the "held out" validation set. And so on.

At the end of this process, we will have four values for the validation metric. These are often averaged to compute a final number to serve as a proxy for the model's ability to generalize.

Fortunately for us, scikit-learn automates this cross validation process! Here's how easy it is---run the following cell.

In [None]:
# Import `cross_val_score` from the `model_selection` submodule.
from sklearn.model_selection import cross_val_score

# Run 4-fold cross validation for the degree-3 model. Pay careful attention to the parameters!
cv_mse3 = -cross_val_score(model3, X, y, scoring='neg_mean_squared_error', cv=4)

# Print out the cross validation MSE's.
cv_mse3

The code in this cell computes the MSE's over a $4$-fold cross validation (that's the `cv=4` parameter) of our degree-$3$ polynomial regression model. Notice that the code prints out all four of the MSE's. Notice also that the `cross_val_score` function actually returns the **negative** MSE, which accounts for the negative sign after the equal sign `=`.

Let's average the four MSE's to get our final proxy for generalizability:

In [None]:
cv_mse3.mean()

Now scroll back up and look at the _actual_ MSE on the new data for the degree-$3$ model. It's not _exactly_ equal to the cross validated MSE, but it's close!

#### Problem 4 --- Validating the degree-19 and degree-1 models

In the next code cell, compute a $4$-fold cross validated MSE for the degree-$19$ model on the **original data** in the design matrix `X` and the array `y`.

In [None]:
# ENTER YOUR CODE IN THIS CELL

cv_mse19 = None         # <-- replace `None` with your own code
cv_mse19.mean()

If you coded everything correctly, the cross validated MSE should be _huge_, indicating the degree-$19$ model's ability to generalize is terrible.

Now compute the $4$-fold cross validated MSE for the degree-$1$ linear regression model:

In [None]:
# ENTER YOUR CODE IN THIS CELL

cv_mse1 = None          # <-- replace `None` with your own code
cv_mse1.mean()

In [None]:
# RUN THIS CELL TO CHECK YOUR ANSWERS

prob_check(answers=[cv_mse19, cv_mse1], prob_num=4)

As far as cross validated MSE's go, this analysis conclusively proves that the degree-$3$ model is better than the other two. The degree-$19$ model suffers from extreme overfitting, while the degree-$1$ model underfits.

## Naive Bayes models

The scikit-learn libary is so well-designed that once you understand the workflow for one model in the library, you pretty much understand them all. The purpose of the rest of this assignment is to prove this point by building a spam classifier via a [_Naive Bayes model_](https://en.wikipedia.org/wiki/Naive_Bayes_classifier).

### A spam classifier

To begin, let's suppose that we have a collection of emails. We let $Y \sim Ber(\psi)$ be an indicator random variable ($\psi$ is a parameter---see below) with $Y=1$ corresponding to a spam email, and $Y=0$ indicating a non-spam email. The goal is to predict the value of $Y$ based on the presence of certain words in the email.

Let's suppose that we are on the lookout for only six words:

$$
\text{office, cash, vacation, meeting, credit, cat}.
$$

It is natural to define a $6$-dimensional random vector

$$
\mathbf{X}^\intercal = (X_1,X_2,X_3,X_4,X_5,X_6)
$$

where each component $X_j$ is an indicator random variable for the presence of these words, written in that order. So, for example, a value of $X_3=1$ means that a given email contains the word "vacation." We will try to predict the value of $Y$ based on the value of $\mathbf{X}$.

To do this, we will use a probabilistic graphical model of the form

<br>
<center>
<img src="https://raw.githubusercontent.com/jmyers7/stats-book-materials/main/img/nb.svg" width="200" align="center">
</center>
<br>

where the parameters are given by

* a real number $\psi \in [0,1]$,
* two $6$-dimensional vectors $\boldsymbol{\theta}_0,\boldsymbol{\theta}_1 \in [0,1]^6$.

The real number $\psi$ parametrizes the distribution of $Y\sim Ber(\psi)$, while the link function at $\mathbf{X}$ is given by

$$
p(\mathbf{x} \mid y; \ \boldsymbol{\theta}_0,\boldsymbol{\theta}_0) = \prod_{j=1}^6 \phi_j^{x_j} (1-\phi_j)^{1-x_j} \quad \text{where} \quad \boldsymbol{\phi} = (1-y)\boldsymbol{\theta}_0 + y \boldsymbol{\theta}_1
$$

and $\boldsymbol{\phi}^\intercal = (\phi_1,\phi_2,\ldots,\phi_6)$. Thus, the components $X_j$ of the random vector $\mathbf{X}$ are conditionally independent (given $Y$) Bernoulli random variables, with

$$
X_j \mid Y \sim Ber(\phi_j), \quad j=1,2,\ldots,6.
$$

If we write

$$
\boldsymbol{\theta}_i^\intercal = (\theta_{i1},\theta_{i2},\ldots, \theta_{i6})
$$

for each $i=0,1$, then $\theta_{0j}$ is the probability that the $j$-th word appears in the email, given that it is _not_ a spam email (i.e., $y=0$), while $\theta_{1j}$ is the probability that it appears, given that it _is_ a spam email (i.e., $y=1$). This is an example of a _Naive Bayes model_. The first part of the name, "Naive," comes from the assumption of conditional independence of the word indicator random variables $X_j$. This is a "naive" assumption, because it is clearly false in the real world.

Once all these parameters have been learned from training data, we may predict whether a given email is spam based on the presence of the six words by examining the two conditional probabilities

$$
p(y =0 \mid \mathbf{x}) \quad \text{and} \quad p(y=1 \mid \mathbf{x}),
$$


where we've dropped the parameters from the notation for simplicity. If the first probability is larger than the second, we predict non-spam (i.e., $y=0$); otherwise, we predict spam (i.e., $y=1$). In symbols, if we write $\hat{y}$ for our prediction of $y$, we have

$$
\hat{y} = \argmax_{y\in \{0,1\}} p(y \mid \mathbf{x}),
$$

where the "$\argmax$" operator returns the maximizer of the function to the right.

These conditional probabilities may be computed using (you guessed it) Bayes' theorem:

$$
p(y \mid \mathbf{x}) = \frac{p(y) p(\mathbf{x} \mid y)}{p(\mathbf{x})}.
$$

But if all we are after is to determine which of the two probabilities above is larger, than we may drop the probability $p(\mathbf{x})$ from Bayes' theorem (since it does not depend on $y$) and write

$$
p(y \mid \mathbf{x}) \propto p(y) p(\mathbf{x} \mid y) = p(\mathbf{x},y)
$$

Then, our prediction is given by

$$
\hat{y} = \argmax_{y\in \{0,1\}} \left[p(y) p(\mathbf{x} \mid y)\right] = \argmax_{y\in \{0,1\}} \left[p(\mathbf{x}, y)\right]
$$

If this seems confusing, don't worry too much right now, because luckily scikit-learn takes care of a lot of the details for us. To see how, let's import some data on emails:

In [None]:
url = 'https://raw.githubusercontent.com/jmyers7/stats-book-materials/main/data/data-12-3.csv'
df = pd.read_csv(url)
df

From the printout, we see that we have data on 512 emails. The first row in the dataframe shows that the first email is spam, and it contains the words "office," "cash," "vacation", and "credit".

Let's pull out the $x_j$'s and the $y$'s:

In [None]:
X = df[['x1', 'x2', 'x3', 'x4', 'x5', 'x6']].to_numpy()
y = df['y'].to_numpy()

#### Problem 5 --- Fitting our spam classifier

In the next code cell, you will implement our spam classifier as an object from [`BernoulliNB`](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html) in scikit-learn. In it, instantiate the model with the default parameters passed into `BernoulliNB`.

In [None]:
# ENTER YOUR CODE IN THIS CELL

from sklearn.naive_bayes import BernoulliNB

model = None      # <-- replace `None` with your own code

#### Problem 6 --- Validating the spam classifier

Let's validate our spam classifier, to see if it generalizes well to new emails. In the next code cell, compute $6$-fold cross validated accuracy scores for the model. (_Hint:_ We talked about the _accuracy_ metric in the worksheet for this chapter. You'll need to set the `scoring` parameter to `'accuracy'`.)

In [None]:
# ENTER YOUR CODE IN THIS CELL

cv_accuracy = None        # <-- replace `None` with your own code
cv_accuracy

Assuming your code is correct, you should see a NumPy array with the six accuracy scores for each of the six training/validation runs. In the next code cell, take the average of these scores to get the final proxy for the model's generalizability:

In [None]:
# ENTER YOUR CODE IN THIS CELL

accuracy_mean = None        # <-- replace `None` with your own code
accuracy_mean

In [None]:
# RUN THIS CELL TO CHECK YOUR ANSWERS

prob_check(answers=[cv_accuracy, accuracy_mean], prob_num=6)

If your code is correct, you should see a number pretty close 1. This means that our spam classifier should generalize well to new emails!