In [None]:
# Run this cell to get everything set up.
from lec_utils import *
import lec23_util as util
diabetes = pd.read_csv('data/diabetes.csv')
from sklearn.model_selection import train_test_split
diabetes = diabetes[(diabetes['Glucose'] > 0) & (diabetes['BMI'] > 0)]
X_train, X_test, y_train, y_test = (
    train_test_split(diabetes[['Glucose', 'BMI']], diabetes['Outcome'], random_state=1)
)


<div class="alert alert-info" markdown="1">

#### Lecture 23

# Logistic Regression, Continued

### EECS 398: Practical Data Science, Spring 2025

<small><a style="text-decoration: none" href="https://practicaldsc.org">practicaldsc.org</a> • <a style="text-decoration: none" href="https://github.com/practicaldsc/sp25">github.com/practicaldsc/sp25</a> • 📣 See latest announcements [**here on Ed**](https://edstem.org/us/courses/78535/discussion/6647877) </small>
    
</div>


### Agenda 📆

- Recap: Logistic regression.
- Choosing a threshold.
- Linear separability.
- Softmax regression.

<div class="alert alert-warning">
    <h3>Question 🤔 (Answer at <a style="text-decoration: none; color: #0066cc" href="https://docs.google.com/forms/d/e/1FAIpQLSd4oliiZYeNh76jWy-arfEtoAkCrVSsobZxPwxifWggo3EO0Q/viewform">practicaldsc.org/q</a>)</h3>
    
Remember that you can always ask questions anonymously at the link above!

## Recap: Logistic regression

---

### Logistic regression

- Logistic **regression** is a linear **classification** technique that builds upon linear regression.

- It models **the probability of belonging to class 1, given a feature vector**:
    
$$P(y_i = 1 | \vec{x}_i) = \sigma (w_0 + w_1 x_i^{(1)} + w_2 x_i^{(2)} + ... + w_d x_i^{(d)}) = \sigma\left(\vec{w} \cdot \text{Aug}(\vec{x}_i) \right)$$   

- Suppose we train a logistic regression model to predict the probability a patient has diabetes ($y = 1$) given their `'Glucose'` and `'BMI'`.<br>If our optimal parameters end up being $\vec{w}^* = \begin{bmatrix} -7.85 & 0.04 & 0.08 \end{bmatrix}^T$, we then predict probabilities using:

$$P(y_i = 1 | \text{Glucose}_i, \text{BMI}_i) = \sigma(−7.85 + 0.04 \cdot \text{Glucose}_i + 0.08 \cdot \text{BMI}_i)$$

- To find the optimal parameters $\vec{w}^*$, we minimize mean **cross-entropy loss**:
<br><small>There's no closed-form solution for $\vec{w}^*$, so we use some numerical method (or, rather, `sklearn` does).</small>

\begin{align*}R_\text{ce}(\vec{w}) &= - \frac{1}{n} \sum_{i = 1}^n \left( y_i \log p_i + (1 - y_i) \log (1 - p_i) \right) \\ &= - \frac{1}{n} \sum_{i = 1}^n \left[ y_i \log \left( \sigma \left(\vec{w} \cdot \text{Aug}(\vec{x}_i) \right) \right)  + (1 - y_i) \log \left(1 - \sigma\left(\vec{w} \cdot \text{Aug}(\vec{x}_i) \right)\right) \right]\end{align*}

### `LogisticRegression` in `sklearn`

- To illustrate, let's re-fit a model to predict diabetes from `'Glucose'` and `'BMI'` in `sklearn`.

In [None]:
from sklearn.linear_model import LogisticRegression
model_logistic_multiple = LogisticRegression()
model_logistic_multiple.fit(X_train, y_train)

- By default, the `predict` method of a fit `LogisticRegression` model predicts a **class**; it applies a threshold $T = 0.5$ to the predicted probability.

In [None]:
model_logistic_multiple.predict(pd.DataFrame([{
    'Glucose': 150,
    'BMI': 25,
}]))

- We can access the predicted **probabilities** using the `predict_proba` method.

In [None]:
model_logistic_multiple.predict_proba(pd.DataFrame([{
    'Glucose': 150,
    'BMI': 25,
}]))

### The decision boundary in the feature space

- After choosing $T = 0.5$, what does the resulting <b><span style="color:purple">decision boundary</span></b> look like, in a $d = 2$ dimensional plot?

In [None]:
util.show_decision_boundary(model_logistic_multiple, X_train, y_train, title='Logistic Regression Decision Boundary (T = 0.5)')

- Note that unlike the decision boundaries for $k$-Nearest Neighbors and decision trees, this decision boundary is **linear**. Specifically, it is the line:

$$\sigma(−7.85 + 0.04 \cdot \text{Glucose}_i + 0.08 \cdot \text{BMI}_i) = 0.5$$

- **Important**: Since $\sigma(0) = 0.5$, we can write the above as:

$$-7.85 + 0.04 \cdot \text{Glucose}_i + 0.08 \cdot \text{BMI}_i = 0$$

<div class="alert alert-warning">

<h3>Question 🤔 (Answer at <a style="text-decoration: none; color: #0066cc" href="https://docs.google.com/forms/d/e/1FAIpQLSd4oliiZYeNh76jWy-arfEtoAkCrVSsobZxPwxifWggo3EO0Q/viewform">practicaldsc.org/q</a>)</h3>
    
Which expression describes the **odds ratio**, $$\frac{P(y_i = 1 | \vec{x}_i)}{P(y_i = 0 | \vec{x}_i)}$$
    
in the logistic regression model?
    
- A. $\vec{w} \cdot \text{Aug}(\vec{x}_i)$
- B. $-\vec{w} \cdot \text{Aug}(\vec{x}_i)$
- C. $e^{\vec{w} \cdot \text{Aug}(\vec{x}_i)}$
- D. $\sigma(\vec{w} \cdot \text{Aug}(\vec{x}_i))$
- E. None of the above.
    
</div>

<div class="alert alert-warning">

<h3>Question 🤔 (Answer at <a style="text-decoration: none; color: #0066cc" href="https://docs.google.com/forms/d/e/1FAIpQLSd4oliiZYeNh76jWy-arfEtoAkCrVSsobZxPwxifWggo3EO0Q/viewform">practicaldsc.org/q</a>)</h3>
    
Which expression describes $P(y_i = \mathbf{0} | \vec{x}_i)$ in the logistic regression model?
    
- A. $\sigma\left(\vec{w} \cdot \text{Aug}(\vec{x}_i) \right)$
- B. $-\sigma\left(\vec{w} \cdot \text{Aug}(\vec{x}_i) \right)$
- C. $\sigma\left(- \vec{w} \cdot \text{Aug}(\vec{x}_i) \right)$
- D. $1 - \log \left( 1 + e^{\vec{w} \cdot \text{Aug}(\vec{x}_i)} \right)$
- E. $1 + \log \left( 1 + e^{- \vec{w} \cdot \text{Aug}(\vec{x}_i)} \right)$
    
</div>

## Choosing a threshold

---

### Thresholding

- As we've seen, in order to classify $\vec{x}_i$ as either yes ($y_i = 1$) or no ($y_i = 0$), we apply a **threshold** $T$ to the predicted probability.

<center><img src="imgs/threshold.svg" width=600><small>With a threshold of $T = 0.6$, a predicted probability of 0.68 is classified as <span style="color:blue">yes diabetes (class 1)</span>,<br>and a predicted probability of 0.55 is classified as <span style="color:orange">no diabetes (class 0)</span>.</small></center>

- More generally, if we pick a threshold of $T$, then any feature vector $\vec{x}_i$ such that:

    $$\sigma(\vec{w}^* \cdot \text{Aug}(\vec{x}_i)) \geq T$$ 

    is classified as class 1.

- **Question**: How do we choose the "right" threshold?

- `sklearn`'s default threshold of $T = 0.5$ is **not** guaranteed to yield the highest **accuracy**!<br><small>Remember, to find $\vec{w}^*$, we minimized mean cross-entropy loss (that is, we didn't "maximize" accuracy), and mean cross-entropy loss doesn't involve our threshold.</small>

### Choosing a custom threshold

- If we want to use a custom threshold, we'll need to implement the logic ourselves.

<center><img src="imgs/threshold.svg" width=300></center>

In [None]:
def predict_thresholded(X, T):
    '''Calls model_logistic_multiple.predict_proba.
       For each P(y_i = 1 | x_i), returns 1 if >= T and 0 if < T.'''
    probs = model_logistic_multiple.predict_proba(X)[:, 1]
    return (probs >= T).astype(int)

- Now, we can choose any threshold we'd like, and compute the accuracy of the resulting predictions.

In [None]:
predict_thresholded([[150, 25]], 0.5)

In [None]:
predict_thresholded([[150, 25]], 0.4)

In [None]:
predict_thresholded(X_train, 0.4)

In [None]:
# Training accuracy for the threshold T = 0.4.
(predict_thresholded(X_train, 0.4) == y_train).mean()

### Accuracy vs. threshold

- Accuracy is defined as:

$$\text{accuracy} = \frac{\text{# points classified correctly}}{\text{# points}} = \frac{TP + TN}{TP + FP + FN + TN}$$

- How does the model's **training** accuracy change as the threshold changes?<br><small>Note that we'd see a similar trend with test accuracy, too.</small>

In [None]:
util.plot_vs_threshold(X_train, y_train, 'Accuracy')

- The threshold with the best training accuracy (among the thresholds we tried) is $T = 0.465$, which has a training accuracy of 77.3\%.

- Remember that 64\% of people in the training set don't have diabetes, so we can achieve a 64\% training accuracy just by always predicting "no diabetes"! This means that a good model's accuracy should be much higher than 64\%.

In [None]:
pd.Series(y_train).value_counts(normalize=True)

### Metrics for binary classification

- A few lectures ago, we introduced other metrics for measuring the quality of a binary classifier's predictions.

$$\text{precision} = \frac{TP}{\text{# predicted positive}} = \frac{TP}{TP + FP}$$

<center><small>Here, a false positive ($FP$) is when we predict that someone has diabetes when they do not.</small></center>

$$\text{recall} = \frac{TP}{\text{# actually positive}} = \frac{TP}{TP + FN}$$

<center><small>Here, a false negative ($FN$) is when we predict that someone does not have diabetes, when they really do.</small></center>

- A binary classifier's **confusion matrix** displays its number of true positives ($TP$), false positives ($FP$), true negatives ($TN$), and false negatives ($FN$).

In [None]:
util.show_confusion(X_train, y_train, T=0.5)

- Remember, we're predicting whether or not patients have diabetes. **Which is worse: a false positive or a false negative?**

Observe how the values in the confusion matrix change as the threshold changes!

In [None]:
interact(lambda T: util.show_confusion(X_train, y_train, T), T=(0, 1, 0.01));

### Precision vs. threshold

- Precision is defined as:

    $$\text{precision} = \frac{TP}{\text{# predicted positive}} = \frac{TP}{TP + FP}$$
    
    Here, a false positive ($FP$) is when we predict that someone has diabetes when they do not.

- How does the model's training **precision** change as the threshold changes?

In [None]:
util.plot_vs_threshold(X_train, y_train, 'Precision')

- If the "bar" is higher to predict 1, then we will have fewer positives in general, and thus fewer false positives.

- As the **threshold increases** ⬆️, the denominator in $\text{precision} = \frac{TP}{TP + FP}$ will decrease, and so **precision tends to increase** ⬆️.<br><small>There are some cases where a slightly higher threshold led to a slightly lower precision; why?</small>

### Recall vs. threshold

- Recall is defined as:

    $$\text{recall} = \frac{TP}{\text{# actually positive}} = \frac{TP}{TP + FN}$$
    
    Here, a false negative ($FN$) is when we predict that someone does not have diabetes, when they really do.

- How does the model's training **recall** change as the threshold changes?

In [None]:
util.plot_vs_threshold(X_train, y_train, 'Recall')

- Note that the denominator in $\text{recall} = \frac{TP}{\text{# actually positive}}$ is constant. As the **threshold increases** ⬆️:
    - true positives get converted to false negatives, so
    - the numerator of recall ($TP$) decreases, and so
    - **recall decreases** ⬇️.

### Precision vs. recall

- We can visualize how precision and recall vary **together**.

In [None]:
util.pr_curve(X_train, y_train)

- The curve above is called a **PR curve**.

- **Question**: Given the information above, what threshold would you choose?

- **Answer**: The threshold whose point is closest to the **top right corner** of the plot above. <br><small>Why? The top right corner is where precision = 1 and recall = 1, and we want both to be high.</small>

### ROC curves

- A more popular variant of the PR curve is the **ROC curve**.<br><small>ROC stands for "receiver operating characteristic."<br>See [**here**](https://stats.stackexchange.com/questions/7207/roc-vs-precision-and-recall-curves) for a good discussion on the differences between PR curves and ROC curves.</small>

- A ROC curve plots true positive rate (TPR) vs. false positive rate (FPR) for all possible thresholds, where:

$$\underbrace{\text{true positive rate (TPR)} = \frac{TP}{\text{# actually positive}} = \frac{TP}{TP + FN} = \text{recall}}_\text{we want this to be close to 1!}$$

$$\underbrace{\text{false positive rate (FPR)} = \frac{FP}{\text{# actually negative}} = \frac{FP}{FP + TN}}_\text{we want this to be close to 0!}$$

The ROC curve for our classifier looks like:

In [None]:
util.draw_roc_curve(X_train, y_train)

- If we care about TPR and FPR equally, the best threshold is the one whose point is closest to the **top left corner** in the plot above.<br><small>Why? The top left corner is where $TPR = 1$ and $FPR = 0$, and we want $TPR$ to be high and $FPR$ to be low.

- A common metric for the quality of a binary classifier is the **area under curve (AUC)** for the ROC curve.<br><small>Larger values are better!</small>

<div class="alert alert-warning">

<h3>Question 🤔 (Answer at <a style="text-decoration: none; color: #0066cc" href="https://docs.google.com/forms/d/e/1FAIpQLSd4oliiZYeNh76jWy-arfEtoAkCrVSsobZxPwxifWggo3EO0Q/viewform">practicaldsc.org/q</a>)</h3>
    
What questions do you have about thresholds and logistic regression?
    
</div>

## Linear separability

---

### Feature space

- Suppose we're using $d$ features as inputs to our classifier. Consider a visualization of the features in $d$-dimensional space.

- Example: $d = 1$.

In [None]:
util.show_one_feature_plot_in_1D(X_train, y_train, thres=False)

- Example: $d = 2$.

In [None]:
util.create_base_scatter(X_train, y_train)

- Note that in both plots above, there are <span style="color:orange">orange points</span> mixed in with the <span style="color:blue">blue points</span>!

### Linear separability

- A dataset is **linearly separable** if a line, plane, or hyperplane can be drawn in $d$-dimensional space that **perfectly separates** the two classes.

- Example: $d = 1$.

In [None]:
util.lin_sep_1D()

In [None]:
util.non_lin_sep_1D()

- Example: $d = 2$.

In [None]:
util.lin_sep_2D()

In [None]:
util.non_lin_sep_2D()

- Why is the dataset below **not** linearly separable?

In [None]:
util.bad_example_1D()

### Linear separability and decision boundaries

- By definition, if a dataset is linearly separable, then there exists a **<span style="color:purple">linear decision boundary</span>** that achieves 100\% training accuracy.

In [None]:
util.lin_sep_1D()

- Above, any value of $c$ in $(120, 150)$ would make the <b><span style="color:purple">decision boundary</span></b> $$\text{Glucose} = c$$
achieve 100% training accuracy.

- **Question**: How do we find this decision boundary?

### Logistic regression and linear separability

- Logistic regression, **without regularization**, **fails to converge** on linearly separable data!

- Let's re-draw the plot below, but with diabetes status drawn on the $y$-axis.

In [None]:
util.lin_sep_1D()

- Why would the optimal $w_1^*$ below tend to $\infty$?<br><small>See the annotated slides for more details.</small>

$$P(y_i = 1 | \text{Glucose}_i) = \sigma(w_0 + w_1 \cdot \text{Glucose}_i) = \frac{1}{1 + e^{-(w_0 + w_1 \cdot \text{Glucose}_i)}}$$

In [None]:
util.lin_sep_1D_elevated()

- To prevent this case, logistic regression should generally be regularized.<br><small>This is exactly why `sklearn` regularizes logistic regression by default.</small>

## Logistic regression for multiclass classification

---

### From binary to multiclass classification

- In binary classification, there are only two possible classes, typically either 0 or 1.

$$y_i \in \{0, 1\}$$

- In multiclass classification, there can be any finite number of classes, or **labels**. They need not be numbers, either.

$$y_i \in \{ \text{Adelie}, \text{Chinstrap}, \text{Gentoo} \}$$

- **Important**: Let $C$ be the set of possible classes for our classification problem, and let $|C|$ be the number of classes total.

### Loading the data 🐧

In [None]:
import seaborn as sns
penguins = sns.load_dataset('penguins').dropna().reset_index(drop=True)
X_train, X_test, y_train, y_test = train_test_split(penguins[['bill_length_mm', 'body_mass_g']], 
                                                    penguins['species'], 
                                                    random_state=26)
display(X_train, y_train)

- As we did two lectures ago, we'll aim to predict the `'species'` of a penguin given their `'bill_length_mm'` and `'bill_depth_mm'`.

In [None]:
util.penguin_scatter_2d(X_train, y_train)

### Recap: $k$-nearest neighbors

- Let's fit a $k$-NN classifier with $k=5$ to the training data.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
model_knn = KNeighborsClassifier(n_neighbors=5)
model_knn.fit(X_train, y_train)
util.penguin_decision_boundary(model_knn, X_train, y_train, title="k-NN Decision Boundary when k = 5")

- Notice the vastly different scales of the features! What happens if we standardize?

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
model_knn_standardized = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=5))
model_knn_standardized.fit(X_train, y_train)
util.penguin_decision_boundary(model_knn_standardized, X_train, y_train, title="k-NN Decision Boundary when k = 5 and with Standardization")

### Recap: Decision trees

- Let's fit a decision tree classifier with a maximum depth of 3 to the training data.

In [None]:
from sklearn.tree import DecisionTreeClassifier
model_tree = DecisionTreeClassifier(max_depth=3)
model_tree.fit(X_train, y_train)
util.penguin_decision_boundary(model_tree, X_train, y_train, title="Decision Boundary for a Decision Tree of Depth 3")

### What about logistic regression?

- As we've seen, in **binary classification**, logistic regression models **the probability of belonging to class 1, given a feature vector $\vec{x}_i$**:

$$P(y_i = 1 | \vec{x}_i) = \sigma (w_0 + w_1 x_i^{(1)} + w_2 x_i^{(2)} + ... + w_d x_i^{(d)}) = \sigma\left(\vec{w} \cdot \text{Aug}(\vec{x}_i) \right)$$   

- In logistic regression, $C = \{0, 1\}$. But, in our current penguin classification problem, $C = \{ \text{Adelie}, \text{Chinstrap}, \text{Gentoo} \}$, so we can't use logistic regression directly.

- One idea: **one-vs-rest**. Fit $|C| = 3$ separate logistic regression models – one per class – and predict the class that has the highest probability.
    - Penguin is Adelie vs. penguin is not Adelie.
    - Penguin is Chinstrap vs. penguin is not Chinstrap.
    - Penguin is Gentoo vs. penguin is not Gentoo.

- Another idea: **one-vs-one**. Fit ${3 \choose 2} = 3$ separate logistic regression models – one per **pair** of classes – and predict the class that "wins" the most predictions.
    - Penguin is Adelie vs. penguin is Chinstrap.
    - Penguin is Adelie vs. penguin is Gentoo.
    - Penguin is Chinstrap vs. penguin is Gentoo.

- Let's try something slightly different than what's listed above.

### Multinomial logistic regression

- **Multinomial** logistic regression, also known as **softmax regression**, models the probability of belonging to **any class, given a feature vector $\vec x_i$**.<br><small>Think of it as a generalization of logistic regression.</small>

$$p_\text{Adelie} = P(y_i = \text{Adelie} | \vec{x}_i) = \frac{e^{\vec{w}_\text{Adelie} \cdot \text{Aug}(\vec{x}_i)}}{e^{\vec{w}_\text{Adelie} \cdot \text{Aug}(\vec{x}_i)} + e^{\vec{w}_\text{Chinstrap} \cdot \text{Aug}(\vec{x}_i)} + e^{\vec{w}_\text{Gentoo} \cdot \text{Aug}(\vec{x}_i)}}$$

$$p_\text{Chinstrap} = P(y_i = \text{Chinstrap} | \vec{x}_i) = \frac{e^{\vec{w}_\text{Chinstrap} \cdot \text{Aug}(\vec{x}_i)}}{e^{\vec{w}_\text{Adelie} \cdot \text{Aug}(\vec{x}_i)} + e^{\vec{w}_\text{Chinstrap} \cdot \text{Aug}(\vec{x}_i)} + e^{\vec{w}_\text{Gentoo} \cdot \text{Aug}(\vec{x}_i)}}$$

$$\underbrace{p_j = P(y_i = j | \vec{x}_i) = \frac{e^{\vec{w}_j \cdot \text{Aug}(\vec{x}_i)}}{\sum_{k \in C} e^{\vec w_k \cdot \text{Aug}(\vec x_i)}}}_\text{in general}$$

- Instead of a single parameter vector $\vec{w}$, there are $|C|$ parameter vectors, one per class!

- Multinomial logistic regression models the probability of each class directly, and then predicts the most likely class.

### Aside: The softmax function

- The **softmax** function is a generalization of the logistic function to multiple dimensions.<br>
Suppose $\vec z \in \mathbb{R}^d$. Then, the softmax of $\vec z$ is defined element-wise as follows:

$$\sigma(\vec z)_i = \frac{e^{z_i}}{\sum_{j = 1}^d e^{z_j}}$$

- For example, suppose $\vec{z} = \begin{bmatrix} -5 \\ 2 \\ 4 \end{bmatrix}$. Then:

$$\sigma(\vec z) = \begin{bmatrix} \sigma(\vec z)_1 \\ \sigma(\vec z)_2 \\ \sigma(\vec z)_3  \end{bmatrix} = \underbrace{\begin{bmatrix} \frac{e^{-5}}{e^{-5} + e^2 + e^4} \\ \frac{e^{2}}{e^{-5} + e^2 + e^4} \\ \frac{e^{4}}{e^{-5} + e^2 + e^4} \end{bmatrix}}_\text{note the constant denominator!} = \begin{bmatrix} 0.0001 \\ 0.1192 \\ 0.8807 \end{bmatrix}$$

- Why is it defined this way? **It maps a vector of real numbers to a vector of probabilities!**<br><small>Note that the denominator, $\sum_{j=1}^d e^{z_j}$, normalizes the $e^{z_i}$ terms so that the results sum to 1.</small>

- Multinomial logistic regression, i.e. softmax regression, trains $|C|$ linear models of the form $\boxed{\vec w_k \cdot \text{Aug}(\vec x_i)}$ – one per class, $k$ – and feeds the output of each through the softmax function, so the results can be interpreted as probabilities.

    $$p_j = P(y_i = j | \vec{x}_i) = \frac{e^{\vec{w}_j \cdot \text{Aug}(\vec{x}_i)}}{\sum_{k \in C} e^{\vec w_k \cdot \text{Aug}(\vec x_i)}}$$

    The $|C|$ optimal parameter vectors – $\vec w_\text{Adelie}^*$, $\vec w_\text{Chinstrap}^*$, and $\vec w_\text{Gentoo}^*$, in our case – are chosen to minimize mean cross-entropy loss, just like before!

### Multinomial logistic regression in `sklearn`

- The `LogisticRegression` class supports multinomial logistic regression.

In [None]:
model_log = LogisticRegression(multi_class='multinomial')
model_log.fit(X_train, y_train)

- In total, the fit model has $3 \times 2 = 6$ coefficients and $3 \times 1 = 3$ intercepts.

In [None]:
model_log.coef_

In [None]:
model_log.intercept_

In [None]:
model_log.classes_

- When calling `model_log.predict_proba`, we get back an array of three predicted probabilities.

In [None]:
model_log.predict_proba(pd.DataFrame([{
    'bill_length_mm': 45,
    'body_mass_g': 4500
}]))

### What does this model _look_ like?

In [None]:
util.penguin_decision_boundary(model_log, X_train, y_train, title="Softmax Regression Decision Boundary")

### Neural networks 🧠

- Softmax regression is an example of a **neural network**.<br><small>Our brains are made up of **neurons** connected by "links", called synapses. The model diagram below loosely resembles this structure, which is why the model is called a **neural** network.</small>

<center><img src="imgs/net.svg" width=1200></center>

- Each of the 9 diagonal lines connecting a value in the <b><span style="color:#b3e0ff">input layer</span></b> with a value in the <b><span style="color:#ff7400">out</span><span style="color:#c45bcc">put</span> <span style="color:#077575">layer</span></b> represents a parameter, $w^*$.

In [None]:
model_log.intercept_

In [None]:
model_log.coef_

- We can use the nine parameter values above to reproduce the network's calculations ourselves.

In [None]:
# Same values as shown in model_log.predict, two slides ago!
softmax = lambda z: np.e ** z / sum(np.e ** z)
softmax(model_log.intercept_.reshape(-1, 1) + model_log.coef_ @ np.array([[45], [4500]]))