In [None]:
# Run this cell to get everything set up.
from lec_utils import *
import lec24_util as util
diabetes = pd.read_csv('data/diabetes.csv')
from sklearn.model_selection import train_test_split
diabetes = diabetes[(diabetes['Glucose'] > 0) & (diabetes['BMI'] > 0)]
X_train, X_test, y_train, y_test = (
    train_test_split(diabetes[['Glucose', 'BMI']], diabetes['Outcome'], random_state=11)
)
from ipywidgets import interact
import warnings
warnings.simplefilter('ignore')

<div class="alert alert-info" markdown="1">

#### Lecture 24

# Logistic Regression

### EECS 398-003: Practical Data Science, Fall 2024

<small><a style="text-decoration: none" href="https://practicaldsc.org">practicaldsc.org</a> • <a style="text-decoration: none" href="https://github.com/practicaldsc/fa24">github.com/practicaldsc/fa24</a></small>
    
</div>

<script type="text/x-mathjax-config">
  MathJax.Hub.Config({
    TeX: {
      extensions: ["color.js"],
      packages: {"[+]": ["color"]},
    }
  });
  </script>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js?config=TeX-AMS_HTML"></script>

### Announcements 📣

- The Portfolio Homework's checkpoint is due on **Monday, November 25th** – no slip days allowed!<br><small>The full homework is due on **Saturday, December 7th** (no slip days!).</small>

- Homework 10 is (finally!) out, and is due on **Monday, December 2nd**.<br><small>Plan to finish it earlier, since we won't be able to offer much help over Thanksgiving.<br>There will still be a Homework 11, but it'll be max 3 questions.</small>

- Consider entering the Big Ten Data Viz Championship. Submissions are due on January 15th. Read more [**here**](https://it.umich.edu/community/data-viz-championship).<br><small>Help Michigan defend its title!</small>

- Enrollment begins today. Some suggested courses for next semester can be found in [**#306 on Ed**](https://edstem.org/us/courses/61012/discussion/5723634).<br><small>And please help spread the word about 398!</small>

### Agenda

- Recap: Classification techniques and classifier evaluation.
- Predicting probabilities.
- Cross-entropy loss.
- From probabilities to decisions.

<div class="alert alert-warning">
    <h3>Question 🤔 (Answer at <a style="text-decoration: none; color: #0066cc" href="https://docs.google.com/forms/d/e/1FAIpQLSd4oliiZYeNh76jWy-arfEtoAkCrVSsobZxPwxifWggo3EO0Q/viewform">practicaldsc.org/q</a>)</h3>
    
Remember that you can always ask questions anonymously at the link above!

## Recap: Classification techniques and classifier evaluation

---

### Classification

- A **regression** problem is one in which we're given a feature vector $\vec{x}$ and need to predict a **real-valued** target variable, $y$.<br><small>Example: Given today's temperature, precipitation, and wind chill, what will tomorrow's high temperature be?</small>

- A **classification** problem is one in which we're given a feature vector $\vec{x}$ and need to predict a **categorical** target variable, $y$.<br><small>Example: Given today's temperature, precipitation, and wind chill, will it snow tomorrow?</small>

- In **binary classification**, there are only two possible values of the target variable (typically 1 and 0); in **multi-class classification**, there can be more than two possible values of the target variable.

- Last class, we learned about two classification techniques:
    - $k$-Nearest Neighbors 🏡🏠.
    - Decision trees 🎄.

### Accuracy of COVID tests

- The results of 100 Michigan Medicine COVID tests are given below.

| | Predicted Negative | Predicted Positive |
| --- | --- | --- |
| **Actually Negative** | TN = 90 ✅ | FP = 1 ❌ |
| **Actually Positive** | FN = 8 ❌ | TP = 1 ✅ |

<center><i><small>Michigan Medicine test results.</small></i></center>

- 🤔 **Question:** What is the accuracy of the test?

- **🙋 Answer:** $$\text{accuracy} = \frac{TP + TN}{TP + FP + FN + TN} = \frac{1 + 90}{100} = 0.91$$

- **Followup:** At first, the test seems good. But, suppose we build a classifier that predicts that **nobody has COVID**. What would its accuracy be?

- **Answer to followup:** Also 0.91! There is severe **class imbalance** in the dataset, meaning that most of the data points are in the same class (no COVID). Accuracy doesn't tell the full story.

### Recall

| | Predicted Negative | Predicted Positive |
| --- | --- | --- |
| **Actually Negative** | TN = 90 ✅ | FP = 1 ❌ |
| <span style='color:orange'><b>Actually Positive</b></span> | <span style='color:orange'>FN = 8</span> ❌ | <span style='color:orange'>TP = 1</span> ✅ |

<center><i><small>Michigan Medicine test results</small></i></center>

- 🤔 **Question:** What proportion of individuals who actually have COVID did the test **identify**?

- **🙋 Answer:** $\frac{1}{1 + 8} = \frac{1}{9} \approx 0.11$.

- More generally, the **recall** of a binary classifier is the proportion of <span style='color:orange'><b>actually positive instances</b></span> that are correctly classified. We'd like this number to be as close to 1 (100%) as possible.

$$\text{recall} = \frac{TP}{\text{# actually positive}} = \frac{TP}{TP + FN}$$

- To compute recall, look at the <span style='color:orange'><b>bottom (positive) row</b></span> of the above confusion matrix.

### Recall isn't everything, either!

$$\text{recall} = \frac{TP}{TP + FN}$$

- 🤔 **Question:** Can you design a "COVID test" with perfect recall?

- **🙋 Answer:** Yes – **just predict that everyone has COVID!**

| | Predicted Negative | Predicted Positive |
| --- | --- | --- |
| **Actually Negative** | TN = 0 ✅ | FP = 91 ❌ |
| <span style='color:orange'><b>Actually Positive</b></span> | <span style='color:orange'>FN = 0</span> ❌ | <span style='color:orange'>TP = 9</span> ✅ |

<center><i><small>everyone-has-COVID classifier</small></i></center>

$$\text{recall} = \frac{TP}{TP + FN} = \frac{9}{9 + 0} = 1$$

- Like accuracy, recall on its own is not a perfect metric. Even though the classifier we just created has perfect recall, it has 91 false positives!

### Precision

| | Predicted Negative | <span style='color:orange'>Predicted Positive</span> |
| --- | --- | --- |
| **Actually Negative** | TN = 0 ✅ | <span style='color:orange'>FP = 91</span> ❌ |
| **Actually Positive** | FN = 0 ❌ | <span style='color:orange'>TP = 9</span> ✅ |

<center><i><small>everyone-has-COVID classifier</small></i></center>

- The **precision** of a binary classifier is the proportion of <span style='color:orange'><b>predicted positive instances</b></span> that are correctly classified. We'd like this number to be as close to 1 (100%) as possible.

$$\text{precision} = \frac{TP}{\text{# predicted positive}} = \frac{TP}{TP + FP}$$

- To compute precision, look at the <span style='color:orange'><b>right (positive) column</b></span> of the above confusion matrix.<br><small>**Tip:** A good way to remember the difference between precision and recall is that in the denominator for 🅿️recision, both terms have 🅿️ in them (TP and FP).</small>

- Note that the "everyone-has-COVID" classifier has perfect recall, but a precision of $\frac{9}{9 + 91} = 0.09$, which is quite low.

- 🚨 **Key idea:** There is a "tradeoff" between precision and recall. Ideally, you want both to be high. For a particular prediction task, one may be important than the other.

- Later today, we'll see how to weigh this tradeoff in the context of selecting a threshold for classification in logistic regression.

### Precision and recall

<center><img src="imgs/Precisionrecall.svg.png" width=30%></center>

<center>(<a href="https://en.wikipedia.org/wiki/Precision_and_recall">source</a>)</center>

<div class="alert alert-success">
    
### Discussion
    
$$\text{precision} = \frac{TP}{TP + FP} \: \: \: \:  \: \: \: \: \text{recall} = \frac{TP}{TP + FN}$$
    
- 🤔 When might high **precision** be more important than high recall?

- 🤔 When might high **recall** be more important than high precision?

<div class="alert alert-success">
    <h3>Activity</h3>


Consider the confusion matrix shown below.

| | Predicted Negative | Predicted Positive |
| --- | --- | --- |
| **Actually Negative** | TN = 22 ✅ | FP = 2 ❌ |
| **Actually Positive** | FN = 23 ❌ | TP = 18 ✅ |

What is the accuracy of the above classifier? The precision? The recall?

<br>

After calculating all three on your own, click below to see the answers.

<details>
    <summary><b>👉 Accuracy</b></summary>
    (22 + 18) / (22 + 2 + 23 + 18) = 40 / 65
</details>

<details>
    <summary><b>👉 Precision</b></summary>
    18 / (18 + 2) = 9 / 10
</details>

<details>
    <summary><b>👉 Recall</b></summary>
    18 / (18 + 23) = 18 / 41
</details>    
    
</div>

<div class="alert alert-success">
    <h3>Activity</h3>

After fitting a `BillyClassifier`, we use it to make predictions on an unseen test set. Our results are summarized in the following confusion matrix.

| | **Predicted Negative** | **Predicted Positive** |
| --- | --- | --- |
| **Actually Negative** | ??? | 30 |
| **Actually Positive** | 66 | 105 |

- **Part 1**: What is the recall of our classifier? Give your answer as a fraction (it does not need to be simplified).<br>

- **Part 2**: The accuracy of our classifier is $\frac{69}{117}$. How many **true negatives** did our classifier have? Give your answer as an integer.<br>

- **Part 3**: True or False: In order for a binary classifier's precision and recall to be equal, the number of mistakes it makes must be an even number.<br>

- **Part 4**: Suppose we are building a classifier that listens to an audio source (say, from your phone’s microphone) and predicts whether or not it is Soulja Boy’s 2008 classic “Kiss Me thru the Phone." Our classifier is pretty good at detecting when the input stream is ”Kiss Me thru the Phone", but it often incorrectly predicts that similar sounding songs are also “Kiss Me thru the Phone."

Complete the sentence: Our classifier has...
- low precision and low recall.
- low precision and high recall.
- high precision and low recall.
- high precision and high recall.
    
</div>

### Combining precision and recall

- If we care equally about a model's precision $PR$ and recall $RE$, we can combine the two using a single metric called the **F1-score**:

$$\text{F1-score} = \text{harmonic mean}(PR, RE) = 2\frac{PR \cdot RE}{PR + RE}$$

- Both F1-score and accuracy are overall measures of a binary classifier's performance. But remember, accuracy is misleading in the presence of class imbalance, and doesn't take into account the kinds of errors the classifier makes.

### Other evaluation metrics for binary classifiers

- We just scratched the surface! This [excellent table from Wikipedia](https://en.wikipedia.org/wiki/Template:Diagnostic_testing_diagram) summarizes the many other metrics that exist.

<center><img src='imgs/wiki-table.png' width=75%></center>

- If you're interested in exploring further, a good next metric to look at is **true negative rate (i.e. specificity)**, which is the analogue of recall for true negatives.

## Predicting probabilities

---

<center>
<img src="imgs/needle.png" width=900>
<br>
The New York Times maintained <a href="https://www.nytimes.com/interactive/2024/11/05/us/elections/results-president-forecast-needle.html">needles</a><br>that displayed the probabilities of various outcomes in the election.
</center>

### Motivation: Predicting probabilities

- Often, we're interested in predicting the **probability** of an event occurring, given some other information.

<center>Given that the score at the start of the second half is Michigan 23-Northwestern 15,<br>what's the probability that Michigan wins?</center>

<center>Here's a picture of an animal. What's the probability it's of a dog? Cat? Hamster? Zebra?</center>

<center>What's the probability that it snows on campus tomorrow?<br><small>In the context of weather apps, this is a nuanced question; <a href="https://xkcd.com/1985">here's a meme about it</a>.</small></center>

- If we're able to predict the probability of an event, we can **classify** the event by using a threshold.<br><small>For example, if we predict there's a 70% chance of Michigan winning, we could predict that Michigan will win. Here, we implicitly used a threshold of 50%.</small>

- The two classification techniques we've seen so far – $k$-Nearest Neighbors and decision trees – **don't** directly use probabilities in their decision-making process.<br><small>But sometimes it's helpful to model uncertainty and to be able to state a level of confidence along with a prediction!</small>

### Recap: Predicting diabetes

- Let's try to predict whether or not a patient has diabetes (`'Outcome'`) given just their `'Glucose'` level.<br><small>Last class, we used both `'Glucose'` and `'BMI'`; we'll start with just one feature for now.</small>

- As before, <span style='color: orange'><b>class 0 (orange) is "no diabetes"</b></span> and <span style='color: blue'><b>class 1 (blue) is "diabetes"</b></span>.

In [None]:
util.show_one_feature_plot(X_train, y_train)

- It seems that as a patient's `'Glucose'` value increases, the **chances they have diabetes** also increases.

- Can we model this probability directly, as a function of `'Glucose'`?<br>In other words, can we find some $f$ such that:

$$P(y = 1 | \text{Glucose}) = f(\text{Glucose})$$

### An attempt to predict probabilities

- Let's try and fit a simple linear model to the data from the previous slide.

In [None]:
util.show_one_feature_plot_with_linear_model(X_train, y_train)

- The <span style="color:#097054"><b>simple linear model</b></span> above predicts values greater than 1 and less than 0! This means we can't interpret the outputs as probabilities.

- We could, technically, **clip** the outputs of the linear model:

In [None]:
util.show_one_feature_plot_with_linear_model_clipped(X_train, y_train)

### Bins and proportions

- Another approach we could try is to:
    - Place `'Glucose'` values into **bins**, e.g. 50 to 55, 55 to 60, 60 to 65, etc.
    - Within each bin, compute the proportion of patients in the training set who had diabetes.

In [None]:
# Take a look at the source code in lec24_util.py to see how we did this!
# We've hidden a lot of the plotting code in the notebook to make it cleaner.
util.make_prop_plot(X_train, y_train)

- For example, the point near a `'Glucose'` value of 100 has a $y$-axis value of ~0.25. This means that about 25\% of patients with a `'Glucose'` value near 100 had diabetes in the training set. 

- So, if a new person comes along with a `'Glucose'` value near 100, we'd predict there's a 25\% chance they have diabetes (so they likely do not)!

- **Notice that the points form an S-shaped curve!**<br><small>Can we incorporate this S-shaped curve in how we predict probabilities?</small>

### The logistic function

- The **logistic function** resembles an $S$-shape.

    $$\sigma(t) = \frac{1}{1 + e^{-t}} = \frac{1}{1 + \text{exp}(-t)}$$
    
    <br><small>The logistic function is an example of a <b>sigmoid function</b>, which is the general term for an S-shaped function. Sometimes, we use the terms "logistic function" and "sigmoid function" interchangeably.</small>

- Below, we'll look at the shape of $y = \sigma(w_0 + w_1 x)$ for different values of $w_0$ and $w_1$.
    - $w_0$ controls the position of the curve on the $x$-axis.
    - $w_1$ controls the "steepness" of the curve.

In [None]:
util.show_three_sigmoids()

- Notice that $0 < \sigma(t) < 1$, for all $t$, which means **we can interpret the outputs of $\sigma(t)$ as probabilities**!

- Below, interact with the sliders to change the values of $w_0$ and $w_1$.

In [None]:
interact(util.plot_sigmoid, w0=(-15, 15), w1=(-3, 3, 0.1));

### Logistic regression

- Logistic **regression** is a linear **classification** technique that builds upon linear regression.

- It models **the probability of belonging to class 1, given a feature vector**:
    
$$P(y = 1 | \vec{x}) = \sigma (\underbrace{w_0 + w_1 x^{(1)} + w_2 x^{(2)} + ... + w_d x^{(d)}}_{\text{linear regression model}}) = \sigma\left(\vec{w} \cdot \text{Aug}(\vec{x}) \right)$$   

- Note that the existence of coefficients, $w_0, w_1, ... w_d$, that we need to learn from the data, tells us that logistic regression is a **parametric** method!

### `LogisticRegression` in `sklearn`

In [None]:
from sklearn.linear_model import LogisticRegression

- Let's fit a `LogisticRegression` classifier. Specifically, this means we're asking `sklearn` to learn the optimal parameters $w_0^*$ and $w_1^*$ in:

$$P(y = 1 | \text{Glucose}) = \sigma \left( w_0 + w_1 \cdot \text{Glucose} \right)$$

In [None]:
model_logistic = LogisticRegression()
model_logistic.fit(X_train[['Glucose']], y_train)

- We get a test accuracy that's roughly in line with the test accuracies of the two models we saw last class.

In [None]:
model_logistic.score(X_test[['Glucose']], y_test)

- What does our fit model **look like**?

### Visualizing a fit logistic regression model

- The values of $w_0^*$ and $w_1^*$ `sklearn` found are below.

In [None]:
model_logistic.intercept_[0], model_logistic.coef_[0][0]

- So, our fit model is:

$$P(y = 1 | \text{Glucose}) = \sigma(-5.9015855 + 0.04240496 \cdot \text{Glucose})$$

In [None]:
util.show_one_feature_plot_with_logistic(X_train, y_train)

- So, if a patient has a `'Glucose'` level of 150, the model's predicted probability that they have diabetes is:

$$\sigma(-5.9015855 + 0.04240496 \cdot 150) \approx \sigma(0.46) \approx 0.61$$

In [None]:
model_logistic.predict_proba([[150]])

<center><big>How did <code>sklearn</code> find $w_0^*$ and $w_1^*$?<br>What <b>loss function</b> did it use?</big></center>

## Cross-entropy loss

---

### The modeling recipe

- To train a **parametric model**, we always follow the same three steps.
<br><small>$k$-Nearest Neighbors and decision trees didn't quite follow the same process.</small>

1. Choose a model.

$$P(y = 1 | \vec{x}) = \sigma (w_0 + w_1 x^{(1)} + w_2 x^{(2)} + ... + w_d x^{(d)}) = \sigma\left(\vec{w} \cdot \text{Aug}(\vec{x}) \right)$$


2. Choose a loss function.

<center>???</center>

3. Minimize average loss to find optimal model parameters.<br><small>As we've now seen, average loss could also be regularized!</small>

<center>???</center>

### Attempting to use squared loss

- Our default loss function has always been squared loss, so we could try and use it here.

$$R_\text{sq}(\vec{w}) = \frac{1}{n} \sum_{i = 1}^n \left( y_i - \sigma\left(\vec{w} \cdot \text{Aug}(\vec{x}_i) \right) \right)^2$$

- Unfortunately, there's no closed form solution for $\vec{w}^*$, so we'll need to use gradient descent.

- Before doing so, let's visualize the **loss surface** in the case of our "simple" logistic model:

$$P(y = 1 | \text{Glucose}) = \sigma(w_0 + w_1 \cdot \text{Glucose})$$

- Specifically, we'll visualize:

$$R_\text{sq}(w_0, w_1) = \frac{1}{n} \sum_{i = 1}^n \left( y_i - \sigma(w_0 + w_1 \underbrace{x_i}_{\text{Glucose}_i} ) \right)^2$$

In [None]:
util.show_logistic_mse_surface(X_train, y_train)

- **What do you notice?**

### Mean squared error doesn't work well with logistic regression! 

- The following function is **not** convex:

$$R_\text{sq}(\vec{w}) = \frac{1}{n} \sum_{i = 1}^n \left( y_i - \sigma\left(\vec{w} \cdot \text{Aug}(\vec{x}_i) \right) \right)^2$$

- There are two flat "valleys" with gradients near 0, where gradient descent could get trapped.

- Additionally, squared loss doesn't penalize bad predictions nearly enough. The largest possible value of:

    $$\left( y_i - \sigma\left(\vec{w} \cdot \text{Aug}(\vec{x}_i) \right) \right)^2$$

    is 1, since both $y_i$ and $\sigma\left(\vec{w} \cdot \text{Aug}(\vec{x}_i) \right)$ are **bounded** between 0 and 1, and $(1 - 0)^2 = 1$.

- Suppose $y_i = 1$. Then, the graph of the squared loss of the prediction $p_i$ is below.

In [None]:
util.show_squared_loss_individual()

- Predicted $p_i$ values near 0 are really bad, since $y_i = 1$, but the loss for $p_i = 0$ is not very high.

- It seems like we need a loss function that more **steeply penalizes incorrect probability predictions** – and hopefully, one that is convex for the logistic regression model!

### Cross-entropy loss

- A common loss function in this setting is **log loss**, i.e. **cross-entropy loss**.<br><small>The term "entropy" comes from information theory. Watch [**this short video**](https://www.youtube.com/watch?v=ErfnhcEV1O8) for more details.</small>

- We can define the cross-entropy loss function piecewise. If $y_i$ is an observed value and $p_i$ is a predicted **probability**, then: 

$$L_\text{ce}(y_i, p_i) = \begin{cases} - \log(p_i) & \text{if $y_i = 1$} \\ -\log(1 - p_i) & \text{if $y_i = 0$} \end{cases}$$

- Note that in the two cases – $y_i = 1$ and $y_i = 0$ – the cross-entropy loss function resembles squared loss, but is unbounded when the predicted probabilities $p_i$ are far from $y_i$.

In [None]:
util.show_ce_loss_individual_1()

In [None]:
util.show_ce_loss_individual_0()

### A non-piecewise definition of cross-entropy loss

- We can define the cross-entropy loss function piecewise. If $y_i$ is an observed value and $p_i$ is a predicted **probability**, then: 

$$L_\text{ce}(y_i, p_i) = \begin{cases} - \log(p_i) & \text{if $y_i = 1$} \\ -\log(1 - p_i) & \text{if $y_i = 0$} \end{cases}$$

- An equivalent formulation of $L_\text{ce}$ that isn't piecewise is:

$$L_\text{ce}(y_i, p_i) = - \left( y_i \log p_i + (1 - y_i) \log (1 - p_i) \right)$$

- This formulation is easier to work with algebraically!

### Average cross-entropy loss

- **Cross-entropy loss** for an observed value $y_i$ and predicted **probability** $p_i = P(y = 1 | \vec{x}_i) = \sigma \left(\vec w \cdot \text{Aug}(\vec x_i) \right)$ is:

$$L_\text{ce}(y_i, p_i) = - \left( y_i \log p_i + (1 - y_i) \log (1 - p_i) \right)$$

- To find $\vec{w}^*$, then, we minimize **average cross-entropy loss**:

\begin{align*}R_\text{ce}(\vec{w}) &= - \frac{1}{n} \sum_{i = 1}^n \left( y_i \log p_i + (1 - y_i) \log (1 - p_i) \right) \\ &= - \frac{1}{n} \sum_{i = 1}^n \left[ y_i \log \left( \sigma \left(\vec{w} \cdot \text{Aug}(\vec{x}_i) \right) \right)  + (1 - y_i) \log \left(1 - \sigma\left(\vec{w} \cdot \text{Aug}(\vec{x}_i) \right)\right) \right]\end{align*}

- Cross-entropy loss is the default loss function used to find optimal parameters in logistic regression.

- There's still no closed-form solution for $\vec{w}^* = \underset{\vec{w}}{\text{argmin}} \: R_\text{ce}(\vec{w})$, so we'll need to use gradient descent, or some other numerical method.<br><small>But don't worry – we'll leave this to `sklearn`!</small>

- Fortunately, average cross-entropy loss is convex, too.

In [None]:
util.show_logistic_ce_surface(X_train, y_train)

- And, it can be regularized!<br><small>By default, `sklearn` applies regularization when performing logistic regression.</small>

In [None]:
util.show_logistic_ce_surface(X_train, y_train, reg_lambda=0.5)

### The modeling recipe, revisited

1. Choose a model.

$$P(y = 1 | \vec{x}) = \sigma (w_0 + w_1 x^{(1)} + w_2 x^{(2)} + ... + w_d x^{(d)}) = \sigma\left(\vec{w} \cdot \text{Aug}(\vec{x}) \right)$$


2. Choose a loss function.

$$L_\text{ce}(y_i, p_i) = - \left( y_i \log p_i + (1 - y_i) \log (1 - p_i) \right)$$

$$\text{where} \: p_i = P(y = 1 | \vec{x}_i) = \sigma\left(\vec{w} \cdot \text{Aug}(\vec{x}_i) \right)$$

3. Minimize average loss to find optimal model parameters.<br><small>As we've now seen, average loss could also be regularized!</small>

    \begin{align*}R_\text{ce}(\vec{w}) &= - \frac{1}{n} \sum_{i = 1}^n \left( y_i \log p_i + (1 - y_i) \log (1 - p_i) \right) \\ &= - \frac{1}{n} \sum_{i = 1}^n \left[ y_i \log \left( \sigma \left(\vec{w} \cdot \text{Aug}(\vec{x}_i) \right) \right)  + (1 - y_i) \log \left(1 - \sigma\left(\vec{w} \cdot \text{Aug}(\vec{x}_i) \right)\right) \right]\end{align*}

    <br>

    The actual minimization here is done using numerical methods, through `sklearn`.

### `LogisticRegression` in `sklearn`, revisited

- The `LogisticRegression` class in `sklearn` has a lot of hidden, default hyperparameters.

In [None]:
LogisticRegression?

- It performs $L_2$ regularization ("ridge logistic regression") **by default**. The hyperparameter for regularization strength, $C$, is the **inverse** of $\lambda$; by default, it sets $C = 1$.

$$C = \frac{1}{\lambda}$$

- So, for a given value of $C$, it minimizes:

$$R_\text{ce-reg}(\vec{w}) - \frac{1}{n} \sum_{i = 1}^n \left[ y_i \log \left( \sigma \left(\vec{w} \cdot \text{Aug}(\vec{x}_i) \right) \right)  + (1 - y_i) \log \left(1 - \sigma\left(\vec{w} \cdot \text{Aug}(\vec{x}_i) \right)\right) \right] + \frac{1}{C} \sum_{j = 1}^d w_j^2$$

- It also specifies `solver='lbfgs'`, i.e. it doesn't use gradient descent per-se, but another more sophisticated numerical method.<br><small>Read more about LBFGS [here](https://en.wikipedia.org/wiki/Limited-memory_BFGS).</small>

<div class="alert alert-warning">
    <h3>Question 🤔 (Answer at <a style="text-decoration: none; color: #0066cc" href="https://docs.google.com/forms/d/e/1FAIpQLSd4oliiZYeNh76jWy-arfEtoAkCrVSsobZxPwxifWggo3EO0Q/viewform">practicaldsc.org/q</a>)</h3>
    
What questions do you have?

## From probabilities to decisions

---

### Predicting probabilities vs. predicting classes

$$P(y = 1 | \vec{x}) = \sigma (w_0 + w_1 x^{(1)} + w_2 x^{(2)} + ... + w_d x^{(d)}) = \sigma\left(\vec{w} \cdot \text{Aug}(\vec{x}) \right)$$


- 🤔 **Question**: Suppose our logistic regression model predicts the probability that someone has diabetes is 0.75. What do we predict – diabetes or no diabetes? What if the predicted probability is 0.3?

- 🙋 **Answer**: We have to pick a threshold (for example, 0.5)!
    - If the predicted probability is above the threshold, we predict diabetes (1).
    - Otherwise, we predict no diabetes (0).

### Predicting probabilities vs. predicting classes

- By default, the `predict` method of a fit `LogisticRegression` model predicts a class.

In [None]:
model_logistic.predict(pd.DataFrame([{
    'Glucose': 150,
}]))

- But, logistic regression is designed to predict **probabilities**. We can access these predicted probabilities using the `predict_proba` method, as we saw earlier.

In [None]:
model_logistic.predict_proba(pd.DataFrame([{
    'Glucose': 150,
}]))

- The above is telling us that the model thinks this person has:
    - A 39% chance of belonging to class 0 (no diabetes).
    - A 61% chance of belonging to class 1 (diabetes).

- By default, it uses a threshold of 0.5, i.e. it predicts the larger probability.<br>As we'll soon discuss, this may not be what we want!<br><small>Unfortunately, `sklearn` doesn't let us change the threshold ourselves. If we want a different threshold, we need to manually implement it using the results of `predict_proba`.</small>

### Thresholding probabilities

- As we did with other classifiers, we can visualize the **decision boundary** of a fit logistic regression model.

- If we pick a threshold of $T$, any patient with a `'Glucose'` value such that: 

    $$\sigma(w_0^* + w_1^* \cdot \text{Glucose}) \geq T$$ 

    is classified as having diabetes.

- For example, if $T = 0.5$:

In [None]:
util.show_one_feature_plot_with_logistic_and_y_threshold(X_train, y_train, 0.5)

- If we set $T = 0.5$, then patients with `'Glucose'` values above $\approx$ 140 are classified as having diabetes.

In [None]:
util.show_one_feature_plot_with_logistic_and_x_threshold(X_train, y_train, 0.5)

- **How do we find the exact $x$-axis position of the <span style="color:purple">decision boundary</span> above?**<br><small>If we can, then we'd be able to predict whether someone has diabetes just by looking at their `'Glucose'` value.</small>

### Decision boundaries for logistic regression

- In our single feature model that predicts `'Outcome'` given just `'Glucose'`, our predicted probabilities are of the form:

$$P(y = 1 | \text{Glucose}) = \sigma \left( w_0^* + w_1^* \cdot \text{Glucose} \right)$$

- Suppose we fix a threshold, $T$. Then, our <b><span style="color:purple">decision boundary</span></b> is of the form:

$$\sigma \left( w_0^* + w_1^* \cdot \text{Glucose} \right) = T$$

- If we can invert $\sigma(t)$, then we can re-arrange the above to solve for the `'Glucose'` value at the threshold:

$$\text{Glucose}_\text{T} = \frac{\sigma^{-1}(T) - w_0^*}{w_1^*}$$

- **Important**: If $p = \sigma(t)$, then $\sigma^{-1}({p}) = \log \left( \frac{p}{1-p} \right)$ is the inverse of $\sigma(t)$.<br><small>$\sigma^{-1}(p)$ is called the **logit** function.</small>

### Aside: Odds

- Suppose an event occurs with probability $p$.

- The **odds** of that event are:

$$\text{odds}(p) = \frac{p}{1-p}$$

- For instance, if there's a $p = \frac{3}{4}$ chance that Michigan wins this week, then the **odds** that Michigan wins this week are:

    $$\text{odds} \left( \frac{3}{4} \right) = \frac{\frac{3}{4}}{\frac{1}{4}} = 3$$

- Interpretation: it's 3 times more likely that Michigan wins than loses.

- **We can interpret $\sigma^{-1}(p) = \log \left( \frac{p}{1-p} \right)$ as the "log odds" of $p$!**<br><small>See the reference slides for more details.</small>

### Solving for the decision boundary

- Previously, we said that if we pick a threshold $T$, then:

$$\sigma \left( w_0^* + w_1^* \cdot \text{Glucose} \right) = T$$

- We re-arranged this for the `'Glucose'` value on the threshold, $\text{Glucose}_T$:

$$\text{Glucose}_\text{T} = \frac{\sigma^{-1}(T) - w_0^*}{w_1^*}$$

- Using the fact that $\sigma^{-1}(T) = \log \left( \frac{T}{1 - T} \right)$ gives us a closed-form formula for $\text{Glucose}_T$!

$$\text{Glucose}_\text{T} = \frac{\log \left( \frac{T}{1-T} \right) - w_0^*}{w_1^*}$$

- **This explains why $\text{Glucose} \geq 139.17$ is the <span style="color:purple">decision boundary</span> below!**

In [None]:
w0_star = model_logistic.intercept_[0]
w1_star = model_logistic.coef_[0][0]
T = 0.5
glucose_threshold = (np.log(T / (1 - T)) - w0_star) / w1_star
glucose_threshold

In [None]:
util.show_one_feature_plot_with_logistic_and_x_threshold(X_train, y_train, 0.5)

### The decision boundary in the feature space

- The decision boundary on the previous slide is:

$$\text{Glucose}_T \geq 139.17$$

- Let's visualize this in the **feature space**. We are just using $d = 1$ feature, so let's visualize our decision boundary with a 1D plot, i.e. a number line.

In [None]:
util.show_one_feature_plot_in_1D(X_train, y_train, 0.5)

### Logistic regression with multiple features

- Now, as we did last class, let's use both `'Glucose'` and `'BMI'` to predict diabetes.

In [None]:
util.make_two_feature_scatter(X_train, y_train)

- Specifically, our fit model will look like:

$$P(y = 1 | \text{Glucose}, \text{BMI}) = \sigma \left( w_0^* + w_1^* \cdot \text{Glucose} + w_2^* \cdot \text{BMI} \right)$$

In [None]:
model_logistic_multiple = LogisticRegression()
model_logistic_multiple.fit(X_train, y_train)

- After minimizing mean (regularized!) cross-entropy loss, we find that our fit model is of the form:

$$P(y = 1 | \text{Glucose}, \text{BMI}) = \sigma \left( -8.1697 + 0.0394 \cdot \text{Glucose} + 0.0802 \cdot \text{BMI} \right)$$

In [None]:
model_logistic_multiple.intercept_, model_logistic_multiple.coef_

### Visualizing a fit logistic regression model with two features

- Recall, the logistic regression model is trained to predict the probability of <b><span style="color:blue">class 1 (diabetes)</span></b>.

$$P(y = 1 | \text{Glucose}, \text{BMI}) = \sigma \left( -8.1697 + 0.0394 \cdot \text{Glucose} + 0.0802 \cdot \text{BMI} \right)$$

- The graph below shows the predicted probabilities of <b><span style="color:blue">class 1 (diabetes)</span></b> for different combinations of features. 

In [None]:
util.show_logistic(model_logistic_multiple, X_train, y_train)

### The decision boundary in the feature space

- What does the resulting decision boundary look like, in a $d = 2$ dimensional plot?

In [None]:
util.show_decision_boundary(model_logistic_multiple, X_train, y_train, title='Decision Boundary when Using Both Glucose and BMI \n and T = 0.5 (the default)')

- Note that unlike the decision boundaries for $k$-Nearest Neighbors and decision trees, this decision boundary is **linear**.

- Specifically, the decision boundary in the feature space is of the form:

$$a \cdot \text{Glucose} + b \cdot \text{BMI} + c = 0$$

- **In the homework, you'll solve for $a$, $b$, and $c$ in a similar example!**<br><small>It involves retracing the steps we followed in the single-feature case.</small>

<div class="alert alert-warning">
    <h3>Question 🤔 (Answer at <a style="text-decoration: none; color: #0066cc" href="https://docs.google.com/forms/d/e/1FAIpQLSd4oliiZYeNh76jWy-arfEtoAkCrVSsobZxPwxifWggo3EO0Q/viewform">practicaldsc.org/q</a>)</h3>
    
What questions do you have?

### Lingering questions

- By default, a fit `LogisticRegression` object's `predict` method uses a threshold of $T = 0.5$ to decide when to predict class 1 vs. class 0. What if we want to use a different threshold?

<div class="alert alert-danger">
    
#### Reference Slide

### Properties of the logistic function

- The logistic function, $\sigma(t)$, obeys several interesting properties. 

$$\sigma(t) = \frac{1}{1 + e^{-t}}$$

- It is **symmetric**.

$$\sigma(-t) = 1 - \sigma(t)$$

- Its **derivative** is conveniently calculated:

$$\frac{d}{dt}\sigma(t) = \sigma(t) (1 - \sigma(t))$$

- But, most relevant to us right now, its **inverse** is:

$$p = \sigma(t) \implies t = \sigma^{-1}(p) = \log \left( \frac{p}{1-p} \right)$$

<div class="alert alert-danger">
    
#### Reference Slide

### Linearity of log odds

- Let $p$ represent our predicted probability.

$$p = P(y = 1 | \text{Glucose} ) = \sigma \left( w_0^* + w_1^* \cdot \text{Glucose} \right)$$

- Using the inverse of the logistic function, we have that:

$$w_0^* + w_1^* \cdot \text{Glucose} = \log \left( \frac{p}{1 - p} \right)$$

- On the left, we have a **linear function of $\text{Glucose}$**.

- On the right, we have the **log of the odds** of $p$.<br><small>We call the "log of the odds" the "log odds".</small>

- **Important**: The logistic regression model assumes that **the log of the odds of $P(y = 1 | \vec{x})$ is linear!**

<div class="alert alert-danger">
    
#### Reference Slide

### Implications of the linearity of log odds

- Suppose that $w_0^* = -6$ and $w_1^* = 0.05$. Then:

$$P(y = 1 | \text{Glucose}) = \sigma(-6 + 0.05 \cdot \text{Glucose})$$

- It's hard to interpret the role of the coefficient $0.05$ directly. But, we know that:

$$-6 + 0.05 \cdot \text{Glucose} = \log \left( \frac{p}{1 - p} \right)$$

- Example: Suppose my `'Glucose'` level increases by 1 unit. Then, the predicted log odds that I have diabetes increases by 0.05.

- But, since:

$$e^{-6 + 0.05 \cdot \text{Glucose}} = \frac{p}{1-p} = \text{odds}(p)$$

- And:

$$e^{-6 + 0.05 \cdot (Glucose + 1)} = e^{-6 + 0.05 \cdot \text{Glucose}} \cdot e^{0.05}$$

- We can say that **if my `'Glucose'` level increases by 1 unit, then my predicted odds of diabetes increases by a _factor_ of $e^{0.05}$**, or more generally $e^{w_1^*}$.

- You'll need this interpretation in Homework 10, Question 6!