# Final Project: Overview

- All in class
- Approx 2 hours in small groups
- Instructor guidance

## Introduction to Logistic Regression

Before we jump into today's holiday-themed activity, let's briefly review **logistic regression**, a foundational model in supervised machine learning used for **classification** tasks.

---

### What Problem Does Logistic Regression Solve?

Unlike **linear regression**, which predicts a continuous value **logistic regression predicts a probability** that an observation belongs to a particular class.

> Give examples of what logistic regression could be used for?

> Give examples of what linear regression could be used for?

---

### Why Not Use a Line?

If we tried using a normal linear regression line for classification, predictions could fall outside the range [0, 1], which doesn’t make sense for probabilities.

To fix this, we apply a special mathematical function called the **sigmoid function**.

---

### The Sigmoid Function

The sigmoid takes any real number and “squashes” it into a probability between 0 and 1:

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

<p align="left">
    <img src = "https://cdn.britannica.com/64/264764-050-A2C174FD/graph-of-a-sigmoid-function.jpg" width = "800">
</p>

Where:

- $z = \beta_0 + \beta_1 x_1 + \cdots + \beta_n x_n$ 
- $\beta$’s are model parameters (weights) (i.e the thing you're estimating)
- $x$’s are input features (i.e the data you've collected)

As $z \to +\infty$, output $\to$ 1  
As $z \to -\infty$, output $\to$ 0

> How do you think the model would perform if the relationship between the parameters was $z = \beta_0 + \beta_1 cos(x_1) + \cdots + \beta_n x_n^2$?
---

### Decision Boundary

Once we have a probability, we turn it into a final prediction by applying a **threshold**:
$$
\hat{y} =
\begin{cases}
1 & \text{if } \sigma(z) \ge 0.5 \\
0 & \text{otherwise}
\end{cases}
$$

For multi-class problems like today's (**naughty / nice / very nice**), we extend logistic regression using **softmax regression** (a type of multinomial logistic regression).

---

### Softmax for Multiclass Classification

Instead of outputting just one probability, the softmax function outputs **one probability per class**, and they always sum to 1. The probability that row index $i$ is of class $k$ is: 

$$
P_i(y_i = k | x_i) = \frac{e^{z_{i,k}}}{\sum_{j=1}^{K} e^{z_{i,j}}}
$$

Where:

- $z_k = \beta_{0,k} + \beta_{1,k}x_1 + ... + \beta_{n,k}x_n$  
- $K$ is the total number of classes  
- Each class gets its own linear model output $z_k$

> Say you have $n_k = 3$ classes and $n_f = 4$ features, how many model parameters does your model estimate? $(n_k*n_f)+n_k$
---

### How Does the Model Learn?

Logistic regression parameters are learned by minimizing a cost function called **cross-entropy loss**, which measures how well the predicted probabilities match the actual outcomes.

$$
\text{Loss} = -\sum_{i=1}^{N} \sum_{k=1}^{K} y_{ik} \log(\hat{y}_{ik})
$$

The computer updates parameters using an optimization method such as **gradient descent**.

---

## Holiday Classification Problem
<p align="left">
    <img src = "https://embroideres.com/files/1215/6749/7822/grinch_naughty_or_nice_machine_embroidery_design.jpg" width = "800">
</p>

Santa wants a model to classify children into:
- Naughty
- Nice
- Very Nice

You are given the following feature dataset:
- good_deeds — number of good deeds this year
- tantrums — number of tantrums
- cookies_left — how many times they left cookies for Santa
- chores_done — % of assigned chores completed


In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# for reproducibility
np.random.seed(42)

# number of samples
N = 300 

# sample features from different distributions; really the type of distribution isn't too important here
good_deeds = np.random.poisson(10, N)
tantrums = np.random.poisson(5, N)
cookies_left = np.random.binomial(10, 0.4, N)
chores_done = np.random.uniform(0, 1, N)
############################

# define your weights and threshold for labeling the sample nice or not nice (naughty)
# -- this is our back of the textbook solution 
weights_per_feature = {
    "good_deeds": 1,
    "tantrums": 1,
    "cookies_left": 0.5,    
    "chores_done": 5
}
threshold = 10

# generate a label based on a simple POLYNOMIAL function of the features 
nice = (weights_per_feature['good_deeds']*good_deeds - weights_per_feature['tantrums']*tantrums + weights_per_feature['chores_done']*chores_done + weights_per_feature['cookies_left']*cookies_left > threshold).astype(int)

# package things into a DataFrame
df = pd.DataFrame({
    "good_deeds": good_deeds,
    "tantrums": tantrums,
    "cookies_left": cookies_left,
    "chores_done": chores_done,
    "nice": nice
})

# your features
X = df[["good_deeds", "tantrums", "cookies_left", "chores_done"]]
# the thing your predicting
y = df["nice"]

# split training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# train the model
model = LogisticRegression()
model.fit(X_train, y_train)

# make predictions
y_pred = model.predict(X_test)

# report the result of your model
# you should see that your coefficients roughly match the weights you defined above 
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)


Accuracy: 0.9466666666666667
Coefficients: [[ 1.28769755 -1.39435005  0.48294982  3.86177188]]
Intercept: [-10.70927088]


> In this case, what's a better metric than accuracy?

> Interpret the general impact of each coefficient on the class.

In [None]:
# compute your niceness score -- notice it's still polynomial in the features
conditions = (
    weights_per_feature['good_deeds']*good_deeds - weights_per_feature['tantrums']*tantrums + weights_per_feature['chores_done']*chores_done + weights_per_feature['cookies_left']*cookies_left
)

# define 3 classes: naughty (0), nice (1), very nice (2)
y_multi = np.where(conditions < 5, 0, np.where(conditions < 12, 1, 2))
df["label3"] = y_multi

# isolate features
X = df[["good_deeds", "tantrums", "cookies_left", "chores_done"]]
# get labels
y = df["label3"]

# split training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# train the model 
multi = LogisticRegression(max_iter=500)
multi.fit(X_train, y_train)

# test the model 
y_pred = multi.predict(X_test)

# print results
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Coefficients:", multi.coef_)


Accuracy: 0.8933333333333333
Coefficients: [[-1.32165936  1.29693148 -0.65767289 -2.88195674]
 [-0.03994326  0.04970594  0.02017583 -0.45794048]
 [ 1.36160263 -1.34663742  0.63749707  3.33989722]]


> Why do we have 12 coefficient estimates here?