# An Introduction to Logistic Regression
Logistic regression, in spite of its name, is a CLASSIFICATION tool.

Please check the visually explained intro: <https://www.youtube.com/watch?v=3bvM3NyMiE0&t=0s>

## Motivation: Why Not Linear Regression for Classification?

Linear regression is excellent for predicting continuous values, like the price of a house or the temperature tomorrow. But what if we want to predict a categorical outcome? For example:

- Will a patient test positive or negative for a disease?
- Is an email spam or not spam?
- Is a tumor malignant or benign?

These are **classification problems** with binary outcomes (Yes/No, 1/0, True/False).

Let's see what happens if we try to fit a linear regression model to a binary outcome. Imagine we have data on tumor size and whether it's malignant (1) or benign (0).

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Generate some sample data
np.random.seed(0)
tumor_size = np.random.normal(10, 5, 100)
is_malignant = (tumor_size + np.random.normal(0, 3, 100) > 12).astype(int)

# Fit a linear regression line
m, b = np.polyfit(tumor_size, is_malignant, 1)

# Plot the data and the line
plt.figure(figsize=(10, 6))
plt.scatter(tumor_size, is_malignant, label='Data (0=Benign, 1=Malignant)', alpha=0.7)
plt.plot(tumor_size, m*tumor_size + b, color='red', label='Linear Regression Fit')
plt.axhline(y=0, color='gray', linestyle='--')
plt.axhline(y=1, color='gray', linestyle='--')
plt.xlabel('Tumor Size')
plt.ylabel('Malignant (1) or Benign (0)')
plt.title('Why Linear Regression Fails for Classification')
plt.legend()
plt.show()

As you can see, the linear regression line extends beyond 0 and 1. How would we interpret a prediction of 1.5? Or -0.5? It doesn't make sense as a probability. 

We need a model that outputs a value between 0 and 1, which can be interpreted as the **probability** of the outcome being 'Yes' (or 1).

This is where **Logistic Regression** comes in. It uses a special S-shaped function, the **Sigmoid function**, to squash the output of a linear equation into the range [0, 1].

## The Core Component: The Sigmoid Function

The sigmoid function is a mathematical function that takes any real number and maps it to a value between 0 and 1. 

The formula is:
$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$

Where `z` is the output of our linear equation (e.g., `z = mx + b`).

Let's visualize it.

In [None]:
from bokeh.io import output_notebook
output_notebook()

In [None]:
from bokeh.plotting import figure, show

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

z = np.linspace(-10, 10, 100)
sigma_z = sigmoid(z)

# prepare some data
# create a new plot with a title and axis labels
p = figure(title="The sigmoid function", x_axis_label='z', y_axis_label=rf'$$\sigma(z)$$')
# add a line renderer with legend and line thickness to the plot
p.line(z, sigma_z, line_width=2)
p.line(z, 0.5, legend_label="Threshold at 0.5", line_width=2, color='red')
# show the results
show(p)

In [None]:

# Generate values for z
# Plot the sigmoid function
plt.figure(figsize=(10, 6))
plt.plot(z, sigma_z, color='blue')
plt.axhline(y=0.5, color='red', linestyle='--', label='Threshold at 0.5')
plt.axvline(x=0, color='gray', linestyle='--')
plt.xlabel('z (output of linear equation)')
plt.ylabel('Probability (σ(z))')
plt.title('The Sigmoid Function')
plt.legend()
plt.grid(True)
plt.show()

**Key Observations:**
- When `z` is large and positive, `e^-z` approaches 0, so `σ(z)` approaches 1.
- When `z` is large and negative, `e^-z` becomes very large, so `σ(z)` approaches 0.
- When `z = 0`, `e^0 = 1`, so `σ(z) = 1 / (1 + 1) = 0.5`.

This is perfect for modeling probability! The output of the sigmoid function, `σ(z)`, can be interpreted as the probability of the positive class (e.g., the probability that a tumor is malignant).

$$ P(Y=1 | X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1X_1 + ... + \beta_nX_n)}} $$

We typically set a **decision boundary** or **threshold** at 0.5. 
- If `P(Y=1) > 0.5`, we classify the outcome as 1 (Malignant).
- If `P(Y=1) <= 0.5`, we classify the outcome as 0 (Benign).

:::{exercise} Conceptual
Based on the sigmoid plot, if a new tumor size results in a `z` value of 5, would you classify it as benign or malignant? What about a `z` value of -2? Explain your reasoning.
:::

## Practical Example: Breast Cancer Tumor Classification

Let's build a logistic regression model to predict whether a breast cancer tumor is **malignant** or **benign**. We will use the Breast Cancer Wisconsin dataset, which is conveniently included in the `scikit-learn` library.

### Step 1: Load and Explore the Data

In [None]:
import pandas as pd
from sklearn.datasets import load_breast_cancer

# Load the dataset
cancer_data = load_breast_cancer()

# Create a pandas DataFrame for easier manipulation
df = pd.DataFrame(cancer_data.data, columns=cancer_data.feature_names)
df['target'] = cancer_data.target # 0: malignant, 1: benign

print("Feature Names:", cancer_data.feature_names)
print("\nTarget Names:", cancer_data.target_names)
print("\nFirst 5 rows of the data:")
df.head()

Our goal is to use the feature columns (like 'mean radius', 'mean texture', etc.) to predict the 'target' column. Note that in this dataset, `0` represents a malignant tumor and `1` represents a benign tumor.

### Step 2: Split the Data

We need to split our data into a training set (to build the model) and a testing set (to evaluate its performance on unseen data).

In [None]:
from sklearn.model_selection import train_test_split

# Define our features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")

### Step 3: Train the Logistic Regression Model

Now we'll use `scikit-learn`'s `LogisticRegression` class to train our model. For numerical stability, it's often a good idea to scale our features first.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create and train the model
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

That's it! The model is now trained. The `.fit()` method found the best coefficients (β values) to map our input features to the probability of a tumor being benign.

### Step 4: Evaluate the Model

How well did our model do? We'll make predictions on our held-out test set and compare them to the actual outcomes.

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Make predictions on the test data
y_pred = model.predict(X_test_scaled)

# Calculate Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")

# Display the Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=cancer_data.target_names))

# Display the Confusion Matrix
print("\nConfusion Matrix:")
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=cancer_data.target_names, 
            yticklabels=cancer_data.target_names)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()

#### Interpreting the Results

- **Accuracy**: The overall percentage of correct predictions. Our model is highly accurate!
- **Classification Report**: 
    - **Precision**: Of all the tumors we *predicted* as malignant, how many actually were? (Measures false positives).
    - **Recall (Sensitivity)**: Of all the tumors that *truly were* malignant, how many did we correctly identify? (Measures false negatives). This is often a critical metric in medical diagnostics.
    - **F1-Score**: The harmonic mean of precision and recall.
- **Confusion Matrix**: A visual breakdown of our predictions.
    - **Top-Left**: True Negatives (Predicted Malignant, Was Malignant)
    - **Top-Right**: False Positives (Predicted Benign, Was Malignant)
    - **Bottom-Left**: False Negatives (Predicted Malignant, Was Benign)
    - **Bottom-Right**: True Positives (Predicted Benign, Was Benign)

:::{exercise} Interpretation

Looking at the confusion matrix generated above:
1. How many benign tumors were incorrectly classified as malignant (False Negatives)?
2. Why might recall be a more important metric than precision for the 'malignant' class in this specific medical context?
:::

## Final Exercises

:::{exercise} Impact of Test Size

Go back to **Step 2: Split the Data**. Change the `test_size` from `0.2` to `0.3` (meaning 30% of the data will be used for testing). Re-run all the subsequent cells. Did the model's accuracy on the test set change? Why do you think this might happen?
:::

:::{exercise} Predicting a Single Observation

Imagine you have a new tumor with the characteristics of the first row of our original dataset. Use the trained `model` and `scaler` to predict whether this single tumor is malignant or benign. 

**Hint**: You will need to select the first row from `X`, reshape it, scale it, and then use `model.predict()` and `model.predict_proba()`.
:::

In [None]:
# Base code

# Get the first sample from the original (unscaled) dataset X
single_sample = X.iloc[[0]] # Using [[0]] keeps it as a DataFrame

# Scale the sample using the FITTED scaler
# Your code here

# Make a prediction (0 or 1)
# Your code here

# Get the probabilities
# Your code here

# print(f"The predicted class is: {prediction[0]}")
# print(f"The probability of each class is: {probabilities}")

## Where to Find Interactive Complementary Material

Reading code and text is great, but interacting with visualizations can deepen your understanding. Here are some excellent resources:

*   **ML-Playground's Logistic Regression**: An interactive playground where you can add data points and see how the logistic curve and decision boundary adapt instantly. [Link](http://ml-playground.com/#logistic_regression)
*   **Explained Visually - Logistic Regression**: A beautiful, scrolling visual explanation of the core concepts. [Link](https://setosa.io/ev/logistic-regression/)
*   **R2D3.us - A Visual Introduction to Machine Learning**: While this covers more than just logistic regression, Part 2 provides an excellent visual and interactive breakdown of classification that is highly relevant. [Link](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)