# An Introduction to Logistic Regression

## A Brief History of Logistic Regression

Logistic Regression is often one of the first classification algorithms that people learn in machine learning, but its story begins over a century earlier in the field of statistics and demography. It wasn't invented by a computer scientist for a machine learning task, but rather evolved over time to solve problems related to understanding relationships in data.

Here is a brief timeline of its key developments:

*   **Early 19th Century (1830s-1840s): The Logistic Function**
    *   The mathematical foundation, the **logistic function** (also known as the sigmoid function), was first described by the Belgian mathematician **Pierre François Verhulst**.
    *   He used it not for classification, but to model population growth. The S-shaped curve was a perfect fit for describing how a population grows rapidly at first, then slows down as it approaches a carrying capacity or resource limit. This work laid the mathematical groundwork for what would come later.

*   **Mid-20th Century (1944): The "Logit" is Born**
    *   The term **"logit"** was coined by the statistician **Joseph Berkson**. The logit function is the core of logistic regression; it's the natural logarithm of the odds $$ \text{logit}(p) = \ln\left(\frac{p}{1-p}\right) $$
    *   Berkson proposed that this transformation could be used to linearize the relationship between predictor variables and a binary (yes/no) outcome, making it suitable for analysis with methods similar to linear regression. His work was primarily in the field of biostatistics.

*   **Mid-20th Century (1958): Formalization as a Regression Model**
    *   The British statistician **David Cox** is widely credited with popularizing and formalizing logistic regression as we know it today.
    *   His influential 1958 paper, "The regression analysis of binary data," detailed how to use the logit model to analyze the relationship between a set of explanatory variables and a binary dependent variable. This cemented its place as a fundamental tool in statistical analysis, especially in epidemiology and medical research.

*   **Late 20th Century (1970s-1990s): Adoption into Machine Learning**
    *   As the field of machine learning grew out of computer science and statistics, logistic regression was naturally adopted as a simple, efficient, and highly interpretable classification algorithm.
    *   Its inclusion within the framework of **Generalized Linear Models (GLMs)** in the 1970s provided a strong theoretical foundation.
    *   It became (and remains) a crucial **baseline model**—a simple model to which more complex models (like SVMs, Random Forests, or Neural Networks) are compared.

*   **Present Day: A Foundational Concept**
    *   Today, logistic regression is still one of the most widely used algorithms in both statistics and machine learning. Its importance also extends to being a building block for more complex models. A single neuron in a neural network using a sigmoid activation function is, in essence, performing logistic regression. This makes understanding it fundamental to understanding deep learning.

## Motivation: Why Not Linear Regression for Classification?

Linear regression is excellent for predicting continuous values, like the price of a house or the temperature tomorrow. But what if we want to predict a categorical outcome? For example:

- Will a patient test positive or negative for a disease?
- Is an email spam or not spam?
- Is a tumor malignant or benign?

These are **classification problems** with binary outcomes (Yes/No, 1/0, True/False).

Let's see what happens if we try to fit a linear regression model to a binary outcome. Imagine we have data on tumor size and whether it's malignant (1) or benign (0).

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Generate some sample data
np.random.seed(0)
tumor_size = np.random.normal(10, 5, 100)
is_malignant = (tumor_size + np.random.normal(0, 3, 100) > 12).astype(int)

In [None]:
# Fit a linear regression line
m, b = np.polyfit(tumor_size, is_malignant, 1)

# Plot the data and the line
plt.figure(figsize=(10, 6))
plt.scatter(tumor_size, is_malignant, label='Data (0=Benign, 1=Malignant)', alpha=0.7)
plt.plot(tumor_size, m*tumor_size + b, color='red', label='Linear Regression Fit')
plt.axhline(y=0, color='gray', linestyle='--')
plt.axhline(y=1, color='gray', linestyle='--')
plt.xlabel('Tumor Size')
plt.ylabel('Malignant (1) or Benign (0)')
plt.title('Why Linear Regression Fails for Classification')
plt.legend()
plt.show()

As you can see, the linear regression line extends beyond 0 and 1. How would we interpret a prediction of 1.5? Or -0.5? It doesn't make sense as a probability. 

We need a model that outputs a value between 0 and 1, which can be interpreted as the **probability** of the outcome being 'Yes' (or 1).

This is where **Logistic Regression** comes in. It uses a special S-shaped function, the **Sigmoid function**, to squash the output of a linear equation into the range [0, 1].

## The Core Component: The Sigmoid Function

The sigmoid function is a mathematical function that takes any real number and maps it to a value between 0 and 1. 

The formula is:
$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$

Where `z` is the output of our linear equation (e.g., `z = mx + b`).

Let's visualize it.

In [None]:
from bokeh.io import output_notebook
output_notebook()

In [None]:
from bokeh.plotting import figure, show

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

z = np.linspace(-10, 10, 100)
sigma_z = sigmoid(z)

# prepare some data
# create a new plot with a title and axis labels
p = figure(title="The sigmoid function", x_axis_label='z', y_axis_label=rf'$$\sigma(z)$$')
# add a line renderer with legend and line thickness to the plot
p.line(z, sigma_z, line_width=2)
p.line(z, 0.5, legend_label="Threshold at 0.5", line_width=2, color='red')
# show the results
show(p)

**Key Observations:**
- When `z` is large and positive, $e^{-z}$ approaches 0, so `σ(z)` approaches 1.
- When `z` is large and negative, $e^{-z}$ becomes very large, so `σ(z)` approaches 0.
- When `z = 0`, $e^0 = 1$, so $\sigma(z) = 1 / (1 + 1) = 0.5$.

This is perfect for modeling probability! The output of the sigmoid function, `σ(z)`, can be interpreted as the probability of the positive class (e.g., the probability that a tumor is malignant).

$$ P(Y=1 | X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1X_1 + ... + \beta_nX_n)}} $$

We typically set a **decision boundary** or **threshold** at 0.5. 
- If `P(Y=1) > 0.5`, we classify the outcome as 1 (Malignant).
- If `P(Y=1) <= 0.5`, we classify the outcome as 0 (Benign).

%%ai chatgpt -f math
Generate the 2D heat equation in LaTeX surrounded by `$$`. Do not include an explanation.

## What is the cost function?
In logistic regression, we are comparing probabilities. Therefore, it is more important to compare ratios than differences. Furthermore, the cost function from MSE is now not a simple paraboloid, so comparing "distances" is not correct. Given that outcomes are not 0 or 1 (or binary), they follow a Bernoulli distribution, and, therefore, we are fitting a probability distribution, finding the likelihood of the parameters.

### The **Likelihood Function**

Suppose you have data and a statistical model with parameters. The **likelihood function** tells you how *likely* it is that your observed data came from the model, given specific parameter values.

* If your model is ($P(y \mid \theta)$), the probability of observing outcome ($y$) given parameters ($\theta$),
* And you have a dataset ($y_1, y_2, \dots, y_n$),
* Then the **likelihood function** is

$$
L(\theta) = \prod_{i=1}^n P(y_i \mid \theta).
$$

This product appears because we assume the data points are independent.

Instead of maximizing this product directly (which gets very small), we usually maximize its **logarithm**:

$$
\ell(\theta) = \log L(\theta) = \sum_{i=1}^n \log P(y_i \mid \theta).
$$

This is called the **log-likelihood**.

### Application to **Logistic Regression**

In logistic regression, we model the probability that ($y_i = 1$) given predictors ($x_i$) as:

$$
P(y_i = 1 \mid x_i, \beta) = \sigma(x_i^\top \beta) = \frac{1}{1 + e^{-x_i^\top \beta}},
$$

and

$$
P(y_i = 0 \mid x_i, \beta) = 1 - \sigma(x_i^\top \beta).
$$

So the probability of observing (y_i) is:

$$
P(y_i \mid x_i, \beta) = \big[\sigma(x_i^\top \beta)\big]^{y_i} \cdot \big[1 - \sigma(x_i^\top \beta)\big]^{1-y_i}.
$$

This formula works because if ($y_i = 1$), the first term stays; if ($y_i = 0$), the second term stays.

### Likelihood for the whole dataset

Since data points are independent:

$$
L(\beta) = \prod_{i=1}^n \big[\sigma(x_i^\top \beta)\big]^{y_i} \cdot \big[1 - \sigma(x_i^\top \beta)\big]^{1-y_i}.
$$

Taking logs:

$$
\ell(\beta) = \sum_{i=1}^n \Big[ y_i \log \sigma(x_i^\top \beta) + (1-y_i)\log (1 - \sigma(x_i^\top \beta)) \Big].
$$
This is called the logit.

With this, it is possible to define the loss or cost function as 
\begin{equation}
Cost = -\ell (\beta)  = -\sum[y\log(\hat y) - (1-y)\log(1-\hat y)],
\end{equation}
to define it positive and compute the minimum. This cost function maximizes the probability for each label. For instance, if the label must be 1 but the predicted probability for it is 0.1, then we have a large penalty. Same for the inverse case. Also, it can be used nicely with gradient descent.

### Why not least squares?

* In linear regression, minimizing squared error works because the residuals are assumed Gaussian.
* But in logistic regression, the outcomes are binary ((0/1)), not continuous.
* Squared error isn’t statistically justified here. Instead, we use **maximum likelihood**, because it directly models the probability of binary outcomes using the Bernoulli distribution.

**Summary**:

* The **likelihood function** measures how probable the observed data is under your model.
* **Maximum likelihood estimation (MLE)** finds the parameters that maximize this probability.
* Logistic regression is naturally derived via MLE, since binary outcomes follow a **Bernoulli distribution**, and maximizing the likelihood leads to the standard logistic regression log-loss function used in practice.


### Gradient descent implementation
For the gradient descent update ("backward propagation"), we have
\begin{equation}
\beta_i' = \beta_i - \alpha \frac{\partial Cost}{\beta_i}. 
\end{equation}
Given that the cost function is a function of $\hat y$, and $\hat y = 1/(1 + \exp(-z))$, with $z= \beta_0 + \beta_1 X_1 + \ldots + \beta_n X_n$, then, by using the chain rule, we have
\begin{align}
\frac{\partial Cost}{\partial \beta_i} &= \frac{\partial Cost}{\partial \hat y}\frac{\partial \hat y}{\partial z}\frac{\partial z}{\partial \beta_i}\\
\frac{\partial Cost}{\partial \beta_i} &= (\hat y - y) X_i,
\end{align}
(for $\beta_0$ it is only $\hat y - y$), and this shows how to simply update the parameters. 

This is an example (here $b = \beta_0$ and $\vec w = (\beta_1, \beta_2, \ldots, \beta_n)$:

:::{exercise}
Deduce analytically the previous expressions for the parameters estimation
:::

This is an example of a manual implementation for the gradient descent

In [None]:
import numpy as np
import matplotlib.pyplot as plt

class LogisticRegression:
    def __init__(self, learning_rate=0.01, max_iterations=1000):
        self.learning_rate = learning_rate
        self.max_iterations = max_iterations
        self.weights = None
        self.bias = None
        self.costs = []
    
    def sigmoid(self, z):
        """Sigmoid activation function"""
        # Clip z to prevent overflow
        z = np.clip(z, -500, 500)
        return 1 / (1 + np.exp(-z))
    
    def fit(self, X, y):
        """Train the logistic regression model"""
        # Initialize parameters
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        # Gradient descent
        for i in range(self.max_iterations):
            # Forward pass: compute predictions
            z = np.dot(X, self.weights) + self.bias
            y_pred = self.sigmoid(z)
            
            # Compute cost (for monitoring)
            cost = self.compute_cost(y, y_pred)
            self.costs.append(cost)
            
            # Compute gradients
            dw = (1/n_samples) * np.dot(X.T, (y_pred - y))
            db = (1/n_samples) * np.sum(y_pred - y)
            
            # Update parameters using gradient descent
            self.weights = self.weights - self.learning_rate * dw
            self.bias = self.bias - self.learning_rate * db
            
            # Print progress
            if i % 100 == 0:
                print(f"Cost after iteration {i}: {cost:.4f}")
    
    def compute_cost(self, y_true, y_pred):
        """Compute cross-entropy cost"""
        m = len(y_true)
        # Avoid log(0) by clipping predictions
        y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)
        cost = -(1/m) * np.sum(y_true * np.log(y_pred) + (1-y_true) * np.log(1-y_pred))
        return cost
    
    def predict(self, X):
        """Make predictions"""
        z = np.dot(X, self.weights) + self.bias
        y_pred = self.sigmoid(z)
        return (y_pred >= 0.5).astype(int)
    
    def predict_proba(self, X):
        """Return prediction probabilities"""
        z = np.dot(X, self.weights) + self.bias
        return self.sigmoid(z)

# Example usage:
if __name__ == "__main__":
    # Generate sample data
    np.random.seed(42)
    X = np.random.randn(100, 2)
    y = (X[:, 0] + X[:, 1] > 0).astype(int)
    
    # Create and train model
    model = LogisticRegression(learning_rate=0.1, max_iterations=1000)
    model.fit(X, y)
    
    # Make predictions
    predictions = model.predict(X)
    probabilities = model.predict_proba(X)
    
    print(f"Final weights: {model.weights}")
    print(f"Final bias: {model.bias}")
    print(f"Accuracy: {np.mean(predictions == y):.4f}")
    
    # Plot cost function
    plt.figure(figsize=(10, 6))
    plt.plot(model.costs)
    plt.title('Cost Function Over Training')
    plt.xlabel('Iterations')
    plt.ylabel('Cost')
    plt.show()


:::{exercise} SDGClassifier
Implement the same but using scikit-learn SDGClassifier (use partial fit to contorl the numbers of iterations). Plot the loss as function of the iterations.  
:::

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import log_loss


class LogisticRegressionSGD:
    def __init__(self, learning_rate=0.01, max_iterations=1000):
        self.learning_rate = learning_rate
        self.max_iterations = max_iterations
        self.model = None
        self.costs = []

    def fit(self, X, y):
# YOUR CODE HERE


    def predict(self, X):
        return self.model.predict(X)

    def predict_proba(self, X):
        return self.model.predict_proba(X)[:, 1]


# Example usage:
if __name__ == "__main__":
    np.random.seed(42)
    X = np.random.randn(100, 2)
    y = (X[:, 0] + X[:, 1] > 0).astype(int)

    model_sgd = LogisticRegressionSGD(learning_rate=0.1, max_iterations=1000)
    model_sgd.fit(X, y)

    predictions = model_sgd.predict(X)
    probabilities = model_sgd.predict_proba(X)

    print(f"Accuracy: {np.mean(predictions == y):.4f}")

    # Plot cost function
    plt.figure(figsize=(10, 6))
    plt.plot(model_sgd.costs, label="SGDClassifier")
    plt.title("Cost Function Over Training (SGD)")
    plt.xlabel("Iterations")
    plt.ylabel("Cost (Log-loss)")
    plt.legend()
    plt.show()


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import log_loss

# -------------------------
# 1. Batch Gradient Descent
# -------------------------
class LogisticRegressionBatch:
    def __init__(self, learning_rate=0.01, max_iterations=1000):
        self.learning_rate = learning_rate
        self.max_iterations = max_iterations
        self.weights = None
        self.bias = None
        self.costs = []

    def sigmoid(self, z):
        z = np.clip(z, -500, 500)
        return 1 / (1 + np.exp(-z))

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0

        for i in range(self.max_iterations):
            # Predictions
            z = np.dot(X, self.weights) + self.bias
            y_pred = self.sigmoid(z)

            # Compute cost
            cost = log_loss(y, y_pred)
            self.costs.append(cost)

            # Gradients
            dw = (1/n_samples) * np.dot(X.T, (y_pred - y))
            db = (1/n_samples) * np.sum(y_pred - y)

            # Update
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db

# -------------------------
# 2. SGD (stochastic, batch_size=1)
# -------------------------
def run_sgd(X, y, learning_rate=0.01, max_iterations=1000, batch_size=1):
    n_samples = X.shape[0]
    model = SGDClassifier(
        loss="log_loss",
        learning_rate="constant",
        eta0=learning_rate,
        max_iter=1,
        tol=None,
        random_state=42,
        shuffle=True
    )

    classes = np.unique(y)
    costs = []

    for _ in range(max_iterations):
        # Select a random mini-batch
        idx = np.random.choice(n_samples, batch_size, replace=False)
        X_batch, y_batch = X[idx], y[idx]

        model.partial_fit(X_batch, y_batch, classes=classes)

        # Compute global log-loss on full data
        y_pred_proba = model.predict_proba(X)[:, 1]
        cost = log_loss(y, y_pred_proba)
        costs.append(cost)

    return model, costs

# -------------------------
# 3. Run comparison
# -------------------------
if __name__ == "__main__":
    np.random.seed(42)
    X = np.random.randn(200, 2)
    y = (X[:, 0] + X[:, 1] > 0).astype(int)

    max_iter = 2000

    # Batch GD
    batch_model = LogisticRegressionBatch(learning_rate=0.1, max_iterations=max_iter)
    batch_model.fit(X, y)

    # Stochastic GD (batch size = 1)
    _, sgd_costs = run_sgd(X, y, learning_rate=0.1, max_iterations=max_iter, batch_size=1)

    # Mini-batch GD (batch size = 32)
    _, mb_costs = run_sgd(X, y, learning_rate=0.1, max_iterations=max_iter, batch_size=32)

    # -------------------------
    # Plot comparison
    # -------------------------
    plt.figure(figsize=(10, 6))
    plt.plot(batch_model.costs, label="Batch GD", linewidth=2)
    plt.plot(sgd_costs, label="Stochastic GD (batch=1)", alpha=0.7)
    plt.plot(mb_costs, label="Mini-Batch GD (batch=32)", alpha=0.7)
    plt.xlabel("Iterations")
    plt.ylabel("Log-loss (Cost)")
    plt.title("Comparison of Gradient Descent Variants")
    plt.legend()
    plt.grid(True)
    plt.show()


## Practical Example: Breast Cancer Tumor Classification

Let's build a logistic regression model to predict whether a breast cancer tumor is **malignant** or **benign**. We will use the Breast Cancer Wisconsin dataset, which is conveniently included in the `scikit-learn` library.

### Step 1: Load and Explore the Data

In [None]:
import pandas as pd
from sklearn.datasets import load_breast_cancer

# Load the dataset
cancer_data = load_breast_cancer()

# Create a pandas DataFrame for easier manipulation
df = pd.DataFrame(cancer_data.data, columns=cancer_data.feature_names)
df['target'] = cancer_data.target # 0: malignant, 1: benign

print("Feature Names:", cancer_data.feature_names)
print("\nTarget Names:", cancer_data.target_names)
print("\nFirst 5 rows of the data:")
df.head()

Our goal is to use the feature columns (like 'mean radius', 'mean texture', etc.) to predict the 'target' column. Note that in this dataset, `0` represents a malignant tumor and `1` represents a benign tumor.

### Step 2: Split the Data

We need to split our data into a training set (to build the model) and a testing set (to evaluate its performance on unseen data).

In [None]:
from sklearn.model_selection import train_test_split

# Define our features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")

### Step 3: Train the Logistic Regression Model

Now we'll use `scikit-learn`'s `LogisticRegression` class to train our model. For numerical stability, it's often a good idea to scale our features first.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create and train the model
model = LogisticRegression(n_jobs=1)
model.fit(X_train_scaled, y_train)

That's it! The model is now trained. The `.fit()` method found the best coefficients (β values) to map our input features to the probability of a tumor being benign.

### Step 4: Evaluate the Model

How well did our model do? We'll make predictions on our held-out test set and compare them to the actual outcomes.

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns

# Make predictions on the test data
y_pred = model.predict(X_test_scaled)

# Calculate Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")

# Display the Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=cancer_data.target_names))

# Display the Confusion Matrix
print("\nConfusion Matrix:")
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=cancer_data.target_names, 
            yticklabels=cancer_data.target_names)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()

#### Interpreting the Results

- **Accuracy**: The overall percentage of correct predictions. Our model is highly accurate!
- **Classification Report**: 
    - **Precision**: Of all the tumors we *predicted* as malignant, how many actually were? (Measures false positives).
    - **Recall (Sensitivity)**: Of all the tumors that *truly were* malignant, how many did we correctly identify? (Measures false negatives). This is often a critical metric in medical diagnostics.
    - **F1-Score**: The harmonic mean of precision and recall.
- **Confusion Matrix**: A visual breakdown of our predictions.
    - **Top-Left**: True Negatives (Predicted Malignant, Was Malignant)
    - **Top-Right**: False Positives (Predicted Benign, Was Malignant)
    - **Bottom-Left**: False Negatives (Predicted Malignant, Was Benign)
    - **Bottom-Right**: True Positives (Predicted Benign, Was Benign)

:::{exercise} Interpretation

Looking at the confusion matrix generated above:
1. How many benign tumors were incorrectly classified as malignant (False Negatives)?
2. Why might recall be a more important metric than precision for the 'malignant' class in this specific medical context?
3. Compite the "report": Precission, Recall, F1-Score
:::

## Final Exercises

:::{exercise} Impact of Test Size

Go back to **Step 2: Split the Data**. Change the `test_size` from `0.2` to `0.3` (meaning 30% of the data will be used for testing). Re-run all the subsequent cells. Did the model's accuracy on the test set change? Why do you think this might happen?
:::

:::{exercise} Predicting a Single Observation

Imagine you have a new tumor with the characteristics of the first row of our original dataset. Use the trained `model` and `scaler` to predict whether this single tumor is malignant or benign. 

**Hint**: You will need to select the first row from `X`, reshape it, scale it, and then use `model.predict()` and `model.predict_proba()`.
:::

In [None]:
# Base code

# Get the first sample from the original (unscaled) dataset X
single_sample = X.iloc[[0]] # Using [[0]] keeps it as a DataFrame

# Scale the sample using the FITTED scaler
# Your code here

# Make a prediction (0 or 1)
# Your code here

# Get the probabilities
# Your code here

# print(f"The predicted class is: {prediction[0]}")
# print(f"The probability of each class is: {probabilities}")