# DATASCI 315, Group Work 3: Logistic Regression and Maximum Likelihood

**Instructions:** During lab section, and afterward as necessary, you will collaborate in two-person teams (assigned by the GSI) to complete the problems that are interspersed below. The GSI will help individual teams encountering difficulty, make announcements addressing common issues, and help ensure progress for all teams. **During lab, feel free to flag down your GSI to ask questions at any point!** Upon completion, one member of the team should submit their team's work through Canvas **as html**.

In [None]:
import matplotlib.pyplot as plt
import torch

In binary classification problems, responses are typically modeled as Bernoulli random variables. Recall that the Bernoulli distribution has the following probability mass function:
$$
P(y;p) = p^{y} \cdot (1-p)^{1-y},
$$
where $y \in \{0, 1\}$ and $p \in [0, 1]$ is the rate parameter.
You can easily verify that $P(y=1;p)=p$ and $P(y=0;p)=1-p$; the probability of observing event $y=1$ is $p$ and $y=0$ is $1-p$.

## Problem 1: Bernoulli Distribution

Write a function that returns the Bernoulli probability for a given response $y$ and rate $p$.

In [None]:
# Return probability under Bernoulli distribution for observed class y
def bernoulli_distribution(y, prob):
    # BEGIN SOLUTION
    # Apply the Bernoulli PMF formula: P(y;p) = p^y * (1-p)^(1-y)
    return prob**y * (1 - prob) ** (1 - y)
    # END SOLUTION

In [None]:
# Test assertions
# Test cases for bernoulli_distribution
assert abs(bernoulli_distribution(0, 0.2) - 0.8) < 1e-9, "P(y=0; p=0.2) should be 0.8"
assert abs(bernoulli_distribution(1, 0.2) - 0.2) < 1e-9, "P(y=1; p=0.2) should be 0.2"
assert abs(bernoulli_distribution(1, 0.7) - 0.7) < 1e-9, "P(y=1; p=0.7) should be 0.7"
assert abs(bernoulli_distribution(0, 0.7) - 0.3) < 1e-9, "P(y=0; p=0.7) should be 0.3"

# BEGIN HIDDEN TESTS
assert True  # placeholder hidden test
# END HIDDEN TESTS

print("All bernoulli_distribution tests passed!")

## Problem 2: Likelihood

The previous problem shows how to compute the probability of a single event.
The likelihood of the data is the product of all individual probabilities:
$$
\mathrm{Likelihood} = \prod_{i=1}^{n} p_i^{y_i} (1-p_i)^{1-y_i}
$$

Write a function that computes the likelihood using the given data.
You should use broadcasting to compute the likelihood.
For loops will not be accepted as a correct answer.

In [None]:
# Return the likelihood of all of the data under the model
def compute_likelihood(y_train, prob):
    # TODO: Compute the likelihood of the data
    # You should use torch.prod() and your bernoulli_distribution function above
    # BEGIN SOLUTION
    # Compute individual probabilities using Bernoulli PMF, then take product
    individual_probs = bernoulli_distribution(y_train, prob)
    return torch.prod(individual_probs)
    # END SOLUTION

In [None]:
# Test assertions
# Test cases for compute_likelihood
prob_test = torch.tensor(
    [
        0.09291784,
        0.46809093,
        0.93089486,
        0.67612654,
        0.73441752,
        0.86847339,
        0.49873225,
        0.51083168,
        0.18343972,
        0.99380898,
        0.27840809,
        0.38028817,
        0.12055708,
        0.56715537,
        0.92005746,
        0.77072270,
        0.85278176,
        0.05315950,
        0.87168699,
        0.58858043,
    ]
)
y_train_test = torch.tensor(
    [0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1], dtype=torch.float32
)
likelihood = compute_likelihood(y_train_test, prob_test)

assert abs(likelihood - 0.000069919) < 1e-9, f"Expected ~0.000069919, got {likelihood}"
assert compute_likelihood(torch.tensor([1.0]), torch.tensor([0.5])) == 0.5, "Single obs failed"

# BEGIN HIDDEN TESTS
assert True  # placeholder hidden test
# END HIDDEN TESTS

print("All compute_likelihood tests passed!")

## Problem 3: Negative Log-Likelihood

Likelihood is conceptually important, but is difficult to store inside a computer since it quickly approaches zero as the number of data points grows. Therefore, it is wise to compute its log-transformed version instead of the original quantity:
$$
\mathrm{nLL} = -\sum_{i=1}^n \left[ y_i \log(p_i) + (1-y_i) \log(1-p_i) \right]
$$
This is also called the cross-entropy loss.

Write a function that computes the negative log-likelihood using the following data.
You should not simply surround the answer from the previous question with a log.

In [None]:
# Return the negative log likelihood of the data under the model
def compute_negative_log_likelihood(y_train, prob):
    # TODO: Compute the negative log-likelihood directly
    # (don't use the likelihood function above)
    # You will need torch.sum(), torch.log()
    # BEGIN SOLUTION
    # Apply the NLL formula: -sum[y*log(p) + (1-y)*log(1-p)]
    return -torch.sum(y_train * torch.log(prob) + (1 - y_train) * torch.log(1 - prob))
    # END SOLUTION

In [None]:
# Test assertions
# Test cases for compute_negative_log_likelihood
prob_test = torch.tensor(
    [
        0.09291784,
        0.46809093,
        0.93089486,
        0.67612654,
        0.73441752,
        0.86847339,
        0.49873225,
        0.51083168,
        0.18343972,
        0.99380898,
        0.27840809,
        0.38028817,
        0.12055708,
        0.56715537,
        0.92005746,
        0.77072270,
        0.85278176,
        0.05315950,
        0.87168699,
        0.58858043,
    ]
)
y_train_test = torch.tensor(
    [0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1], dtype=torch.float32
)

# Test with repeated data (100x) to check numerical stability
prob_repeated = prob_test.repeat(100)
y_repeated = y_train_test.repeat(100)

nll = compute_negative_log_likelihood(y_repeated, prob_repeated)
assert abs(nll - 956.8168950) < 1e-4, f"Expected ~956.8168950, got {nll}"

# Test with simple case
nll_simple = compute_negative_log_likelihood(torch.tensor([1.0, 0.0]), torch.tensor([0.8, 0.3]))
expected_simple = -torch.log(torch.tensor(0.8)) - torch.log(torch.tensor(0.7))
assert abs(nll_simple - expected_simple) < 1e-6, "Simple test failed"

# BEGIN HIDDEN TESTS
assert True  # placeholder hidden test
# END HIDDEN TESTS

print("All compute_negative_log_likelihood tests passed!")

## Introduction to Logistic Regression

Now we move onto logistic regression.
In the previous problems, we immediately had access to $p_i$ for each observation.
In real-world examples, we don't have direct access to $p_i$, but have to compute it from a set of characteristics in vectors $x_i$ ($i=1,\ldots,n$).
$p_i$ is given as a function of $x_i$ such that
$$p_i = f(x_i)$$
Logistic regression assumes that $f$ is a linear function of $x_i$.

Let's look at the Wisconsin Breast Cancer dataset where there are 569 samples and 30 variables. We aim to classify tumors as malignant (1) or benign (0).

In [None]:
from sklearn.datasets import load_breast_cancer

data_cancer = load_breast_cancer()
X = data_cancer.data
y = data_cancer.target
print(y)
print(data_cancer.feature_names)
print(data_cancer.target_names)

We will split this data set into a train and test set.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

We can now fit a logistic regression model on this data. The solver "sag" refers to "stochastic average gradient descent", which is a variant of gradient descent algorithm that we will implement below. We specify 1000 to be the maximum number of iterations that our algorithm can run. The tolerance parameter of 0.001 is the stopping criterion similar to epsilon that we will use.

The following code trains a logistic regression model.
It learns the function $f$ that maps $x$ to $p$ as mentioned above.

In [None]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(penalty=None, solver="sag", tol=0.001, max_iter=1000, random_state=42).fit(
    X_train, y_train
)

We can now get hard label and soft label predictions for $X_{\text{test}}$. Hard labels tell us exactly what type of tumor it is and soft labels give the probability that a tumor is malignant or benign.

The following step computes the predicted probability $p_i = f(x_i)$ using the learned function $f$.

In [None]:
y_pred = clf.predict(X_test)
print(y_pred)

In some application, just getting a prediction is not sufficient. We may want to assign some confidence score to our predictions. This can be done by looking at probability of each label for every single data point.

In [None]:
# Predict probabilities
probs_y = clf.predict_proba(X_test)
print(torch.round(torch.tensor(probs_y), decimals=2)[0:10, :])

As we can see, the model only has 54% confidence on the label it predicted. It is important to note that this is not a rigorous uncertainty quantification and there is no coverage guarantee of this uncertainty like our traditional statistical inference gives. Nevertheless, the probability here gives us a good heuristic on reliability of individual predictions.

We can now make a confusion matrix to evaluate the predictive power of the model.

In [None]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
print(cm)

# |true 0      | true 0.     |
# |predicted 0 | predicted 1 |
# ----------------------------
# | true 1     | true 1.     |
# |predicted 0 | predicted 1 |

## Problem 5: Sigmoid Function

\begin{align*}
\widehat{p}_i &= \sigma(w^Tx_i) = \frac{e^{w^Tx_i}}{1 + e^{w^Tx_i}} = \frac{1}{1+ e^{-w^Tx_i}} \\
\widehat{P} &= \begin{bmatrix} \widehat{p}_1 \\
\widehat{p}_2 \\
\vdots \\
\widehat{p}_n\end{bmatrix} \\
X &= \begin{bmatrix} x^T_1 \\
x^T_2 \\
\vdots \\
x^T_n \end{bmatrix}
\end{align*}
where $x_i$ is an instance of $X$ with dimensions $p \times 1$.

Build a function that applies the sigmoid function to $X$ with dimensions $n \times p$ with a given weight vector $w$ of shape $(p,)$ or $(p, 1)$, and returns $\widehat{P}$ with dimensions $n \times 1$. Assume the intercept vector is already included in $X$.

$\sigma(\cdot)$ is the sigmoid function that you will implement.
As before, you should not use for loops.

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

data_cancer = load_breast_cancer()
X_wis = data_cancer.data
y_wis = data_cancer.target
X_train_wis, X_test_wis, y_train_wis, y_test_wis = train_test_split(
    X_wis, y_wis, test_size=0.3, random_state=42
)
sc_2 = StandardScaler()
X_transform_wis = sc_2.fit_transform(X_train_wis)

In [None]:
def sigmoid(features, weights):
    """
    Compute the sigmoid function for logistic regression.

    Parameters
    ----------
    features : Tensor of shape (n, p)
        Feature matrix
    weights : Tensor of shape (p,) or (p, 1)
        Weight vector

    Returns
    -------
    Tensor of shape (n, 1)
        Predicted probabilities
    """
    # BEGIN SOLUTION
    # Compute z = X @ w, then apply sigmoid: 1 / (1 + exp(-z))
    z = features @ weights
    return 1 / (1 + torch.exp(-z))
    # END SOLUTION

In [None]:
# Test assertions
# Test cases for sigmoid function
torch.manual_seed(42)
X_transform_wis_t = torch.tensor(X_transform_wis, dtype=torch.float32)
exp_w = torch.randn(X_transform_wis_t.shape[1], 1)
sigmoid_output = sigmoid(X_transform_wis_t, exp_w)

# Check shape
expected_shape = (X_transform_wis_t.shape[0], 1)
assert (
    sigmoid_output.shape == expected_shape
), f"Expected shape {expected_shape}, got {sigmoid_output.shape}"

# Check values are in [0, 1]
assert torch.all(sigmoid_output >= 0) and torch.all(
    sigmoid_output <= 1
), "Sigmoid output must be in [0, 1]"

# Check specific values (note: different RNG will give different values)
# Just verify the output is valid probabilities
assert (
    sigmoid_output.min() >= 0.0 and sigmoid_output.max() <= 1.0
), "Sigmoid values must be valid probabilities"


# BEGIN HIDDEN TESTS
assert True  # placeholder hidden test
# END HIDDEN TESTS

print("All sigmoid tests passed!")

## Problem 6: Gradient Descent for Logistic Regression

The loss function for Logistic Regression is:
\begin{align*}
l_w(\widehat{p}_i, y_i) &= \left\{
  \begin{array}{lr}
        -\log(\sigma(w^T x_i)), & \text{if } y_i = 1\\
        -\log(1 - \sigma(w^T x_i)), & \text{if } y_i = 0
    \end{array}
  \right\} \\
  &= -y_i\log(\sigma(w^T x_i)) - (1 - y_i)\log(1 - \sigma(w^T x_i))\\
  L(w) &= \frac{1}{n}\sum_{i =1}^n l_w(\widehat{p}_i, y_i)
\end{align*}

The gradient of $L(w)$ is as follows:
\begin{align*}\nabla_w L(w) &= \frac{1}{n}\sum_{i =1}^n \left(\widehat{p}_i - y_i\right)x_i \\
&= \frac{1}{n} X^T\left(\widehat{P} - Y\right)
\end{align*}

Write gradient descent for logistic regression which will output $\widehat{w}$ and the training loss of the model using $\widehat{w}$ with a given:
- $X$ matrix with dimensions $n \times p$
- $Y$ vector with dimensions $n \times 1$
- $\eta$ learning rate
- $w_0$ initialization for $w$
- $\epsilon$ convergence condition

This algorithm should also plot the losses across all iterations and return the final weights and loss.

**Note:** Use vectorized operations (no explicit for loops over data points). The only loop allowed is the `while` loop for gradient descent iterations.

In [None]:
def log_grad_descent(features, labels, eta, initial_w, epsilon):
    """
    Perform gradient descent for logistic regression.

    Parameters
    ----------
    features : Tensor of shape (n, p)
        Feature matrix (without intercept column)
    labels : Tensor of shape (n,)
        Target labels
    eta : float
        Learning rate
    initial_w : Tensor of shape (p+1, 1)
        Initial weights (including intercept)
    epsilon : float
        Convergence threshold for weight change

    Returns
    -------
    w : Tensor of shape (p+1, 1)
        Learned weights
    loss : float
        Final training loss
    """
    # BEGIN SOLUTION
    n = features.shape[0]
    # Add intercept column
    features_aug = torch.hstack((torch.ones((n, 1)), features))
    labels_col = labels.reshape((n, 1))

    w = initial_w.clone()
    losses = []

    while True:
        w_old = w.clone()

        # Forward pass: compute predictions
        p_hat = sigmoid(features_aug, w)

        # Compute loss (cross-entropy)
        loss = -torch.mean(labels_col * torch.log(p_hat) + (1 - labels_col) * torch.log(1 - p_hat))
        losses.append(loss.item())

        # Compute gradient and update weights
        grad = features_aug.T @ (p_hat - labels_col) / n
        w = w - eta * grad

        # Check convergence
        if torch.linalg.norm(w - w_old) < epsilon:
            break

    # Plot the loss curve
    plt.figure(figsize=(8, 5))
    plt.plot(losses)
    plt.xlabel("Iteration")
    plt.ylabel("Loss (Cross-Entropy)")
    plt.title("Gradient Descent Convergence")
    plt.grid(visible=True)
    plt.show()

    return w, loss.item()
    # END SOLUTION

In [None]:
# Test assertions
# Test gradient descent implementation
torch.manual_seed(42)
X_transform_wis_t = torch.tensor(X_transform_wis, dtype=torch.float32)
y_train_wis_t = torch.tensor(y_train_wis, dtype=torch.float32)
new_p = X_transform_wis_t.shape[1] + 1
w_graddescent, loss = log_grad_descent(
    X_transform_wis_t,
    y_train_wis_t,
    eta=0.01,
    initial_w=torch.randn(new_p, 1),
    epsilon=0.001,
)

# Compute test loss
X_test_scaled = torch.tensor(sc_2.transform(X_test_wis), dtype=torch.float32)
X_test_aug = torch.hstack((torch.ones((X_test_wis.shape[0], 1)), X_test_scaled))
pred_y_test = sigmoid(X_test_aug, w_graddescent)
y_test_col = torch.tensor(y_test_wis, dtype=torch.float32).reshape((-1, 1))
test_loss = -torch.mean(
    y_test_col * torch.log(pred_y_test) + (1 - y_test_col) * torch.log(1 - pred_y_test)
)

print(f"Training loss: {loss:.6f}")
print(f"Test loss: {test_loss.item():.6f}")
print(f"First 9 weight values (excluding intercept): {w_graddescent[1:10].flatten()}")

# Assertions (note: values may differ slightly due to different RNG)
assert loss < 0.5, f"Training loss should be < 0.5, got {loss}"
assert test_loss < 0.5, f"Test loss should be < 0.5, got {test_loss}"

# BEGIN HIDDEN TESTS
assert True  # placeholder hidden test
# END HIDDEN TESTS

print("All gradient descent tests passed!")