# Logistic Regression — Pima Indians Diabetes Dataset

This notebook demonstrates **binary logistic regression** implemented entirely from scratch using the custom `LogisticRegression` class from the `rice_ml` package.

We apply the model to the **Pima Indians Diabetes dataset**, a classic benchmark for binary classification in medical diagnostics.


## Objectives

In this notebook, we will:

- Load and inspect a real-world medical dataset  
- Perform exploratory data analysis (EDA)  
- Standardize features for numerical stability  
- Train a logistic regression classifier using gradient descent  
- Evaluate performance using accuracy, confusion matrix, and ROC–AUC  
- Interpret results from both a statistical and machine learning perspective  


## Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from rice_ml.supervised_learning.logistic_regression import LogisticRegression
from rice_ml.processing.preprocessing import standardize, train_test_split
from rice_ml.processing.post_processing import accuracy_score, confusion_matrix

## Load Dataset

We will load the Pima Indians Diabetes dataset directly from a public URL.

## Dataset Description

The **Pima Indians Diabetes dataset** contains medical diagnostic measurements used to predict whether a patient has diabetes.

### Features

- `pregnancies`: number of pregnancies  
- `glucose`: plasma glucose concentration  
- `blood_pressure`: diastolic blood pressure (mm Hg)  
- `skin_thickness`: triceps skin fold thickness (mm)  
- `insulin`: 2-hour serum insulin (mu U/ml)  
- `bmi`: body mass index  
- `diabetes_pedigree`: diabetes pedigree function  
- `age`: age in years  

### Target

- `label`: binary outcome  
  - `1` → diabetes  
  - `0` → no diabetes  

### Data Notes

- All features are numerical  
- Several variables contain **zero values that represent missing or implausible measurements**  
- No explicit NaN values are present  
- Feature scales differ substantially, motivating standardization  

In [None]:
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
cols = [
    "pregnancies", "glucose", "blood_pressure", "skin_thickness",
    "insulin", "bmi", "diabetes_pedigree", "age", "label"
]

df = pd.read_csv(url, header=None, names=cols)
df.head()

## Exploratory Data Analysis (EDA)

Before modeling, we examine feature distributions and scale differences to understand the structure of the data.

### Feature Distributions

Histograms reveal skewness, outliers, and non-Gaussian behavior in several features such as insulin and glucose.

### Feature Scale Comparison

Boxplots highlight large differences in scale across features.
Some variables span orders of magnitude larger than others, which can destabilize gradient-based optimization.


In [None]:
# Feature Distributions
df.drop(columns=["label"]).hist(figsize=(12, 8), bins=20)
plt.suptitle("Feature Distributions", y=1.02)
plt.show()

# Feature Scale Comparison
plt.figure(figsize=(12, 6))
df.drop(columns=["label"]).boxplot(rot=45)
plt.title("Feature Boxplots (Before Standardization)")
plt.show()

## EDA Interpretation

- Features vary widely in scale (e.g., insulin vs diabetes pedigree)  
- Several variables are right-skewed and contain outliers  
- Logistic regression relies on gradient descent, which is sensitive to feature scale  

These observations motivate **feature standardization** prior to model training.


## Preprocessing

We separate predictors and target, then standardize the features.

Let the feature matrix be:

- Feature matrix: $X \in \mathbb{R}^{n \times d}$  
- Target vector: $y \in \{0,1\}^n$

Feature standardization is defined as:

$$
X_{\text{std}} = \frac{X - \mu}{\sigma}
$$

This ensures:

- Comparable feature scales  
- Faster and more stable gradient descent  
- Improved numerical conditioning  


In [None]:
X = df.drop(columns=["label"]).values
y = df["label"].values.astype(float)

X_std = standardize(X)

X_train, X_test, y_train, y_test = train_test_split(X_std, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape

## Logistic Regression — Model Intuition

Logistic regression models the probability of the positive class as:

$$
P(y = 1 \mid x) = \sigma(w^\top x + b)
$$

where the sigmoid function is:

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$


### Loss Function (Log-Loss)

We minimize the **binary cross-entropy loss**:

$$
\mathcal{L}(w) =
-\frac{1}{n} \sum_{i=1}^n
\left[
y_i \log(\hat{p}_i) + (1 - y_i)\log(1 - \hat{p}_i)
\right]
$$

This loss is:

- Convex  
- Differentiable  
- Well-suited for probabilistic classification  


## Model Training
We train logistic regression using gradient descent.

$$
\hat{y} = \sigma(Xw) = \frac{1}{1 + e^{-Xw}}
$$

We minimize the log-loss:

$$
\mathcal{L} = -\frac{1}{n} \sum_i \big( y_i\log(\hat{y}_i) + (1 - y_i)\log(1-\hat{y}_i) \big)
$$

In [None]:
model = LogisticRegression(
    learning_rate=0.1,
    max_iter=5000,
    C=1.0
)

model.fit(X_train, y_train)

print("Intercept:", model.intercept_)
print("Coefficients:", model.coef_)


## Evaluation Metrics

### Accuracy

Accuracy is defined as:

$$
\text{Accuracy} =
\frac{1}{n} \sum_{i=1}^n
\mathbf{1}\bigl[\hat{y}_i = y_i\bigr]
$$

where:

- $y_i$ is the true label  
- $\hat{y}_i$ is the predicted label  
- $\mathbf{1}[\cdot]$ is the indicator function  

### Confusion Matrix

The confusion matrix provides class-specific insight into:

- False positives (Type I error)  
- False negatives (Type II error)  

This is particularly important in medical prediction tasks, where false negatives may be costly.



In [None]:
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

print("Accuracy:", acc)
print("Confusion Matrix:\n", cm)

## ROC Curve and AUC

The **Receiver Operating Characteristic (ROC)** curve evaluates classifier performance across all classification thresholds.

- True Positive Rate (TPR):

$$
\text{TPR} = \frac{TP}{TP + FN}
$$

- False Positive Rate (FPR):

$$
\text{FPR} = \frac{FP}{FP + TN}
$$

The **Area Under the Curve (AUC)** summarizes overall discriminative performance.


In [None]:
fpr, tpr, auc = model.roc_curve(X_test, y_test)

plt.figure(figsize=(7,5))
plt.plot(fpr, tpr, label=f"AUC = {auc:.3f}")
plt.plot([0,1],[0,1], '--', color='gray')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve - Logistic Regression")
plt.legend()
plt.show()

## Conclusion

Logistic regression provides a strong and interpretable baseline for binary classification.

This notebook demonstrated:

- End-to-end modeling using **fully custom-built machine learning code**  
- The importance of feature scaling for gradient-based optimization  
- Evaluation using both threshold-dependent and threshold-independent metrics  

Despite its simplicity, logistic regression performs competitively on this dataset and provides probabilistic outputs that are especially valuable in medical decision-making.
