# Logistic Regression Tutorial (Pima Indians Diabetes Dataset)

This notebook demonstrates binary logistic regression using the custom `LogisticRegression` class.
We will load the **Pima Indians Diabetes dataset** from UCI, preprocess it, train the model, and evaluate performance using ROC and accuracy.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from rice_ml.supervised_learning.logistic_regression import LogisticRegression
from rice_ml.processing.preprocessing import standardize, train_test_split
from rice_ml.processing.post_processing import accuracy_score, confusion_matrix

## Load Dataset

We will load the Pima Indians Diabetes dataset directly from a public URL.

## Dataset Description

This dataset contains medical diagnostic measurements used to predict
whether a patient has diabetes.

### Features
- **pregnancies**: number of pregnancies
- **glucose**: plasma glucose concentration
- **blood_pressure**: diastolic blood pressure (mm Hg)
- **skin_thickness**: triceps skin fold thickness (mm)
- **insulin**: 2-hour serum insulin (mu U/ml)
- **bmi**: body mass index
- **diabetes_pedigree**: diabetes pedigree function
- **age**: age in years

### Target
- **label**: binary outcome (1 = diabetes, 0 = no diabetes)

### Data Notes
- All features are numerical
- Some variables contain zeros that represent missing values
- No explicit NaN values are present


In [None]:
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
cols = [
    "pregnancies", "glucose", "blood_pressure", "skin_thickness",
    "insulin", "bmi", "diabetes_pedigree", "age", "label"
]

df = pd.read_csv(url, header=None, names=cols)
df.head()

## Exploratory Data Analysis (EDA)

In [None]:
# Histogram
df.drop(columns=["label"]).hist(figsize=(12, 8), bins=20)
plt.suptitle("Feature Distributions", y=1.02)
plt.show()

# Boxplot
plt.figure(figsize=(12, 6))
df.drop(columns=["label"]).boxplot(rot=45)
plt.title("Feature Boxplots")
plt.show()

## EDA Explanation

These plots show differences in scale, skewness, and the presence of outliers.
Several features are right-skewed and vary significantly in magnitude,
motivating feature standardization before optimization.

## Preprocessing
We standardize features to help stabilize gradient descent.

In [None]:
X = df.drop(columns=["label"]).values
y = df["label"].values.astype(float)

X_std = standardize(X)

X_train, X_test, y_train, y_test = train_test_split(X_std, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape

## Model Training
We train logistic regression using gradient descent.

$$
\hat{y} = \sigma(Xw) = \frac{1}{1 + e^{-Xw}}
$$

We minimize the log-loss:

$$
\mathcal{L} = -\frac{1}{n} \sum_i \big( y_i\log(\hat{y}_i) + (1 - y_i)\log(1-\hat{y}_i) \big)
$$

In [None]:
model = LogisticRegression(learning_rate=0.1, max_iter=5000, C=1.0)
model.fit(X_train, y_train)

print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)

## Evaluation Metrics
We compute accuracy and confusion matrix.

In [None]:
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

print("Accuracy:", acc)
print("Confusion Matrix:\n", cm)

## ROC Curve and AUC
The ROC curve evaluates classifier performance across all thresholds.

$$
\text{TPR} = \frac{TP}{TP + FN},\qquad
\text{FPR} = \frac{FP}{FP + TN}
$$

In [None]:
fpr, tpr, auc = model.roc_curve(X_test, y_test)

plt.figure(figsize=(7,5))
plt.plot(fpr, tpr, label=f"AUC = {auc:.3f}")
plt.plot([0,1],[0,1], '--', color='gray')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve - Logistic Regression")
plt.legend()
plt.show()