# Logistic Regression

In [19]:
import sys
from pathlib import Path

# Start at the current directory
root = Path().resolve()

while not (root / "src" / "rice_ml").exists() and root != root.parent:
    root = root.parent

sys.path.append(str(root / "src"))

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

from rice_ml.supervised_learning.logistic_regression import LogisticRegression
from rice_ml.supervised_learning.preprocess import train_test_split, standardize_fit, standardize_transform

sns.set_theme()

### Loading and preparing the dataset

In [20]:
# Step 2: Prepare DataFrame with 'x' (feature vectors) and 'y' (labels)

feature_cols = [
    "Pregnancies",
    "Glucose",
    "BloodPressure",
    "SkinThickness",
    "Insulin",
    "BMI",
    "DiabetesPedigreeFunction",
    "Age",
]

target_col = "Outcome"

# Create a copy for modeling
df_lr = diabetes.copy()

# Create 'x' as a feature vector (np.ndarray) per row
df_lr["x"] = df_lr[feature_cols].apply(lambda row: row.to_numpy(dtype=float), axis=1)

# Create 'y' as the label column (0 or 1)
df_lr["y"] = df_lr[target_col].astype(int)

df_lr[[ "x", "y"]].head()


Unnamed: 0,x,y
0,"[6.0, 148.0, 72.0, 35.0, 0.0, 33.6, 0.627, 50.0]",1
1,"[1.0, 85.0, 66.0, 29.0, 0.0, 26.6, 0.351, 31.0]",0
2,"[8.0, 183.0, 64.0, 0.0, 0.0, 23.3, 0.672, 32.0]",1
3,"[1.0, 89.0, 66.0, 23.0, 94.0, 28.1, 0.167, 21.0]",0
4,"[0.0, 137.0, 40.0, 35.0, 168.0, 43.1, 2.288, 3...",1


Here, we create our X by combining all 8 numeric predictors into one feature vector. "Outcome", indicating whether the woman was diagnosed with diabetes or not, becomes Y, or the target variable of our Logistic Regression.

In [21]:
# Build full feature matrix X and label vector y
feature_cols = [
    "Pregnancies",
    "Glucose",
    "BloodPressure",
    "SkinThickness",
    "Insulin",
    "BMI",
    "DiabetesPedigreeFunction",
    "Age",
]
target_col = "Outcome"

X = diabetes[feature_cols].to_numpy(dtype=float)
y = diabetes[target_col].to_numpy(dtype=int)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    shuffle=True,
    random_state=42
)

print("X_train shape:", X_train.shape)
print("X_test shape: ", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape: ", y_test.shape)


X_train shape: (614, 8)
X_test shape:  (154, 8)
y_train shape: (614,)
y_test shape:  (154,)


As in other notebooks, we can first split our dataset into testing and training data.

We end up with 614 observations in our training data and 154 observations in our testing data.

In [22]:
# Fit on training features only
mean_train, std_train = standardize_fit(X_train)

# Transform both train and test
X_train_std = standardize_transform(X_train, mean_train, std_train)
X_test_std  = standardize_transform(X_test,  mean_train, std_train)

Also like in other notebooks, we can also standardize the features to prevent features with larger numeric scales from dominating the regression.

In [23]:

df_train = pd.DataFrame({
    "x": list(X_train_std),
    "y": y_train
})

df_test = pd.DataFrame({
    "x": list(X_test_std),
    "y": y_test
})

df_train.head()

Unnamed: 0,x,y
0,"[-0.5133859063701559, -0.3802283869875909, -0....",0
1,"[-1.123093733277846, -0.6559688156612692, -3.4...",0
2,"[1.0108836608990692, -0.043212307497539744, -3...",0
3,"[2.2302993147144496, 0.5389063752580032, 0.264...",1
4,"[-0.818239819824001, -0.1044879583139127, 0.97...",1


Here, we create our X by combining all 8 numeric predictors into one feature vector. "Outcome", indicating whether the woman was diagnosed with diabetes or not, becomes Y, or the target variable of our Logistic Regression.

### Training