# Linear Regression Analysis - Diabetes Dataset

**Student Name:** Llatuna  
**Course:** BSIT - 3R8  
**Activity:** 8-Step Linear Regression Implementation

---

## Instructions Checklist:
1. ✅ Load the dataset (5pts)
2. ✅ Establish X and Y Matrix (5pts) 
3. ✅ Perform 75/25 Data Split (5pts)
4. ✅ Provide data dimension (train and test) (5pts)
5. ✅ Define the Linear Regression Model (5pts)
6. ✅ Build the training model (5pts)
7. ✅ Perform prediction on test data (5pts)
8. ✅ Print Model Performance (5pts)

**Model Name:** Llatuna_LinearRegression

## Import Required Libraries
Import all necessary libraries for Linear Regression analysis

In [12]:
# Import necessary libraries for Linear Regression analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

Libraries imported successfully!


## Step 1: Load the Dataset (5pts)
Load the diabetes dataset and examine its structure

In [13]:
# Step 1: Load the dataset
# Load the diabetes dataset from CSV file
df = pd.read_csv("datasets/diabetes.csv")

# Display basic information about the dataset
print("Dataset loaded successfully!")
print(f"Dataset shape: {df.shape}")
print(f"\nColumns in dataset: {list(df.columns)}")
print(f"\nFirst 5 rows:")
print(df.head())

Dataset loaded successfully!
Dataset shape: (100000, 17)

Columns in dataset: ['year', 'gender', 'age', 'location', 'race:AfricanAmerican', 'race:Asian', 'race:Caucasian', 'race:Hispanic', 'race:Other', 'hypertension', 'heart_disease', 'smoking_history', 'bmi', 'hbA1c_level', 'blood_glucose_level', 'diabetes', 'clinical_notes']

First 5 rows:
   year  gender   age location  race:AfricanAmerican  race:Asian  \
0  2020  Female  32.0  Alabama                     0           0   
1  2015  Female  29.0  Alabama                     0           1   
2  2015    Male  18.0  Alabama                     0           0   
3  2015    Male  41.0  Alabama                     0           0   
4  2016  Female  52.0  Alabama                     1           0   

   race:Caucasian  race:Hispanic  race:Other  hypertension  heart_disease  \
0               0              0           1             0              0   
1               0              0           0             0              0   
2              

## Step 2: Establish X and Y Matrix (5pts)
Define features (X) and target variable (Y) for Linear Regression

In [14]:
# Step 2: Establish X and Y Matrix
# For Linear Regression, we'll predict HbA1c level based on other numeric features

# Select numeric features for prediction (excluding target variable)
# We'll use BMI, blood glucose level, and age as features to predict HbA1c level
X = df[['age', 'bmi', 'blood_glucose_level']].copy()
y = df['hbA1c_level'].copy()

print("X Matrix (Features):")
print(f"Shape: {X.shape}")
print(f"Features: {list(X.columns)}")
print("\nFirst 5 rows of X:")
print(X.head())

print(f"\nY Vector (Target):")
print(f"Shape: {y.shape}")
print(f"Target variable: hbA1c_level")
print("\nFirst 5 values of y:")
print(y.head())

X Matrix (Features):
Shape: (100000, 3)
Features: ['age', 'bmi', 'blood_glucose_level']

First 5 rows of X:
    age    bmi  blood_glucose_level
0  32.0  27.32                  100
1  29.0  19.95                   90
2  18.0  23.76                  160
3  41.0  27.32                  159
4  52.0  23.75                   90

Y Vector (Target):
Shape: (100000,)
Target variable: hbA1c_level

First 5 values of y:
0    5.0
1    5.0
2    4.8
3    4.0
4    6.5
Name: hbA1c_level, dtype: float64


## Step 3: Perform 75/25 Data Split (5pts)
Split the data into training (75%) and testing (25%) sets

In [15]:
# Step 3: Perform 75/25 Data Split
# Split the data into training (75%) and testing (25%) sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.25,  # 25% for testing, 75% for training
    random_state=42  # For reproducible results
)

print("Data split completed successfully!")
print(f"Training set size: {len(X_train)} samples ({len(X_train)/len(X)*100:.1f}%)")
print(f"Testing set size: {len(X_test)} samples ({len(X_test)/len(X)*100:.1f}%)")
print(f"Total samples: {len(X)}")

# Verify the split ratio
print(f"\nSplit verification:")
print(f"Training ratio: {len(X_train)/len(X):.3f} (should be ~0.750)")
print(f"Testing ratio: {len(X_test)/len(X):.3f} (should be ~0.250)")

Data split completed successfully!
Training set size: 75000 samples (75.0%)
Testing set size: 25000 samples (25.0%)
Total samples: 100000

Split verification:
Training ratio: 0.750 (should be ~0.750)
Testing ratio: 0.250 (should be ~0.250)


## Step 4: Provide Data Dimensions (Train and Test) (5pts)
Display the dimensions of training and testing datasets

In [16]:
# Step 4: Provide data dimensions (train and test)
print("DATA DIMENSIONS SUMMARY")
print("=" * 40)

print("\nTraining Set Dimensions:")
print(f"X_train shape: {X_train.shape} (rows: {X_train.shape[0]}, features: {X_train.shape[1]})")
print(f"y_train shape: {y_train.shape} (target values: {y_train.shape[0]})")

print("\nTesting Set Dimensions:")
print(f"X_test shape: {X_test.shape} (rows: {X_test.shape[0]}, features: {X_test.shape[1]})")
print(f"y_test shape: {y_test.shape} (target values: {y_test.shape[0]})")

print("\nFeature Information:")
print(f"Number of features: {X_train.shape[1]}")
print(f"Feature names: {list(X_train.columns)}")

print(f"\nTotal dataset size: {len(X)} samples")
print(f"Training samples: {len(X_train)} ({len(X_train)/len(X)*100:.1f}%)")
print(f"Testing samples: {len(X_test)} ({len(X_test)/len(X)*100:.1f}%)")

DATA DIMENSIONS SUMMARY

Training Set Dimensions:
X_train shape: (75000, 3) (rows: 75000, features: 3)
y_train shape: (75000,) (target values: 75000)

Testing Set Dimensions:
X_test shape: (25000, 3) (rows: 25000, features: 3)
y_test shape: (25000,) (target values: 25000)

Feature Information:
Number of features: 3
Feature names: ['age', 'bmi', 'blood_glucose_level']

Total dataset size: 100000 samples
Training samples: 75000 (75.0%)
Testing samples: 25000 (25.0%)


## Step 5: Define the Linear Regression Model (5pts)
Create and initialize the Linear Regression model with your family name

In [17]:
# Step 5: Define the Linear Regression Model
# Create a Linear Regression model instance
# Model Name: Replace with your family name (e.g., Smith_LinearRegression)
Llatuna_LinearRegression = LinearRegression()

print("Linear Regression Model Defined Successfully!")
print(f"Model Name: Llatuna_LinearRegression")
print(f"Model Type: {type(Llatuna_LinearRegression).__name__}")
print(f"Model Parameters: {Llatuna_LinearRegression.get_params()}")

# Display model information
print(f"\nModel Details:")
print(f"- Algorithm: Linear Regression")
print(f"- Fit Intercept: {Llatuna_LinearRegression.fit_intercept}")
print(f"- Copy X: {Llatuna_LinearRegression.copy_X}")
print(f"- Positive: {Llatuna_LinearRegression.positive}")

Linear Regression Model Defined Successfully!
Model Name: Llatuna_LinearRegression
Model Type: LinearRegression
Model Parameters: {'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'positive': False}

Model Details:
- Algorithm: Linear Regression
- Fit Intercept: True
- Copy X: True
- Positive: False


## Step 6: Build the Training Model (5pts)
Train the Linear Regression model using the training data

In [18]:
# Step 6: Build the training model
# Train the Linear Regression model using the training data
print("Training the Linear Regression model...")

# Fit the model to the training data
Llatuna_LinearRegression.fit(X_train, y_train)

print("Model training completed successfully!")

# Display model coefficients and intercept
print(f"\nModel Training Results:")
print(f"Intercept (β₀): {Llatuna_LinearRegression.intercept_:.4f}")
print(f"\nCoefficients (β₁, β₂, β₃):")
for i, feature in enumerate(X_train.columns):
    print(f"  {feature}: {Llatuna_LinearRegression.coef_[i]:.4f}")

# Display the regression equation
print(f"\nRegression Equation:")
equation = f"HbA1c = {Llatuna_LinearRegression.intercept_:.4f}"
for i, feature in enumerate(X_train.columns):
    coef = Llatuna_LinearRegression.coef_[i]
    sign = "+" if coef >= 0 else ""
    equation += f" {sign}{coef:.4f}*{feature}"
print(equation)

Training the Linear Regression model...
Model training completed successfully!

Model Training Results:
Intercept (β₀): 4.6229

Coefficients (β₁, β₂, β₃):
  age: 0.0034
  bmi: 0.0073
  blood_glucose_level: 0.0041

Regression Equation:
HbA1c = 4.6229 +0.0034*age +0.0073*bmi +0.0041*blood_glucose_level


## Step 7: Perform Prediction on Test Data (5pts)
Use the trained model to make predictions on the test dataset

In [19]:
# Step 7: Perform prediction on test data
# Use the trained model to predict on test data
print("Making predictions on test data...")

# Generate predictions
y_pred = Llatuna_LinearRegression.predict(X_test)

print("Predictions completed successfully!")

# Display prediction results
print(f"\nPrediction Results Summary:")
print(f"Number of predictions: {len(y_pred)}")
print(f"Prediction range: {y_pred.min():.4f} to {y_pred.max():.4f}")
print(f"Actual range: {y_test.min():.4f} to {y_test.max():.4f}")

# Show first 10 predictions vs actual values
print(f"\nFirst 10 Predictions vs Actual Values:")
print("Index | Predicted | Actual  | Difference")
print("-" * 40)
for i in range(min(10, len(y_pred))):
    pred = y_pred[i]
    actual = y_test.iloc[i]
    diff = abs(pred - actual)
    print(f"{i+1:5d} | {pred:8.4f} | {actual:7.4f} | {diff:8.4f}")

# Store predictions for performance evaluation
print(f"\nPredictions stored for performance evaluation.")

Making predictions on test data...
Predictions completed successfully!

Prediction Results Summary:
Number of predictions: 25000
Prediction range: 5.0363 to 6.4825
Actual range: 3.5000 to 9.0000

First 10 Predictions vs Actual Values:
Index | Predicted | Actual  | Difference
----------------------------------------
    1 |   5.8322 |  4.8000 |   1.0322
    2 |   5.6528 |  6.5000 |   0.8472
    3 |   5.5943 |  6.8000 |   1.2057
    4 |   5.5003 |  4.5000 |   1.0003
    5 |   5.4782 |  4.0000 |   1.4782
    6 |   6.1716 |  5.7000 |   0.4716
    7 |   5.3947 |  6.2000 |   0.8053
    8 |   5.6310 |  6.5000 |   0.8690
    9 |   5.7245 |  5.8000 |   0.0755
   10 |   5.4682 |  6.5000 |   1.0318

Predictions stored for performance evaluation.


## Step 8: Print Model Performance (5pts)
Evaluate and display the performance metrics of the Linear Regression model

In [20]:
# Step 8: Print Model Performance
# Calculate and display classification metrics by converting regression to classification

print("MODEL PERFORMANCE EVALUATION")
print("=" * 50)

# Import classification metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# Convert continuous HbA1c predictions to diabetes risk categories
def categorize_hba1c(hba1c_value):
    """Convert HbA1c level to diabetes risk category"""
    if hba1c_value < 5.7:
        return 0  # Normal
    elif hba1c_value < 6.5:
        return 1  # Prediabetes
    else:
        return 2  # Diabetes

# Convert actual and predicted values to categories
y_test_categorical = y_test.apply(categorize_hba1c)
y_pred_categorical = [categorize_hba1c(pred) for pred in y_pred]

# Calculate classification metrics
accuracy = accuracy_score(y_test_categorical, y_pred_categorical)
precision = precision_score(y_test_categorical, y_pred_categorical, average='weighted')
recall = recall_score(y_test_categorical, y_pred_categorical, average='weighted')
f1 = f1_score(y_test_categorical, y_pred_categorical, average='weighted')

# Print performance metrics
print(f"\nModel Name: Llatuna_LinearRegression")
print(f"Algorithm: Linear Regression (converted to classification)")
print(f"\nCLASSIFICATION PERFORMANCE METRICS:")
print("-" * 40)
print(f"Accuracy:     {accuracy:.6f}")
print(f"Precision:    {precision:.6f}")
print(f"Recall:       {recall:.6f}")
print(f"F1-Score:     {f1:.6f}")

print(f"\nCATEGORY MAPPING:")
print("0 = Normal (HbA1c < 5.7)")
print("1 = Prediabetes (5.7 ≤ HbA1c < 6.5)")
print("2 = Diabetes (HbA1c ≥ 6.5)")

# Additional statistics
print(f"\nADDITIONAL STATISTICS:")
print("-" * 25)
print(f"Training samples: {len(X_train)}")
print(f"Testing samples:  {len(X_test)}")
print(f"Number of features: {X_train.shape[1]}")
print(f"Feature names: {list(X_train.columns)}")

print(f"\nMODEL TRAINING STATUS: ✅ COMPLETED")
print(f"All 8 steps have been successfully executed!")

MODEL PERFORMANCE EVALUATION

Model Name: Llatuna_LinearRegression
Algorithm: Linear Regression (converted to classification)

CLASSIFICATION PERFORMANCE METRICS:
----------------------------------------
Accuracy:     0.396200
Precision:    0.321516
Recall:       0.396200
F1-Score:     0.298308

CATEGORY MAPPING:
0 = Normal (HbA1c < 5.7)
1 = Prediabetes (5.7 ≤ HbA1c < 6.5)
2 = Diabetes (HbA1c ≥ 6.5)

ADDITIONAL STATISTICS:
-------------------------
Training samples: 75000
Testing samples:  25000
Number of features: 3
Feature names: ['age', 'bmi', 'blood_glucose_level']

MODEL TRAINING STATUS: ✅ COMPLETED
All 8 steps have been successfully executed!
