# Diabetes Prediction and Analysis

In this project, the researchers' aim are to investigate and predict the likelihood of diabetes in individuals by leveraging a publicly available dataset. The researchers' approach involves conducting comprehensive exploratory data analysis, comparing key health metrics between diabetic and non-diabetic groups, and developing machine learning models for prediction. 

The primary objective is to identify significant health indicators linked to diabetes and to construct reliable models that can support early detection efforts.


**Dataset used:**

    Clinical health records
    Key features: gender, age, hypertension, heart_disease, smoking_history, bmi, HbA1c_level, blood_glucose_level
    Target Variable: diabetes
    

**Key Analytics Questions Solved:**

    Q1: Which features are most important for predicting diabetes?
    Q3: What is the best model for this Diabetes Prediction?


**Models used:**

    KNN
    Logistic Regression (L1 and L2)
    SVM (L1 and L2)

 
**Model Results Summary**



**Top Predictors:**

    

**Researchers:**

    Maghinay, Shane
    Pesaras, Nilmar
    Baguio, Ryan
    Ventic


**Definition of Terms**

**gender:** refers to the biological sex of the individual, which can have an impact on their susceptibility to diabetes.

**age:** an important factor as diabetes is more commonly diagnosed in older adults. Age ranges from 0-80 in our dataset.

**hypertension:** medical condition in which the blood pressure in the arteries is persistently elevated. It has values a 0 or 1 where 0 indicates they don’t have hypertension and for 1 it means they have hypertension.

**heart_disease:** another medical condition that is associated with an increased risk of developing diabetes. It has values a 0 or 1 where 0 indicates they don’t have heart disease and for 1 it means they have heart disease.

**smoking_history:** considered a risk factor for diabetes and can exacerbate the complications associated with diabetes.In our dataset we have 5 categories i.e not current,former,No Info,current,never and ever.

**bmi (Body Mass Index):** a measure of body fat based on weight and height. Higher BMI values are linked to a higher risk of diabetes. The range of BMI in the dataset is from 10.16 to 71.55. BMI less than 18.5 is underweight, 18.5-24.9 is normal, 25-29.9 is overweight, and 30 or more is obese. 

**HbA1c_level (Hemoglobin A1c):** measure of a person's average blood sugar level over the past 2-3 months. Higher levels indicate a greater risk of developing diabetes. Mostly more than 6.5% of HbA1c Level indicates diabetes.

**blood_glucose_level:** refers to the amount of glucose in the bloodstream at a given time. High blood glucose levels are a key indicator of diabetes.

**diabetes:** target variable being predicted, with values of 1 indicating the presence of diabetes and 0 indicating the absence of diabetes.


<center> --------------------------------------------------------------------------------- <center>

## Import Libraries

In [None]:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from scipy.stats import skew
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler # Use for encoding categorical variables
from sklearn.model_selection import train_test_split # Use for splitting the dataset
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score # Use for evaluating the model
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression


## Load Data

In [None]:
data_set = pd.read_csv('diabetes_prediction_dataset.csv')

data_set

## Data Exploration

Before doing the preprocessing, the researchers did a data exploration like:
1. Finding Missing Values
2. Finding Null Values
3. Finding Duplicate Values


In [None]:
print("Shape of Dataset (Rows, Columns):", data_set.shape, "\n") # total number of rows and columns
print("Columns in Dataset:\n", data_set.columns, "\n") # names of columns
print("First 5 Rows:\n", data_set.head(), "\n")

In [None]:
# Check for Missing Values
print("Missing values per column:")
print(data_set.isnull().sum())

# Check for duplicate values
print("\nDuplicate values:" , data_set.duplicated().sum())

# Find data types
print("\nData types of each column:")
print(data_set.dtypes)


**Findings:**

**Number of Records:** 100,000
**Number of Columns:** 9

**Missing Values:**

    There are no missing and null values.
    
**Duplicates:**

    Base from the above result, there are `3854` duplicates. This will lead to:
    1. Training Bias -> becomes biased toward patterns in frequently repeated observations.
    2. Overfitting -> Reduced the ability to perform well on new, unseen data.
    3. Distorted Feature Importance -> lead to incorrect conclusions about which health indicators truly predicted diabetes.

**Categorical Features:**

    gender
    smoking_history

**Numerical Features:**

    age
    hypertension
    heart_disease
    bmi
    HbA1c_level
    blood_glucose_level
    diabetes

**Target Variable:**

    diabetes

## Preprocessing

In this section, the researchers will process the data before using it for training. The researchers will do the following:
1. Removing duplicates
2. Encoding Categorical Data
3. Scaling numerical features
4. Correlation Matrix

### Removing duplicates

In [None]:
# Remove duplicate values

data_set.drop_duplicates(inplace=True)

# Check for duplicate values again 
print("\nDuplicate values after removing duplicates:" , data_set.duplicated().sum())

data_set.isnull().sum() # Check for missing values again

### Encoding Categorical data

In [None]:
# Identify categorical columns
categorical_columns = data_set.select_dtypes(include=["object"]).columns

# Apply One-Hot Encoding for non-binary categorical columns
non_binary_columns = [col for col in categorical_columns if data_set[col].nunique() > 2]
data_set = pd.get_dummies(data_set, columns=non_binary_columns, drop_first=True)

# Convert boolean columns to integers
data_set = data_set.astype(int, errors='ignore')

print("\nApplied One-Hot Encoding to non-binary categorical columns and converted boolean columns to integers.\n")

# Display the first few rows of the dataset after encoding
print("Dataset after encoding categorical variables:")
print(data_set.head())

**Jusitifcation For One-Hot Encoding**

**Gender Variable (3 categories)**

    Categories: Female, Male, Other
    Why One-Hot:
        * No ordinal relationship exists
        * Prevents artificial ordering
        * Avoids bias in model interpretation
    Resulting Columns:
        * gender_Male
        * gender_Other
        * (Female as reference category)
        
**Smoking History (6 categories)**
    
    Categories: never, No Info, current, former, ever, not current
    Why One-Hot:
        * No inherent order between categories
        * Each category has distinct medical significance
        * Preserves independence of categories
    Resulting Columns:
        * smoking_history_current
        * smoking_history_ever
        * smoking_history_former
        * smoking_history_not_current
        * smoking_history_No Info
        * (never as reference category)
        
**Benefits of Approach**

        Maintains categorical nature of variables
        Prevents ordinal assumptions
        Allows model to learn category-specific effects
        drop_first=True prevents multicollinearity
        *Preserves all categorical information without imposing hierarchy

### Scaling Numerical Features

In [None]:
feature_columns = [col for col in data_set.columns if col != "diabetes"] # Exclude the target variable

scaler = StandardScaler() # StandardScaler for standardization

data_set[feature_columns] = scaler.fit_transform(data_set[feature_columns])

# Print the first 5 rows of the scaled data
print("\nScaled Data:\n")
print(data_set.head())

### Correlation Matrix

In [None]:
# Select numerical columns
numerical_columns = data_set.select_dtypes(include=[np.number]).columns

# Calculate correlation matrix
corr_matrix = data_set[numerical_columns].corr()

# Define your desired column order (must match exactly what's in numerical_columns)
desired_order = [
    'age', 
    'hypertension', 
    'heart_disease', 
    'bmi', 
    'HbA1c_level', 
    'blood_glucose_level',
    'diabetes'
]

# Ensure all columns are accounted for
desired_order = [col for col in desired_order if col in numerical_columns]

# Reindex the correlation matrix with the desired order
corr_matrix = corr_matrix.reindex(index=desired_order, columns=desired_order)

# Count missing values in each column
missing_cols = data_set[numerical_columns].isnull().sum()
missing_cols = missing_cols[missing_cols > 0].index

plt.figure(figsize=(10, 10))
ax = sns.heatmap(
    corr_matrix, 
    annot=True, 
    fmt=".2f", 
    cmap="coolwarm", 
    center=0,  
    linewidths=0.10,  
    square=True,
    vmin=-1,
    vmax=1,
    cbar_kws={"shrink": .8, "label": "Correlation Coefficient"},
)

# Highlight missing value columns (if any)
for col in missing_cols:
    if col in desired_order:
        col_idx = desired_order.index(col)
        ax.add_patch(plt.Rectangle((col_idx, 0), 1, len(desired_order), fill=False, edgecolor='green', lw=4))

plt.title("Correlation Heatmap of Numerical Features", fontsize=10)
plt.show()

## Data Splitting

In [None]:
target = "diabetes"
X = data_set.drop(columns=[target]) # Features
y = data_set[target] # Target variable

## Training Models

In this section, we will train our model based from this order:
1. KNN
2. Logistic Regression L2
3. Logistic Regression L1
4. SVM L2
5. SVM L1

### KNN

To determine the optimal KNN model for Diabetes prediction, the researchers conducted a comprehensive analysis across multiple parameters and evaluation metrics. The investigation involved testing different numbers of neighbors (k values ranging from 1 to 12, various train-test split ratios (15% to 35%), and multiple random states to ensure robust results.

In [None]:
#### Defining ranges

In [None]:
# Define ranges
test_sizes = [0.2, 0.25, 0.3, 0.35]
random_states = range(0, 2)
neighbors_range = range(1, 1)

In [None]:
#### Training the Model

In [None]:

results = []

# Step 1: Iterate over different test sizes
for test_size in test_sizes:
    train_scores = []
    test_scores = []
    
    # Step 2: Iterate over random states for variability
    for random_state in random_states:
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=test_size, random_state=random_state
        )

        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)

        # Temporarily use k=5 for initial training
        knn = KNeighborsClassifier(n_neighbors=5)
        knn.fit(X_train_scaled, y_train)

        y_train_pred = knn.predict(X_train_scaled)
        y_test_pred = knn.predict(X_test_scaled)

        train_scores.append(accuracy_score(y_train, y_train_pred))
        test_scores.append(accuracy_score(y_test, y_test_pred))

    avg_train_accuracy = np.mean(train_scores)
    avg_test_accuracy = np.mean(test_scores)

    # Step 3: Perform CV to find the best k
    best_k = 0
    best_score = 0
    scores = []

    for k in neighbors_range:
        knn = KNeighborsClassifier(n_neighbors=k)
        cv_scores = cross_val_score(knn, X, y, cv=6, scoring='accuracy')
        avg_cv_score = np.mean(cv_scores)
        scores.append(avg_cv_score)

        if avg_cv_score > best_score:
            best_score = avg_cv_score
            best_k = k

    # Store the results for this test size
    results.append((test_size, best_k, avg_train_accuracy, avg_test_accuracy, scores))

    # Step 4: Show the best k for each test size (with CV)
    print(f"Test Size: {test_size}, Best k: {best_k}, CV Accuracy: {best_score:.4f}")

    # Step 5: Plot the CV accuracy vs. k for this test size
    plt.plot(neighbors_range, scores, label=f"Test size = {test_size}")

#### Visualization

In [None]:
# Step 7: Display the results for all test sizes
for test_size, best_k, avg_train, avg_test, _ in results:
    print(f"Test Size: {test_size}, Best k: {best_k}, "
          f"Avg Train Accuracy: {avg_train * 100:.2f}%, "
          f"Avg Test Accuracy: {avg_test * 100:.2f}%")

# Step 8: Get the best test size (based on Avg Test Accuracy)
best_result = max(results, key=lambda x: x[3])  # x[3] is Avg Test Accuracy

best_knn_test_size = best_result[0]
best_knn_k = best_result[1]
best_knn_train_acc = best_result[2]
best_knn_test_acc = best_result[3]

print(f"\nBest Test Size: {best_knn_test_size}, Best k: {best_knn_k}, "
      f"Avg Train Accuracy: {best_knn_train_acc * 100:.2f}%, "
      f"Avg Test Accuracy: {best_knn_test_acc * 100:.2f}%")

# Step 9: Create a DataFrame for the bar plot
results_df = pd.DataFrame(results, columns=['Test Size', 'Best k', 'Avg Train Accuracy', 'Avg Test Accuracy', 'CV Scores'])

# Melt the DataFrame for plotting
results_melted = results_df.melt(id_vars=['Test Size'], 
                                value_vars=['Avg Train Accuracy', 'Avg Test Accuracy'],
                                var_name='Accuracy Type', 
                                value_name='Accuracy')

# Clean up the accuracy type labels
results_melted['Accuracy Type'] = results_melted['Accuracy Type'].str.replace('Avg ', '')

# Step 10: Plot the bar chart comparing Train and Test Accuracies
sns.set_style("whitegrid")
palette = {'Train Accuracy': "#e60f0f", 'Test Accuracy': "#e6d816"}

plt.figure(figsize=(12, 7))
ax = sns.barplot(x='Test Size', y='Accuracy', hue='Accuracy Type', 
                data=results_melted, palette=palette, dodge=True,
                edgecolor='black', linewidth=1, alpha=0.85,
                width=0.5)

# Customize the plot
plt.xlabel('Test Size', fontsize=12, labelpad=10, fontweight='bold')
plt.ylabel('Accuracy', fontsize=12, labelpad=10, fontweight='bold')
plt.title(f'Model Performance Across Different Test Sizes', fontsize=14, pad=20, fontweight='bold')

# Format y-axis as percentages
ax.set_yticklabels(['{:.0f}%'.format(y*100) for y in ax.get_yticks()])

# Custom grid settings
ax.grid(visible=True, which='major', axis='y', linestyle='--', linewidth=0.7, alpha=0.6)
ax.grid(visible=True, which='minor', axis='y', linestyle=':', linewidth=0.5, alpha=0.4)
ax.minorticks_on()

# Improve legend
legend = plt.legend(title='Accuracy Type', frameon=True, 
                   facecolor='white', framealpha=1,
                   bbox_to_anchor=(1.02, 1), loc='upper left')
legend.get_title().set_fontweight('bold')

# Add value labels on top of bars
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x() + p.get_width()/2., height + 0.01,
            '{:.1%}'.format(height),
            ha="center", fontsize=9, fontweight='bold')

# Adjust layout and appearance
sns.despine(left=True, right=True, top=True)
plt.tight_layout()

# Add horizontal line at 50% for reference
plt.axhline(y=0.5, color='gray', linestyle=':', linewidth=1, alpha=0.5)

# Adjust spacing between bars
plt.gca().margins(x=0.1)

plt.show()


#### Training Evaluation

### <center>Logistic Regression<center>

#### Logistic Regression L2

##### Training the Model

In [None]:
# Parameters for tuning
test_sizes = [0.2, 0.25, 0.3, 0.35]
random_states = range(0, 2)  # 31 random states (0-50)
C_values = [0.001, 0.01, 0.1, 1, 10, 100]

results = []

# Step 1: Evaluate different test sizes and C values to find the best configuration
for test_size in test_sizes:
    print(f"\nEvaluating Test Size: {test_size}")
    
    # Collect (C, train_acc, test_acc, gap) for this test_size
    info = []
    
    for C in C_values:
        train_scores = []
        test_scores = []
        
        for rs in random_states:
            # Split the data
            X_train, X_test, y_train, y_test = train_test_split(
                X, y, test_size=test_size, random_state=rs
            )
            # Scale the data
            scaler = StandardScaler()
            X_tr = scaler.fit_transform(X_train)
            X_te = scaler.transform(X_test)
            
            # Train the model
            model = LogisticRegression(penalty='l2', C=C, max_iter=10000)
            model.fit(X_tr, y_train)
            
            # Predict and calculate accuracy
            train_scores.append(accuracy_score(y_train, model.predict(X_tr)))
            test_scores.append(accuracy_score(y_test, model.predict(X_te)))
        
        # Average scores and gap calculation
        avg_train = np.mean(train_scores)
        avg_test = np.mean(test_scores)
        gap = abs(avg_train - avg_test)
        
        print(f"  C = {C:<6} Avg Train Acc = {avg_train*100:6.2f}%, "
              f"Avg Test Acc = {avg_test*100:6.2f}%, Gap = {gap*100:5.2f}%")
        
        info.append((C, avg_train, avg_test, gap))
    
    # Pick the C with the smallest gap
    best_C, best_tr, best_te, best_gap = min(info, key=lambda x: x[3])
    results.append((test_size, best_C, best_tr, best_te, best_gap))


# Step 3: Compute feature importance for the best test size and best C
log_l2_best_test_size = results[-1][0]  # Best test size
log_l2_best_C = results[-1][1]  # Best C

# Initialize lists to store the accuracies for averaging
train_accs_final = []
test_accs_final = []

# Step 4: Retrain model on the best test size and best C, AVERAGE over random states 0-51
for random_state in random_states:
    # Split data for training and testing
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=log_l2_best_test_size, random_state=random_state)

    # Scale features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Train the model on the full training set and evaluate on the test set
    best_model = LogisticRegression(penalty='l2', C=log_l2_best_C, max_iter=10000)
    best_model.fit(X_train_scaled, y_train)

    train_acc = best_model.score(X_train_scaled, y_train)
    test_acc = best_model.score(X_test_scaled, y_test)

    # Store the accuracies to average later
    train_accs_final.append(train_acc)
    test_accs_final.append(test_acc)

# Average the train and test accuracies over all random states
log_l2_best_train_accuracy = np.mean(train_accs_final)
log_l2_best_test_accuracy = np.mean(test_accs_final)

# Feature importance: Calculate absolute values of the coefficients as feature importance
log_l2_feature_importance = np.abs(best_model.coef_[0])
log_l2_feature_importance_percent = 100 * log_l2_feature_importance / log_l2_feature_importance.sum()

# Rank features by importance (descending order)
log_l2_ranked_features = np.argsort(log_l2_feature_importance_percent)[::-1]

# Get column names (assuming X is your feature DataFrame)
log_l2_feature_names = X.columns.tolist()

##### Training Summary

In [None]:
# Step 2: Print the summary for best C for each test size
print("\nSummary of Best C for Each Test Size (L2, Least Overfitting):")
for result in results:
    print(f"Test Size: {result[0]:<5}, Best C: {result[1]:<6}, Train: {result[2]*100:6.2f}%, "
          f"Test: {result[3]*100:6.2f}%, Gap: {result[4]*100:5.2f}%")

# Summary of Best C and Test Size
print(f"\nLog L2 Summary:")
print(f"Best Test Size: {log_l2_best_test_size}")
print(f"Best C: {log_l2_best_C}")
print(f"Final Average Train Accuracy: {log_l2_best_train_accuracy * 100:.2f}%")
print(f"Final Average Test Accuracy: {log_l2_best_test_accuracy * 100:.2f}%")



# Plotting accuracy and gap
summary_df = pd.DataFrame(results, columns=["Test Size", "Best C", "Train Accuracy", "Test Accuracy", "Gap"])

# Melt for Seaborn
melted = summary_df.melt(id_vars=["Test Size"], 
                         value_vars=["Train Accuracy", "Test Accuracy", "Gap"],
                         var_name="Metric", value_name="Accuracy")

# Softer color palette for L2 plot
color_palette = {
    "Train Accuracy": "#0825cc",
    "Test Accuracy": "#0dc4c4",  # softer red
    "Gap": "#db6809"
}

# Plot
plt.figure(figsize=(10, 6))
ax = sns.barplot(data=melted, x="Test Size", y="Accuracy", hue="Metric", 
                 palette=color_palette, dodge=True)

# Make bars thin
for container in ax.containers:
    for bar in container:
        bar.set_width(0.25)

# Add percentage labels on top
for container in ax.containers:
    for bar in container:
        height = bar.get_height()
        ax.annotate(f'{height * 100:.1f}%',
                    xy=(bar.get_x() + bar.get_width() / 2, height),
                    xytext=(0, 4),
                    textcoords="offset points",
                    ha='center', va='bottom', fontsize=9)

# Final styling
plt.title("Train, Test Accuracy, and Gap per Test Size (L2 Regularization)")
plt.ylabel("Accuracy (%)")
plt.xlabel("Test Size")
plt.ylim(0, 1.1)
plt.legend(title="Metric")
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()

**Evaluation of L2 Regularization with Varying Test Sizes**

This investigates the performance of **L2-regularized logistic regression** across different **test set sizes**, aiming to:

- Identify the **optimal test size and regularization strength (C)**.
- Minimize the **overfitting gap** between training and testing performance.
- Visualize the effects of test size on model generalization.

**Results Summary**

| Test Size | Best C | Train Accuracy | Test Accuracy | Overfitting Gap |
|-----------|--------|----------------|---------------|------------------|
| 0.20      | 0.01   | 94.00%         | 94.00%        | **0.00%**        |
| 0.25      | 0.10   | 98.38%         | 98.33%        | 0.05%            |
| 0.30      | 0.10   | 98.39%         | 98.16%        | 0.23%            |
| 0.35      | 0.01   | 93.03%         | 92.92%        | 0.11%            |

- **Best Performance** achieved at:
  - **Test Size = 0.20**
  - **Best C = 0.01**
  - **Train Acc = 94.00%**, **Test Acc = 94.00%**

## Visualization: Train, Test Accuracy, and Overfitting Gap

The bar chart above displays **Train Accuracy**, **Test Accuracy**, and the **Overfitting Gap** (difference between the two) across each test size:

**[L2 Regularization Results]**

**Key Insights from the Chart:**

-  **Train Accuracy (blue)** and  **Test Accuracy (red)** remain close across all test sizes, indicating **strong generalization**.
-  **Gap (orange)** is minimal, especially at **0.20** and **0.25**, implying **low overfitting**.
- Slight gap increase at **Test Size = 0.30** suggests **mild overfitting** even with good accuracy.
- **Lowest train/test accuracy** appears at **0.35**, showing performance drop due to **less training data**.

---

**Conclusion**

- The **optimal test/train split** for this dataset using L2 regularization is:
  - **Test Size: 0.20**
  - **Regularization Strength (C): 0.01**
- This configuration yields the **highest generalization** with **zero overfitting**.
- As the test size increases, accuracy slightly drops, and the gap begins to widen, confirming the trade-off between **training data quantity** and **model generalization**.

##### Visualization

In [None]:
# Create a horizontal bar chart for feature importance
plt.figure(figsize=(12, 8))

# Create sorted data for plotting (descending order)
sorted_idx = log_l2_ranked_features
sorted_features = [log_l2_feature_names[i] for i in sorted_idx]
sorted_importance = log_l2_feature_importance_percent[sorted_idx]

# Reverse the order so highest appears at the top of the plot
sorted_features = sorted_features[::-1]
sorted_importance = sorted_importance[::-1]

# Plot horizontal bars
bars = plt.barh(range(len(sorted_features)), sorted_importance, align='center', color='#4287f5')
plt.yticks(range(len(sorted_features)), sorted_features)

# Add percentage labels to the bars
for i, bar in enumerate(bars):
    width = bar.get_width()
    plt.text(width + 0.5, 
             bar.get_y() + bar.get_height()/2, 
             f'{sorted_importance[i]:.2f}%', 
             ha='left', 
             va='center',
             fontweight='bold')

# Add title and labels
plt.title('Logistic Regression L2 Feature Importance', fontsize=15)
plt.xlabel('Importance (%)', fontsize=12)
plt.ylabel('Features', fontsize=12)

# Add grid for better readability
plt.grid(axis='x', linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()

# Print the ranked feature importances (still in original descending order)
print("\nLog L2 Feature Importance (sorted by importance):")
for i, idx in enumerate(log_l2_ranked_features):
    print(f"  Rank {i + 1}: {log_l2_feature_names[idx]} - {log_l2_feature_importance_percent[idx]:.2f}%")

**Interpretations**



### Logistics Regression L1

#### Training the Model

In [None]:
# Parameters
test_sizes = [0.2, 0.25, 0.3, 0.35]
C_values = [0.001, 0.01, 0.1, 1, 10, 100]
random_states = range(0, 2) # Changed to 31 for more iterations

logistic_l1_results = []  # To store best results per test size

# Variables with unique names
Log_l1_best_test_size = None
Log_l1_best_C = None
Log_l1_best_train_acc = 0
Log_l1_best_test_acc = 0
Log_l1_best_gap = float('inf')

# Step 1: Evaluate all combinations of test size and C to find the best configuration
for test_size in test_sizes:
    print(f"\nEvaluating Test Size: {test_size}")
    
    c_train_scores = {C: [] for C in C_values}
    c_test_scores = {C: [] for C in C_values}
    
    for random_state in random_states:
        # Split the data
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=test_size, random_state=random_state
        )
        
        # Scale features
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)
        
        # Train and evaluate for each C
        for C in C_values:
            model = LogisticRegression(C=C, penalty='l1', solver='liblinear', max_iter=1000)
            model.fit(X_train_scaled, y_train)

            y_train_pred = model.predict(X_train_scaled)
            y_test_pred = model.predict(X_test_scaled)

            train_acc = accuracy_score(y_train, y_train_pred)
            test_acc = accuracy_score(y_test, y_test_pred)

            c_train_scores[C].append(train_acc)
            c_test_scores[C].append(test_acc)
    
    # Evaluate overfitting and print details
    overfit_info = []
    for C in C_values:
        avg_train = np.mean(c_train_scores[C])
        avg_test = np.mean(c_test_scores[C])
        gap = abs(avg_train - avg_test)
        overfit_info.append((C, avg_train, avg_test, gap))
        print(f"  C = {C}: Avg Train Acc = {avg_train * 100:.2f}%, "
              f"Avg Test Acc = {avg_test * 100:.2f}%, Gap = {gap * 100:.2f}%")
    
    # Pick C with least overfitting (smallest gap)
    best_C_for_test_size, best_train, best_test, best_gap_for_test_size = min(overfit_info, key=lambda x: x[3])
    
    # Track the overall best configuration
    if best_gap_for_test_size < Log_l1_best_gap:
        Log_l1_best_test_size = test_size
        Log_l1_best_C = best_C_for_test_size
        Log_l1_best_train_acc = best_train
        Log_l1_best_test_acc = best_test
        Log_l1_best_gap = best_gap_for_test_size

    logistic_l1_results.append((test_size, best_C_for_test_size, best_train, best_test, best_gap_for_test_size))



# Step 3: Retrain model on the best test size and best C, AVERAGE over random states 0-51
train_accs_final = []
test_accs_final = []

for random_state in random_states:
    # Split data for training and testing
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=Log_l1_best_test_size, random_state=random_state)

    # Scale features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Train the model on the full training set and evaluate on the test set
    best_model = LogisticRegression(C=Log_l1_best_C, penalty='l1', solver='liblinear', max_iter=1000)
    best_model.fit(X_train_scaled, y_train)

    train_acc = best_model.score(X_train_scaled, y_train)
    test_acc = best_model.score(X_test_scaled, y_test)

    # Store the accuracies to average later
    train_accs_final.append(train_acc)
    test_accs_final.append(test_acc)

# Average the train and test accuracies over all random states
Log_l1_avg_train_acc_final = np.mean(train_accs_final)
Log_l1_avg_test_acc_final = np.mean(test_accs_final)

# Get the absolute values of coefficients as feature importance for the final model
Log_l1_feature_importances = np.abs(best_model.coef_[0])

# Calculate percentage importance for each feature
Log_l1_feature_importance_percent = 100 * Log_l1_feature_importances / Log_l1_feature_importances.sum()

# Rank features by importance
Log_l1_ranked_features = np.argsort(Log_l1_feature_importance_percent)[::-1]


#### Training Summary

In [None]:
# Step 2: Print the summary for best C for each test size
print("\nSummary of Best C for Each Test Size (L1, Least Overfitting):")
for result in logistic_l1_results:
    print(f"Test Size: {result[0]:<5}, Best C: {result[1]:<6}, Train: {result[2]*100:6.2f}%, "
          f"Test: {result[3]*100:6.2f}%, Gap: {result[4]*100:5.2f}%")
# Step 4: Output feature importance
print(f"\nBest Test Size: {Log_l1_best_test_size}, Best C: {Log_l1_best_C}")

# Print out the averaged accuracies
print(f"\nFinal Average Train Accuracy (over random states 0-51): {Log_l1_avg_train_acc_final * 100:.2f}%")
print(f"Final Average Test Accuracy (over random states 0-51): {Log_l1_avg_test_acc_final * 100:.2f}%")

# Convert results to DataFrame
summary_df = pd.DataFrame(logistic_l1_results, columns=["Test Size", "Best C", "Train Accuracy", "Test Accuracy", "Gap"])

# Melt for Seaborn
melted = summary_df.melt(id_vars=["Test Size"], 
                         value_vars=["Train Accuracy", "Test Accuracy", "Gap"],
                         var_name="Metric", value_name="Accuracy")

# Softer color palette for L1 plot
color_palette = {
    "Train Accuracy": "#0825cc",
    "Test Accuracy": "#0dc4c4",  # softer red
    "Gap": "#db6809"
}

# Plot
plt.figure(figsize=(10, 6))
ax = sns.barplot(data=melted, x="Test Size", y="Accuracy", hue="Metric", 
                 palette=color_palette, dodge=True)

# Make bars thin
for container in ax.containers:
    for bar in container:
        bar.set_width(0.25)

# Add percentage labels on top
for container in ax.containers:
    for bar in container:
        height = bar.get_height()
        ax.annotate(f'{height * 100:.1f}%',
                    xy=(bar.get_x() + bar.get_width() / 2, height),
                    xytext=(0, 4),
                    textcoords="offset points",
                    ha='center', va='bottom', fontsize=9)

# Final styling
plt.title("Train, Test Accuracy, and Gap per Test Size (L1 Regularization)")
plt.ylabel("Accuracy (%)")
plt.xlabel("Test Size")
plt.ylim(0, 1.1)
plt.legend(title="Metric")
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()

#### Evaluation


**L1 Logistic Regression: Train vs Test Accuracy Across Test Sizes**

This section analyzes how the model performs using **L1 regularization** across different **test sizes**. It identifies the best `C` value (inverse of regularization strength) that causes the **least overfitting**.

---

###  Summary of Results (Best `C`, Accuracy & Overfitting Gap)

| Test Size | Best C | Train Accuracy | Test Accuracy | Overfitting Gap |
|-----------|--------|----------------|----------------|------------------|
| 0.20      | 0.01   | 93.80%         | 93.82%         | 0.01%            |
| 0.25      | 0.01   | 93.71%         | 93.69%         | 0.02%            |
| 0.30      | 0.01   | 93.55%         | 93.51%         | 0.04%            |
| 0.35      | 1      | 99.50%         | 99.25%         | 0.24%            |

 **Best Test Size**: **0.20**  
 **Best C**: **0.01** (lowest overfitting)

---

###  Chart Explanation

The bar chart above compares:

- **Train Accuracy** (how well the model fits the training data),
- **Test Accuracy** (how well the model generalizes to new data), and
- **Gap** (the difference between Train and Test Accuracy – smaller is better).

It helps visualize overfitting:  
> **Larger gaps mean more overfitting**.

---

**Analysis base from the Bar Plots**

- **Test size 0.20** performs best overall with both train and test accuracy at **~93.8%** and **almost zero gap**.
- At **test size 0.35**, train accuracy is very high (**99.5%**) but with a larger gap (**0.24%**), showing signs of **overfitting**.
- The model is **most stable** with smaller test sizes and lower `C` values.

---

**Key insights**

This analysis helps in:
- Choosing the **best model configuration** (test size + C value),
- Ensuring the model **generalizes well** (avoids overfitting),
- Identifying the **sweet spot** for data splitting.

>  Use **Test Size = 0.20** and **C = 0.01** for the best balance between performance and stability.
---

**Visualization**

In [None]:
# Create a horizontal bar chart for feature importance
plt.figure(figsize=(12, 8))

# Define feature names first (this was missing)
Log_l1_feature_names = X.columns.tolist()

# Create sorted data for plotting (descending order)
sorted_idx = Log_l1_ranked_features
sorted_features = [Log_l1_feature_names[i] for i in sorted_idx]
sorted_importance = Log_l1_feature_importance_percent[sorted_idx]  # Changed from log_l2_ to Log_l1_

# Reverse the order so highest appears at the top of the plot
sorted_features = sorted_features[::-1]
sorted_importance = sorted_importance[::-1]

# Plot horizontal bars
bars = plt.barh(range(len(sorted_features)), sorted_importance, align='center', color='#4287f5')
plt.yticks(range(len(sorted_features)), sorted_features)

# Add percentage labels to the bars
for i, bar in enumerate(bars):
    width = bar.get_width()
    plt.text(width + 0.5, 
             bar.get_y() + bar.get_height()/2, 
             f'{sorted_importance[i]:.2f}%', 
             ha='left', 
             va='center',
             fontweight='bold')

# Add title and labels - changed to L1 instead of L2
plt.title('Logistic Regression L1 Feature Importance', fontsize=15)
plt.xlabel('Importance (%)', fontsize=12)
plt.ylabel('Features', fontsize=12)

# Add grid for better readability
plt.grid(axis='x', linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()

# Print the ranked feature importances (now using L1 variables)
print("\nLog L1 Feature Importance (sorted by importance):")
for i, idx in enumerate(Log_l1_ranked_features):
    print(f"  Rank {i + 1}: {Log_l1_feature_names[idx]} - {Log_l1_feature_importance_percent[idx]:.2f}%")

## SVM