### Comparing Groq model and Gemini 2.5 pro on a Stroke dataset
- Main issues encountered: The stroke is heavily imbalanced as there are extremely few 1 in the stroke. This leads to high accuracy but less recall. 
- Updated prompt with little bit of guidelines

## Comparing Multiple LLM's
- gemini 2.5 catches the conetxt best and also has best results


In [1]:
# Description: Import libraries, set up constants, initialize results list.

import pandas as pd
import numpy as np
import json
import os
import warnings
from dotenv import load_dotenv

# Scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Opacus
from opacus import PrivacyEngine
from opacus.validators import ModuleValidator # Optional

# LLM Clients
from groq import Groq
from google import genai
from google.genai import types

# Suppress warnings
warnings.filterwarnings('ignore')
load_dotenv() # Load environment variables from .env file if present

# --- Configuration ---
DATA_FILE = 'cleaned_healthcare_stroke.csv' # Ensure this file exists
TARGET_COLUMN = 'stroke'
RANDOM_STATE = 42
TEST_SIZE = 0.2

# Training Hyperparameters (Base)
LEARNING_RATE = 0.01
EPOCHS = 10 # Keep consistent for comparison unless tuning epochs
BATCH_SIZE = 64

# Default DP Parameters (for Fixed DP run)
DEFAULT_TARGET_EPSILON = 1.0
DEFAULT_TARGET_DELTA = 1e-5 # Will likely recalculate based on N
DEFAULT_MAX_GRAD_NORM = 1.0

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# --- Initialize Results Storage ---
results_list = []

# --- Initialize LLM Clients ---
GROQ_API_KEY = os.getenv("GROQ_API_KEY")
GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
GROQ_MODEL_NAME = os.getenv("GROQ_MODEL_NAME", "llama3-70b-8192") # Or your preferred Groq model
GEMINI_MODEL_NAME = os.getenv("GEMINI_MODEL_NAME", "gemini-1.5-flash") # Or your preferred Gemini model

groq_client = None
gemini_client = None

if GROQ_API_KEY:
    try:
        groq_client = Groq(api_key=GROQ_API_KEY)
        print(f"Groq client initialized successfully for model: {GROQ_MODEL_NAME}")
    except Exception as e:
        print(f"Error initializing Groq client: {e}")
else:
    print("Warning: GROQ_API_KEY not found. Groq LLM will not be used.")

if GEMINI_API_KEY:
    try:
        gemini_client = genai.Client(api_key=GEMINI_API_KEY)
        print(f"Gemini client initialized successfully for model: {GEMINI_MODEL_NAME}")
    except Exception as e:
        print(f"Error initializing Gemini client: {e}")
else:
    print("Warning: GEMINI_API_KEY not found. Gemini LLM will not be used.")

Using device: cpu
Groq client initialized successfully for model: deepseek-r1-distill-llama-70b
Gemini client initialized successfully for model: gemini-2.5-pro-exp-03-25


In [2]:
# Description: Load a classification dataset (e.g., Stroke).
# Verify assumptions (cleaning, balance for this specific dataset).

# --- Configuration for CURRENT Classification Dataset (e.g., Stroke) ---
CURRENT_DATA_FILE = 'cleaned_healthcare_stroke.csv' # Or your target dataset
CURRENT_TARGET_COLUMN = 'stroke'
CURRENT_DATASET_NAME_FOR_LLM = "Kaggle Stroke Prediction"
CURRENT_TARGET_METRIC_PREFIX = "Stroke" # For naming P/R/F1 columns
# ---------------------------------------------------------------------

print(f"\n--- Loading Dataset: {CURRENT_DATA_FILE} for Classification ---")
df_current = None # Initialize
original_columns_current = []
pos_count_current, neg_count_current = 0, 0 # For imbalance info

try:
    df_current = pd.read_csv(CURRENT_DATA_FILE)
    print(f"{CURRENT_DATASET_NAME_FOR_LLM} Dataset loaded successfully.")
    print("Dataset shape:", df_current.shape)

    # --- Verification (adapt as needed) ---
    print("\nVerifying data assumptions...")
    df_current.info()
    if CURRENT_TARGET_COLUMN in df_current.columns:
        print(f"\nTarget Column '{CURRENT_TARGET_COLUMN}' Value Counts:")
        print(df_current[CURRENT_TARGET_COLUMN].value_counts())
        balance_check = df_current[CURRENT_TARGET_COLUMN].value_counts(normalize=True)
        print(f"Balance ratio:\n{balance_check}")
        pos_count_current = df_current[CURRENT_TARGET_COLUMN].sum()
        neg_count_current = len(df_current) - pos_count_current
    else:
        print(f"ERROR: Target column '{CURRENT_TARGET_COLUMN}' not found!")
        df_current = None

    if df_current is not None:
        original_columns_current = df_current.columns.tolist()

except FileNotFoundError:
    print(f"Error: File not found at {CURRENT_DATA_FILE}.")
except Exception as e:
     print(f"An error occurred: {e}")


--- Loading Dataset: cleaned_healthcare_stroke.csv for Classification ---
Kaggle Stroke Prediction Dataset loaded successfully.
Dataset shape: (5110, 14)

Verifying data assumptions...
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   gender             5110 non-null   object 
 1   age                5110 non-null   float64
 2   hypertension       5110 non-null   int64  
 3   heart_disease      5110 non-null   int64  
 4   ever_married       5110 non-null   object 
 5   work_type          5110 non-null   object 
 6   Residence_type     5110 non-null   object 
 7   avg_glucose_level  5110 non-null   float64
 8   bmi                5110 non-null   float64
 9   smoking_status     5110 non-null   object 
 10  stroke             5110 non-null   int64  
 11  age_group          5110 non-null   object 
 12  glucose_group      5110 non-nu

In [4]:
# Description: Preprocess current classification data, split, convert to tensors, calculate pos_weight.

print(f"\n--- Preprocessing {CURRENT_DATASET_NAME_FOR_LLM} Data ---")
# Initialize to prevent NameErrors if df_current is None
n_features_current = None
train_loader_current = None
test_loader_current = None
pos_weight_tensor_current = torch.tensor([1.0], dtype=torch.float32).to(device) # Default
DEFAULT_TARGET_DELTA_CURRENT = 1e-5 # Fallback

if df_current is not None and CURRENT_TARGET_COLUMN in df_current.columns:
    X_current = df_current.drop(CURRENT_TARGET_COLUMN, axis=1)
    y_current = df_current[CURRENT_TARGET_COLUMN]

    # !!! IMPORTANT: Define these accurately for CURRENT_DATA_FILE !!!
    categorical_features_current = X_current.select_dtypes(include=['object', 'category']).columns.tolist()
    potential_numerical_features = ['age', 'avg_glucose_level', 'bmi'] # Example for Stroke
    numerical_features_current = [col for col in potential_numerical_features if col in X_current.columns and pd.api.types.is_numeric_dtype(X_current[col])]
    # Add any other numerical columns specific to your dataset
    # numerical_features_current.extend([col for col in X_current.columns if pd.api.types.is_numeric_dtype(X_current[col]) and col not in numerical_features_current and col not in categorical_features_current])


    print(f"\nIdentified Categorical Features: {categorical_features_current}")
    print(f"Identified Numerical Features: {numerical_features_current}")

    numerical_transformer = StandardScaler()
    categorical_transformer = OneHotEncoder(handle_unknown='ignore', drop='first', sparse_output=False)

    preprocessor_current = ColumnTransformer(
        transformers=[
            ('num', numerical_transformer, numerical_features_current),
            ('cat', categorical_transformer, categorical_features_current)
        ],
        remainder='passthrough'
    )

    X_train_curr, X_test_curr, y_train_curr, y_test_curr = train_test_split(
        X_current, y_current, test_size=TEST_SIZE, random_state=RANDOM_STATE, stratify=y_current
    )
    y_test_labels_curr = y_test_curr.values # Keep for evaluation if needed by DP function

    try:
        X_train_processed_curr = preprocessor_current.fit_transform(X_train_curr)
        X_test_processed_curr = preprocessor_current.transform(X_test_curr)
    except ValueError as ve: # Catch errors e.g. if a category only appears in test
        print(f"Preprocessing Error: {ve}. This might be due to new categories in test data or empty feature lists.")
        print("Ensure categorical_features_current and numerical_features_current are correctly defined and not empty.")
        X_train_processed_curr = None # Prevent further errors

    if X_train_processed_curr is not None:
        n_features_current = X_train_processed_curr.shape[1]
        print(f"\nNumber of features after preprocessing: {n_features_current}")

        X_train_tensor_curr = torch.tensor(X_train_processed_curr.astype(np.float32)).to(device)
        y_train_tensor_curr = torch.tensor(y_train_curr.values.astype(np.float32)).unsqueeze(1).to(device)
        X_test_tensor_curr = torch.tensor(X_test_processed_curr.astype(np.float32)).to(device)
        y_test_tensor_curr = torch.tensor(y_test_curr.values.astype(np.float32)).unsqueeze(1).to(device) # For loss

        train_dataset_curr = TensorDataset(X_train_tensor_curr, y_train_tensor_curr)
        test_dataset_curr = TensorDataset(X_test_tensor_curr, y_test_tensor_curr)
        train_loader_current = DataLoader(train_dataset_curr, batch_size=BATCH_SIZE, shuffle=True)
        test_loader_current = DataLoader(test_dataset_curr, batch_size=BATCH_SIZE, shuffle=False)

        print(f"\n{CURRENT_DATASET_NAME_FOR_LLM} preprocessing and splitting complete.")
        print(f"Training set size: {len(train_dataset_curr)}")

        # Calculate Positive Class Weight
        neg_count = (y_train_curr == 0).sum()
        pos_count = (y_train_curr == 1).sum()
        if pos_count > 0:
            pos_weight_value = neg_count / pos_count
            pos_weight_tensor_current = torch.tensor([pos_weight_value], dtype=torch.float32).to(device)
            print(f"\nCalculated pos_weight for loss: {pos_weight_value:.2f}")
        else:
            pos_weight_tensor_current = torch.tensor([1.0], dtype=torch.float32).to(device)
            print("Warning: No positive samples in training data. Using default pos_weight (1.0).")

        DEFAULT_TARGET_DELTA_CURRENT = 1 / len(train_dataset_curr)
        print(f"Default Target Delta for {CURRENT_DATASET_NAME_FOR_LLM} (1/N): {DEFAULT_TARGET_DELTA_CURRENT:.2e}")
    else:
        print(f"Skipping {CURRENT_DATASET_NAME_FOR_LLM} tensor conversion due to preprocessing error.")
else:
    print(f"Skipping {CURRENT_DATASET_NAME_FOR_LLM} preprocessing: df_current is None or target missing.")


--- Preprocessing Kaggle Stroke Prediction Data ---

Identified Categorical Features: ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status', 'age_group', 'glucose_group', 'bmi_group']
Identified Numerical Features: ['age', 'avg_glucose_level', 'bmi']

Number of features after preprocessing: 26

Kaggle Stroke Prediction preprocessing and splitting complete.
Training set size: 4088

Calculated pos_weight for loss: 19.54
Default Target Delta for Kaggle Stroke Prediction (1/N): 2.45e-04


In [6]:
# Description: Define a simple Logistic Regression model using PyTorch nn.Module.
class LogisticRegression(nn.Module):
    def __init__(self, n_features):
        super(LogisticRegression, self).__init__()
        self.linear = nn.Linear(n_features, 1)
    def forward(self, x):
        return self.linear(x)
print("\nLogistic Regression model class defined.")


Logistic Regression model class defined.


In [7]:
# Description: Train and evaluate standard SGD model for the CURRENT classification dataset.

run_name_non_dp = f"{CURRENT_DATASET_NAME_FOR_LLM} Non-DP SGD"
print(f"\n--- Running: {run_name_non_dp} ---")

if n_features_current is not None and train_loader_current is not None and test_loader_current is not None:
    model_non_dp = LogisticRegression(n_features_current).to(device)
    criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight_tensor_current)
    optimizer_non_dp = optim.SGD(model_non_dp.parameters(), lr=LEARNING_RATE)

    print(f"Training Standard Logistic Regression ({CURRENT_DATASET_NAME_FOR_LLM})...")
    # ... (Training loop same as before, using _current variables) ...
    model_non_dp.train()
    for epoch in range(EPOCHS):
        for batch_X, batch_y in train_loader_current:
            batch_X, batch_y = batch_X.to(device), batch_y.to(device)
            optimizer_non_dp.zero_grad(); outputs = model_non_dp(batch_X)
            loss = criterion(outputs, batch_y); loss.backward(); optimizer_non_dp.step()
    print(f"Standard Training Complete ({CURRENT_DATASET_NAME_FOR_LLM}).")

    print(f"Evaluating Standard Model ({CURRENT_DATASET_NAME_FOR_LLM})...")
    # ... (Evaluation loop same as before, using _current variables) ...
    model_non_dp.eval()
    all_preds, all_targets = [], []
    with torch.no_grad():
        for batch_X, batch_y in test_loader_current:
            batch_X, batch_y = batch_X.to(device), batch_y.to(device)
            outputs = model_non_dp(batch_X); preds = torch.round(torch.sigmoid(outputs))
            all_preds.extend(preds.cpu().numpy().flatten())
            all_targets.extend(batch_y.cpu().numpy().flatten())

    accuracy = accuracy_score(all_targets, all_preds)
    precision = precision_score(all_targets, all_preds, pos_label=1, zero_division=0)
    recall = recall_score(all_targets, all_preds, pos_label=1, zero_division=0)
    f1 = f1_score(all_targets, all_preds, pos_label=1, zero_division=0)

    print(f"Accuracy: {accuracy:.4f}, Precision ({CURRENT_TARGET_METRIC_PREFIX}): {precision:.4f}, Recall ({CURRENT_TARGET_METRIC_PREFIX}): {recall:.4f}, F1 ({CURRENT_TARGET_METRIC_PREFIX}): {f1:.4f}")

    results_list.append({
        "Run Type": run_name_non_dp, "LLM Used": "N/A",
        "Target Epsilon": "N/A", "Final Epsilon": "N/A", "Target Delta": "N/A", "Max Grad Norm": "N/A",
        "Accuracy": accuracy,
        f"Precision ({CURRENT_TARGET_METRIC_PREFIX})": precision,
        f"Recall ({CURRENT_TARGET_METRIC_PREFIX})": recall,
        f"F1 ({CURRENT_TARGET_METRIC_PREFIX})": f1,
        "LLM Epsilon Suggestion": "N/A", "LLM Reasoning": "N/A"
    })
    print(f"{run_name_non_dp} results recorded.")
else:
    print(f"Skipping {run_name_non_dp} run due to missing components for {CURRENT_DATASET_NAME_FOR_LLM}.")


--- Running: Kaggle Stroke Prediction Non-DP SGD ---
Training Standard Logistic Regression (Kaggle Stroke Prediction)...
Standard Training Complete (Kaggle Stroke Prediction).
Evaluating Standard Model (Kaggle Stroke Prediction)...
Accuracy: 0.7202, Precision (Stroke): 0.1266, Recall (Stroke): 0.8000, F1 (Stroke): 0.2186
Kaggle Stroke Prediction Non-DP SGD results recorded.


In [10]:
# Description: Define the standard prompt structure and functions to call LLMs.

# # v2
# def create_llm_prompt(task_config, schema_string, data_shape):
#     """Creates a more detailed prompt string for the LLM, guiding parameter choices."""
#     prompt = f"""
# Analyze the provided dataset context and task to recommend **optimized and justified** Differential Privacy (DP) settings for training a Logistic Regression model using DP-SGD.
# The goal is to predict the target variable '{task_config['target_variable']}'.

# **Dataset Context:**
# - Name: {task_config['dataset_name']}
# - Domain: {task_config['data_domain']} (Note: Healthcare data is generally considered sensitive).
# - Task: {task_config['task_description']}
# - Schema (Original Columns): {schema_string}
# - Extra details: {task_config['details']} (Pay close attention to class imbalance).

# **Parameter Guidance - IMPORTANT:** Avoid generic default values. Base your recommendations *specifically* on the context provided above.

# Provide your recommendations ONLY in a structured JSON format. The JSON object must include the following keys:
# - "dp_algorithm": String, the specific DP algorithm variant recommended (e.g., "DP-SGD with Gaussian Noise").
# - "target_epsilon": Float, recommended privacy budget epsilon (e.g., 1.5). Justify this based on sensitivity, utility needs, and domain.
# - "target_delta": Float or String, recommended privacy budget delta (e.g., 1e-5 or suggest calculating as "1/N"). Justify choice.
# - "max_grad_norm": Float, recommended gradient clipping norm (e.g., 1.0). Justify based on model stability and potential gradient explosion, especially considering class imbalance if noted.
# - "preprocessing_suggestions": List of strings, specific preprocessing actions recommended BEFORE applying DP (e.g., "Remove: id", "Normalize: age, avg_glucose_level, bmi").
# - "column_sensitivity_epsilon": A dictionary where keys are original column names and values are *conceptual* relative sensitivity floats (0.0=low, 1.0=high/ID) or labels (Low, Medium, High). This guides understanding, not direct budget split in standard DP-SGD. Exclude the target variable.
# - "reasoning": String, concise reasoning behind the overall recommendations (epsilon, delta, max_grad_norm choices, linking back to context).

# JSON Output ONLY:
# """
#     return prompt

# #v3
# def create_llm_prompt(task_config, schema_string, data_shape):
#     """Creates a more detailed prompt string for the LLM, guiding parameter choices
#        and including general ML best practices context."""

#     # Calculate approximate training size N for context
#     approx_N_train = int(data_shape[0] * (1-TEST_SIZE)) if data_shape else 'Unknown'

#     prompt = f"""
# Analyze the provided dataset context and task to recommend **optimized and justified** Differential Privacy (DP) settings for training a Logistic Regression model using DP-SGD.
# The goal is to predict the target variable '{task_config['target_variable']}'.

# **Dataset Context:**
# - Name: {task_config['dataset_name']}
# - Domain: {task_config['data_domain']} (Note: Healthcare data is generally considered sensitive).
# - Task: {task_config['task_description']}
# - Schema (Original Columns): {schema_string}
# - Data Shape: {data_shape} (Approx. Training N = {approx_N_train})
# - Extra details: {task_config['details']} (Pay close attention to class imbalance).

# **General ML Best Practices Context (Keep these in mind):**
# - **Normalization/Scaling:** Features with different scales (like 'age' vs 'avg_glucose_level') MUST be normalized or standardized (e.g., StandardScaler, MinMaxScaler) for models like Logistic Regression and especially before applying gradient clipping in DP-SGD. This ensures stable gradient computations. Apply scaling AFTER splitting data and ideally AFTER imputation if applicable.
# - **Gradient Stability & Clipping:** DP-SGD uses gradient clipping (`max_grad_norm`) to bound the influence of any single data point. Choosing the norm value is a trade-off:
#     - Too low: Clips potentially useful gradient information, slowing learning or preventing convergence, especially for minority classes or complex patterns.
#     - Too high: Less protection against outliers, potentially higher noise required for the same privacy budget (ε).
#     - Imbalanced Data Impact: Gradients from rare class samples might be infrequent but large; aggressive clipping can disproportionately affect learning for that class.
# - **Imbalanced Data Handling:** Beyond class weighting in the loss (which is assumed here), model evaluation should focus on metrics like F1-score, Precision, Recall for the minority class, not just accuracy. The goal is often to improve detection of the rare class.
# - **Preprocessing Order:** Typically: Split Data -> Impute Missing -> Encode Categorical -> Scale Numerical -> Train Model. DP is applied during the training step.

# **Parameter Guidance - IMPORTANT:** Based on the Dataset Context AND the ML Best Practices above, provide specific, justified recommendations. Avoid generic defaults.

# 1.  **`target_epsilon`**: Balance '{task_config['data_domain']}' sensitivity vs. utility needed for training. Justify the specific trade-off. (Range 1.0-5.0 often considered, but justify *your* choice).
# 2.  **`target_delta`**: Recommend a specific small value (e.g., 1e-5, 1e-6) or suggest "1/N". Justify why (e.g., related to approx N={approx_N_train}).
# 3.  **`max_grad_norm`**: **Connect this directly to the Gradient Stability & Imbalanced Data points above.** Given the heavy imbalance, suggest a value (e.g., range 5.0 - 15.0, or a specific reasoned value) likely higher than a generic default (like 1.0) to preserve minority class signals. Justify based *explicitly* on imbalance and the need for stable yet informative gradients.
# 4.  **`column_sensitivity_epsilon`**: Conceptual relative sensitivity hints (0.0 low, 1.0 high). Reflect potential identifiability/sensitivity based on domain/name. Exclude target.
# 5.  **`reasoning`**: Concise but detailed justification for epsilon, delta, and max_grad_norm, *explicitly linking* choices to dataset context (domain, N, imbalance) and the relevant ML best practices mentioned (normalization, gradient stability).

# **Output Format:**
# Provide recommendations ONLY in a structured JSON format with keys: "dp_algorithm", "target_epsilon", "target_delta", "max_grad_norm", "preprocessing_suggestions", "column_sensitivity_epsilon", "reasoning".

# JSON Output ONLY:
# """
#     return prompt

# Description: Define the standard prompt structure and functions to call LLMs.

def create_llm_prompt(task_config, schema_string, data_shape, task_type="classification"): # Add task_type
    """Creates a more detailed prompt string for the LLM, guiding parameter choices
       and including general ML best practices context. Adapts for task_type."""

    approx_N_train = int(data_shape[0] * (1-TEST_SIZE)) if data_shape else 'Unknown'

    # Adjust parts of the prompt based on task_type
    if task_type == "regression":
        imbalance_guidance = "" # No class imbalance for regression
        target_guidance_metrics = "metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), or R2 Score. The goal is often to minimize error."
        max_grad_norm_focus = "focus on overall gradient stability, considering the range and scale of the target variable values, rather than specific class signals. A moderate value (e.g., 1.0-10.0, depending on target scale and feature normalization) is common."
        epsilon_utility_focus = "utility needs for accurate predictions (e.g., low MAE/MSE)"
        model_type_in_prompt = "Linear Regression" # Or generic "Regression Model"
    else: # classification (default)
        imbalance_guidance = "(Pay close attention to class imbalance if mentioned in 'Extra details')."
        target_guidance_metrics = "metrics like F1-score, Precision, Recall for the minority class, not just accuracy. The goal is often to improve detection of the rare class."
        max_grad_norm_focus = "**Connect this directly to the Gradient Stability & Imbalanced Data points above.** Given potential class imbalance (see 'Extra details'), suggest a value (e.g., range 5.0 - 15.0, or a specific reasoned value) likely higher than a generic default (like 1.0) to preserve minority class signals. Justify based *explicitly* on imbalance and the need for stable yet informative gradients."
        epsilon_utility_focus = "utility needs for model training (which often requires sufficient signal)"
        model_type_in_prompt = "Logistic Regression"


    prompt = f"""
Analyze the provided dataset context and task to recommend **optimized and justified** Differential Privacy (DP) settings for training a {model_type_in_prompt} model using DP-SGD.
The goal is to predict the target variable '{task_config['target_variable']}'.

**Dataset Context:**
- Name: {task_config['dataset_name']}
- Domain: {task_config['data_domain']}
- Task: {task_config['task_description']}
- Schema (Original Columns): {schema_string}
- Data Shape: {data_shape} (Approx. Training N = {approx_N_train})
- Extra details: {task_config['details']} {imbalance_guidance}

**General ML Best Practices Context (Keep these in mind):**
- **Normalization/Scaling:** Features MUST be normalized or standardized for models like {model_type_in_prompt} and especially before applying gradient clipping in DP-SGD.
- **Gradient Stability & Clipping (`max_grad_norm`):** Bound influence of single points. Trade-off:
    - Too low: Clips useful info, hinders learning.
    - Too high: Less protection, more noise needed.
    - { "Imbalanced Data Impact: Gradients from rare class samples might be infrequent but large; aggressive clipping can disproportionately affect learning for that class." if task_type=="classification" else "For regression, consider the scale of the target variable when thinking about gradient magnitudes."}
- **Evaluation Focus ({task_type}):** Model evaluation should focus on {target_guidance_metrics}
- **Preprocessing Order:** Typically: Split Data -> Impute -> Encode -> Scale -> Train.

**Parameter Guidance - IMPORTANT:** Base recommendations *specifically* on context. Avoid generic defaults.

1.  **`target_epsilon`**: Balance '{task_config['data_domain']}' sensitivity vs. {epsilon_utility_focus}. Justify. (Range 1.0-5.0 often considered, but justify *your* choice).
2.  **`target_delta`**: Recommend small value (e.g., 1e-5) or "1/N". Justify (e.g., N={approx_N_train}).
3.  **`max_grad_norm`**: For this {task_type} task, {max_grad_norm_focus} Justify.
4.  **`column_sensitivity_epsilon`**: Conceptual relative sensitivity hints (0.0 low, 1.0 high). Exclude target.
5.  **`reasoning`**: Detailed justification for epsilon, delta, `max_grad_norm`, linking to dataset context (domain, N, imbalance/target scale) and ML practices.

**Output Format:**
JSON ONLY with keys: "dp_algorithm", "target_epsilon", "target_delta", "max_grad_norm", "preprocessing_suggestions", "column_sensitivity_epsilon", "reasoning".

JSON Output ONLY:
"""
    return prompt



def get_gemini_config(prompt, client):
    """Gets DP config from Gemini API."""
    if not client:
        print("Gemini client not available.")
        return None
    print("\nSending request to Gemini...")
    try:
        response = client.models.generate_content(
            model=GEMINI_MODEL_NAME, contents=prompt
            )
        response_text = response.text
        print("Gemini Response Received.")
        # Extract JSON part
        start_index = response_text.find('{')
        end_index = response_text.rfind('}')
        if start_index != -1 and end_index != -1:
            json_string_only = response_text[start_index : end_index + 1]
            config = json.loads(json_string_only)
            print("Successfully parsed Gemini config.")
            print(config)
            # Basic validation
            required_keys = ["dp_algorithm", "target_epsilon", "target_delta", "max_grad_norm", "preprocessing_suggestions", "column_sensitivity_epsilon", "reasoning"]
            if not all(key in config for key in required_keys):
                print("Warning: Gemini response missing some required keys.")
            return config
        else:
            print("Error: Could not find JSON object in Gemini response.")
            print("Raw Response:", response_text)
            return None
    except Exception as e:
        print(f"Error during Gemini API call or parsing: {e}")
        try:
            print("Gemini Response Content (if available):", response.candidates) # Might show safety blocks
        except: pass
        return None

def get_groq_config(prompt, client, model_name):
    """Gets DP config from Groq API."""
    if not client:
        print("Groq client not available.")
        return None
    print("\nSending request to Groq...")
    try:
        chat_completion = client.chat.completions.create(
            messages=[{"role": "user", "content": prompt}],
            model=model_name,
            temperature=0.2,
            max_tokens=3024, 
            top_p=0.8,
            response_format={"type": "json_object"},
        )
        response_content = chat_completion.choices[0].message.content
        print("Groq Response Received.")
        config = json.loads(response_content)
        print("Successfully parsed Groq config.")
        print(config)

        # Basic validation
        required_keys = ["dp_algorithm", "target_epsilon", "target_delta", "max_grad_norm", "preprocessing_suggestions", "column_sensitivity_epsilon", "reasoning"]
        if not all(key in config for key in required_keys):
            print("Warning: Groq response missing some required keys.")
        return config
    except Exception as e:
        print(f"Error during Groq API call or parsing: {e}")
        return None

print("LLM Helper functions defined.")

LLM Helper functions defined.


In [13]:
# Description: Function to train and evaluate a DP CLASSIFICATION model using Opacus.

def train_evaluate_dp_classification_model(
    config, run_name, train_loader, test_loader, # test_loader provides X and y_labels for metrics
    n_features, device, epochs, learning_rate,
    pos_weight_tensor,
    target_metric_prefix="Class 1", # e.g., "Stroke", "Readmit"
    y_test_labels_for_eval=None # Pass the actual y_test labels (numpy array)
):
    """Trains and evaluates DP classification model, returns results dictionary."""
    print(f"\n--- Running: {run_name} (Task: Classification) ---")
    if config is None: # ... (null check) ...
        print("Skipping run due to missing config."); return None

    target_eps = config.get("target_epsilon", DEFAULT_TARGET_EPSILON)
    target_del_config = config.get("target_delta", "1/N")
    max_norm = config.get("max_grad_norm", DEFAULT_MAX_GRAD_NORM)
    # ... (actual_delta calculation - same as before) ...
    if isinstance(target_del_config, str) and "1/N" in target_del_config and train_loader:
        actual_delta = 1 / len(train_loader.dataset)
    elif isinstance(target_del_config, (int, float)):
        actual_delta = target_del_config
    else: # Fallback
        actual_delta = 1 / len(train_loader.dataset) if train_loader else DEFAULT_TARGET_DELTA


    dp_model = LogisticRegression(n_features).to(device)
    dp_optimizer = optim.SGD(dp_model.parameters(), lr=learning_rate)
    privacy_engine = PrivacyEngine()
    try:
        dp_model, dp_optimizer, dp_data_loader = privacy_engine.make_private_with_epsilon(
            module=dp_model, optimizer=dp_optimizer, data_loader=train_loader,
            max_grad_norm=max_norm, target_epsilon=target_eps, target_delta=actual_delta, epochs=epochs
        )
        print(f"Opacus Attached. Target ε={target_eps:.2f}, Target δ={actual_delta:.2e}, Max Grad Norm={max_norm}")
    except Exception as e:
        print(f"Error attaching Opacus: {e}"); return None

    criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight_tensor)
    print("Training DP Classification Model...")
    # ... (Training loop - same as before, using dp_data_loader) ...
    dp_model.train()
    for epoch in range(epochs):
        for batch_X, batch_y in dp_data_loader:
            batch_X, batch_y = batch_X.to(device), batch_y.to(device)
            dp_optimizer.zero_grad(); outputs = dp_model(batch_X)
            loss = criterion(outputs, batch_y); loss.backward(); dp_optimizer.step()

    try:
      final_epsilon = privacy_engine.get_epsilon(delta=actual_delta)
      print(f"DP Training Complete. Final ε = {final_epsilon:.4f}")
    except Exception as e:
      print(f"Error getting epsilon: {e}"); final_epsilon = float('nan')


    print("Evaluating DP Classification Model...")
    dp_model.eval()
    all_preds_dp, all_targets_dp_eval = [], [] # all_targets_dp_eval will be from y_test_labels_for_eval
    with torch.no_grad():
        for batch_X, _ in test_loader: # We only need X for prediction
            batch_X = batch_X.to(device)
            outputs = dp_model(batch_X)
            preds = torch.round(torch.sigmoid(outputs))
            all_preds_dp.extend(preds.cpu().numpy().flatten())
    
    if y_test_labels_for_eval is None:
        print("Error: y_test_labels_for_eval not provided for metrics calculation.")
        return None
    all_targets_dp_eval = y_test_labels_for_eval # Use the passed numpy array of true labels

    accuracy_dp = accuracy_score(all_targets_dp_eval, all_preds_dp)
    precision_dp = precision_score(all_targets_dp_eval, all_preds_dp, pos_label=1, zero_division=0)
    recall_dp = recall_score(all_targets_dp_eval, all_preds_dp, pos_label=1, zero_division=0)
    f1_dp = f1_score(all_targets_dp_eval, all_preds_dp, pos_label=1, zero_division=0)
    print(f"Accuracy: {accuracy_dp:.4f}, Precision ({target_metric_prefix}): {precision_dp:.4f}, Recall ({target_metric_prefix}): {recall_dp:.4f}, F1 ({target_metric_prefix}): {f1_dp:.4f}")

    results = {
        "Run Type": run_name, "LLM Used": config.get("llm_model_name", "N/A"),
        "Target Epsilon": target_eps, "Final Epsilon": final_epsilon,
        "Target Delta": actual_delta, "Max Grad Norm": max_norm,
        "Accuracy": accuracy_dp,
        f"Precision ({target_metric_prefix})": precision_dp,
        f"Recall ({target_metric_prefix})": recall_dp,
        f"F1 ({target_metric_prefix})": f1_dp,
        "LLM Epsilon Suggestion": config.get("target_epsilon", "N/A (Default)"),
        "LLM Reasoning": config.get("reasoning", "N/A")
    }
    print(f"{run_name} results recorded.")
    return results

print("DP CLASSIFICATION Training/Evaluation function defined.")

DP CLASSIFICATION Training/Evaluation function defined.


In [11]:
# Description: Define task details for CURRENT classification dataset and generate prompt.

print(f"\n--- Preparing Config and Prompt for {CURRENT_DATASET_NAME_FOR_LLM} ---")
if df_current is not None and original_columns_current and 'pos_count_current' in locals():
    imbalance_details_current = f"The target '{CURRENT_TARGET_COLUMN}' has {pos_count_current} positive vs {neg_count_current} negative samples in train set."
    if abs(pos_count_current - neg_count_current) / (pos_count_current + neg_count_current + 1e-6) < 0.1: # Approx balanced
        imbalance_details_current += " This is considered relatively balanced."
    else:
        imbalance_details_current += " This is considered imbalanced."


    task_config_current = {
        "dataset_name": CURRENT_DATASET_NAME_FOR_LLM,
        "data_domain": "Healthcare", # Adjust if domain changes
        "task_description": f"Train a binary classification model (Logistic Regression using DP-SGD) to predict '{CURRENT_TARGET_COLUMN}'.",
        "target_variable": CURRENT_TARGET_COLUMN,
        "model_type": "Logistic Regression",
        "dp_mechanism_family": "DP-SGD",
        "details": imbalance_details_current
    }

    schema_string_current = ", ".join(original_columns_current)
    data_shape_tuple_current = df_current.shape

    llm_prompt_current = create_llm_prompt(
        task_config_current,
        schema_string_current,
        data_shape_tuple_current,
        task_type="classification" # Explicitly classification
    )
    print(f"{CURRENT_DATASET_NAME_FOR_LLM} task config and LLM prompt prepared.")
else:
    print(f"Skipping {CURRENT_DATASET_NAME_FOR_LLM} prompt creation: data not loaded or details missing.")
    llm_prompt_current = None


--- Preparing Config and Prompt for Kaggle Stroke Prediction ---
Kaggle Stroke Prediction task config and LLM prompt prepared.


In [14]:
# Description: Execute Fixed DP run for the CURRENT classification dataset.

run_name_fixed_dp = f"{CURRENT_DATASET_NAME_FOR_LLM} Fixed DP SGD"
print(f"\n--- Running: {run_name_fixed_dp} ---")

if n_features_current is not None and train_loader_current is not None and 'DEFAULT_TARGET_DELTA_CURRENT' in locals() and 'y_test_labels_curr' in locals():
    fixed_dp_config_curr = {
        "dp_algorithm": "DP-SGD with Gaussian Noise (Fixed)",
        "target_epsilon": DEFAULT_TARGET_EPSILON,
        "target_delta": DEFAULT_TARGET_DELTA_CURRENT,
        "max_grad_norm": DEFAULT_MAX_GRAD_NORM,
        "reasoning": f"Using fixed defaults for {CURRENT_DATASET_NAME_FOR_LLM}.",
        "llm_model_name": "N/A (Fixed Defaults)" # ... add other expected keys if needed
    }
    results_fixed_curr = train_evaluate_dp_classification_model(
        fixed_dp_config_curr, run_name_fixed_dp,
        train_loader_current, test_loader_current, n_features_current, device,
        EPOCHS, LEARNING_RATE, pos_weight_tensor_current,
        target_metric_prefix=CURRENT_TARGET_METRIC_PREFIX,
        y_test_labels_for_eval=y_test_labels_curr
    )
    if results_fixed_curr: results_list.append(results_fixed_curr)
else:
    print(f"Skipping {run_name_fixed_dp} run: missing components.")


--- Running: Kaggle Stroke Prediction Fixed DP SGD ---

--- Running: Kaggle Stroke Prediction Fixed DP SGD (Task: Classification) ---
Opacus Attached. Target ε=1.00, Target δ=2.45e-04, Max Grad Norm=1.0
Training DP Classification Model...
DP Training Complete. Final ε = 0.9920
Evaluating DP Classification Model...
Accuracy: 0.9511, Precision (Stroke): 0.0000, Recall (Stroke): 0.0000, F1 (Stroke): 0.0000
Kaggle Stroke Prediction Fixed DP SGD results recorded.


In [15]:
# Description: Get config from Gemini for CURRENT dataset and run.

run_name_gemini_dp = f"{CURRENT_DATASET_NAME_FOR_LLM} Gemini DP SGD"
print(f"\n--- Running: {run_name_gemini_dp} ---")

if gemini_client and llm_prompt_current and n_features_current is not None and train_loader_current is not None and 'y_test_labels_curr' in locals():
    gemini_config_curr = get_gemini_config(llm_prompt_current, gemini_client)
    if gemini_config_curr:
        gemini_config_curr["llm_model_name"] = GEMINI_MODEL_NAME
        results_gemini_curr = train_evaluate_dp_classification_model(
            gemini_config_curr, run_name_gemini_dp,
            train_loader_current, test_loader_current, n_features_current, device,
            EPOCHS, LEARNING_RATE, pos_weight_tensor_current,
            target_metric_prefix=CURRENT_TARGET_METRIC_PREFIX,
            y_test_labels_for_eval=y_test_labels_curr
        )
        if results_gemini_curr: results_list.append(results_gemini_curr)
    else: print(f"Failed to get Gemini config for {CURRENT_DATASET_NAME_FOR_LLM}.")
elif not gemini_client: print("Skipping Gemini run: client not initialized.")
else: print(f"Skipping Gemini run for {CURRENT_DATASET_NAME_FOR_LLM}: missing components.")


--- Running: Kaggle Stroke Prediction Gemini DP SGD ---

Sending request to Gemini...
Gemini Response Received.
Successfully parsed Gemini config.
{'dp_algorithm': 'DP-SGD', 'target_epsilon': 3.0, 'target_delta': 0.000245, 'max_grad_norm': 10.0, 'preprocessing_suggestions': ['Split Data: Perform train-test split first to prevent data leakage.', "Imputation: Impute missing values (e.g., 'bmi' often has NaNs). Median imputation for numerical features and mode for categorical features are common strategies.", "Encoding: Convert categorical features ('gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status') into numerical representations using one-hot encoding. Consider `drop='first'` to avoid multicollinearity if appropriate for the model library.", 'Normalization/Scaling: Standardize or normalize all numerical features (including one-hot encoded ones, or apply after encoding numericals and before concatenating with OHE features). StandardScaler (to mean 0, std 1) or Min

In [17]:
# Description: Loop through Groq models, get config, and run DP training for CURRENT dataset.

print(f"\n--- Comparing Multiple Groq LLM Models for {CURRENT_DATASET_NAME_FOR_LLM} ---")
groq_models_to_test = [
    "llama-3.3-70b-versatile",
    "qwen-qwq-32b",
    "gemma2-9b-it",
    "deepseek-r1-distill-llama-70b" 
]

if not groq_client:
    print("Skipping Groq model comparison: Groq client not initialized.")
elif not llm_prompt_current:
    print(f"Skipping Groq model comparison for {CURRENT_DATASET_NAME_FOR_LLM}: LLM prompt not generated.")
elif n_features_current is None or train_loader_current is None or 'y_test_labels_curr' not in locals():
    print(f"Skipping Groq model comparison for {CURRENT_DATASET_NAME_FOR_LLM}: data components missing.")
else:
    for groq_model_id in groq_models_to_test:
        run_name_groq_dp = f"{CURRENT_DATASET_NAME_FOR_LLM} Groq ({groq_model_id}) DP SGD"
        print(f"\n--- Testing Groq Model: {groq_model_id} for {CURRENT_DATASET_NAME_FOR_LLM} ---")

        current_groq_config = get_groq_config(llm_prompt_current, groq_client, groq_model_id)
        if current_groq_config:
            current_groq_config["llm_model_name"] = groq_model_id
            results_current_groq = train_evaluate_dp_classification_model(
                current_groq_config, run_name_groq_dp,
                train_loader_current, test_loader_current, n_features_current, device,
                EPOCHS, LEARNING_RATE, pos_weight_tensor_current,
                target_metric_prefix=CURRENT_TARGET_METRIC_PREFIX,
                y_test_labels_for_eval=y_test_labels_curr
            )
            if results_current_groq:
                results_list.append(results_current_groq)
        else:
            print(f"Failed to get config from Groq model {groq_model_id} for {CURRENT_DATASET_NAME_FOR_LLM}.")


--- Comparing Multiple Groq LLM Models for Kaggle Stroke Prediction ---

--- Testing Groq Model: llama-3.3-70b-versatile for Kaggle Stroke Prediction ---

Sending request to Groq...
Groq Response Received.
Successfully parsed Groq config.
{'dp_algorithm': 'DP-SGD', 'target_epsilon': 2.5, 'target_delta': 1e-05, 'max_grad_norm': 10.0, 'preprocessing_suggestions': ['Normalize features using Standard Scaler', 'Split data into training and validation sets', 'Encode categorical variables using One-Hot Encoding', 'Impute missing values with mean or median'], 'column_sensitivity_epsilon': {'gender': 0.5, 'age': 0.2, 'hypertension': 0.8, 'heart_disease': 0.8, 'ever_married': 0.3, 'work_type': 0.4, 'Residence_type': 0.4, 'avg_glucose_level': 0.6, 'bmi': 0.6, 'smoking_status': 0.7, 'age_group': 0.2, 'glucose_group': 0.6, 'bmi_group': 0.6}, 'reasoning': 'The target epsilon of 2.5 balances the need for model utility in the healthcare domain with the sensitivity of patient data. A target delta of 1

In [18]:
# Description: Show the results from all runs in a table, handling dynamic metric names.

if results_list:
    results_df = pd.DataFrame(results_list)

    # Define base column order
    base_cols = [
        "Run Type", "LLM Used", "Target Epsilon", "Final Epsilon", "Target Delta",
        "Max Grad Norm", "Accuracy"
    ]
    # Dynamically find all P/R/F1 columns
    metric_cols = sorted([col for col in results_df.columns if "Precision (" in col or "Recall (" in col or "F1 (" in col])
    
    end_cols = ["LLM Epsilon Suggestion", "LLM Reasoning"]

    # Combine and ensure all are present
    final_cols_order = base_cols + metric_cols + end_cols
    
    # Add missing columns with N/A if they don't exist in the DataFrame yet
    for col in final_cols_order:
        if col not in results_df.columns:
            results_df[col] = np.nan # Use np.nan for potential later numerical ops

    results_df = results_df[final_cols_order] # Reorder
    results_df.fillna("N/A", inplace=True) # Fill any remaining NaNs for display

    pd.set_option('display.max_colwidth', 100)
    pd.set_option('display.width', 1500) # Wider for more metric columns
    print("\n--- Combined Experiment Results ---")
    display(results_df)
else:
    print("No results recorded yet.")


--- Combined Experiment Results ---


Unnamed: 0,Run Type,LLM Used,Target Epsilon,Final Epsilon,Target Delta,Max Grad Norm,Accuracy,F1 (Stroke),Precision (Stroke),Recall (Stroke),LLM Epsilon Suggestion,LLM Reasoning
0,Kaggle Stroke Prediction Non-DP SGD,,,,,,0.720157,0.218579,0.126582,0.8,,
1,Kaggle Stroke Prediction Fixed DP SGD,N/A (Fixed Defaults),1.0,0.991973,0.000245,1.0,0.951076,0.0,0.0,0.0,1.0,Using fixed defaults for Kaggle Stroke Prediction.
2,Kaggle Stroke Prediction Gemini DP SGD,gemini-2.5-pro-exp-03-25,3.0,2.997967,0.000245,10.0,0.939335,0.27907,0.333333,0.24,3.0,The recommended DP settings aim to balance privacy requirements in a 'Healthcare' domain with th...
3,Kaggle Stroke Prediction Groq (llama-3.3-70b-versatile) DP SGD,llama-3.3-70b-versatile,2.5,2.498573,1e-05,10.0,0.925636,0.155556,0.175,0.14,2.5,The target epsilon of 2.5 balances the need for model utility in the healthcare domain with the ...
4,Kaggle Stroke Prediction Groq (gemma2-9b-it) DP SGD,gemma2-9b-it,3.0,2.990195,1e-05,10.0,0.933464,0.190476,0.235294,0.16,3.0,The chosen DP settings are based on a balance between privacy and utility. A `target_epsilon` o...
5,Kaggle Stroke Prediction Groq (deepseek-r1-distill-llama-70b) DP SGD,deepseek-r1-distill-llama-70b,3.0,2.990195,1e-05,10.0,0.935421,0.232558,0.277778,0.2,3.0,"The target epsilon of 3.0 balances privacy and utility, suitable for healthcare applications. De..."
6,Kaggle Stroke Prediction Groq (llama-3.3-70b-versatile) DP SGD,llama-3.3-70b-versatile,2.5,2.498573,1e-05,10.0,0.943249,0.216216,0.333333,0.16,2.5,The target epsilon of 2.5 balances the need for model utility in the healthcare domain with the ...
7,Kaggle Stroke Prediction Groq (qwen-qwq-32b) DP SGD,qwen-qwq-32b,3.0,2.990195,1e-05,10.0,0.940313,0.186667,0.28,0.14,3.0,1. **target_epsilon=3.0**: Balances healthcare data sensitivity (requires stronger privacy than ...
8,Kaggle Stroke Prediction Groq (gemma2-9b-it) DP SGD,gemma2-9b-it,3.0,2.990195,1e-05,10.0,0.921722,0.259259,0.241379,0.28,3.0,The chosen DP settings are based on a balance between privacy and utility. A `target_epsilon` o...
9,Kaggle Stroke Prediction Groq (deepseek-r1-distill-llama-70b) DP SGD,deepseek-r1-distill-llama-70b,3.0,2.998465,0.000245,10.0,0.935421,0.214286,0.264706,0.18,3.0,"The target epsilon of 3.0 balances privacy and utility needs for healthcare data, providing suff..."


In [23]:
results_df = results_df.drop([3,4,5])
results_df['dataset'] = 'cleaned_healthcare_stroke'
results_df

Unnamed: 0,Run Type,LLM Used,Target Epsilon,Final Epsilon,Target Delta,Max Grad Norm,Accuracy,F1 (Stroke),Precision (Stroke),Recall (Stroke),LLM Epsilon Suggestion,LLM Reasoning,dataset
0,Kaggle Stroke Prediction Non-DP SGD,,,,,,0.720157,0.218579,0.126582,0.8,,,cleaned_healthcare_stroke
1,Kaggle Stroke Prediction Fixed DP SGD,N/A (Fixed Defaults),1.0,0.991973,0.000245,1.0,0.951076,0.0,0.0,0.0,1.0,Using fixed defaults for Kaggle Stroke Prediction.,cleaned_healthcare_stroke
2,Kaggle Stroke Prediction Gemini DP SGD,gemini-2.5-pro-exp-03-25,3.0,2.997967,0.000245,10.0,0.939335,0.27907,0.333333,0.24,3.0,The recommended DP settings aim to balance privacy requirements in a 'Healthcare' domain with th...,cleaned_healthcare_stroke
6,Kaggle Stroke Prediction Groq (llama-3.3-70b-versatile) DP SGD,llama-3.3-70b-versatile,2.5,2.498573,1e-05,10.0,0.943249,0.216216,0.333333,0.16,2.5,The target epsilon of 2.5 balances the need for model utility in the healthcare domain with the ...,cleaned_healthcare_stroke
7,Kaggle Stroke Prediction Groq (qwen-qwq-32b) DP SGD,qwen-qwq-32b,3.0,2.990195,1e-05,10.0,0.940313,0.186667,0.28,0.14,3.0,1. **target_epsilon=3.0**: Balances healthcare data sensitivity (requires stronger privacy than ...,cleaned_healthcare_stroke
8,Kaggle Stroke Prediction Groq (gemma2-9b-it) DP SGD,gemma2-9b-it,3.0,2.990195,1e-05,10.0,0.921722,0.259259,0.241379,0.28,3.0,The chosen DP settings are based on a balance between privacy and utility. A `target_epsilon` o...,cleaned_healthcare_stroke
9,Kaggle Stroke Prediction Groq (deepseek-r1-distill-llama-70b) DP SGD,deepseek-r1-distill-llama-70b,3.0,2.998465,0.000245,10.0,0.935421,0.214286,0.264706,0.18,3.0,"The target epsilon of 3.0 balances privacy and utility needs for healthcare data, providing suff...",cleaned_healthcare_stroke


In [24]:
results_df.to_csv('results_multiple_1.csv', index=False)

In [None]:
# Description: Outline and provide example code for tuning DP parameters.

print("\n--- Phase 3: Tuning (Example - Epsilon vs. Utility) ---")
print("This section shows how you *could* tune epsilon. Uncomment and adapt to run.")

# tuning_results_list = []
# epsilon_values_to_test = [0.5, 1.0, 2.0, 3.0, 5.0, 7.0, 10.0]

# # Choose a base config (e.g., Gemini's recommendations for other params)
# tuning_base_config = gemini_config.copy() if 'gemini_config' in locals() and gemini_config else fixed_dp_config.copy()
# print(f"Using base config from: {tuning_base_config.get('llm_model_name', 'Fixed Defaults')}")

# if tuning_base_config and 'n_features' in locals() and 'train_loader' in locals():
#     for eps_val in epsilon_values_to_test:
#         print(f"\nTuning Run: Epsilon = {eps_val}")
#         current_tuning_config = tuning_base_config.copy()
#         current_tuning_config["target_epsilon"] = eps_val
#         run_name = f"Tuned DP (Base: {tuning_base_config.get('llm_model_name', 'Fixed')}, Eps={eps_val})"

#         results_tuned = train_evaluate_dp_model(
#             current_tuning_config,
#             run_name,
#             train_loader,
#             test_loader,
#             n_features,
#             device,
#             EPOCHS,
#             LEARNING_RATE,
#             pos_weight_tensor
#         )
#         if results_tuned:
#             tuning_results_list.append(results_tuned)
# else:
#     print("Cannot run tuning - missing base config or other components.")

# if tuning_results_list:
#     tuning_df = pd.DataFrame(tuning_results_list)
#     print("\n--- Tuning Results (Epsilon vs. F1) ---")
#     display(tuning_df[['Run Type', 'Target Epsilon', 'Final Epsilon', 'F1 (Stroke)']])
#     # Add plotting code here if desired (e.g., using matplotlib or seaborn)
#     # import matplotlib.pyplot as plt
#     # plt.plot(tuning_df['Final Epsilon'], tuning_df['F1 (Stroke)'], marker='o')
#     # plt.xlabel('Final Epsilon (ε)')
#     # plt.ylabel('F1 Score (Stroke Class)')
#     # plt.title('DP Utility-Privacy Trade-off (F1 vs. Epsilon)')
#     # plt.grid(True)
#     # plt.show()

In [None]:
## Phase 4: Documentation and Conclusion

# **Summary of Findings:**
# * **Non-DP Baseline:** Establish performance without privacy (often high recall, low precision on minority).
# * **Fixed DP:** Show the utility cost of applying generic DP settings (e.g., ε=1.0). Usually significant drop, especially in recall/F1 for minority.
# * **LLM-Guided DP (Gemini vs. Llama 3):** Compare the configurations suggested by different LLMs.
#     * Did they suggest similar parameters (epsilon, delta, max_grad_norm)?
#     * Did they identify similar sensitivities?
#     * How did the resulting model performance compare? Did one LLM's config lead to better utility (e.g., F1 score) for a comparable privacy level (final epsilon)?
#     * Evaluate the quality and relevance of the 'reasoning' provided by each LLM.
# * **Tuning (If Performed):** Discuss the trade-off observed (e.g., how F1 score changes as epsilon increases). What seems like a reasonable balance for this specific task?
# * **Overall:** Conclude on the feasibility and potential benefits/drawbacks of using LLMs as DP advisors in this context. Highlight the importance of human oversight and the need to validate LLM suggestions. Mention limitations (e.g., only tested two LLMs, one dataset, simple model).

# **Next Steps/Future Work:**
# * Test more LLMs.
# * Experiment with different datasets and tasks (regression, more complex models).
# * Explore prompt engineering variations to improve LLM guidance (e.g., explicitly asking to optimize for F1).
# * Investigate using the 'column_sensitivity_epsilon' hints more directly (requires advanced DP techniques beyond standard DP-SGD).