## Overview

This code is designed to evaluate the performance of machine learning classifiers: Random Forest and Logistic Regression.

### Experiment Logic:
- **Data Preprocessing**
- **Model Tuning**
- **Evaluation and Comparison**
- **Overfitting Analysis**
- **Final Model Selection**

The code is divided into two main modules:

---

### 1. Method Definition Module Overview

This section includes the following segments:

- **Method of Monitoring System Resource Usage**
- **Method of Data Loading**
- **Method of Filtering Invalid Parameter Combinations**
- **Model and Hyperparameter Definitions**
- **Method of Hyperparameter Search**
- **Method of Training and Evaluation**
- **Method of Plotting ROC and PR Curves**
- **Method of Test Set Evaluation**
- **Method of Plotting Learning Curves**

---

### 2. Execution Module Overview

This section includes the following segments:

- **Data Loading and Splitting**
- **Hyperparameter Search**
- **K-Fold Cross Validation**
- **Statistical Test Results**
- **Model Selection and Final Testing**
- **Overfitting Detection**
- **Random Forest Manual Hyperparameter Tuning**


### Import Packages
Numpy and pandas provide numerical computation and structured data processing capabilities; Matplotlib and Seaborn are used for data visualization, including model evaluation metrics such as ROC and PR curves. Time, Psutil, and GPUtil enable system monitoring, including execution time measurement, CPU/memory tracking, and GPU utilization analysis. The SciPy module supports statistical analysis with tools like the t-test. The sklearn module implements the complete machine learning pipeline—ensuring balanced data distribution through StratifiedKFold, performing hyperparameter tuning with RandomizedSearchCV, integrating data standardization (StandardScaler) and model training via Pipeline, and computing key performance metrics such as accuracy, precision, recall, F1-score, AUC, and average precision. RandomForestClassifier and LogisticRegression serve as the primary classification models, providing a combination of ensemble-based learning and interpretable linear classification. Additionally, warnings are suppressed to avoid unnecessary alerts, and a fixed random seed is set for reproducibility.


In [49]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
import psutil
import GPUtil
import seaborn as sns
from scipy.stats import ttest_ind
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold, train_test_split, learning_curve
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, precision_recall_curve, roc_auc_score, average_precision_score, f1_score, precision_score, recall_score



import warnings
warnings.filterwarnings("ignore", category=UserWarning)



RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

ModuleNotFoundError: No module named 'GPUtil'

### 1. Method Definition Module Overview

This section includes the following segments:

- **Method of Monitoring System Resource Usage**
- **Method of Data Loading**
- **Method of Filtering Invalid Parameter Combinations**
- **Model and Hyperparameter Definitions**
- **Method of Hyperparameter Search**
- **Method of Training and Evaluation**
- **Method of Plotting ROC and PR Curves**
- **Method of Test Set Evaluation**
- **Method of Plotting Learning Curves**


### 1.1 Method of Monitoring System Resource Usage
The get_system_usage function monitors CPU, memory, and GPU utilization.
This function provides a snapshot of system resource consumption, aiding in performance monitoring during machine learning tasks.

In [None]:
# method of Monitoring system resource usage
def get_system_usage():
    cpu_usage = psutil.cpu_percent(interval=1)
    mem_usage = psutil.virtual_memory().percent
    gpu_usage = 0.0  # set to 0.0 by default
    try:
        gpus = GPUtil.getGPUs()
        if gpus:
            gpu_usage = gpus[0].load * 100  # getting the percent of first GPU
    except:
        gpu_usage = "N/A"  
    return cpu_usage, mem_usage, gpu_usage

### 1.2 Method of Data Loading
The load_data function reads a CSV file, extracts feature variables (X) and target labels (y) for a binary classification task, 
and ensures numerical data is in float32 format for efficient computation.

In [None]:
#data loading
def load_data(file_path):
    data = pd.read_csv(file_path, header=0, sep=';')#we have headerin first row,skip it; separate data with ;
    X = data.iloc[:, 1:-1].astype(np.float32)  # first column is id, last one is target value
    y = data.iloc[:, -1]  # binary classification
    return X, y

### 1.3 Method of Filtering Invalid Parameter Combinations
The filter_params function filters out incompatible solver-penalty combinations for Logistic Regression, ensuring that only valid combinations 
(e.g., lbfgs with l2) are included in the parameter grid. For other models, such as Random Forest, it directly returns the original parameter grid 
without modification.

In [None]:
#method of filtering invalid parameter combinations
def filter_params(model_name, param_grid):
    # C: Regularization strength in Logistic Regression, controlling overfitting.
#solver: The algorithm used to optimize the Logistic Regression model (e.g., lbfgs, liblinear).
#penalty: The type of regularization applied (e.g., l1, l2).
    if model_name == "Logistic Regression":
        filtered_grid = {"C": [], "solver": [], "penalty": []}
        for C in param_grid["C"]:
            for solver in param_grid["solver"]:
                for penalty in param_grid["penalty"]:
                    if (solver == "lbfgs" and penalty == "l1") or \
                       (solver == "liblinear" and penalty == "elasticnet"):
                        continue  # skip invalid one
                    filtered_grid["C"].append(C)
                    filtered_grid["solver"].append(solver)
                    filtered_grid["penalty"].append(penalty)
        return filtered_grid
    return param_grid  # return directly if it is Random Forest 


### 1.4 Model and Hyperparameter Definitions
The models dictionary defines two machine learning models—Logistic Regression and Random Forest—with their respective hyperparameters. 

In [None]:
# define the search area of hyperparameters of models
models = {
    'Logistic Regression': {
        'class': LogisticRegression,
        'param_grid': {
            'C': [0.01, 0.1, 1, 10, 100],
            'solver': ['liblinear', 'lbfgs', 'saga'],
            'penalty': ['l1', 'l2'],
            'class_weight': ['balanced']
        }
    },
    'Random Forest': {
        'class': RandomForestClassifier,
        'param_grid': {
            'n_estimators': [500, 1000],
            'max_depth': [10, 20, None],
            'min_samples_split': [2, 5, 10],
            'bootstrap': [True],
            'random_state': [RANDOM_SEED],
            'class_weight': ['balanced']
        }
    }
}

### 1.5 Method of Hyperparameter Search
The hyperparameter_search function performs hyperparameter tuning using RandomizedSearchCV. For Logistic Regression, it creates a pipeline that includes data scaling (StandardScaler) and the model, and filters the parameter grid to ensure compatibility. For Random Forest, it directly uses the provided parameter grid. The function searches for the best hyperparameters by evaluating the model using 3-fold cross-validation and the roc_auc scoring metric, then returns the best model and hyperparameters.

In [None]:
# method of hyperparameter search
def hyperparameter_search(model_name, model_class, param_grid, X_train, y_train):
    #search using train_valid set first to find the best parameters
    print(f"\n🔍 Searching best hyperparameters for {model_name}...")
#in logistic regression, data need to be standardization first 
    if model_name == 'Logistic Regression':
        pipeline = Pipeline([
            ('scaler', StandardScaler()),
            ('model', model_class(random_state=RANDOM_SEED))
        ])
        filtered_grid = filter_params(model_name, param_grid)
        param_grid_fixed = {f"model__{k}": v for k, v in filtered_grid.items()} 
    else:
        pipeline = model_class(random_state=RANDOM_SEED)
        param_grid_fixed = param_grid  # Random Forest using the dictionary directly
# speed up using random search
    search = RandomizedSearchCV(
        pipeline, param_distributions=param_grid_fixed, n_iter=10,
        cv=3, scoring='roc_auc', n_jobs=4, random_state=RANDOM_SEED
    )
    
    search.fit(X_train, y_train)
    print(f"Best hyperparameters for {model_name}: {search.best_params_}")
    
    return search.best_estimator_, search.best_params_

### 1.6 Method of Training and Evaluation
The train_and_evaluate function performs K-fold cross-validation to train and evaluate the model. For each fold, it trains the model on the training
subset and evaluates it on the validation subset, recording performance metrics like Recall, F1-score, AUC, AP, and Precision. It also tracks CPU,
memory, and GPU usage, along with the time taken for training each fold. After completing all folds, it computes and prints average metrics,
total training time, and average resource usage. The function then plots the ROC and precision-recall curves and returns the evaluation results across
all folds.

In [None]:
#
def train_and_evaluate(model_name, best_model, X, y, cv):
    #using k-folds to train and validate, records the results of training that are needed. 
    all_fpr, all_tpr, all_precision, all_recall = [], [], [], []
    train_errors, val_errors = [], []
    auc_scores, ap_scores, f1_scores, precision_scores, recall_scores = [], [], [], [], []
    cpu_usages, mem_usages, gpu_usages, training_times = [], [], [], []
    
    print(f"\n=== Running {model_name} ===")
    start_time = time.time()
    
    
    for fold, (train_idx, val_idx) in enumerate(cv.split(X, y)):
        print(f"\n=== Fold {fold+1} ===")

        fold_start_time = time.time()
        cpu_start, mem_start, gpu_start = get_system_usage()
        
        X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
        y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

        model = best_model  # using the best model got from hyperparameter search
        model.fit(X_train, y_train)

        y_train_pred = model.predict(X_train)    
        y_val_pred = model.predict(X_val)

        train_f1 = f1_score(y_train, y_train_pred)
        val_f1 = f1_score(y_val, y_val_pred)
       # used for plot learning curves
        train_errors.append(1 - train_f1)  
        val_errors.append(1 - val_f1)   

        y_proba = model.predict_proba(X_val)[:, 1]
        fpr, tpr, _ = roc_curve(y_val, y_proba)
        prec, rec, _ = precision_recall_curve(y_val, y_proba)

        auc_score = roc_auc_score(y_val, y_proba)
        ap_score = average_precision_score(y_val, y_proba)
        precision = precision_score(y_val, y_val_pred)
        recall = recall_score(y_val, y_val_pred)

        #getting a list of data for each by storing the data
        auc_scores.append(auc_score)         
        ap_scores.append(ap_score)
        f1_scores.append(val_f1)
        precision_scores.append(precision)
        recall_scores.append(recall)

        all_fpr.append(fpr)
        all_tpr.append(tpr)
        all_precision.append(prec)
        all_recall.append(rec)
        
        cpu_end, mem_end, gpu_end = get_system_usage()
        #calculating time for each fold
        fold_training_time = time.time() - fold_start_time
        cpu_usage = (cpu_start + cpu_end) / 2
        mem_usage = (mem_start + mem_end) / 2
        gpu_usage = (gpu_start + gpu_end) / 2 if gpu_start != "N/A" else "N/A"

        cpu_usages.append(cpu_usage)
        mem_usages.append(mem_usage)
        gpu_usages.append(gpu_usage)
        training_times.append(fold_training_time)
        
        print(f"Fold {fold+1}: Recall = {recall:.3f},F1 = {val_f1:.3f}, "
              f"AUC = {auc_score:.3f}, AP = {ap_score:.3f}, Precision = {precision:.3f}, "
              f"Training Time = {fold_training_time:.2f}s, CPU = {cpu_usage:.2f}%, "
              f"Memory = {mem_usage:.2f}%, GPU = {gpu_usage:.2f}%")
        
    #calculating the total time    
    end_time = time.time()
    total_training_time = end_time - start_time
    avg_cpu_usage = np.mean(cpu_usages)
    avg_mem_usage = np.mean(mem_usages)
    avg_gpu_usage = np.mean(gpu_usages) if gpu_usages else "N/A"
    
    print(f"\n {model_name} Total Training Time: {total_training_time:.2f}s | "
          f"Avg CPU: {avg_cpu_usage:.2f}% | Avg Memory: {avg_mem_usage:.2f}% | Avg GPU: {avg_gpu_usage:.2f}%")    

    
    print(f"\n Final {model_name} Avg Recall: {np.mean(recall_scores):.3f}, Avg F1: {np.mean(f1_scores)}, Avg AUC: {np.mean(auc_scores):.3f}, Avg AP: {np.mean(ap_scores):.3f}, Avg Precision:{np.mean(precision_scores)}")

    plot_roc_pr_curves(all_fpr, all_tpr, all_precision, all_recall, model_name)

# return lists of data, which will be used in statistic test
    return auc_scores, ap_scores, train_errors, val_errors, f1_scores, recall_scores, training_times, cpu_usages, mem_usages, gpu_usages


### 1.7 Method of Plotting ROC and PR Curves
The plot_roc_pr_curves function plots the ROC (Receiver Operating Characteristic) and PR (Precision-Recall) curves for all folds of cross-validation.
The ROC curve is plotted on the left, showing the False Positive Rate vs True Positive Rate, with a reference line for random guessing.
The PR curve is plotted on the right, showing Recall vs Precision. Each fold’s curve is labeled, and the plot is displayed with appropriate titles
and legends for the model name. The function helps in visually assessing the model's performance across different metrics.

In [None]:
# method of plotting roc and pr vurves
def plot_roc_pr_curves(all_fpr, all_tpr, all_precision, all_recall, model_name):
    
    plt.figure(figsize=(12, 5))

    # plot ROC curve
    plt.subplot(1, 2, 1)
    for i in range(len(all_fpr)):
        plt.plot(all_fpr[i], all_tpr[i], lw=1.5, label=f'Fold {i+1}')
    
    # Add the y = x reference line (random guessing baseline)
    plt.plot([0, 1], [0, 1], 'k--', lw=1, label='Random Guess')
    
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title(f'{model_name} - ROC Curve')
    plt.legend()

    # plot PR curve
    plt.subplot(1, 2, 2)
    for i in range(len(all_precision)):
        plt.plot(all_recall[i], all_precision[i], lw=1.5, label=f'Fold {i+1}')
    
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.title(f'{model_name} - Precision-Recall Curve')
    plt.legend()

    plt.tight_layout()
    plt.show()

### 1.8 Method of Test Set Evaluation
This method evaluates the final model on the test set by calculating metrics such as AUC, average precision (AP),
F1 score, precision, and recall. It also tracks evaluation time and system resource usage (CPU, memory, GPU). 
Additionally, it plots the ROC curve and Precision-Recall curve, and prints the final evaluation results.

In [None]:
# Method of test set evaluation
def evaluate_on_test_set(model, X_test, y_test, model_name):
    #start the timing
    start_time = time.time()
    cpu_start, mem_start, gpu_start = get_system_usage()

    
    y_test_proba = model.predict_proba(X_test)[:, 1]
    y_test_pred = model.predict(X_test)

    auc_score = roc_auc_score(y_test, y_test_proba)
    ap_score = average_precision_score(y_test, y_test_proba)
    f1 = f1_score(y_test, y_test_pred)
    precision = precision_score(y_test, y_test_pred)
    recall = recall_score(y_test, y_test_pred)

    fpr, tpr, _ = roc_curve(y_test, y_test_proba)
    prec, rec, _ = precision_recall_curve(y_test, y_test_proba)
    
    end_time = time.time()
    cpu_end, mem_end, gpu_end = get_system_usage()

    test_time = end_time - start_time
    test_cpu = (cpu_start + cpu_end) / 2
    test_mem = (mem_start + mem_end) / 2
    test_gpu = (gpu_start + gpu_end) / 2 if gpu_start != "N/A" else "N/A"

    print(f"\n Final Model Evaluation Time: {test_time:.2f}s | CPU: {test_cpu:.2f}% | Memory: {test_mem:.2f}% | GPU: {test_gpu}%")


    print(f"\n {model_name} Final Test Results:")
    print(f" Recall: {recall:.3f},F1: {f1:.3f}, AUC: {auc_score:.3f}, AP: {ap_score:.3f},  Precision: {precision:.3f}")

    plot_roc_pr_curves([fpr], [tpr], [prec], [rec], f"{model_name} - Test Set")
  


### 1.9 Method of Plotting Learning Curves
This method plots three learning curves to assess model performance: one representing the training F1 scores using 5-fold cross-validation, another for the validation F1 scores from cross-validation, and a third showing the test F1 scores to evaluate the model's generalization ability as the training size increases.

In [None]:
def plot_learning_curves(final_model, X_train, y_train, X_test, y_test, model_name):


    train_sizes, train_scores, val_scores = learning_curve(
        final_model, X_train, y_train, cv=5, scoring='f1',
        train_sizes=np.linspace(0.1, 1.0, 10), n_jobs=4
    )

    train_scores_mean = np.mean(train_scores, axis=1)
    val_scores_mean = np.mean(val_scores, axis=1)

    test_scores = []  #store the test results
    for train_size in train_sizes:
        X_subtrain = X_train[:int(train_size)]
        y_subtrain = y_train[:int(train_size)]

        model = final_model.fit(X_subtrain, y_subtrain)
        y_test_pred = model.predict(X_test)
        test_scores.append(f1_score(y_test, y_test_pred))

    # plot curves
    plt.figure(figsize=(10, 6))

    #training curves and validation curves
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training Score (CV)")
    plt.plot(train_sizes, val_scores_mean, 'o-', color="b", label="Validation Score (CV)")

    # testing curves
    plt.plot(train_sizes, test_scores, 'o-', color="g", label="Test Score")

    plt.xlabel("Training Samples")
    plt.ylabel("F1 Score")
    plt.title(f"{model_name} Learning Curves")
    plt.legend(loc="best")
    plt.show()


### 2. Execution Module Overview

This section includes the following segments:

- **Data Loading and Splitting**
- **Hyperparameter Search**
- **K-Fold Cross Validation**
- **Statistical Test Results**
- **Model Selection and Final Testing**
- **Overfitting Detection**
- **Random Forest Manual Hyperparameter Model Selection**


### 2.1 Data Loading and Splitting
This section involves loading the dataset from the specified file path, followed by splitting the data into features (X) and target labels (y). The dataset is then further split into training-validation and test sets, with 10% of the data reserved for testing. 

In [None]:

file_path = "cardio_train.csv"
X, y = load_data(file_path)

X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.1, random_state=RANDOM_SEED, stratify=y)
k_folds = min(10, len(X_train_val) // 5000)
cv = StratifiedKFold(n_splits=k_folds, shuffle=True, random_state=RANDOM_SEED)

### 2.2 Hyperparameter Search
This section performs hyperparameter search for each model in the models dictionary. It iterates over the models, applying the hyperparameter_search function to find the best model and corresponding hyperparameters. The results are stored in dictionaries, with best_models holding the optimal models and best_params storing the best parameter sets for each model.


In [None]:

best_models = {}  
best_params = {}  
for model_name, model_info in models.items():
    best_models[model_name], best_params[model_name] = hyperparameter_search(
        model_name, model_info['class'], model_info['param_grid'], X_train_val, y_train_val
    )


### 2.3 K-Fold Cross-Validation
This section performs K-fold cross-validation for each model using the best model found from the hyperparameter search. It iterates over the models and evaluates their performance using the train_and_evaluate function. The results, including performance metrics, are stored in the results dictionary for each model.

In [None]:

results = {}
for name, info in models.items():
    results[name] = train_and_evaluate(name, best_models[name], X_train_val, y_train_val, cv)

### 2.4 Statistical Testing
This section performs statistical tests to compare the performance metrics of Logistic Regression and Random Forest models. Using a t-test, it evaluates the significance of differences across various metrics, including Recall, F1-score, AUC, AP, training time, and system resource usage (CPU, memory, and GPU). The results are printed, highlighting whether a significant difference was detected for each metric based on the p-value threshold of 0.05.

In [None]:



metrics = ["Recall", "F1-score", "AUC", "AP", "Training Time (s)", "CPU Usage (%)", "Memory Usage (%)", "GPU Usage (%)"]
logistic_training_time, logistic_cpu_usage, logistic_memory_usage, logistic_gpu_usage = results['Logistic Regression'][6:]
rf_training_time, rf_cpu_usage, rf_memory_usage, rf_gpu_usage = results['Random Forest'][6:]

metrics = ["Recall", "F1-score", "AUC", "AP", "Training Time (s)", "CPU Usage (%)", "Memory Usage (%)", "GPU Usage (%)"]

logistic_results = [
    results['Logistic Regression'][4],  # Recall
    results['Logistic Regression'][3],  # F1-score
    results['Logistic Regression'][0],  # AUC
    results['Logistic Regression'][1],  # AP
    logistic_training_time,             # training time
    logistic_cpu_usage,                  # CPU 
    logistic_memory_usage,               # memory
    logistic_gpu_usage                    # GPU 
]

random_forest_results = [
    results['Random Forest'][4],  # Recall
    results['Random Forest'][3],  # F1-score
    results['Random Forest'][0],  # AUC
    results['Random Forest'][1],  # AP
    rf_training_time,             # training time
    rf_cpu_usage,                  # CPU 
    rf_memory_usage,               # memory
    rf_gpu_usage                    # GPU 
]

print("\n🎯 Statistical Test Results (Logistic Regression vs Random Forest)")
for i in range(len(metrics)):
    t_stat, p_value = ttest_ind(logistic_results[i], random_forest_results[i])

    print(f"\n📌 {metrics[i]}:")
    print(f"   T-statistic: {t_stat:.4f}, P-value: {p_value:.4f}")

    if p_value < 0.05:
        print(f"   ✅ Significant difference detected for {metrics[i]}!")
    else:
        print(f"   ❌ No significant difference for {metrics[i]}.")





### 2.5 Model Selection and Final Testing
This section selects the best model based on the highest AUC score between Logistic Regression and Random Forest. The chosen model is then trained on the entire training-validation dataset and evaluated on the test set. The evaluation results are printed, providing insights into the model's performance on unseen data.

In [None]:

best_model_name = "Random Forest" if results["Random Forest"][0] > results["Logistic Regression"][0] else "Logistic Regression"
final_model = best_models[best_model_name]  # using the best model

print(f"\n🚀 {best_model_name} Selected for Final Testing")
final_model.fit(X_train_val, y_train_val)

evaluate_on_test_set(final_model, X_test, y_test, best_model_name)




### 2.6 Overfitting Detection
This section analyzes overfitting by plotting learning curves for the selected model. It evaluates the model's performance on both the training-validation and test datasets, helping to assess whether the model is overfitting to the training data as the training size increases.

In [None]:

print(f"\n🚀 {best_model_name} Selected for Overfitting Analysis (Learning Curve)")
plot_learning_curves(final_model, X_train_val, y_train_val, X_test, y_test, best_model_name) 

### 2.7 Manual Hyperparameter Tuning and Overfitting Detection
Based on the results from the previous learning curve, which indicated potential overfitting, a new Random Forest model is created with manually tuned hyperparameters to address this issue. The model is then trained on the complete training-validation set. After training, learning curves are plotted to assess whether overfitting persists with the new hyperparameter settings.

In [66]:
'''
the learning curve results of the optimal Random Forest model indicated some overfitting. 
To address this, a new Random Forest model is created using manually tuned hyperparameters to optimize the model's performance.
'''

verified_rf_params = {
    'n_estimators': 1000,  
    'max_depth': 4,  # Limit the maximum depth of the tree.lowering it reduces overall training set performance. Currently, the best value is 4.
    'min_samples_split': 75,  #Control the minimum number of samples required to split a node. Gradually decreasing it improves performance when data is sparse. The best value is 75.
    'bootstrap': True,  
    'random_state': RANDOM_SEED,
    'class_weight': 'balanced'
}

#   Recreate the Random Forest model using manually tuned hyperparameters
final_model = RandomForestClassifier(**verified_rf_params)
final_model = RandomForestClassifier(**verified_rf_params)


#  Train the final model on the complete training set (training + validation)
final_model.fit(X_train_val, y_train_val)


#  Overfitting Detection (Plotting Learning Curves)
print(f"\n🚀 Random Forest Selected for Overfitting Analysis (Learning Curve)")
plot_learning_curves(final_model, X_train_val, y_train_val, X_test, y_test, "Random Forest")

NameError: name 'RANDOM_SEED' is not defined