<a target="_blank" href="https://colab.research.google.com/github/rapidsai-community/showcase/blob/main/getting_started_tutorials/rapids-pip-colab-template.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Install RAPIDS into Colab"/>
</a>

# RAPIDS cuDF is now already on your Colab instance!
RAPIDS cuDF is preinstalled on Google Colab and instantly accelerates Pandas with zero code changes. [You can quickly get started with our tutorial notebook](https://nvda.ws/rapids-cudf). This notebook template is for users who want to utilize the full suite of the RAPIDS libraries for their workflows on Colab.  

# Environment Sanity Check #

Click the _Runtime_ dropdown at the top of the page, then _Change Runtime Type_ and confirm the instance type is _GPU_.

You can check the output of `!nvidia-smi` to check which GPU you have.  Please uncomment the cell below if you'd like to do that.  Currently, RAPIDS runs on all available Colab GPU instances.

In [1]:
!nvidia-smi

Tue Apr 15 05:52:38 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   45C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

#Setup:
This set up script:

1. Checks to make sure that the GPU is RAPIDS compatible
1. Pip Installs the RAPIDS' libraries, which are:
  1. cuDF
  1. cuML
  1. cuGraph
  1. cuSpatial
  1. cuxFilter
  1. cuCIM
  1. xgboost

# Controlling Which RAPIDS Version is Installed
This line in the cell below, `!python rapidsai-csp-utils/colab/pip-install.py`, kicks off the RAPIDS installation script.  You can control the RAPIDS version installed by adding either `latest`, `nightlies` or the default/blank option.  Example:

`!python rapidsai-csp-utils/colab/pip-install.py <option>`

You can now tell the script to install:
1. **RAPIDS + Colab Default Version**, by leaving the install script option blank (or giving an invalid option), adds the rest of the RAPIDS libraries to the RAPIDS cuDF library preinstalled on Colab.  **This is the default and recommended version.**  Example: `!python rapidsai-csp-utils/colab/pip-install.py`
1. **Latest known working RAPIDS stable version**, by using the option `latest` upgrades all RAPIDS labraries to the latest working RAPIDS stable version.  Usually early access for future RAPIDS+Colab functionality - some functionality may not work, but can be same as the default version. Example: `!python rapidsai-csp-utils/colab/pip-install.py latest`
1. **the current nightlies version**, by using the option, `nightlies`, installs current RAPIDS nightlies version.  For RAPIDS Developer use - **not recommended/untested**.  Example: `!python rapidsai-csp-utils/colab/pip-install.py nightlies`


**This will complete in about 5-6 minutes**

In [2]:
# This get the RAPIDS-Colab install files and test check your GPU.  Run this and the next cell only.
# Please read the output of this cell.  If your Colab Instance is not RAPIDS compatible, it will warn you and give you remediation steps.
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/pip-install.py


Cloning into 'rapidsai-csp-utils'...
remote: Enumerating objects: 587, done.[K
remote: Counting objects: 100% (153/153), done.[K
remote: Compressing objects: 100% (71/71), done.[K
remote: Total 587 (delta 122), reused 85 (delta 82), pack-reused 434 (from 3)[K
Receiving objects: 100% (587/587), 193.00 KiB | 4.49 MiB/s, done.
Resolving deltas: 100% (296/296), done.
Installing RAPIDS remaining 25.02 libraries
error: a value is required for '--prerelease <PRERELEASE>' but none was supplied
  [possible values: disallow, allow, if-necessary, explicit, if-necessary-or-explicit]

For more information, try '--help'.

        ***********************************************************************
        The pip install of RAPIDS is complete.

        Please do not run any further installation from the conda based installation methods, as they may cause issues!

        Please ensure that you're pulling from the git repo to remain updated with the latest working install scripts.

        Tro

# RAPIDS is now installed on Colab.  
You can copy your code into the cells below or use the below to validate your RAPIDS installation and version.  
# Enjoy!

In [1]:
import cudf
cudf.__version__

'25.02.01'

In [2]:
import cuml
cuml.__version__

'25.02.01'

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
import os
from pathlib import Path
import re
import warnings

# # Import GPU-enabled GridSearchCV from cuML
# from cuml.model_selection import GridSearchCV
# from sklearn.model_selection import StratifiedKFold  # still using scikit-learn CV splitter

# # For GPU-based scoring you can usually use the same metric.
# from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score,
#                              roc_auc_score, confusion_matrix, ConfusionMatrixDisplay,
#                              roc_curve)

# # XGBoost import (set tree_method to GPU)
# import xgboost as xgb

# # IMPORTANT: For GPU-accelerated models, import cuML estimators where possible.
# from cuml.linear_model import LogisticRegression as cuLogisticRegression
# from cuml.svm import SVC as cuSVC
# from cuml.ensemble import RandomForestClassifier as cuRFClassifier
# # (cuML does not yet have a DecisionTreeClassifier, so you might keep the scikit-learn version for that or omit.)
# from sklearn.tree import DecisionTreeClassifier

# # Ignore some warnings for cleaner output
# warnings.filterwarnings("ignore", category=UserWarning)
# warnings.filterwarnings("ignore", category=FutureWarning)
# warnings.filterwarnings("ignore", category=pd.errors.PerformanceWarning)



In [5]:
# --- Configuration ---
MERGED_DATA_DIR = Path("/content/drive/MyDrive/merged_datasets_csv") # Directory with train/val/test_merged_data.csv
OUTPUT_RESULTS_DIR = Path("/content/drive/MyDrive/classification_results_v2") # V2 for new results
OUTPUT_MODELS_DIR = OUTPUT_RESULTS_DIR / "models"
OUTPUT_PLOTS_DIR = OUTPUT_RESULTS_DIR / "plots"

# Define file names
TRAIN_FILE = MERGED_DATA_DIR / "train_dataset.csv"
VAL_FILE = MERGED_DATA_DIR / "val_dataset.csv"
TEST_FILE = MERGED_DATA_DIR / "test_dataset.csv"

# Target variable
TARGET_COLUMN = 'forged'

# Original + Annotation Features (Base)
BASE_FEATURES = [
    'prod_price', 'prod_qty', 'prod_amt', 'total', 'amt_paid',
    'change', 'tax', 'discount',
    'digital annotation', 'handwritten annotation'
]

# Models to train
MODELS_TO_TRAIN = [
    'LogisticRegression',
    'SVC',
    'RandomForestClassifier',
    'DecisionTreeClassifier',
    'XGBClassifier'
]

# Create output directories
OUTPUT_RESULTS_DIR.mkdir(parents=True, exist_ok=True)
OUTPUT_MODELS_DIR.mkdir(parents=True, exist_ok=True)
OUTPUT_PLOTS_DIR.mkdir(parents=True, exist_ok=True)

In [6]:
# --- Feature Engineering Function ---

def feature_engineer(df):
    """Creates new features from existing columns."""
    print("  Starting Feature Engineering...")
    df_eng = df.copy()

    # --- 1. Datetime Feature Extraction ---
    # Ensure 'datetime' column exists and is datetime type
    if 'datetime' in df_eng.columns and pd.api.types.is_datetime64_any_dtype(df_eng['datetime']):
        print("    Extracting datetime features...")
        dt_col = df_eng['datetime']
        df_eng['hour'] = dt_col.dt.hour
        df_eng['minute'] = dt_col.dt.minute
        df_eng['day_of_week'] = dt_col.dt.dayofweek # Monday=0, Sunday=6
        df_eng['day_of_year'] = dt_col.dt.dayofyear
        df_eng['month'] = dt_col.dt.month
        df_eng['year'] = dt_col.dt.year
        df_eng['is_weekend'] = (df_eng['day_of_week'] >= 5).astype(int) # Sat/Sun

        # Cyclical Encoding (preserves closeness, e.g., Dec is close to Jan)
        print("    Applying cyclical encoding...")
        df_eng['hour_sin'] = np.sin(2 * np.pi * df_eng['hour'] / 24)
        df_eng['hour_cos'] = np.cos(2 * np.pi * df_eng['hour'] / 24)
        df_eng['day_of_week_sin'] = np.sin(2 * np.pi * df_eng['day_of_week'] / 7)
        df_eng['day_of_week_cos'] = np.cos(2 * np.pi * df_eng['day_of_week'] / 7)
        df_eng['month_sin'] = np.sin(2 * np.pi * df_eng['month'] / 12)
        df_eng['month_cos'] = np.cos(2 * np.pi * df_eng['month'] / 12)

        # Drop original simple time features if cyclical are used and preferred
        # df_eng = df_eng.drop(columns=['hour', 'day_of_week', 'month'])

    else:
        print("    'datetime' column not found or not datetime type. Skipping datetime features.")


    # --- 2. Financial Calculation Features ---
    print("    Calculating financial features...")
    # Ensure required columns are numeric first, coercing errors
    num_cols = ['prod_price', 'prod_qty', 'prod_amt', 'total', 'amt_paid', 'change', 'tax', 'discount']
    for col in num_cols:
        if col in df_eng.columns:
            df_eng[col] = pd.to_numeric(df_eng[col], errors='coerce')
        else:
            print(f"      Warning: Column '{col}' needed for financial calcs not found.")
            # Create dummy column with NaN if needed for subsequent steps to not fail immediately
            if col not in df_eng.columns: df_eng[col] = np.nan

    # Row-wise product check: prod_amt vs (price * qty)
    # Handle 0 quantity carefully to avoid division by zero implicitly later if needed
    # Handle potential NaNs in calculation
    df_eng['prod_calc_diff'] = df_eng['prod_amt'] - (df_eng['prod_price'] * df_eng['prod_qty'])

    # Row-wise payment check: total vs (amt_paid - change)
    df_eng['payment_check'] = df_eng['amt_paid'] - df_eng['change']
    df_eng['payment_vs_total_diff'] = df_eng['total'] - df_eng['payment_check']

    # Group-wise calculations (per invoice/file)
    if 'file_name' in df_eng.columns:
        print("    Calculating group-wise features (per file_name)...")
        df_eng['sum_prod_amt_per_invoice'] = df_eng.groupby('file_name')['prod_amt'].transform('sum')
        df_eng['product_count_per_invoice'] = df_eng.groupby('file_name')['prod_name'].transform('count') # Assumes prod_name exists

        # Difference between declared total and sum of product amounts
        df_eng['total_vs_sum_prod_diff'] = df_eng['total'] - df_eng['sum_prod_amt_per_invoice']
    else:
        print("      Warning: 'file_name' column not found. Skipping group-wise features.")
        df_eng['sum_prod_amt_per_invoice'] = np.nan
        df_eng['product_count_per_invoice'] = np.nan
        df_eng['total_vs_sum_prod_diff'] = np.nan

    # Ratios (handle division by zero -> replace inf with 0 or NaN then impute)
    print("    Calculating financial ratios...")
    # Use a small epsilon to avoid division by zero exactly
    epsilon = 1e-6
    df_eng['tax_ratio'] = df_eng['tax'] / (df_eng['total'] + epsilon)
    df_eng['discount_ratio'] = df_eng['discount'] / (df_eng['total'] + epsilon)
    df_eng['change_ratio'] = df_eng['change'] / (df_eng['amt_paid'] + epsilon)

    # Replace potential infinities resulting from division by ~zero
    df_eng.replace([np.inf, -np.inf], np.nan, inplace=True)


    # --- 3. Indicator Flags ---
    print("    Creating indicator flags...")
    df_eng['is_discount_applied'] = (df_eng['discount'].fillna(0) > 0).astype(int)
    df_eng['is_tax_applied'] = (df_eng['tax'].fillna(0) > 0).astype(int)
    df_eng['is_change_given'] = (df_eng['change'].fillna(0) > 0).astype(int)
    df_eng['prod_price_is_zero'] = (df_eng['prod_price'].fillna(0) == 0).astype(int)
    df_eng['prod_qty_is_one'] = (df_eng['prod_qty'].fillna(0) == 1).astype(int)
    # Check if payment doesn't match total (using the calculated check)
    # Check for small discrepancies using np.isclose due to potential float issues
    df_eng['payment_mismatch_flag'] = (~np.isclose(df_eng['total'].fillna(0), df_eng['payment_check'].fillna(0))).astype(int)


    # --- 4. Interaction Features ---
    # Example: interaction between annotations
    if 'digital annotation' in df_eng.columns and 'handwritten annotation' in df_eng.columns:
        df_eng['digital_and_handwritten'] = df_eng['digital annotation'] * df_eng['handwritten annotation']


    print("  Feature Engineering finished.")
    return df_eng


In [7]:
# --- Helper Functions (Evaluation & Plotting - similar to previous script) ---

def evaluate_model(model, X_train, y_train, X_val, y_val, X_test, y_test):
    """Evaluates model on train, validation, and test sets."""
    results = {}
    print("      Evaluating model...")
    for name, X, y in [('Train', X_train, y_train), ('Validation', X_val, y_val), ('Test', X_test, y_test)]:
        if X is None or y is None:
            print(f"      Skipping evaluation for {name} set (data not available).")
            continue
        y_pred = model.predict(X)
        try:
             y_prob = model.predict_proba(X)[:, 1]
             roc_auc = roc_auc_score(y, y_prob)
        except (AttributeError, NotImplementedError):
             y_prob = None
             roc_auc = None
             # print(f"      Note: ROC AUC not available for {name} set.") # Less verbose

        results[name] = {
            'Accuracy': accuracy_score(y, y_pred),
            'Precision': precision_score(y, y_pred, zero_division=0),
            'Recall': recall_score(y, y_pred, zero_division=0),
            'F1 Score': f1_score(y, y_pred, zero_division=0),
            'ROC AUC': roc_auc
        }
        # Less verbose output during evaluation loop
        # print(f"    {name} Metrics: Acc={results[name]['Accuracy']:.4f}, P={results[name]['Precision']:.4f}, R={results[name]['Recall']:.4f}, F1={results[name]['F1 Score']:.4f}, ROC AUC={results[name]['ROC AUC'] if roc_auc is not None else 'N/A'}")
    print("      Evaluation complete.")
    return results

def plot_confusion_matrix(y_true, y_pred, model_name, stage, save_dir):
    """Plots and saves the confusion matrix."""
    try:
        cm = confusion_matrix(y_true, y_pred)
        disp = ConfusionMatrixDisplay(confusion_matrix=cm)
        fig, ax = plt.subplots(figsize=(6, 6))
        disp.plot(ax=ax, cmap='Blues', colorbar=False)
        ax.set_title(f'{model_name} - Confusion Matrix ({stage} Set)')
        plt.tight_layout()
        filepath = save_dir / f"{model_name}_confusion_matrix_{stage.lower()}.png"
        plt.savefig(filepath)
        # print(f"      Saved confusion matrix to {filepath}")
        plt.close(fig)
    except Exception as e:
        print(f"      [ERROR] Failed to plot confusion matrix for {model_name} ({stage}): {e}")


def plot_feature_importance(model, feature_names, model_name, save_dir, top_n=25):
    """Plots and saves top_n feature importances for tree-based models."""
    if not hasattr(model, 'feature_importances_'):
         print(f"      Feature importance not available for {model_name}.")
         return

    try:
        importances = model.feature_importances_
        indices = np.argsort(importances)[::-1]
        top_indices = indices[:top_n]

        plt.figure(figsize=(10, max(6, len(top_indices) // 2)))
        plt.title(f"{model_name} - Top {top_n} Feature Importance")
        plt.barh(range(len(top_indices)), importances[top_indices], align='center')
        plt.yticks(range(len(top_indices)), [feature_names[i] for i in top_indices])
        plt.xlabel('Importance')
        plt.gca().invert_yaxis()
        plt.tight_layout()
        filepath = save_dir / f"{model_name}_feature_importance.png"
        plt.savefig(filepath)
        # print(f"      Saved feature importance plot to {filepath}")
        plt.close()
    except Exception as e:
         print(f"      [ERROR] Failed to plot feature importance for {model_name}: {e}")



In [8]:
# --- Main Script ---

# 1. Load Data
print("--- Loading Data ---")
try:
    df_train_raw = pd.read_csv(TRAIN_FILE, low_memory=False) # Use low_memory=False if mixed types cause issues
    df_val_raw = pd.read_csv(VAL_FILE, low_memory=False)
    df_test_raw = pd.read_csv(TEST_FILE, low_memory=False)
    print(f"Raw data loaded: Train={df_train_raw.shape}, Val={df_val_raw.shape}, Test={df_test_raw.shape}")
except FileNotFoundError as e:
    print(f"[FATAL ERROR] Could not load data files: {e}")
    exit()
except Exception as e:
    print(f"[FATAL ERROR] Error loading data: {e}")
    exit()

# Convert date/datetime columns *before* feature engineering
print("\n--- Converting Datetime Columns ---")
for df in [df_train_raw, df_val_raw, df_test_raw]:
    if 'date' in df.columns:
        df['date'] = pd.to_datetime(df['date'], errors='coerce') # Handles NaT
    # Assuming 'datetime' was created correctly before, ensure it's dt type
    if 'datetime' in df.columns:
         df['datetime'] = pd.to_datetime(df['datetime'], errors='coerce')




--- Loading Data ---
Raw data loaded: Train=(1703, 22), Val=(499, 22), Test=(604, 22)

--- Converting Datetime Columns ---


In [9]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# 2. Feature Engineering
print("\n--- Feature Engineering ---")
df_train_eng = feature_engineer(df_train_raw)
df_val_eng = feature_engineer(df_val_raw)
df_test_eng = feature_engineer(df_test_raw)

# Define FINAL feature list after engineering
engineered_features = [
    # Datetime basic
    'hour', 'minute', 'day_of_week', 'day_of_year', 'month', 'year', 'is_weekend',
    # Datetime cyclical
    'hour_sin', 'hour_cos', 'day_of_week_sin', 'day_of_week_cos', 'month_sin', 'month_cos',
    # Financial Calculations
    'prod_calc_diff', 'payment_check', 'payment_vs_total_diff',
    'sum_prod_amt_per_invoice', 'product_count_per_invoice', 'total_vs_sum_prod_diff',
    # Ratios
    'tax_ratio', 'discount_ratio', 'change_ratio',
    # Indicators
    'is_discount_applied', 'is_tax_applied', 'is_change_given',
    'prod_price_is_zero', 'prod_qty_is_one', 'payment_mismatch_flag',
    # Interactions
    'digital_and_handwritten'
]

# Combine base features with newly engineered ones (only those that actually exist)
FINAL_FEATURE_COLUMNS = BASE_FEATURES + [feat for feat in engineered_features if feat in df_train_eng.columns]
# Remove duplicates just in case
FINAL_FEATURE_COLUMNS = sorted(list(set(FINAL_FEATURE_COLUMNS)))

print(f"\nFinal features selected ({len(FINAL_FEATURE_COLUMNS)}): {', '.join(FINAL_FEATURE_COLUMNS)}")

# 3. Preprocessing (Imputation & Scaling on FINAL features)
print("\n--- Preprocessing Final Features ---")

# Select final features and target
X_train_eng = df_train_eng[FINAL_FEATURE_COLUMNS].copy()
y_train = df_train_eng[TARGET_COLUMN].copy().astype(int)

X_val_eng = df_val_eng[FINAL_FEATURE_COLUMNS].copy()
y_val = df_val_eng[TARGET_COLUMN].copy().astype(int)

X_test_eng = df_test_eng[FINAL_FEATURE_COLUMNS].copy()
y_test = df_test_eng[TARGET_COLUMN].copy().astype(int)


# Imputation (fit on train only)
print("  Fitting imputer (strategy='median')...")
imputer = SimpleImputer(strategy='median', keep_empty_features=False) # keep_empty_features for newer sklearn
try:
    X_train_imputed = imputer.fit_transform(X_train_eng)
    X_val_imputed = imputer.transform(X_val_eng)
    X_test_imputed = imputer.transform(X_test_eng)
    print("  Imputation successful.")
except Exception as e:
    print(f"  [ERROR] Imputation failed: {e}. Check for non-numeric data or all-NaN columns in FINAL_FEATURE_COLUMNS.")
    # Fallback or debug: print columns with issues
    # print(X_train_eng.dtypes)
    # print(X_train_eng.isna().sum())
    exit()


# Scaling (fit on train only)
print("  Fitting scaler (StandardScaler)...")
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_val_scaled = scaler.transform(X_val_imputed)
X_test_scaled = scaler.transform(X_test_imputed)
print("  Scaling successful.")


# Convert back to DataFrames for easier handling later (optional but good for feature names)
X_train_processed = pd.DataFrame(X_train_scaled, columns=FINAL_FEATURE_COLUMNS)
X_val_processed = pd.DataFrame(X_val_scaled, columns=FINAL_FEATURE_COLUMNS)
X_test_processed = pd.DataFrame(X_test_scaled, columns=FINAL_FEATURE_COLUMNS)

# Check for issues after final preprocessing
print(f"\nShapes after preprocessing: X_train={X_train_processed.shape}, X_val={X_val_processed.shape}, X_test={X_test_processed.shape}")
print(f"NaNs in X_train: {X_train_processed.isna().sum().sum()}, X_val: {X_val_processed.isna().sum().sum()}, X_test: {X_test_processed.isna().sum().sum()}")



--- Feature Engineering ---
  Starting Feature Engineering...
    Extracting datetime features...
    Applying cyclical encoding...
    Calculating financial features...
    Calculating group-wise features (per file_name)...
    Calculating financial ratios...
    Creating indicator flags...
  Feature Engineering finished.
  Starting Feature Engineering...
    Extracting datetime features...
    Applying cyclical encoding...
    Calculating financial features...
    Calculating group-wise features (per file_name)...
    Calculating financial ratios...
    Creating indicator flags...
  Feature Engineering finished.
  Starting Feature Engineering...
    Extracting datetime features...
    Applying cyclical encoding...
    Calculating financial features...
    Calculating group-wise features (per file_name)...
    Calculating financial ratios...
    Creating indicator flags...
  Feature Engineering finished.

Final features selected (39): amt_paid, change, change_ratio, day_of_week, day_

In [None]:
# Required imports (ensure these are present)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
import os
from pathlib import Path
import re
import warnings
import traceback # For detailed error printing

# Scikit-learn imports
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
# from sklearn.svm import SVC # Replaced by cuML/fallback
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, roc_auc_score,
    confusion_matrix, ConfusionMatrixDisplay, roc_curve
)

# XGBoost import
import xgboost as xgb

# --- cuML Import ---
try:
    from cuml.svm import SVC as cumlSVC
    print("Successfully imported cuML SVC.")
    CUML_AVAILABLE = True
except ImportError:
    print("WARNING: cuML library not found. Falling back to scikit-learn SVC (CPU).")
    print("         To use GPU acceleration for SVC, enable GPU and install RAPIDS cuML.")
    from sklearn.svm import SVC as sklearnSVC # Fallback import
    CUML_AVAILABLE = False

# --- Helper Functions ---
# Assume evaluate_model, plot_confusion_matrix, plot_feature_importance are defined correctly

# --- Placeholder Data / ** ACTUAL DATA LOADING AND PREPROCESSING ** ---
# !!!!! CRITICAL !!!!!
# !!! Make sure the following placeholder data is REPLACED with your actual
# !!! data loading and preprocessing steps that generate:
# !!! X_train_processed, y_train, X_val_processed, y_val, X_test_processed, y_test
# !!! and the FINAL_FEATURE_COLUMNS list.
# !!! Using the placeholder data WILL lead to near-random results (~0.5 AUC).
# !!!!! CRITICAL !!!!!
print("INFO: Using placeholder data for demonstration. Replace with your actual loading and preprocessing.")
# Example shapes, replace with actual shapes after your preprocessing
n_train_samples, n_features = 1000, 45 # Example feature count from previous steps
n_val_samples, n_test_samples = 200, 200
X_train_processed = np.random.rand(n_train_samples, n_features)
X_val_processed = np.random.rand(n_val_samples, n_features)
X_test_processed = np.random.rand(n_test_samples, n_features)
y_train = np.random.randint(0, 2, n_train_samples)
y_val = np.random.randint(0, 2, n_val_samples)
y_test = np.random.randint(0, 2, n_test_samples)
FINAL_FEATURE_COLUMNS = [f'feat_{i}' for i in range(n_features)]
# --- End Placeholder / ACTUAL DATA ---


# 4. Model Training, Tuning, and Evaluation (Modified for cuML SVC and XGBoost GPU)
print("\n--- Model Training and Evaluation Loop (with cuML SVC option & XGBoost GPU fix) ---")

# Calculate scale_pos_weight for XGBoost
pos_weight = (y_train == 0).sum() / (y_train == 1).sum() if (y_train == 1).sum() > 0 else 1
print(f"Calculated scale_pos_weight for XGBoost: {pos_weight:.2f}")


# Define models and parameter grids
models = {
    'LogisticRegression': ( # No changes needed here for warnings
        LogisticRegression(
            solver='saga', random_state=42, class_weight='balanced',
            max_iter=5000, n_jobs=-1
            ),
        {
            'C': [0.01, 0.1, 1, 10, 100],
            'penalty': ['l1', 'l2', 'elasticnet'],
            'l1_ratio': [0.2, 0.5, 0.8]
        }
    ),
    'SVC': (
        # Use cuML SVC if available, otherwise fallback to scikit-learn SVC
        cumlSVC(
            # random_state=42, # Removed as per warning for probabilistic SVC
            class_weight='balanced',
            probability=False,      # <<< FIX: Set probability=False for cuML binary SVC with class_weight
            cache_size=500,
            max_iter=-1
            ) if CUML_AVAILABLE else \
        sklearnSVC( # Fallback to sklearn SVC (can keep probability=True here)
            random_state=42, class_weight='balanced', probability=True,
            cache_size=500
            ),
        # Parameter grid (ensure params are valid for the chosen SVC type)
        {
            'C': [0.1, 1, 10, 50],
            'gamma': ['scale', 'auto', 0.01, 0.1],
            'kernel': ['rbf', 'linear'] # Ensure these kernels are supported by cuML SVC
            # 'degree': [2, 3], # Only for kernel='poly'
            # 'coef0': [0.0, 0.1] # For 'poly' and 'sigmoid'
        }
    ),
    'RandomForestClassifier': (
        RandomForestClassifier(random_state=42, class_weight='balanced_subsample', n_jobs=-1),
        {
            'n_estimators': [100, 200, 300],
            'max_depth': [None, 10, 20, 30],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 3, 5],
            'max_features': ['sqrt', 'log2', 0.5]
        }
    ),
    'DecisionTreeClassifier': (
        DecisionTreeClassifier(random_state=42, class_weight='balanced'),
        {
            'criterion': ['gini', 'entropy'],
            'max_depth': [None, 10, 20, 30, 50],
            'min_samples_split': [2, 5, 10, 20],
            'min_samples_leaf': [1, 3, 5, 10]
        }
    ),
    'XGBClassifier': (
        xgb.XGBClassifier(
            objective='binary:logistic',
            eval_metric='logloss',
            # use_label_encoder=False, # <<< FIX: Removed deprecated parameter
            random_state=42,
            scale_pos_weight=pos_weight,
            n_jobs=1, # Usually set to 1 when device='cuda'
            tree_method='hist',        # <<< FIX: Use 'hist'
            device='cuda' if CUML_AVAILABLE else 'cpu' # <<< FIX: Specify device='cuda'
            ),
        { # Parameter grid remains the same
            'n_estimators': [100, 200, 300],
            'learning_rate': [0.01, 0.05, 0.1, 0.2],
            'max_depth': [3, 5, 7, 9],
            'subsample': [0.7, 0.8, 0.9],
            'colsample_bytree': [0.7, 0.8, 0.9]
        }
    )
}

all_results = {}
best_models = {}
roc_data_test = {}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# --- Training Loop (with n_jobs adjustment logic) ---
for name in MODELS_TO_TRAIN:
    if name not in models:
        print(f"Model '{name}' not defined. Skipping.")
        continue

    is_cuml_svc = name == 'SVC' and CUML_AVAILABLE
    # Check the actual device setting for XGBoost instance
    is_xgb_gpu = name == 'XGBClassifier' and models[name][0].get_params().get('device') == 'cuda'

    print(f"\n===== Training {name} {'(GPU with cuML)' if is_cuml_svc else '(GPU with XGBoost)' if is_xgb_gpu else '(CPU)'} =====")
    model_instance, param_grid = models[name]

    # Handle conditional parameters if necessary (example for LogReg)
    current_param_grid = param_grid
    if name == 'LogisticRegression':
        # (Logic for handling LogReg solver/penalty interaction - keep as is)
        current_param_grid = []
        base_estimator_params = model_instance.get_params()
        for penalty in param_grid['penalty']:
            grid = {'C': param_grid['C'], 'penalty': [penalty]}
            if penalty == 'elasticnet':
                grid['l1_ratio'] = param_grid['l1_ratio']
                grid['solver'] = ['saga']
            elif penalty == 'l1':
                grid['solver'] = ['liblinear', 'saga']
            current_param_grid.append(grid)
        print(f"  Using specific param grid for {name}")


    # Set n_jobs for GridSearchCV
    gridsearch_n_jobs = 1 if is_cuml_svc or is_xgb_gpu else -1
    print(f"  Using n_jobs={gridsearch_n_jobs} for GridSearchCV.")


    print(f"  Performing GridSearchCV (CV={cv.get_n_splits()}, scoring='roc_auc')...")
    grid_search = GridSearchCV(
        estimator=model_instance,
        param_grid=current_param_grid,
        cv=cv,
        scoring='roc_auc',
        n_jobs=gridsearch_n_jobs,
        verbose=1,
        error_score='raise' # Set error_score='raise' to get full traceback on failure
    )

    try:
        grid_search.fit(X_train_processed, y_train) # Use processed data (NumPy arrays)
        print(f"\n  Best Params for {name}: {grid_search.best_params_}")
        print(f"  Best CV ROC AUC Score: {grid_search.best_score_:.4f}")
        best_model = grid_search.best_estimator_
        best_models[name] = best_model

        # Evaluate
        print(f"  Evaluating best {name} model on all sets...")
        model_results = evaluate_model(best_model,
                                       X_train_processed, y_train,
                                       X_val_processed, y_val,
                                       X_test_processed, y_test)
        all_results[name] = model_results
        test_results = model_results.get('Test', {})
        # Handle case where ROC AUC might be None (e.g., for cuML SVC with prob=False)
        roc_auc_test = test_results.get('ROC AUC')
        roc_auc_test_str = f"{roc_auc_test:.4f}" if roc_auc_test is not None else "N/A"
        print(f"  Test Set Performance: Acc={test_results.get('Accuracy', 'N/A'):.4f}, P={test_results.get('Precision', 'N/A'):.4f}, R={test_results.get('Recall', 'N/A'):.4f}, F1={test_results.get('F1 Score', 'N/A'):.4f}, ROC AUC={roc_auc_test_str}")


        # Generate Plots for Test Set
        print(f"  Generating plots for {name} (Test Set)...")
        if X_test_processed is not None and y_test is not None:
             y_pred_test = best_model.predict(X_test_processed)
             plot_confusion_matrix(y_test, y_pred_test, name, "Test", OUTPUT_PLOTS_DIR)
             plot_feature_importance(best_model, FINAL_FEATURE_COLUMNS, name, OUTPUT_PLOTS_DIR)

             # Store ROC data only if probabilities are available
             if roc_auc_test is not None:
                 try:
                     # Ensure predict_proba exists before calling
                     if hasattr(best_model, "predict_proba"):
                         y_prob_test = best_model.predict_proba(X_test_processed)[:, 1]
                         fpr, tpr, _ = roc_curve(y_test, y_prob_test)
                         roc_data_test[name] = {'fpr': fpr, 'tpr': tpr, 'auc': roc_auc_test}
                     else:
                         print(f"      Skipping ROC data storage for {name}: predict_proba not available.")
                 except Exception as roc_err:
                      print(f"      Warning: Could not get probabilities or calculate ROC curve for {name}: {roc_err}")
             else:
                 print(f"      Skipping ROC data storage for {name}: ROC AUC was N/A.")
        else:
            print("      Skipping test set plots as test data is not available.")


        # Save Model
        model_filename = OUTPUT_MODELS_DIR / f"{name}_best_model.joblib"
        try:
            joblib.dump(best_model, model_filename)
            print(f"  Saved best model to {model_filename}")
        except Exception as save_err:
            print(f"  [ERROR] Could not save model {name} using joblib: {save_err}")
            print(f"      Model object type: {type(best_model)}")


    except Exception as train_err:
        print(f"  [ERROR] Failed to tune, train or evaluate {name}: {train_err}")
        traceback.print_exc() # Print detailed error


# --- Saving Results Summary and Combined ROC Plot ---
# (Keep these sections the same as before - they handle potentially missing ROC AUC)
# ... (rest of the code for saving summary and plotting combined ROC) ...

# 5. Save Results Summary
print("\n--- Saving Results Summary ---")
results_df_list = []
for model_name, stages in all_results.items():
    for stage_name, metrics in stages.items():
        metrics_copy = metrics.copy()
        metrics_copy['Model'] = model_name
        metrics_copy['Stage'] = stage_name
        results_df_list.append(metrics_copy)

if results_df_list:
    results_df = pd.DataFrame(results_df_list)
    cols_order = ['Model', 'Stage', 'Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC AUC']
    results_df = results_df[[col for col in cols_order if col in results_df.columns]]
    # Ensure ROC AUC column has consistent type (float or object if N/A exists)
    if 'ROC AUC' in results_df.columns:
        results_df['ROC AUC'] = pd.to_numeric(results_df['ROC AUC'], errors='ignore')
    results_filepath = OUTPUT_RESULTS_DIR / "all_models_evaluation_summary_v2.csv"
    try:
        results_df.to_csv(results_filepath, index=False)
        print(f"Evaluation summary saved to {results_filepath}")
    except Exception as e:
        print(f"[ERROR] Could not save results summary: {e}")
else:
     print("No model results generated to save.")

# 6. Combined ROC Curve Plot
print("\n--- Generating Combined ROC Curve Plot ---")
plt.figure(figsize=(10, 8))
plot_count = 0
for name, data in roc_data_test.items():
    # Check if 'auc' key exists and is not None
    if data.get('auc') is not None:
        plt.plot(data['fpr'], data['tpr'], lw=2, label=f"{name} (AUC = {data['auc']:.3f})")
        plot_count += 1

if plot_count > 0:
    plt.plot([0, 1], [0, 1], 'k--', label='Chance Level (AUC = 0.500)')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curves - Test Set Comparison')
    plt.legend(loc='lower right')
    plt.grid(True)
    plt.tight_layout()
    roc_comp_filepath = OUTPUT_PLOTS_DIR / "combined_roc_curves_test_set_v2.png"
    try:
        plt.savefig(roc_comp_filepath)
        print(f"Combined ROC plot saved to {roc_comp_filepath}")
    except Exception as e:
        print(f"[ERROR] Could not save combined ROC plot: {e}")
else:
    print("No models with valid ROC AUC data to plot.") # Updated message

plt.close()


print("\n--- Script Finished ---")

In [10]:
# Required imports from previous context
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
import os
from pathlib import Path
import re
import warnings
import traceback # Added for better error printing

# Scikit-learn imports
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
# from sklearn.svm import SVC # We will replace this with cuML's SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, roc_auc_score,
    confusion_matrix, ConfusionMatrixDisplay, roc_curve
)

# XGBoost import
import xgboost as xgb

# --- Add cuML Import ---
# IMPORTANT: Ensure you have enabled GPU in your Kaggle Notebook settings!
# You might need to install cuML if it's not pre-installed:
# !pip install cuml-cuXX --extra-index-url=https://pypi.ngc.nvidia.com # Replace XX with CUDA version (e.g., 11 or 12)
try:
    from cuml.svm import SVC as cumlSVC # Import cuML SVC, aliased to avoid name clash if needed elsewhere
    print("Successfully imported cuML SVC.")
    CUML_AVAILABLE = True
except ImportError:
    print("WARNING: cuML library not found. Falling back to scikit-learn SVC (CPU).")
    print("         To use GPU acceleration for SVC, enable GPU and install RAPIDS cuML.")
    from sklearn.svm import SVC as sklearnSVC # Fallback import
    CUML_AVAILABLE = False

# --- Helper Functions (evaluate_model, plot_confusion_matrix, plot_feature_importance) ---
# Assume these functions (evaluate_model, plot_confusion_matrix, plot_feature_importance)
# are defined as in the previous version of the code. They work with both
# scikit-learn and cuML models as long as the models adhere to the basic
# .predict(), .predict_proba(), .feature_importances_ (where applicable) API.

# --- Placeholder for Data Loading, Feature Engineering, Preprocessing ---
# Assume the following variables exist and contain the processed data as NumPy arrays:
# X_train_processed, y_train, X_val_processed, y_val, X_test_processed, y_test
# Also assume FINAL_FEATURE_COLUMNS (list of feature names) exists.
# --- Replace below with your actual data loading/preprocessing ---
print("INFO: Using placeholder data for demonstration. Replace with your actual loading and preprocessing.")
# Example shapes, replace with actual shapes after your preprocessing
n_train_samples, n_features = 1000, len(BASE_FEATURES) + 20 # Example feature count
n_val_samples, n_test_samples = 200, 200
X_train_processed = np.random.rand(n_train_samples, n_features)
X_val_processed = np.random.rand(n_val_samples, n_features)
X_test_processed = np.random.rand(n_test_samples, n_features)
y_train = np.random.randint(0, 2, n_train_samples)
y_val = np.random.randint(0, 2, n_val_samples)
y_test = np.random.randint(0, 2, n_test_samples)
FINAL_FEATURE_COLUMNS = [f'feat_{i}' for i in range(n_features)]
# --- End Placeholder Data ---


# 4. Model Training, Tuning, and Evaluation (Modified for cuML SVC)
print("\n--- Model Training and Evaluation Loop (with cuML SVC option) ---")

# Calculate scale_pos_weight for XGBoost
pos_weight = (y_train == 0).sum() / (y_train == 1).sum() if (y_train == 1).sum() > 0 else 1
print(f"Calculated scale_pos_weight for XGBoost: {pos_weight:.2f}")

# Define models and parameter grids
# --- Modified SVC entry ---
models = {
    'LogisticRegression': (
        LogisticRegression(
            solver='saga', random_state=42, class_weight='balanced',
            max_iter=5000, # Increased from previous iteration
            n_jobs=-1
            ),
        {
            'C': [0.01, 0.1, 1, 10, 100],
            'penalty': ['l1', 'l2', 'elasticnet'],
            'l1_ratio': [0.2, 0.5, 0.8]
        }
    ),
    'SVC': (
        # Use cuML SVC if available, otherwise fallback to scikit-learn SVC
        cumlSVC(
            random_state=42, class_weight='balanced', probability=True,
            cache_size=500, # cuML might use cache differently, check docs if optimizing
            max_iter=-1 # Often -1 (no limit) is default/recommended for cuML SVC
            ) if CUML_AVAILABLE else \
        sklearnSVC( # Fallback to sklearn SVC
            random_state=42, class_weight='balanced', probability=True,
            cache_size=500
            ),
        # --- Use the SAME parameter grid (check cuML compatibility for specific params) ---
        {
            'C': [0.1, 1, 10, 50], # Regularization parameter
            'gamma': ['scale', 'auto', 0.01, 0.1], # Kernel coefficient
            'kernel': ['rbf', 'linear'] # cuML SVC supports 'linear', 'poly', 'rbf', 'sigmoid'
            # 'degree': [2, 3], # Only for kernel='poly' - add if testing 'poly'
            # 'coef0': [0.0, 0.1] # For 'poly' and 'sigmoid'
        }
    ),
    'RandomForestClassifier': ( # Note: cuML also has cuml.ensemble.RandomForestClassifier
        RandomForestClassifier(random_state=42, class_weight='balanced_subsample', n_jobs=-1),
        {
            'n_estimators': [100, 200, 300],
            'max_depth': [None, 10, 20, 30],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 3, 5],
            'max_features': ['sqrt', 'log2', 0.5]
        }
    ),
    'DecisionTreeClassifier': (
        DecisionTreeClassifier(random_state=42, class_weight='balanced'),
        {
            'criterion': ['gini', 'entropy'],
            'max_depth': [None, 10, 20, 30, 50],
            'min_samples_split': [2, 5, 10, 20],
            'min_samples_leaf': [1, 3, 5, 10]
        }
    ),
    'XGBClassifier': (
        # To use GPU with XGBoost, set tree_method='gpu_hist'
        xgb.XGBClassifier(
            objective='binary:logistic', eval_metric='logloss', use_label_encoder=False,
            random_state=42, scale_pos_weight=pos_weight, n_jobs=1, # Use n_jobs=1 if using GPU below
            tree_method='gpu_hist' if CUML_AVAILABLE else 'auto' # Enable XGBoost GPU if cuML is available (implies working GPU)
            ),
        {
            'n_estimators': [100, 200, 300],
            'learning_rate': [0.01, 0.05, 0.1, 0.2],
            'max_depth': [3, 5, 7, 9],
            'subsample': [0.7, 0.8, 0.9],
            'colsample_bytree': [0.7, 0.8, 0.9]
        }
    )
}

all_results = {}
best_models = {}
roc_data_test = {} # To store data for combined ROC plot

# Use Stratified K-Fold for cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# --- Loop Modifications ---
for name in MODELS_TO_TRAIN:
    if name not in models:
        print(f"Model '{name}' not defined. Skipping.")
        continue

    # Check if we are using cuML for this model (specifically SVC for now)
    is_cuml_svc = name == 'SVC' and CUML_AVAILABLE
    is_xgb_gpu = name == 'XGBClassifier' and models[name][0].get_params().get('tree_method') == 'gpu_hist'

    print(f"\n===== Training {name} {'(GPU with cuML)' if is_cuml_svc else '(GPU with XGBoost)' if is_xgb_gpu else '(CPU)'} =====")
    model_instance, param_grid = models[name]

    # --- Handle conditional parameters if necessary (example for LogReg) ---
    current_param_grid = param_grid # Default
    if name == 'LogisticRegression':
        current_param_grid = []
        # Using the base instance which has max_iter=5000
        base_estimator_params = model_instance.get_params()
        for penalty in param_grid['penalty']:
            grid = {'C': param_grid['C'], 'penalty': [penalty]}
            solver_needed = True
            if penalty == 'elasticnet':
                grid['l1_ratio'] = param_grid['l1_ratio']
                grid['solver'] = ['saga'] # Only saga supports elasticnet
                solver_needed = False
            elif penalty == 'l1':
                grid['solver'] = ['liblinear', 'saga']
                solver_needed = False
            # Add specific solver constraints if needed based on final instance params
            current_param_grid.append(grid)
        print(f"  Using specific param grid for {name}")

    # --- Set n_jobs for GridSearchCV ---
    # Use n_jobs=1 for GPU models (cuML SVC, potentially XGBoost with gpu_hist)
    # Use n_jobs=-1 for CPU-bound scikit-learn models
    gridsearch_n_jobs = 1 if is_cuml_svc or is_xgb_gpu else -1
    print(f"  Using n_jobs={gridsearch_n_jobs} for GridSearchCV.")


    print(f"  Performing GridSearchCV (CV={cv.get_n_splits()}, scoring='roc_auc')...")
    grid_search = GridSearchCV(
        estimator=model_instance, # Uses cuML SVC instance if name=='SVC' and CUML_AVAILABLE
        param_grid=current_param_grid,
        cv=cv,
        scoring='roc_auc',
        n_jobs=gridsearch_n_jobs, # Use adjusted n_jobs
        verbose=1,
        # Consider adding error_score='raise' to debug issues within CV folds
        # error_score='raise'
    )

    try:
        # Pass NumPy arrays directly. cuML/XGBoost(GPU) should handle transfer.
        grid_search.fit(X_train_processed, y_train)
        print(f"\n  Best Params for {name}: {grid_search.best_params_}")
        print(f"  Best CV ROC AUC Score: {grid_search.best_score_:.4f}")
        best_model = grid_search.best_estimator_
        best_models[name] = best_model

        # Evaluate (ensure evaluate_model handles NumPy arrays)
        print(f"  Evaluating best {name} model on all sets...")
        model_results = evaluate_model(best_model,
                                       X_train_processed, y_train,
                                       X_val_processed, y_val,
                                       X_test_processed, y_test)
        all_results[name] = model_results
        test_results = model_results.get('Test', {})
        print(f"  Test Set Performance: Acc={test_results.get('Accuracy', 'N/A'):.4f}, P={test_results.get('Precision', 'N/A'):.4f}, R={test_results.get('Recall', 'N/A'):.4f}, F1={test_results.get('F1 Score', 'N/A'):.4f}, ROC AUC={test_results.get('ROC AUC', 'N/A'):.4f}")


        # Generate Plots for Test Set
        print(f"  Generating plots for {name} (Test Set)...")
        # Ensure test data is available before predicting/plotting
        if X_test_processed is not None and y_test is not None:
             y_pred_test = best_model.predict(X_test_processed)
             plot_confusion_matrix(y_test, y_pred_test, name, "Test", OUTPUT_PLOTS_DIR)
             # Feature importance plotting might need adjustment if model is cuML
             # Standard sklearn way might work if cuML model mimics the attribute
             plot_feature_importance(best_model, FINAL_FEATURE_COLUMNS, name, OUTPUT_PLOTS_DIR)

             # Store ROC data
             if test_results.get('ROC AUC') is not None:
                 try:
                     y_prob_test = best_model.predict_proba(X_test_processed)[:, 1]
                     fpr, tpr, _ = roc_curve(y_test, y_prob_test)
                     roc_data_test[name] = {'fpr': fpr, 'tpr': tpr, 'auc': test_results['ROC AUC']}
                 except Exception as roc_err:
                      print(f"      Warning: Could not get probabilities or calculate ROC curve for {name}: {roc_err}")
        else:
            print("      Skipping test set plots as test data is not available.")


        # Save Model
        model_filename = OUTPUT_MODELS_DIR / f"{name}_best_model.joblib"
        # Note: Saving cuML models might require specific handling or might just work with joblib
        try:
            joblib.dump(best_model, model_filename)
            print(f"  Saved best model to {model_filename}")
        except Exception as save_err:
            print(f"  [ERROR] Could not save model {name} using joblib: {save_err}")
            print(f"      Model object type: {type(best_model)}")


    except Exception as train_err:
        print(f"  [ERROR] Failed to tune, train or evaluate {name}: {train_err}")
        traceback.print_exc() # Print detailed error


# 5. Save Results Summary
print("\n--- Saving Results Summary ---")
# (Keep results saving code same as before)
results_df_list = []
for model_name, stages in all_results.items():
    for stage_name, metrics in stages.items():
        metrics_copy = metrics.copy()
        metrics_copy['Model'] = model_name
        metrics_copy['Stage'] = stage_name
        results_df_list.append(metrics_copy)

if results_df_list:
    results_df = pd.DataFrame(results_df_list)
    cols_order = ['Model', 'Stage', 'Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC AUC']
    results_df = results_df[[col for col in cols_order if col in results_df.columns]]
    results_filepath = OUTPUT_RESULTS_DIR / "all_models_evaluation_summary_v2.csv"
    try:
        results_df.to_csv(results_filepath, index=False)
        print(f"Evaluation summary saved to {results_filepath}")
    except Exception as e:
        print(f"[ERROR] Could not save results summary: {e}")
else:
     print("No model results generated to save.")

# 6. Combined ROC Curve Plot
print("\n--- Generating Combined ROC Curve Plot ---")
# (Keep ROC plotting code same as before)
plt.figure(figsize=(10, 8))
plot_count = 0
for name, data in roc_data_test.items():
    if data.get('auc') is not None:
        plt.plot(data['fpr'], data['tpr'], lw=2, label=f"{name} (AUC = {data['auc']:.3f})")
        plot_count += 1

if plot_count > 0:
    plt.plot([0, 1], [0, 1], 'k--', label='Chance Level (AUC = 0.500)')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curves - Test Set Comparison')
    plt.legend(loc='lower right')
    plt.grid(True)
    plt.tight_layout()
    roc_comp_filepath = OUTPUT_PLOTS_DIR / "combined_roc_curves_test_set_v2.png"
    try:
        plt.savefig(roc_comp_filepath)
        print(f"Combined ROC plot saved to {roc_comp_filepath}")
    except Exception as e:
        print(f"[ERROR] Could not save combined ROC plot: {e}")
else:
    print("No models with valid ROC AUC data to plot.")
plt.close()


print("\n--- Script Finished ---")


Successfully imported cuML SVC.
INFO: Using placeholder data for demonstration. Replace with your actual loading and preprocessing.

--- Model Training and Evaluation Loop (with cuML SVC option) ---
Calculated scale_pos_weight for XGBoost: 1.05

===== Training LogisticRegression (CPU) =====
  Using specific param grid for LogisticRegression
  Using n_jobs=-1 for GridSearchCV.
  Performing GridSearchCV (CV=5, scoring='roc_auc')...
Fitting 5 folds for each of 30 candidates, totalling 150 fits





  Best Params for LogisticRegression: {'C': 0.01, 'penalty': 'l1', 'solver': 'liblinear'}
  Best CV ROC AUC Score: 0.5000
  Evaluating best LogisticRegression model on all sets...
      Evaluating model...
      Evaluation complete.
  Test Set Performance: Acc=0.4500, P=0.0000, R=0.0000, F1=0.0000, ROC AUC=0.5000
  Generating plots for LogisticRegression (Test Set)...
      Feature importance not available for LogisticRegression.
  Saved best model to /content/drive/MyDrive/classification_results_v2/models/LogisticRegression_best_model.joblib

===== Training SVC (GPU with cuML) =====
  Using n_jobs=1 for GridSearchCV.
  Performing GridSearchCV (CV=5, scoring='roc_auc')...
Fitting 5 folds for each of 32 candidates, totalling 160 fits
  [ERROR] Failed to tune, train or evaluate SVC: 
All the 160 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
----------------------

Traceback (most recent call last):
  File "<ipython-input-10-cda6525a8084>", line 207, in <cell line: 0>
    grid_search.fit(X_train_processed, y_train)
  File "/usr/local/lib/python3.11/dist-packages/sklearn/base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sklearn/model_selection/_search.py", line 1024, in fit
    self._run_search(evaluate_candidates)
  File "/usr/local/lib/python3.11/dist-packages/sklearn/model_selection/_search.py", line 1571, in _run_search
    evaluate_candidates(ParameterGrid(self.param_grid))
  File "/usr/local/lib/python3.11/dist-packages/sklearn/model_selection/_search.py", line 1001, in evaluate_candidates
    _warn_or_raise_about_fit_failures(out, self.error_score)
  File "/usr/local/lib/python3.11/dist-packages/sklearn/model_selection/_validation.py", line 517, in _warn_or_raise_about_fit_failures
    raise ValueError(all_fits_f


  Best Params for RandomForestClassifier: {'max_depth': None, 'max_features': 'log2', 'min_samples_leaf': 3, 'min_samples_split': 10, 'n_estimators': 100}
  Best CV ROC AUC Score: 0.4919
  Evaluating best RandomForestClassifier model on all sets...
      Evaluating model...
      Evaluation complete.
  Test Set Performance: Acc=0.4750, P=0.5281, R=0.4273, F1=0.4724, ROC AUC=0.4812
  Generating plots for RandomForestClassifier (Test Set)...
  Saved best model to /content/drive/MyDrive/classification_results_v2/models/RandomForestClassifier_best_model.joblib

===== Training DecisionTreeClassifier (CPU) =====
  Using n_jobs=-1 for GridSearchCV.
  Performing GridSearchCV (CV=5, scoring='roc_auc')...
Fitting 5 folds for each of 160 candidates, totalling 800 fits

  Best Params for DecisionTreeClassifier: {'criterion': 'entropy', 'max_depth': 10, 'min_samples_leaf': 5, 'min_samples_split': 2}
  Best CV ROC AUC Score: 0.5165
  Evaluating best DecisionTreeClassifier model on all sets...
     

[1;30;43mStreaming output truncated to the last 5000 lines.[0m


    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"

Parameters: { "use_label_encoder" } are not used.


    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"

Parameters: { "use_label_encoder" } are not used.


    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"

Parameters: { "use_label_encoder" } are not used.


    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"

Parameters: { "use_label_encoder" } are not used.


    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"

Parameters: { "use_label_encoder" } are not used.


    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"

Parameters: { "use_label_encoder" } are not used.


    E.g. tree_method = "hist", device =


  Best Params for XGBClassifier: {'colsample_bytree': 0.7, 'learning_rate': 0.2, 'max_depth': 5, 'n_estimators': 200, 'subsample': 0.7}
  Best CV ROC AUC Score: 0.4987
  Evaluating best XGBClassifier model on all sets...
      Evaluating model...
      Evaluation complete.
  Test Set Performance: Acc=0.5550, P=0.6105, R=0.5273, F1=0.5659, ROC AUC=0.5398
  Generating plots for XGBClassifier (Test Set)...
  Saved best model to /content/drive/MyDrive/classification_results_v2/models/XGBClassifier_best_model.joblib

--- Saving Results Summary ---
Evaluation summary saved to /content/drive/MyDrive/classification_results_v2/all_models_evaluation_summary_v2.csv

--- Generating Combined ROC Curve Plot ---
Combined ROC plot saved to /content/drive/MyDrive/classification_results_v2/plots/combined_roc_curves_test_set_v2.png

--- Script Finished ---


In [14]:
import numpy as np
import matplotlib.pyplot as plt
import joblib

# Attempt to import cupy, if available.
try:
    import cupy as cp
except ImportError:
    cp = None

def evaluate_model(model, X_train, y_train, X_val, y_val, X_test, y_test):
    """
    Evaluate the given model on train, validation, and test sets.
    Convert model.predict_proba outputs to a NumPy array if necessary.
    """
    results = {}
    for stage, (X, y) in zip(['Train', 'Validation', 'Test'],
                              [(X_train, y_train), (X_val, y_val), (X_test, y_test)]):
        y_pred = model.predict(X)
        # Obtain probability estimates.
        y_prob = model.predict_proba(X)
        # Convert GPU arrays (CuPy) to NumPy, or cuDF/pandas objects via to_numpy().
        if cp is not None and isinstance(y_prob, cp.ndarray):
            y_prob = y_prob.get()
        elif hasattr(y_prob, "to_numpy"):
            y_prob = y_prob.to_numpy()
        else:
            y_prob = np.asarray(y_prob)
        # If y_prob has two columns, select the probability of the positive class.
        if y_prob.ndim == 2 and y_prob.shape[1] >= 2:
            y_prob = y_prob[:, 1]
        results[stage] = {
            'Accuracy':      np.round(accuracy_score(y, y_pred), 4),
            'Precision':     np.round(precision_score(y, y_pred, zero_division=0), 4),
            'Recall':        np.round(recall_score(y, y_pred, zero_division=0), 4),
            'F1 Score':      np.round(f1_score(y, y_pred, zero_division=0), 4),
            'ROC AUC':       np.round(roc_auc_score(y, y_prob), 4)
        }
    return results

# =============================================================================
# Model training loop (GPU-enabled GridSearchCV with cuml, etc.)
# In your training loop, simply call evaluate_model as before.
for name in MODELS_TO_TRAIN:
    if name not in models:
        print(f"Model '{name}' not defined. Skipping.")
        continue

    print(f"\n===== Training {name} =====")
    model_instance, param_grid = models[name]
    current_param_grid = param_grid

    print(f"  Performing GPU-enabled GridSearchCV (CV={cv.get_n_splits()}, scoring='roc_auc')...")
    grid_search = GridSearchCV(
        estimator=model_instance,
        param_grid=current_param_grid,
        cv=cv,
        scoring='roc_auc',
        verbose=1
    )

    try:
        grid_search.fit(X_train_processed, y_train)
        print(f"\n  Best Params for {name}: {grid_search.best_params_}")
        print(f"  Best CV ROC AUC Score: {grid_search.best_score_:.4f}")
        best_model = grid_search.best_estimator_
        best_models[name] = best_model

        print(f"  Evaluating best {name} model on all sets...")
        model_results = evaluate_model(
            best_model,
            X_train_processed, y_train,
            X_val_processed, y_val,
            X_test_processed, y_test
        )
        all_results[name] = model_results
        print(f"  Test Set Performance: Acc={model_results['Test']['Accuracy']:.4f}, "
              f"P={model_results['Test']['Precision']:.4f}, R={model_results['Test']['Recall']:.4f}, "
              f"F1={model_results['Test']['F1 Score']:.4f}, ROC AUC={model_results['Test']['ROC AUC']:.4f}")

        # ---------------------- Generate and Save Plots ----------------------
        print(f"  Generating plots for {name} (Test Set)...")
        # 1. Confusion Matrix plot (assuming plot_confusion_matrix is defined)
        y_pred_test = best_model.predict(X_test_processed)
        plot_confusion_matrix(y_test, y_pred_test, name, "Test", OUTPUT_PLOTS_DIR)

        # 2. Feature Importance plot (assuming your function handles GPU models as needed)
        plot_feature_importance(best_model, FINAL_FEATURE_COLUMNS, name, OUTPUT_PLOTS_DIR)

        # 3. ROC Curve plot
        y_prob_test = best_model.predict_proba(X_test_processed)
        if cp is not None and isinstance(y_prob_test, cp.ndarray):
            y_prob_test = y_prob_test.get()
        elif hasattr(y_prob_test, "to_numpy"):
            y_prob_test = y_prob_test.to_numpy()
        y_prob_test = np.asarray(y_prob_test)
        if y_prob_test.ndim == 2 and y_prob_test.shape[1] >= 2:
            fpr, tpr, thresholds = roc_curve(y_test, y_prob_test[:, 1])
            plt.figure()
            plt.plot(fpr, tpr, label=f'{name} (AUC = {model_results["Test"]["ROC AUC"]:.2f})')
            plt.plot([0, 1], [0, 1], 'k--')
            plt.xlabel('False Positive Rate')
            plt.ylabel('True Positive Rate')
            plt.title(f'ROC Curve - {name} (Test Set)')
            plt.legend(loc='lower right')
            roc_plot_path = OUTPUT_PLOTS_DIR / f"{name}_roc_curve.png"
            plt.savefig(roc_plot_path)
            plt.close()
            print(f"  Saved ROC curve plot to {roc_plot_path}")

        # ---------------------- Save Model ----------------------
        model_filename = OUTPUT_MODELS_DIR / f"{name}_best_model.joblib"
        joblib.dump(best_model, model_filename)
        print(f"  Saved best model to {model_filename}")

    except Exception as train_err:
        print(f"  [ERROR] Failed to tune, train or evaluate {name}: {train_err}")
        import traceback
        traceback.print_exc()



===== Training LogisticRegression =====
  Performing GPU-enabled GridSearchCV (CV=5, scoring='roc_auc')...
Fitting 5 folds for each of 5 candidates, totalling 25 fits

  Best Params for LogisticRegression: {'C': 0.01, 'penalty': 'l2'}
  Best CV ROC AUC Score: nan
  Evaluating best LogisticRegression model on all sets...
  Test Set Performance: Acc=0.7616, P=0.0732, R=0.0275, F1=0.0400, ROC AUC=0.4560
  Generating plots for LogisticRegression (Test Set)...
      Feature importance not available for LogisticRegression.
  Saved ROC curve plot to /content/drive/MyDrive/classification_results_v2/plots/LogisticRegression_roc_curve.png
  Saved best model to /content/drive/MyDrive/classification_results_v2/models/LogisticRegression_best_model.joblib

===== Training SVC =====
  Performing GPU-enabled GridSearchCV (CV=5, scoring='roc_auc')...
Fitting 5 folds for each of 16 candidates, totalling 80 fits

  Best Params for SVC: {'C': 0.1, 'gamma': 'scale', 'kernel': 'rbf'}
  Best CV ROC AUC Score

In [15]:
# 6. Combined ROC Curve Plot
print("\n--- Generating Combined ROC Curve Plot ---")
# (Keep ROC plotting code same as before)
plt.figure(figsize=(10, 8))
plot_count = 0
for name, data in roc_data_test.items():
    if data.get('auc') is not None:
        plt.plot(data['fpr'], data['tpr'], lw=2, label=f"{name} (AUC = {data['auc']:.3f})")
        plot_count += 1

if plot_count > 0:
    plt.plot([0, 1], [0, 1], 'k--', label='Chance Level (AUC = 0.500)') # Diagonal line
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curves - Test Set Comparison')
    plt.legend(loc='lower right')
    plt.grid(True)
    plt.tight_layout()
    roc_comp_filepath = OUTPUT_PLOTS_DIR / "combined_roc_curves_test_set_v2.png"
    try:
        plt.savefig(roc_comp_filepath)
        print(f"Combined ROC plot saved to {roc_comp_filepath}")
    except Exception as e:
        print(f"[ERROR] Could not save combined ROC plot: {e}")
else:
    print("No models with valid ROC AUC data to plot.")

plt.close()


print("\n--- Script Finished ---")


--- Generating Combined ROC Curve Plot ---
Combined ROC plot saved to /content/drive/MyDrive/classification_results_v2/plots/combined_roc_curves_test_set_v2.png

--- Script Finished ---


In [5]:
import cugraph
cugraph.__version__

ModuleNotFoundError: No module named 'cugraph'

In [6]:
!pip install cugraph

Collecting cugraph
  Downloading cugraph-0.6.1.post1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: cugraph
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py bdist_wheel[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Building wheel for cugraph (setup.py) ... [?25lerror
[31m  ERROR: Failed building wheel for cugraph[0m[31m
[0m[?25h  Running setup.py clean for cugraph
Failed to build cugraph
[31mERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (cugraph)[0m[31m
[0m

In [7]:
import cuspatial
cuspatial.__version__

ModuleNotFoundError: No module named 'cuspatial'

In [8]:
import cuxfilter
cuxfilter.__version__

ModuleNotFoundError: No module named 'cuxfilter'

# Next Steps #

For an overview of how you can access and work with your own datasets in Colab, check out [this guide](https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92).

For more RAPIDS examples, check out our RAPIDS notebooks repos:
1. https://github.com/rapidsai/notebooks
2. https://github.com/rapidsai/notebooks-contrib