# Model Performance Dashboard


This notebook provides a modularized version of the Model Performance Dashboard. The goal is to evaluate the effectiveness of the model on the dataset. It covers data upload, preprocessing, model selection, training, and evaluation using various machine learning models. AWS S3 integration is included for dataset management.


Key features include:

- Support for both local and AWS S3 data sources for flexible dataset management.
- Interactive widgets for uploading data, selecting the target column and model, and displaying results.
- Evaluation metrics such as accuracy, F1 score, and ROC AUC to assess model performance.
- Data preprocessing steps including handling missing values, feature scaling, and train/test splitting.


This notebook is designed to help users easily experiment with different models and datasets in an interactive environment.

## 1. Import Required Libraries
Import all necessary libraries for data handling, AWS S3, and machine learning.

In [1]:
import os
import pandas as pd
import numpy as np
import boto3
from io import BytesIO
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
print("All libraries imported successfully.")

All libraries imported successfully.


## 2. AWS S3 Configuration and Helper Functions
Set up AWS S3 configuration and define helper functions for file management.

In [2]:
# AWS S3 Configuration
S3_BUCKET_NAME = "arv7staging"
s3_client = boto3.client("s3")

def s3_file_exists(s3_key):
    try:
        s3_client.head_object(Bucket=S3_BUCKET_NAME, Key=s3_key)
        return True
    except:
        return False
    
print("S3 storage client configured successfully.")

S3 storage client configured successfully.


## 3. Dataset Upload and Download from S3
Functions to upload local datasets to S3 and load datasets from S3, supporting CSV and XLSX formats.

In [3]:
def upload_to_s3(local_path, s3_key):
    if os.path.exists(local_path):
        if not s3_file_exists(s3_key):
            with open(local_path, "rb") as f:
                s3_client.put_object(Bucket=S3_BUCKET_NAME, Key=s3_key, Body=f)
            print(f"Uploaded {s3_key} to S3.")
        else:
            print(f"Skipping {s3_key}: Already exists in S3.")
    else:
        print(f"Error: {local_path} does not exist.")

def load_from_s3(s3_key):
    try:
        obj = s3_client.get_object(Bucket=S3_BUCKET_NAME, Key=s3_key)
        if s3_key.endswith(".csv"):
            df = pd.read_csv(obj['Body'], low_memory=False)
        elif s3_key.endswith(".xlsx"):
            df = pd.read_excel(BytesIO(obj['Body'].read()), engine='openpyxl')
        else:
            raise ValueError("Unsupported file format")
        return df
    except Exception as e:
        print(f"Error loading {s3_key} from S3: {e}")
        return None

## 4. Data Preprocessing Function
Define a function to preprocess the data, including selecting numeric columns, imputing missing values, scaling features, and splitting into train/test sets.

In [4]:
def preprocess_data(df, target_column):
    if target_column not in df.columns:
        raise ValueError(f"Target column '{target_column}' not found in dataset.")
    df = df.select_dtypes(include=[np.number])  # Keep only numeric columns
    if target_column not in df.columns:
        raise ValueError(f"After processing, target column '{target_column}' is missing.")
    X = df.drop(columns=[target_column])
    y = df[target_column]
    imputer = SimpleImputer(strategy='mean')
    X_imputed = imputer.fit_transform(X)
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X_imputed)
    X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
    return X_train, X_test, y_train, y_test

## 5. Model Definitions
Create a dictionary of available models: Logistic Regression, Random Forest, HistGradientBoosting, XGBoost, LightGBM, CatBoost, K-Nearest Neighbors, and MLP Neural Network. Models can be selected for comparison.

In [5]:
models = {
    "Logistic Regression": LogisticRegression(),
    "Random Forest": RandomForestClassifier(n_estimators=100),
    "HistGradientBoosting": HistGradientBoostingClassifier(),
    "XGBoost": XGBClassifier(use_label_encoder=False, eval_metric='logloss'),
    "LightGBM": LGBMClassifier(),
    "CatBoost": CatBoostClassifier(verbose=0),
    "K-Nearest Neighbors": KNeighborsClassifier(),
    "MLP Neural Network": MLPClassifier(max_iter=500)
}

In [6]:
# Store outcomes for each model
model_outcomes = {}


def store_model_outcome(model_name, y_true, y_pred, y_proba):
    outcome = {
        "Accuracy": accuracy_score(y_true, y_pred),
        "F1 Score": f1_score(y_true, y_pred),
        "ROC AUC": roc_auc_score(y_true, y_proba) if y_proba is not None else "N/A"
    }
    model_outcomes[model_name] = outcome
    return outcome


def compare_model_outcomes():
    return model_outcomes

## 6. Load and Preview Dataset
Load a dataset from a local file or S3, and display a preview of the data.

In [7]:
import ipywidgets as widgets
from IPython.display import display

# File upload widget
file_upload = widgets.FileUpload(accept='.csv,.xlsx', multiple=False)
display(file_upload)

def get_uploaded_df(file_upload):
    if len(file_upload.value) > 0:
        uploaded_filename = list(file_upload.value.keys())[0]
        content = file_upload.value[uploaded_filename]['content']
        if uploaded_filename.endswith('.csv'):
            df = pd.read_csv(BytesIO(content))
        elif uploaded_filename.endswith('.xlsx'):
            df = pd.read_excel(BytesIO(content), engine='openpyxl')
        else:
            raise ValueError('Unsupported file format')
        return df
    return None

df = get_uploaded_df(file_upload)
if df is not None:
    display(df.head())

FileUpload(value=(), accept='.csv,.xlsx', description='Upload')

## 7. Select Target Column and Models
Allow user to select the target column and one or more models to compare using ipywidgets.

In [8]:
if df is not None:
    target_selector = widgets.Dropdown(options=df.columns.tolist(), description='Target Column:')
    model_selector = widgets.SelectMultiple(options=list(models.keys()), description='Models to Compare:', value=[list(models.keys())[0], list(models.keys())[1]])
    display(target_selector, model_selector)

## 8. Train and Evaluate Selected Models
Train the selected models on the preprocessed data, make predictions, and evaluate performance using accuracy, F1 score, and ROC AUC. Results for all selected models are compared.

In [9]:
if df is not None:
    import warnings
    warnings.filterwarnings('ignore')
    try:
        X_train, X_test, y_train, y_test = preprocess_data(df, target_selector.value)
        for model_name in model_selector.value:
            model = models[model_name]
            model.fit(X_train, y_train)
            y_pred = model.predict(X_test)
            y_proba = model.predict_proba(X_test)[:, 1] if hasattr(model, "predict_proba") else None
            store_model_outcome(model_name, y_test, y_pred, y_proba)
        print("Model Performance Comparison:")
        print(compare_model_outcomes())
    except Exception as e:
        print(f"Error: {e}")