# Compliance Radar – Main Notebook

Machine Learning 2025/2026 – LUISS Guido Carli

This notebook follows the project structure required by the course:
1. Data loading
2. Exploratory Data Analysis (EDA)
3. Preprocessing & feature engineering
4. Model training & cross-validation
5. Evaluation & interpretability
6. Conclusions & insights


## 0. Setup & Imports

Run this cell first to load all required libraries.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sqlalchemy import create_engine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    confusion_matrix
)

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

import shap

sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (8, 5)


## 1. Data Loading

We load the SQLite database `org_compliance_data.db` from the `data/` folder and inspect the available tables.

In [None]:
db_path = "data/org_compliance_data.db"  # make sure the file is in this path

# Create SQLAlchemy engine
engine = create_engine(f"sqlite:///{db_path}")

# List tables in the database (syntax works with older SQLAlchemy; for newer you can use inspector)
tables = engine.table_names()
print("Tables in database:", tables)

# EXAMPLE: load a specific table (replace 'table_name' with an actual name from the list above)
# df = pd.read_sql("SELECT * FROM table_name", engine)
# df.head()


## 2. Exploratory Data Analysis (EDA)

_Teammate A:_ add histograms, boxplots, counts, and correlation heatmaps here once `df` is defined.

In [None]:
# Example EDA template (uncomment once df is defined)
# display(df.head())
# display(df.describe(include="all"))

# numeric_cols = df.select_dtypes(include=["int64", "float64"]).columns
# df[numeric_cols].hist(bins=30, figsize=(12, 8))
# plt.tight_layout()
# plt.show()

# corr = df[numeric_cols].corr()
# sns.heatmap(corr, annot=False, cmap="coolwarm")
# plt.title("Correlation heatmap")
# plt.show()


## 3. Preprocessing & Feature Engineering

Here we define:
- target variable `y`
- feature matrix `X`
- train/test split
- scaling/imputation if needed.

In [None]:
# Example template – adapt based on your actual columns
# target_col = "<TARGET_COLUMN_NAME>"  # TODO: replace with real target name
# feature_cols = [c for c in df.columns if c != target_col]

# X = df[feature_cols]
# y = df[target_col]

# X_train, X_test, y_train, y_test = train_test_split(
#     X, y, test_size=0.2, random_state=42, stratify=y
# )

# scaler = StandardScaler()
# X_train_scaled = scaler.fit_transform(X_train)
# X_test_scaled = scaler.transform(X_test)


## 4. Model Training & Cross-Validation

Here we train at least 3 models: Logistic Regression, Random Forest, XGBoost.

_You (Petra) handle this part._

In [None]:
# Example placeholder – replace with real code once X_train_scaled / y_train exist

# log_reg = LogisticRegression(max_iter=1000)
# rf = RandomForestClassifier(random_state=42)
# xgb = XGBClassifier(random_state=42, eval_metric="logloss")

# models = {
#     "Logistic Regression": log_reg,
#     "Random Forest": rf,
#     "XGBoost": xgb
# }

# results = []
# for name, model in models.items():
#     model.fit(X_train_scaled, y_train)
#     y_pred = model.predict(X_test_scaled)
#     y_proba = model.predict_proba(X_test_scaled)[:, 1]

#     acc = accuracy_score(y_test, y_pred)
#     prec = precision_score(y_test, y_pred)
#     rec = recall_score(y_test, y_pred)
#     f1 = f1_score(y_test, y_pred)
#     auc = roc_auc_score(y_test, y_proba)

#     results.append([name, acc, prec, rec, f1, auc])

# results_df = pd.DataFrame(results, columns=["Model", "Accuracy", "Precision", "Recall", "F1", "AUC"])
# display(results_df)


## 5. Interpretability (Feature Importance & SHAP)

_You (Petra) also handle this part._

In [None]:
# Example placeholder – compute SHAP values for the best model
# best_model = rf  # or xgb, depending on results

# explainer = shap.TreeExplainer(best_model)
# shap_values = explainer.shap_values(X_test_scaled)

# shap.summary_plot(shap_values, X_test, feature_names=X_test.columns)


## 6. Conclusions & Compliance Insights

_Teammate B/C:_ add textual interpretation here based on the final results.

- Which features are most strongly associated with potential non-compliance?
- How should the organisation monitor these?
- What recommendations follow from the model outputs?
