# 03 - Modeling and Evaluation

Train and evaluate classification models.

## Objectives
- Split dataset into train/test
- Train at least one baseline model
- Evaluate with Accuracy and ROC-AUC
- Save trained model + metrics

> **Learner task:** Try additional classifiers and compare metrics.

In [None]:
# Imports
import json
import joblib
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report

In [None]:
# Load prepared data
# Option A: Use engineered dataframe
df = pd.read_csv("data/processed/sample_clean.csv")

target_col = "target"  # TODO: update
X = df.drop(columns=[target_col])
y = df[target_col]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(X_train.shape, X_test.shape)

In [None]:
# Train baseline classification model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

print("Training complete.")

In [None]:
# Evaluate model
preds = model.predict(X_test)
probs = model.predict_proba(X_test)[:, 1]  # binary case

accuracy = accuracy_score(y_test, preds)
roc_auc = roc_auc_score(y_test, probs)

print(f"Accuracy: {accuracy:.4f}")
print(f"ROC-AUC: {roc_auc:.4f}")
print(classification_report(y_test, preds))

In [None]:
# Save model and metrics
joblib.dump(model, "models/baseline_model.joblib")

metrics = {"accuracy": float(accuracy), "roc_auc": float(roc_auc)}
with open("reports/metrics/baseline_metrics.json", "w", encoding="utf-8") as f:
    json.dump(metrics, f, indent=2)

print("Saved model and metrics.")

## Conclusion
- Best model and key score:
- What worked well:
- What to improve next (features, model choice, tuning, data quality):