# Spot the Scam â€” End-to-End Notebook

This notebook documents the full Spot the Scam pipeline: data ingestion, preprocessing, feature creation, model training, evaluation, explainability, and packaging with MLflow/ONNX.

## 1. Environment Setup

- Activate the project virtual environment: `source .venv/bin/activate`
- Ensure dependencies are installed: `pip install -e '.[dev]'`
- Run this notebook from the project root so relative paths resolve correctly.

In [None]:
# Allow imports from the src/ directory
import sys
from pathlib import Path

ROOT = Path.cwd()
SRC = ROOT / "src"
if str(SRC) not in sys.path:
    sys.path.insert(0, str(SRC))

## 2. Load Configuration and Raw Data

In [None]:
from spot_scam.config.loader import load_config
from spot_scam.data.ingest import load_raw_dataset

config = load_config()
config

In [None]:
raw_df = load_raw_dataset(config)
raw_df.head()

## 3. Preprocess & Combine Text

Creates the `text_all` field (title + description + other text) and drops leak-prone columns.

In [None]:
from spot_scam.data.preprocess import preprocess_dataframe

processed_df, text_fields = preprocess_dataframe(raw_df, config)
processed_df[text_fields + [config["data"]["target_column"], "text_all"]].head()

## 4. Train/Validation/Test Splits & Feature Bundles

In [None]:
from spot_scam.data.split import create_splits
from spot_scam.features.builders import build_feature_bundle

splits = create_splits(processed_df, config, persist=False)
bundle = build_feature_bundle(splits.train, splits.val, splits.test, config)
bundle

## 5. Model Training (Optional Inline Run)

Training can be invoked programmatically or via CLI. The cell below calls the Typer CLI directly (commented out to avoid accidental long runs).

In [None]:
# %%capture
# import typer
# from spot_scam.pipeline.train import app as train_cli
#
# # Equivalent to: PYTHONPATH=src python -m spot_scam.pipeline.train --skip-transformer
# train_cli(["--skip-transformer"])

## 6. Inspect Persisted Artifacts

In [None]:
import json
from spot_scam.utils.paths import ARTIFACTS_DIR

metadata_path = ARTIFACTS_DIR / "metadata.json"
with metadata_path.open() as f:
    metadata = json.load(f)
metadata

### Validation vs Test Metrics

In [None]:
import pandas as pd

pd.DataFrame([
    {"split": "validation", **metadata.get("val_metrics", {})},
    {"split": "test", **metadata.get("test_metrics", {})},
])

## 7. Inference & Explainability

In [None]:
from spot_scam.inference.predictor import FraudPredictor

predictor = FraudPredictor()
sample = {
    "title": "Immediate start data entry",
    "description": "Immediate start. Must send details.",
    "requirements": "Quickbooks experience preferred.",
    "benefits": "",
    "telecommuting": 1,
    "has_company_logo": 0,
    "has_questions": 0,
}
result = predictor.predict([sample])[0]
result

In [None]:
import json
print(json.dumps(result["explanation"], indent=2))

## 8. MLflow Integration

In [None]:
import mlflow

mlflow.set_tracking_uri("file:./mlruns")
runs_df = mlflow.search_runs()
runs_df[["run_id", "metrics.test_f1", "tags.mlflow.runName"]]

> **Serve the latest model** (outside the notebook):
>
> ```bash
> mlflow models serve --env-manager local -m runs:/<RUN_ID>/model -p 8080
> ```

## 9. Visualizations

In [None]:
from IPython.display import Image, display

display(Image(filename=str(ROOT / "experiments/figs/pr_curve_test.png")))

## 10. Next Steps
- Fine-tune transformer with domain-specific data.
- Deploy the MLflow model to staging.
- Monitor drift (token shifts, slice metrics).
- Expand explainability to transformer (token attributions).