# Chapter 06 - Lab 1: Training Layer (At-risk Language Baseline)

This notebook uses the synthetic CSV dataset in `Chapter06/Labs/assets/` to train a baseline text classifier.

- **Binary label**: `label` (1 = needs outreach, 0 = no outreach)
- **Join key**: `student_id` (messages <-> student profiles)


## What is 'at-risk' here?

For this lab, 'at-risk' means the message includes actionable signals that a learner may disengage, fail, or withdraw **without support**.
The model's job is to help route/triage messages to the right follow-up workflow.

## 1) Install dependencies
Run once per new kernel.

In [None]:
%pip -q install pandas scikit-learn joblib

## 2) Load data (messages + student profiles)
We locate the MasterClass root by searching for `LAB_INSTRUCTIONS.md` so this notebook runs from different working directories.

In [None]:
from __future__ import annotations

import hashlib
import json
from pathlib import Path

import pandas as pd


def find_masterclass_root(start: Path) -> Path:
    for candidate in [start, *start.parents]:
        if (candidate / 'LAB_INSTRUCTIONS.md').exists():
            return candidate
    raise FileNotFoundError('Could not find LAB_INSTRUCTIONS.md. Run from within the MasterClass workspace.')


MASTERCLASS_ROOT = find_masterclass_root(Path.cwd())
ASSETS_DIR = MASTERCLASS_ROOT / 'Chapter06' / 'Labs' / 'assets'
MESSAGES_CSV = ASSETS_DIR / 'at_risk_student_messages_500.csv'
PROFILES_CSV = ASSETS_DIR / 'at_risk_student_profiles.csv'

MESSAGES_CSV, PROFILES_CSV

In [None]:
messages = pd.read_csv(MESSAGES_CSV)
profiles = pd.read_csv(PROFILES_CSV)
messages['label'] = messages['label'].astype(int)

df = messages.merge(profiles, on='student_id', how='left', validate='many_to_one')

display(df.head(3))
df.shape

In [None]:
# Quick EDA
label_counts = df['label'].value_counts(dropna=False).rename_axis('label').to_frame('count')
missing = df.isna().mean().sort_values(ascending=False).to_frame('missing_rate')

display(label_counts)
missing.head(10)

## 3) Train a baseline model
A strong simple baseline for text classification is **TF-IDF + Logistic Regression**.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

X = df['text'].astype(str)
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

model = Pipeline(
    steps=[
        ('tfidf', TfidfVectorizer(ngram_range=(1, 2), min_df=2, max_df=0.95)),
        ('clf', LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42)),
    ]
)

model.fit(X_train, y_train)
model

## 4) Evaluate
Record metrics and inspect a few errors.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, f1_score, precision_score, recall_score

y_pred = model.predict(X_test)

metrics = {
    'accuracy': float((y_pred == y_test).mean()),
    'precision': float(precision_score(y_test, y_pred, zero_division=0)),
    'recall': float(recall_score(y_test, y_pred, zero_division=0)),
    'f1': float(f1_score(y_test, y_pred, zero_division=0)),
    'confusion_matrix': confusion_matrix(y_test, y_pred).tolist(),
    'evaluated_rows': int(len(y_test)),
}

print(classification_report(y_test, y_pred, digits=3, zero_division=0))
metrics

In [None]:
# Example errors (false positives + false negatives)
test_df = pd.DataFrame({'text': X_test, 'y_true': y_test, 'y_pred': y_pred}).reset_index(drop=True)

display(test_df[(test_df.y_true == 0) & (test_df.y_pred == 1)].head(3))
display(test_df[(test_df.y_true == 1) & (test_df.y_pred == 0)].head(3))

## 5) Save artifacts
Saves model + metrics + parameters + dataset fingerprints so the run is repeatable.

In [None]:
import joblib

ARTIFACT_DIR = MASTERCLASS_ROOT / 'Chapter06' / 'Labs' / 'artifacts' / 'at_risk_baseline'
ARTIFACT_DIR.mkdir(parents=True, exist_ok=True)

def sha256_file(path: Path) -> str:
    h = hashlib.sha256()
    with path.open('rb') as f:
        for chunk in iter(lambda: f.read(1024 * 1024), b''):
            h.update(chunk)
    return h.hexdigest()

joblib.dump(model, ARTIFACT_DIR / 'model.joblib')
(ARTIFACT_DIR / 'metrics.json').write_text(json.dumps(metrics, indent=2), encoding='utf-8')
(ARTIFACT_DIR / 'params.json').write_text(
    json.dumps({
        'model': 'tfidf + logistic_regression',
        'random_state': 42,
        'test_size': 0.2,
        'tfidf': {'ngram_range': [1, 2], 'min_df': 2, 'max_df': 0.95},
        'logreg': {'max_iter': 1000, 'class_weight': 'balanced'},
    }, indent=2),
    encoding='utf-8',
)
(ARTIFACT_DIR / 'data_version.json').write_text(
    json.dumps({
        'messages_csv': str(MESSAGES_CSV.relative_to(MASTERCLASS_ROOT)),
        'profiles_csv': str(PROFILES_CSV.relative_to(MASTERCLASS_ROOT)),
        'messages_sha256': sha256_file(MESSAGES_CSV),
        'profiles_sha256': sha256_file(PROFILES_CSV),
        'messages_rows': int(messages.shape[0]),
        'profiles_rows': int(profiles.shape[0]),
    }, indent=2),
    encoding='utf-8',
)

sorted([p.name for p in ARTIFACT_DIR.iterdir()])

## Next steps (optional)
- Join in additional CSVs later (e.g., outcomes/engagement) to show feature enrichment
- Add metadata features (e.g., `study_mode`, `week`) and compare metrics
- Swap model: LinearSVC or MultinomialNB
- Use a time-based split on `created_at`
