# ML Template Library Overview

This notebook is a **map** of your reusable ML templates.

Each template is a starting point with:

- Verbose markdown (when to use, what to do next)
- Working code skeletons
- Clear config blocks at the top

You can copy any of these into a new Kaggle / project folder and edit:
- `DATA_DIR`
- File names (train/test)
- Column names (ID, target, time, text, etc.)

---


## 1. Core Supervised Templates (Tabular)

### 1.1 Regression (single target)
**File:** `tabular_regression_template.ipynb` (your main regression template)

**Use when:**
- Target is continuous (e.g., price, points, rating-as-number)
- You care about RMSE/MAE/R²

**Workflow:**
1. Basic EDA
2. Feature typing (numeric/categorical)
3. Imputation + scaling
4. Linear baseline + tree/boosting models
5. Cross-validation and model comparison

---

### 1.2 Multi-Target Regression
**File:** `multi_target_regression_template.ipynb`

**Use when:**
- You predict **multiple numeric targets** at once (e.g., x,y,z coordinates; multiple stats per player)

**Key ideas:**
- Wrap base regressor into `MultiOutputRegressor`
- Evaluate per-target and overall metrics
- Optionally correlate targets and consider dimension reduction


### 1.3 Classification (binary / multiclass)
**File:** `tabular_classification_template.ipynb`

**Use when:**
- Target is a **discrete label** (0/1, or multiple classes)
- Metrics of interest: accuracy, F1, ROC-AUC, etc.

**Workflow:**
1. Explore class balance
2. Handle missing values & encoding
3. Try baseline models: LogisticRegression, RandomForest, XGBoost/LightGBM hook
4. Use stratified train/validation split
5. Inspect confusion matrix & per-class metrics

**Decision guide:**
- Strong linear separability → Logistic/Linear SVM
- Complex interactions → Tree/boosting methods


## 2. Time & Temporal Structure

### 2.1 Time-Series Regression (forecasting via tabular)
**File:** `time_series_template.ipynb`

**Use when:**
- You have `TIME_COL` (+ optional `ID_COL` for panel data)
- You want to forecast a numeric target using lags/rolling stats

**Workflow:**
1. Sort by time, respect temporal order
2. Create lag features, rolling means/std
3. Time-based train/validation split (no leakage)
4. Tree-based model on lagged features as a baseline
5. Optionally add calendar features (day-of-week, month, etc.)

**Decision guide:**
- Start with lagged-feature + tree model
- Move to pure time-series models (Prophet, ARIMA, DeepTS) only when baseline is exhausted


### 2.2 Survival / Time-to-Event
**File:** `survival_time_to_event_template.ipynb`

**Use when:**
- You have **time until event** (churn, injury, failure)
- Some rows are **censored** (event not observed yet)

**Core columns:**
- `DURATION_COL`: how long each subject was observed
- `EVENT_COL`: 1 if event occurred, 0 if censored

**Workflow:**
1. Exploratory Kaplan–Meier curves
2. Cox proportional hazards model with `lifelines`
3. Extract `cox_risk_score`
4. Evaluate with concordance index
5. Optionally create risk groups (low/med/high)

**Decision guide:**
- Start with Cox
- If PH assumption fails or relationships are complex, explore nonlinear models (GBMs, neural nets) later


### 2.3 Sequence Classification
**File:** `sequence_classification_template.ipynb`

**Use when:**
- Each sample is a **fixed-length sequence** (e.g., `seq_0 ... seq_T-1`)
- You need a label for the whole sequence (normal/anomaly, class type, etc.)

**Two paths inside:**
1. Feature-based: mean/std/min/max/slope → RandomForest
2. Optional 1D CNN (if Keras is available)

**Decision guide:**
- Start with feature-based baseline
- Move to CNN/RNN when you suspect richer temporal patterns and have enough data


## 3. Unsupervised & Semi-Supervised

### 3.1 Clustering & Dimensionality Reduction
**File:** `clustering_dimred_template.ipynb`

**Use when:**
- No labels, want to discover **structure** in tabular data
- Need segments / cohorts / player archetypes
- Want 2D embeddings for visualization

**Workflow:**
1. Choose features (numeric or encoded)
2. Standardize → PCA
3. Inspect explained variance and 2D PCA plot
4. KMeans on PCA components; elbow + silhouette to choose k
5. Cluster profiling (per-cluster feature means)
6. Optional t-SNE, hierarchical, DBSCAN

**Outputs:**
- Original data + `cluster` column
- PCA embeddings for plotting or downstream models


### 3.2 Anomaly / Outlier Detection
**File:** `anomaly_detection_template.ipynb`

**Use when:**
- You want to find **unusual** points:
  - Fraud, weird games, ops logs anomalies, sensor glitches

**Methods inside:**
- IsolationForest (recommended baseline)
- LocalOutlierFactor (local density)
- Optional One-Class SVM

**Workflow:**
1. Decide unsupervised vs semi-supervised (`LABEL_COL`)
2. Feature selection + scaling
3. Train anomaly detectors
4. Convert scores → flags by top X% thresholding
5. If labels exist, tune threshold using precision/recall/F1

**Next steps:**
- Inspect top anomalies manually
- Cluster anomalies themselves
- Feed anomaly scores/flags into supervised models


## 4. Text & NLP

### 4.1 NLP Text Classification
**File:** `nlp_text_classification_template.ipynb`

**Use when:**
- You have text + label (e.g., sentiment, topic, toxicity)

**Workflow:**
1. Inspect label distribution and text lengths
2. Split train/validation (stratified)
3. TF-IDF vectorization (ngrams, max_features, min_df)
4. Baselines:
   - LogisticRegression
   - RandomForest

**Decision guide:**
- Tune TF-IDF first (n-grams, min_df)
- Try other linear models (LinearSVC, SGDClassifier)
- Once saturated, move to transformer embeddings or end-to-end transformers


## 5. Structured Variants

### 5.1 Ordinal Regression
**File:** `ordinal_regression_template.ipynb`

**Use when:**
- Labels are **ordered categories** (1–5, grades, severity levels)

**Approaches inside:**
1. Treat as numeric regression → round
2. Treat as multiclass classification
3. Simple cumulative ordinal model (K-1 logistic regressions)

**Decision guide:**
- Many ordered levels, nearly continuous → regression baseline
- Small number of levels, care about "distance" between errors → ordinal cumulative model


## 6. Graph-Structured Problems

### 6.1 Graph ML (Classical)
**File:** `graph_ml_template.ipynb`

**Use when:**
- You have nodes + edges (players, users, items; and their relationships)
- Want node-level predictions or link signals without full GNN stack

**Workflow:**
1. Load `nodes.csv` and `edges.csv`
2. Build NetworkX graph
3. Compute node features:
   - Degree, clustering coefficient, PageRank
4. Node classification (if labels exist) using RandomForest
5. Simple link prediction sketch (common neighbors)

**Next-level path:**
- Move to PyTorch Geometric / DGL for full GNNs when this baseline is exhausted.


## 7. Shared Utilities: `utils_tabular_ml.py`

You also have a small **shared utilities module**:

**File:** `utils_tabular_ml.py`

Contains helpers for:

- Loading CSVs (`load_csv`)
- Quick summaries (`summarize_dataframe`)
- Feature type detection:
  - `get_numeric_features`
  - `get_categorical_features`
- Simple train/validation split helper (`basic_train_valid_split`)
- EDA plotting helpers:
  - `plot_numeric_distributions`
  - `plot_correlation_heatmap`
  - `plot_target_distribution`

Example usage:

```python
from pathlib import Path
from utils_tabular_ml import (
    load_csv,
    summarize_dataframe,
    get_numeric_features,
    get_categorical_features,
    plot_numeric_distributions,
)

DATA_DIR = Path("../input")
df = load_csv(DATA_DIR / "train.csv")
summarize_dataframe(df, "train")
num_cols = get_numeric_features(df, exclude=["id", "target"])
plot_numeric_distributions(df, cols=num_cols[:8], title="Numeric feature histograms")
```

You don’t need to retrofit old notebooks right away—just start using these helpers
for new projects or when you decide to refactor.
