# 🧠 AI / Machine Learning / Deep Learning — Ultimate Cheat Sheet
*Generated: 2025-10-14 18:05*

---
**How to use:**
- This notebook is a living cheat sheet: scan, copy, adapt.
- Each section has minimal runnable code + links to deeper blocks.
- Use `CTRL/CMD + F` to jump, or JupyterLab ToC for navigation.

> Tip: Keep your dataset-specific cells at the bottom in the Playground section.

## Table of Contents
1. [Boilerplate & Reproducibility](#boilerplate)
2. [Math Quickies (Linear Algebra & Prob/Stats)](#math)
3. [Data Loading & Cleaning](#data)
4. [EDA & Visualization](#eda)
5. [Feature Engineering & Selection](#fe)
6. [Modeling Patterns (sklearn)](#sklearn)
7. [Imbalanced Learning (SMOTE, Class Weights, Bagging)](#imbalance)
8. [Evaluation Metrics & Curves (ROC/PR)](#metrics)
9. [Cross-Validation & Pipelines](#cv)
10. [From-Scratch Mini-Implementations](#scratch)
11. [PyTorch Templates](#torch)
12. [TensorFlow/Keras Templates](#keras)
13. [Hyperparameter Tuning (Grid/Random/Optuna-lite)](#tuning)
14. [Model Interpretability (SHAP/LIME tips)](#interpret)
15. [Saving, Loading, and Deployment Snips](#deploy)
16. [Troubleshooting & Performance Tips](#troubleshoot)
17. [Playground / Your Dataset Hooks](#playground)


## 1) Boilerplate & Reproducibility <a id='boilerplate'></a>

In [None]:
import os, sys, math, gc, json, random, time
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt  # use default styles (no seaborn)
from pprint import pprint

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
random.seed(RANDOM_STATE)

def seed_everything(seed=RANDOM_STATE):
    try:
        import torch
        torch.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
    except Exception:
        pass
seed_everything()

def tic():
    return time.time()

def toc(t0, label='Elapsed'):
    print(f"{label}: {time.time()-t0:.3f}s")


## 2) Math Quickies (Linear Algebra & Prob/Stats) <a id='math'></a>

In [None]:
# Vector norms & cosine similarity
a = np.array([1,2,3]); b = np.array([4,5,6])
l2_a = np.linalg.norm(a)
cos = a @ b / (np.linalg.norm(a) * np.linalg.norm(b))
print('||a||2 =', l2_a)
print('cos(a,b)=', cos)


In [None]:
# Sigmoid, Softmax, Cross-Entropy
def sigmoid(z):
    return 1/(1+np.exp(-z))
def softmax(z):
    z = z - np.max(z, axis=-1, keepdims=True)
    e = np.exp(z)
    return e / e.sum(axis=-1, keepdims=True)
def binary_cross_entropy(y, p, eps=1e-9):
    p = np.clip(p, eps, 1-eps)
    return -(y*np.log(p) + (1-y)*np.log(1-p)).mean()
print('sigmoid(0)=', sigmoid(0.0))

## 3) Data Loading & Cleaning <a id='data'></a>

In [None]:
# Load CSV quickly
CSV_PATH = 'your_data.csv'  # change me
if Path(CSV_PATH).exists():
    df = pd.read_csv(CSV_PATH)
    display(df.head())
else:
    print('> Tip: set CSV_PATH to your file path and re-run.')


In [None]:
# Basic cleaning template
def quick_clean(df):
    df = df.copy()
    # Strip whitespace col names
    df.columns = [c.strip() for c in df.columns]
    # Drop dupes
    df = df.drop_duplicates()
    # Example: fill numeric NA with median
    num_cols = df.select_dtypes(include=[np.number]).columns
    for c in num_cols:
        df[c] = df[c].fillna(df[c].median())
    return df

try:
    df = quick_clean(df)
except NameError:
    pass

## 4) EDA & Visualization <a id='eda'></a>

In [None]:
# Hist & correlation (matplotlib only)
def plot_hist(df, col):
    plt.figure()
    df[col].plot(kind='hist', bins=30)
    plt.title(f'Histogram: {col}')
    plt.show()

def quick_corr(df):
    plt.figure()
    plt.imshow(df.corr(numeric_only=True), aspect='auto')
    plt.title('Correlation (numeric only)')
    plt.colorbar()
    plt.show()

## 5) Feature Engineering & Selection <a id='fe'></a>

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

def basic_numeric_preprocess(df, features):
    num_pipe = Pipeline([
        ('impute', SimpleImputer(strategy='median')),
        ('scale', StandardScaler()),
    ])
    pre = ColumnTransformer([
        ('num', num_pipe, features)
    ])
    return pre


## 6) Modeling Patterns (sklearn) <a id='sklearn'></a>

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score

def fit_basic_models(X, y, features=None, test_size=0.2, rs=RANDOM_STATE):
    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=test_size, stratify=y, random_state=rs)
    if features is None:
        features = X.columns.tolist()
    pre = basic_numeric_preprocess(X, features)
    models = {
        'LR': LogisticRegression(max_iter=1000, n_jobs=None),
        'DT': DecisionTreeClassifier(random_state=rs),
        'RF': RandomForestClassifier(n_estimators=200, random_state=rs, n_jobs=-1),
        'GB': GradientBoostingClassifier(random_state=rs)
    }
    results = {}
    for name, clf in models.items():
        pipe = Pipeline([('pre', pre), ('clf', clf)])
        t0 = tic(); pipe.fit(Xtr, ytr); toc(t0, f'Fit {name}')
        if hasattr(pipe.named_steps['clf'], 'predict_proba'):
            yprob = pipe.predict_proba(Xte)[:,1]
        else:
            # Fallback: decision_function → sigmoid-like mapping (rough)
            score = pipe.decision_function(Xte)
            yprob = (score - score.min())/(score.max()-score.min()+1e-9)
        ypred = pipe.predict(Xte)
        results[name] = {
            'acc': accuracy_score(yte, ypred),
            'f1': f1_score(yte, ypred),
            'roc_auc': roc_auc_score(yte, yprob)
        }
    return results


## 7) Imbalanced Learning (SMOTE, Class Weights, Bagging) <a id='imbalance'></a>

In [None]:
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE
from imblearn.ensemble import BalancedBaggingClassifier

def smote_rf_template(X, y, features=None, test_size=0.2):
    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=test_size, stratify=y, random_state=RANDOM_STATE)
    if features is None:
        features = X.columns.tolist()
    pre = basic_numeric_preprocess(X, features)
    pipe = ImbPipeline([
        ('pre', pre),
        ('smote', SMOTE(random_state=RANDOM_STATE, k_neighbors=5, sampling_strategy='auto')),
        ('clf', RandomForestClassifier(n_estimators=300, class_weight='balanced_subsample', random_state=RANDOM_STATE, n_jobs=-1))
    ])
    pipe.fit(Xtr, ytr)
    yprob = pipe.predict_proba(Xte)[:,1]
    ypred = pipe.predict(Xte)
    print('F1:', f1_score(yte, ypred), ' ROC-AUC:', roc_auc_score(yte, yprob))
    return pipe


In [None]:
# BalancedBagging (no augmentation) template
def balanced_bagging_rf(X, y, features=None, test_size=0.2):
    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=test_size, stratify=y, random_state=RANDOM_STATE)
    if features is None:
        features = X.columns.tolist()
    pre = basic_numeric_preprocess(X, features)
    base_rf = RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE, n_jobs=-1)
    clf = BalancedBaggingClassifier(base_estimator=base_rf, n_estimators=10, sampling_strategy='auto', random_state=RANDOM_STATE, n_jobs=-1)
    pipe = ImbPipeline([('pre', pre), ('clf', clf)])
    pipe.fit(Xtr, ytr)
    yprob = pipe.predict_proba(Xte)[:,1]
    ypred = pipe.predict(Xte)
    print('F1:', f1_score(yte, ypred), ' ROC-AUC:', roc_auc_score(yte, yprob))
    return pipe


## 8) Evaluation Metrics & Curves (ROC/PR) <a id='metrics'></a>

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc, precision_recall_curve

def plot_roc_pr(y_true, y_prob):
    # ROC
    fpr, tpr, _ = roc_curve(y_true, y_prob)
    roc_auc = auc(fpr, tpr)
    plt.figure()
    plt.plot(fpr, tpr, label=f'ROC AUC={roc_auc:.3f}')
    plt.plot([0,1],[0,1],'--')
    plt.xlabel('FPR'); plt.ylabel('TPR'); plt.title('ROC Curve'); plt.legend(); plt.show()
    # PR
    precision, recall, _ = precision_recall_curve(y_true, y_prob)
    pr_auc = auc(recall, precision)
    plt.figure()
    plt.plot(recall, precision, label=f'PR AUC={pr_auc:.3f}')
    plt.xlabel('Recall'); plt.ylabel('Precision'); plt.title('Precision-Recall Curve'); plt.legend(); plt.show()


## 9) Cross-Validation & Pipelines <a id='cv'></a>

In [None]:
from sklearn.model_selection import StratifiedKFold
from tqdm import tqdm

def cv_auc(pipe_builder, X, y, n_splits=5):
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=RANDOM_STATE)
    scores = []
    for fold, (tr, te) in enumerate(tqdm(skf.split(X, y), total=n_splits, desc='CV folds')):
        Xtr, Xte = X.iloc[tr], X.iloc[te]
        ytr, yte = y.iloc[tr], y.iloc[te]
        pipe = pipe_builder()
        pipe.fit(Xtr, ytr)
        yprob = pipe.predict_proba(Xte)[:,1]
        scores.append(roc_auc_score(yte, yprob))
    print('ROC-AUC mean±std:', np.mean(scores), np.std(scores))
    return scores


## 10) From-Scratch Mini-Implementations <a id='scratch'></a>

In [None]:
# Logistic Regression (binary) — minimal from-scratch
class LogisticRegressionScratch:
    def __init__(self, lr=0.1, n_iter=1000):
        self.lr = lr
        self.n_iter = n_iter
        self.w = None
        self.b = 0.0
    def fit(self, X, y):
        X = X.astype(float)
        y = y.astype(float)
        n, d = X.shape
        self.w = np.zeros(d)
        for _ in range(self.n_iter):
            z = X @ self.w + self.b
            p = 1/(1+np.exp(-z))
            grad_w = X.T @ (p - y) / n
            grad_b = (p - y).mean()
            self.w -= self.lr * grad_w
            self.b -= self.lr * grad_b
    def predict_proba(self, X):
        z = X @ self.w + self.b
        return 1/(1+np.exp(-z))
    def predict(self, X):
        return (self.predict_proba(X) >= 0.5).astype(int)


In [None]:
# Gini impurity and a toy split function (for DT intuition)
def gini(labels):
    if len(labels)==0: return 0.0
    p = np.mean(labels)
    return 2*p*(1-p)
print('gini([0,0,1,1,1])=', gini(np.array([0,0,1,1,1])))

## 11) PyTorch Templates <a id='torch'></a>

In [None]:
try:
    import torch
    from torch import nn
    class MLP(nn.Module):
        def __init__(self, in_dim, hidden=64, out_dim=1):
            super().__init__()
            self.net = nn.Sequential(
                nn.Linear(in_dim, hidden), nn.ReLU(),
                nn.Linear(hidden, hidden), nn.ReLU(),
                nn.Linear(hidden, out_dim)
            )
        def forward(self, x):
            return self.net(x)
    print('PyTorch ready ✔')
except Exception as e:
    print('PyTorch not available →', e)

In [None]:
# Minimal train loop (binary classification w/ BCEWithLogitsLoss)
def torch_train_loop(model, loader, epochs=5, lr=1e-3, device='cpu'):
    import torch
    from torch import nn
    model.to(device)
    opt = torch.optim.Adam(model.parameters(), lr=lr)
    loss_fn = nn.BCEWithLogitsLoss()
    model.train()
    for ep in range(1, epochs+1):
        total = 0.0
        for xb, yb in loader:
            xb, yb = xb.to(device), yb.to(device).float().view(-1,1)
            opt.zero_grad()
            logits = model(xb)
            loss = loss_fn(logits, yb)
            loss.backward(); opt.step()
            total += loss.item()*len(xb)
        print(f'Epoch {ep}: loss={total/len(loader.dataset):.4f}')


## 12) TensorFlow/Keras Templates <a id='keras'></a>

In [None]:
try:
    import tensorflow as tf
    from tensorflow.keras import layers, models
    def build_keras_mlp(in_dim):
        model = models.Sequential([
            layers.Input(shape=(in_dim,)),
            layers.Dense(64, activation='relu'),
            layers.Dense(64, activation='relu'),
            layers.Dense(1, activation='sigmoid')
        ])
        model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['AUC','Precision','Recall'])
        return model
    print('TensorFlow ready ✔')
except Exception as e:
    print('TensorFlow not available →', e)

## 13) Hyperparameter Tuning (Grid/Random) <a id='tuning'></a>

In [None]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
def grid_search_example(X, y, features=None):
    if features is None: features = X.columns.tolist()
    pre = basic_numeric_preprocess(X, features)
    pipe = Pipeline([('pre', pre), ('clf', RandomForestClassifier(random_state=RANDOM_STATE))])
    grid = {
        'clf__n_estimators': [100, 300],
        'clf__max_depth': [None, 10, 20],
    }
    gs = GridSearchCV(pipe, grid, scoring='roc_auc', cv=3, n_jobs=-1, verbose=1)
    gs.fit(X, y)
    print('Best:', gs.best_params_, ' Score:', gs.best_score_)


## 14) Model Interpretability (SHAP/LIME tips) <a id='interpret'></a>

In [None]:
print('Tips: For tree models use TreeExplainer; subsample to speed up. For linear models, coefficients are direct indicators. For NN, use Integrated Gradients or Deep SHAP when available.')

## 15) Saving, Loading, and Deployment Snips <a id='deploy'></a>

In [None]:
import joblib
def save_model(obj, path='model.joblib'):
    joblib.dump(obj, path)
    print('Saved →', path)
def load_model(path='model.joblib'):
    return joblib.load(path)


## 16) Troubleshooting & Performance Tips <a id='troubleshoot'></a>
- **Long training?** Reduce estimators, features, or sample rows; profile bottlenecks.
- **Imbalanced metrics misleading?** Use ROC-AUC/PR-AUC, check confusion matrix per threshold.
- **Smote k_neighbors:** try 3–10; validate via CV; avoid leakage by fitting on train only.
- **Reproducibility:** fix seeds and library versions.
- **Memory:** delete big objects (`del var` + `gc.collect()`).


## 17) Playground / Your Dataset Hooks <a id='playground'></a>

In [None]:
# Example wiring: set FEATURES and TARGET then run a pipeline
try:
    FEATURES = [c for c in df.columns if c != 'status_label']  # change as needed
    TARGET = 'status_label'  # change as needed
    X = df[FEATURES].copy(); y = df[TARGET].copy()
    print('Fitting basic models...')
    res = fit_basic_models(X, y, features=FEATURES)
    pprint(res)
except NameError:
    print('> Load your dataframe as df first (see Section 3).')
