# Supervised Learning — Master Edition (Single Source of Truth)

**Overview:** This notebook is a textbook-grade, self-contained reference for major supervised learning algorithms. It combines deep conceptual theory, mathematical foundations (LaTeX), step-by-step internal mechanics, domain use-cases, visual explanations with Matplotlib, and runnable scikit-learn examples. Use this as a single source of truth for learning, teaching, or interview preparation.

---
**Contents:**

1. Naive Bayes
2. Performance Evaluation
3. Naive Bayes Optimizations
4. K-Nearest Neighbors (KNN)
5. Decision Trees
6. Linear Regression
7. Logistic Regression
8. Support Vector Machine (SVM)
9. Cross-cutting Topics: Ensembles, Bias-Variance, When to Use What
10. Summary Tables & Domain Mapping

Run cells sequentially for best results; many visualization cells rely on numpy and matplotlib.


In [None]:
# Common imports used throughout the notebook
import numpy as np
import matplotlib.pyplot as plt
from math import log
%matplotlib inline
print('Common libraries loaded')

## 1 — Naive Bayes

### Conceptual Overview (detailed)

Naive Bayes is a probabilistic classification framework built from Bayes' theorem. Historically rooted in statistical decision theory, it became popular in NLP and document classification because it handles high-dimensional sparse inputs efficiently. The **core idea** is to compute the posterior probability of each class given features and pick the class with the highest posterior. The ‘naive’ assumption is that features are conditionally independent given the class — this simplifies the likelihood into a product of one-dimensional likelihoods, dramatically reducing estimation complexity.

**Why it exists:** To provide a simple, scalable probabilistic classifier that can be trained with few examples and fast inference. It trades modeling expressiveness for computational efficiency and robustness in high-dimensional sparse settings.

### Mathematical foundations

Bayes' theorem:

$$P(C_k\mid X) = \frac{P(X\mid C_k)P(C_k)}{P(X)}$$

Naive independence assumption:

$$P(X\mid C_k)=\prod_{i=1}^n P(x_i\mid C_k)$$

Often we compute log-posteriors to avoid numerical underflow:

$$\log P(C_k\mid X) = \log P(C_k) + \sum_{i=1}^n \log P(x_i\mid C_k) + const.$$ 

**Variants:** GaussianNB (continuous features), MultinomialNB (counts), BernoulliNB (binary features).

### Internal mechanics (step-by-step)

1. **Training:** Estimate class priors \(\hat P(C_k)\) and feature likelihood parameters per class (means/variances or counts).
2. **Prediction:** For a new sample, compute class log-posteriors using estimated parameters and select argmax.
3. **Calibration:** Optionally calibrate probabilities with isotonic or Platt scaling for better probability estimates.

### Simple theory example (non-code)

Imagine classifying emails as spam/ham. For each word feature, Naive Bayes learns how frequent that word is in spam vs ham. For a new email, multiply (or sum log) the conditioned word probabilities and combine with class priors to get final posterior scores.

### Where it's used across ML domains

- **NLP / Text classification:** spam detection, sentiment analysis, topic categorization (classic use-case).
- **Document filtering and email routing** where interpretability and speed matter.
- **Baseline model**: used as a quick baseline against which more complex models are compared.

### Strengths & Caveats

- **Strengths:** Simple, very fast, works well with sparse, high-dimensional data; requires little memory.
- **Caveats:** Conditional independence rarely holds in reality—correlated features can reduce performance. Does not capture feature interactions.



In [None]:
# Naive Bayes visual intuition (Gaussian class-conditional densities & posterior)

def gaussian_pdf(x, mu, sigma):
    return (1/(np.sqrt(2*np.pi)*sigma)) * np.exp(-0.5*((x-mu)/sigma)**2)

x = np.linspace(-4, 8, 400)
# two class-conditional Gaussians
mu0, s0 = 0.5, 1.0
mu1, s1 = 3.0, 1.2
p0 = gaussian_pdf(x, mu0, s0)
p1 = gaussian_pdf(x, mu1, s1)

plt.figure(figsize=(8,3))
plt.plot(x, p0, label='class 0 likelihood')
plt.plot(x, p1, label='class 1 likelihood')
plt.title('Naive Bayes — Gaussian class-conditional likelihoods (1D)')
plt.xlabel('feature value')
plt.legend()
plt.show()

# Posterior with equal priors
prior0 = 0.5
prior1 = 0.5
post0 = prior0 * p0
post1 = prior1 * p1
norm = post0 + post1
posterior0 = post0 / norm
posterior1 = post1 / norm

plt.figure(figsize=(8,2.5))
plt.plot(x, posterior0, label='P(class0 | x)')
plt.plot(x, posterior1, label='P(class1 | x)')
plt.title('Posterior probabilities (equal priors)')
plt.xlabel('feature value')
plt.legend()
plt.show()

print('Observation: decision boundary where posteriors cross.')


## 2 — Performance Evaluation (Detailed Theory & Use)

### Conceptual overview

Performance evaluation answers: *How good is my model?* and *where does it fail?* Selecting metrics depends on the problem (classification vs regression) and business priorities (e.g., false negatives more costly than false positives in medical tests).

### Core concepts & metrics

- **Confusion matrix** components: TP, FP, TN, FN.
- **Accuracy**: overall correctness but can be misleading on imbalanced data.
- **Precision & Recall**: precision measures correctness of positive predictions; recall measures coverage of actual positives.
- **F1-score** balances precision and recall.
- **ROC-AUC** measures separability across thresholds. **PR-AUC** is more informative for highly imbalanced problems.

### Cross-validation and model selection

- **k-fold CV** provides robust performance estimates by averaging results across folds.
- **Nested CV** is important for honest hyperparameter tuning to avoid optimistic bias.

### Use across ML domains

- In healthcare, **recall** (sensitivity) is prioritized to avoid missing positive cases.
- In spam filtering, **precision** matters to avoid false positives (legitimate emails marked spam).
- In recommender systems, business-derived metrics (CTR, revenue uplift) often matter more than raw accuracy.

### Practical tips

- Choose metrics aligned with costs of different error types.
- Use stratified CV for classification to preserve class ratios.
- Report multiple metrics (accuracy + precision/recall + calibration) for transparency.


In [None]:
# Performance visuals: ROC and Precision-Recall
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, auc, precision_recall_curve

X, y = make_classification(n_samples=1000, n_features=10, n_informative=3, weights=[0.7], flip_y=0.01, random_state=2)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2)
clf = LogisticRegression(max_iter=500)
clf.fit(X_train, y_train)
probs = clf.predict_proba(X_test)[:,1]

fpr, tpr, _ = roc_curve(y_test, probs)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(6,4))
plt.plot(fpr, tpr, label=f'AUC={roc_auc:.3f}')
plt.plot([0,1],[0,1],'--', color='gray')
plt.title('ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()

precision, recall, _ = precision_recall_curve(y_test, probs)
plt.figure(figsize=(6,4))
plt.plot(recall, precision)
plt.title('Precision-Recall Curve')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.show()


## 3 — Naive Bayes Optimizations (Detailed Theory)

### Why optimize?
Naive Bayes is simple and effective, but several practical adjustments improve robustness, especially in NLP: smoothing, feature selection, and more realistic likelihood models.

### Laplace (additive) smoothing
Smoothing avoids zero probabilities for unseen tokens. With vocabulary size |V| and smoothing parameter \(\alpha\):

$$P(w\mid C)=\frac{count(w,C)+\alpha}{\sum_{w'}count(w',C)+\alpha|V|}$$

Smaller \(\alpha\) (e.g., 0.1) is less aggressive; \(\alpha=1\) is Laplace smoothing.

### Feature selection and weighting
- **TF-IDF** downweights very frequent words that are less informative.
- **Chi-square, mutual information** select tokens most correlated with labels.

### Calibration & hybrid models
- Naive Bayes probabilities can be poorly calibrated; isotonic regression or Platt scaling can help.
- Hybrid models stack Naive Bayes as features into stronger learners (e.g., NB + Logistic Regression).

### Use-cases across ML
- As a competitive baseline in text tasks.
- Fast initial model when building large-scale pipelines (e.g., email triage).


In [None]:
# Demonstrate smoothing effect: token probabilities with different alpha
from collections import Counter

spam_docs = ['buy cheap meds now', 'cheap meds available', 'limited offer buy']
ham_docs = ['project meeting schedule', 'discuss project', 'let us meet tomorrow']
all_docs = spam_docs + ham_docs
vocab = list(set(' '.join(all_docs).split()))
V = len(vocab)

def token_prob(token, docs, alpha):
    counts = Counter(' '.join(docs).split())
    total = sum(counts.values())
    return (counts[token] + alpha) / (total + alpha * V)

print('Vocab sample:', vocab[:8])
for alpha in [0.0, 0.5, 1.0]:
    probs = {t: token_prob(t, spam_docs, alpha) for t in vocab}
    print(f'alpha={alpha}, sample probs:', list(probs.items())[:6])


## 4 — K-Nearest Neighbors (KNN)

### Conceptual overview
KNN is an instance-based (lazy) algorithm: it does not build a global model during training. Instead, it stores training samples and at prediction time finds the k closest training points according to a distance metric. The predicted label is typically the majority among neighbors (classification) or the average (regression).

### When to use
- Small to medium datasets where training time should be minimal and interpretability is helpful.
- Problems where similarity intuition is strong (e.g., recommendation by nearest users/items).

### Limitations and scalability
- Prediction cost is O(n) per query; high-dimensional data degrades distance usefulness (curse of dimensionality).
- Preprocessing: feature scaling is essential.

### Simple (non-code) example
Classify a new flower by looking at the closest k labeled flowers in the feature space (petal length, petal width, etc.).


In [None]:
# KNN decision boundary visualization for varying k
from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=300, n_features=2, n_informative=2, n_redundant=0, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

def plot_knn(k):
    clf = KNeighborsClassifier(n_neighbors=k)
    clf.fit(X_train, y_train)
    xx, yy = np.meshgrid(np.linspace(X[:,0].min()-1,X[:,0].max()+1,200), np.linspace(X[:,1].min()-1,X[:,1].max()+1,200))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    plt.figure(figsize=(5,4))
    plt.contourf(xx, yy, Z, alpha=0.3)
    plt.scatter(X[:,0], X[:,1], c=y, s=15)
    plt.title(f'KNN decision boundary (k={k})')
    plt.show()

for k in [1,3,9]:
    plot_knn(k)


## 5 — Decision Trees

### Conceptual overview
Decision Trees recursively partition the feature space into regions with mostly-homogeneous labels. They are non-parametric and create human-readable rules (if-then statements). The splitting criterion (e.g., information gain, Gini impurity) measures which split reduces uncertainty the most.

### Why they are useful
- Interpretability: rules are easy to visualize and explain.
- Handle mixed data types (categorical + numeric) without heavy preprocessing.

### Where they can fail
- Trees can overfit: a deep tree can memorize noise. Regularization (max depth, min samples) and ensembles mitigate this.


In [None]:
# Entropy plot and decision tree example
ps = np.linspace(0.001, 0.999, 500)
entropy = - (ps * np.log2(ps) + (1-ps) * np.log2(1-ps))
plt.figure(figsize=(6,3))
plt.plot(ps, entropy)
plt.title('Binary entropy H(p)')
plt.xlabel('p')
plt.ylabel('Entropy')
plt.show()

from sklearn.datasets import make_moons
from sklearn.tree import DecisionTreeClassifier, plot_tree
X, y = make_moons(n_samples=200, noise=0.25, random_state=0)
clf = DecisionTreeClassifier(max_depth=4, random_state=0)
clf.fit(X, y)
plt.figure(figsize=(6,4))
plot_tree(clf, filled=True)
plt.title('Decision Tree (make_moons)')
plt.show()


## 6 — Linear Regression

### Conceptual overview
Linear Regression models the expectation of the target as a linear combination of features. It's the simplest parametric regression model and serves as the foundation for many statistical and machine learning techniques. It provides interpretable coefficients indicating marginal effects.

### Derivation (OLS normal equation)
Given matrix X (with column of ones for intercept) and target y, OLS solves:

$$\hat{\beta} = \arg\min_\beta ||y - X\beta||_2^2$$

Setting derivative to zero yields the normal equation:

$$X^TX\hat{\beta} = X^Ty \Rightarrow \hat{\beta} = (X^TX)^{-1}X^Ty$$

### Use-cases and notes
- Used for trend estimation, baseline regressors, and interpretation.
- Use regularized variants (Ridge, Lasso) when multicollinearity or overfitting occurs.


In [None]:
# Linear regression example with synthetic noisy data
np.random.seed(0)
X = np.linspace(0, 10, 50)
y = 3.0 * X + 4.0 + np.random.normal(scale=5.0, size=X.shape)
from sklearn.linear_model import LinearRegression, Ridge
X_mat = X.reshape(-1,1)
reg = LinearRegression().fit(X_mat, y)
plt.figure(figsize=(6,4))
plt.scatter(X, y, label='data')
plt.plot(X, reg.predict(X_mat), color='red', label='OLS fit')
plt.title('Linear Regression fit (noisy data)')
plt.legend()
plt.show()
print('OLS coef, intercept:', reg.coef_, reg.intercept_)

# Ridge example to illustrate regularization effect
ridge = Ridge(alpha=10.0).fit(X_mat, y)
print('Ridge coef, intercept:', ridge.coef_, ridge.intercept_)


## 7 — Logistic Regression

### Conceptual overview
Logistic Regression models the log-odds of the positive class as a linear function of inputs and maps that to a probability via the sigmoid function. It's a discriminative model (models P(y|x) directly) and widely used for binary classification due to simplicity and well-behaved convex loss.

### Loss and optimization
The negative log-likelihood (log-loss) is convex; common solvers use Newton's method variants or gradient-based solvers to find parameters.

### Use cases
- Binary classification tasks across domains (medical diagnosis, credit scoring, click-through prediction) where interpretability and probability estimates matter.


In [None]:
# Sigmoid plot and logistic regression decision boundary

def sigmoid(z):
    return 1.0/(1.0 + np.exp(-z))

z = np.linspace(-10, 10, 400)
plt.figure(figsize=(6,3))
plt.plot(z, sigmoid(z)); plt.title('Sigmoid function'); plt.grid(True); plt.show()

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
X, y = make_classification(n_samples=200, n_features=2, n_informative=2, n_redundant=0, random_state=1)
clf = LogisticRegression().fit(X, y)
xx, yy = np.meshgrid(np.linspace(X[:,0].min()-1, X[:,0].max()+1, 200), np.linspace(X[:,1].min()-1, X[:,1].max()+1, 200))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
plt.figure(figsize=(6,4)); plt.contourf(xx, yy, Z, alpha=0.3); plt.scatter(X[:,0], X[:,1], c=y); plt.title('Logistic Regression boundary'); plt.show()


## 8 — Support Vector Machine (SVM)

### Conceptual overview
SVM searches for a hyperplane that maximizes the margin between classes. Conceptually, maximum margin improves generalization because it finds the most robust separator.

### Geometry & optimization
For linearly separable data (hard-margin):

$$\min_{w,b} \frac{1}{2}||w||^2 \quad \text{s.t. } y_i(w\cdot x_i + b) \ge 1$$

Margin width = \(2/||w||\). For non-separable data introduce slack variables and regularization parameter C.

### Kernels and non-linear separation
Kernels implicitly map data to higher-dimensional spaces where a linear separator may exist; common kernels include RBF and polynomial.

### Use-cases
- Effective in high-dimensional tasks such as text categorization and early computer vision tasks. SVMs were heavily used before deep learning became dominant.


In [None]:
# SVM margin visualization
from sklearn.svm import SVC
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=120, centers=2, random_state=6)
clf = SVC(kernel='linear', C=1.0).fit(X, y)
plt.figure(figsize=(6,4))
xx = np.linspace(X[:,0].min()-1, X[:,0].max()+1, 200)
yy = np.linspace(X[:,1].min()-1, X[:,1].max()+1, 200)
XX, YY = np.meshgrid(xx, yy)
Z = clf.predict(np.c_[XX.ravel(), YY.ravel()]).reshape(XX.shape)
plt.contourf(XX, YY, Z, alpha=0.3)
plt.scatter(X[:,0], X[:,1], c=y)
sv = clf.support_vectors_
plt.scatter(sv[:,0], sv[:,1], s=100, facecolors='none', edgecolors='k', label='support vectors')
plt.title('Linear SVM with support vectors (circled)')
plt.legend(); plt.show()
print('Support vectors count per class:', clf.n_support_)


## 9 — Cross-cutting Topics: Ensembles, Bias-Variance, When to Use What

### Bias-Variance tradeoff (intuition)
- **Bias**: error from wrong model assumptions (underfitting). High bias models are too simple.
- **Variance**: error from sensitivity to small fluctuations in training set (overfitting). High variance models are too complex.
A good model balances bias and variance; regularization and ensembling are ways to control variance.

### Ensembles (brief)
- **Bagging (e.g., Random Forests):** reduces variance by averaging many trees trained on bootstrap samples.
- **Boosting (e.g., AdaBoost, Gradient Boosting):** sequentially focuses on hard-to-predict examples to reduce bias and produce strong learners.

### When to use what (practical rules)
- **Naive Bayes:** fast baseline for text/NLP.
- **Logistic Regression:** when you need interpretable probability estimates.
- **Decision Trees / Random Forests:** when interpretability and handling mixed feature types matter.
- **SVM:** small-to-medium high-dimensional problems with clear margins.
- **KNN:** small datasets where similarity is meaningful.
- **Linear Regression:** baseline for regression; use Ridge/Lasso for regularization.


## 10 — Summary Tables & Domain Mapping

| Algorithm | Core Idea | Typical Domains | Strength | Limitation |
|---|---|---|---|---|
| Naive Bayes | Probabilistic with independence assumption | NLP, document classification | Fast, simple, works with high-dim sparse data | Independence assumption may be unrealistic |
| KNN | Similarity-based voting | Recommendation, small classification tasks | Simple, non-parametric | Slow at prediction, needs scaling |
| Decision Tree | Recursive partitioning | Finance, healthcare (rule-based models) | Interpretable, handles mixed data | Prone to overfit |
| Linear Regression | Linear modeling of target | Economics, trend analysis | Interpretable coefficients | Fails with non-linearity |
| Logistic Regression | Linear model for log-odds | Healthcare, marketing | Probabilistic outputs, interpretable | Linear decision boundary |
| SVM | Max-margin classifier | Bioinformatics, vision (historically) | Effective in high-dim spaces | Slow on large datasets |

---

### Final notes
This master notebook aims to be both a learning document and a quick reference. For deeper theory consult Bishop (PRML) or Hastie et al. (ESL).


## References & Further Reading

- Christopher Bishop — *Pattern Recognition and Machine Learning*
- Hastie, Tibshirani, Friedman — *The Elements of Statistical Learning*
- Scikit-learn documentation: https://scikit-learn.org
- Goodfellow, Bengio, Courville — *Deep Learning*
