# Supervised Learning — Single Source of Truth

**Contents:** Naive Bayes, Performance Evaluation, Naive Bayes Optimizations, K-Nearest Neighbors (KNN), Decision Trees, Linear Regression, Logistic Regression, Support Vector Machine (SVM).

This notebook provides:

- Detailed definitions and theory (technical overview) with mathematical formulas rendered in LaTeX.
- Plain-English intuitive explanations.
- Key terms, strengths, and limitations.
- Runnable Python examples for each algorithm with line-by-line comments.

---

_Generated as a single comprehensive handbook for interview-ready and reference use._

## Table of Contents

1. [Naive Bayes](#Naive-Bayes)
2. [Performance Evaluation](#Performance-Evaluation)
3. [Naive Bayes Optimizations](#Naive-Bayes-Optimizations)
4. [K-Nearest Neighbors (KNN)](#K-Nearest-Neighbors)
5. [Decision Trees](#Decision-Trees)
6. [Linear Regression](#Linear-Regression)
7. [Logistic Regression](#Logistic-Regression)
8. [Support Vector Machine (SVM)](#Support-Vector-Machine)
9. [Conclusion & References](#Conclusion-and-References)


## Naive Bayes

### Definition & Working Principle

Naive Bayes is a probabilistic classifier based on **Bayes' Theorem**, which computes the posterior probability of a class given observed features. The core assumption is that features are **conditionally independent** given the class.

### Bayes' Theorem (basic formula)

\[
P(C_k \mid X) = \frac{P(X \mid C_k)\,P(C_k)}{P(X)}
\]

Under the Naive assumption (features independent):

\[
P(X \mid C_k) = \prod_{i=1}^{n} P(x_i \mid C_k)
\]

### Likelihoods / Variants

- **Gaussian Naive Bayes:** assumes continuous features follow a normal distribution. For each feature:
\[ P(x_i \mid C_k) = \frac{1}{\sqrt{2\pi\sigma_{k,i}^2}} \exp\left(-\frac{(x_i-\mu_{k,i})^2}{2\sigma_{k,i}^2}\right) \]

- **Multinomial Naive Bayes:** models counts (e.g., word counts) with likelihood proportional to relative frequency.

- **Bernoulli Naive Bayes:** models binary/bag-of-words presence/absence.

### Key Terms

- **Prior**: \(P(C_k)\), the prevalence of class before seeing data.
- **Likelihood**: \(P(X\mid C_k)\), how probable the features are under class \(C_k\).
- **Posterior**: \(P(C_k\mid X)\), the updated probability after observing \(X\).
- **Laplace smoothing**: add \(\alpha\) to counts to avoid zero probabilities: \(P=\frac{count+\alpha}{N+\alpha|V|}\).

### Strengths & Limitations

- Strengths: extremely fast to train and predict, works well on high-dimensional problems (text), low memory footprint.
- Limitations: strong independence assumption, does not model feature interactions; can be outperformed when features are highly correlated.

### Plain-English Summary

Naive Bayes treats each feature as an independent witness and multiplies their evidences to find the most likely class. It's simple but often surprisingly effective, especially in text classification.

In [None]:
# Naive Bayes -- Gaussian and Multinomial examples with comments on each line
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report

# Example 1: Gaussian Naive Bayes with Iris dataset (continuous features)
X, y = load_iris(return_X_y=True)  # load a small continuous dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)  # split data
gnb = GaussianNB()  # instantiate Gaussian Naive Bayes
gnb.fit(X_train, y_train)  # fit model to training data
y_pred = gnb.predict(X_test)  # predict on test set
print('GaussianNB classification report:')  # label
print(classification_report(y_test, y_pred))  # show precision/recall/f1

# Example 2: Multinomial Naive Bayes for text (counts)
texts = ['spam offer buy now', 'limited offer click', 'hello how are you', 'let us meet tomorrow']  # tiny corpus
labels = [1, 1, 0, 0]  # binary labels: 1=spam, 0=ham
vec = CountVectorizer()  # convert text to token counts
X_counts = vec.fit_transform(texts)  # fit and transform
mnb = MultinomialNB(alpha=1.0)  # Laplace smoothing alpha=1
mnb.fit(X_counts, labels)  # train MultinomialNB
sample = vec.transform(['buy this limited offer now'])  # new sample to classify
print('MultinomialNB prediction for sample:', mnb.predict(sample))  # output predicted class


## Performance Evaluation

### Definition & Purpose

Performance evaluation quantifies how well a supervised model will perform on unseen data. It is crucial for model selection, hyperparameter tuning, and deployment decisions.

### Classification Metrics (confusion matrix components)

- True Positive (TP), True Negative (TN), False Positive (FP), False Negative (FN)

Formulas:

Accuracy:

\[ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} \]

Precision:

\[ \text{Precision} = \frac{TP}{TP + FP} \]

Recall (Sensitivity):

\[ \text{Recall} = \frac{TP}{TP + FN} \]

F1-score:

\[ F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \]

### Regression Metrics

Mean Squared Error (MSE):

\[ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \]

R-squared (coefficient of determination):

\[ R^2 = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2} \]

### Validation Strategies

- Train/test split
- k-fold cross-validation
- Stratified k-fold for imbalanced classes

### Practical Notes

- Always measure performance on data not used during training.
- Choose metrics that align with business goals (e.g., recall for medical tests).


In [None]:
# Performance evaluation examples
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, mean_squared_error, r2_score
import numpy as np

# Dummy classification example
y_true = np.array([0, 1, 1, 0, 1, 0])  # true labels
y_pred = np.array([0, 0, 1, 0, 1, 1])  # predicted labels
print('Accuracy:', accuracy_score(y_true, y_pred))  # accuracy
print('Precision (macro):', precision_score(y_true, y_pred, average='macro'))  # precision
print('Recall (macro):', recall_score(y_true, y_pred, average='macro'))  # recall
print('F1 (macro):', f1_score(y_true, y_pred, average='macro'))  # f1

# Dummy regression example
y_true_reg = np.array([2.5, 0.0, 2.1, 7.8])  # true continuous targets
y_pred_reg = np.array([3.0, -0.1, 2.0, 7.5])  # predicted targets
print('MSE:', mean_squared_error(y_true_reg, y_pred_reg))  # mean squared error
print('R2:', r2_score(y_true_reg, y_pred_reg))  # R-squared


## Naive Bayes Optimizations

### Variants & Optimizations

Different Naive Bayes variants suit different data types and tasks:

- **GaussianNB**: models continuous features with Gaussian likelihoods.
- **MultinomialNB**: models discrete count features (e.g., word counts); useful for NLP.
- **BernoulliNB**: models binary features (presence/absence).

**Laplace (additive) smoothing** prevents zero-likelihood by adding \(\alpha>0\) to counts:

\[ P(w \mid C) = \frac{count(w, C) + \alpha}{\sum_{w'} count(w', C) + \alpha |V|} \]

**Feature selection** (e.g., chi-square, mutual information) often improves Naive Bayes by removing noisy features. **Calibration** and combining NB in ensembles (e.g., via stacking) can improve probability estimates.


In [None]:
# Naive Bayes optimizations demonstration: using TF-IDF, feature selection, and MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import MultinomialNB

texts = [
    'buy cheap meds now',
    'limited time offer buy',
    'project meeting schedule',
    'let us discuss the project plan'
]
labels = [1, 1, 0, 0]

# Build a pipeline: TF-IDF -> select top-k features by chi2 -> MultinomialNB
pipeline = make_pipeline(
    TfidfVectorizer(),  # convert raw text to TF-IDF features
    SelectKBest(chi2, k=5),  # pick top 5 features by chi-squared test
    MultinomialNB(alpha=1.0)  # classifier with Laplace smoothing
)
pipeline.fit(texts, labels)  # train pipeline
print('Prediction (optimized pipeline):', pipeline.predict(['cheap limited offer']))  # test


## K-Nearest Neighbors (KNN)

### Definition & Working Principle

K-Nearest Neighbors (KNN) is an instance-based (lazy) algorithm that predicts the label of a query point by looking at the labels of the k nearest training examples according to a distance metric.

### Distance metrics

- **Euclidean distance:** \(d(x,y)=\sqrt{\sum_i (x_i-y_i)^2}\)
- **Manhattan distance:** \(d(x,y)=\sum_i |x_i-y_i|\)
- **Cosine similarity** (for angles) often used for text embeddings.

### Key choices

- **k (neighbors)**: small k → low bias/high variance; large k → high bias/low variance.
- **Distance weighting**: neighbors can be weighted inversely by distance.

### Strengths & Limitations

- Strengths: simple, no training cost, can model complex decision boundaries.
- Limitations: expensive at prediction time (needs full dataset), sensitive to feature scaling and irrelevant features.


In [None]:
# KNN example with comments
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# Using a small toy dataset
X = [[1.0, 2.0], [1.1, 1.9], [4.0, 4.1], [4.2, 3.9]]  # toy 2D points
y = [0, 0, 1, 1]  # labels for two clusters

# Build a pipeline to standardize features then apply KNN
knn_pipeline = make_pipeline(
    StandardScaler(),  # scale features to zero mean and unit variance
    KNeighborsClassifier(n_neighbors=3)  # KNN with k=3
)
knn_pipeline.fit(X, y)  # store training points in the pipeline
print('KNN prediction for [2.0,2.0]:', knn_pipeline.predict([[2.0, 2.0]]))  # query point


## Decision Trees

### Definition & Working Principle

Decision Trees split data recursively along features to form a tree where leaves represent predictions. At each split, the algorithm chooses the feature and threshold that best improves a purity metric.

### Purity measures

- **Entropy:** \(H(S) = -\sum_{i} p_i \log_2 p_i\)
- **Information Gain:** difference in entropy before and after the split.
- **Gini impurity:** \(G = 1 - \sum_i p_i^2\)

### Tree building (high-level)

1. Start with all training data at the root.
2. For each candidate split (feature + threshold), compute impurity reduction.
3. Choose split with maximum reduction and recurse on children.
4. Stop when stopping criteria met (max depth, min samples, pure node).

### Overfitting & Regularization

- Trees can overfit by creating deep structures that memorize training data.
- Remedies: pruning, limiting max depth, min samples per leaf, ensemble methods (Random Forests, Gradient Boosting).


In [None]:
# Decision Tree example with comments
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt

# toy dataset
X = [[0, 0], [1, 1], [0, 1], [1, 0]]  # simple binary feature pairs
y = [0, 1, 1, 0]

dt = DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=42)  # entropy criterion
dt.fit(X, y)  # train the decision tree
print('Decision Tree predictions:', dt.predict(X))  # predictable on training set

# Visualize tree (if running interactively this will plot)
plt.figure(figsize=(6,4))  # set figure size
plot_tree(dt, feature_names=['f1','f2'], class_names=['class0','class1'], filled=True)  # draw tree
plt.title('Decision Tree (toy example)')  # title
plt.show()  # display plot


## Linear Regression

### Definition & Working Principle

Linear Regression models the expected value of a continuous target as a linear combination of input features:

\[ y = \beta_0 + \sum_{j=1}^{p} \beta_j x_j + \epsilon \]

Parameters \(\beta\) are estimated by minimizing the sum of squared residuals (Ordinary Least Squares):

\[ \hat{\beta} = \arg\min_{\beta} \sum_{i=1}^n (y_i - X_i \beta)^2 \]

Closed-form solution (normal equation) when \(X^TX\) is invertible:

\[ \hat{\beta} = (X^T X)^{-1} X^T y \]

### Regularization

- **Ridge (L2)** adds \(\lambda ||\beta||_2^2\) penalty to reduce variance.
- **Lasso (L1)** adds \(\lambda ||\beta||_1\) penalty and can perform feature selection.


In [None]:
# Linear Regression example with comments
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge, Lasso

# simple linear data: y = 2x + noise
X = np.arange(1, 6).reshape(-1, 1).astype(float)  # feature matrix (5 samples)
y = (2 * X).ravel() + np.random.normal(scale=0.1, size=X.shape[0])  # target with small noise

lr = LinearRegression()  # ordinary least squares
lr.fit(X, y)  # fit model
print('Linear Regression coef (slope):', lr.coef_, 'intercept:', lr.intercept_)  # parameters
print('Predict for x=6:', lr.predict([[6]]))  # predict

# Ridge and Lasso example
ridge = Ridge(alpha=1.0)  # L2 regularization with strength 1.0
lasso = Lasso(alpha=0.1)  # L1 regularization
ridge.fit(X, y)  # fit ridge
lasso.fit(X, y)  # fit lasso
print('Ridge coef:', ridge.coef_)  # ridge coef
print('Lasso coef:', lasso.coef_)  # lasso coef (may be shrunk to zero)


## Logistic Regression

### Definition & Working Principle

Logistic Regression is a linear model for classification that models the log-odds (logit) of the probability of the positive class as a linear function of features. The probability is obtained via the sigmoid (logistic) function:

\[ \sigma(z) = \frac{1}{1 + e^{-z}} \quad\text{with } z = \beta_0 + \sum_j \beta_j x_j \]

Log-loss (negative log-likelihood) is minimized to estimate parameters:

\[ J(\beta) = -\frac{1}{m} \sum_{i=1}^m \left[y^{(i)} \log(h_\beta(x^{(i)})) + (1-y^{(i)})\log(1-h_\beta(x^{(i)}))\right] \]

Multiclass logistic regression can be implemented via one-vs-rest (OvR) or by using a multinomial (softmax) formulation.


In [None]:
# Logistic Regression example with comments
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# synthetic binary classification dataset
X, y = make_classification(n_samples=200, n_features=4, n_informative=2, n_redundant=0, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

logreg = LogisticRegression(max_iter=200)  # logistic regression solver default
logreg.fit(X_train, y_train)  # train model
y_pred = logreg.predict(X_test)  # predict
print('Logistic Regression classification report:')  # label
print(classification_report(y_test, y_pred))  # show metrics


## Support Vector Machine (SVM)

### Definition & Working Principle

Support Vector Machine (SVM) finds a hyperplane that maximizes the margin between classes. For linearly separable data the hard-margin formulation searches for \(w, b\) that minimize \(\frac{1}{2}||w||^2\) subject to constraints \(y_i(w\cdot x_i + b) \ge 1\).

For non-separable data the soft-margin adds slack variables and a penalty parameter \(C\) to control misclassification.

### Kernel trick

When data is not linearly separable in input space, kernels implicitly project data into higher-dimensional spaces where a linear separator may exist. Common kernels: linear, polynomial, RBF (Gaussian):

RBF kernel: \(K(x,x') = \exp(-\gamma ||x - x'||^2)\)

### Strengths & Limitations

- Strengths: effective in high-dimensional spaces, robust with clear margins, uses support vectors (sparse solution).
- Limitations: training can be slow on very large datasets, requires careful kernel/parameter tuning.


In [None]:
# SVM example with comments
from sklearn.svm import SVC
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# synthetic dataset with two blobs
X, y = make_blobs(n_samples=200, centers=2, cluster_std=1.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

svm_linear = SVC(kernel='linear', C=1.0)  # linear SVM with regularization C
svm_linear.fit(X_train, y_train)  # train
y_pred = svm_linear.predict(X_test)  # predict
print('Linear SVM classification report:')  # label
print(classification_report(y_test, y_pred))  # metrics

# RBF kernel example
svm_rbf = SVC(kernel='rbf', gamma='scale', C=1.0)  # RBF kernel
svm_rbf.fit(X_train, y_train)  # train
print('RBF SVM score on test:', svm_rbf.score(X_test, y_test))  # test score


## Conclusion and References

This notebook provided a single-source-of-truth overview of popular supervised learning algorithms including:

- Theoretical definitions with mathematical formulas rendered in LaTeX.
- Intuition and plain-English explanations.
- Key terms, strengths, limitations, and optimization notes.
- Runnable Python examples (scikit-learn) with line-by-line comments.

**Recommended references for deeper study:**

- "Pattern Recognition and Machine Learning" by C. Bishop
- "The Elements of Statistical Learning" by Hastie, Tibshirani, and Friedman
- scikit-learn documentation: https://scikit-learn.org

---

_End of notebook._