# Lesson 8A: Anomaly Detection TheoryDetecting outliers, fraud, and rare events using supervised and unsupervised approaches.

## IntroductionImagine monitoring credit card transactions. Most are normal - groceries, gas, restaurants. Then suddenly: $5000 purchase in a foreign country.Anomaly detection identifies these unusual patterns. It's crucial for:- Fraud detection (credit cards, insurance)- Network security (intrusion detection)- Manufacturing (defect detection)- Healthcare (disease outbreaks)

## Table of Contents1. What is Anomaly Detection?2. Statistical Methods3. Isolation Forest4. One-Class SVM5. Autoencoders for Anomaly Detection6. Supervised Approaches7. Evaluation Metrics8. Real-World Applications

In [None]:
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.ensemble import IsolationForestfrom sklearn.svm import OneClassSVMfrom sklearn.datasets import make_classificationfrom sklearn.metrics import classification_report, roc_auc_scorenp.random.seed(42)print('✅ Libraries loaded')

## Statistical Approach: Gaussian Distribution**Method:** Model normal data as Gaussian, flag points with low probability.**Algorithm:**1. Compute mean μ and variance σ² from training data2. For new point x, compute p(x)3. If p(x) < ε (threshold), flag as anomaly$p(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$For multivariate: Use Mahalanobis distance

In [None]:
# Generate normal data with a few anomaliesnormal = np.random.normal(0, 1, (1000, 2))anomalies = np.random.uniform(-4, 4, (50, 2))X = np.vstack([normal, anomalies])y = np.array([0]*1000 + [1]*50)  # 0=normal, 1=anomalyplt.figure(figsize=(10, 6))plt.scatter(X[y==0, 0], X[y==0, 1], alpha=0.5, label='Normal', s=20)plt.scatter(X[y==1, 0], X[y==1, 1], c='red', alpha=0.7, label='Anomaly', s=50, marker='x')plt.xlabel('Feature 1')plt.ylabel('Feature 2')plt.title('Anomaly Detection Problem', fontweight='bold')plt.legend()plt.grid(alpha=0.3)plt.show()

## Isolation Forest**Key Insight:** Anomalies are 'few and different', so they're isolated quickly in random trees.**Algorithm:**1. Randomly select feature and split value2. Build tree by recursively splitting3. Anomalies require fewer splits to isolate4. Anomaly score = average path length across trees**Advantages:**- Fast (linear time complexity)- No need to define distance metric- Handles high dimensions well

In [None]:
# Isolation Forestiso_forest = IsolationForest(contamination=0.05, random_state=42)y_pred = iso_forest.fit_predict(X)y_pred = np.where(y_pred == -1, 1, 0)  # Convert to 0/1print('Isolation Forest Results:')print(classification_report(y, y_pred, target_names=['Normal', 'Anomaly']))print(f'ROC-AUC: {roc_auc_score(y, iso_forest.score_samples(X)):.3f}')

## One-Class SVM**Idea:** Find boundary that encloses normal data in feature space.**Algorithm:**1. Map data to high-dimensional space (kernel trick)2. Find hyperplane separating data from origin3. Maximize margin (distance to boundary)4. Points outside boundary = anomalies**Best for:** Small, clean datasets with clear normal region

In [None]:
# One-Class SVMsvm = OneClassSVM(kernel='rbf', gamma='auto', nu=0.05)y_pred_svm = svm.fit_predict(X)y_pred_svm = np.where(y_pred_svm == -1, 1, 0)print('One-Class SVM Results:')print(classification_report(y, y_pred_svm, target_names=['Normal', 'Anomaly']))

## Conclusion**Method Selection Guide:****Isolation Forest:**- ✅ Large datasets- ✅ High dimensions- ✅ Fast detection needed- ✅ General-purpose**One-Class SVM:**- ✅ Small, clean dataset- ✅ Clear normal region- ✅ Need decision boundary**Statistical (Gaussian):**- ✅ Data truly Gaussian- ✅ Low dimensions- ✅ Interpretability important**Autoencoders (Deep Learning):**- ✅ Images, sequences- ✅ Complex patterns- ✅ GPU available**Next:** Lesson 8B - Production anomaly detection systems!