# Notebook 3: Anomaly Detection

Welcome to the third notebook in our advanced machine learning series under **Part_3_Advanced_Topics**. In this notebook, we will explore **Anomaly Detection**, a critical technique for identifying outliers or unusual patterns in data, which is widely used in fraud detection, network security, and quality control.

We'll cover the following topics:
- What is Anomaly Detection?
- Key concepts: Outliers, Novelty Detection, and Types of Anomalies
- How Anomaly Detection works
- Implementation using scikit-learn
- Advantages and limitations

## What is Anomaly Detection?

Anomaly Detection is the process of identifying data points, events, or observations that deviate significantly from the majority of the data or expected behavior. These anomalies, also called outliers, can indicate critical incidents like fraud, system failures, or rare events.

Anomaly detection can be applied in supervised, semi-supervised, or unsupervised settings, depending on whether labeled data is available.

## Key Concepts

- **Outliers:** Data points that differ significantly from the rest of the dataset, often due to variability or errors.
- **Novelty Detection:** Identifying new, previously unseen patterns that differ from the training data (often used in semi-supervised settings).
- **Types of Anomalies:**
  - **Point Anomalies:** Individual data points that are anomalous (e.g., a single fraudulent transaction).
  - **Contextual Anomalies:** Data points that are anomalous in a specific context (e.g., high temperature in winter).
  - **Collective Anomalies:** A collection of data points that are anomalous together (e.g., a sequence of unusual network traffic).
- **Unsupervised vs. Supervised:** Unsupervised methods detect anomalies without labeled data, while supervised methods use labeled examples of normal and anomalous data.
- **Isolation Forest and One-Class SVM:** Popular algorithms for unsupervised anomaly detection. Isolation Forest isolates anomalies by randomly partitioning data, while One-Class SVM learns a boundary around normal data.

## How Anomaly Detection Works

Anomaly detection typically involves the following steps:

1. **Data Preparation:** Clean and preprocess the data, handling missing values and scaling features if necessary.
2. **Model Selection:** Choose an appropriate algorithm based on the data and problem (e.g., Isolation Forest for unsupervised, One-Class SVM for novelty detection).
3. **Training:** Fit the model on normal data (or a mix if unsupervised) to learn the characteristics of typical behavior.
4. **Scoring:** Assign anomaly scores or labels to data points based on how much they deviate from the learned normal behavior.
5. **Thresholding:** Set a threshold on anomaly scores to classify points as normal or anomalous (often based on domain knowledge or statistical measures).
6. **Evaluation:** If labeled data is available, evaluate using metrics like precision, recall, or F1-score for anomaly class.

## Implementation Using scikit-learn

Let's implement anomaly detection using scikit-learn with two popular algorithms: Isolation Forest and One-Class SVM. We'll use a synthetic dataset with intentional outliers to demonstrate the process.

In [None]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn.datasets import make_blobs
from sklearn.metrics import classification_report

# Generate a synthetic dataset with normal data and outliers
np.random.seed(42)
# Normal data: two clusters
X_normal, _ = make_blobs(n_samples=300, centers=2, n_features=2, cluster_std=0.5, random_state=42)
# Outliers: uniformly distributed in a larger range
X_outliers = np.random.uniform(low=-6, high=6, size=(30, 2))
# Combine normal and outlier data
X = np.vstack([X_normal, X_outliers])
# Create true labels (1 for normal, -1 for outliers)
y_true = np.ones(len(X))
y_true[-len(X_outliers):] = -1

# 1. Isolation Forest for Anomaly Detection
iso_forest = IsolationForest(contamination=0.1, random_state=42)
iso_forest.fit(X)
y_pred_iso = iso_forest.predict(X)  # Returns 1 for normal, -1 for anomaly

# 2. One-Class SVM for Anomaly Detection
oc_svm = OneClassSVM(kernel='rbf', nu=0.1)
oc_svm.fit(X_normal)  # Train only on normal data for novelty detection
y_pred_svm = oc_svm.predict(X)  # Returns 1 for normal, -1 for anomaly

# Visualize the results
plt.figure(figsize=(12, 5))

# Plot Isolation Forest results
plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=y_pred_iso, cmap='coolwarm', label='Predicted')
plt.title('Isolation Forest Anomaly Detection')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar(label='Normal (1) / Anomaly (-1)')

# Plot One-Class SVM results
plt.subplot(1, 2, 2)
plt.scatter(X[:, 0], X[:, 1], c=y_pred_svm, cmap='coolwarm', label='Predicted')
plt.title('One-Class SVM Anomaly Detection')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar(label='Normal (1) / Anomaly (-1)')

plt.tight_layout()
plt.show()

# Evaluate the models using classification report (since we have synthetic labels)
print('Isolation Forest Classification Report:')
print(classification_report(y_true, y_pred_iso, target_names=['Anomaly', 'Normal']))

print('One-Class SVM Classification Report:')
print(classification_report(y_true, y_pred_svm, target_names=['Anomaly', 'Normal']))

## Advantages and Limitations

**Advantages:**
- Identifies rare events or outliers that could indicate critical issues like fraud or system failures.
- Works in unsupervised settings, which is useful when labeled anomaly data is scarce.
- Can be applied across various domains, from finance to cybersecurity.

**Limitations:**
- Defining what constitutes an anomaly can be subjective and context-dependent, requiring domain knowledge for thresholding.
- High-dimensional data can make anomaly detection challenging due to the curse of dimensionality.
- Unsupervised methods may produce false positives or miss subtle anomalies if the model doesn't capture the true distribution of normal data.
- Performance heavily depends on the choice of algorithm and hyperparameters (e.g., contamination ratio).

## Conclusion

Anomaly Detection is a vital technique for uncovering unusual patterns or events in data, with applications ranging from fraud detection to predictive maintenance. Algorithms like Isolation Forest and One-Class SVM provide effective ways to identify outliers, even in unsupervised settings. Understanding the nature of anomalies and selecting the right method for your data is key to successful implementation.

In the next notebook, we will explore another advanced topic to further enhance our machine learning toolkit.