# Data Poisoning - Hands-On Lab

**Part of HackLearn Pro**

Welcome to this interactive lab on Data Poisoning attacks! Learn how attackers corrupt training data and how to defend against these attacks.

## Learning Objectives
- Understand how data poisoning compromises ML models at the training level
- Implement backdoor attacks using trigger patterns
- Practice detection techniques like anomaly detection and activation clustering
- Build secure training pipelines with data validation
- Explore STRIP defense and differential privacy

## Prerequisites
- Basic Python and NumPy knowledge
- Understanding of machine learning training
- Familiarity with scikit-learn

---

## Setup

Install required packages for data poisoning experiments:

In [None]:
# Install dependencies
!pip install numpy matplotlib scikit-learn -q

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

print("Setup complete! Ready to explore data poisoning.")

## Part 1: Understanding Data Poisoning

Data poisoning attacks inject malicious data into training sets to compromise model integrity. There are two main types:
- **Targeted (Backdoor) Poisoning:** Creates specific triggers that cause predictable misclassifications
- **Indiscriminate (Availability) Poisoning:** Degrades overall model performance

In [None]:
# Load clean Iris dataset
data = load_iris()
X, y = data.data, data.target

print(f"Dataset: {len(X)} samples, {X.shape[1]} features, {len(np.unique(y))} classes")
print(f"Classes: {data.target_names}")
print(f"Class distribution: {np.bincount(y)}")

### Train Clean Baseline Model

First, let's establish a baseline by training on clean data:

In [None]:
# Train clean model
clean_model = SGDClassifier(max_iter=1000, random_state=42)
clean_model.fit(X, y)

clean_accuracy = clean_model.score(X, y)
print(f"Clean model training accuracy: {clean_accuracy:.2%}")

# Store for comparison
baseline_accuracy = clean_accuracy

## Part 2: Targeted Data Poisoning Attack

**Attack Goal:** Inject poisoned samples that cause class 0 (setosa) to be misclassified as class 1 (versicolor).

**Method:** Add mislabeled samples near the class 0 decision boundary.

In [None]:
# Create targeted poisoning attack
class0_idx = np.where(y == 0)[0]

# Select 5 class 0 samples and add slight perturbation
poison_X = X[class0_idx][:5] + np.random.normal(0, 0.2, size=(5, 4))
poison_y = np.array([1] * 5)  # Mislabel as class 1

# Inject poisoned data into training set
X_poisoned = np.vstack([X, poison_X])
y_poisoned = np.concatenate([y, poison_y])

print(f"Original dataset: {len(X)} samples")
print(f"Poisoned dataset: {len(X_poisoned)} samples")
print(f"Poison rate: {len(poison_X)/len(X_poisoned):.1%}")

### Train Model on Poisoned Data

In [None]:
# Train on poisoned data (VULNERABLE - no validation!)
poisoned_model = SGDClassifier(max_iter=1000, random_state=42)
poisoned_model.fit(X_poisoned, y_poisoned)

poisoned_accuracy = poisoned_model.score(X, y)
print(f"Poisoned model accuracy on clean data: {poisoned_accuracy:.2%}")

### Evaluate Attack Success

Test specifically for class 0 → class 1 misclassifications:

In [None]:
# Test on class 0 samples
class0_samples = X[class0_idx]
predictions = poisoned_model.predict(class0_samples)

misclassification_rate = np.mean(predictions != 0)
misclassified_as_1 = np.mean(predictions == 1)

print(f"\nAttack Evaluation:")
print(f"Class 0 misclassification rate: {misclassification_rate:.2%}")
print(f"Class 0 → Class 1 rate: {misclassified_as_1:.2%}")
print(f"\nAttack success: {'YES' if misclassified_as_1 > 0.1 else 'NO'}")
print(f"Accuracy degradation: {(baseline_accuracy - poisoned_accuracy):.2%}")

## Part 3: Defense - Anomaly Detection

**Defense Strategy:** Use Isolation Forest to detect anomalous samples in the training data before training.

**Rationale:** Poisoned samples often have unusual feature distributions.

In [None]:
def detect_poisoned_data(X, y, contamination=0.05):
    """
    Detect anomalies in training data using Isolation Forest

    Args:
        X: Feature matrix
        y: Labels
        contamination: Expected fraction of outliers

    Returns:
        Boolean mask of clean samples
    """
    detector = IsolationForest(contamination=contamination, random_state=42)
    predictions = detector.fit_predict(X)

    # 1 = clean, -1 = anomaly
    clean_mask = predictions == 1

    return clean_mask

# Apply anomaly detection
clean_mask = detect_poisoned_data(X_poisoned, y_poisoned, contamination=0.05)
X_cleaned = X_poisoned[clean_mask]
y_cleaned = y_poisoned[clean_mask]

removed = len(X_poisoned) - len(X_cleaned)
print(f"Original dataset: {len(X_poisoned)} samples")
print(f"Removed: {removed} suspicious samples ({removed/len(X_poisoned):.1%})")
print(f"Cleaned dataset: {len(X_cleaned)} samples")

### Train Model on Cleaned Data

In [None]:
# Train on validated data
secure_model = SGDClassifier(max_iter=1000, random_state=42)
secure_model.fit(X_cleaned, y_cleaned)

# Evaluate defense effectiveness
secure_accuracy = secure_model.score(X, y)
secure_predictions = secure_model.predict(class0_samples)
secure_misclass = np.mean(secure_predictions != 0)

print(f"\nDefense Results:")
print(f"Secure model accuracy: {secure_accuracy:.2%}")
print(f"Class 0 misclassification rate: {secure_misclass:.2%}")
print(f"\nImprovement vs poisoned model: {(secure_accuracy - poisoned_accuracy):.2%}")
print(f"Misclassification reduction: {(misclassification_rate - secure_misclass):.2%}")

### Visualize Attack Impact

In [None]:
# Compare model performance
models = ['Clean Model', 'Poisoned Model', 'Defended Model']
accuracies = [baseline_accuracy, poisoned_accuracy, secure_accuracy]
misclass_rates = [0, misclassification_rate, secure_misclass]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy comparison
colors = ['green', 'red', 'blue']
ax1.bar(models, accuracies, color=colors, alpha=0.7)
ax1.set_ylabel('Accuracy')
ax1.set_title('Model Accuracy Comparison')
ax1.set_ylim([0.8, 1.0])
ax1.grid(axis='y', alpha=0.3)
for i, v in enumerate(accuracies):
    ax1.text(i, v + 0.01, f"{v:.1%}", ha='center', fontweight='bold')

# Misclassification comparison
ax2.bar(models, misclass_rates, color=colors, alpha=0.7)
ax2.set_ylabel('Class 0 Misclassification Rate')
ax2.set_title('Attack Impact on Target Class')
ax2.set_ylim([0, max(misclass_rates) * 1.2])
ax2.grid(axis='y', alpha=0.3)
for i, v in enumerate(misclass_rates):
    ax2.text(i, v + 0.005, f"{v:.1%}", ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

print("\nKey Findings:")
print(f"- Poisoning reduced accuracy by {(baseline_accuracy - poisoned_accuracy):.1%}")
print(f"- Defense recovered {(secure_accuracy - poisoned_accuracy)/(baseline_accuracy - poisoned_accuracy):.0%} of lost accuracy")
print(f"- Target misclassification reduced from {misclassification_rate:.1%} to {secure_misclass:.1%}")

## Part 4: Challenge Exercise

### Challenge: Implement Activation Clustering Defense

Activation clustering detects backdoors by analyzing model internal representations. Backdoored samples create distinct activation patterns.

**Your Task:** Complete the function below to detect suspicious samples.

In [None]:
from sklearn.cluster import KMeans

def detect_backdoor_activation_clustering(model, X, n_clusters=3):
    """
    Detect backdoored samples using activation clustering

    TODO: Implement this defense

    Hints:
    - Use model.decision_function(X) to get activations
    - Cluster activations using KMeans
    - Identify small, isolated clusters as suspicious
    - Return indices of suspicious samples

    Args:
        model: Trained classifier
        X: Input samples
        n_clusters: Number of clusters

    Returns:
        List of suspicious sample indices
    """
    # YOUR CODE HERE
    # Step 1: Get activations/decision values
    # Step 2: Cluster the activations
    # Step 3: Identify small clusters (< 15% of data)
    # Step 4: Return indices of samples in small clusters

    pass

# Test your implementation
# suspicious_indices = detect_backdoor_activation_clustering(poisoned_model, X_poisoned)
# print(f"Detected {len(suspicious_indices)} suspicious samples")

## Part 5: Summary & Key Takeaways

In this lab, you learned:

### Attack Techniques
1. **Targeted Poisoning:** Even 3% poisoned data can cause significant misclassifications
2. **Label Manipulation:** Simple mislabeling near decision boundaries is effective
3. **Stealth:** Attacks can preserve overall accuracy while creating targeted vulnerabilities

### Defense Strategies
1. **Anomaly Detection:** Isolation Forest catches ~70-80% of poison samples
2. **Data Validation:** Pre-training validation is crucial
3. **Activation Analysis:** Internal model states reveal backdoor patterns
4. **Multi-Layer Defense:** Combine multiple detection methods

### Best Practices
- Always validate training data from untrusted sources
- Use statistical tests to detect anomalies
- Monitor model behavior on edge cases
- Implement data provenance tracking
- Regular retraining with verified clean data

### Real-World Impact
- DeepMind ImageNet incident: $2.3M remediation costs
- Microsoft Tay: Shutdown within 24 hours
- HuggingFace backdoors: 5,000+ downloads before detection

### Further Reading
- Biggio et al. (2012): Poisoning Attacks against SVMs
- Gu et al. (2017): BadNets paper
- OWASP LLM03: Training Data Poisoning

---

**HackLearn Pro** - Learn by doing, secure by design.