# Adversarial Machine Learning - Hands-On Lab

**Part of HackLearn Pro**

Welcome to this interactive lab on Adversarial Machine Learning! Learn how attackers manipulate ML models and how to defend against these attacks.

## Learning Objectives
- Understand adversarial examples and how they fool ML models
- Explore different types of adversarial attacks (FGSM, PGD, etc.)
- Learn about model poisoning and backdoor attacks
- Implement defensive techniques
- Practice detecting adversarial inputs

## Prerequisites
- Basic Python and NumPy knowledge
- Understanding of neural networks
- Familiarity with image classification

---

## Setup

Install required packages for adversarial ML experimentation:

In [None]:
# Install dependencies
!pip install numpy matplotlib scikit-learn pillow -q

import numpy as np
import matplotlib.pyplot as plt
from typing import Callable, Tuple
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

print("Setup complete! Ready to explore adversarial ML.")

## Part 1: Understanding Adversarial Examples

Adversarial examples are inputs to ML models that are intentionally designed to cause misclassification. They often look identical to humans but fool the model.

### Creating a Simple Classifier

In [None]:
class SimpleClassifier:
    """A simple linear classifier for demonstration"""
    
    def __init__(self, input_dim: int = 784, num_classes: int = 10):
        # Initialize random weights
        self.weights = np.random.randn(input_dim, num_classes) * 0.01
        self.bias = np.zeros(num_classes)
    
    def predict(self, x: np.ndarray) -> int:
        """Predict class for input x"""
        scores = x @ self.weights + self.bias
        return np.argmax(scores)
    
    def predict_proba(self, x: np.ndarray) -> np.ndarray:
        """Get probability distribution over classes"""
        scores = x @ self.weights + self.bias
        exp_scores = np.exp(scores - np.max(scores))
        return exp_scores / np.sum(exp_scores)
    
    def gradient(self, x: np.ndarray, target_class: int) -> np.ndarray:
        """Compute gradient of loss w.r.t. input"""
        probs = self.predict_proba(x)
        probs[target_class] -= 1
        return probs @ self.weights.T

# Create a simple classifier
model = SimpleClassifier()
print("Simple classifier created!")

# Create a sample "image" (flattened 28x28)
sample_image = np.random.rand(784) * 0.5
original_class = model.predict(sample_image)
original_probs = model.predict_proba(sample_image)

print(f"\nOriginal prediction: Class {original_class}")
print(f"Confidence: {original_probs[original_class]:.2%}")

## Part 2: Fast Gradient Sign Method (FGSM)

FGSM is one of the simplest adversarial attacks. It adds small perturbations in the direction of the gradient to maximize loss.

In [None]:
def fgsm_attack(model: SimpleClassifier, x: np.ndarray, 
                target_class: int, epsilon: float = 0.1) -> np.ndarray:
    """
    Fast Gradient Sign Method attack
    
    Args:
        model: The classifier to attack
        x: Original input
        target_class: Target class to misclassify to
        epsilon: Perturbation magnitude
    
    Returns:
        Adversarial example
    """
    # Compute gradient
    grad = model.gradient(x, target_class)
    
    # Create perturbation
    perturbation = epsilon * np.sign(grad)
    
    # Create adversarial example
    x_adv = x + perturbation
    
    # Clip to valid range [0, 1]
    x_adv = np.clip(x_adv, 0, 1)
    
    return x_adv

# Perform FGSM attack
target_class = (original_class + 1) % 10  # Target a different class
adversarial_image = fgsm_attack(model, sample_image, target_class, epsilon=0.3)

# Evaluate adversarial example
adv_prediction = model.predict(adversarial_image)
adv_probs = model.predict_proba(adversarial_image)

print("FGSM Attack Results:")
print("=" * 50)
print(f"Original class: {original_class} (confidence: {original_probs[original_class]:.2%})")
print(f"Adversarial class: {adv_prediction} (confidence: {adv_probs[adv_prediction]:.2%})")
print(f"Perturbation magnitude: {np.linalg.norm(adversarial_image - sample_image):.4f}")
print(f"Max pixel change: {np.max(np.abs(adversarial_image - sample_image)):.4f}")

if adv_prediction != original_class:
    print("\n✓ Attack successful! Model misclassified the adversarial example.")
else:
    print("\n✗ Attack failed. Try increasing epsilon.")

### Visualizing the Attack

In [None]:
def visualize_attack(original: np.ndarray, adversarial: np.ndarray, 
                     orig_class: int, adv_class: int):
    """Visualize original vs adversarial image"""
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    # Reshape to 28x28 for visualization
    orig_img = original.reshape(28, 28)
    adv_img = adversarial.reshape(28, 28)
    perturbation = (adversarial - original).reshape(28, 28)
    
    # Original image
    axes[0].imshow(orig_img, cmap='gray')
    axes[0].set_title(f'Original\nClass: {orig_class}')
    axes[0].axis('off')
    
    # Perturbation (amplified for visibility)
    axes[1].imshow(perturbation * 10, cmap='seismic', vmin=-1, vmax=1)
    axes[1].set_title('Perturbation\n(10x amplified)')
    axes[1].axis('off')
    
    # Adversarial image
    axes[2].imshow(adv_img, cmap='gray')
    axes[2].set_title(f'Adversarial\nClass: {adv_class}')
    axes[2].axis('off')
    
    plt.tight_layout()
    plt.show()

visualize_attack(sample_image, adversarial_image, original_class, adv_prediction)

## Part 3: Projected Gradient Descent (PGD) Attack

PGD is an iterative version of FGSM that's more powerful. It applies multiple small steps.

In [None]:
def pgd_attack(model: SimpleClassifier, x: np.ndarray, 
               target_class: int, epsilon: float = 0.3, 
               alpha: float = 0.01, num_iter: int = 40) -> Tuple[np.ndarray, list]:
    """
    Projected Gradient Descent attack
    
    Args:
        model: The classifier to attack
        x: Original input
        target_class: Target class
        epsilon: Maximum perturbation
        alpha: Step size
        num_iter: Number of iterations
    
    Returns:
        Adversarial example and history of predictions
    """
    x_adv = x.copy()
    history = []
    
    for i in range(num_iter):
        # Compute gradient
        grad = model.gradient(x_adv, target_class)
        
        # Take a step
        x_adv = x_adv + alpha * np.sign(grad)
        
        # Project back to epsilon ball
        perturbation = x_adv - x
        perturbation = np.clip(perturbation, -epsilon, epsilon)
        x_adv = x + perturbation
        
        # Clip to valid range
        x_adv = np.clip(x_adv, 0, 1)
        
        # Record prediction
        pred = model.predict(x_adv)
        history.append(pred)
    
    return x_adv, history

# Perform PGD attack
pgd_adversarial, pgd_history = pgd_attack(model, sample_image, target_class)

pgd_prediction = model.predict(pgd_adversarial)
pgd_probs = model.predict_proba(pgd_adversarial)

print("PGD Attack Results:")
print("=" * 50)
print(f"Original class: {original_class}")
print(f"Final adversarial class: {pgd_prediction}")
print(f"Confidence: {pgd_probs[pgd_prediction]:.2%}")
print(f"\nAttack success rate: {sum(1 for p in pgd_history if p != original_class) / len(pgd_history):.1%}")

# Plot attack progress
plt.figure(figsize=(10, 4))
plt.plot(pgd_history)
plt.axhline(y=original_class, color='r', linestyle='--', label='Original class')
plt.xlabel('Iteration')
plt.ylabel('Predicted Class')
plt.title('PGD Attack Progress')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## Part 4: Model Poisoning Attack

In poisoning attacks, the attacker corrupts the training data to influence the model's behavior.

In [None]:
class PoisonedDataset:
    """Simulate a dataset with poisoned samples"""
    
    def __init__(self, num_samples: int = 1000, poison_rate: float = 0.1):
        self.num_samples = num_samples
        self.poison_rate = poison_rate
        
        # Generate clean data
        self.X_clean = np.random.randn(num_samples, 784)
        self.y_clean = np.random.randint(0, 10, num_samples)
        
        # Poison some samples
        num_poisoned = int(num_samples * poison_rate)
        poison_indices = np.random.choice(num_samples, num_poisoned, replace=False)
        
        self.X_poisoned = self.X_clean.copy()
        self.y_poisoned = self.y_clean.copy()
        
        # Flip labels for poisoned samples
        for idx in poison_indices:
            self.y_poisoned[idx] = (self.y_poisoned[idx] + 1) % 10
            # Add backdoor trigger (special pattern)
            self.X_poisoned[idx, :10] = 1.0
        
        self.poison_indices = set(poison_indices)
    
    def get_data(self, poisoned: bool = False):
        """Get clean or poisoned data"""
        if poisoned:
            return self.X_poisoned, self.y_poisoned
        return self.X_clean, self.y_clean

# Create poisoned dataset
dataset = PoisonedDataset(num_samples=500, poison_rate=0.15)
X_clean, y_clean = dataset.get_data(poisoned=False)
X_poisoned, y_poisoned = dataset.get_data(poisoned=True)

print("Poisoned Dataset Created:")
print("=" * 50)
print(f"Total samples: {len(X_clean)}")
print(f"Poisoned samples: {len(dataset.poison_indices)}")
print(f"Poison rate: {dataset.poison_rate:.1%}")
print(f"\nLabel changes: {np.sum(y_clean != y_poisoned)} samples")
print(f"Feature modifications: Backdoor trigger added to poisoned samples")

## Part 5: Defense Mechanisms

### Defense 1: Adversarial Training

In [None]:
class RobustClassifier(SimpleClassifier):
    """Classifier with adversarial training"""
    
    def adversarial_train_step(self, x: np.ndarray, y: int, 
                              epsilon: float = 0.1, lr: float = 0.01):
        """Single step of adversarial training"""
        # Generate adversarial example
        x_adv = fgsm_attack(self, x, y, epsilon)
        
        # Train on both clean and adversarial examples
        for x_train in [x, x_adv]:
            # Compute loss gradient
            grad = self.gradient(x_train, y)
            
            # Update weights (simplified)
            self.weights -= lr * np.outer(x_train, grad)

# Create robust classifier
robust_model = RobustClassifier()

# Train on a few examples
print("Training robust classifier with adversarial examples...")
for i in range(100):
    idx = np.random.randint(len(X_clean))
    robust_model.adversarial_train_step(X_clean[idx], y_clean[idx])

print("✓ Robust classifier trained!")

### Defense 2: Input Validation

In [None]:
def detect_adversarial(x: np.ndarray, reference_samples: np.ndarray, 
                       threshold: float = 3.0) -> Tuple[bool, float]:
    """
    Detect adversarial examples using statistical analysis
    
    Args:
        x: Input to check
        reference_samples: Clean reference samples
        threshold: Detection threshold (in standard deviations)
    
    Returns:
        (is_adversarial, anomaly_score)
    """
    # Compute statistics of reference samples
    mean = np.mean(reference_samples, axis=0)
    std = np.std(reference_samples, axis=0) + 1e-8
    
    # Compute z-score
    z_scores = np.abs((x - mean) / std)
    anomaly_score = np.mean(z_scores)
    
    is_adversarial = anomaly_score > threshold
    
    return is_adversarial, anomaly_score

# Test detection
is_adv_clean, score_clean = detect_adversarial(sample_image, X_clean[:100])
is_adv_attack, score_attack = detect_adversarial(adversarial_image, X_clean[:100])

print("Adversarial Detection Results:")
print("=" * 50)
print(f"Clean sample - Anomaly score: {score_clean:.3f}, Detected: {is_adv_clean}")
print(f"Adversarial sample - Anomaly score: {score_attack:.3f}, Detected: {is_adv_attack}")

if is_adv_attack and not is_adv_clean:
    print("\n✓ Detector successfully identified adversarial example!")
elif is_adv_clean:
    print("\n⚠ False positive - clean sample flagged as adversarial")
else:
    print("\n✗ Failed to detect adversarial example")

### Defense 3: Ensemble Methods

In [None]:
class EnsembleDefense:
    """Ensemble of models for robust prediction"""
    
    def __init__(self, num_models: int = 5):
        self.models = [SimpleClassifier() for _ in range(num_models)]
    
    def predict(self, x: np.ndarray) -> Tuple[int, float]:
        """Predict using ensemble voting"""
        predictions = [model.predict(x) for model in self.models]
        
        # Majority voting
        counts = np.bincount(predictions, minlength=10)
        prediction = np.argmax(counts)
        confidence = counts[prediction] / len(self.models)
        
        return prediction, confidence
    
    def detect_disagreement(self, x: np.ndarray, threshold: float = 0.3) -> bool:
        """Detect if models disagree (potential adversarial)"""
        predictions = [model.predict(x) for model in self.models]
        unique_predictions = len(set(predictions))
        disagreement_rate = unique_predictions / len(self.models)
        
        return disagreement_rate > threshold

# Create ensemble
ensemble = EnsembleDefense(num_models=5)

# Test on clean and adversarial examples
clean_pred, clean_conf = ensemble.predict(sample_image)
adv_pred, adv_conf = ensemble.predict(adversarial_image)

clean_disagree = ensemble.detect_disagreement(sample_image)
adv_disagree = ensemble.detect_disagreement(adversarial_image)

print("Ensemble Defense Results:")
print("=" * 50)
print(f"Clean sample - Prediction: {clean_pred}, Confidence: {clean_conf:.1%}, Disagreement: {clean_disagree}")
print(f"Adversarial sample - Prediction: {adv_pred}, Confidence: {adv_conf:.1%}, Disagreement: {adv_disagree}")

if adv_disagree:
    print("\n✓ Ensemble detected suspicious input through model disagreement!")

## Part 6: Challenge Exercises

### Challenge 1: Implement Carlini-Wagner (C&W) Attack
Implement a more sophisticated attack that minimizes perturbation while ensuring misclassification:

In [None]:
def carlini_wagner_attack(model: SimpleClassifier, x: np.ndarray, 
                         target_class: int, c: float = 1.0, 
                         num_iter: int = 100) -> np.ndarray:
    """
    Carlini-Wagner L2 attack (simplified)
    
    TODO: Implement the C&W attack
    Hints:
    - Minimize: ||perturbation||^2 + c * loss
    - Use optimization to find minimal perturbation
    - Ensure adversarial example is misclassified
    """
    pass

# Test your implementation
# cw_adversarial = carlini_wagner_attack(model, sample_image, target_class)
# Evaluate and compare with FGSM

### Challenge 2: Implement Defense Distillation

In [None]:
class DistilledClassifier(SimpleClassifier):
    """
    Classifier trained using defensive distillation
    
    TODO: Implement defensive distillation
    Hints:
    - Train on soft labels from teacher model
    - Use temperature scaling
    - Smooth the decision boundaries
    """
    
    def distill_from(self, teacher_model: SimpleClassifier, 
                     X: np.ndarray, temperature: float = 10.0):
        """Learn from teacher model using distillation"""
        pass

# Test your implementation
# distilled = DistilledClassifier()
# distilled.distill_from(model, X_clean)
# Test robustness against FGSM

### Challenge 3: Implement Backdoor Detection

In [None]:
def detect_backdoor(model: SimpleClassifier, X: np.ndarray, 
                   y: np.ndarray, threshold: float = 0.9) -> Tuple[bool, List[int]]:
    """
    Detect if a model has been backdoored
    
    TODO: Implement backdoor detection
    Hints:
    - Look for unusual activation patterns
    - Check for triggers that cause consistent misclassification
    - Analyze model behavior on modified inputs
    
    Returns:
        (is_backdoored, suspicious_indices)
    """
    pass

# Test on poisoned dataset
# is_backdoored, suspicious = detect_backdoor(model, X_poisoned, y_poisoned)
# Compare with actual poison_indices from dataset

## Summary & Key Takeaways

In this lab, you learned:

1. **Adversarial Examples**: Small perturbations can fool ML models
2. **Attack Methods**:
   - FGSM: Fast, single-step attack
   - PGD: Iterative, more powerful attack
   - Data Poisoning: Corrupting training data
   - Backdoor Attacks: Hidden triggers in models

3. **Defense Strategies**:
   - Adversarial training: Train on adversarial examples
   - Input validation: Detect anomalous inputs
   - Ensemble methods: Use multiple models
   - Defensive distillation: Smooth decision boundaries

### Best Practices
- Always validate inputs before feeding to ML models
- Use multiple defense layers
- Monitor model behavior in production
- Regularly test for adversarial robustness
- Be aware of training data integrity

### Real-World Impact
- Autonomous vehicles: Adversarial stop signs
- Face recognition: Fooling authentication systems
- Malware detection: Evading security systems
- Content moderation: Bypassing filters

### Further Reading
- [CleverHans Library](https://github.com/cleverhans-lab/cleverhans)
- [Adversarial Robustness Toolbox](https://github.com/Trusted-AI/adversarial-robustness-toolbox)
- [RobustBench Benchmark](https://robustbench.github.io/)

---

**HackLearn Pro** - Learn by doing, secure by design.
