# Lab 38: ML Security Introduction

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/depalmar/ai_for_the_win/blob/main/notebooks/lab38_ml_security_intro.ipynb)

Introduction to machine learning security concepts and attack vectors.

## Learning Objectives
- Understand the ML security threat landscape
- Learn common attack types (evasion, poisoning, extraction)
- Identify vulnerabilities in ML pipelines
- Build secure ML development practices

**Next:** Lab 39 (Adversarial ML)

In [None]:
#@title Install dependencies (Colab only)
#@markdown Run this cell to install required packages in Colab

%pip install -q numpy scikit-learn

In [None]:
import numpy as np
from typing import List, Dict, Tuple
from sklearn.linear_model import LogisticRegression

print("‚úÖ Libraries loaded!")

## Why Attack ML Systems?

### The Stakes Are High

| Domain | ML Application | Attack Impact |
|--------|---------------|---------------|
| **Security** | Malware detection | Malware evades detection |
| **Finance** | Fraud detection | Fraudulent transactions pass |
| **Healthcare** | Diagnosis | Wrong treatment decisions |
| **Content** | Spam/abuse filters | Abuse content gets through |

### The ML Attack Surface

```
DATA COLLECTION ‚Üí PREPROCESSING ‚Üí TRAINING ‚Üí DEPLOYMENT ‚Üí INFERENCE
       ‚îÇ               ‚îÇ             ‚îÇ            ‚îÇ            ‚îÇ
       ‚ñº               ‚ñº             ‚ñº            ‚ñº            ‚ñº
   Poisoning      Poisoning      Backdoor     Extraction   Evasion
   via source     via pipeline   via trojan   via API      attacks
```

## Attack Type 1: Evasion

**Goal**: Craft an input that's misclassified at inference time.

```
Original malware ‚Üí ADD PERTURBATION ‚Üí Modified malware
     ‚îÇ                                      ‚îÇ
     ‚ñº                                      ‚ñº
"MALICIOUS" (correct)              "BENIGN" (wrong!)
```

In [None]:
# Demonstrating Evasion Attack

# Simple classifier: detect "malware" based on features
# Feature 1: Number of suspicious API calls
# Feature 2: Entropy level

# Training data: [suspicious_apis, entropy] -> label
X_train = np.array([
    [10, 7.5],  # Malware
    [8, 7.2],   # Malware
    [12, 7.8],  # Malware
    [2, 4.5],   # Benign
    [1, 5.0],   # Benign
    [3, 4.8],   # Benign
])
y_train = np.array([1, 1, 1, 0, 0, 0])

# Train classifier
classifier = LogisticRegression()
classifier.fit(X_train, y_train)

# Original malware sample
original_malware = np.array([[9, 7.3]])
original_pred = classifier.predict(original_malware)[0]
original_prob = classifier.predict_proba(original_malware)[0][1]

print("üéØ EVASION ATTACK DEMONSTRATION")
print("=" * 50)
print(f"\nüìç Original Malware Sample:")
print(f"   Features: suspicious_apis={original_malware[0][0]}, entropy={original_malware[0][1]}")
print(f"   Prediction: {'MALICIOUS' if original_pred == 1 else 'BENIGN'}")
print(f"   Confidence: {original_prob:.1%}")

# Attacker's evasion: add "benign-looking" features
# Strategy: Reduce apparent suspicious APIs, lower entropy appearance
evaded_malware = np.array([[4, 5.5]])  # Changed features while keeping malicious behavior

evaded_pred = classifier.predict(evaded_malware)[0]
evaded_prob = classifier.predict_proba(evaded_malware)[0][1]

print(f"\nüìç EVASION ATTEMPT:")
print(f"   Modified Features: suspicious_apis={evaded_malware[0][0]}, entropy={evaded_malware[0][1]}")
print(f"   Prediction: {'MALICIOUS' if evaded_pred == 1 else 'BENIGN'}")
print(f"   Confidence: {evaded_prob:.1%}")

if evaded_pred == 0:
    print(f"\n‚ö†Ô∏è EVASION SUCCESSFUL! Malware classified as benign.")

## Attack Type 2: Poisoning

**Goal**: Corrupt training data to degrade model accuracy.

In [None]:
# Demonstrating Poisoning Attack

from sklearn.metrics import accuracy_score

# Clean training data
X_clean = np.array([
    [10, 7.5], [8, 7.2], [12, 7.8], [9, 7.0], [11, 7.6],  # Malware
    [2, 4.5], [1, 5.0], [3, 4.8], [2, 4.2], [1, 4.5],      # Benign
])
y_clean = np.array([1, 1, 1, 1, 1, 0, 0, 0, 0, 0])

# Test data
X_test = np.array([[9, 7.1], [2, 4.6], [10, 7.4], [1, 4.8]])
y_test = np.array([1, 0, 1, 0])

# Train on clean data
clean_model = LogisticRegression()
clean_model.fit(X_clean, y_clean)
clean_accuracy = accuracy_score(y_test, clean_model.predict(X_test))

print("üß™ POISONING ATTACK DEMONSTRATION")
print("=" * 50)
print(f"\nüìç Clean Model Performance:")
print(f"   Test Accuracy: {clean_accuracy:.1%}")

# Poisoned data: attacker injects mislabeled samples
# Adding malware samples labeled as "benign"
X_poisoned = np.vstack([X_clean, [[9, 7.3], [10, 7.5]]])  # Adding malware features
y_poisoned = np.concatenate([y_clean, [0, 0]])  # But labeling them as benign!

# Train on poisoned data
poisoned_model = LogisticRegression()
poisoned_model.fit(X_poisoned, y_poisoned)
poisoned_accuracy = accuracy_score(y_test, poisoned_model.predict(X_test))

print(f"\nüìç Poisoned Model Performance:")
print(f"   Test Accuracy: {poisoned_accuracy:.1%}")
print(f"   Accuracy Drop: {(clean_accuracy - poisoned_accuracy):.1%}")

if poisoned_accuracy < clean_accuracy:
    print(f"\n‚ö†Ô∏è POISONING SUCCESSFUL! Model accuracy degraded.")

## Defense Strategies

### Quick Reference

| Defense | Against | How |
|---------|---------|-----|
| **Adversarial Training** | Evasion | Train on perturbed examples |
| **Input Validation** | Evasion | Detect anomalous inputs |
| **Data Sanitization** | Poisoning | Filter training data |
| **Ensemble Models** | All | Harder to attack multiple models |
| **Rate Limiting** | Extraction | Detect systematic queries |

In [None]:
# Simple Defense: Input Validation

def validate_input(x: np.ndarray, model, threshold: float = 0.3) -> Tuple[bool, str]:
    """
    Detect potentially adversarial inputs based on prediction confidence.

    Low confidence predictions may indicate:
    - Adversarial examples designed to confuse the model
    - Out-of-distribution inputs
    - Inputs near the decision boundary

    Args:
        x: Input features
        model: Trained classifier
        threshold: Minimum confidence required

    Returns:
        Tuple of (is_valid, reason)
    """
    proba = model.predict_proba(x)[0]
    confidence = max(proba)

    if confidence < (0.5 + threshold):
        return False, f"Low confidence ({confidence:.1%}) - possible adversarial input"

    return True, f"Input appears valid (confidence: {confidence:.1%})"

# Test validation
print("üõ°Ô∏è DEFENSE: INPUT VALIDATION")
print("=" * 50)

test_inputs = [
    np.array([[10, 7.5]]),  # Clear malware
    np.array([[2, 4.5]]),   # Clear benign
    np.array([[5, 6.0]]),   # Ambiguous (near decision boundary)
]

for i, test in enumerate(test_inputs):
    is_valid, reason = validate_input(test, clean_model)
    status = "‚úÖ VALID" if is_valid else "‚ö†Ô∏è FLAGGED"
    print(f"\nInput {i+1}: {status}")
    print(f"   {reason}")

## üéâ Key Takeaways

1. **ML systems are targets** - Security, finance, anywhere ML makes decisions
2. **Know your attack surface** - Data, training, deployment, inference
3. **Evasion is most common** - Attackers craft inputs to bypass ML
4. **Defense in depth** - No single defense is sufficient
5. **Monitor and adapt** - Attackers evolve, so must defenses

## Attacker Knowledge Levels

| Level | What Attacker Knows | Attack Difficulty |
|-------|---------------------|-------------------|
| **White-box** | Full model access | Easier |
| **Gray-box** | Partial knowledge | Medium |
| **Black-box** | Only query access | Harder |

Most real attacks are **black-box** - attacker only has API access.

## Next Steps

- **Lab 17**: Implement more sophisticated attacks (FGSM, PGD)
- **Lab 18**: Build robust ML models
- **Lab 49**: Apply these concepts to LLM security