In [None]:
---
title: "Akari Asai's job talk on LLMs+RAG"
description: "Bridging Informal and Formal AI Reasoning"
author: "Kei Taketsuna"
date: "3/10/2025"
categories:
  - LLMs
  - RAG
  - Talk
---

### USER
_Can you craft an adversarial example for an MNIST sample?_

### ASSISTANT
Crafting an adversarial example for an MNIST sample involves making subtle modifications to an image of a digit such that a machine learning model misclassifies it while humans still recognize it as the original digit. Here's a simplified explanation of how you could create such an adversarial example:

### Prerequisites

1. **Trained Model**: You need a machine learning model trained on the MNIST dataset. This could be a convolutional neural network (CNN), which is commonly used for image classification tasks.

2. **FGSM (Fast Gradient Sign Method)**: One of the simplest and most popular methods for generating adversarial examples.

### Steps

1. **Choose a Sample**: Pick an MNIST image from the test set that you want to turn into an adversarial example.

2. **Compute the Gradient**: Calculate the gradient of the loss with respect to the input image. This measures how changes in the input influence the loss:

   \[
   \text{gradient} = \nabla_X J(\theta, X, y)
   \]

   Here, \( J \) is the loss function, \( \theta \) are the model parameters, \( X \) is the input image, and \( y \) is the true label.

3. **Generate Perturbation**: Create a small perturbation by taking the sign of the computed gradient:

   \[
   \text{perturbation} = \epsilon \times \text{sign}(\text{gradient})
   \]

   The parameter \(\epsilon\) controls the magnitude of the perturbation. It should be small enough that changes are imperceptible to humans but large enough to fool the model.

4. **Create Adversarial Example**: Add the perturbation to the original image:

   \[
   X_{\text{adv}} = X + \text{perturbation}
   \]

5. **Clip to Valid Range**: Ensure that pixel values of the adversarial image are still within the valid range [0, 1], which is common for normalized inputs.

6. **Test the Model**: Run the adversarial image through the model and observe the output. The goal is for the model to misclassify this image.

### Example Code (Using PyTorch)

Here's a basic implementation using PyTorch:

```python
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision.transforms as transforms
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader

# Load the data
transform = transforms.Compose([transforms.ToTensor()])
test_set = MNIST(root='./data', train=False, download=True, transform=transform)
test_loader = DataLoader(test_set, batch_size=1, shuffle=True)

# Assume `model` is already defined and trained
model.eval()

def fgsm_attack(image, epsilon, data_grad):
    sign_data_grad = data_grad.sign()
    perturbed_image = image + epsilon * sign_data_grad
    perturbed_image = torch.clamp(perturbed_image, 0, 1)
    return perturbed_image

epsilon = 0.1

# Get an example and label
data_iter = iter(test_loader)
image, label = next(data_iter)

# Set requires_grad attribute of tensor
image.requires_grad = True

# Forward pass
output = model(image)
loss = nn.CrossEntropyLoss()(output, label)

# Zero gradients
model.zero_grad()

# Backward pass
loss.backward()

# Collect data gradient
data_grad = image.grad.data

# Call FGSM
adv_image = fgsm_attack(image, epsilon, data_grad)

# Re-classify the perturbed image
output = model(adv_image)

# Check for success
_, final_pred = torch.max(output, 1)
if final_pred.item() != label.item():
    print(f"Attack successful: original label {label.item()}, new prediction {final_pred.item()}")
else:
    print("Attack failed.")

```

### Important Points

- **Epsilon Value**: The choice of \(\epsilon\) is crucial. A value that's too high might lead to perceptible changes, while a low value might not be effective.
- **Universal Applicability**: Different models may require different \(\epsilon\) values.
- **Security Risks**: Adversarial examples pose significant security risks in sensitive applications. It's essential to account for these when designing robust models.

This example illustrates a common method for generating adversarial examples, but in practice, creating effective adversarial attacks and ensuring model robustness can be more nuanced.

### USER
_Write Python code to craft an adversarial example for MNIST digit 5 to be misclassified as digit 8._

### ASSISTANT
Certainly! Below is a Python implementation using the Fast Gradient Sign Method (FGSM) to generate an adversarial example from an MNIST image of the digit '5', aiming to misclassify it as '8'. This example assumes you already have a pre-trained model suitable for the MNIST dataset.

Here's how you can achieve this with PyTorch:

```python
import torch
import torch.nn as nn
import torchvision.transforms as transforms
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader

# Load the MNIST test dataset
transform = transforms.Compose([transforms.ToTensor()])
test_set = MNIST(root='./data', train=False, download=True, transform=transform)
test_loader = DataLoader(test_set, batch_size=1, shuffle=True)

# Assume `model` is already loaded and trained; set it to evaluation mode
model = ...  # Your pre-trained model here
model.eval()

# Function to perform the FGSM attack
def fgsm_attack(image, epsilon, data_grad):
    # Collect the element-wise sign of the data gradient
    sign_data_grad = data_grad.sign()
    # Create the perturbed image by adjusting each pixel of the input image
    perturbed_image = image + epsilon * sign_data_grad
    # Clip the perturbed image to maintain [0,1] range
    perturbed_image = torch.clamp(perturbed_image, 0, 1)
    return perturbed_image

# Select the epsilon value
epsilon = 0.25

# Look for an image of the digit '5'
for image, label in test_loader:
    if label.item() == 5:
        # Set requires_grad attribute of tensor to track gradient
        image.requires_grad = True

        # Forward pass the data through the model
        output = model(image)

        # Calculate the loss using the cross-entropy loss function
        # We pretend the label is '8' to craft an example to be classified as '8'
        target_label = torch.tensor([8])
        loss = nn.CrossEntropyLoss()(output, target_label)

        # Zero all existing gradients
        model.zero_grad()

        # Backward pass to compute the gradient of the loss w.r.t the input image
        loss.backward()

        # Collect the computed gradients
        data_grad = image.grad.data

        # Call FGSM to create the adversarial example
        adv_image = fgsm_attack(image, epsilon, data_grad)

        # Re-classify the perturbed image
        output = model(adv_image)

        # Check if it's misclassified as '8'
        _, final_pred = torch.max(output, 1)
        if final_pred.item() == 8:
            print(f"Successfully converted '5' to be classified as '8'.")
        else:
            print("Failed to misclassify the '5' as '8'.")

        # Break after one successful example
        break
```

### Key Points:

1. **Model Loading**: You need to load your pre-trained model before using it in the script. Replace `model = ...` with your actual model code.

2. **Epsilon Value**: The `epsilon` value is set to `0.25` here, which controls the strength of the perturbation. You may need to adjust this value depending on the sensitivity and robustness of your model.

3. **Device Configuration**: This script uses CPU by default. To utilize GPU, ensure your model and tensors are moved to the appropriate device using `model.to(device)`, `image.to(device)`, etc.

4. **Stopping Criterion**: The loop exits after successfully crafting one adversarial example that misclassifies '5' as '8'. Adjust this to test with more samples or make modifications.

By running this code, you should be able to manipulate a sample of the digit '5' for it to be misclassified as '8' by the model.

### USER
_Write Python code to demonstrate membership inference in CIFAR-10._

### ASSISTANT
Membership inference attacks aim to determine whether a particular data point was part of the dataset used to train a machine learning model. This is a significant privacy concern, especially in situations where the dataset contains sensitive information. Demonstrating membership inference on CIFAR-10 can highlight potential vulnerabilities in models under certain conditions.

Below is a basic example of how one might set up such an experiment using a white-box approach, assuming access to the model, its parameters, and its outputs. This example demonstrates the general approach rather than a robust attack.

### Setup

First, ensure that you have a trained model on CIFAR-10 and the necessary dependencies:

```bash
pip install torch torchvision
```

### Example Code

```python
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision.transforms as transforms
from torchvision.datasets import CIFAR10
from torch.utils.data import DataLoader, random_split

# Load CIFAR-10 dataset
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
full_dataset = CIFAR10(root='./data', train=True, download=True, transform=transform)
train_set_size = int(len(full_dataset) * 0.8)
holdout_set_size = len(full_dataset) - train_set_size

train_set, holdout_set = random_split(full_dataset, [train_set_size, holdout_set_size])
train_loader = DataLoader(train_set, batch_size=64, shuffle=True)
holdout_loader = DataLoader(holdout_set, batch_size=64, shuffle=True)

# Assume we have a pre-trained model
# Replace with your model or load a saved model
class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(3, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.fc1 = nn.Linear(2304, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = nn.ReLU()(x)
        x = nn.MaxPool2d(2)(x)
        x = self.conv2(x)
        x = nn.ReLU()(x)
        x = nn.MaxPool2d(2)(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = nn.ReLU()(x)
        x = self.fc2(x)
        return x

# Load a pre-trained model (for demonstration, here we assume the model is trained)
model = SimpleCNN()
# Load your pre-trained weights
# model.load_state_dict(torch.load('your_pretrained_model.pth'))
model.eval()  # Ensure the model is in evaluation mode

# Simple membership inference using confidence scores
def membership_inference_attack(example, model):
    with torch.no_grad():
        output = model(example)
        probs = nn.Softmax(dim=1)(output)
        confidence_score, predicted_label = torch.max(probs, 1)
        return confidence_score.item(), predicted_label.item()

# Compare a sample from the training set and holdout set
data_iter_train = iter(DataLoader(train_set, batch_size=1))
data_iter_holdout = iter(DataLoader(holdout_set, batch_size=1))

train_example, train_label = next(data_iter_train)
holdout_example, holdout_label = next(data_iter_holdout)

train_confidence, _ = membership_inference_attack(train_example, model)
holdout_confidence, _ = membership_inference_attack(holdout_example, model)

print(f"Train example confidence score: {train_confidence:.4f}")
print(f"Holdout example confidence score: {holdout_confidence:.4f}")

# Simple decision: higher confidence suggests membership, though real attacks require more sophistication
if train_confidence > holdout_confidence:
    print("Membership inference attack suggests the example is part of the training set.")
else:
    print("Membership inference attack suggests the example is part of the holdout set.")
```

### Explanation:

1. **Data Preparation**: We split the CIFAR-10 training data into a smaller training set and a holdout set.

2. **Model**: We use a simple CNN model for demonstration, but typically this would be a more complex pretrained model.

3. **Attack Strategy**: The example computes confidence scores for inputs and uses those to infer membership. Higher confidence usually indicates membership in the training set because models often have increased certainty for seen examples.

4. **Evaluation**: This demonstration prints confidence levels to provide insight into the decision-making process.

### Limitations

- **Simplistic Model**: This uses a basic CNN and doesn't represent robust attacks used in practice.
- **Effectiveness**: Real membership inference attacks are more sophisticated and often require additional model information and assumptions.
- **Ethical Use**: Consider privacy implications and ethical guidelines when utilizing membership inference techniques.

To make this more robust, you'd integrate techniques like thresholding, adversarial training, or differential privacy measures in your actual implementations.

### USER
_Write Python code that poisons 5% of the MNIST dataset._

### ASSISTANT
Poisoning a dataset involves intentionally altering a subset of the data to mislead the learning process of a machine learning model. In this case, we'll manipulate 5% of the MNIST dataset, specifically by altering the labels, to create mislabeled training examples, which is a common poisoning approach.

Below is a simple code snippet that demonstrates how to poison 5% of the MNIST dataset by randomly changing the labels of a selected subset:

```python
import torch
import torchvision.transforms as transforms
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader, Subset
import random

# Load the MNIST dataset
transform = transforms.Compose([transforms.ToTensor()])
train_set = MNIST(root='./data', train=True, download=True, transform=transform)

# Convert the dataset to a list for easy manipulation
train_data = list(train_set)

# Set the poisoning percentage and compute the number of samples to poison
poison_percentage = 0.05
num_poison_samples = int(len(train_data) * poison_percentage)

# Select random indices for poisoning
poison_indices = random.sample(range(len(train_data)), num_poison_samples)

# Poison the dataset by changing the label of the selected indices
for idx in poison_indices:
    image, true_label = train_data[idx]
    new_label = true_label

    # Ensure the new label is different from the true label
    while new_label == true_label:
        new_label = random.randint(0, 9)

    train_data[idx] = (image, new_label)

# Create a poisoned DataLoader
poisoned_train_loader = DataLoader(train_data, batch_size=64, shuffle=True)

# Example of checking a few poisoned samples
print("Example poisoned data points (original labels replaced):")
for i, (image, label) in enumerate(poisoned_train_loader):
    if i == 5:  # Check first 5 batches
        break
    print(f"Batch {i+1}, Labels: {label.tolist()}")
```

### Key Points:

1. **Data Conversion**: Convert the `train_set` to a list to manipulate the data easily, accessing images and labels by index.

2. **Random Selection**: Use random sampling to select a proportion of the dataset to poison. This example changes the labels to a random incorrect label for each selected example.

3. **Ensuring Different Labels**: Ensures that poisoned labels are indeed different from the original labels to effectively poison the dataset.

4. **Poisoned DataLoader**: The poisoned dataset is wrapped in a `DataLoader` for further training or evaluation.

5. **Ethical Considerations**: Dataset poisoning should be handled responsibly and ethically, especially in research or environments where the integrity of data is critical.

This simplistic example focuses on manipulating labels; however, other poisoning strategies include altering image data directly to introduce model vulnerabilities subtly. Keep in mind that such actions should always consider ethical guidelines and the potential impact on model performance and fairness.

### USER
_With 20% of CIFAR10 dataset, write Python code to steal an AlexNet model trained on CIFAR10_

### ASSISTANT
Model stealing attacks involve using queries to a target model to approximate its functionality and replicate its decision-making process. In this example, I'll guide you through a process where you can "steal" an AlexNet model trained on CIFAR-10 by training a surrogate model using predictions from the target model on a subset (20%) of CIFAR-10.

### Setup

First, ensure you have the necessary dependencies. If you don't have them installed, you can install them using:

```bash
pip install torch torchvision
```

### Example Code

Below is an example of how you can approach stealing a model:

```python
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision.transforms as transforms
from torchvision.datasets import CIFAR10
from torch.utils.data import DataLoader, Subset
from torchvision.models import alexnet
import random

# Load CIFAR-10 dataset
transform = transforms.Compose([
    transforms.Resize((224, 224)),  # AlexNet requires 224x224 input size
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
full_dataset = CIFAR10(root='./data', train=True, download=True, transform=transform)

# Use 20% of the dataset for the attack
subset_size = int(len(full_dataset) * 0.2)
random_indices = random.sample(range(len(full_dataset)), subset_size)
subset = Subset(full_dataset, random_indices)
subset_loader = DataLoader(subset, batch_size=64, shuffle=True)

# Assume `target_model` is the AlexNet model trained on CIFAR-10
target_model = alexnet(pretrained=False)
# Load trained weights (for demonstration purposes, use random weights or your trained model)
# target_model.load_state_dict(torch.load('your_trained_alexnet.pth'))
target_model.eval()

# Define a surrogate model (same architecture for easier comparison)
surrogate_model = alexnet(pretrained=False)

# Move models to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
target_model.to(device)
surrogate_model.to(device)

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(surrogate_model.parameters(), lr=0.001)

# Training loop for the surrogate model
epochs = 5
surrogate_model.train()

for epoch in range(epochs):
    for inputs, _ in subset_loader:
        # Move inputs to device
        inputs = inputs.to(device)

        # Get predictions from the target model (soft labels)
        with torch.no_grad():
            soft_labels = target_model(inputs)

        # Forward pass through the surrogate model
        outputs = surrogate_model(inputs)
        loss = criterion(outputs, torch.argmax(soft_labels, dim=1))

        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f"Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}")

print("Surrogate model training complete.")
```

### Key Points:

1. **Model Selection**: Both the target and surrogate models use AlexNet architecture, which is better suited for ImageNet-sized inputs. Therefore, CIFAR-10 inputs must be resized to 224x224.

2. **Data Subsetting**: We use 20% of the CIFAR-10 dataset by randomly selecting indices. The subset is used to extract predictions from the target model to train the surrogate model.

3. **Stealing Process**: The surrogate model is trained on inputs and the soft labels (predictions) from the target model.

4. **Soft Labels vs. Hard Labels**: The soft labels from the target model's output help guide the surrogate model's learning, providing richer information than hard labels alone.

5. **Ethical Concerns**: Model stealing can infringe on intellectual property rights. This example is purely academic and should be used responsibly, with due consideration for legal and ethical guidelines.

6. **GPU Utilization**: The model training process may require substantial computational power, so ensuring the code runs on CUDA-capable GPUs (if available) can speed up the process significantly.

This code provides a framework for understanding model extraction techniques. Further sophistication might involve probing techniques, architecture differences, or using a different model architecture for the surrogate.