# Laboratory on Adversarial Machine Learning
_Digital Forensics and Biometrics_ A.A. 2025/2026

Lecturer: prof. **Simone Milani** (simone.milani@dei.unipd.it)

Teaching Assistants: **Mattia Tamiazzo** (mattia.tamiazzo.1@phd.unipd.it); **Luca Domeneghetti** (luca.domeneghetti@studenti.unipd.it)

_(Sources: [TensorFlow approach](https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/generative/adversarial_fgsm.ipynb) and [PyTorch approach](https://docs.pytorch.org/tutorials/beginner/fgsm_tutorial.html))_

### Prerequisites
This laboratory requires some basic prior knowledge on _machine learning_ and _neural networks_.

The additional information required to understand the laboratory will be provided during the lecture.

The following aspects are assumed to be known:
- What is a ML model and how to interact with it
- Tensors
- Gradients and loss functions (in particular Cross Entropy and Log Softmax)
- Backward propagation in a Neural Network

### Contents
The goal of this laboratory is to create an adversarial data sample (i.e. an image) that is capable of fooling a ML algorithm so that it mislabels the input sample. To do so, we will use a **Fast Gradient Sign Method** (FGSM) to shift our input sample towards a desired target class.

By the end of the laboratory the student will have acquired the following skills:
- Basic usage of Pytorch framework
- Load and use a pretrained network in evaluation mode
- Compute the signed gradient through a backpropagation
- **Create an adversarial data sample**
- **Shift the input prediction towards a desired target**
___
# Threat Model

For context, there are many categories of adversarial attacks, each with
a different goal and assumption of the attacker's knowledge. However, in
general the overarching goal is to add the least amount of perturbation
to the input data to cause the desired misclassification. There are
several kinds of assumptions of the attacker's knowledge, two of which
are: **white-box** and **black-box**. A *white-box* attack assumes the
attacker has full knowledge and access to the model, including
architecture, inputs, outputs, and weights. A *black-box* attack assumes
the attacker only has access to the inputs and outputs of the model, and
knows nothing about the underlying architecture or weights. There are
also several types of goals, including **misclassification** and
**source/target misclassification**. A goal of *misclassification* means
the adversary only wants the output classification to be wrong but does
not care what the new classification is. A *source/target
misclassification* means the adversary wants to alter an image that is
originally of a specific source class so that it is classified as a
specific target class.

In this case, the FGSM attack is a *white-box* attack with the goal of
*misclassification* and *source/target misclassification*. With this background information, we can now
discuss the attack in detail.

# Fast Gradient Sign Method
The fast gradient sign method works by using the gradients of the neural network to create an adversarial example. For an input image, the method uses the gradients of the loss with respect to the input image to create a new image that maximises the loss. This new image is called the adversarial image. This can be summarised using the following expression:
$$\text{adv}_x = x + \epsilon*\text{sign}(\nabla_xJ(\theta, x, y))$$

where

* adv_x : Adversarial image.
* x : Original input image.
* y : Original input label.
* $\epsilon$ : Multiplier to ensure the perturbations are small.
* $\theta$ : Model parameters.
* $J$ : Loss.

![](https://pytorch.org/tutorials/_static/img/fgsm_panda_image.png)

An intriguing property here, is the fact that the gradients are taken with respect to the input image. This is done because the objective is to create an image that maximises the loss. A method to accomplish this is to find how much each pixel in the image contributes to the loss value, and add a perturbation accordingly. This works pretty fast because it is easy to find how each input pixel contributes to the loss by using the chain rule and finding the required gradients. Hence, the gradients are taken with respect to the image. In addition, since the model is no longer being trained (thus the gradient is not taken with respect to the trainable variables, i.e., the model parameters), and so the model parameters remain constant. The only goal is to fool an already trained model.

So let's try and fool a pretrained model. In this tutorial, the model used is [MobileNetV2](https://pytorch.org/hub/pytorch_vision_mobilenet_v2/) pretrained on [ImageNet](http://www.image-net.org/).

In [None]:
import torch
import torch.nn.functional as F
import torchvision.transforms as transforms
from torchvision.transforms import functional as Fv
from torchvision.models import mobilenet_v2, MobileNet_V2_Weights
from PIL import Image
import urllib.request
import matplotlib.pyplot as plt
import numpy as np

# setup device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using {device} device")

## 1. Setting up the Target Model & Utilities

In this laboratory, we will perform adversarial attacks against **MobileNetV2**, a lightweight Convolutional Neural Network (CNN) pre-trained on the **ImageNet** dataset.

![](https://www.researchgate.net/publication/350152088/figure/fig1/AS:1002717703045121@1616077938892/The-proposed-MobileNetV2-network-architecture.png)

### Implementation Details

To prepare the environment for the attack, we need to perform three key steps:

1.  **Load the Model:** We import MobileNetV2 with its default pre-trained weights.
    * *Note:* We explicitly set the model to **evaluation mode** (`model.eval()`). This is critical because it locks the behavior of layers like `Dropout` and `BatchNorm`, ensuring consistent predictions during the attack generation.
2.  **Define Preprocessing:** Standard ImageNet models require specific input normalization. We define a transformation pipeline that resizes images and normalizes them using the standard ImageNet mean ($\mu$) and standard deviation ($\sigma$).
    $$\text{Input}_{norm} = \frac{\text{Input} - \mu}{\sigma}$$
3.  **Visualization Helpers:** Since the model requires normalized tensors, we cannot visualize them directly. We define two helper functions:
    * `denormalize`: Reverses the math above to return the tensor to the original pixel space.
    * `tensor2img`: Converts the PyTorch tensor (Channels, Height, Width) into a PIL Image (Height, Width, Channels) suitable for plotting.

In [None]:
# load Model
weights = MobileNet_V2_Weights.DEFAULT
model = mobilenet_v2(weights=weights)
model.eval() # set to evaluation mode (batchnorm/dropout behavior changes)
model.to(device)

# define Transforms (Normalization constants)
MEAN = [0.485, 0.456, 0.406]
STD = [0.229, 0.224, 0.225]

preprocess = transforms.Compose([
    transforms.Resize(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=MEAN, std=STD),
])

# helper Functions
def denormalize(tensor):
    """Reverses the ImageNet normalization for visualization."""
    
    # clone to avoid modifying the original tensor in place
    tensor = tensor.clone().detach()
    
    mean = torch.tensor(MEAN).view(1, 3, 1, 1).to(device)
    std = torch.tensor(STD).view(1, 3, 1, 1).to(device)

    return tensor * std + mean

def tensor2img(tensor, denorm=True):
    """Converts a Tensor (normalized or not) to a PIL Image for plotting."""
    
    # denormalize
    denormalized = tensor.clone()
    if denorm:
        denormalized = denormalize(tensor)
    
    # clamp to [0, 1] range to avoid artifacts
    clamped = torch.clamp(denormalized, 0, 1)
    
    # convert to Numpy (H, W, C)
    img_numpy = clamped.squeeze(0).cpu().numpy()
    img_numpy = np.transpose(img_numpy, (1, 2, 0))
    
    # convert to PIL (0-255)
    return Image.fromarray((img_numpy * 255).astype(np.uint8))

### Exploring available classes

To perform an adversarial attack, we first need a valid input sample. Since the model is trained on ImageNet, the input image must correspond to one of the 1,000 specific classes the model recognizes.

In the following cell, we access the metadata attached to the model weights to list the available categories and identify a suitable subject for our test.

In [None]:
categories = weights.meta['categories']

# print first 10 for brevity, or remove slice to see all
print('\n'.join(f'{i}\t{element}' for i, element in enumerate(categories[:10])))
print("...")

### Loading a sample image

Next, we download an image to serve as the target for our adversarial attack.

The code below retrieves a sample image (a Siamese cat) directly from a URL. You are free to replace this URL with an image of your choice from the web or from this [ImageNet sample repository](https://github.com/EliSchwartz/imagenet-sample-images).

> **Note:** If you choose a custom image from GitHub, right-click the image and select "Copy Image Link" to ensure you use the **raw static URL** rather than the repository UI link.

In [None]:
url = 'https://www.purina.co.uk/sites/default/files/styles/square_medium_440x440/public/2022-06/Siamese%201.jpg'
filename = 'siamese.jpeg'

try:
    urllib.request.urlretrieve(url, filename)
    pil_image = Image.open(filename)
    print("Image downloaded successfully.")
except Exception as e:
    print(f"Error downloading image: {e}")

With the image loaded, we must now format it for the model. We apply the preprocessing transforms defined earlier and use `unsqueeze(0)` to add a batch dimension, creating a tensor of shape $(1, C, H, W)$.

We then pass this batch through the model to obtain the initial prediction and confidence score.

In [None]:
# preprocess the image ONCE and create the master tensor batch
image_batch = preprocess(pil_image).unsqueeze(0).to(device)

# run inference
output = model(image_batch)
probs = F.softmax(output, dim=1)

top_prob, top_catid = probs.max(1)
label = categories[top_catid.item()]
confidence = top_prob.item()

plt.figure(figsize=(5,5))
plt.title(f'{label}: {confidence*100:.2f}%')
plt.imshow(tensor2img(image_batch))
plt.axis('off')
plt.show()

## 2. Constructing the Adversarial Perturbation

We can now begin the attack on our model using the **Fast Gradient Sign Method (FGSM)**.

The core idea is to create a perturbation that pushes the image across the decision boundary. To achieve this, we calculate the **gradient of the loss** with respect to the input image ($x$). This gradient, denoted as $\nabla_x J(\theta, x, y)$, indicates how we should adjust pixel values to maximize the classification error.

The function below implements this logic:
1.  It enables gradient tracking on the input tensor.
2.  It computes the loss based on the model's current prediction.
3.  It backpropagates to find the gradient and returns the **sign** of that gradient, which determines the direction of the perturbation.

In [None]:
def get_gradient(tensor, label_index):
    """
    Calculates the gradient of the loss w.r.t the input tensor.
    Note: We process the tensor directly, avoiding converting back and forth to Image.
    """
    
    # clone the tensor to ensure we don't mess up the original computation graph
    data = tensor.clone().detach().to(device)
    data.requires_grad = True

    # forward pass
    output = model(data)
    
    # calculate Loss
    target = torch.tensor([label_index]).to(device)
    loss = F.cross_entropy(output, target)
    
    # backward pass
    model.zero_grad()
    loss.backward()

    # return the sign of the gradient
    return data.grad.data.sign()

Let's test our `get_gradient()` function by running it against our chosen image and associated label's index.

By plotting the perturbation we can visually see how the signed gradient displays in the RGB image domain.

In [None]:
# get the index of the current true label
true_label_idx = categories.index('Siamese cat')

# calculate gradient
perturbation = get_gradient(image_batch, true_label_idx)

# visualize the perturbation (noise)
plt.figure(figsize=(5,5))
plt.title("Perturbation (Signed Gradient)")
plt.imshow(tensor2img(perturbation))
plt.axis('off')
plt.show()

### Measuring the "noisiness" of the image

To evaluate the quality of our attack, we need to measure how much the adversarial image differs from the original. We will use the **Peak Signal-to-Noise Ratio (PSNR)**, a standard metric for image fidelity.

The PSNR is calculated using the Mean Squared Error (MSE) between the two images:

$$PSNR = 10 \cdot \log_{10}\left(\frac{MAX^2}{MSE}\right)$$

Where $MAX$ represents the maximum possible pixel value (1.0 for our floating-point tensors). A **higher PSNR** indicates less distortion, meaning the attack is harder for a human to detect.

In [None]:
def calculate_psnr(img_tensor_1, img_tensor_2, max_val=1.0):
    """Calculates Peak Signal-to-Noise Ratio between two tensors."""
    
    mse = torch.mean((img_tensor_1 - img_tensor_2) ** 2)
    if mse == 0:
        return float('inf')
    return 10 * torch.log10((max_val ** 2) / mse)

### Visualizing the attack

To systematically evaluate the effectiveness of our adversarial examples, we define a dedicated display function. This utility performs three key tasks:

1.  **Inference:** It passes the perturbed image back through the model to determine if the predicted label has changed (i.e., if the attack was successful).
2.  **Quality Check:** It computes the PSNR metric to quantify the level of visual distortion introduced by the attack.
3.  **Visualization:** It renders the final adversarial image with the new confidence scores and quality metrics.

In [None]:
def display_attack_result(adv_tensor, orig_tensor, description, noise=None):
    """Displays the adversarial image, prediction, and PSNR."""
    
    # get prediction for the adversarial image
    output = model(adv_tensor)
    probs = F.softmax(output, dim=1)
    top_prob, top_catid = probs.max(1)
    label = categories[top_catid.item()]
    confidence = top_prob.item()

    # calculate metrics
    psnr = calculate_psnr(adv_tensor, orig_tensor)
    
    # plot
    if noise is not None:
        fig, axs = plt.subplots(1, 2, figsize=(12, 6))
        axs[1].imshow(tensor2img(noise))
        axs[1].set_title("Noise")
        axs[1].axis('off')
        main_ax = axs[0]
    else:
        fig, ax = plt.subplots(1, figsize=(6, 6))
        main_ax = ax

    main_ax.imshow(tensor2img(adv_tensor))
    main_ax.set_title(f'{description}\nPred: {label} ({confidence*100:.2f}%)\nPSNR: {psnr:.2f} dB')
    main_ax.axis('off')
    
    plt.show()

## Task 1: Untargeted attack

### Single Step Untargeted FSGM

We are now ready to generate adversarial examples. We will iterate through a list of **epsilon ($\epsilon$)** values, which control the magnitude of the perturbation.

The core logic follows the **Fast Gradient Sign Method (FGSM)** update rule:

$$x_{adv} = x + \epsilon \cdot \text{sign}(\nabla_x J(\theta, x, y))$$

Where:
* $x$ is the original image.
* $\epsilon$ (epsilon) is the attack strength.
* $\text{sign}(\nabla_x J)$ is the direction of the gradient we calculated earlier.

**The Trade-off:**
* **Low $\epsilon$:** The attack is subtle and harder to detect visually, but less likely to fool the model.
* **High $\epsilon$:** The attack is powerful and likely to succeed, but the noise becomes visible to the human eye, lowering the PSNR.

In [None]:
epsilons = [0, 0.001, 0.01, 0.1, 0.15, 0.9]

for eps in epsilons:
    # untargeted attack: ADD epsilon * gradient to MAXIMIZE loss
    adv_x = image_batch + eps * perturbation
    
    desc = f'Epsilon = {eps:.3f}' if eps else 'Input'
    display_attack_result(adv_x, image_batch, desc, perturbation if eps else None)

### Iterative Untargeted FSGM

While the standard FGSM takes a single, large step in the direction of the gradient, the **Iterative FGSM** takes multiple smaller steps.

Instead of calculating the gradient once on the original image, we recalculate the gradient at every step based on the *current* perturbed image. This allows the attack to fine-tune the noise, often resulting in a successful misclassification with potentially less perceptible distortion.

The update rule for each iteration $t$ is:

$$x_{t+1} = x_t + \alpha \cdot \text{sign}(\nabla_{x_t} J(\theta, x_t, y))$$

Where:
* $x_0$ is the original image.
* $\alpha$ (epsilon in the code below) is the step size for each iteration.
* The gradient is re-computed at every step.

In [None]:
epsilon = 0.001
epochs = 30

# start from the original tensor
adv_x = image_batch.clone()

for _ in range(epochs):
    # get gradient w.r.t current adversarial example
    # Note: we pass the tensor directly; converting to image and back would lose precision.
    iter_perturbation = get_gradient(adv_x, true_label_idx)
    
    # update image (gradient ascent for untargeted)
    adv_x = adv_x + epsilon * iter_perturbation

display_attack_result(adv_x, image_batch, f"Iterative Attack ({epochs} epochs)")

## Task 2: Targeted Attack

### Single Step Targeted FGSM

In the previous task, we performed an **untargeted attack**. We simply wanted the model to make *any* mistake. By adding noise in the direction of the gradient, we pushed the image across the nearest decision boundary. This often results in a prediction that is semantically similar to the original (e.g., a Golden Retriever might be misclassified as a Labrador).

However, a sophisticated attacker often wants to force a specific outcome (e.g., making a self-driving car classify a "Stop" sign as a "Speed Limit 60" sign). This is a **targeted attack**.

#### Implementation Logic

To achieve this, we modify the FGSM approach slightly:
1.  **Target Selection:** We calculate the gradient of the input with respect to a specific **target label** ($y_{target}$) rather than the true label.
2.  **Gradient Descent:** Instead of maximizing the loss (moving *away* from the true class), we want to **minimize** the loss with respect to our target class (moving *towards* it).

Therefore, we **subtract** the perturbation instead of adding it:

$$x_{adv} = x - \epsilon \cdot \text{sign}(\nabla_x J(\theta, x, y_{target}))$$

You can find a list of all 1,000 ImageNet classes [here](https://deeplearning.cms.waikato.ac.nz/user-guide/class-maps/IMAGENET/). We encourage you to experiment with different target indices to see which classes are easier to mimic.

In [None]:
target_idx = 200
target_label_name = categories[target_idx]
print(f'Targeting class {target_idx}: "{target_label_name}"')

In [None]:
# calculate gradient relative to the TARGET class, not the true class.
target_perturbation = get_gradient(image_batch, target_idx)

epsilons = [0, 0.001, 0.01, 0.1, 0.15, 0.3, 0.6]

for eps in epsilons:
    # targeted attack: SUBTRACT epsilon * gradient to MINIMIZE loss w.r.t Target
    # (gradient descent towards the target)
    adv_x = image_batch - eps * target_perturbation
    
    desc = f'Targeted Eps = {eps:.3f}' if eps else 'Input'
    display_attack_result(adv_x, image_batch, desc)

### Iterative Targeted FGSM

When targeting a specific class that is semantically very different from the original image (e.g., turning a "Cat" into a "Black Swan"), a single gradient step is often insufficient. A large single step might overshoot the target manifold or push the image into a completely different, unintended class.

To overcome this, we employ an **Iterative Targeted Attack**. Instead of one large jump, we take many small steps in the direction of the target class. This is essentially performing **Gradient Descent** on the image pixels to minimize the loss with respect to the target label.

The update rule for each epoch $i$ becomes:

$$x_{i+1} = x_i - \alpha \cdot \text{sign}(\nabla_{x_i} J(\theta, x_i, y_{target}))$$

In [None]:
hard_target_idx = 100
print(f'Hard Target {hard_target_idx}: "{categories[hard_target_idx]}"')

In [None]:
epsilon = 0.0001 # small step size
epochs = 200
epoch_reached = 0

adv_x = image_batch.clone()

for i in range(epochs):
    # get gradient w.r.t target class based on current adversarial image
    iter_grad = get_gradient(adv_x, hard_target_idx)
    
    # 2. update image (gradient descent towards target)
    adv_x = adv_x - epsilon * iter_grad

    current_pred = model(adv_x).argmax(dim=1).item()
    if current_pred == hard_target_idx and not epoch_reached:
        epoch_reached = i

print(f"Target reached at epoch {epoch_reached}!")
display_attack_result(adv_x, image_batch, f"Targeted Iterative ({epochs} epochs)")

## Task 3: Background vs. Foreground Sensitivity

A critical question in understanding Convolutional Neural Networks is determining *where* the model looks. Does it classify a "car" because of the wheels and chassis (foreground), or because it detects a road and asphalt (background)?

To investigate this, we can restrict our adversarial attack to specific regions of the image.

### Methodology
We will employ a **Masked Gradient Attack**. The process is as follows:

1.  **Load a Mask:** We import a binary mask ($M$) where pixel values are $1$ for the region of interest (e.g., the cat) and $0$ elsewhere.
2.  **Compute Gradient:** We calculate the gradient for the entire image as usual.
3.  **Apply Mask:** Before updating the image, we perform an element-wise multiplication between the gradient and the mask.
    $$\text{perturbation} = \nabla_x J(\theta, x, y) \odot M$$
    * If $M_{ij} = 0$, the gradient becomes 0, and that pixel remains unchanged.
    * If $M_{ij} = 1$, the perturbation is applied normally.

By toggling the mask (using `1 - mask`), we can choose to attack **only the background** or **only the foreground** to see which region contributes most to the model's vulnerability.

In [None]:
input_height = image_batch.shape[2]
input_width = image_batch.shape[3]

mask_transform = transforms.Compose([
    transforms.Resize((input_height, input_width)),
    transforms.ToTensor(), 
])

mask_raw = Image.open("mask.png").convert('RGB')
mask = mask_transform(mask_raw)

# add batch dimension to match image_batch [1, 3, H, W]
mask = mask.unsqueeze(0).to(device)

final_mask = mask
final_mask = 1 - mask

plt.figure(figsize=(5,5))
plt.imshow(tensor2img(final_mask, denorm=False))
plt.axis('off')
plt.plot()

In [None]:
epsilon = 0.1
epochs = 10

adv_x = image_batch.clone()

for i in range(epochs):

    iter_grad = get_gradient(adv_x, hard_target_idx)     

    iter_grad = iter_grad * final_mask
    
    adv_x = adv_x - epsilon * iter_grad

display_attack_result(adv_x, image_batch, f"Masked Attack ({epochs} epochs)")

## Conclusion & Key Takeaways

In this laboratory, we explored the fragility of modern Convolutional Neural Networks (CNNs). Despite achieving high accuracy on the ImageNet dataset, we demonstrated that **MobileNetV2** is highly susceptible to **adversarial examples**.

Through our experiments, we observed three critical phenomena:

1.  **Imperceptible noise:** By calculating the gradient of the loss with respect to the input image ($\nabla_x J$), we created perturbations that are nearly invisible to the human eye but catastrophic for the model. The **Fast Gradient Sign Method (FGSM)** proved that models often behave linearly in high-dimensional spaces, allowing simple attacks to succeed easily.
2.  **Control over predictions:** We moved beyond simple errors to **Targeted Attacks**. By optimizing the input to minimize the loss for a specific target class, we forced the model to hallucinate specific objects (e.g., turning a cat into a specific dog breed or inanimate object), even if the visual semantics did not change for a human observer.
3.  **Contextual Bias:** In the final task, we saw that the model does not look at objects in isolation. By masking the attack, we learned that changing only the **background** or only the **foreground** can sometimes be enough to fool the network, revealing that CNNs often rely on contextual correlations rather than just object features.

### Food for Thought: Defense

If an attacker can manipulate a self-driving car's vision or a medical diagnosis system using invisible noise, how do we trust these systems?

This field of study leads naturally to **Adversarial Defense**, which involves:
* **Adversarial Training:** Including adversarial examples in the training set so the model learns to ignore small perturbations.
* **Gradient masking:** Techniques to hide the gradient information from attackers.
* **Robust architectures:** Designing networks that are mathematically guaranteed to be stable within a certain radius $\epsilon$.