<a href="https://colab.research.google.com/github/oaramoon/AML-Course/blob/main/Lab_1_Hands_on_With_Evasion_Attacks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Lab 1: Hands-on With Evasion Attacks**

 In this lab, we delve into the world of adversarial machine learning by exploring evasion attacks, also known as adversarial example attacks. Our journey will encompass hands-on experiments and insights into how these attacks operate and their implications on machine learning models. Here is what we are going to cover:

**1. Getting Familiar with the Target Model: ResNet50 Classifier**
   - We will deploy a pre-trained ResNet50 classifier, a powerful and widely-used neural network, as our test subject. This model, proficient in identifying various objects from the extensive ImageNet dataset, serves as an ideal candidate for understanding evasion attacks.

**2. Understanding Resilience to Random Perturbations**
   - Our initial exploration begins with a key question: Does random noise always trick a neural network? We will experiment by introducing random perturbations to the input images and observe the ResNet50 classifier's performance. This exercise aims to demonstrate the innate resilience of neural networks against such random modifications.

**3. White-Box Evasion Attacks**
   - Moving deeper, we'll dissect white-box evasion attacks, specifically focusing on the Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) techniques. Here, the 'white-box' context implies complete access to the model's architecture and parameters.
   - Our goal is to demystify the complex mathematical formulas presented in lectures and translate them into executable code. By doing so, we'll gain a practical understanding of how targeted and untargeted attacks are implemented and their effects on the classifier.

**4. Black-Box Threat Model**
   - The final segment of our lab introduces the black-box threat model, where the attacker has no internal knowledge of the model but can only make queries and observe predictions.
   - we'll apply gradient estimation methods to execute attacks on black-box models, illustrating a more realistic attack scenario where the adversary's knowledge is limited.
   - Additionally, we will investigate the concept of adversarial example transferability - the phenomenon where adversarial examples crafted for one model can also deceive another model.
   

By the end of this lab, you will have a comprehensive understanding of how evasion attacks function, their impact on machine learning models, and the differences between white-box and black-box attack strategies. This knowledge is crucial for developing more robust and secure AI systems in the future.

## **Part 1: Getting Familiar with the Target Model: ResNet50 Classifier**

In this part of the lab, we will lay the foundation for our experiments with evasion attacks. We'll do this by setting up a pre-trained classifier and preparing our test image.

### **Setup**

Importing the packages we need for this lab.

In [None]:
import tensorflow as tf
from tensorflow.keras.applications.resnet50 import ResNet50, decode_predictions
from tensorflow.keras.applications.resnet50 import preprocess_input as resnet50_preprocess_input
from tensorflow.keras.preprocessing import image
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
import time
import random
import os

### **Loading Target Classifier**

The next step is to acquire a powerful neural network model that we will attempt to attack. For this purpose, we'll use tensorflow, a popular deep learning library, to download a pre-trained ResNet50 classifier. This model has been trained on the ImageNet dataset, making it proficient in recognizing a wide array of objects. Running the code snippet below downloads the ResNet50 model directly from tensorflow's selection of pre-trained models and loads it into our Python environment.

In [None]:
target_classifier = ResNet50(weights='imagenet')
target_classifier.summary()

### **Loading and Viewing Target Image**

To conduct an evasion attack, we need an image to work with. This image will serve as the 'target' which we will attempt to manipulate.
We will download a specific image that we want to use for our experiment.
Once downloaded, we'll display this image in our environment, allowing us to visually inspect the original unaltered state of our target.

In [None]:
def load_image(img_path):
    img = image.load_img(img_path, target_size=(224, 224))
    img = image.img_to_array(img)
    img = np.expand_dims(img, axis=0)
    return img


# Download image
!wget https://raw.githubusercontent.com/oaramoon/AML-Course/main/labrador_retriever.png -O labrador_retriever.png

img_path = './labrador_retriever.png'  # Replace with your image path
original_img = load_image(img_path)


# Display the original image
def display_image_w_prediction(img, model,preprocess_input):
    fig, ax = plt.subplots(1, 1, figsize=(7, 7))
    ax.imshow(np.squeeze(img).astype("uint8"))

    pred = model.predict(preprocess_input(img.copy()),verbose=0)
    (_,label, conf) = decode_predictions(pred, top=1)[0][0]

    ax.set_title(f'Predicted Label: "{label.replace("_"," ")} w/ {100*conf:.2f}% confidence"')
    ax.axis('off')

display_image_w_prediction(img=original_img, model=target_classifier, preprocess_input=resnet50_preprocess_input)


Does the target image look familiar?

## **Part 2: Understanding Resilience to Random Perturbations**

The crux of an evasion attack lies in subtly altering the input image in a way that causes the model to misclassify it. This alteration is known as 'adversarial noise'. One (or poeple who were sleeping during Lecture 3) might think that adversarial noise can be found through a trial-and-error process, randomly sampling a perturbation within a permitted range and testing its ability to fool the classifier. This process would be repeated until an effective perturbation is discovered.

Let's implement this approach! Make 1000 quries to the model where the target iag is superimpose by a random noise and record how many times you were successding to fool the model.

**Implementation Steps:**

- **Noise Generation**: For each query (out of 1000 total), generate a unique random noise pattern. This noise must be within a permissible range so that it doesn't drastically alter the image's appearance. We will assume a 'permissible range' for our noise, defined by an L_infinity norm of 3. This means the maximum change allowed for any pixel value in the image will not exceed 3 units in any direction (positive or negative). This constraint ensures that our perturbations are subtle and do not drastically alter the overall appearance of the image. Note that pixle values are in [0,255].

- **Superimposing Noise:** Apply the generated noise to the target image, creating a potentially adversarial example.

- **Model Querying:** Feed this modified image to our pre-trained ResNet50 classifier and record the model's prediction.

- **Success Evaluation:** Determine if the noise led to a misclassification. If the model's prediction differs from the original, unaltered image's classification, count it as a success.

Keep track of the number of successful misclassifications out of the 1000 attempts. This will give us a quantitative measure of how effective this brute-force method is in crafting adversarial examples.


In [None]:
def adv_noise_by_trial_and_error(model=target_classifier, preprocess_input=resnet50_preprocess_input, num_attempts=1000, l_inf=3): # this function is supposed to return attack_success_rate
  num_fooled = 0

  ## WRITE YOUR CODE BELOW
  #....
  #
  #return num_fooled/num_attempts

attack_success_rate = adv_noise_by_trial_and_error():
print(f"{100*attack_success_rate:.2f}% of queries resulted in misclassification.")

## **Part 3: White-box Evasion Attacks**

Having experienced the limitations of a trial-and-error approach in generating adversarial examples, we now shift our focus to a more strategic and informed method. In this part of the lab, we will delve into the realm of white-box attacks, where the adversary is assumed to have complete knowledge of the machine learning model.

Our first goal is to review the implmentation of an untargeted FGSM (Fast Gradient Sign Method) attack, a technique we discussed in our lectures. Despite its effectiveness, you might be surprised to discover how straightforward it is to implement FGSM; it requires only a few lines of code.

**Implementation Steps:**
- **Gradient Calculation**: In the first code block, function `get_loss_gradient_wrt_input` implements calculating the gradient of the loss with respect to the input image. This step is crucial as it determines the direction in which we will perturb our image.
- **Creating the Adversarial Example**: In the first code block, function `untargeted_fgsm_attack` applies the FGSM method using the calculated gradient. We will adjust the image pixels slightly in the direction of the gradient's sign, staying within our predefined L infinity norm constraint of 3.
- **Visualizing the Result**: In the second code block, after generating the adversarial image, we will display it alongside the original image for comparison. This visual representation will help you understand the subtlety of the changes made and their impact on the classifier's prediction.

### **Untargeted FGSM** **Attack**

In [None]:
import tensorflow as tf

def get_loss_gradient_wrt_input(x, y, target_model):
    # Define the loss function (Categorical Crossentropy for multi-class classification)
    training_loss = tf.keras.losses.CategoricalCrossentropy()

    # Use TensorFlow's GradientTape to record operations for automatic differentiation
    with tf.GradientTape() as tape:
        tape.watch(x)  # Ensure the tape is watching the input tensor 'x'
        prediction = target_model(x)  # Obtain the model's prediction for the input 'x'
        loss = training_loss(tf.one_hot([y], 1000), prediction)  # Calculate the loss between the true label 'y' and prediction

    gradient = tape.gradient(loss, x)  # Compute the gradient of the loss with respect to the input image 'x'
    return gradient  # Return the gradient

def untargeted_fgsm_attack(target_model, img, img_label, epsilon, preprocess_input):
    img_tensor = tf.convert_to_tensor(img)  # Convert the image to a TensorFlow tensor

    # Obtain the gradient of the loss with respect to the input image
    grads = get_loss_gradient_wrt_input(target_model=target_model, x=preprocess_input(img_tensor), y=img_label)

    sign_grads = tf.sign(grads)  # Get the sign of the gradient (to determine the direction of the perturbation)

    # Create the perturbed image by adding the scaled gradient (epsilon * sign_grads) to the original image
    perturbed_image = img_tensor + epsilon * sign_grads

    # Ensure that the pixel values of the perturbed image remain within valid range (0-255)
    perturbed_image = tf.clip_by_value(perturbed_image, 0, 255)

    return perturbed_image.numpy()  # Convert the tensor back to a numpy array and return


# Ensure preprocessed_img is a NumPy array, not a tensor
untargeted_fgsm_adv_img = untargeted_fgsm_attack(target_model=target_classifier, img=original_img, img_label=208, epsilon=3, preprocess_input=resnet50_preprocess_input)  # 208 is the class index for 'Labrador Retriever'

In [None]:
def display_adversarial_and_original_images(original, adversarial, model, preprocess_input):
    fig, axes = plt.subplots(1, 3, figsize=(30, 10))  # Adjusted for three subplots

    # Original and adversarial predictions
    original_pred = model.predict(preprocess_input(original.copy()),verbose=0)
    adversarial_pred = model.predict(preprocess_input(adversarial.copy()),verbose=0)
    (_, orig_label, org_conf) = decode_predictions(original_pred, top=1)[0][0]
    (_, adv_label, adv_conf) = decode_predictions(adversarial_pred, top=1)[0][0]

    # Display original image
    axes[0].imshow(np.squeeze(original.astype('uint8')))
    axes[0].set_title(f'Original: "{orig_label.replace("_"," ")}" ({100*org_conf:.2f}%)')
    axes[0].axis('off')

    # Display adversarial image
    axes[1].imshow(np.squeeze(adversarial.astype('uint8')))
    axes[1].set_title(f'Adversarial: "{adv_label.replace("_"," ")}" ({100*adv_conf:.2f}%)')
    axes[1].axis('off')

    # Calculate and display the perturbation
    perturbation = adversarial - original
    perturbation = (perturbation - perturbation.min()) / (perturbation.max() - perturbation.min())  # Normalize
    axes[2].imshow(np.squeeze(perturbation).astype('float32'))
    axes[2].set_title('Adversarial Perturbation')
    axes[2].axis('off')

    plt.show()


display_adversarial_and_original_images(original=original_img, adversarial=untargeted_fgsm_adv_img, model=target_classifier, preprocess_input=resnet50_preprocess_input)

### **Targeted FGSM Attack**

Now that you've gained hands-on experience with an untargeted FGSM attack, it's time to take it a step further. In this challenge, you are tasked with implementing a targeted version of the FGSM attack. Your goal is to manipulate the input image in such a way that the pre-trained ResNet50 classifier is not just fooled into making any wrong prediction, but specifically misclassifies the image as a 'Tree Frog' (label index 31 in the ImageNet dataset).

After implementing these changes, test your attack on the same image used previously. Observe if the classifier now misclassifies it as a 'Tree Frog'.

In [None]:
def targeted_fgsm_attack(target_model, img, target_label, epsilon, preprocess_input):
    ## WRITE YOUR CODE BELOW
    #....
    #
    #return perturbed_image.numpy()


targeted_fgsm_adv_img = targeted_fgsm_attack(target_model=target_classifier, img=original_img, target_label=31, epsilon=3, preprocess_input=resnet50_preprocess_input) # 31 is the class index for 'Tree Frog'
display_adversarial_and_original_images(original=original_img, adversarial=targeted_fgsm_adv_img, model=target_classifier, preprocess_input=resnet50_preprocess_input)

If your targeted FGSM attack didn't succeed in fooling the classifier into misclassifying the image as a 'Tree Frog', It was expected.
The FGSM, while powerful, has certain limitations, especially in the context of targeted attacks. FGSM is known for its simplicity and efficiency in generating adversarial examples quickly. However, this simplicity can be a drawback when precision and specific target misclassifications are required. Targeted attacks generally demand a more refined approach than what FGSM offers. FGSM makes a single, large step in the direction of the gradient sign, which might not be subtle or precise enough to fool a model into misclassifying an image as a specific class, especially when the target class is significantly different from the true class.

To increase the success rate of targeted attacks, one might consider iterative methods like the Projected Gradient Descent (PGD), which takes smaller steps towards the target but iterates multiple times. This approach can offer a more refined control over the adversarial example generation.

### **Targeted PGD Attack**

Having explored the Fast Gradient Sign Method (FGSM) and its limitations in targeted attacks, we now turn our attention to a more sophisticated technique: the Projected Gradient Descent (PGD) attack. PGD is known for its effectiveness in crafting more effective adversarial examples, especially in targeted scenarios.

Your task is to implement a targeted PGD attack. The goal remains the same as with the FGSM challenge: to manipulate an input image so that our pre-trained model misclassifies it as a 'Tree Frog' (label index 31 in the ImageNet dataset).
As before, the permissible range for perturbation (defined by the L infinity
norm) will be the same.

**Implementation Guidance:**

- Start with the Original Image: Initialize your attack with the original image you used in the FGSM experiment.

- Iterative Perturbation: In each iteration, apply a small perturbation in the direction that increases the likelihood of the image being classified as the target class (Tree Frog).
-Apply Projection: After each perturbation, use a projection function to ensure that the adversarial example stays within the permissible range.
- Ensure that the pixel values of the perturbed image remain within valid range (0-255)  
Repeat: Continue this process for a set number of iterations or until the image is successfully classified as a Tree Frog.


After completing the iterations, evaluate whether the model now misclassifies the altered image as a 'Tree Frog'.

In [None]:
def targeted_pgd_attack(target_model, img, target_label, epsilon, preprocess_input, num_ittr=50):
    ## WRITE YOUR CODE BELOW
    #....
    #
    #return perturbed_image.numpy()


targeted_pgd_adv_img = targeted_pgd_attack(target_model=target_classifier, img=original_img, target_label=31, epsilon=3, preprocess_input=resnet50_preprocess_input)
display_adversarial_and_original_images(original=original_img, adversarial=targeted_pgd_adv_img, model=target_classifier, preprocess_input=resnet50_preprocess_input)


## **Part 2: Black-box Evasion Attacks**
In the real world, adversaries often do not have full access to a target model's internal parameters and architecture. This is known as a black-box attack scenario. Unlike white-box attacks where the attacker has complete knowledge of the model, black-box attacks present unique challenges. In black-box settings, adversaries cannot directly compute the gradients of the adversarial loss since, by just running a few lines of code, as they lack access to the model's internal workings. This makes generating adversarial examples more complex and requires different strategies. Here we review two of such strategies.


If the adversary has access to the confidence scores output by the target model for each prediction, they can use these scores to their advantage through **Zeroth-Order Optimization Methods**. These methods do not require explicit gradient computation. Instead, they estimate gradients based on the output scores provided by the model. This approach is particularly useful in black-box scenarios.

#### **Untargeted Black-box FGSM Attack via Zeroth-Order Methods**

 Zeroth-order methods approximate the gradient by observing changes in the model's output (confidence scores) in response to slight variations in the input. By systematically tweaking the input and noting the corresponding changes in output, these methods can infer the direction in which to modify the input to induce misclassification. This involves running a series of queries to the target model with slightly altered inputs and analyzing the resulting confidence scores. Based on these observations, the adversary can estimate the gradient and use it to craft an adversarial example.

In the next code block, we'll demonstrate an implementation of an untargeted black-box FGSM attack using a zeroth-order estimation method, namely the **difference quotient** method which we all learne din highschool. Function `estimate_gradient` estimates the gradient along a specific pixel (denoted by the index variable), and `fgsm_attack_with_estimated_gradients` implements the implements the black-box FGSM attack.

Before running the code block, it's important to set realistic expectations regarding execution time. Due to the iterative and computationally demanding nature of zeroth-order methods, especially when applied to high-resolution images, the processing time can be substantial. This significant time requirement is one of the drawbacks of using zeroth-order methods in adversarial attacks. This exercise serves as an important lesson in the practical challenges adversaries face in real-world scenarios. It underscores the need for balancing precision and computational efficiency in adversarial strategies.

In [None]:
def estimate_gradient(x, y, index, target_model, preprocess_input, h=1e-4):
    # Categorical Crossentropy loss for multi-class classification
    training_loss = tf.keras.losses.CategoricalCrossentropy()

    x_tensor = tf.convert_to_tensor(x)  # Convert the image to a TensorFlow tensor

    # Save the original value at the specified index (pixel)
    orig_val = x_tensor[index]

    # Perturb x at the specified index by a small value +h and calculate the loss
    x_plus_h = tf.tensor_scatter_nd_update(x_tensor, [index], [orig_val + h])
    loss_plus_h = training_loss(tf.one_hot([y], 1000), target_model(preprocess_input(x_plus_h)))

    # Perturb x at the specified index by a small value -h and calculate the loss
    x_minus_h = tf.tensor_scatter_nd_update(x_tensor, [index], [orig_val - h])
    loss_minus_h = training_loss(tf.one_hot([y], 1000), target_model(preprocess_input(x_minus_h)))

    # Estimate the gradient using the central difference quotient formula
    gradient = (loss_plus_h - loss_minus_h) / (2 * h)
    return gradient  # Return the estimated gradient

def fgsm_attack_with_estimated_gradients(target_model, img, img_label, epsilon, preprocess_input, h=1e-4):
    start_time = time.time()  # Record the start time of the attack
    img_tensor = tf.convert_to_tensor(img)  # Convert the image to a TensorFlow tensor
    perturbed_image = tf.identity(img_tensor)  # Create a copy of the image tensor

    num_pixels_covered = 0  # Track the number of pixels processed

    # Iterate over each pixel in the image to estimate the gradient
    for i in (range(img_tensor.shape[1])):  # Iterating over height
        for j in (range(img_tensor.shape[2])):  # Iterating over width
            for c in (range(img_tensor.shape[3])):  # Iterating over channels
                index = [0, i, j, c]  # Create an index for the current pixel

                # Estimate the gradient at the current pixel
                grad_est = estimate_gradient(x=img_tensor, y=img_label, index=index, target_model=target_model, h=h, preprocess_input=preprocess_input)
                sign_grad = tf.sign(grad_est)  # Get the sign of the estimated gradient

                # Update the perturbed image at the current pixel with the calculated gradient
                perturbed_image_val = perturbed_image[index] + epsilon * sign_grad
                perturbed_image = tf.tensor_scatter_nd_update(perturbed_image, [index], [perturbed_image_val])

                # Calculate and print the runtime and progress
                duration = time.time() - start_time
                num_pixels_covered += 1
                print(f"Runtime: {int(duration // 3600):02}:{int((duration % 3600) // 60):02}:{int(duration % 60):02}, Progress: {(num_pixels_covered*100)/(img_tensor.shape[1]*img_tensor.shape[2]*img_tensor.shape[3]):.3f}%")

    # Ensure the pixel values of the perturbed image remain within the valid range (0-255)
    perturbed_image = tf.clip_by_value(perturbed_image, 0, 255)
    return perturbed_image.numpy()  # Convert the tensor back to a numpy array and return



black_box_untargeted_fgsm_adv_img = fgsm_attack_with_estimated_gradients(target_model=target_classifier, img=original_img, img_label=208, epsilon=1, preprocess_input=resnet50_preprocess_input)  # 208 is the class index for 'Labrador Retriever'
display_adversarial_and_original_images(original=original_img, adversarial=black_box_untargeted_fgsm_adv_img, model=target_classifier, preprocess_input=resnet50_preprocess_input)

#### **Transferbility-based Black-box Attacks**


Another effective strategy for mounting black-box evasion attacks can be leveraging the concept of 'transferability'. This approach relies on the phenomenon where adversarial examples crafted for one model can also be effective against another model.

The core idea behind transferability is to use a surrogate model — a different model than the actual target, but trained for the same task. The adversary generates adversarial examples using this surrogate model, and then tests these examples on the actual target model. The rationale behind this strategy is that many machine learning models, despite being architecturally different, often learn similar decision boundaries for a given task. Therefore, an example that is adversarial for one model has a reasonable chance of being adversarial for another.

In this part of the lab, we delve into the practical aspects of transferability in black-box attacks. We'll demonstrate how an adversarial example, crafted using one model, can potentially deceive another, seemingly unrelated model.


For our exercise, we'll use the VGG16 model, a well-known convolutional neural network architecture. VGG16, like our target model ResNet50, is trained on the ImageNet dataset, making it a suitable choice for our surrogate. We'll obtain the VGG16 model directly from TensorFlow's library of pre-trained models.

With the surrogate model in hand, we'll employ our untargeted white-box FGSM attack method. This involves creating an input that is purposely modified to cause the VGG16 model to misclassify it.

Then we will see if the adversarial example, which is misleading for VGG16, will also fool the ResNet50 model.

In [None]:
from tensorflow.keras.applications.vgg16 import VGG16
from tensorflow.keras.applications.vgg16 import preprocess_input as vgg16_preprocess_input

surrogate_classifier = VGG16(weights='imagenet')

untargeted_transfer_fgsm_adv_img = untargeted_fgsm_attack(target_model=surrogate_classifier, img=original_img, img_label=208, epsilon=3, preprocess_input=vgg16_preprocess_input)  # 208 is the class index for 'Labrador Retriever'
display_adversarial_and_original_images(original=original_img, adversarial=untargeted_transfer_fgsm_adv_img, model=target_classifier, preprocess_input=resnet50_preprocess_input)

Having explored transferability with untargeted FGSM attacks, we now raise the bar. Your task is to repeat the process, but this time using a more sophisticated method - the targeted Projected Gradient Descent (PGD) attack.
Pay close attention to whether the target model misclassifies the image, and if so, whether it aligns with the target class specified in your PGD attack.



In [None]:
## WRITE YOUR CODE BELOW
  #....
  #

Did the target label succesfully transferred to the target classifier Transferring targeted adversarial examples between models is inherently more challenging than untargeted examples. This is because targeted attacks require manipulating the model’s output to a specific, often less probable class, which might not align with the decision boundaries of a different model.
Learning Point: This exercise will help you understand the nuances and limitations of adversarial example transferability, especially in the context of targeted attacks. It highlights the complex interplay between the specificity of the attack and the generalizability of its effectiveness across different models