<a href="https://colab.research.google.com/github/oaramoon/AML/blob/main/Lab_2_Defenses_Against_Evasion_Attacks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Lab 2: Defnses Against Evasion Attack**

Welcome to our hands-on lab where you will actively engage in assessing the robustness of ML models against evasion attacks. This session is an extension of our previous exploration in Lab 1, where we developed a Projected Gradient Descent (PGD) attack. We start by examining seemingly effective defense techniques - pay attention to the significance of "effective" within double quotes. These methods include JPEG compression, random model selection, etc. You'll investigate how these strategies enhance adversarial robustness against standard evasion attacks, like the PGD attack.

In the second segment, our focus pivots to the critical role of adaptive attacks in evaluating these empirical defenses. Relying solely on standard, off-the-shelf attacks may yield a misleading sense of security. We will dissect two advanced adaptive attack techniques: the Expectation of Transformation attack and the Backward Pass Differentiable Approximation attack. Through practical exercises, you will apply these methods to critically assess the true resilience of these defense strategies. This investigation is crucial in understanding the intricacies of defending machine learning models against evolving adversarial tactics.

We conclude our lab with a comprehensive analysis of adversarially trained models. Often celebrated for their resilience to evasion attacks, these models, however, have a propensity to overfit to specific types of threats. Through interactive discussions and experiments, you will explore the multifaceted challenges in achieving genuine robustness in adversarial machine learning contexts. This final section aims to provide you with an in-depth understanding of the strengths and limitations of current defense methods. It's designed to foster a well-rounded perspective on the ongoing endeavors to enhance the security and reliability of machine learning systems in the face of sophisticated adversarial threats.

##**Part 1: Setting up the Enviroment**

In this part of the lab, we will establish the groundwork for our experiments. To begin, we will download the lab materials from the course repositoryand then set up the environment necessary for our experiments. We will then load our unprotected target model, which we refer to as the baseline classifier. This model is a convolutional neural network trained for the task of classifying objects in the CIFAR-10 dataset.

Following this setup, we will utilize the PGD (Projected Gradient Descent) attack, which we developed in Lab 1, to evaluate the security of our baseline classifier. This exercise provide an understanding of the model's level of vulnerability to evasion (or adversarial example) attacks.

### **Downloading the Lab Material**


In [None]:
!git clone https://github.com/oaramoon/AML.git

In [None]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
from tqdm import tqdm
import time
import random
from AML.TST196 import load_cifar10_model

### **Getting Familiar with CIFAR10 Dataset**
The CIFAR-10 dataset is a widely-used benchmark dataset in the field of machine learning, particularly for image classification tasks. For those not familiar with it, let's provide a brief introduction.

The CIFAR-10 dataset consists of 60,000 color images, each of 32x32 pixels, spread across 10 classes, with 6,000 images per class. It is split into a standard configuration of 50,000 training images and 10,000 test images. Each image is to be classified into one of the 10 different classes, which represent different objects or animals: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.

In the following cell, we will load the CIFAR-10 dataset from Keras datasets and visualize some sample images from the training set.

In [None]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
num_classes = 10

x_train = x_train.astype("float32") /255.0
x_test = x_test.astype("float32") /255.0
input_shape = x_train[0].shape

print(f"There are {x_train.shape[0]} training and {x_test.shape[0]} test samples in CIFAR10 dataset.")
print(f"Each sample is of dimension {input_shape}")

# CIFAR-10 dataset labels
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
               'dog', 'frog', 'horse', 'ship', 'truck']

# Select 100 random images from the training set
indices = tf.random.uniform([100], 0, len(x_train), dtype=tf.int32)
sample_images = tf.gather(x_train, indices)
sample_labels = tf.gather(y_train, indices)

# Plot the selected images
plt.figure(figsize=(15,15))
for i in range(100):
    plt.subplot(10,10,i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(sample_images[i])
    plt.title(class_names[sample_labels[i][0]])
plt.tight_layout()
plt.show()

### **Loading the Baseline Classifier**

In [None]:
baseline_classifier = load_cifar10_model(model_path="./AML/baseline_cifar10")

### **Untargeted PGD Attack (From Lab 1)**

In [None]:
def get_loss_gradient_wrt_input(x, y, target_model, preprocess_input, num_classes):
    # Define the training loss using categorical crossentropy
    training_loss = tf.keras.losses.CategoricalCrossentropy()

    # Use GradientTape for automatic differentiation
    with tf.GradientTape() as tape:
        tape.watch(x)  # Instruct tape to watch the input tensor 'x'

        # Preprocess the input and then make predictions using the target model
        prediction = target_model(preprocess_input(x))

        # Compute the loss between the true labels and predictions
        loss = training_loss(tf.one_hot(y, num_classes), prediction)

    # Compute the gradient of the loss with respect to the input
    gradient = tape.gradient(loss, x)
    return gradient

def untargeted_pgd_attack(target_model, preprocess_input, img_batch, img_labels, epsilon, num_ittr, num_classes):
    # Convert image batch to tensors and create a copy for original images
    perturbed_images = tf.convert_to_tensor(img_batch)
    original_images = tf.identity(perturbed_images)

    # Define the step size for the perturbation
    step = (epsilon / num_ittr) * 1.25

    # Iteratively apply perturbations
    for _ in tqdm(range(num_ittr)):
        # Calculate gradients of loss w.r.t. the perturbed images
        grads = get_loss_gradient_wrt_input(target_model=target_model, x=perturbed_images, y=img_labels, preprocess_input=preprocess_input, num_classes=num_classes)

        # Replace any NaNs in gradients with zeros
        grads = tf.where(tf.math.is_nan(grads), tf.zeros_like(grads), grads)

        # Compute the sign of the gradients
        sign_grads = tf.sign(grads)

        # Update perturbed images by a step in the direction of the sign of gradients
        perturbed_images = perturbed_images + step * sign_grads

        # Ensure that the perturbations stay within the epsilon bounds
        perturbation = tf.clip_by_value(perturbed_images - original_images, -epsilon, epsilon)
        perturbed_images = original_images + perturbation

    # Clip the perturbed images to be within valid image range [0, 1]
    perturbed_images = tf.clip_by_value(perturbed_images, 0.0, 1.0)

    # Return the perturbed images as a numpy array
    return perturbed_images.numpy()


### **Target Images**

To assess the robustness of the baseline classifier, we will utilize the first 100 samples from the CIFAR-10 test set. Why only 100 samples? This limitation is due to the lack of access to GPUs on Google Colab. In real-world evaluations, a larger sample size is often necessary. However, for the purposes of this lab, we will work within these constraints.

In [None]:
target_samples_x = tf.convert_to_tensor(x_test[:100,::],dtype=tf.float32)
target_samples_org_y = y_test[:100].flatten()

### **Baseline Classifier Vs Untargeted PGD Attack**
In the following code block, we will evaluate the effectiveness of our PGD attack, developed in Lab 1, against the baseline classifier. We'll measure its success rate using the 100 target samples that were selected earlier.

In [None]:
adv_imgs_for_baseline_model = untargeted_pgd_attack(target_model=baseline_classifier, preprocess_input=lambda x: x, img_batch=target_samples_x, img_labels=target_samples_org_y, epsilon=4.0/255.0, num_ittr=10, num_classes=num_classes)

def untargeted_attack_success_rate(model, preprocess_input, x, y):
    predictions = model(preprocess_input(x))
    return np.mean(np.argmax(predictions,axis=1) != y)

attack_attack_success_on_baseline =  untargeted_attack_success_rate(model=baseline_classifier, preprocess_input=lambda x: x, x=adv_imgs_for_baseline_model, y=target_samples_org_y)
print(f"Attack success rate on baseline model is {100*attack_attack_success_on_baseline:.2f}%.")

## **Part 2: Defenses**

In the second part of the lab, we'll revisit several defense mechanisms discussed in Lecture 4 of our course. Our objective will be to evaluate the success rate of our off-the-shelf PGD attack against these defenses.

### **Input Transformation JPEG Compression**
One relatively straightforward defense mechanism in the realm of machine learning is to incorporate JPEG compression as an input preprocessing step. The rationale behind this approach is rooted in the nature of the JPEG compression algorithm itself. JPEG, primarily designed for efficient image storage by reducing file size, achieves this through a lossy compression technique. This process inherently filters out certain high-frequency components of the image, which are often imperceptible to the human eye. In the context of adversarial machine learning, this characteristic of JPEG compression can be leveraged to mitigate the impact of adversarial noise. The hypothesis is that during the compression process, some of the deliberately crafted perturbations, which are typically subtle and high-frequency, might be discarded along with other non-essential image details. Consequently, this preprocessing step has the potential to enhance the model's resilience by diminishing the effectiveness of adversarial inputs designed to mislead the model.



In [None]:
from PIL import Image
import io


def jpeg_compress_batch(input_imgs, quality=50):
    """
    Compresses a batch of images (either TensorFlow tensors or NumPy arrays) using JPEG compression
    and returns the compressed images as a TensorFlow tensor.

    Args:
    - input_imgs (tf.Tensor or np.ndarray): The batch of images with shape (batch_size, height, width, channels).
    - quality (int): The quality of the JPEG compression (1 to 100, higher means better quality and less compression).

    Returns:
    - tf.Tensor: The batch of compressed images as a TensorFlow tensor.
    """
    compressed_imgs = []

    # Check if the input is a TensorFlow tensor and convert to a NumPy array if so
    if tf.is_tensor(input_imgs):
        np_imgs = input_imgs.numpy()
    else:
        np_imgs = input_imgs

    for np_img in np_imgs:
        # Ensure the image is in uint8 format
        if np_img.dtype != np.uint8:
            np_img = (np_img * 255).astype('uint8')

        pil_img = Image.fromarray(np_img)

        # Save the image to a buffer with JPEG compression
        buffer = io.BytesIO()
        pil_img.save(buffer, format="JPEG", quality=quality)

        # Load the image back from the buffer
        compressed_img = Image.open(buffer)

        # Convert the PIL Image back to a NumPy array and add to the list
        compressed_imgs.append(np.array(compressed_img)/255.0)

    # Convert the list of compressed images back to a TensorFlow tensor
    return tf.convert_to_tensor(compressed_imgs,dtype=tf.float32)




In [None]:
try:
  adv_imgs_for_jpeg_defense = untargeted_pgd_attack(target_model=baseline_classifier, preprocess_input=jpeg_compress_batch, img_batch=target_samples_x, img_labels=target_samples_org_y, epsilon=4.0/255.0, num_ittr=10, num_classes=num_classes)

  attack_attack_success_on_transform_defense =  untargeted_attack_success_rate(model=baseline_classifier, preprocess_input=jpeg_compress_batch, x=adv_imgs_for_jpeg_defense, y=target_samples_org_y)
  print(f"Attack success rate on transform defense model is {100*attack_attack_success_on_transform_defense:.2f}%.")

except Exception as e:
  print(e,"\nCould not generate adversarial exapmles!\n")


#### **Discussion**
Why can't we directly generate adversarial examples for a system using JPEG compression? A partial explanation is that we use the PIL library for JPEG compression, and these operations aren't integrated into TensorFlow's computation graph, hence TensorFlow cannot compute the necessary gradients. But this is not the core issue. Even if we were to replicate all steps of JPEG compression using TensorFlow operations, we would still face challenges in computing the gradients required for creating adversarial examples. This is primarily due to certain steps in the JPEG compression process that are inherently non-differentiable, such as the quantization step and the discarding of high-frequency components following the Discrete Cosine Transform (DCT). These steps, which involve forms of thresholding, introduce non-differentiability. As discussed in our lectures, defenses like JPEG compression aim to enhance robustness against evasion attacks by disrupting the calculation of gradients with respect to the model's input. This disruption makes it difficult for attackers to craft adversarial examples using gradient-based methods, a common tactic in evasion attacks.

But how about the adversarial examples generated for baseline model? Are they still effective? by running the code block below, you observe for yorslef.

In [None]:
attack_attack_success_on_JPEG_defense =  untargeted_attack_success_rate(model=baseline_classifier, preprocess_input=jpeg_compress_batch, x=adv_imgs_for_baseline_model, y=target_samples_org_y)
print(f"Attack success rate of baseline adversarial exmaples on JPEG compression defense is {100*attack_attack_success_on_JPEG_defense:.2f}%.")

 The success rate of adversarial examples generated against our baseline classifier drops drastically—from 92% in the unprotected scenario to just 49% when JPEG compression is applied. This demonstrates that even a straightforward method like JPEG compression can markedly enhance a model's robustness against a PGD attack.

### **Randomization**
Another category of defenses against evasion attacks, as discussed in Lecture 4, aims to introduce randomness into the inference process. This strategy is based on the idea that by randomizing the computed gradients used in generating adversarial examples, the system's robustness can be improved, potentially thwarting such attacks.

#### **Random Neuron Pruning**

An example of this technique is the Random Neuron Pruning defense. In this approach, during the inference phase, we don't use the exact network activations computed in the standard manner. Instead, a subset of neurons is randomly dropped. This pruning is biased such that neuron with larger activations are more likely to be removed. The neruons that remain are then scaled up, similar to the dropout technique, to maintain the overall integrity of the network's output.

In the upcoming code block, we have implemented this Random Neuron Pruning defense. You are encouraged to test the resilience of this defense mechanism by running the PGD attack against it. This exercise will allow you to evaluate firsthand how effective Random Neuron Pruning is in enhancing the system's robustness against evasion attacks.

In [None]:
class RandomPruned:
    def __init__(self, obj):
        self.model = obj

    def __call__(self, x, *args, **kwargs):
      for layer in self.model.layers:
          x = layer(x)
          if isinstance(layer, tf.keras.layers.Conv2D):
              _, a, b, c = x.shape
              p = tf.abs(x) / (tf.reduce_sum(tf.abs(x), axis=(1, 2, 3), keepdims=True) + 1e-8)  # Adding a small constant for numerical stability
              p_keep = 1 - tf.exp(-a * b * c / 3 * p)
              p_keep = tf.clip_by_value(p_keep, 1e-8, 1.0)  # Ensure p_keep is not zero
              keep = tf.random.uniform(p_keep.shape) < p_keep
              x = tf.cast(keep, tf.float32) * x / (p_keep + 1e-8)  # Adding a small constant to avoid division by zero

      return x

classifier_w_random_pruning = RandomPruned(baseline_classifier)

adv_imgs_for_RNP_model = untargeted_pgd_attack(target_model=classifier_w_random_pruning, preprocess_input=lambda x: x, img_batch=target_samples_x, img_labels=target_samples_org_y, epsilon=4.0/255.0, num_ittr=10, num_classes=num_classes)

In [None]:
attack_attack_success_on_RNP_defense =  sum([untargeted_attack_success_rate(model=classifier_w_random_pruning, preprocess_input=lambda x:x, x=adv_imgs_for_RNP_model, y=target_samples_org_y) for _ in range(10)])/10
print(f"Attack success rate of baseline adversarial exmaples on random neuron pruning defense is {100*attack_attack_success_on_RNP_defense:.2f}%.")

As evidenced by our results, it's clear that Random Neuron Pruning enhances the system's robustness. The PGD attack's effectiveness is notably reduced, achieving only about a 40% success rate against this defense mechanism. This demonstrates the significant impact that Random Neuron Pruning has in bolstering the system's resistance to such attacks.

#### **Random Model Selection**
One notable downside of the Random Neuron Pruning defense is its detrimental effect on the model's test accuracy – you'll have to take my word for it. To address this issue while still leveraging the benefits of randomization in mitigating evasion attacks, we can turn to the Random Model Selection technique. This approach doesn't rely on a single model. Instead, it involves training multiple models, each with a distinct architecture, for the same task. During the inference stage, one of these models is randomly selected to process the test sample, adding an element of unpredictability to counter evasion attacks.

In the next code block, we've set up the Random Model Selection defense. You're encouraged to evaluate this defense's effectiveness by launching the PGD attack against it.

In [None]:
import random

class RandomModelSelection:
  def __init__(self, models):
    self.models = models
    self.num_models = len(models)

  def __call__(self, x, *args, **kwargs):
    # Randomly select a model
    selected_model = random.choice(self.models)

    # Feed x to the selected model and return its output
    return selected_model(x, *args, **kwargs)

model_1 = load_cifar10_model(model_path="./AML/baseline_cifar10")
model_2 = load_cifar10_model(model_path="./AML/surrogate_cifar10_0")
model_3 = load_cifar10_model(model_path="./AML/surrogate_cifar10_1")
model_4 = load_cifar10_model(model_path="./AML/surrogate_cifar10_2")

random_select_classifier = RandomModelSelection([model_1,model_2,model_3, model_4])

adv_imgs_for_RMS_model = untargeted_pgd_attack(target_model=random_select_classifier, preprocess_input=lambda x: x, img_batch=target_samples_x, img_labels=target_samples_org_y, epsilon=4.0/255.0, num_ittr=10, num_classes=num_classes)

In [None]:
attack_success_on_RMS_defense =  sum([untargeted_attack_success_rate(model=random_select_classifier, preprocess_input=lambda x:x, x=adv_imgs_for_RMS_model, y=target_samples_org_y) for _ in range(10)])/10
print(f"Attack success rate of baseline adversarial exmaples on random model selectiohn defense is {100*attack_success_on_RMS_defense:.2f}%.")

It's evident from our observations that the success rate of the PGD attack on the Random Model Selection (RMS) defense is lower compared to the baseline model. However, it is higher than what we observed with the Random Neuron Pruning defense. This outcome is primarily because the RMS defense introduces a lesser degree of randomization compared to Neuron Pruning. Despite this, the RMS defense still marks an improvement in robustness against evasion attacks, highlighting its potential as a viable defense strategy.

## **Part 3: Adaptive Attacks**

So far in our lab, we have focused on evaluating the security of various defenses using the standard, off-the-shelf PGD attack. However, as we discussed in our lectures, this approach represents a somewhat naive view of security evaluation. An adversary in a real-world scenario won't limit themselves to these readily available attack methods. Especially when there's a significant incentive to circumvent a system, attackers are likely to get more creative and determined in finding ways to breach defenses.

This brings us to the critical concept of adaptive attacks, an essential aspect to consider when evaluating the robustness of any defense mechanism. In an adaptive attack scenario, we assume that the attacker has a white-box level of access, meaning they have complete knowledge of the defense system, including its intricacies and workings. The attacker's goal is to exploit this knowledge to devise strategies specifically tailored to undermine the defense.

When considering adaptive attacks, the question we need to ask is: 'How could an adversary potentially break this system?' If it seems there's no feasible way to breach the defense, that's a promising sign. However, it doesn't automatically guarantee the defense is impenetrable. It's important to recognize that the inability to conceive an effective adaptive attack might be due to limitations in the evaluator's expertise or imagination. Therefore, while adaptive attacks are a more rigorous and realistic method for testing defenses, they still might not cover all possible attack vectors an actual adversary might employ.

In what follows, we'll explore in more detail the strategies behind adaptive attacks, focusing specifically on two prominent and effective methods: the Expectation over Transformation (EOT) attack and the Backward Pass Differentiable Approximation (BPDA) attack. These sessions will encourage you to adopt an adversary's mindset, understanding not just how these attacks work, but also how they can be creatively applied to challenge defense mechanisms.

#### **Expectation over Transformation vs on Random Neuron Pruning Defense**

The Expectation over Transformation (EOT) attack** is a sophisticated strategy developed to counter defenses that introduce randomness into model inference. Consider a scenario where a classifier undergoes random transformations—this could involve randomizing the input, selecting models randomly from a pol of models, or employing random neuron pruning. In each case, these transformations represent a set \( $T$ \). EOT addresses the challenge posed by these random defenses by not targeting a single static instance of the model. Instead, it focuses on the average behavior over multiple instances or transformations.

Crucially, EOT leverages a key insight: the gradient of the expected transformation outcome is equal to the expectation of the gradient following the transformation. In practical terms, EOT iteratively estimates this expectation with each step in the gradient descent process, using sampling. This method allows for the refinement of the adversarial example in response to the defense's randomness.

In the following code block, you will find the implementation of this defense strategy. You cal also evaluate its effectiveness against the Random Neuron Pruning defense. however, Note that sinceour code is executing on a CPU, and considering the increased number of gradient computations required, be aware that the attack process may take a significantly longer time to complete.

In [None]:
def adaptive_attack_against_RNP_defense(target_model, preprocess_input, img_batch, img_labels, epsilon, num_ittr, num_classes, num_trials):
    # Convert image batch to a tensor and keep a copy of the original images
    perturbed_images = tf.convert_to_tensor(img_batch)
    original_images = tf.identity(perturbed_images)

    # Define the step size for the perturbation
    step = (epsilon / num_ittr) * 1.25

    # Iteratively apply perturbations over a number of iterations
    for _ in tqdm(range(num_ittr)):
        # Initialize accumulated gradients as zeros
        accumulated_grads = tf.zeros_like(perturbed_images)

        # Accumulate gradients over a number of trials to average the effect of random neuron pruning
        for _ in tqdm(range(num_trials)):
            accumulated_grads += get_loss_gradient_wrt_input(target_model=target_model, x=perturbed_images, y=img_labels, preprocess_input=preprocess_input, num_classes=num_classes)

        # Calculate the average of the accumulated gradients
        average_grads = accumulated_grads / num_trials
        # Replace NaNs in gradients with zeros
        average_grads = tf.where(tf.math.is_nan(average_grads), tf.zeros_like(average_grads), average_grads)

        # Update perturbed images using the sign of the averaged gradients
        sign_grads = tf.sign(average_grads)
        perturbed_images = perturbed_images + step * sign_grads

        # Ensure that the perturbation stays within the epsilon bounds
        perturbation = tf.clip_by_value(perturbed_images - original_images, -epsilon, epsilon)
        perturbed_images = original_images + perturbation

    # Clip the perturbed images to be within the valid image range [0.0, 1.0]
    perturbed_images = tf.clip_by_value(perturbed_images, 0.0, 1.0)

    # Return the perturbed images as a numpy array
    return perturbed_images.numpy()

# Applying the adaptive attack against the Random Neuron Pruning defense
adaptive_adv_imgs_for_RNP_defense = adaptive_attack_against_RNP_defense(target_model=classifier_w_random_pruning, preprocess_input=lambda x: x, img_batch=target_samples_x, img_labels=target_samples_org_y, epsilon=4.0/255.0, num_ittr=100, num_classes=num_classes, num_trials=100)


In [None]:
attack_success_of_EoT_on_RNP_defense =  untargeted_attack_success_rate(model=classifier_w_random_pruning, preprocess_input=lambda x:x, x=adaptive_adv_imgs_for_RNP_defense, y=target_samples_org_y)
print(f"Attack success rate of baseline adversarial exmaples on random neron pruning defense is {100*attack_success_of_EoT_on_RNP_defense:.2f}%.")

#### **Backward Pass Differentiable Approximation Attack vs JPEG Compression Defense**



The Backward Pass Differentiable Approximation (BPDA) attack, is designed to tackle the challenges posed by defense such as JPEG compression that feature non-differentiable layers. In this approach, the model is represented as \( f \), which includes a specific layer \( $f_i$ \) characterized by its non-differentiability. BPDA's strategy is centered around the concept of finding a differentiable approximation, denoted as \( g(x) \), that closely mirrors the behavior of \( $f_i(x)$ \).

During the forward pass in the attack process, the model operates as normal, inclusive of the non-differentiable layer \( $f_i$ \). However, the innovation of BPDA comes into play during the backward pass, crucial for gradient calculation. In this phase, \( $f_i$ \) is substituted with the approximate \( g(x) \). This substitution allows for the derivation of gradients that, despite not being perfectly accurate, are approximate enough to assist in crafting effective adversarial examples.

In [None]:
def adaptive_attack_for_JPEG_compression(target_model, preprocess_input, img_batch, img_labels, epsilon, num_ittr, num_classes):
    # Convert image batch to a tensor and keep a copy of original images
    perturbed_images = tf.convert_to_tensor(img_batch)
    original_images = tf.identity(perturbed_images)

    # Define the step size for the perturbation
    step = (epsilon / num_ittr) * 1.25

    # Iteratively apply perturbations
    for _ in tqdm(range(num_ittr)):
        # Apply JPEG compression to perturbed images (forward pass)
        compressed_pertubed_images = preprocess_input(perturbed_images)

        # Compute gradients with respect to the compressed images
        grads_for_compressed_images = get_loss_gradient_wrt_input(target_model=target_model, x=compressed_pertubed_images, y=img_labels, preprocess_input=lambda x: x, num_classes=num_classes)

        # Replace any NaNs in gradients with zeros
        grads_for_compressed_images = tf.where(tf.math.is_nan(grads_for_compressed_images), tf.zeros_like(grads_for_compressed_images), grads_for_compressed_images)

        # Directly use the gradients of the compressed images as our gradient estimate
        # This corresponds to the backward pass of BPDA
        grads = grads_for_compressed_images

        # Compute the sign of the gradients
        sign_grads = tf.sign(grads)

        # Update perturbed images by a step in the direction of the sign of gradients
        perturbed_images = perturbed_images + step * sign_grads

        # Ensure that the perturbations stay within the epsilon bounds
        perturbation = tf.clip_by_value(perturbed_images - original_images, -epsilon, epsilon)
        perturbed_images = original_images + perturbation

    # Clip the perturbed images to be within the valid image range [0, 1]
    perturbed_images = tf.clip_by_value(perturbed_images, 0.0, 1.0)

    # Return the perturbed images as a numpy array
    return perturbed_images.numpy()

# Applying the BPDA attack on the JPEG compression defense
adaptive_adv_imgs_for_jpeg_defense = adaptive_attack_for_JPEG_compression(target_model=baseline_classifier, preprocess_input=jpeg_compress_batch, img_batch=target_samples_x, img_labels=target_samples_org_y, epsilon=4.0/255.0, num_ittr=10, num_classes=num_classes)


In [None]:
attack_attack_success_of_adaptive_attack_on_JPEG_defense =  untargeted_attack_success_rate(model=baseline_classifier, preprocess_input=jpeg_compress_batch, x=adaptive_adv_imgs_for_jpeg_defense, y=target_samples_org_y)
print(f"Attack success rate of adaptive attack on JPEG compression defense is {100*attack_attack_success_of_adaptive_attack_on_JPEG_defense:.2f}%.")

## **Exercise: Hands-on Security Evaluation**

Imagine you're given the task of assessing the security of the following two systems:

- **System 1** uses strong random input augmentation techniques to mitigate the effect of adversarial noise.

- **System 2** is designed to defend against evasion attacks through majority voting mechanism.


 How secure do you think these systems are? what is highest attack success rate you can achieve? As part of your evaluation, note that you have white-box access to the system, meaning you can see and understand all its inner workings. Put on your adversary hat and brainstorm the most effective untargeted attack strategy for this system. Consider the system's strengths and vulnerabilities, and how an attacker could exploit them.

### **System 1: Strong Data Augmentation**

In the following code block, we introduce the augment_strong function. This function is specifically designed to apply a sequence of transformations to images. The goal of these transformations is to potentially disrupt adversarial perturbations, thereby serving as an effective defense mechanism against evasion attacks.

**Key Features of `augment_strong`:**

- **Adjustable Transformation Intensity**: The `strength` parameter in the function controls the intensity of the transformations, affecting the brightness, contrast, saturation, and hue adjustments.

- **Diverse Transformations**: The function includes four transformation types: random adjustments to brightness, contrast, saturation, and hue. Each type of transformation is defined as a separate internal function.

- **Randomized Transformation Application**: The transformations are applied in a shuffled order, introducing unpredictability that is crucial in thwarting structured adversarial attacks.

- **Additional Image Augmentations**: Beyond color adjustments, the function also performs horizontal flipping and random cropping after reflective padding, adding further variability to the transformed images.

- **Visualization Tool**: The `display_original_and_transformed_images` function enables a side-by-side visual comparison of the original and augmented images, illustrating the effects of the applied augmentations.

In [None]:
def augment_strong(image, strength=.75):
    # Define the intensity of different transformations based on the strength parameter
    brightness = 0.8 * strength
    contrast = 0.8 * strength
    saturation = 0.8 * strength
    hue = 0.2 * strength

    def apply_transform(i, x):
        # Define functions for each type of transformation
        def brightness_foo():
            return tf.image.random_brightness(x, max_delta=brightness)

        def contrast_foo():
            return tf.image.random_contrast(x, lower=1-contrast, upper=1+contrast)

        def saturation_foo():
            return tf.image.random_saturation(x, lower=1-saturation, upper=1+saturation)

        def hue_foo():
            return tf.image.random_hue(x, max_delta=hue)

        # Apply one of the transformations based on the value of i
        x = tf.cond(tf.less(i, 2),
                    lambda: tf.cond(tf.less(i, 1), brightness_foo, contrast_foo),
                    lambda: tf.cond(tf.less(i, 3), saturation_foo, hue_foo))
        return x

    def augment(x, y):
        """
        Applies flip and shift augmentation to the given example.
        Flips the image horizontally and randomly crops it after padding.
        """
        x_shape = tf.shape(x)
        x = tf.image.random_flip_left_right(x)
        x = tf.pad(x, [[0] * 2, [4] * 2, [4] * 2, [0] * 2], mode='REFLECT')
        return tf.image.random_crop(x, x_shape), y

    # Shuffle the order in which transformations are applied
    perm = tf.random.shuffle(tf.range(4))
    for i in range(4):
        image = apply_transform(perm[i], image)
        image = tf.clip_by_value(image, 0., 1.)  # Clip values to maintain valid image range
    return augment(image, None)[0]


def display_original_and_transformed_images(original, transformed):
    """
    Displays the original and transformed images side by side for comparison.
    """
    fig, axes = plt.subplots(1, 2, figsize=(5, 10))  # Create subplot for two images

    # Display the original image
    axes[0].imshow(np.squeeze(original.astype('float32')))
    axes[0].set_title('Original')
    axes[0].axis('off')

    # Display the transformed (augmented) image
    axes[1].imshow(np.squeeze(transformed.astype('float32')))
    axes[1].set_title('Transformed')
    axes[1].axis('off')
    plt.show()

# Example usage: Apply the strong augmentation to a sample image and display it
sample_image = np.expand_dims(target_samples_x[4,::],0)
sample_image_transformed = augment_strong(sample_image).numpy()
display_original_and_transformed_images(original=sample_image, transformed=sample_image_transformed)

Below, **please implement your most robust attack strategy, specifically designed to challenge the defense mechanisms of this system.**

In [None]:
def your_best_attack_on_strong_augmentation_system(target_model, preprocess_input, img_batch, img_labels, epsilon, num_ittr, num_classes):

    ## WRITE YOUR CODE BELOW
    #....
    #
    #return perturbed_images.numpy()


adv_imgs_for_strong_aug_defense = your_best_attack_on_strong_augmentation_system(target_model=baseline_classifier, preprocess_input=augment_strong, img_batch=target_samples_x, img_labels=target_samples_org_y, epsilon=4.0/255.0, num_ittr=10, num_classes=num_classes)

attack_success_on_strong_aug_defense =  untargeted_attack_success_rate(model=baseline_classifier, preprocess_input=augment_strong, x=adv_imgs_for_strong_aug_defense, y=target_samples_org_y)
print(f"Attack success rate oon majority vote defense is {100*attack_success_on_strong_aug_defense:.2f}%.")

### **System 2: Majority Vothing System**

This system incorporates three classifiers, each independently evaluating the test sample. The system's final output is based on the majority decision of these classifiers. When the classifiers fail to agree on a common label, the system is designed to select and return the largest label among those predicted by the individual models.


In [None]:
class ClassifierWithMajorityVote:
    def __init__(self, models):
        self.models = models
        self.num_models = len(models)

    def __call__(self, x, *args, **kwargs):
        # Compute predictions using each model
        predictions = [tf.argmax(model(x, *args, **kwargs), axis=1) for model in self.models]

        # Extract individual predictions
        a, b, c = predictions

        # Majority vote using TensorFlow operations
        # If there's no consensus, return the prediction with the largest label
        max_ab = tf.maximum(a, b)
        max_abc = tf.maximum(max_ab, c)
        majority_vote = tf.where(a == b, a, tf.where(a == c, a, tf.where(b == c, b, max_abc)))

        # Convert to one-hot encoding
        return tf.one_hot(majority_vote, depth=10)



model_1 = load_cifar10_model(model_path="./AML/baseline_cifar10")
model_2 = load_cifar10_model(model_path="./AML/surrogate_cifar10_1")
model_3 = load_cifar10_model(model_path="./AML/surrogate_cifar10_2")

classifier_with_majority_vote = ClassifierWithMajorityVote([model_1,model_2,model_3])

Below, **please implement your most robust attack strategy, specifically designed to challenge the defense mechanisms of this system.**

In [None]:
def your_best_attack_on_majority_vote_system(target_model, preprocess_input, img_batch, img_labels, epsilon, num_ittr, num_classes):

    ## WRITE YOUR CODE BELOW
    #....
    #
    #return perturbed_images.numpy()


adv_imgs_for_maj_vote_defense = your_best_attack_on_majority_vote_system(target_model=classifier_with_majority_vote, preprocess_input=lambda x: x, img_batch=target_samples_x, img_labels=target_samples_org_y, epsilon=4.0/255.0, num_ittr=10, num_classes=num_classes)

attack_success_on_majority_vote_defense =  untargeted_attack_success_rate(model=classifier_with_majority_vote, preprocess_input=lambda x: x, x=adv_imgs_for_maj_vote_defense, y=target_samples_org_y)
print(f"Attack success rate oon majority vote defense is {100*attack_success_on_majority_vote_defense:.2f}%.")