Adversarial examples are a type of input data that can significantly change a model prediction without being noticeable to the human eye. Due to this fact, adversarial examples can be worrisome, especially in critical tasks such as security or healthcare domains. It would be beneficial to learn how these attacks work in order to start thinking about possible solutions.

There are two types of adversarial attacks: white-box and black-box attacks. In white-box attacks, the attacker has the knowledge of the model, input, and loss function that was used to train the model. By using this knowledge, the attacker can change the inputs to disrupt the predicted outputs. The amount of change in the input is usually minor and indistinguishable to the human eye. A common type of white-box attack is called the Fast Gradient Sign (FGS) attack, which works by changing the input to maximize the loss.

In the FGS attack, given an input and a pre-trained model, we compute the gradients of the loss with respect to the input. Then, we add a small portion of the absolute value of gradients to the input.

## Dataset

In [None]:
import mydataset

In [None]:
test_dl = mydataset.test_dl
for xb,yb in test_dl:
    print(xb.shape, yb.shape)
    break

## Pre-trained Model

In [None]:
import mymodel

In [None]:
model = mymodel.model

In [None]:
import torch
device = torch.device("cuda:3" if torch.cuda.is_available() else "cpu")
model = model.to(device)

In [None]:
def freeze_model(model):
    for child in model.children():
        for param in child.parameters():
            param.requires_grad = False
    print("model frozen")
    return model

In [None]:
model = freeze_model(model)

In [None]:
import numpy as np

def deploy_model(model, test_dl):
    y_pred = []
    y_gt = []
    with torch.no_grad():
        for x,y in test_dl:
            y_gt.append(y.item())
            out = model(x.to(device)).cpu().numpy()
            out = np.argmax(out, axis=1)[0]
            y_pred.append(out)    
    return y_pred, y_gt
y_pred, y_gt = deploy_model(model,test_dl)

In [None]:
# Verify the pre-trained model's performance on the sample test data
from sklearn.metrics import accuracy_score
acc=accuracy_score(y_pred,y_gt)
print("accuracy: %.2f" %acc)

## FGS Attach

In [None]:
"""
we computed the loss value by comparing the model output and the target label. Then, 
we computed the gradients of loss with respect to the input tensor. Finally, we perturbed 
the input by adding a portion of the signed gradients. The alfa coefficient defines the 
amount of distortion added to the input. Higher alfa clearly leads to more distortion. We 
set alfa=0.005 so that it is small enough to change the predictions while still being 
invisible to our eyes.
"""
def perturb_input(xb, yb, model, alfa):
    """
    xb: The input to be perturbed, a PyTorch tensor of shape [1, 3, height, width]
    yb: The target label, a tensor of shape [1]
    model: The pre-trained model
    alfa: The perturbation coefficient, set to 0.005
    """
    xb = xb.to(device)
    xb.requires_grad = True
    out = model(xb).cpu()
    loss = F.nll_loss(out, yb)
    model.zero_grad()
    loss.backward()
    xb_grad = xb.grad.data
    xb_p = xb + alfa * xb_grad.sign()
    xb_p = torch.clamp(xb_p, 0, 1)
    return xb_p, out.detach()

In [None]:
from torchvision.transforms.functional import to_pil_image
import torch.nn.functional as F
import matplotlib.pyplot as plt
%matplotlib inline

y_pred = []
y_pred_p = []
for xb,yb in test_dl:
    xb_p, out = perturb_input(xb, yb, model, alfa = 0.005)
    
    with torch.no_grad(): # we stopped tracking gradients at this step using the torch.no_grad()
        # Calculate the prediction probabilities before and after perturbation
        pred = out.argmax(dim=1, keepdim=False).item()
        y_pred.append(pred) 
        prob = torch.exp(out[:, 1])[0].item()

        out_p = model(xb_p).cpu()
        pred_p = out_p.argmax(dim=1, keepdim=False).item()
        y_pred_p.append(pred_p)
        prob_p = torch.exp(out_p[:, 1])[0].item()
        
    plt.subplot(1, 2, 1)
    plt.imshow(to_pil_image(xb[0].detach().cpu()))
    plt.title(prob)
    plt.subplot(1, 2, 2)
    plt.imshow(to_pil_image(xb_p[0].detach().cpu()))
    plt.title(prob_p)
    plt.show()
    

In [None]:
# the model accuracy significantly dropped for the distorted data
acc=accuracy_score(y_pred,y_gt)
print("accuracy: %.2f" %acc)

In [None]:
acc=accuracy_score(y_pred_p,y_gt)
print("accuracy: %.2f" %acc)