<a href="https://colab.research.google.com/github/inspire-lab/SecurePrivateAI/blob/master/2_attack_cnn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Attacking a CNN

The CNN is vunerable to adversarial examples as ![adv example](https://www.tensorflow.org/tutorials/generative/images/adversarial_example.png)

In this exercise we will train a CNN to distinguish between instances of handwritten `0` and instances of handwritten `1`. We will be using `PyTorch` to do this.  

Once we have a trained classifier, we will create adversarial examples from scratch and using `ART`

This is adopted from https://github.com/Trusted-AI/adversarial-robustness-toolbox/blob/main/examples/get_started_pytorch.py

In [None]:
# some configurations for jupyter notebook
%config Completer.use_jedi = False
%load_ext autoreload
%autoreload 2

In [None]:
!pip install adversarial-robustness-toolbox torch torchvision numpy matplotlib

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np

%matplotlib inline 
import matplotlib.pyplot as plt


The MNIST dataset contains data for all of the digits, 

We need to normalize the data. But here, we use the API from `ART`.

Load the actual data. It will load the data as numpy array.

In [None]:
from art.attacks.evasion import FastGradientMethod
from art.estimators.classification import PyTorchClassifier
from art.utils import load_mnist

In [None]:
# Step 1: Load the MNIST dataset

(x_train, y_train), (x_test, y_test), min_pixel_value, max_pixel_value = load_mnist()

# Step 1a: Swap axes to PyTorch's NCHW format

x_train = np.transpose(x_train, (0, 3, 1, 2)).astype(np.float32)
x_test = np.transpose(x_test, (0, 3, 1, 2)).astype(np.float32)

In [None]:
print(type(x_train))
print(x_train.shape, x_test.shape, y_train.shape)

We are using a very simple CNN. This network can be used to distinguish between all 10 classes with very high accuracy.

In [None]:
# define the classifier
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv_1 = nn.Conv2d(in_channels=1, out_channels=4, kernel_size=5, stride=1)
        self.conv_2 = nn.Conv2d(in_channels=4, out_channels=10, kernel_size=5, stride=1)
        self.fc_1 = nn.Linear(in_features=4 * 4 * 10, out_features=100)
        self.fc_2 = nn.Linear(in_features=100, out_features=10)

    def forward(self, x):
        x = F.relu(self.conv_1(x))
        x = F.max_pool2d(x, 2, 2)
        x = F.relu(self.conv_2(x))
        x = F.max_pool2d(x, 2, 2)
        x = x.view(-1, 4 * 4 * 10)
        x = F.relu(self.fc_1(x))
        x = self.fc_2(x)
        return x


Then we initialize a model and train with the cross-entropy loss.

To simplify the training code, we use the wrapper `PyTorchClassifier` from `ART` to train the model.

In [None]:
# Step 2: Create the model

model = Net()

# Step 2a: Define the loss function and the optimizer

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Step 3: Create the ART classifier

classifier = PyTorchClassifier(
    model=model,
    clip_values=(min_pixel_value, max_pixel_value),
    loss=criterion,
    optimizer=optimizer,
    input_shape=(1, 28, 28),
    nb_classes=10,
)

# Step 4: Train the ART classifier

classifier.fit(x_train, y_train, batch_size=64, nb_epochs=3)

# Step 5: Evaluate the ART classifier on benign test examples

predictions = classifier.predict(x_test)
accuracy = np.sum(np.argmax(predictions, axis=1) == np.argmax(y_test, axis=1)) / len(y_test)
print("Accuracy on benign test examples: {}%".format(accuracy * 100))

# you should see an accuracy > 95%

Let's get to the actual attack magic. First we are picking a sample that we want to perturbate. After that we will be implementing our own FGSM attack. 

The attack is fairly simple. It consists of the following steps: 

1.   Compute the loss of the original sample
2.   Calculate the gradient of the loss w.r.t the input 
3.   Take the sign of the gradient and add a fraction episilon to the input, namely $x + \epsilon sign(\nabla_x J(x, y))$

Epsilon controlls the strenght of the pertubation.

First, we select a sample to visualize it and output the model's predictions

In [None]:

# chose a sample to pertubate
sample_ind = 3 # chosen by totaly random dice roll,  index=3 ->  the data is `0`

# picking a test sample
sample = x_test[ sample_ind, : ]

print( sample.shape )

# plot the first instance in the traning set
plt.imshow( sample.reshape( 28, 28 ), cmap="gray_r" )
plt.axis( 'off' )
plt.show( )


pred_prob = F.softmax(model( torch.FloatTensor(sample.reshape( (1, sample.shape[ 0 ], sample.shape[ 1 ], sample.shape[ 2 ]) ) ) ), dim=1)

logits = classifier.predict( sample.reshape( (1, sample.shape[ 0 ], sample.shape[ 1 ], sample.shape[ 2 ]) ) )

print( 'output for the test samples:\n', logits )
print( 'class prediction for the test samples:\n', pred_prob.detach() )
print( 'predicted as', np.argmax( logits , axis=1) )

Since `ART` loads data as numpy array, we create variables as PyTorch Tensor for convenience

And the labels of `y_train` and `y_test` are one-hot vectors. We need to convert it to the category label.

In [None]:
eps = 1. # allowed maximum modification
t_sample = torch.FloatTensor(sample.reshape( (1, sample.shape[ 0 ], sample.shape[ 1 ], sample.shape[ 2 ]) ) )
one_hot_y = torch.LongTensor( y_test[ sample_ind, : ].reshape( ( 1, -1 ) ) )
t_y = torch.argmax(one_hot_y, dim=1)


Construct adversarial examples from scratch. You can use the code above as reference.

Eq: $x + \epsilon sign(\nabla_x J(x, y))$

In [None]:
# constructing adversarial examples
######################
# fill in the blanks #
######################

# compute logits using the PyTorch Tensor

logits = ??? t_sample ???

# compute the cross entropy loss of our original sample

loss = 

# get the gradient w.r.t to the input. 
# Here it may show an error, if you exactly follow the tutorial

grads = torch.autograd.grad(  )
print(grads.shape)

# You may see an error
# The error should be `RuntimeError: One of the differentiated Tensors does not require grad`
# What does it mean? and how to solve it?

It's caused by the mechanism of PyTorch.

By default, only model's parameters will compute/require gradients.

Now we need to firstly let the input data require gradients.

The adversarial example is: $x + \epsilon sign(\nabla_x J(x, y))$

In [None]:
# constructing adversarial examples
######################
# fill in the blanks #
######################

# Set the data require gradients



# compute logits

logits = ??? t_sample ???

# compute the loss of our original sample

loss =  

# get the gradient wrt to the input.

grads = torch.autograd.grad( )

# use torch.autograd.grad may cause an error, make sure the grads is a PyTorhc Tensor variable
print(grads.shape)

# calculate the pertubation using the sign of grads

pertubation = ??? grads ???

# apply pertubation, 

x_adv = 

# now that we have the adversarial examples
# get the prediction result and print the adversarial example


print( 'our adversarial example' )
print( x_adv.shape )

print( 'logits for our sample: \t\n', ???  )
print( 'class prediction for our sample: \t\n',  ???  )
print( 'predicted as',  ???  )

plt.imshow( x_adv.detach().numpy().reshape( 28, 28 ), cmap="gray_r" )
plt.axis( 'off' )
plt.show( )


The FGSM is one of the most simple attacks.
As we can see that results are not very convincing since the perturbation is perceptible.
We can improve on it by making it iterative. 

Using the code from above, create an iterative version of FGSM that calculates a new perturbation with a smaller epsilon for every iteration and stops once it achieves mis-classification.

The goal is to make the perturbation as small/invisible as possible, but also make the model give wrong prediction.

In [None]:
# your code here

epsilon =
iterations =
for i in range(iterations):
    # your code

Let's use `ART` library to do the actual attack magic.

we will also use the FGSM attack to generate an adversarial example.

In [None]:
# using the ART implemenation


print( 'logits for our sample: \t\n', ???  )
print( 'class prediction for our sample: \t\n',  ???  )
print( 'predicted as',  ???  )

# visualize it

plt.imshow( adv_sample.reshape( 28, 28 ), cmap="gray_r" )
plt.axis( 'off' )
plt.show( )


You can see that it's much simpler than we write it from scratch.

You can check it from https://github.com/Trusted-AI/adversarial-robustness-toolbox/blob/main/examples/get_started_pytorch.py

We have seen that FGSM does not do a great job of producing adversarial examples when work with 0 and 1. Update the code above work on all 10 digits and try for a number of 0 instance what class they get transformed into in an untargeted attack.
Alternativley pick a pair of numbers that you think are closer to each orther and the FGSM attack should work better with.


`ART` provides more attacks than the once introdcued above. Try any other attacks from the official documents.

You can find more information on the attacks here: https://github.com/Trusted-AI/adversarial-robustness-toolbox/wiki/ART-Attacks


In [None]:
# your code here