<a href="https://colab.research.google.com/github/inspire-lab/SecurePrivateAI/blob/main/3_defend_cnn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Defense with adversarial training

In this section we will use adversarial training to harden our CNN against adversarial examples. 

In adversarial training the dataset get "augmented" with adversarial examples that are correctly labeled. This way the network learns that such perturbations are possible and can adapt to them.

We will be using the IBM Adversarial Robustness Toolbox in this exercise. It offers a very easy-to-use implementation of adversarial training and a number of other defenses. 
https://github.com/IBM/adversarial-robustness-toolbox


We start out by importing most of the modules and functions we will need. 

In [None]:
pip install tensorflow-gpu==1.15.2 keras==2.2.3 adversarial-robustness-toolbox 

In [None]:
# most of our imports
import warnings
import numpy as np
import os
import keras
from keras.layers import Conv2D, MaxPooling2D, Dropout, Flatten, Dense
%matplotlib inline
import matplotlib.pyplot as plt
import tensorflow as tf
from art.classifiers import KerasClassifier


# helper code 
def extract_ones_and_zeroes( data, labels ):
    data_zeroes = data[ np.argwhere( labels == 0 ).reshape( -1 ) ][ :200 ]
    data_ones = data[ np.argwhere( labels == 1 ).reshape( -1 ) ][ :200 ]
    x = np.vstack( (data_zeroes, data_ones) )

    x = x / 255.
    print( x.shape )

    labels_zeroes = np.zeros( data_zeroes.shape[ 0 ] )
    labels_ones = np.ones( data_ones.shape[ 0 ] )
    y = np.append( labels_zeroes, labels_ones )

    return x, y

def extract_two_classes( data, labels, classes=(0,1), no_instance=200 ):
    data_zeroes = data[ np.argwhere( labels ==  classes[0] ).reshape( -1 ) ][ :no_instance ]
    data_ones = data[ np.argwhere( labels == classes[1] ).reshape( -1 ) ][ :no_instance ]
    x = np.vstack( (data_zeroes, data_ones) )
    
    # normalize the data
    x = x / 255.

    labels_zeroes = np.zeros( data_zeroes.shape[ 0 ] )
    labels_ones = np.ones( data_ones.shape[ 0 ] )
    y = np.append( labels_zeroes, labels_ones )

    return x, y

def convert_to_keras_image_format( x_train, x_test ):
    if keras.backend.image_data_format( ) == 'channels_first':
        x_train = x_train.reshape( x_train.shape[ 0 ], 1, x_train.shape[ 1 ], x_train.shape[ 2 ] )
        x_test = x_test.reshape( x_test.shape[ 0 ], 1, x_train.shape[ 1 ], x_train.shape[ 2 ] )
    else:
        x_train = x_train.reshape( x_train.shape[ 0 ], x_train.shape[ 1 ], x_train.shape[ 2 ], 1 )
        x_test = x_test.reshape( x_test.shape[ 0 ], x_train.shape[ 1 ], x_train.shape[ 2 ], 1 )

    return x_train, x_test


def mnist_cnn_model( x_train, y_train, x_test, y_test, epochs=2 ):
    # define the classifier
    clf = keras.Sequential( )
    clf.add( Conv2D( 32, kernel_size=(3, 3), activation='relu', input_shape=x_train.shape[ 1: ] ) )
    clf.add( Conv2D( 64, (3, 3), activation='relu' ) )
    clf.add( MaxPooling2D( pool_size=(2, 2) ) )
    clf.add( Dropout( 0.25 ) )
    clf.add( Flatten( ) )
    clf.add( Dense( 128, activation='relu' ) )
    clf.add( Dropout( 0.5 ) )
    clf.add( Dense( y_train.shape[ 1 ], activation='softmax' ) )

    clf.compile( loss=keras.losses.categorical_crossentropy,
                 optimizer='adam',
                 metrics=[ 'accuracy' ] )

    clf.fit( x_train, y_train,
             epochs=epochs,
             verbose=1 )
    clf.summary( )
    score = clf.evaluate( x_test, y_test )
    print( 'Test loss:', score[ 0 ] )
    print( 'Test accuracy:', score[ 1 ] )

    return clf


def show_image( img ):
    plt.imshow( img.reshape( 28, 28 ), cmap="gray_r" )
    plt.axis( 'off' )
    plt.show( )

We start out by loading the data, preparing it and training our CNN.

In [None]:
from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# extract ones and zeroes
x_train, y_train = extract_ones_and_zeroes( x_train, y_train )
x_test, y_test = extract_ones_and_zeroes( x_test, y_test )

# we need to bring the data in to a format that our cnn likes
y_train = keras.utils.to_categorical( y_train, 2 )
y_test = keras.utils.to_categorical( y_test, 2 )

# convert it to a format keras can work with
x_train, x_test = convert_to_keras_image_format(x_train, x_test)

# need to some setup so everything gets executed in the same tensorflow session
session = tf.Session( )
keras.backend.set_session( session )

# get and train our cnn
clf = mnist_cnn_model( x_train, y_train, x_test, y_test, epochs=5)


We want to know how robust our model is against an attack. To do this we are calculating the `empirical robustness`. This is equivalent to computing the minimal perturbation that the attacker must introduce for a    successful attack. We are following the approach of Moosavi-Dezfooli et al. 2016 (paper link: https://arxiv.org/abs/1511.04599).

The empirical robustness method supports two attacks at the moment.
The `Fast Gradient Sign Method` and `Hop Skip and Jump`.

You can use them by passing either `fgsm` or `hsj` as parameters.
The default attack parameters are the following:
```
    "fgsm", {"eps_step": 0.1, "eps_max": 1., "clip_min": 0., "clip_max": 1.},
    "hsj", {'max_iter': 50, 'max_eval': 10000, 'init_eval': 100, 'init_size': 100}
```

In [None]:
from art.metrics import empirical_robustness

# wrap the model an calculate empirical robustness
wrapper = KerasClassifier( model=clf, clip_values=(0., 1.) )
print( 'robustness of the undefended model', 
      empirical_robustness( wrapper, x_test, 'fgsm'))

Try different attack parameters and compare the results. 

Tip:

For `hsj` use only a few examples otherwise it will take forever.

In [None]:
### your code goes here
x_small = x_test[ :10 ]


Let's create an adversarial example and see how it looks.
We want to know how to the model performs on adversarial examples. Let's create adversarial examples out of the training set and see how the model does with it.

Below you can the keyword arguments for the attack

```
norm=np.inf, eps=.3, eps_step=0.1, targeted=False, num_random_init=0, batch_size=1, minimal=False
        """
        :param norm: The norm of the adversarial perturbation. Possible values: np.inf, 1 or 2.
        :param eps: Attack step size (input variation)
        :param eps_step: Step size of input variation for minimal perturbation computation
        :param targeted: Indicates whether the attack is targeted (True) or untargeted (False)
        :param num_random_init: Number of random initialisations within the epsilon ball. For random_init=0 starting at
            the original input.
        :param batch_size: Size of the batch on which adversarial samples are generated.
        :param minimal: Indicates if computing the minimal perturbation (True). If True, also define `eps_step` for
                        the step size and eps for the maximum perturbation.
   
```

In [None]:
# create an adversarial example with fgsm and plot it
from art.attacks.evasion import FastGradientMethod
fgsm = FastGradientMethod( wrapper, eps=0.4 )
x_adv = fgsm.generate(x_test[128].reshape((1,28,28,1) ))
print( 'class prediction for the adversarial sample:',
       clf.predict( x_adv.reshape((1,28,28,1) ) ) )
show_image( x_adv )

# create adversarial examples for the all of the set
# your code here
x_test_adv = 
print( 'accuracy on adversarial examples:' )



## Adversarial Training

Let's create a new untrained model with the same architecture that we have been using so far. 

We will train the model using adversarial training framework. The idea is very simple:

1.   Train the model for 1 epoch
2.   Create adversarial examples using FGSM 
3.   Enhance training data by mixing it with the adversarial examples. (Only mix in the adversarial examples created in this iteration)
4.   Goto 1

We will be using the FGSM attack from `art` this time.




In [None]:
# create a new untrained model and wrap it
new_model = mnist_cnn_model( x_train, y_train, x_test, y_test, epochs=0 )
defended_model = KerasClassifier(clip_values=(0,1), model=new_model )
# define the attack we are using
fgsm = FastGradientMethod( defended_model, eps=.4 )

# parameters
epochs = 5 # number of iterations that we will perform training for
ratio = .5  # ratio of the test set that will get turned into adversarial examples
            # each iteration


# some helpers
idx = np.arange( x_train.shape[ 0 ], dtype=np.int )

# create varialbes to hold the training data.
# for now it is just the normal training data. we'll mix in the 
# adversarial examples in later
x_train_enhanced = x_train
y_train_enhanced = y_train


for i in range( epochs ):
  # train model for one epoch

  # shuffle   

  # pick the subset of the train data to turn into adversarial examples

  # create adversarial examples

  # add the adversarial examples to the training data


# training is done. let's evaluate the performance on the test set
# and adversarial examples
acc = defended_model._model.evaluate( x_test, y_test )[ 1 ]
print( 'acc on the test data: ', acc )

# and now on adversarial examples
x_test_adv = fgsm.generate( x_test )
acc =  wrapper._model.evaluate( x_test_adv, y_test )
print( 'accuracy on adversarial examples: ', acc )


To use the adversarial training that comes with `art` we need to pass our wrapped model to an `AdversarialTrainer` instance. The `AdversarialTrainer` also needs an instance of the attack that will be used to create the adversarial examples.

https://adversarial-robustness-toolbox.readthedocs.io/en/latest/modules/defences/trainer.html#art.defences.trainer.AdversarialTrainer

https://github.com/Trusted-AI/adversarial-robustness-toolbox/blob/main/examples/adversarial_training_cifar10.py

In [None]:
from art.defences.trainer import AdversarialTrainer

# get a new untrained model and warp it
new_model = mnist_cnn_model( x_train, y_train, x_test, y_test, epochs=0 )
defended_model = KerasClassifier(clip_values=(0,1), model=new_model )
# define the attack we are using
fgsm = FastGradientMethod( defended_model, eps=0.4 )

Create the `AdversarialTrainer` instance. 
Train the model and evaluate it on the test data.

In [None]:
# define the adversarial trainer and train the new network
adversarial_trainer = AdversarialTrainer( defended_model, fgsm )
adversarial_trainer.fit( x_train, y_train, batch_size=100, nb_epochs=5 )

# evaluate how good our model is
defended_model._model.evaluate( x_test,y_test )

# and now on adversarial examples
x_test_adv = fgsm.generate( x_test )
acc =  wrapper._model.evaluate( x_test_adv, y_test )
print( 'loss and accuracy on adversarial examples: ', acc )


Calculate the `empirical robustness` for our now hopefully more robust model.

In [None]:
# calculate the empiracal robustness
print( 'robustness of the defended model', 
      empirical_robustness( defended_model, x_test[0:], 'fgsm', {}) )

x_adv = fgsm.generate(x_test[0].reshape((1,28,28,1) ))
print( 'class prediction for the adversarial sample:',
       clf.predict( x_adv.reshape((1,28,28,1) ) ) 
     )
plt.imshow( x_adv.reshape( 28, 28 ), cmap="gray_r" )
plt.axis( 'off' )
plt.show( )

# Defensive Distillation

The idea behind defensive distillation is to transfer robustness from one network to another. To do this we are training two networks. The first network, which we will call `one` is trained normally. We want to transfer some of *experience* to our second network, called `two`. Both `one` and `two` have the same architecture. The way we achieve is this is by training `two` with the outputs of `one`. An important change is that we are using a so called *temperature* `T` parameter in the softmax function.
The process is as follows:


1.   Train `one` at temperature `T`
2.   Create new labels for the training data using `one`
3.   Train `two` at temperature `T` using the new labels


Hints:


*   `tf.math.exp`
*   `keras.backend.in_train_phase`
*   kullback leibler divergence

Loss: $H(\sigma (z^T/ρ), \sigma (z^S/ρ))$ , T: teacher, S: student, H: Entropy Loss

Fig: ![kd](https://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs11263-021-01453-z/MediaObjects/11263_2021_1453_Fig4_HTML.png)

Ref:

* https://arxiv.org/pdf/1503.02531.pdf



In [None]:
import tensorflow as tf
# softmax with temperature
T = 10
def softmax_with_temp( x ):
  return keras.backend.in_train_phase(
      # your code here
      ??? # using temperature,
      tf.nn.softmax( x )
  )
  

# define the classifier one
one = keras.Sequential( )
one.add( Conv2D( 32, kernel_size=(3, 3), activation='relu', input_shape=x_train.shape[ 1: ] ) )
one.add( Flatten( ) )
one.add( Dense( 128, activation='relu' ) )
one.add( Dense( y_train.shape[ 1 ], activation=softmax_with_temp ) )


# train the classifier one and evaluate on clean test data.
# your code here
???

one.summary( )

score = one.evaluate( x_test, y_test )
print( 'Test loss:', score[ 0 ] )
print( 'Test accuracy:', score[ 1 ] )


# test the FGSM attack
one_wrapped = KerasClassifier(clip_values=(0,1), model=one )
fgsm = FastGradientMethod( one_wrapped, eps=0.4 )
x_test_adv = fgsm.generate( x_test )
acc =  one.evaluate( x_test_adv, y_test )
print( 'accuracy on adversarial examples: ', acc )

# create new labels
y_train_new = one.predict( x_train )


# define the classifier two
two = keras.Sequential( )
two.add( Conv2D( 32, kernel_size=(3, 3), activation='relu', input_shape=x_train.shape[ 1: ] ) )
two.add( Flatten( ) )
two.add( Dense( 128, activation='relu' ) )
two.add( Dense( y_train.shape[ 1 ], activation=softmax_with_temp ) )

# train the classifier two and evaluate on clean test data
# your code here
???


two.summary( )
score = two.evaluate( x_test, y_test )
print( 'Test loss:', score[ 0 ] )
print( 'Test accuracy:', score[ 1 ] )


# test the FGSM attack

print( 'accuracy on adversarial examples: ', acc )


# Black box attacks

Assume we do not have access to the internal workings of our target model. This means we can not easily calculate gradients.
Fortunately or unfortunate depending on how you are looking at it adversarial examples created on one model can be also used against a different model. Given their learned decision boundary is similar enough.

We do not know what the target model looks like but in most cases we know the domain that it works in, MNIST in our case, so we can make an educated guess. We then train our model with the architecture that we guessed and create adversarial examples using this model. If our model and the target model are similar enough the adversarial examples can be transfered.


In the code below we will be training two different models and see if the adversarial examples transfer from one to the other.

Fig: ![bba](https://miro.medium.com/max/1400/1*6FUwsVaUsrtzmKUO_YP-8A.png)

Ref: https://arxiv.org/abs/1602.02697

In [None]:
import keras
import keras.backend as k
from keras.models import Sequential
from keras.layers import Dense, Flatten, Conv2D, MaxPooling2D, Reshape
from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# extract ones and zeroes
x_train, y_train = extract_ones_and_zeroes( x_train, y_train )
x_test, y_test = extract_ones_and_zeroes( x_test, y_test )

# we need to bring the data in to a format that our cnn likes
y_train = keras.utils.to_categorical( y_train, 2 )
y_test = keras.utils.to_categorical( y_test, 2 )

# convert it to a format keras can work with
x_train, x_test = convert_to_keras_image_format(x_train, x_test)

# Create simple CNN
model_0 = mnist_cnn_model( x_train, y_train, x_test, y_test, epochs=5 )
print( model_0.evaluate( x_test, y_test )[ 1 ] )


# create a simple DNN
model_1 = Sequential()
model_1.add( Reshape(  [28 * 28 ], input_shape=x_train.shape[ 1: ] ) )# flatten the data
model_1.add( Dense( 512, activation='relu' ) ) 
model_1.add( Dense( 128, activation='relu' ) ) 
model_1.add( Dense( 2, activation='softmax' ) ) 

# train model_1
# your code here
???

# compare how the models do on the test set
# your code here
???
print( 'acc model 0: ', acc_0 )
print( 'acc model 1: ', acc_1 )


# compare how the models perform on adversarial examples
# your code here
???
print( 'acc model 0 on adversarial examples: ',  ??? )
print( 'acc model 1 on adversarial examples: ',  ??? )


# let's see how the models do when we give them the adversarial examples 
# created against the other model
print( 'acc model 0 on adversarial examples from model 1: ', ??? )
print( 'acc model 1 on adversarial examples from model 0: ', ??? )
