In [None]:
%matplotlib inline
import matplotlib
import seaborn as sns
sns.set()
matplotlib.rcParams['figure.dpi'] = 144

In [None]:
import numpy as np
import matplotlib.pyplot as plt

from tensorflow import keras
from tensorflow.keras import backend as K

<!-- requirement: images/noise_0.png -->
<!-- requirement: images/noisy_image_0.png -->
<!-- requirement: images/nn-fool0.jpg -->
<!-- requirement: images/nn-fool1.jpg -->
<!-- requirement: images/negative1.png -->
<!-- requirement: images/negative2.png -->
<!-- requirement: images/subliminal-graffiti-sticker.jpg -->
<!-- requirement: pylib/mnist_dataset.py -->
<!-- requirement: pylib/tf_utils.py -->

# Adversarial Noise


## Fooling Neural Networks


Neural Networks are inspired by our own brains' wiring, but how close is their actual operation?  One way to judge is by looking at how they fail.  If we build an image to fool a neural network, is it also one that would fool a human?

The answer, at present, is a resounding "No".  Consider these images, generated by Nguyen, Yosinski, and Clune in a [recent paper](https://arxiv.org/abs/1412.1897).  They trained a neural network for image recognition, and then build images that the network would classify with extreme confidence.  To us, the images appear to be noise.

![fool0](images/nn-fool0.jpg)
*A neural network classifies each of these images into the class below it with confidence $\ge$ 99.6%.  From Nguyen A, Yosinski J, Clune J. Deep Neural Networks are Easily Fooled: High Confidence Predictions
for Unrecognizable Images. In Computer Vision and Pattern Recognition (CVPR ’15), IEEE, 2015.*

Perhaps this is not entirely surprising.  The net must classify the image as something, and it was not trained to recognize "noise".  Some combination of pixels ought to be able to tickle the right inputs in a way to produce a high-confidence classification.  More surprising, perhaps, are the following images, which produce equally confident classifications, despite having clear patterns and little resemblance to the objects in question.

![fool1](images/nn-fool1.jpg)
*A neural network classifies each of these images into the class below it with confidence $\ge$ 99.6%.  From Nguyen A, Yosinski J, Clune J. Deep "Neural Networks are Easily Fooled: High Confidence Predictions
for Unrecognizable Images". In Computer Vision and Pattern Recognition (CVPR ’15), IEEE, 2015.*

This suggests that the features that a neural network is triggering on are in fact significantly different from those that our brains are picking out.  


## Attacking Networks


These results, though interesting, may seem nothing more than an intellectual curiosity.  The images are clearly artificial and can easily be picked out by a human.  However, it would be more worrying if it were possible to produce images that a human would confidently classify in one category, while a neural net work confidently classify it in another.  Such an image could be used to attack a system involving a classifier without being obvious.

Researchers have managed to produce such attacks.  Starting with an arbitrary image, it is possible to make small, barely noticeable modifications that cause a neural network to change its classification of the image.  The follow examples come from a [paper](https://arxiv.org/abs/1312.6199) by Szegedy, *et al*.

![negative1](images/negative1.png)
![negative2](images/negative2.png)

*On the left, sample images correctly classified by [`AlexNet`](https://en.wikipedia.org/wiki/AlexNet).  On the right, distorted images that `AlexNet` classifies as ostriches.  The center images show the differences between the original an modified images, magnified by a factor of 10. From C. Szegedy, W. Zaremba, I. Sytskever, J. Bruna, D. Erhan, I. Goodfellow, R. Fergus, "Intriguing Properties of Neural Networks".  `arXiv`:1312.6199, February 2014.*

In this case, a unique noise was created for each input, but even that is unnecessary.  It is in fact possible to generate a single **adversarial noise** which can cause a neural network to misclassify most input images into whatever class the attacker desires.

In this notebook, we will build an attack against a CNN designed to classify MNIST images.  It will produce a noise that looks like the left image, where red and blue pixels represent positive and negative changes to the pixel intensities.  When added to an MNIST image (right), this noise causes the CNN to classify the image as a zero.

<table>
    <tr>
        <td> <img src="files/images/noise_0.png" style="width: 400px;"/> </td>
        <td> <img src="files/images/noisy_image_0.png" style="width: 400px;"/> </td>
    </tr>
</table>

While adversarial noise is trained against a particular network, it is somewhat robust.  Noise trained on one network has been show to be able to attack a second network with the same architecture but trained independently, albeit with reduced efficiency.

Such attacks are even possible in the real world.  [Recent work](https://iotsecurity.eecs.umich.edu/#roadsigns) by Evtimov, *et al*, have produced adversarial noise in the form of stickers applied to street signs.  These can reliably cause the misclassification of those signs by a CNN.

![street sign](images/subliminal-graffiti-sticker.jpg)
*A CNN misclassifies this as a "Speed Limit 45" sign in two-thirds of trials.  From I. Evtimov, K. Eykholt, E. Fernandes, T. Kohno, B. Li, A. Prakash, A. Rahmati, D. Song, "Robust Physical-World Attacks on Machine Learning Models". `arXiv`:1707.08945, August 2017.*

## How do you find adversarial noise?


Before attempting to correct for the noise, we first have to find it. We'll get different noise patterns for each class (0-9), so we'll have to calculate adversarial noise 10 times. We will leave this as an exercise for you, but will describe the process: 

1. Change all of the test class labels to a single class (the "adversarial target class"). 
2. Create a new loss function that is the sum of the original loss and the L2-norm loss (least squares error). 
3. Define an optimizer to minimize this loss (like gradient descent) where you change the adversarial noise to increase the number of images classified as the adversarial target class. 

## Putting it all together


There are two different optimization procedures in our neural net. The first is the typical procedure where we try to classify the digits and modify the weights and biases of our neural network. The second is trying to find the adversarial noise and is described above. In this procedure, we do not modify the variables of the neural network. 

To make the network immune to noise we have to train it twice. One time to find the noise and a second time to train network to correctly classify noisy images. In the example below, we do this for target class 3. (3 is similar to many of the other digits, so we can train in fewer steps.) In theory, we would like to do this for all classes, but there are several things to keep in mind:

1. The model's accuracy decreases if we try to make it immune to all classes (0-9). 
2. If we have many classes, making the network immune to all adversarial noise is impractical. 
3. As the model becomes immune to noise, it does not classify clean images as accurately. 

Let's start by loading the data and setting up the network.

In [None]:
from pylib.tf_utils import mnist_test, mnist_train
from tensorflow.keras.utils import to_categorical as one_hot

X_train, y_train = mnist_train()
X_test, y_test = mnist_test()

y_train = one_hot(y_train)
y_test  = one_hot(y_test)

We'll add noise using a custom Keras layer that comes before the main network. The kernel of this layer (the noise) will be zero to start, and we'll make this layer non-trainable for the initial training of the network. In order to ensure that our noisy images remain recognizable to the human eye, we will also limit the magnitude of this noise. Here, we choose this limit to be 0.35. Finally, we want to ensure that the values of our pixels remain between 0 and 1.  

In [None]:
from tensorflow.keras.layers import Layer
from tensorflow.keras import regularizers

noise_limit = 0.35

class NoiseLayer(Layer):
    
    def __init__(self, kernel_regularizer=None, noise_limit=noise_limit, **kwargs):
        self.kernel_regularizer = regularizers.get(kernel_regularizer)
        self.noise_limit = noise_limit
        super().__init__(**kwargs)

    def build(self, input_shape):
        self.kernel = self.add_weight(name='kernel', 
                                      shape=input_shape[1:],
                                      initializer='zeros')  
        super().build(input_shape)

    def call(self, x):                        
        self.kernel = K.clip(self.kernel, -self.noise_limit, self.noise_limit)
        return K.clip(x + self.kernel, 0, 1)

    def compute_output_shape(self, input_shape):
        return input_shape

The main network consists of two convolutional layers followed by a dense hidden layer and the output layer.

In [None]:
N_CLASSES = 10

filt_size = [5, 5]
img_size = 28
out_sizes = [32, 64, 1024]

In [None]:
model = keras.models.Sequential()
model.add(NoiseLayer(kernel_regularizer=regularizers.l2(0.0000025)))
model.add(keras.layers.Reshape([img_size, img_size, 1]))

for out_size in out_sizes[:-1]:
    model.add(keras.layers.Conv2D(out_size, filt_size, padding='same',
                                  activation='relu'))
    model.add(keras.layers.MaxPooling2D(pool_size=(2, 2), strides=(2,2),
                                        padding='same'))
    
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense(out_sizes[-1], activation='relu'))

model.add(keras.layers.Dense(N_CLASSES, activation='softmax'))

Below we write a function that can implement two different modes of training:

* `main`: &nbsp; When training the main network, all layers except the first (the noise layer) are trainable, and we use the usual targets.
* `noise`: When training the noise, only the first layer is trainable, and we change all targets to equal the adversarial target class.

We don't use the usual targets when training the noise, because we aren't trying to make correct predictions at this stage. Instead, we are trying to 'hack' the network to always predict `adversary_target_class`.

Note that calling `model.compile` more than once does _not_ reset the weights of the network. Also note that the noise layer includes some mild regularization. This does not affect `main` training, but it helps to limit the kernel that will be learned in `noise` training. We want to learn the smallest values of noise which will result in the best (mis)classification.

In [None]:
def train(mode, model, X_train, y_train, X_test, y_test, adversary_target_class=0, epochs=1):
    
    # Set the appropriate layers as trainable
    for n, layer in enumerate(model.layers):
        if n == 0:
            if mode == "main":
                layer.trainable = False
            else:
                layer.trainable = True
        else:
            if mode == "main":
                layer.trainable = True
            else:
                layer.trainable = False
    
    # Set the appropriate targets 
    if mode == "noise":
        target_train = np.ones(y_train.shape[0]) * adversary_target_class
        target_test  = np.ones(y_test.shape[0]) * adversary_target_class
        target_train = one_hot(target_train, 10)
        target_test  = one_hot(target_test, 10)
    else:
        target_train = y_train
        target_test  = y_test
            
    # Compile and train
    model.compile(loss='categorical_crossentropy',
                  optimizer=keras.optimizers.Adam(), 
                  metrics=['accuracy'])            
                                               
    history = model.fit(X_train, target_train,            
                    epochs=epochs,                 
                    batch_size=100,
                    validation_data=(X_test, target_test))

First, we train the main part of the network to classify digits without noise.

In [None]:
train("main", model, X_train, y_train, X_test, y_test)

The noise term was initialized to zero.  (This is why it didn't disrupt the training above.)  We'll get slightly better performance if we start it off with some random values.

In [None]:
np.random.seed(30)
noise_init = np.random.uniform(-noise_limit/2, noise_limit/2, size=(28*28,))
model.layers[0].set_weights([noise_init])

This random noise doesn't particularly bother the classifier.

In [None]:
def predict(idx):
    image = X_test[idx]
    return np.argmax(model.predict([[image]])[0])

idx = 0
actual = np.argmax(y_test[idx])
print ("Predicted: %d, Actual: %d" % (predict(idx), actual))
plt.imshow((X_test[idx]+noise_init).reshape((img_size,img_size)),
           cmap=plt.cm.gray_r, interpolation='nearest')

But by training with `adversary_target_cls=3`, we can tune the noise to force classification of the images as threes.

In [None]:
train("noise", model, X_train, y_train, X_test, y_test, adversary_target_class=3)

The noise displays hints of a three, but it is mostly random.

In [None]:
noise = model.layers[0].get_weights()[0]
plt.imshow(noise.reshape((img_size,img_size)), interpolation='nearest',
           cmap='seismic', vmin=-1.0, vmax=1.0)
plt.show()

When it is combined with an image, our classifier is fooled.  But when we look at the image, it's still clearly a seven.

In [None]:
idx = 0
actual = np.argmax(y_test[idx])
print ("Predicted: %d, Actual: %d" % (predict(idx), actual))

plt.imshow((np.clip(X_test[idx]+noise,0,1)).reshape((img_size,img_size)),
           cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()

By doing additional training in `main` mode, we immunize the classifier against this noise.

In [None]:
train("main", model, X_train, y_train, X_test, y_test)

Now the classifier works on the noisy image.

In [None]:
idx = 0
actual = np.argmax(y_test[idx])
print ("Predicted: %d, Actual: %d" % (predict(idx), actual))

plt.imshow((np.clip(X_test[idx]+noise,0,1)).reshape((img_size,img_size)),
           cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()

## Exercise: Extending immunity


Make the network immune to all target classes. How does the accuracy of the model change?

*Copyright &copy; 2018 The Data Incubator.  All rights reserved.*