# Week 7 Homework: Adversarial Attacks

This week, we'll be talking about adversarial examples. Basically, we fool the neural network into thinking that an image which is abnormal is totally healthy--super scary stuff!

In [0]:
import numpy as np
import os, scipy.ndimage, scipy.misc
import matplotlib.pyplot as plt
from skimage import util
import requests, zipfile, io

import keras
import keras.backend as K
from keras.models import load_model
from keras.objectives import binary_crossentropy
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, BatchNormalization
from keras.layers import Activation, Dropout, Flatten, Dense
from keras.preprocessing.image import ImageDataGenerator

# Download and extract data.
r = requests.get("http://web.stanford.edu/class/cs21si/resources/unit4_resources.zip")
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()

## Load the Model

First, we load a pre-trained model on this task and check out the summary.

In [0]:
model = load_model('unit4_resources/trained_model.h5')
model.summary()

## Pick a Test Case

We now pick an image that we are going to play with. This image is an abnormal sample from the test set--in other words, we pick an image from the test set which shows an unhealthy lession on the skin of the patient. Notice that the output the model predicts is above 0.5, so it thinks that this is unhealthy. We will make the model believe this is healthy!

In [0]:
base_path = 'unit4_resources/addi/test/abnormal/'
img_path = base_path + os.listdir(base_path)[3]
img = scipy.ndimage.imread(img_path).astype(float)
img = scipy.misc.imresize(img, (150, 150, 3))[np.newaxis, :, :, :]/255.0

plt.imshow(np.squeeze(img))
plt.show()

model.predict(img)[0, 0]

## Untargeted Noise

Before we dive into how adversarial examples work, try this out: just intialize some random noise and add it to our above image. Then, see how the model does. Mess around with the scaling/shifting factor of the noise to try to get the score below 0.5 (making the model think it is healthy). This is called an untargeted attack--we don't know what output the model will give this noisy input.

In [0]:
# TODO: Change the scale and shift to fool the model
scale, shift = 0, 0
noise = np.random.rand(1, 150, 150, 3) * scale + shift

# Add the noise to the image
noisy_img = np.clip(img + noise, 0, 1)
plt.imshow(np.squeeze(noisy_img))
plt.show()

# Have the model predict on the image
model.predict(noisy_img)[0, 0]

## Targeted Noise

Now for the interesting stuff: we will change the image until we fool the model. How does this work? It's quite simple: we compute the gradient of the normal class score with respect to the input image. This tells us how much to change the input image to maximize the normal class score--i.e. how do we change the unhealthy image to make it look healthy. That's it!

Notice how, as we execute this code to do that, the score decreases. Mess with the pertubation amount to try to get the model output below a 0.5 on this input. This means that the model now thinks that the unhealthy image is healthy!

In [0]:
pertubation_amount = 0.1
def add_targetted_noise(fooling_img):
  for _ in range(15):
    y_true = K.placeholder((1, 1))
    loss = K.mean(binary_crossentropy(y_true, model.output))
    get_grads = K.function([model.input, y_true], K.gradients(loss, model.input))

    grad = get_grads([fooling_img, [[0]]])[0]
    fooling_img = fooling_img - pertubation_amount*grad
    
    pred = model.predict(fooling_img)[0, 0]
    if pred <= 0.5:
        break
  
  return fooling_img


fooling_img = add_targetted_noise(img.copy())

## Visualize the "Fooling" Image

Let's see how this adversarial example looks!

In [0]:
plt.imshow(np.squeeze(fooling_img))
plt.show()
pred = model.predict(fooling_img)[0, 0]
print("Prediction using this image: ", pred)

Wow, this looks exactly like the input image! However, our model thinks that this is healthy when it is clearly not. In just a couple of steps, you were able to take an unhealthy image and make it look healthy to the network. That is super scary! This is something to keep in mind as you design your high-stake ML algorithms.

However, there is something you can do to help mitigate this issue: train on images that are perturbed in such a way. Let's do this now by creating fooling images out of all abnormals and then training on them:

In [0]:
fooling_batch = []
for i in range(10):
  img_path = base_path + os.listdir(base_path)[i]
  img = scipy.ndimage.imread(img_path).astype(float)
  img = scipy.misc.imresize(img, (150, 150, 3))[np.newaxis, :, :, :]/255.0
  
  fooling_img = add_targetted_noise(img.copy())
  fooling_batch.append(fooling_img)

Let's test how the model does on the last 2 out of 10 examples:

In [0]:
model.evaluate(np.array(fooling_batch)[8:].squeeze(), [1 for _ in range(2)])

It fails for both of them! Now, fine-tune the model (further train a model that has already been trained) on the first 8 out of 10 images and then see how it does on the final 2. Hint: call *model.train_on_batch*!

In [0]:
# Load a fresh model so you can play around with this cell many times
model = load_model('unit4_resources/trained_model.h5')

# Fine-Tune the Model
def fine_tune(images, labels):
  
  for _ in range(50):
    ### YOUR CODE HERE ### (1 line)
    
    ### END CODE ###

fine_tune(np.array(fooling_batch[:8]).squeeze(), [1 for _ in range(8)])
    
# Evaluate on the final 2 examples
model.evaluate(np.array(fooling_batch)[8:].squeeze(), [1 for _ in range(2)])

So we see that training on adversarial examples is a good way to protect our models against them.

If you want to know more about adversarial examples and how we can protect against them, check out [this paper on the topic](https://arxiv.org/pdf/1712.07107.pdf). For a more casual read, check out [this neat blog post](https://www.anishathalye.com/2017/07/25/synthesizing-adversarial-examples/) on the topic.