## Experimental Work and The Transferability Principle
```In this exercise you will experience with real experimental work and will meet a very interesting issue - adversarial AI and the transferability principle. You will define experiments and measures for success, and will execute those experiments. It is of great importance that you will discuss this exercise with your tutor, even during the work on the exercise.```

```~Ohad Amosi & Ittai Haran```

```Read the paper "Adversarial Examples Are Not Bugs, They Are Features", which you can find in this exercise directory. Read it thoroughly. Make sure you understand how the datasets of experiments #1 and #2 were generated. As you might tell, the paper reports very interesting results.```

```If you are not familiar with the concept of model distillation, read about it. There arise the question, how can we be sure that the "non-robust" features described in the paper are real? Could it be that the effect the paper measures is a model distillation? Think about it: In both the experiments the authors used a trained robust network to label images, on which another network was trained normally. Could it be that the robust network was somehow "leaked"? Think about this possibility, and how it could have happen in experiment #1 and experiment #2.```

## Answer

In distillation training, one model is trained to predict the output probabilities of another model that was trained on an earlier, baseline standard to emphasize accuracy.

In the case of the first experiment, robust model is used to create a robust dataset.
We want to replace each sample x in the original dataset by a sample x_r. x and x_r must be very close in the latent space of a robust model.
We use only the representation layer of the robust model, not the output probabilities and then it is not a distillation training. Furthermore, the new sample x_r does not lie in the latent sapce, but in the input space

Experiment #2 does not use a robust model, but creates advresarial example to show that good classification can be achieved only based on non-robust features

```All in all, we will examine two conjectures:```
1. ```The paper is great, non-robust features are the real thing and different networks use the same features.```
2. ```The paper is wrong, it's only a fancy way of network distilling.```

```Let's first explore the concept of distilling networks. Can we do it in any case? Can you distill a network using only, for example, white noise? Can you do it using only the predictions? Or do you maybe need to use the logits of the network? Answer this question. MNIST might help you with that.```

In [1]:
import numpy as np
from keras.callbacks import EarlyStopping
from tensorflow import keras
from tensorflow.keras import layers
from keras.layers import Layer
import tensorflow as tf

In [2]:
class DivideLayer(Layer):
    def __init__(self, temperature=1, **kwargs):
        self.temperature = temperature
        super(DivideLayer, self).__init__(**kwargs)

    def call(self, inputs, training=None):
        if training:
            return  inputs / self.temperature
        return inputs

In [3]:
def train_model(x_train, y_train, input_shape, num_classes, temperature=1):
    model = keras.Sequential(
        [
            keras.Input(shape=input_shape),
            layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
            layers.MaxPooling2D(pool_size=(2, 2)),
            layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
            layers.MaxPooling2D(pool_size=(2, 2)),
            layers.Flatten(),
            layers.Dropout(0.5),
            DivideLayer(temperature),
            layers.Dense(num_classes, activation="softmax"),
        ]
    )
    
    es = EarlyStopping(monitor='val_accuracy', mode='max', verbose=1, patience=3)
    model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
#     model.summary()
    model.fit(x_train, y_train, batch_size=256, epochs=15, validation_split=0.1, callbacks=[es])
    
    return model

In [4]:
def TestAttack(model, adv_images, orig_images, true_labels, target_labels=None, targeted=False):
    adv_images = adv_images.numpy()
    score = model.evaluate(adv_images, true_labels, verbose=0)
    
    print('Test loss: {:.2f}'.format(score[0]))
    print('Successfully moved out of source class: {:.2f}'.format( 1 - score[1]))
    
    if targeted:
        score = model.evaluate(adv_images, target, verbose=0)
        print('Test loss: {:.2f}'.format(score[0]))
        print('Successfully perturbed to target class: {:.2f}'.format(score[1]))
    
    dist = np.mean(np.sqrt(np.mean(np.square(adv_images - orig_images), axis=(1,2,3))))
    print('Mean perturbation distance: {:.2f}'.format(dist))

In [5]:
def FastGradientSignMethod(model, input_image, input_label, eps=0.3):
    loss_object = tf.keras.losses.CategoricalCrossentropy()
    
    with tf.GradientTape() as tape:
        tape.watch(input_image)
        prediction = model(input_image)
        loss = loss_object(input_label, prediction)

    gradient = tape.gradient(loss, input_image)
    signed_grad = tf.sign(gradient)
    adv_x = input_image + eps*signed_grad
    
    return adv_x

In [6]:
def attack_model(test_images, test_labels, model, eps=0.3):
  test_images = tf.convert_to_tensor(test_images)

  adv_images = FastGradientSignMethod(model, test_images, test_labels, eps=eps)
  TestAttack(model, adv_images, test_images.numpy(), test_labels, targeted=False)

In [12]:
def train_distillation(temperature=1, use_proba=True, white_noise_input=False, 
                       number_white_noise=60000):
    num_classes = 10
    input_shape = (28, 28, 1)
    
    (x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

    x_train = x_train.astype("float32") / 255
    x_test = x_test.astype("float32") / 255
    x_train = np.expand_dims(x_train, -1)
    x_test = np.expand_dims(x_test, -1)

    y_train = keras.utils.to_categorical(y_train, num_classes)
    y_test = keras.utils.to_categorical(y_test, num_classes)
    print('x_train', x_train.shape)
    print("Training teacher model")
    teacher = train_model(x_train, y_train, input_shape, num_classes, temperature=temperature)
    
    _, acc = teacher.evaluate(x_test, y_test, batch_size=512)
    print('Teacher test accuracy:', acc)

    if use_proba is True:
        train_probas = teacher.predict(x_train)
    else:
        train_probas = teacher.predict(x_train)
        print(train_probas[0, :])
        train_probas = np.argmax(train_probas, axis=1)
        train_probas = tf.one_hot(train_probas, depth=num_classes)
        print(train_probas[0, :])

    if white_noise_input is True:
#         x_train = np.random.normal(size=(60000, 28, 28, 1))
        x_train = np.random.randint(0, 2, size=(number_white_noise, 28, 28, 1))
        train_probas = teacher.predict(x_train)

    print("Training student model")
    student = train_model(x_train, train_probas, input_shape, num_classes, temperature=temperature)
    
    _, acc = student.evaluate(x_test, y_test, batch_size=512)
    print('Student test accuracy:', acc)

    print('Attacking teacher model')
    attack_model(x_test, y_test, teacher, eps=0.3)
    print('Attacking student model')
    attack_model(x_test, y_test, student, eps=0.3)

    return teacher, student

In [20]:
teacher, student = train_distillation(temperature=10, use_proba=True)

Training teacher model
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
Teacher test accuracy: 0.9860000014305115
Training student model
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
Student test accuracy: 0.9850999712944031
Attacking teacher model
Test loss: 23.87
Successfully moved out of source class: 0.36
Mean perturbation distance: 0.11
Attacking student model
Test loss: 20.95
Successfully moved out of source class: 0.22
Mean perturbation distance: 0.07


In [20]:
teacher, student = train_distillation(temperature=10, use_proba=False)

Training teacher model
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
Teacher test accuracy: 0.9878000020980835
[0.000000e+00 0.000000e+00 0.000000e+00 4.972901e-16 0.000000e+00
 1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00]
tf.Tensor([0. 0. 0. 0. 0. 1. 0. 0. 0. 0.], shape=(10,), dtype=float32)
Training student model
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
Student test accuracy: 0.984000027179718
Attacking teacher model
Test loss: 24.21
Successfully moved out of source class: 0.34
Mean perturbation distance: 0.11
Attacking student model
Test loss: 21.91
Successfully moved out of source class: 0.24
Mean perturbation distance: 0.07


In [13]:
teacher, student = train_distillation(temperature=10, use_proba=True, white_noise_input=True,
                                     number_white_noise=100000)

x_train (60000, 28, 28, 1)
Training teacher model
Epoch 1/15
Epoch 2/15
Epoch 3/15


KeyboardInterrupt: 

```Distilling using only white noise is a though question. Why is it? Think about the concepts of distribution and out-of-distribution in when answering this.```

```Assuming conjecture #2 and regarding experiment #1, the distillation isn't happening on white noise, but not on real images either. Think of an experiment that will help you decide if the phenomenon the authors encountered is indeed just network distillation. Think how to measure your success. Open the hint only if you can't think of a way do to it.```

```You can, for example, take a dataset such as CIFAR10, train a network on 5 classes and try to distill it using the other 5 classes. That way you are using real images, but from different distributions.```

```Let's focus now on conjecture #1. Conduct an experiment that will demonstrate that two different networks indeed use the same features. Do it gradually: Start with two copies of the same architecture and on the same data, and move to different architectures and different subsets of the data (but from the same distribution). Think how to measure your success.```