<a href="https://colab.research.google.com/github/inspire-lab/CyberAI-labs/blob/main/category-SecureAI/Poisoning-attack-backdoor/poisoning_backdoor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Adversarial Machine learning: Poisoning attacks

So far with have been dealing with attacks that are launched against models that are already trained.
In Poisoning attacks we are infiltrating the system earlier in the pipeline. Now we are attacking the model during training. For these attacks to succeed we need have influence over all of the training data or at least parts of the training data.

<!-- In machine learning poisoning attacks, an attacker poison the model in order to change the learning outcome, by adding malicious data in the model training phase. This method can be performed, for example, by sending and injecting carefully designed samples when data collection is occuring during network operation, to train a network intrusion detection system model.  -->

Poisoning attack come in two main types: The first type targets the machine learning system's availability. Whereasa the second one targets the integrity (also often called a backdoor attack).

Availability attacks aim to inject so much of malicious data into an ML system that trained model becomes basically useless. Recent researchh shows that, even under strong defenses, a 3% poisoning in training dataset can lead to 11 % drop in accuracy of the model.

Integrity or backdoor attacks are more sophisticated. They keep the classifier functionality exactly what it should be with one exception: a backdoor. A backdoor is a type of input that the model’s designer is not aware of, but that the attacker can leverage to get the ML system to do what they want. For example, imagine an attacker teaches a malware classifier that if a certain string is present in the file, that file should always be classified as benign. Now the attacker can compose any malware they want and as long as they insert that string into their file somewhere, they can make the model work as per his need.

In [None]:
!pip install adversarial-robustness-toolbox

## Setup and Evalution functions

We start out be importing all the modules we need, loading the data, and by
defining some evaluation functions.

In [None]:
# imports
import tensorflow as tf
import numpy as np
import art
import sklearn
import matplotlib.pyplot as plt

Next we load the MNIST data set, normalize the data, and bring it into the
format that tensorflow expects it in. Furthermore, to speed things up during
training, we will select a smaller subsection of the training data.

In [None]:
from tensorflow.keras.datasets import mnist

# load the data
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# normalize data
x_train = x_train / 255.
x_test = x_test / 255.

# convert image to the correct format
if tf.keras.backend.image_data_format() == 'channels_first':
  x_train = x_train.reshape(x_train.shape[0], 1, x_train.shape[1],
                            x_train.shape[2])
  x_test = x_test.reshape(x_test.shape[0], 1, x_train.shape[1],
                          x_train.shape[2])
else:
  x_train = x_train.reshape(x_train.shape[0], x_train.shape[1],
                            x_train.shape[2], 1)
  x_test = x_test.reshape(x_test.shape[0], x_train.shape[1], x_train.shape[2],
                          1)

# a smaller version of the training data
x_tr, y_tr = sklearn.utils.shuffle(x_train, y_train)
x_tr = x_tr[:1000]
y_tr = y_tr[:1000]


Since we are working with images alot in this lab let's create function that
displays multiple images in a grid layout.

In [None]:
def plot_grid(imgs, rows_cols=None, figsize=(15, 15), titles=None):
    """
    Takes a list of images `imgs` and displays them in a grid. To specify
    the number of rows and columns, pass a tuple `(n, m)` to the `rows_cols` parameter.
    `figsize` is forwarded to `matplotlib.pyplot.figure` as `figsize`.
    `titles` can be used to set a title for each image in the grid, which should be a
    list of `str`. It needs to have the same number of elements as `imgs` if not None.
    """
    num = imgs.shape[0]

    # Determine the number of rows and columns if not specified
    if rows_cols is None:
        cols = int(np.sqrt(num))
        if num % cols != 0:
            cols += 1
        rows = (num + cols - 1) // cols  # Ensure rows is an integer
    else:
        rows, cols = rows_cols

    # Create the figure
    fig = plt.figure(figsize=figsize)

    for i, img in enumerate(imgs):
        ax = fig.add_subplot(rows, cols, i + 1)
        if titles is not None:
            ax.set_title(titles[i])

        # Handle single-dimensional and multi-dimensional images
        if len(img.shape) == 1:  # If the image is flattened
            side = int(np.sqrt(img.size))
            plt.imshow(img.reshape(side, side), cmap="gray")
        else:
            plt.imshow(img.squeeze(), cmap="gray")

        plt.axis('off')

    plt.tight_layout()  # Adjust layout to prevent overlap
    plt.show()

### Q. Use the `plot_grid` function to display 9 images from the training data. Each image should have it's label as title

Next we are going to define functions that will create an untrained model for
us. One for CIFAR-10 and one for MNIST.

In [None]:
from tensorflow.keras import models, layers
from tensorflow import keras


def get_cifar10_model():
  model = models.Sequential()
  model.add(
      layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
  model.add(layers.MaxPooling2D((2, 2)))
  model.add(layers.Conv2D(64, (3, 3), activation='relu'))
  model.add(layers.MaxPooling2D((2, 2)))
  model.add(layers.Conv2D(64, (3, 3), activation='relu'))
  model.add(layers.Flatten())
  model.add(layers.Dense(64, activation='relu'))
  model.add(layers.Dense(10, activation='softmax'))

  model.compile(optimizer='adam',
                loss=tf.keras.losses.SparseCategoricalCrossentropy(),
                metrics=['accuracy'])
  return model


def get_mnist_model(sparse_loss=True):
  clf = models.Sequential()
  clf.add(
      layers.Conv2D(32,
                    kernel_size=(3, 3),
                    activation='relu',
                    input_shape=(28, 28, 1)))
  clf.add(layers.Conv2D(64, (3, 3), activation='relu'))
  clf.add(layers.MaxPooling2D(pool_size=(2, 2)))
  clf.add(layers.Dropout(0.25))
  clf.add(layers.Flatten())
  clf.add(layers.Dense(128, activation='relu'))
  clf.add(layers.Dropout(0.5))
  clf.add(layers.Dense(10, activation='softmax'))

  if sparse_loss:
    clf.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(),
                optimizer='adam',
                metrics=['accuracy'])
  else:
    clf.compile(loss=tf.keras.losses.CategoricalCrossentropy(),
                optimizer='adam',
                metrics=['accuracy'])

  return clf


# A very simple poisoning attack

In the simplest form of poisoning we simply mislable parts of the training data.

But first let's get a baseline by training the MNIST model for a few epochs.

In [None]:
# train a model on correctly labled data
m = get_mnist_model()
m.fit(x_tr, y_tr, epochs=5)
print('clean model:', m.evaluate(x_test, y_test))


The performance isn't fantastic but remember that we are only using a small
subset of the training data and are only train for a few epochs.

### Q. Train a poisoned model. Poison the training data by changing the label of a precentage of the training data. Plot the the pefromance of the poisoned model on the test data for 10%, 25%, 33%, 50%, 66%, 75%, 90% and 100% of poisoned data.

In [None]:
# your code goes here


How does the performance change if we increase or decrease the number of corrupted labels?

# More interessting poisoning attacks

Confidence reduction is relativly simple todo. A more interessting scenario is to have model that performs nicely on clean data but has some hidden *trigger* that changes its behaviour. These backdoors allow us exploit behaviour of the model it would not normally display.

Due to the poor explainability of neural networks, these backdoors are hard to detect in a trained model.





Let's start off by exploring some of the backdoor embedding functions.

In [None]:
from art.attacks.poisoning.perturbations.image_perturbations import add_single_bd, \
                                                                    add_pattern_bd

Let's load the CIFAR-10 data and normalize it

In [None]:
from tensorflow.keras.datasets import cifar10
from sklearn.utils import shuffle

(x_train_cf10, y_train_cf10), (x_test_cf10, y_test_cf10) = cifar10.load_data()

x_train_cf10 = x_train_cf10 / 255.
x_test_cf10 = x_test_cf10 / 255.

Now that the data is loaded we can have a look at the data. To do so we display
an image from the MNIST and one image from CIFAR-10 data set.

In [None]:
# pick a "random" sample
x_mnist = x_train[42]
x = x_train_cf10[42]

plt.imshow(x_mnist.squeeze())
plt.show()

plt.imshow(x)
plt.show()

In [None]:
# add a simple backdoor
x_p = add_single_bd(x_mnist.squeeze())
plt.imshow(x_p.squeeze())
plt.show()

x_p = add_single_bd(x)
plt.imshow(x_p)
plt.show()

We can see that backdoor embedding does not work as intendend in color images.

### Q. Make an embedding function that does work for color images.

The signature should be the same as the the function in ART

```
def add_single_bd(x: np.ndarray, distance: int = 2, pixel_value: int = 1) -> np.ndarray:
    """
    Augments a matrix by setting value some `distance` away from the bottom-right edge to 1. Works for single images
    or a batch of images.

    :param x: N X W X H  matrix or W X H matrix
    :param distance: Distance from bottom-right walls.
    :param pixel_value: Value used to replace the entries of the image matrix.
    :return: Backdoored image.
    """
```

Fill in the code below

In [None]:
def add_single_bd_rgb(x: np.ndarray,
                      distance: int = 2,
                      pixel_value: int = 1) -> np.ndarray:
  """
  Augments a matrix by setting value some `distance` away from the bottom-right
  edge to 1. Works for single images or a batch of images.

  :param x: N X W X H matrix or W X H matrix
  :param distance: Distance from bottom-right walls.
  :param pixel_value: Value used to replace the entries of the image matrix.
  :return: Backdoored image.
  """
  x_modified = np.copy(x)
  # add modifications
  return x_modified

In [None]:
# solution
def add_single_bd_rgb(x: np.ndarray,
                      distance: int = 2,
                      pixel_value: int = 1) -> np.ndarray:
  """
  Augments a matrix by setting value some `distance` away from the bottom-right
  edge to 1. Works for single images or a batch of images.

  :param x: N X W X H X C matrix or W X H X C matrix
  :param distance: Distance from bottom-right walls.
  :param pixel_value: Value used to replace the entries of the image matrix.
  :return: Backdoored image.
  """
  x_modified = np.copy(x)
  if len(x.shape) == 4:  # batch of images
    x_modified[:, -distance, -distance, :] = pixel_value
  else:  #single image
    x_modified[-distance, -distance, :] = pixel_value
  return x_modified

Check that the new function works

In [None]:
x_p = add_single_bd_rgb(x)
plt.imshow(x_p)
plt.show()

Rather than just adding change a single pixle we can also add a pattern to the
image. The pattern is going to be more obvious should someone inspect the iamge
but it is also more robust.

In [None]:
# add a pattern backdoor
x_p = add_pattern_bd(x_mnist.squeeze())
plt.imshow(x_p)
plt.show()

We can see the pattern in the bottom right of the image.

But does it work with color images?

In [None]:
# add a pattern backdoor
x_p = add_pattern_bd(x.squeeze())
plt.imshow(x_p)
plt.show()

Once again it does not work for color images.

### Q. Again make a function that does work for color images. It should work the same as the ART function and add the same pattern to the image.

The signature should be the same as the the function in ART

```
def add_pattern_bd(x: np.ndarray, distance: int = 2, pixel_value: int = 1) -> np.ndarray:
  ...
```

Fill in the code below

In [None]:
def add_pattern_bd_rgb(x, distance=2, pixel_value=1):
  x_modified = np.copy(x)
  # add modifications
  return x_modified

Check that the new function works

In [None]:
x_p = add_pattern_bd_rgb(x)
plt.imshow(x_p)
plt.show()

In [None]:
# we can change the intensity and positioning
x_p = add_pattern_bd(x_mnist.squeeze(), distance=5, pixel_value=.3)
plt.imshow(x_p)
plt.show()

x_p = add_pattern_bd_rgb(x, distance=2, pixel_value=.3)
plt.imshow(x_p)
plt.show()

### Q. At the moment the pattern method only supports embedding the pattern along the diagonal of the image. Create a function `add_pattern_bd_advanced` so that the `distance` argument can take a tuple. In the tuple `(x,y)`, x specifies the distance from the right border and y the distance from the left border. The function needs to support MNIST images, it does not need support color images. But it needs to support batches of images.

In [None]:
def add_pattern_bd_advanced(x, distance=2, pixel_value=1):
  x_modified = np.copy( x )
  if isinstance(distance, int):
    # place image along thte diagonal
    # add modifications
  else:
    # input was a tuple
    # add modifications
  return x_modified

Let's poison a MNIST model with a simple backdoor

We start by training a baseline

In [None]:
mnist_baseline = get_mnist_model()

mnist_baseline.fit(x_tr, y_tr, epochs=5)
mnist_baseline.evaluate(x_test, y_test)


We want to poison a specfic class of the model. Our goal is to have the model
predict a class, the `target_class`, of our choosing, when we add the marker.
Let's assume we have controll over certain percentage `poison_precentage` of the
training data. Meaning we can change the instances and labels at will. In the
subset of the data we have control over we label add the marker to all instances
that are not the target class and change their label to that of the target class.

First we select a `target_class` and `poison_percentage`

In [None]:
from art.attacks.poisoning import PoisoningAttackBackdoor
# create the poisoned data set
poison_precentage = .5
target_class = 9

# shuffel the data
x_all, y_all = shuffle(x_tr, y_tr)

# split the data into the part the we have controll over and the part we don't
x_ours = x_all[:int(len(x_all) * poison_precentage)]
y_ours = y_all[:int(len(y_all) * poison_precentage)]

x_theirs = x_all[int(len(x_all) * poison_precentage):]
y_theirs = y_all[int(len(y_all) * poison_precentage):]

### Q. Create an ndarray `x_poison` that contains all instances from `x_ours` where the label is not `target_class` and an ndarray `poison_label` containing the `target_class` as label for each instance on `x_poison`.

In [None]:
# your code goes here

We now use the data you just created to poison the training data.

In [None]:
# create attack instance
backdoor_attack = PoisoningAttackBackdoor(add_pattern_bd)
# add patern to instance
poison, poison_labels = backdoor_attack.poison(x_poison, y=poison_labels)

# combine our training data with the rest
x_poison = np.concatenate([x_theirs, poison])
y_poison = np.concatenate([y_theirs, poison_labels])

# shuffel the data
x_poison, y_poison = shuffle(x_poison, y_poison)
plot_grid(poison[:64])


We can now use the poisoned data train model.

In [None]:
# train a model on poisoned data
poisoned_model = get_mnist_model()
poisoned_model.fit(x_poison, y_poison, epochs=5)
# performance on clean data
poisoned_model.evaluate(x_test, y_test)


### Q. Evaluate the success of the attack.

To show the attack was successful we need to show 2 things.
1. Show that the model performs well on data that does not contain the marker
2. Show that the model predicts the target class when add ther marker.

In [None]:
# evalution goes here

### Q. Evaluate embedding strategies.

There are number of things you need to compare.

1. Compare the impact of the `poison_percentage`. Plot the evaluation for 0.1, 0.25, 0.33, and 0.5.
2. Compare the effectiveness of using a pattern against just using a single pixel as a marker. Use the same `poison_percentage` as above.
3. When using a pattern compare the effectiveness of different marker placement. Compare the following marker placements: bottom right, center, and random. Use the same `poison_percentage` as above.
4. Using a pattern compare the impact of the `pixel_value`. Compare values 0.01, 0.25, 0.5, 0.75, and 1.0. Use the same `poison_percentage` use the pattern placement you found to be most successful in the previous experiment.

In [None]:
#evaluation goes here

The downside of the attacks that we have used so far have the major downside that labels aren't clean. Most of our poisoned data is clearly labled incorrectly.

Let's adress this issue with clean label attacks. Which can be found in part 2.