-----------

This notebook covers two algorithms I developed to generate adversarial examples that avoid detection by the trapdoor defense from Shan el al 2020.

The first uses the outcome of an initial pass of projected gradient descent and adds a term measuring distance to this example to the loss function in an attempt to bias the algorithm away from the trapdoor, which is likely to provide the steepest descent in the loss function.

The second reduces the largest elements of the gradient before applying PGD, assuming that these largest elements are likely to lead towards the trapdoor.

-----------

In [1]:
import numpy as np
import tensorflow as tf
import matplotlib as mpl
import matplotlib.pyplot as plt
import os
from tensorflow.keras.datasets import cifar10

import cifar10_bd_train
import backdoor_utils
import targeted_pgd

-------------

Load, unpack, and preprocess data

-------------

In [2]:
raw_train_data, raw_test_data = cifar10.load_data()

In [3]:
raw_train_images, raw_train_labels = raw_train_data
raw_test_images, raw_test_labels = raw_test_data

train_images, train_labels = cifar10_bd_train.preprocess(raw_train_images, raw_train_labels)
test_images, test_labels = cifar10_bd_train.preprocess(raw_test_images, raw_test_labels)

--------------

Load models and backdoor patterns for randomly initialized trigger locations and bottom-right located triggers.

--------------

In [4]:
# load randomly initialized
labels = [2, 3, 5, 7]
models_basic = {}
for label in labels:
    fp = 'cifar10_hp_nonoise/label' + str(label) + 'noise0.0'
    model = tf.keras.models.load_model(fp)
    pattern = np.load(fp + '/pattern.npy')
    mask = np.load(fp + '/mask.npy')
    models_basic[label] = (model, pattern, mask)

In [5]:
# load bottom right initialized
labels = [2, 3, 5, 7]
models_botRight = {}
for label in labels:
    fp = 'cifar10_hp_nonoise/label' + str(label) + 'noise0.0bottomRight'
    model = tf.keras.models.load_model(fp)
    pattern = np.load(fp + '/pattern.npy')
    mask = np.load(fp + '/mask.npy')
    models_botRight[label] = (model, pattern, mask)

------------------

### Gradient Probe Attack
First attempt is to try to avoid the trapdoor with a gradient probe attack.  This algorithm first calculates a projected gradient descent attack step, then uses the resulting image in a new loss function that combines the crossentropy loss to the target class with a term representing the distance between the image being optimized and the image that was found in the PGD step. 

The intuition of the algorithm was to use the first PGD step as a probe, assuming that the preferred direction at each step would lead towards the trapdoor.  This probe image is then incorporated into the loss to incentivize finding an adversarial example without taking the same path as the probe.

------------------

In [6]:
from targeted_pgd import gradient_probe_attack
from targeted_pgd import gradient_mask_reduc_attack
from backdoor_utils import sim_distribution
from backdoor_utils import inject_backdoor_pattern
from backdoor_utils import build_backdoor_sig
from backdoor_utils import test_backdoor_defense

from backdoor_utils import setup_backdoor_defense

-------------

Evaluation on randomly initialized backdoor models.  Values for the attack were found from random experimentation.  Grid search would likely find more effective values, but time considerations prevented doing that in this study.  Even so, the attack is pretty effective, in some cases greatly reducing the ability of the detection method to identify the attacks.

eps = constraint on distance from original image

lam = weight on the distance term of the adjusted loss function

beta = step size for probe image

alpha = step size for final image

-------------

In [7]:
labels = [2, 3, 5, 7]

for label in labels:
    model, pattern, mask = models_basic[label]
    sig, feature_extractor, thresholds = setup_backdoor_defense(label, model, pattern,
                                                       mask, test_images,
                                                       train_images, raw_train_labels)
    imgs, loss = gradient_probe_attack(model, test_images, label, 2000,
                                       steps=50, alpha=0.0001, beta=0.03,
                                       eps=0.03, lam=0.996)
    
#     fp = 'cifar10_hp_nonoise/label'+str(label)+'noise0.0/attacks/GradientProbe/'
#     if not os.path.exists(fp):
#         os.makedirs(fp)
#     np.save(fp + 'imgs', imgs)
#     np.save(fp + 'loss', loss)
        
#     imgs = np.load(fp + 'imgs.npy')
#     loss = np.load(fp + 'loss.npy')
    
    print('---------------------------------------------------')
    print('---------------------------------------------------')
    print('Testing Model Defending Label ' + str(label) + ':')
    test_backdoor_defense(imgs, test_images, label, feature_extractor, sig, thresholds)

---------------------------------------------------
---------------------------------------------------
Testing Model Defending Label 2:
----------------
Threshold Percentile: 90
True Positive: 879
False Negative: 412
True Positive Rate: 0.6808675445391169
True Negative: 8992
False Positive: 1008
False Positive Rate: 0.1008
----------------
Threshold Percentile: 91
True Positive: 835
False Negative: 456
True Positive Rate: 0.6467854376452362
True Negative: 9103
False Positive: 897
False Positive Rate: 0.0897
----------------
Threshold Percentile: 92
True Positive: 784
False Negative: 507
True Positive Rate: 0.6072811773818745
True Negative: 9201
False Positive: 799
False Positive Rate: 0.0799
----------------
Threshold Percentile: 93
True Positive: 748
False Negative: 543
True Positive Rate: 0.5793958171959721
True Negative: 9295
False Positive: 705
False Positive Rate: 0.0705
----------------
Threshold Percentile: 94
True Positive: 678
False Negative: 613
True Positive Rate: 0.5251742

-------------

Evaluation on bottom right corner triggered models.  The algorithm has a harder time finding successful adversarial examples for this model.  The detection accuracy of the models is significantly lower than with basic PGD.

The settings used were the result of tuning parameters to the set of random tigger placement models.  The attack success rate could likely be improved by tuning parameters to the set of bottom right triggered models.

-------------

In [8]:
labels = [2, 3, 5, 7]

for label in labels:
    model, pattern, mask = models_botRight[label]
    
    sig, feature_extractor, thresholds = setup_backdoor_defense(label, model,
                                                                pattern, mask,
                                                                test_images,
                                                                train_images,
                                                                raw_train_labels)
    imgs, loss = gradient_probe_attack(model, test_images, label, 2000,
                                       steps=50, alpha=0.0001, beta=0.03,
                                       eps=0.03, lam=0.996)
    
#     fp = 'cifar10_hp_nonoise/label'+str(label)+'noise0.0bottomRight/attacks/GradientProbe/'
#     if not os.path.exists(fp):
#         os.makedirs(fp)
#     np.save(fp + 'imgs', imgs)
#     np.save(fp + 'loss', loss)

#     imgs = np.load(fp + 'imgs.npy')
#     loss = np.load(fp + 'loss.npy')
    
    print('---------------------------------------------------')
    print('---------------------------------------------------')
    print('Testing Label ' + str(label) + ':')
    test_backdoor_defense(imgs, test_images, label, feature_extractor, sig, thresholds)

---------------------------------------------------
---------------------------------------------------
Testing Label 2:
----------------
Threshold Percentile: 90
True Positive: 582
False Negative: 40
True Positive Rate: 0.9356913183279743
True Negative: 8921
False Positive: 1079
False Positive Rate: 0.1079
----------------
Threshold Percentile: 91
True Positive: 561
False Negative: 61
True Positive Rate: 0.9019292604501608
True Negative: 9000
False Positive: 1000
False Positive Rate: 0.1
----------------
Threshold Percentile: 92
True Positive: 525
False Negative: 97
True Positive Rate: 0.8440514469453376
True Negative: 9110
False Positive: 890
False Positive Rate: 0.089
----------------
Threshold Percentile: 93
True Positive: 481
False Negative: 141
True Positive Rate: 0.7733118971061094
True Negative: 9227
False Positive: 773
False Positive Rate: 0.0773
----------------
Threshold Percentile: 94
True Positive: 400
False Negative: 222
True Positive Rate: 0.6430868167202572
True Negativ

---------------

### Gradient Mask Reduction Attack
Another attack which attempts to avoid the trapdoors.  The gradient mask reduction attack reduces all values in the gradient with magnitude over a threshold by a predetermined ratio.  The threshold is specified as a percentage of the largest magnitude in the gradient.

The original paper notes that backdoors introduce a highly effective path to misclassification.  This algorithm assumes that this highly effective path will show up as unusually large values in the gradient, and so reducing the contribution of these unusually large values may help avoid converging to the trapdoor.

---------------

In [9]:
from targeted_pgd import gradient_probe_attack
from targeted_pgd import gradient_mask_reduc_attack

In [10]:
labels = [2, 3, 5, 7]

for label in labels:
    model, pattern, mask = models_basic[label]
    
    sig, feature_extractor, thresholds = setup_backdoor_defense(label, model,
                                                                pattern, mask,
                                                                test_images,
                                                                train_images,
                                                                raw_train_labels)
    imgs, loss = gradient_mask_reduc_attack(model, test_images, label, 2000,
                                       steps=50, alpha=0.00006, eps=0.05,
                                       lam=0.95, thresh=0.992)
    
#     fp = 'cifar10_hp_nonoise/label'+str(label)+'noise0.0/attacks/GradientMaskReduc/'
#     if not os.path.exists(fp):
#         os.makedirs(fp)
#     np.save(fp + 'imgs', imgs)
#     np.save(fp + 'loss', loss)

#     imgs = np.load(fp + 'imgs.npy')
#     loss = np.load(fp + 'loss.npy')
    
    print('---------------------------------------------------')
    print('---------------------------------------------------')
    print('Testing Label ' + str(label) + ':')
    test_backdoor_defense(imgs, test_images, label, feature_extractor, sig, thresholds)

---------------------------------------------------
---------------------------------------------------
Testing Label 2:
----------------
Threshold Percentile: 90
True Positive: 1715
False Negative: 185
True Positive Rate: 0.9026315789473685
True Negative: 9001
False Positive: 999
False Positive Rate: 0.0999
----------------
Threshold Percentile: 91
True Positive: 1688
False Negative: 212
True Positive Rate: 0.888421052631579
True Negative: 9099
False Positive: 901
False Positive Rate: 0.0901
----------------
Threshold Percentile: 92
True Positive: 1659
False Negative: 241
True Positive Rate: 0.8731578947368421
True Negative: 9173
False Positive: 827
False Positive Rate: 0.0827
----------------
Threshold Percentile: 93
True Positive: 1615
False Negative: 285
True Positive Rate: 0.85
True Negative: 9287
False Positive: 713
False Positive Rate: 0.0713
----------------
Threshold Percentile: 94
True Positive: 1552
False Negative: 348
True Positive Rate: 0.8168421052631579
True Negative: 93

In [11]:
labels = [2, 3, 5, 7]

for label in labels:
    model, pattern, mask = models_botRight[label]
    
    sig, feature_extractor, thresholds = setup_backdoor_defense(label, model,
                                                                pattern, mask,
                                                                test_images,
                                                                train_images,
                                                                raw_train_labels)
    imgs, loss = gradient_mask_reduc_attack(model, test_images, label, 2000,
                                       steps=50, alpha=0.00006, eps=0.05,
                                       lam=0.95, thresh=0.992)
    
#     fp = 'cifar10_hp_nonoise/label'+str(label)+'noise0.0bottomRight/attacks/GradientMaskReduc/'
#     if not os.path.exists(fp):
#         os.makedirs(fp)
#     np.save(fp + 'imgs', imgs)
#     np.save(fp + 'loss', loss)

#     imgs = np.load(fp + 'imgs.npy')
#     loss = np.load(fp + 'loss.npy')
    
    print('---------------------------------------------------')
    print('---------------------------------------------------')
    print('Testing Label ' + str(label) + ':')
    test_backdoor_defense(imgs, test_images, label, feature_extractor, sig, thresholds)

---------------------------------------------------
---------------------------------------------------
Testing Label 2:
----------------
Threshold Percentile: 90
True Positive: 1564
False Negative: 74
True Positive Rate: 0.9548229548229549
True Negative: 9067
False Positive: 933
False Positive Rate: 0.0933
----------------
Threshold Percentile: 91
True Positive: 1546
False Negative: 92
True Positive Rate: 0.9438339438339438
True Negative: 9169
False Positive: 831
False Positive Rate: 0.0831
----------------
Threshold Percentile: 92
True Positive: 1530
False Negative: 108
True Positive Rate: 0.9340659340659341
True Negative: 9247
False Positive: 753
False Positive Rate: 0.0753
----------------
Threshold Percentile: 93
True Positive: 1494
False Negative: 144
True Positive Rate: 0.9120879120879121
True Negative: 9357
False Positive: 643
False Positive Rate: 0.0643
----------------
Threshold Percentile: 94
True Positive: 1414
False Negative: 224
True Positive Rate: 0.8632478632478633
True

## Ongoing and Future Work

So far I've only been able to use these algorithms bluntly.  I'm beginning work on framing them analytically with the hope that it will help reveal when and how they are effective.  Grid search of the most important parameters can help illuminate the workings of these two algorithms as well.

Additionally, they need to be tested on more datasets and with a larger pool of models.  The effect of trapdoor placement seems to be a large factor, and so testing these algorithms on many models with different trigger placements might help reveal when and how the algorithms work.  