### AM207 Final Project
#### *Adversarial Examples Are Not Bugs, They Are Features*
Connor Capitolo [a], Kevin Hare [b], Mark Penrod [a], Sivananda Rajananda

------------

[a] Interested in continuing work to Spring 2022 semester

[b] Interested in continuing research, graduating and will need to assess time constraints

#### Problem Statement

This paper addresses the topic of adversarial training attacks, and more specifically considers whether these adversarial attacks are exploiting “brittle”, non-robust features in the data. In short, a non-robust feature is one that is highly correlated with the label, but does not survive an adversarial perturbation. Adversarial perturbations, in turn, consist of augmentations to training examples implemented with the goal of fooling a classifier.

#### Context

Adversarial problems are common in many areas of machine learning. First, there are inherently adversarial settings where we hope to deploy machine learning systems. Ilyas et al. (2019) focus primarily on images, considering the case of robustness for classifiers on benchmark datasets.

Indeed, image recognition and classification could have significant adversarial elements. One can imagine the use of masks or other alterations to attempt to avoid identification by such a system – regardless of whether that system is deployed in a harmful manner or not. This problem is not limited to the vision domain; consider an algorithm deployed by a bank to detect fraudulent transactions. By definition, this scenario is adversarial and it would be reasonable to expect that individuals committing fraud will attempt to mimic true transactions or otherwise conceal their behavior. Adversaries may not always be generated with malicious intent. In the example of self-driving cars, natural variations in weather, light, and other natural features as well as random performance variation in sensors can inadvertantly generate adversaries. Vulnerable systems pose significant threats to public safety as this technology is deployed. Thus, having algorithms that are able to extract useful signals from the data under these conditions remain an important goal.

More generally, robust machine learning methods give us as engineers a mechanism for certifying our algorithms against attacks or worst-case scenarios. Rather than traditional data augmentation, which is more random in nature, robust machine learning methods, which adversarial defenses fall under, can assure performance of a given algorithm even when the augmentation is done to maximize the chances of fooling the algorithm, as described above.


#### Existing Work

Robust machine learning for adversarial examples – that is models that are able to correctly classify adversarial examples – has been a rising area in machine learning. Two seminal papers examining adversarial examples writ large are Szegedy et al. (2014) and Goodfellow et al. (2015). The first of these papers describes the issue of adversarial examples, importantly demonstrating that many networks can be fooled by the same adversary on the same test data. This suggests that adversarial examples are not due to simply overfitting or intrinsic properties of a particular architecture but may be more closely related to the data itself. As a machine learning practicioner, this is concerning. Techniques to prevent overfitting such as weight regularization and dropout may not be successful if adversarial examples are more directly related to the data than the architecture.

Goodfellow et al. (2015) hypothesize that these adversarial examples come from the linearity of neural networks when there is high dimensional data. Intuitively, the authors explain that a simple linear classifier (and a neural network more generally) where many very minor perturbations to the features can actually produce a large change in the aggregate. This paper highlights a common theme in research regarding adversarial examples, which often suggest the high dimensionality or relative sparsity of training data within that high dimensional space as the cause of adversarial examples. Additionally, Gilmer et al. (2016) take this relationship a step further by relating adversarial examples to the data manifold, specifically suggesting that examples will be close to misclassified ones due to the high-dimensional geometry present.

A second prominent hypothesis connects adversarial examples to out-of-distribution points, suggesting that adversarial examples are really just off the data manifold. Song et al. (2018) demonstrate that even minor changes across a number of adversarial attacks substantially change the log likelihood of the examples. While log likelihood may not be the ideal measure of OOD examples, it provides an additional explanation for the presence of adversarial examples. Moreover, the intuitive, non-algorithmic solution is potentially two-fold. By using techinques to better sample from low-density regions or otherwise impart information into a network, we may overcome this challenge. Moreover, we could potentially devise a test or separate classifier to detect OOD points as a first line of defense against adversarial examples.

One particularly important advancement in this area is the concept of "adversarial training" (Madry et al. 2019). Adversarial training introduces a specific adversary into the training procedure. Thus, rather than training on a specific set of examples, each batch is perturbed to be adversarial before training. This provides gaurantees of the method's robustness to that adversary, and the authors demonstrate that this is also the case to a lesser degree between adversaries. To be more formal, an adversary in this context is the perturbation method used, so cross-defenses using one adversary, e.g. the Fast Gradient Sign Method (FGSM), to train the model will still be somewhat robust to a Projected Gradient Descent (PGD) attack (though it it is best to explicitly account for many different attack forms in training, which entails a bit more involved work).

Finally, lest we cosign adversarial examples only to the realm of computers, recent research highlights that if humans are given a tight time limit, they too perform worse on adversarial examples generated by these methods (Elsayed et al. ,2018)! This is an important connection as it highlights that there may be something inherently recognized about the adversarial examples by humans and potentially partially dispels the mechanical hypotheses proposed by Goodfellow et al. (2015).

#### Contribution

In contrast to previous work, this paper focuses on one particular element of adversarial examples: the features. Rather than argue that the adversarial examples arise as bugs or due to being in some low-dimensional space, they demonstrate that any given training example can be decomposed into robust and non-robust features. These subsets of features correspond to those that remain useful after adversarial perturbation and those that do not. In the case of an image, we might think of the robust features as those that are truly correlated with the image class. For determination of animal type, this could correspond to actual information about eyes and ears. Non-robust features, on the other hand, could consist of common background characteristics; these are correlated in the dataset but could be easily changed such that a human believe the original class label while the algorithm would predict something completely different. For example would be an image that looks like a pig and was originally predicted as one, but after adversarial perturbation the image still looks like a pig to the human eye but the model now classifies it as a wombat.

#### Technical Content

The content of this paper is predicated on the notion of adversarial examples which themselves are the basis for adversarial training, used to build models robust to adversarial attacks. Adversarial examples are generated by restricted perturbations to the original input that have an outsized effect on predictions made by the respective model. In theory, adversarial training solves a minmax problem. The inner maximization seeks to maximize the loss function via perturbation and the outer minimization is the traditional loss function minimization in machine learning. In practice, adversarial training is performed by incorporating adversarial attacks into the training corpus, with the accurate target label, in order to weaken the efficacy of such attacks.



The authors consistently rely on projected gradient descent (PGD) in order to learn the adjustment parameter, $\delta$, which is used to generate adversarial examples as well as robust and non-robust data. PGD is a gradient-based algorithm leveraged to solve constrained optimization problems. For the case of adversarial examples, the problem is learning an update to the original input such that the L2-distance between the original and the update is less than some upper bound, $\epsilon$, but the predicted label is altered. For the case where the L2-norm is the only constraint on $\delta$, the update expression is as follows:

$$
\delta \coloneqq \delta - \alpha\frac{\nabla_{\delta}\ell(h_{\theta}(x + \delta), y)}{\Vert\nabla_{\delta}\ell(h_{\theta}(x + \delta), y)\Vert_2}
$$

where $$\alpha$$ is the learning rate and $h_{\theta}$ is the parameterized model. We can see that the update involves the gradient of the loss with respect to $\delta$ inversely scaled by the L2-norm of the gradient. However, since image data is a common focus in adversarial literature (as is the case in this paper), it is also common practice to apply a normalization to the right-hand portion of the above update, referred to here as $z$:

$$
\delta \coloneqq P_{\epsilon}(z) = \epsilon\frac{z}{max\{\epsilon, \Vert z \Vert_2\}}
$$

For generating non-robust feature and adversarial examples, both of these constraints are applied to the gradient updates. However, only normalization is applied when generating the robust features. This fact relates to the nature of adversarial examples as necessarily small, and often imperceptible, changes to the original input. Robustification on the other hand need not involve small perturbations and instead only relies on the assumption that the adversarially-trained, robust model has indeed learned robust data representations.

**L2-PGD for Generating Adversarial Features:**

In [None]:
def single_pgd_step_adv(model, X, y, alpha, epsilon, delta):
    ''' Completes a single step of L2 PGD

    Evaluates cross entropy loss and gradient propogates to adjustment parameter delta.
    The gradient is constrained by epsilon and mapped to [0,1]
    '''
    with tf.GradientTape() as tape:
        tape.watch(delta)
        loss = tf.keras.losses.SparseCategoricalCrossentropy(
            from_logits=True,
            reduction=tf.keras.losses.Reduction.NONE # Use no aggregation - will give gradient separtely for each ex.
            )(y, model(X + delta)) # comparing to label for original data point
    grad = tape.gradient(loss, delta)

    normgrad = tf.reshape(norm(grad), (-1, 1, 1, 1))
    z = delta + alpha * (grad / (normgrad + 1e-10))

    normz = tf.reshape(norm(z), (-1, 1, 1, 1))
    delta = epsilon * z / (tf.math.maximum(normz, epsilon) + 1e-10)
    return delta, loss

def pgd_l2_adv(model, X, y, alpha, num_iter, epsilon=0, example=False):
    """Applies L2 PGD for an adversarial example
    
    Will run `num_iter` iterations of PGD over the examples for L2 ball
    constrained by `epsilon` and step size of `alpha`. 
    
    To optimize performance, will decorate only the interior function. Moreover,
    we will re-instantiate this every time. O/w TF will produce errors related to retracing
    of the computational graph."""

    # Apply tf.function to create computational graph of the single step
    # for optimal performance
    fn = tf.function(single_pgd_step_adv)
    delta = tf.zeros_like(X)
    loss = 0
    for t in range(num_iter):
        delta, loss = fn(model, X, y, alpha, epsilon, delta)

    if example:
        print(f'{num_iter} iterations, final MSE {loss}')
    return delta

Under the framework presented in this paper, adversarial training serves as the basis for extracting robust and non-robust features from a data set. 

*Robust features:* Generating robust features begins by first creating a robust model through adversarial training. Then, for each entry of the original dataset, a random starting data input is iteratively updated to produce as close a representation as possible to the original input, under the robust model. Specifically, the "robustification" model includes all but the output layer of the robust, task model. The representation for a data point is the output of this penultimate layer. Robustification then occurs through a supervised learning task. The data point from which the robust features will be extracted is passed through the robustification model yielding its representation, which will serve as the label. Then, a random sample from the data is selected as the starting point. The robustification process proceeds by iteratively perturbing the random sample in order to minimize the L2-distance between the label and the sample's representation. Because the model itself is robust, the adjustments to the random input should only reflect the robust features of the original data point.
<img src="imgs/robust_alg.png" alt="Drawing" style="width: 500px; margin-left: auto; margin-right: auto;"/>
<center>Figure 1: Algorithm to construct the robust data set.</center>

In [None]:
## PGD-L2 for Robustification ##
def single_pgd_step_robust(model, X, y, alpha, delta):
    with tf.GradientTape() as tape:
        tape.watch(delta)
        loss = tf.keras.losses.MeanSquaredError(
            reduction=tf.keras.losses.Reduction.NONE
        )(y, model(X + delta)) # comparing to robust model representation layer

    grad = tape.gradient(loss, delta)
    normgrad = tf.reshape(norm(grad), (-1, 1, 1, 1))
    delta -= alpha*grad / (normgrad + 1e-10) # normalized gradient step
    delta = tf.math.minimum(tf.math.maximum(delta, -X), 1-X) # clip X+delta to [0,1]
    return delta, loss

def pgd_l2_robust(model, X, y, alpha, num_iter, epsilon=0, example=False):
    delta = tf.zeros_like(X)
    loss = 0
    fn = tf.function(single_pgd_step_robust)
    for t in range(num_iter):
      delta, loss = fn(model, X, y, alpha, delta)
    # Prints out loss to evaluate if it's actually learning
    if example:
        print(f'{num_iter} iterations, final MSE {loss}')
    return delta

## Example robustification training ##
def robustify(robust_mod, train_ds, iters=1000, alpha=0.1, batch_size=BATCH_SIZE):
  robust_train = []
  orig_labels = []
  example = False

  train_to_pull = list(iter(train_ds))
  start_rn = np.random.randint(0, len(train_ds))
  rand_batch = train_to_pull[start_rn][0]

  start_time = time.time()
  for i, (img_batch, label_batch) in enumerate(train_ds):
      inter_time = time.time()  

      # For the last batch, it is smaller than batch_size and thus we match the size for the batch of initial images
      if img_batch.shape[0] < batch_size:
        rand_batch = rand_batch[:img_batch.shape[0]]

      # Get the goal representation
      goal_representation = robust_mod(img_batch)
      
      # Upate the batch of images
      learned_delta = pgd_l2_robust(robust_mod, rand_batch, goal_representation, alpha=alpha, num_iter=iters)
      robust_update = (rand_batch + learned_delta)

      # Add the updated images and labels to their respective lists
      robust_train.append(robust_update)
      orig_labels.append(label_batch)
      
      # Measure the time
      if (i+1) % 10 == 0:
        elapsed = time.time() - start_time
        elapsed_tracking = time.time() - inter_time
        print(f'Robustified {(i+1)*batch_size} images in {elapsed:0.3f} seconds; Took {elapsed_tracking:0.3f} seconds for this particular iteration')    
      
      # Reset random image batch
      rn = np.random.randint(0, len(train_ds)-1) # -1 because last batch might be smaller
      rand_batch = train_to_pull[rn][0]

  return robust_train, orig_labels

*Non-robust features*: The process for generating non-robust features is similar in most ways to creating targeted adversarial attacks. First a random label is selected from among the range of possible options. Then, a standard task model is trained such that the input image is updated in order to minimize the loss between the model prediction and random, incorrect label while restricting the perturbations to a relatively small bound, $\epsilon$. It is the minimization that distinguishes targeted adversarial attacks from  standard training, where the learning process involves *maximizing* loss for a given input and its true label, without mind to which other label the input ultimately becomes correlated. In principle, the  non-robustification algorithm re-correlates the non-robust features with an incorrect label. We can then save the non-robustified input with the incorrect label, yielding the non-robust data set.
<img src="imgs/nonrobust_alg.png" alt="Drawing" style="width: 500px; margin-left: auto; margin-right: auto;"/>
<center>Figure 2: Algorithm to construct the non-robust data set.</center>

In [None]:
## Single step of PGD-L2 for non-robustification ##
def single_pgd_step_nonrobust(model, X, y, alpha, epsilon, delta):
    ''' Completes a single step of L2 PGD

    Evaluates cross entropy loss and gradient propogates to adjustment parameter delta.
    The gradient is constrained by epsilon and mapped to [0,1]
    '''
    with tf.GradientTape() as tape:
        tape.watch(delta)
        loss = tf.keras.losses.SparseCategoricalCrossentropy(
            from_logits=True,
            reduction=tf.keras.losses.Reduction.NONE # Use no aggregation - will give gradient separtely for each ex.
            )(y, model(X + delta)) # comparing to label for original data point
    grad = tape.gradient(loss, delta) #tape.gradient(loss, delta)

    # equivalent to delta += alpha*grad / norm(grad), just for batching
    normgrad = tf.reshape(norm(grad), (-1, 1, 1, 1))
    # changed from plus to minus b/c trying to minimize with non-robust
    z = delta - alpha * (grad / (normgrad + 1e-10))
    normz = tf.reshape(norm(z), (-1, 1, 1, 1))
    delta = epsilon * z / (tf.math.maximum(normz, epsilon) + 1e-10)
    return delta, loss

def pgd_l2_nonrobust(model, X, y, alpha, num_iter, epsilon=0, example=False):
    fn = tf.function(single_pgd_step_nonrobust)
    delta = tf.zeros_like(X)
    loss = 0
    for t in range(num_iter):
        delta, loss = fn(model, X, y, alpha, epsilon, delta)

    if example:
        print(f'{num_iter} iterations, final MSE {loss}')
    return delta

***The remaining code for producing non-robust data directly parallels that for robustification, exchanging only the step function for the version implemented above.**

#### Theoretical Framework

Beyond the experiments described below, the authors propose a further theoretical framework for engaging with the robust features. Through a series of proofs over the separation of two Gaussian distributions (as opposed to the more complex image classification task), the authors derive three primary facts to bolster their argument:

1. *The difference between the $\mathcal{l}_2$ objective in the adversarial step and the loss function quantifies the overall adverarial vulnerability of the model*. Moreover, the authors connect this directly to the nonrobust features that occur in the data. Because this difference exists for the adversary, non-robust features provide dimensions on which the distance of the true metric and the distance in the $\mathcal{l}_2$ ball are not close and are therefore exploitable.
2. *Robust (adversarial) training learns both the adversarial and standard loss functions.* The authors show that the adversarial training procedure implemented learns the same (true) mean as the standard training, but that the major differences occur within the learned covariance structure.
3. *Gradients from robustly learned models are more interpretable*. Because the robust procedure learns a covariance structure that is more aligned with the identity than its nonrobust counterpart, this tends to produce classification boundaries that are more orthogonal to the line connecting the two means (which are always learned correctly). Mathematically, this makes the gradient point more directly towards the class in the Gaussian distribution classification setting.

Taken together, the authors conclude that these properties enforce a prior over the features. Intuitively, strongler regularization corresponds to a stronger adversary during training, and thus the model will ideally learn only the robust features.

#### Experiments

Research and applications in adversarial machine learning often center image data and so it is fitting that the authors explore how the concepts of robust and non-robust data apply with respect to such data. In particular the authors consider the concept of *transferability*. That is, how well do standard task models perform when trained with robust or non-robust data and then tested with images from an unmodified data set as well as adversarial attacks. To perform these experiments, the authors utilize the CIFAR-10 and a restricted sample from ImageNet as the data sets as well as ResNet-50 for the model. The experiments consider four training scenarios, each evaluated with standard as well as adversarial test data:
- Trained with standard data
- Trained with adversarial examples
- Trained with robust data
- Trained with non-robust data (only applied for CIFAR-10 data)
For all training data sets, the model was able to achieve consistent, good performance when tested with unmodified data. However, the standard training paradigm yielded very poor accuracy when attacked adversarially. In contrast, the adversarially-trained model was able to achieve nearly as good accuracy with adversarial test data as with standard data. These results mainly serve to confirm the validity of the experiments and are well-established. Moving on to the robust and non-robust data, the experiments showed that the model trained on robust data was far more resilient to adversarial attacks than the standard model though it did not reach the same performance as the adversarially-trained model. Finally, the non-robust trained model was *not* resilient to adversarial attacks and performed the worst among the four models in that regard.

<img src="imgs/exper_results_paper.png" alt="Drawing" style="width: 500px; margin-left: auto; margin-right: auto;"/>
<center>Figure 3: Test accuracy for models trained with various versions of CIFAR-10 dataset and tested with standard and adversarially-perturbed data.</center>

---------
</br>

As part of our work, we attempted to replicate these results. We were able to directly mimic the author's implementation of the adversarial training as well as the production of robust and non-robust features. Furthermore, we were able to achieve similar results to the authors for robust training. However, we observed that the model trained with the non-robust dataset (using SGD optimizer with learning rate decay pulled from the paper) wasn’t able to do better than random choice accuracy and accuracy did not improve across epochs. Therefore, we switched to the Adam optimizer which yielded results closer to those seen above. We were also curious to see how the models would perform with various classes of adversaries. Thus, in addition to the PGD $L_2$ adversarial attacks that the authors used to evaluate their models, we also included PGD $L_{\infty}$ and FGSM adversarial attacks. Last, we parameterized the adversarial attacks with epsilon values of 0.25 and 0.5 since the authors used both values in their experiments.

<img src="imgs/epsilon_025.png" alt="Drawing" style="width: 400px; margin-left: auto; margin-right: auto;"/>
<center>Figure 4: Test accuracy for models trained with various versions of CIFAR-10 dataset and tested with standard and adversarially-perturbed data ($\epsilon=0.5$).</center>

<img src="imgs/epsilon_050.png" alt="Drawing" style="width: 400px; margin-left: auto; margin-right: auto;"/>
<center>Figure 5: Test accuracy for models trained with various versions of CIFAR-10 dataset and tested with standard and adversarially-perturbed data ($\epsilon=0.25$).</center>

We observe a similar trend with the test accuracies for the first three models (i.e. standard training, adversarial training, and robust training) where the models have relatively high test accuracy and each subsequent model has a slightly lower test accuracy. The non-robust training model has a much lower test accuracy compared to the original paper. It is unclear why this is the case, but we hypothesize that it could be due to suboptimal hyperparameters for the creation of the non-robust dataset or for training the model.

We also observe similar trends for PGD L2 adversarial test accuracies; the adversarial training model has the highest resistance towards PGD $L_2$ attacks, followed by the robust training model. The standard training and non-robust training models are very susceptible to PGD $L_2$ attacks. We also note that none of the models seem to offer resistance to PGD $L_{\infty}$ attacks with all models under the different training configurations having test accuracy scores of 0.00. Finally, FGSM adversarial attacks are also reasonably effective, bringing down the test accuracies of the models to chance level (i.e. test accuracy of 0.10).

------- 
</br>
For our experimental setup, we were able to at least partially validate our methods. The first challenge was that the authors cite using ResNet50. However, this network requires 224 x 224 inputs rather than the 32 x 32 inputs offered by CIFAR-10. We found two alternatives. First, some implementations appear to upsample CIFAR images to the appropriate dimensions. Second, in the original ResNet paper (He et al. (2015)) the authors describe ResNet20 and ResNet56, modified ResNet architectures that handle 32 x 32 inputs and demonstrate comparable standard training performance on CIFAR-10. We elect to use the latter. Specifically, all of our experimental results come from the narrow version of ResNet20 (See He et al. (2015) at 7). We note that in some implementations of other robustness papers from the lab, the "wide" version is used, where the wide version corresponds to using the number of convolutional filters from ResNet for ImageNet (beginning at 64 rather than 16). This results in a much larger network which is more computationally expensive to train, but which may produce superior results. Importantly, we have tested multiple parameterizations and find that this does not substantially affect overall accuracy, as demonstrated by He et al. (2015) who show only a few percentage point difference in error between the ResNet20 and ResNet56 results.

For all other parameters such as the strength of weight decay (assumed to be L2 regularization over the parameters rather than the SGDW algorithm as presented in Loshchilov and Hutter (2019) due to performance evaluation of both, where SGDW with weight decay does not converge but interpreting SGD as regularization quickly does). We follow the parameters specified by Ilyas et al as best as possible with a few notable exceptions. First, the authors use a "Drop in learning rate" and provide a starting learning rate but do not provide end results. To mitigate this, we follow the procedure from He et al. (2015), using a piecewise learning rate decay at roughly 60% and 80% of the way through training. Second, the authors do not provide the number of epochs or iterations. ResNet was originally trained over 64,000 iterations. For time and experimental results, we restrict training to be 25,000 iterations after observed plateauing at that level. This is similarly supported by He et al. (2015) who show little to no gain in test set performance with ResNet20 or ResNet56 on CIFAR 10 past this point. Most of the reduction in total network size comes via the width not the depth, as standard ResNet56 has roughly 850,000 parameters and wide ResNet56 has over 13 million, making it much slower to train and evaluate.

Finally, and importantly, we had to make one update to the optimizer. When using the optimizer that the authors proposed, the standard training model using the non-robust dataset fails to perform well with the test set. This suggests that the model is unable to learn the non-robust features from the non-robust dataset. We further experimented with the optimizer and found that the Adam optimizer allowed this model to learn the non-robust features (i.e. having test accuracies that are significantly higher than chance). After many iterations with SGD, we were able to converge the network to high performance once, but it appears that this was highly sensitive to the random initialization selected.

#### Evaluation

The authors did not provide code to replicate their results, making the work more difficult to reproduce and necessitating experimentation with various models and hyperparameters to produce comparable results. While the authors do provide many hyperparameters, this has posed a challenge for our replication. As a point of evaluation, there are two concerns. First, it is difficult to fully assess the technical accuracy of the proposed method without more in-depth interrogation of their code, and second, the results for the non-robust features in particular appear to be somewhat brittle. Indeed, this may be part and parcel of the author's argument. As non-robust features are explicitly considered to be "well-generalizing, yet brittle" features, they may be subject to non-replicability or small changes in the algorithm or model could lead to many different opportunities arising. It's quite possible that the non-robust feature space is much larger than the robust feature space and, as such, variations on the non-robust feature extraction algorithm identify different subspaces. 

Importantly, as we consider the generalizability of this work, we should consider how variations in model hyperparameters greatly impact the outcomes. In certain cases, changes to the hyperparameters have a reasonable effect on results. For example, decreasing the $\epsilon$ value restricts the amount to which the original image can be perturbed and as such will likely produce weaker adversaries. However, it is less clear why the choice of optimizer used during the task training with non-robust train data has such a dramatic impact on the results. To assert that non-robust features in particular reflect some inherent quality of the data, we should expect that we could extract similar features under a variety of approaches, not just the precise implementation used by the authors.

Additionally, in our work we show that while a model trained on robust data is in fact more robust to PGD-L2 adversarial attacks, it is still vulnerable to certain types of alternative attacks. Specifically, and importantly, the model <i>only appears to be robust to the types of adversaries learned in adversarial training</i>. What this creates is a somewhat circular logic. A model is trained to be robust to a certain type of adversary, then a robust data set is generated from that model, and then finally a new model trained on the robust data performs well against those adversaries. The effect is reminiscent of overfitting which suggests a limit to the generalizability of the robust data, which has important practical implications. If deployed in real world applications, the guise of a model or data set robust to adversarial attacks can lead to dangerous complacency or over-trust. 

In a more philosophical sense this lack of generalizability also has important implications. If we are to relate robust features to a notion of human interpretability, can we reasonably restrict that notion to features robust to a very specific type of attack that wouldn’t necessarily trick a human? That said, we may also reasonably consider that there are many flavors to human interpretability and that L2-adversarial robustness does identify a subset of human interpretable features. In this vein the work in this paper makes meaningful contribution in the framework of machine learning as cognitive technology. By distinguishing robust from non-robust features and building models trained on the former, we have more assurance that a model is at least using information that we believe is more relevant to human cognitive processing.



#### Future Work

This work explores how adversarial attacks may be leveraged to distinguish robust from non-robust (brittle) features. Moreover, the authors explore how models trained with these robust and non-robust features perform for standard task training as well as their resiliency to adversarial attacks. In our evaluation we were able to show that a robust dataset is only robust to the class of adversaries upon which it is trained, bringing into question the generalizability of robust features and how well they relate to broader notions of human interpretability. Moreover, we identify technical limitations to this work, primarily sensitivity to ostensibly less relevant features like the optimizer of the task model. Still, the concept of robust and non-robust data is compelling, and brings to mind questions of how robustness relates to other qualities of a data set.

In particular, there is much room to consider how robust and non-robust data relates to concepts such as out-of-distribution data, generative data distributions, and geometric representation on the data manifold. We hypothesize that the use of variational autoencoders (VAEs) may serve as a compelling means to investigate these questions. VAEs allow us to both parameterize an approximate latent, generative distribution for the data and similarly approximate the true data distribution. A variety of possible experiments emerge from this approach. For example, we might first train a VAE using robust data and then evaluate the likelihood of adversarial examples under both the generative and data distributions. Alternatively we might explore the geometry of such distributions. How do the distributions of non-robust and robust data compare? How might these geometries relate to concepts of adversarial robustness? 

The possibilities are vast and can help situate robust and non-robust features into the broader realm of probabilistic machine learning.

#### Broader Impact

Adversarial attacks can be a serious threat to companies, particularly ones with sensitive data. Areas such as spam classification, mawalware detection, and network intrusion detection all fall under the realm of adversarial attacks as an actor is directly trying to fool the classifier. In a more extreme example, a bad actor looking to disrupt a medical system could inconspicuously perturb X-ray images so that machine learning predictions produce incorrect results. With the increasing rise of political attacks affecting the general population, it is essential that entities and users who deploy machine learning models in high-risk situations not only protect their sensitive data but ensure that it is resistent to data perturbations, as the people who are most affected are the individuals whose data is being manipulated. Our work hopes to contribute to the adversarial attack community by looking at robustifying models to provide further insight into ways to defend against these attacks.

Another positive effect of our work is bringing more interpretability to highly complex models. By creating models based on robust features, it allows for humans to better understand what the model is learning and better align human-model interests. Therefore, understanding adversarial attacks through a (non)robust feature lens contributes to the ML interpretability space as well.

There are potential downsides to work on adversarial attacks and publishing results. Bad actors could continue to learn from this new information and create better attacks that fool the SOTA models meant to be resistant to such attacks, creating a sort of cat-and-mouse game that is seen within cybersecurity. Additionally, it's important that users who want to use adversarial robustness recognize the theoretical backing of this technique is only guaranteed when the inner maximization problem is solved exactly. When working with neural networks in a highly non-linear space, this is impossible; therefore, a bad approximation of the inner maximization problem means that a bad actor may be able to produce an effective attack with a slightly more exhaustive inner optimization strategy. Additionally, solving the inner maximization problem using $L_2$ projected gradient descent does *not* necessarily mean that it will be robust for $L_1$ or $L_{\infty}$ attacks. In practice, the real set of attacks that users care about (for example, the set of all images a human thinks "look reasonably the same") is much more difficult to characterize, and so users must be vigilant for any new attacks that can potentially defeat their robustified model.


As our work makes only a small contribution to the adversarial attack space, we are less concerned that our work will be used by bad actors; therefore, we aim for our code to benefit the general community by publicly sharing our GitHub repository. We invite the machine learning community to contribute to the code base or use the published code for their research needs.

### References

- Elsayed, Gamaleldin F., et al. "Adversarial examples that fool both computer vision and time-limited humans." *arXiv preprint arXiv:1802.08195* (2018).
- Gilmer, Justin, et al. "Adversarial spheres." *arXiv preprint arXiv:1801.02774* (2018).
- Goodfellow, Ian J., Jonathon Shlens, and Christian Szegedy. "Explaining and harnessing adversarial examples." *arXiv preprint arXiv:1412.6572* (2014)
- He, Kaiming, et al. "Deep residual learning for image recognition." *Proceedings of the IEEE conference on computer vision and pattern recognition.* 2016.
- Ilyas, Andrew, et al. "Adversarial examples are not bugs, they are features." *arXiv preprint arXiv:1905.02175* (2019).
- Loshchilov, Ilya, and Frank Hutter. "Decoupled weight decay regularization." *arXiv preprint arXiv:1711.05101* (2017).
- Madry, Aleksander, et al. "Towards deep learning models resistant to adversarial attacks." *arXiv preprint arXiv:1706.06083* (2017).
- Song, Yang, et al. "Pixeldefend: Leveraging generative models to understand and defend against adversarial examples." *arXiv preprint arXiv:1710.10766* (2017).
- Szegedy, Christian, et al. "Intriguing properties of neural networks." *arXiv preprint arXiv:1312.6199* (2013).

#### 

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=17f07885-eadc-4f5d-811a-23c7ec90bd30' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>