# Adversarial ML

The existence of adversarial examples for neural networks has been first observed in the context of image classification [[Szegedy et al., 2014](https://arxiv.org/abs/1312.6199)]. There are many great review papers on adversarial attacks and corresponding defenses. For example, the following publications are open access: [[Liu et al., 2018](https://doi.org/10.1109/ACCESS.2018.2805680); [Akhtar and Mian, 2018](https://doi.org/10.1109/ACCESS.2018.2807385); [Ren et al., 2020](https://doi.org/10.1016/j.eng.2019.12.012); [Khamaiseh et al., 2022](https://doi.org/10.1109/ACCESS.2022.3208131); [Meyers et al., 2023](https://doi.org/10.1007/s10462-023-10521-4); [Liu et al., 2024](https://doi.org/10.1007/s10462-024-10841-z)].

## Adversarial attacks

We consider image classification as a prototypical problem for the occurrence of adversarial inputs. Given a dataset $\{(\boldsymbol{x}_i, y_i)\}_{i=1}^N$ of images $\boldsymbol{x}_i$ and labels $y_i$. The weights $\boldsymbol{\theta}$ of a neural network $\mathcal{M}_{\boldsymbol{\theta}}(\boldsymbol{x})$ is found by minimizing a loss function:
$$
\underset{\boldsymbol{\theta}}{\text{minimize }}
\frac{1}{N} \sum_{i=1}^N L(\mathcal{M}_{\boldsymbol{\theta}}(\boldsymbol{x}_i), y_i).
$$
Here, $L(\mathcal{M}_{\boldsymbol{\theta}}(\boldsymbol{x}), y)$ is the contribution of a single data point $(\boldsymbol{x}, y)$.

An adversarial attack can be formulated as another optimization problem. One can try to find an imperceptible perturbation $\boldsymbol{\delta}$ to an input $\boldsymbol{x}$ such that $\boldsymbol{x} + \boldsymbol{\delta}$ is misclassified by a trained classifier. This can be realized by maximizing the loss:
$$
\underset{\boldsymbol{\delta} \in \Delta}{\text{maximize }}
L(\mathcal{M}_{\boldsymbol{\theta}}(\boldsymbol{x} + \boldsymbol{\delta}), y).
$$
A small $\ell_p$-ball $\Delta = \{\boldsymbol{\delta} \, | \, \lVert \delta \rVert_p \leq \epsilon\}$ is often used to constrain the perturbation. More generally though, any modification that can be reasonably assumed not to change the true class label is admissible here.

Since the predicted probability of the true class is minimized, without specifying a certain wrong target class, the attack above is called **untargeted**. One may similarly trick the model into predicting a specific label $\tilde{y}$ with $\tilde{y} \neq y$. Such a **targeted attack** can be formulated as:
$$
\underset{\boldsymbol{\delta} \in \Delta}{\text{minimize }}
L(\mathcal{M}_{\boldsymbol{\theta}}(\boldsymbol{x} + \boldsymbol{\delta}), \tilde{y}).
$$