<h1 align="center">Inconspicuous and Effective Over-the-Air Adversarial Examples via Adaptive Filtering</h1>
<h4 align="center"><i>ICASSP</i> '22</h4>

<div align="center">
<h4>
    <a href="https://interactiveaudiolab.github.io/assets/papers/oreilly_awasthi_vijayaraghavan_pardo_2021.pdf">preprint</a> •
        <a href="https://interactiveaudiolab.github.io/project/audio-adversarial-examples.html">website</a> • 
        <a href="https://interactiveaudiolab.github.io/demos/audio-adversarial-examples.html">audio</a>
    </h4>
    <p>
    by <em>Patrick O'Reilly, Pranjal Awasthi, Aravindan Vijayaragavan, Bryan Pardo</em>
    </p>
</div>
<p align="center"><img src="https://interactiveaudiolab.github.io/assets/images/projects/filters.png" width="400"/></p>
    
    
This notebook and the corresponding repository contain code for the proposed time-varying filter attack and baseline frequency masking attack against a speaker-verification system. After cloning the repository, hyperparameters can be found in `src/constants.py`. Some values differ from the paper to allow for shorter runtimes. To match the experiments in the paper, set:
    
| Variable | Value | Description |
|---|---|---|
| `SR` | 16000 | audio sample rate |
| `SIG_LEN` | `4.0` | length, in seconds, to which audio signals are padded / trimmed |
| `MAX_ITER` | `8000` | number of optimization iterations for each attack |
| `N_PER_CLASS` | `10` | number of instances per class on which to perform attacks |
| `EOT_ITER` | `1` | frequency of expectation-over-transformation parameter resampling; by default, sample new simulation parameters every iteration |
| `N_EVALS` | `2000` | number of random simulations under which to evaluate final generated attacks |
| `N_SEGMENTS` | `0` | for speaker verification model, number of fixed-length segments extracted from each utterance to compute embeddings; setting to `0` uses entire utterance to compute a single embedding. Larger values result in only slightly more robust speaker verification models, at the expense of slower prediction |
| `DISTANCE_FN` | `'cosine'` | embedding-space distance function for speaker verification model |
| `THRESHOLD` | `0.5846` | embedding-space decision threshold for speaker verification model; default value set to EER threshold |
| `CONFIDENCE` | `0.5` | for adversarial loss computation, margin by which spoofed audio must fall under verification threshold; used to encourage strong, high-confidence attacks |

__Note that these hyperparameters may result in very long runtimes; therefore, it is recomended to try with the default values first.__ Likewise, it is recommended you run this notebook in a CUDA-enabled environment.


## Setup

In [None]:
!git clone https://github.com/oreillyp/filter_icassp_22.git

%cd filter_icassp_22/

!pip install -r requirements.txt

!chmod u+x scripts/download/download_librispeech.sh
!chmod u+x scripts/download/download_rir_noise.sh

!./scripts/download/download_librispeech.sh
!./scripts/download/download_rir_noise.sh

In [None]:
import time
import torch
import torchaudio
import pandas as pd
from tqdm import tqdm

from src.constants import *
from src.pipelines import Pipeline
from src.models import SpeakerVerificationModel, ResNetSE34V2
from src.simulation import *
from src.preprocess import *
from src.loss import *
from src.data import LibriSpeechDataset
from src.attacks import FilterAttack, FrequencyMaskingAttack
from src.writer import Writer
from src.utils import *

## Pipeline

We wrap over-the-air simulation, preprocessing, and prediction into a single end-to-end differentiable `Pipeline` to simplify attack optimization. The simulation parameters used in the paper experiments are reproduced below.

<p align="center"><img src="https://interactiveaudiolab.github.io/demos/images/adaptive_filter/system_diagram.png" width="700"/></p>

In [None]:
# initialize differentiable over-the-air simulation
simulation = Simulation(
    Offset(length=[-.15, .15]),
    Noise(type='gaussian', snr=[30.0, 40.0]),
    Bandpass(low=400, high=6000),
    Noise(type='environmental', noise_dir=DATA_DIR / "noise" / "room", snr=[-5.0, 10.0]),
    Reverb(rir_dir=DATA_DIR / "rir" / "real")
)

# initialize differentiable preprocessing
preprocessor = Preprocessor(
    Normalize(),
    # VAD(),
    PreEmphasis()
)

# initialize speaker-verification model and load pretrained weights
model = SpeakerVerificationModel(ResNetSE34V2(), n_segments=N_SEGMENTS, threshold=THRESHOLD)
model.load_weights(MODEL_PATH)

# wrap everything into a Pipeline instance
pipeline = Pipeline(
    model=model,
    simulation=simulation,
    preprocessor=preprocessor,
    device=DEVICE
)

## Dataset

We use the LibriSpeech `test-clean` subset for evaluating attacks.

In [None]:
# load dataset
dataset = LibriSpeechDataset()

# set random seed
set_random_seed(RAND_SEED)

# select subset of data to evaluate attacks
x, y_orig, y = stratified_sample(
    data=dataset,
    n_per_class=N_PER_CLASS,  # starting audio files drawn per class
    target=TARGET_CLASS,  # attack target class (exclude inputs from target class)
    exclude=EXCLUDE_CLASS,  # exclude inputs from given classes as well
)

# if no target is given, randomly assign targets (excluding ground truth)
if TARGET_CLASS is None:
    y = select_random_targets(y_orig, n_per_class=N_PER_CLASS)

y_idx = y.clone()

# for each input, select a corresponding utterance of the target class and
# construct an embedding (or set of embeddings) to serve as a target
target_embeddings = []
target_audio = []  # save target audio for eventual evaluation
for spkr_idx in y:

    all_spkr = dataset.tx[dataset.ty == spkr_idx]
    x_spkr = all_spkr[torch.randperm(len(all_spkr))][0]

    with torch.no_grad():
        target_audio.append(x_spkr)
        target_embeddings.append(
            pipeline.model(x_spkr.to(DEVICE))  # omit simulation, defenses
        )

y = torch.cat(target_embeddings, dim=0)

# fold corresponding embeddings together (n_utterances, n_segments, sig_len)
y = y.reshape(x.shape[0], max(N_SEGMENTS, 1), -1)

# move data to device
x, y_orig, y = x.to(DEVICE), y_orig.to(DEVICE), y.to(DEVICE)

# determine whether any input-target pairs already fall under threshold
with torch.no_grad():
    match = pipeline.model.match_predict(
        pipeline.model(x), y
    ) * 1.0

In [None]:
# demonstration: randomized over-the-air simulation
x_example, _  = dataset[0]
x_example = x_example.to(pipeline.device)

with torch.no_grad():

    print("clean audio:")
    play_audio(x_example)

    print("simulated audio:")
    play_audio(pipeline.simulation(x_example))

    pipeline.sample_params()
    print("simulated audio:")
    play_audio(pipeline.simulation(x_example))

## Attacks

You may need to refresh the TensorBoard dashboard below after running the attack code in order for logs to populate. Logs should include scalars (losses and success rates), images (spectrograms, waveforms, and attack parameters), and audio (over-the-line and simulated over-the-air benign and adversarial recordings).

In [None]:
%load_ext tensorboard
%tensorboard --logdir runs --port=6006 

In [None]:
# initialize logger
writer = Writer(
    root_dir=RUNS_DIR,
    name='speaker_recognition_attack',
    use_tb=True,
    tb_log_iter=LOG_ITER
)
writer.log_info(f'Using device {DEVICE}')
writer.log_info(f'"Spoofing" success rate for unperturbed audio: {match.mean().item():0.3f}')

# initialize adversarial loss
adv_loss = SpeakerEmbeddingLoss(targeted=True,
                                reduction=None,
                                confidence=CONFIDENCE,
                                distance_fn=DISTANCE_FN,
                                n_segments=N_SEGMENTS,
                                threshold=THRESHOLD)

# initialize auxiliary loss
aux_loss = MFCCCosineLoss(reduction=None, n_mfcc=128, n_mels=128)

# initialize proposed attack
flt = FilterAttack(
    n_bands=128,
    block_size=1024,
    pipeline=pipeline,
    class_loss=adv_loss,
    aux_loss=aux_loss,
    max_iter=MAX_ITER,
    eot_iter=EOT_ITER,
    opt='adam',
    lr=0.005,
    eps=40.0,  # 35.0 
    batch_size=BATCH_SIZE,
    mode='selective',
    projection_norm=2,
    rand_evals=N_EVALS,
    k=None,
    writer=writer
)

# initialize baseline frequency-masking attack
qin = FrequencyMaskingAttack(
        pipeline=pipeline,
        class_loss=adv_loss,
        max_iter_1=MAX_ITER // 4,
        max_iter_2=3 * MAX_ITER // 4,
        batch_size=BATCH_SIZE,
        opt_1='adam',
        opt_2='adam',
        alpha=5e-3,  # 5e-4
        eps=0.06,
        lr_2=1e-4,  #1e-3
        eot_iter=EOT_ITER,
        rand_evals=N_EVALS,
        writer=writer
)

Finally, we can run each attack. We evaluate generated audio over a further set of randomized acoustic simulations to obtain an estimate of real-world over-the-air performance. At the default setting of `LOG_ITER = 200`, new logs should populate roughly every 10 minutes in Google Colab.

In [None]:
# time execution
st_time = time.time()

# set random seed
set_random_seed(RAND_SEED)

# perform proposed attack
adv_x_flt, success_flt = flt.attack(
    x,
    y,
)

# set random seed
set_random_seed(RAND_SEED)

# perform baseline attack
adv_x_qin, success_qin = qin.attack(
    x,
    y,
)

ed_time = time.time()
elapsed = ed_time - st_time
writer.log_info(f'time elapsed (s): {elapsed}')

writer.log_info(f'"Spoofing" success rate for proposed attack: {success_flt.mean().item():0.3f}')
writer.log_info(f'"Spoofing" success rate for baseline attack: {success_qin.mean().item():0.3f}')

We'll save all benign and attack audio and prepare a table of results.

In [None]:
# create table: success, detection rates
table = pd.DataFrame()
table['success_baseline'] = tensor_to_np(success_qin).flatten()
table['success_proposed'] = tensor_to_np(success_flt).flatten()

# log ground truth and selected targets
table['ground_truth'] = tensor_to_np(y_orig).flatten()
table['target'] = tensor_to_np(y_idx).flatten()

# log descriptive audio filenames
table['audio_reference'] = [f'reference_{idx}.wav' for idx in range(len(x))]
table['audio_baseline'] = [f'baseline_{idx}.wav' for idx in range(len(x))]
table['audio_proposed'] = [f'proposed_{idx}.wav' for idx in range(len(x))]
table['audio_target'] = [f'target_{idx}.wav' for idx in range(len(x))]

# log perturbation norms
table['L2_baseline'] = tensor_to_np(
    (adv_x_qin - x).norm(p=2, dim=-1).reshape(-1)
).flatten()
table['L2_proposed'] = tensor_to_np(
    (adv_x_flt - x).norm(p=2, dim=-1).reshape(-1)
).flatten()
table['Linf_baseline'] = tensor_to_np(
    (adv_x_qin - x).norm(p=float("inf"), dim=-1).reshape(-1)
).flatten()
table['Linf_proposed'] = tensor_to_np(
    (adv_x_flt - x).norm(p=float("inf"), dim=-1).reshape(-1)
).flatten()

# save complete results
table.to_csv(Path(writer.run_dir) / 'results.csv')

# log overall statistics
#  1. overall success rate for baseline and proposed attacks
#  2. percentage of attacks in which proposed has success rate greater than
#     or equal to that of baseline
#  3. average perturbation norm of each attack
writer.log_info(f'baseline success rate: {table["success_baseline"].mean() :.3f}')
writer.log_info(f'proposed success rate: {table["success_proposed"].mean() :.3f}')

# save all audio
audio_dir = Path(writer.run_dir) / 'audio'
ensure_dir(audio_dir)

pbar = tqdm(range(len(table)))
for i in pbar:
    # save original audio
    audio = table.iloc[i]['audio_reference']
    audio_idx = table.index.values[i]
    pbar.set_description(f'saving audio {audio}')
    torchaudio.save(audio_dir / audio,
                    x[audio_idx].reshape(1, -1).detach().cpu(),
                    sample_rate=SR)

    # save baseline attack audio
    audio = table.iloc[i]['audio_baseline']
    audio_idx = table.index.values[i]
    pbar.set_description(f'saving audio {audio}')
    torchaudio.save(audio_dir / audio,
                    adv_x_qin[audio_idx].reshape(1, -1).detach().cpu(),
                    sample_rate=SR)

    # save proposed attack audio
    audio = table.iloc[i]['audio_proposed']
    audio_idx = table.index.values[i]
    pbar.set_description(f'saving audio {audio}')
    torchaudio.save(audio_dir / audio,
                    adv_x_flt[audio_idx].reshape(1, -1).detach().cpu(),
                    sample_rate=SR)

    # save proposed target audio
    audio = table.iloc[i]['audio_target']
    audio_idx = table.index.values[i]
    pbar.set_description(f'saving audio {audio}')
    torchaudio.save(audio_dir / audio,
                    target_audio[audio_idx].reshape(1, -1).detach().cpu(),
                    sample_rate=SR)

## Results

In [None]:
table