# Antispoofing

In this homework, you will develop a countermeasure against deepfakes and then try to explain it using various XAI techniques.

More specifically, you will implement and train a Countermeasure (CM) system on the Logical Access partition of the [ASVSpoof 2019 Dataset](https://datashare.ed.ac.uk/handle/10283/3336) ([Kaggle Link](https://www.kaggle.com/datasets/awsaf49/asvpoof-2019-dataset)). You may find the [ASVspoof 2019 evaluation plan](https://www.asvspoof.org/asvspoof2019/asvspoof2019_evaluation_plan.pdf) useful.

For the CM, we choose [LightCNN (LCCN)](https://arxiv.org/abs/1511.02683) that once achieved the top place in the competition. We will follow the Speech Technology Center (STC) [paper](https://arxiv.org/abs/1904.05576).

**Hints**:

1. Use STFT (FFT in the paper) as front-end.

2. The dropout layer is put before the last batch norm.

## Dataset [0.5 pts]

We want to train a neural network to predict if the input audio is real or fake. To do so, we need a dataset first. In this homework, we will work with [ASVspoof19](https://arxiv.org/pdf/1911.01601.pdf).

Create a `Dataset` class that downloads the dataset, parses its metadata and, given index $i$, returns $i$-th object of the dataset. Do not forget to preprocess audio for LCNN (calculate stft, etc.).

In [None]:
# YOUR CODE HERE

**Hint**: when working in Kaggle, it is easier and faster to use dataset as kaggle input. We can use it directly or add a symlink to a local dir using `ln -s`.
**Hint**: it might be easier to do this homework in Kaggle, since model training may take some time

Create train/eval dataset and dataloaders:

In [None]:
# YOUR CODE HERE

Visualize one object, just to check that all is fine:

In [None]:
# YOUR CODE HERE

## Loss function [0.5 pts]

In the lecture, we saw different softmax losses and the motivation behind them for the ASV task. However, they can also be used for any classification task, such as synthesized speech detection. The papers suggest to use A(M)-Softmax or Cross-Entropy. The STC paper argues that A-Softmax is better.

(a) Explain what are the benefits of A-softmax over cross-entropy according to the STC paper?

(b) Analyse the [NII paper](https://arxiv.org/pdf/2103.11326) and explain if complicated Softmax is actually needed to achieve good EER or we can go with Cross Entropy.

**Answer**: your answer here...

Following tha NII paper, we will continue with Cross Entropy

In [None]:
from torch import nn

criterion = nn.CrossEntropyLoss()

## Evaluation metric [0.5 pts]

We will use equal error rate as the primary evaluation metric. The code for calculating metrics is provided by the ASVspoof itself. We just need to write a wrapper. Given model logits and labels, calculate EER using the ASVspoof functions.

Your model returns two probas: [spoof_proba, bona_proba]. Be careful with the EER metric and recal how ROC curve is computed to ensure that you do not make a mistake.

In [None]:
# download asvspoof metric calculation functions
!wget https://raw.githubusercontent.com/markovka17/dla/refs/heads/2023/hw5_as/calculate_eer.py

In [None]:
from calculate_eer import compute_eer

def get_eer(logits, labels):
    # YOUR CODE HERE

## LCNN Implementation [6.0 pts]

Create a `LCNN` class for the model architecture.

In [None]:
# YOUR CODE HERE

In [None]:
model = LCNN(...)

In [None]:
# double-check that it runs (do eval mode)
# YOUR CODE HERE

Write the train loop. Since it may take some time, we advise you to save your checkpoints after each epoch to load it back if needed.

Plot EER vs epoch and loss vs epoch curves

In [None]:
def train_one_epoch(model, dataloader, criterion, optimizer, scheduler, device):
    # YOUR CODE HERE


def evaluate(model, dataloader, criterion, device):
    # YOUR CODE HERE


def train(model, train_dataloader, eval_dataloader, criterion, optimizer, scheduler, device, n_epochs):
    # YOUR CODE HERE


In [None]:
device = # YOUR CODE HERE

In [None]:
# take optimizer and scheduler from the NII paper
optimizer = # YOUR CODE HERE
scheduler = # YOUR CODE HERE
n_epochs = # YOUR CODE HERE
train(model, train_dataloader, eval_dataloader,
      criterion, optimizer, scheduler, device, n_epochs)

The task is consired solved if you achieve at least $9\%$ EER. It is much higher than the model can achieve but we do not want you to wait 12+ hours for the model to converge.

## XAI [See points below]

Let's analyse the model we have created. We won't be able to understand the differences easily without having some reference. So, we will use the novel idea from the recent [Interspeech 2025 paper](https://arxiv.org/abs/2506.03425).

We will use a [vocoded dataset](https://arxiv.org/abs/2210.10570) of parallel samples: real and fake audio have the same speaker saying the same content at the same time. The ground-truth explanation will be obtained by calculating difference between real and fake spectrograms.

In [None]:
!wget https://zenodo.org/records/7314976/files/project09-voc.v4.tar?download=1 -O project09-voc.v4.tar
!tar -xvf project09-voc.v4.tar

Note that real audio is taken from ASVspoof. So let's take a real example from the asvspoof dataset. Using its filename, find the corresponding `hifi-gan` and `waveglow` vocoded versions in the vocv4 and load them too

In reality, we are interested in the explanations for the unseen data. But for this homework, let's consider the train set. This will allow us to see if the model learns the futures we expect it to learn (assuming the XAI tool is trustworthy) (Though spoof part of vocv4 is not exactly the same as the one in asvspoof, so we mostly eliminate the issues related to changing speakers, not algorithms)

In [None]:
ind = # for consistency with solutions choose the index that corresponds to LA_T_4179989 (bona fide)
# YOUR CODE HERE

In [None]:
# get LCNN-prepared spectrogram for the paired real and fake example from the dataset
# paired: the same filename, but one is bona fide another is created via vocoder
real_audio = # YOUR CODE HERE
hifigan_audio = # YOUR CODE HERE
waveglow_audio = # YOUR CODE HERE


# fake audio may be slightly longer due to padding, remove some part from the end to make the length equal
# YOUR CODE HERE


# preprocess audio for model input
# YOUR CODE HERE

Run your model on these clips. See if the model prediction is correct. Use this understanding for the following analysis

In [None]:
# YOUR CODE HERE

### Manual explanation [0.5 pts]

Compare the two spectrograms (real and fake). What differences do you see? (**Hint**: they exist, if you do not see -- look carefully).

In [None]:
# YOUR CODE HERE

**Your answer here**

### Automatic explanation. [0.5 pts]

Calculate Eq. 2 from the [Interspeech 2025 paper](https://arxiv.org/abs/2506.03425) to automatically highlight the differences between two objects

In [None]:
# YOUR CODE HERE

Plot the mask on top of the fake spectrogram and compare three plots: real, fake, fake+mask on top. Do it for both vocoders. Compare

In [None]:
# YOUR CODE HERE

**Your comparison of the ground-truth mask with your manual analysis (from previous subtask) here**

### Grad-CAM [0.5 pts]

Using [pytorch-grad-cam lib](https://github.com/jacobgil/pytorch-grad-cam), implement [Grad-CAM](https://arxiv.org/abs/1610.02391) for your LCNN model. Choose the layer you like

In [None]:
# grad-cam and captum may have conflicting numpy dependencies. Just install grad-cam first, then captum and it will work

In [None]:
!pip install grad-cam

In [None]:
# YOUR CODE HERE

### Comparison [1.0 pts]

Compare your Grad-CAM attributions with another gradient-based method: InputXGradient. Compute it using [Captum](https://captum.ai/).

In [None]:
!pip install captum

In [None]:
# YOUR CODE HERE

Do three plots: mask vs grad-cam vs inputXgradient. Compare them. Does any of the XAI methods align with the mask?

In [None]:
# YOUR CODE HERE

**Your answer here**

Due to skewed distribution for some XAI tools, you may want to look at top-5% points, similarly to the ground-truth mask. Binarize attributions using their $95\%$ quantile and plot again:

In [None]:
# YOUR CODE HERE

**Your analysis here**