# A Comparison of Metric Learning Loss Functions for End-to-End Speaker Verification

This notebook contains the code to reproduce the equal error rate of the additive angular margin loss model from the paper.

Before you begin, make sure you have installed [pyannote-audio](https://github.com/pyannote/pyannote-audio) and [pyannote.db.voxceleb](https://github.com/pyannote/pyannote-db-voxceleb).

If you use this model, please cite our paper:

```BibTeX soon!```

## Preparation

### Dataset

First of all, we apply the pretrained model to generate `VoxCeleb1` embeddings. Audio chunk duration is 3s as indicated in `config.yml`, and the step of the sliding window is ~100ms (3 * 0.0333).

To do so, you need to execute the following commands in your terminal:

In [None]:
$ pyannote-audio emb apply --gpu --step=0.0333 --batch=128 --subset=test models/AAM/train/VoxCeleb.SpeakerVerification.VoxCeleb2.train/validate_equal_error_rate/VoxCeleb.SpeakerVerification.VoxCeleb1_X.development VoxCeleb.SpeakerVerification.VoxCeleb1_X
$ pyannote-audio emb apply --gpu --step=0.0333 --batch=128 --subset=train models/AAM/train/VoxCeleb.SpeakerVerification.VoxCeleb2.train/validate_equal_error_rate/VoxCeleb.SpeakerVerification.VoxCeleb1_X.development VoxCeleb.SpeakerVerification.VoxCeleb1_X

Notice that you need to remove `--gpu` if your machine doesn't have one. These commands may take some time to execute, especially the second one. However, once all embeddings are calculated, you will be free to run this notebook many times and modify it without waiting for hours to obtain results.

In this notebook we will be using the `Test` subset to calculate the EER, and `Train` to normalize similarity scores with adaptive s-norm.

If you want to know more about the `apply` method, you can check out [pyannote's tutorials](https://github.com/pyannote/pyannote-audio/tree/develop/tutorials/models/speaker_embedding#application)

### Code

Make sure all the needed libraries are installed:
- `numpy` for obvious reasons
- `xarray` to facilitate score normalization
- `feerci` to calculate EER and its confidence interval
- `pyannote.audio` to use pretrained models
- `pyannote.database` to access `VoxCeleb1`
- `tqdm` to show nice progress bars

In [1]:
import numpy as np
from xarray import DataArray
from feerci import feerci
from pyannote.core.utils.distance import cdist
from pyannote.audio.features import Precomputed
from pyannote.audio.applications.speaker_embedding import SpeakerEmbedding
from pyannote.database import get_protocol
from tqdm import tqdm

We also initialize the database with the preprocessors needed, and we define some useful functions.

In [2]:
# We use the VoxCeleb1_X protocol, with a train and dev set
# resulting from splitting the original development set
protocol = get_protocol('VoxCeleb.SpeakerVerification.VoxCeleb1_X')


# A function to crop the embeddings from a file
def get_embedding(file, pretrained, mean=False):
    emb = []
    for f in file.files():
        if 'try_with' in f:
            segments = f['try_with']
        else:
            segments = f['annotation'].get_timeline()
        for segment in segments:
            for mode in ['center', 'loose']:
                e = pretrained.crop(f, segment, mode=mode)
                if len(e) > 0:
                    break
            emb.append(e)
    emb = np.vstack(emb)
    if mean:
        emb = np.mean(emb, axis=0, keepdims=True)
    return emb


# A function to calculate the EER on a subset of VoxCeleb1_X
def run_experiment(distance, subset):
    total = 37720 if subset == 'test' else None
    y_pred, y_true = [], []
    for trial in tqdm(getattr(protocol, f'{subset}_trial')(), total=total):
        file1 = trial['file1']
        hash1 = get_hash(file1)
        file2 = trial['file2']
        hash2 = get_hash(file2)
        y_pred.append(distance.data[index1[hash1], index2[hash2]])
        y_true.append(trial['reference'])
    y_pred = np.array(y_pred)
    y_true = np.array(y_true)
    eer, ci_lower, ci_upper, _ = feerci(-y_pred[y_true == 0],
                                        -y_pred[y_true == 1],
                                        is_sorted=False)
    return {
        'eer': eer,
        'ci_lower': ci_lower,
        'ci_upper': ci_upper,
        'y_true': y_true,
        'y_pred': y_pred}

## Loading the Pretrained Model

In [3]:
# Load the precomputed embeddings calculated at the beginning of the notebook
model = Precomputed(
    'models/AAM/train/VoxCeleb.SpeakerVerification.VoxCeleb2.train/validate_equal_error_rate/'
    'VoxCeleb.SpeakerVerification.VoxCeleb1_X.development/apply/0560/',
    use_memmap=False)

print(f'Embeddings of {model.sliding_window.duration:g}s duration and of dimension {model.dimension:d}, '
      f'extracted every {1000 * model.sliding_window.step:g}ms')

Embeddings of 3s duration and of dimension 512, extracted every 99.9ms


## Evaluating with Raw Distances

### Calculate embeddings

In [4]:
get_hash = lambda file: SpeakerEmbedding.get_hash(file)

# hash to embedding mapping
cache1 = dict()
cache2 = dict()

# hash to index mapping
index1 = dict()
index2 = dict()

n_file1 = 0
n_file2 = 0

# Get embeddings for every trial in Test
for trial in tqdm(protocol.test_trial(), total=37720):
    
    file1 = trial['file1']
    hash1 = get_hash(file1)
    if hash1 not in cache1:
        cache1[hash1] = get_embedding(file1, model, mean=True)
        index1[hash1] = n_file1
        n_file1 += 1
    
    file2 = trial['file2']
    hash2 = get_hash(file2)
    if hash2 not in cache2:
        cache2[hash2] = get_embedding(file2, model, mean=True)
        index2[hash2] = n_file2
        n_file2 += 1

hashes1 = list(cache1.keys())
hashes2 = list(cache2.keys())
emb1 = np.vstack(list(cache1.values()))
emb2 = np.vstack(list(cache2.values()))

100%|██████████| 37720/37720 [00:29<00:00, 1282.89it/s]


### Calculate cosine distances

In [5]:
distance = DataArray(
    cdist(emb1, emb2, metric='cosine'),
    dims=('file1', 'file2'),
    coords=(hashes1, hashes2))

### Calculate EER on VoxCeleb1 Test

In [6]:
raw_results = run_experiment(distance, 'test')
print(f"EER with raw distances: {100 * raw_results['eer']:.2f} in "
      f"[{100 * raw_results['ci_lower']:.2f}, {100 * raw_results['ci_upper']:.2f}]")

37720it [00:15, 2401.14it/s]


EER with raw distances: 3.94 in [3.75, 4.14]


## Evaluating with adaptive s-norm

Here we improve the above EER with adaptive s-norm score normalization.

### Create a cohort set

In [None]:
# Get cohort embeddings from VoxCeleb1_X.train
cohort_embedding = dict()
for cohort_file in tqdm(protocol.train(), total=143506):
    speaker = cohort_file['annotation'].argmax()
    embedding = get_embedding(cohort_file, model, mean=False)
    cohort_embedding.setdefault(speaker, []).append(embedding)

# The cohort consists of the mean embedding for each speaker
cohort_speakers = list(cohort_embedding.keys())
cohort = np.vstack([np.mean(np.vstack(cohort_embedding[speaker]), axis=0, keepdims=True) 
                    for speaker in cohort_speakers])

### Calculate raw trial scores

In [None]:
# Calculate the distances between each trial embedding (file1 and file2) and the cohort
distance1 = DataArray(
    cdist(emb1, cohort, metric='cosine'),
    dims=('file1', 'cohort'),
    coords=(hashes1, cohort_speakers))

distance2 = DataArray(
    cdist(emb2, cohort, metric='cosine'),
    dims=('file2', 'cohort'),
    coords=(hashes2, cohort_speakers))

### Normalize scores w.r.t the N most similar cohort embeddings

N=400 for us. We have previously tuned this value on `VoxCeleb1 Dev`

In [None]:
# This is our N
COHORT_SIZE = 400

# Calculate mean and std of N most similar cohort embeddings for file1
data1 = np.partition(distance1.data, COHORT_SIZE)[:, :COHORT_SIZE]
mz = np.mean(data1, axis=1) 
sz = np.std(data1, axis=1)
mz = DataArray(mz, dims=('file1',), coords=(hashes1,))
sz = DataArray(sz, dims=('file1',), coords=(hashes1,))

# Calculate mean and std of N most similar cohort embeddings for file2
data2 = np.partition(distance2.data, COHORT_SIZE)[:, :COHORT_SIZE]
mt = np.mean(data2, axis=1) 
st = np.std(data2, axis=1)
mt = DataArray(mt, dims=('file2',), coords=(hashes2,))
st = DataArray(st, dims=('file2',), coords=(hashes2,))

# Normalize
distance_z = (distance - mz) / sz
distance_t = (distance - mt) / st
distance_s = 0.5 * (distance_z + distance_t)

### Calculate EER on VoxCeleb1 Test

In [None]:
# Calculate the DET curve on test and print the EER value
ada_snorm_results = run_experiment(distance_s, 'test')
print(f"EER with adaptive s-norm: {100 * ada_snorm_results['eer']:.2f} in"
      f"[{100 * ada_snorm_results['ci_lower']:.2f}, {100 * ada_snorm_results['ci_upper']:.2f}]")

That's all! If you have any questions or suggestions, feel free to open an issue here or on [pyannote-audio](https://github.com/pyannote/pyannote-audio)