# Analyze Performance of Speaker Diarization Pipeline on DED21 Dataset

This notebook analyzes the performance of two speaker diarization pipelines on the DED21 dataset (see this [blog post](https://stukroodvlees.nl/welke-lijsttrekkers-lacht-het-meest-en-hoe/) for details). The two pipelines were the ones performing best on the AMI corpus:

1. Custom pipeline:
- pyannotes [speaker-segmentation](https://huggingface.co/pyannote/speaker-segmentation) pipeline for detecting speaker segments (voice activity + speaker change + overlapped speech) with postprocessing steps (merging close and removing short segments)
- Speechbrains [pretrained ECAPA-TDNN](https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb) model with default settings as in `model.encode_batch()` for speaker embeddings
- scikit-learns spectral clustering algorithm for speaker segment clustering

2. pyannotes full [speaker-diarization](https://huggingface.co/pyannote/speaker-diarization) pipeline (combines all steps from the custom pipeline)

The performance of the pipelines is assessed by different metrics. See the [documentation](https://pyannote.github.io/pyannote-metrics/reference) of `pyannote.metrics` for details.

In [1]:
from copy import deepcopy
from custom_datasets import load_ded21_dataset
from pyannote.core import Annotation, Segment
from pyannote.metrics.diarization import DiarizationErrorRate, DiarizationCoverage, DiarizationPurity
from pyannote.metrics.segmentation import SegmentationPrecision, SegmentationRecall, SegmentationPurity, SegmentationCoverage
from pyannote.metrics.detection import DetectionErrorRate
from sklearn.cluster import SpectralClustering
from speaker_representation import load_speech_sequence
import os
import torch
import numpy as np


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
DATA_DIR = os.path.join("..", "dutch-debate-corpus")
DATASET = "ded21"
MODEL = "speechbrain-ecapa-tdnn"
SEG_DIR = os.path.join("..", "speaker-segmentation", "results", "pyannote")
RESULTS_DIR = os.path.join("results", MODEL)


In [3]:
def average_subsegment_embeddings(rttm_seq, embeddings):
    splits = []

    for seg in rttm_seq.sequence:
        splits.append(seg.tdur // 20.0 + 1)

    new_embeddings = torch.empty((len(splits), embeddings.shape[1]))

    for j, n in enumerate(splits):
        new_embedding = torch.mean(embeddings[j:(j+int(n)), :], dim=0)
        new_embeddings[j, :] = torch.nn.functional.normalize(
            new_embedding, dim=-1).cpu()

    assert new_embeddings.shape[0] == len(splits)

    return new_embeddings


In [4]:
def get_reference_speakers(ref_dir, dataset, index):
    rttm_seq = load_speech_sequence(ref_dir, dataset, index)
    speakers = [seg.name for seg in rttm_seq.sequence]
    return list(set(speakers))



In [5]:
def load_speaker_embeddings(model, dataset, index):
    with os.scandir(os.path.join("embeddings", model)) as filenames:
        for filename in filenames:
            if filename.name.split("_")[-1].find(str(index)) != -1 and filename.name.find(dataset) != -1:
                print(filename.path)
                embeddings = torch.load(filename.path)
                return embeddings

In [6]:
def convert_rttm_annotation(rttm_seq):
    annotation = Annotation()
    for seg in rttm_seq.sequence:
        annotation[Segment(seg.tbeg, seg.tbeg+seg.tdur)] = seg.name

    return annotation

In [7]:
dataset = load_ded21_dataset(DATA_DIR)

rttm_seqs = []
embeddings_combined = []
speaker_labels = []

for i, sample in enumerate(dataset):
    speaker_labels += get_reference_speakers(DATA_DIR, DATASET, i)
    embeddings = load_speaker_embeddings(MODEL, DATASET, i)
    rttm_seq = load_speech_sequence(SEG_DIR, DATASET, i)
    new_embeddings = average_subsegment_embeddings(rttm_seq, embeddings)
    embeddings_combined += new_embeddings.squeeze()
    rttm_seqs.append(rttm_seq)

speaker_labels = list(set(speaker_labels))

embeddings\speechbrain-ecapa-tdnn\sb_ecapa_tdnn_pa-seg_ded21_0.pt
embeddings\speechbrain-ecapa-tdnn\sb_ecapa_tdnn_pa-seg_ded21_1.pt


In [8]:
rttm_seq_combined = deepcopy(rttm_seqs[0])

for seq in rttm_seqs[1:]:
    rttm_seq_combined.append(seq)

In [9]:
np.random.seed(123)

classifier = SpectralClustering(len(speaker_labels))
num_labels = classifier.fit_predict(np.array([np.array(emb) for emb in embeddings_combined]))

for i, seg in enumerate(rttm_seq_combined.sequence):
    seg.name = num_labels[i]

for i, seg in enumerate(rttm_seqs[0].sequence + rttm_seqs[1].sequence):
    seg.name = num_labels[i]

for i, seq in enumerate(rttm_seqs):
    seq.write(os.path.join("results", MODEL, f"sc_sb_ecapa_tdnn_{DATASET}_sample_{i}.rttm"))


In [10]:
model_annotation = convert_rttm_annotation(rttm_seq_combined)

In [11]:
ref_rttm_seq_combined = load_speech_sequence(DATA_DIR, DATASET, 0)

for i, _ in enumerate(dataset):
    if i > 0:
        ref_rttm_seq_combined.append(load_speech_sequence(DATA_DIR, DATASET, i))

In [12]:
ref_annotation = convert_rttm_annotation(ref_rttm_seq_combined)

In [27]:
diarization_der = DiarizationErrorRate(collar=0.25)
diarization_coverage = DiarizationCoverage(collar=0.25)
diarization_purity = DiarizationPurity(collar=0.25)
diarization_mapping = diarization_der.optimal_mapping
vad_detection = DetectionErrorRate(collar=0.25)
seg_precision = SegmentationPrecision(tolerance=0.5)
seg_recall = SegmentationRecall(tolerance=0.5)
seg_coverage = SegmentationCoverage()
seg_purity = SegmentationPurity()

metrics = {
    "der": diarization_der, "d_cov": diarization_coverage, 
    "d_pur": diarization_purity, "d_map": diarization_mapping,
    "vad_err": vad_detection, "seg_pre": seg_precision, "seg_rec": seg_recall,
    "seg_cov": seg_coverage, "seg_pur": seg_purity
}


In [28]:
for met in metrics.keys():
    print(f"{met}: {metrics[met](ref_annotation, model_annotation)}")



der: 0.7539607638662575
d_cov: 0.33392985478827514
d_pur: 0.3381564995009985
d_map: {0: 'Klaver', 1: 'Burger', 2: 'Hoekstra', 3: 'Marijnissen', 4: 'Rutte', 5: 'Kaag', 6: 'Wilders'}
vad_err: 0.16882226422472735
seg_pre: 0.1884498480243161
seg_rec: 0.5740740740740741
seg_cov: 0.6265871799003123
seg_pur: 1.0


The metrics indicate that the custom pipeline performs not very well on the DED21 dataset. The VAD error rate is relatively low (although it could be even lower), so VAD does not seem to be the problem. The metrics for speaker change detection (`seg_`) also show an acceptable performance of the pipeline. However, the diarization metrics (`der`, `d_cov`, `d_pur`) signal that the speaker embeddings or the clustering performed inadequately.

In [29]:
for i, sample in enumerate(dataset):
    reference = convert_rttm_annotation(load_speech_sequence(DATA_DIR, DATASET, i))
    hypothesis = convert_rttm_annotation(load_speech_sequence(os.path.join("results", "pyannote"), DATASET, i))
    print(f"Sample {i}:")
    for met in metrics.keys():
        print(f"{met}: {metrics[met](reference, hypothesis)}")

Sample 0:
der: 0.35585114806017526
d_cov: 0.9542816997557668
d_pur: 0.6853093627762681
d_map: {'SPEAKER_00': 'Marijnissen', 'SPEAKER_01': 'Kaag', 'SPEAKER_03': 'Rutte', 'SPEAKER_04': 'Hoekstra'}
vad_err: 0.04241646872525815
seg_pre: 0.18085106382978725
seg_rec: 0.7727272727272727
seg_cov: 0.599764734004743
seg_pur: 1.0
Sample 1:
der: 0.3851063358440405
d_cov: 0.9198759029869336
d_pur: 0.6738817274239243
d_map: {'SPEAKER_00': 'Kaag', 'SPEAKER_01': 'Klaver', 'SPEAKER_02': 'Marijnissen', 'SPEAKER_03': 'Burger', 'SPEAKER_04': 'Rutte', 'SPEAKER_05': 'Wilders'}
vad_err: 0.025117412494461443
seg_pre: 0.2084942084942085
seg_rec: 0.8571428571428571
seg_cov: 0.616670700847178
seg_pur: 0.9999999999999998


For the pyannote pipeline, the metrics suggest a much better and adequate performance. The VAD error rate is very low, whereas the speaker change detection metrics are comparable to the ones for the custom pipeline. The speaker diarization metrics indicate moderate performance, suggesting that the embedding and clustering worked much better than for the custom pipeline. The higher coverage compared to the purity suggests that the pipeline slightly oversegments the speech signal (i.e., it merges multiple reference speaker into one hypothesized speaker). This can be explained by the lower number of detected speakers compared to the reference.