<a href="https://colab.research.google.com/github/mraskj/css_fall2023/blob/main/code/class11/class11-solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Class 11: Speaker Diarization and Recognition - Solution

In this exercise, we investigate the validity of using pretrained models to conduct speaker diarization and speaker recognition. For the former, we compare acoustic features computed with groundtruth and automated timestamps. For the latter, we investigate how embeddings can be used to discriminate between speakers.  

As we always do in Colab, you should start by cloning the GitHub repo and by constructing an output folder to save files you create in Colab.

In [None]:
# Clone GitHub directory into
!git clone https://github.com/mraskj/css_fall2023.git

In [None]:
import os

# Define an output directory where we save all output created in Colab
output_dir = os.path.join(os.getcwd(), 'output')

# Make directory if it not already exists.
if not os.path.exists(output_dir):
  os.mkdir(output_dir)

## Exercise 1: Audio Measurement and Annotations

In the first exercise, we investigate the sensitivity of annotation errors to the estimation of acoustic features. I have provided you with a RTTM groundtruth annotations for the same debate snippet as we worked with in the tutorial. The RTTM file can be found here `/content/css_fall2023/data/audio/class11/57-17-groundtruth-withoutspeaker.rttm` (assuming you have cloned the GitHub repo). Note that we work with the version without speaker labels here. You can also work with version with speaker labels if you want - the results should be similar.

The recording of the debate snippet is found here: `/content/css_fall2023/data/audio/class11/57-71-class.mp4`.

1. Convert the video file to audio

2. Apply diarization using the `pyannote/speaker-diarization@2.1` model from `pyannote.audio`. Describe each step you take (e.g. do you keep all diarized segments or do you discard some? Do you merge back-to-back segments from the same speaker label).

3. Write the diarization result to a RTTM file.


4. Compute the DER for three different error margins: 0.0, 0.5, and 1.0. Describe the results. Based on this, do you expect substantial differences in the estimation of acoustic features such as pitch, loudness, or MFCCs when using groundtruth annotations compared to the automated annotations?


5. Split the diarized segments to separate audio files


6. Split the groundtruth segments to separate audio files


7. Compute the first 10 MFCCs, pitch, and intensity for each of the segments from step 4 and 5. For pitch, compute also the standard deviation, minimum, and maximum value.

8. Compare the measure for the diarized and groundtruth segments. Note that you must link each diarization segment to a groundtruth segment to be able to do the comparison. There must be exactly the same number of segments in both conditions. Describe and show the results.














#### Exercise 1.1

I start by converting the video file to an audio file using FFMPEG. I use a sampling rate of $16,000$ and a single channel.

In [None]:
import subprocess

# We specify the sampling rate and number of channels beforehand. This is not
# strictly necessary, but is a useful approach when you wrap code in functions.
sr = 16000
channels = 1

# Specify path to video and audio file file
video_fpath = '/content/css_fall2023/data/audio/class11/57-71-class.mp4'
audio_fpath = os.path.join(output_dir, '57-71-class.wav')

# Write the ffmpeg code
cmd = ['ffmpeg',
       '-y',
       '-i',
       video_fpath,
       '-vn',
       '-ar',
       str(sr),
       '-ac',
       str(channels),
       '-acodec',
       'pcm_s16le',
       audio_fpath]

# Execute the code in the shell
subprocess.call(cmd)

# -y specifies that the output file should be overwritten if already existing
# -i specifies the input file
# -vn omits the video stream
# -ar specifies the sampling rate
# -ac specifies the number of channels
# -acodec specifies the type of encoding (in this case pcm_s16le)

#### Exercise 1.2

For the diarization, I first install the pyannote.audio library and then import the `Pipeline` class.

In [None]:
!pip install pyannote.audio

In [None]:
# Import Pipeline from pyannote.audio
from pyannote.audio import Pipeline

I then load in the pretrained pipeline `pyannote/speaker-diarization@2.1` and allocate the loaded pipeline to `cuda` (i.e. a GPU) if available. The audio file is then diarized.

In [None]:
# Define access token and name of pipeline
from google.colab import userdata
access_token = userdata.get('huggingface')

pipeline_name = "pyannote/speaker-diarization@2.1"

# Load and initiate pipeline.
pipeline = Pipeline.from_pretrained(pipeline_name,
                                    use_auth_token=access_token)

In [None]:
# Import PyTorch
import torch

# Send pipeline to GPU (if available)
pipeline.to(torch.device("cuda"  if torch.cuda.is_available() else "cpu"))

print(f"Device: {pipeline.device}")

In [None]:
# Apply the diarization pipeline to our audio file
diarization = pipeline(os.path.join(output_dir, '57-71-class.wav'))

In [None]:
# Keep only segments that are five seconds or longer.
# The choice of five seconds are fairly arbitrary, but is intended to
# capture that I want to avoid segments that are not actual speech.
threshold = 5.0
dia_result = []
for turn, _, speaker in diarization.itertracks(yield_label=True):
    start = round(turn.start, 1)
    end = round(turn.end, 1)

    # Keep only segments that are five seconds or longer
    if (end - start) >= threshold:
      dia_result.append([speaker, start, end])

In [None]:
# Merge consecutive segments
diarized = []
curr = dia_result[0][0]
start = dia_result[0][1]
for i in range(len(dia_result)-1):
  if curr == dia_result[i+1][0]:
    continue
  else:
    diarized.append([curr,start, dia_result[i][2]])
    curr = dia_result[i+1][0]
    start = dia_result[i+1][1]

diarized.append([curr,start, dia_result[i+1][2]])

In [None]:
diarized

#### Exercise 1.3

I then loop through each diarized segment and write to an RTTM file. Before I do so, I manually remove the first and last diarized segment, which corresponds to chair speech. If you look at the groundtruth file, you will recognize that these segments are not included. This is not urgent at this stage, but when we make pairwise comparisons across acoustic features, we need to link *one* diarized segment to *one* groundtruth segment in a 1:1 match.

In [None]:
# Write output to an RTTM-file by looping through each diarized segment
dia_rttm = []
for i, v  in enumerate(diarized[1:-1]):
    start = round(float(v[1]), 2)
    end = round(float(v[2]), 2)

    # Join each line as a string
    dia_rttm.append(' '.join(['SPEAKER ' +
                      '57-17-class' +
                      ' 1',
                      str(start),
                      str(round(end-start, 1)),
                      '<NA> <NA>',
                      v[0],
                      '<NA> <NA>']))

# Writing
with open(os.path.join(output_dir, '57-17-prediction.rttm'), 'w') as f:
    for d in dia_rttm:
        f.write('%s\n' % d)

#### Exercise 1.4

To compare the results of the automated annotations and a groundtruth, I use the Diarization Error Rate (DER), which is bounded between 0 and 1. The lower the DER, the better. I test the results using three different values of `collar`, which controls the allowed annotation error. I specify `collar`=[0.0, 0.5, 1.0]

In [None]:
# Load in DER metric and function to load RTTM files
from pyannote.database.util import load_rttm
from pyannote.metrics.diarization import DiarizationErrorRate

# Specify collar ranges
collars = [0.0, 0.5, 1.0]

In [None]:
# Load the RTTM-files
prediction_withoutspeaker = load_rttm(os.path.join(output_dir, '57-17-prediction.rttm'))['57-17-class']
groundtruth_withoutspeaker = load_rttm('/content/css_fall2023/data/audio/class11/57-17-groundtruth-withspeaker.rttm')['57-17-class']
# Compute the DER
der_dict = {}
for collar in collars:
  print(f"Error margin: {collar} seconds")
  metric = DiarizationErrorRate(collar=collar, skip_overlap=True)
  der = metric(groundtruth_withoutspeaker, prediction_withoutspeaker, detailed=True)
  der_dict[str(collar)] = der

In [None]:
# Convert DER results to a pandas dataframe to represent it as a table
import pandas as pd
der_df = pd.DataFrame(der_dict)
der_df

The overall diarization error is very low across each of the three error margins ranging from 6.53% to 2.99%f when using an error margin of $1.0$ seconds compared to no margin at all ($0.0$ seconds). The differences arises almost entirely due to differences in the false alarm. This means that the annotations assign portions of the signal as a speech while in fact is not.

Based on the small differences, we expect minor differences in the estimation of acoustic features.

#### Exercise 1.5

I now split the diarized segments into individual audio files based on the generated timestamps.


In [None]:
# Define function that splits an audio file based on provided timestamps in a
# RTTM file. Note that each segments must be formatted exactly as a RTTM file
# work due to the indexing.
def audio_split(segments:list, segment_dir:str, channels=1, sr=16000):

  for i, d in enumerate(segments):

    segment_fpath = os.path.join(segment_dir, f"segment_{i}-{d[7]}.wav")
    cmd = ['ffmpeg',
            '-y',
            '-i',
            audio_fpath,
            '-vn',
            '-ar',
            str(sr),
            '-ac',
            str(channels),
            '-acodec',
            'pcm_s16le',
            '-ss',
            str(d[3]),
            '-t',
            str(d[4]),
            segment_fpath]

    subprocess.call(cmd)

In [None]:
# Preprocess diarization list to be used to split audio file into its individual segments
dia_rttm_split = [x.split() for x in dia_rttm]

In [None]:
# Construct segments from diarization output
segment_dir = os.path.join(output_dir, 'segments_diarized')
if not os.path.exists(segment_dir):
  os.mkdir(segment_dir)

audio_split(segments=dia_rttm_split, segment_dir=segment_dir)

#### Exercise 1.6

I now do the same for the groundtruth segments. I first read in the groundtruth RTTM file and preprocess to the same format.

In [None]:
# Read in groundtruth RTTM and preprocess it to match the diarization output used above
with open('/content/css_fall2023/data/audio/class11/57-17-groundtruth-withspeaker.rttm') as f:
  lines = f.read()
  lines = lines.split('\n')
  lines = [line for line in lines if len(line) > 0]
  groundtruth_rttm = [x.split() for x in lines]

In [None]:
# Construct segments from groundtruth
segment_dir = os.path.join(output_dir, 'segments_groundtruth')
if not os.path.exists(segment_dir):
  os.mkdir(segment_dir)

audio_split(segments=groundtruth_rttm, segment_dir=segment_dir)

In [None]:
groundtruth_rttm[11], dia_rttm[11]

#### Exercise 1.7

I now compute a range of acoustic features using `praat-parselmouth` for each of the diarized and groundtruth segments.

In [None]:
!pip install praat-parselmouth

In [None]:
diarized_dir = os.path.join(output_dir, 'segments_diarized')
groundtruth_dir = os.path.join(output_dir, 'segments_groundtruth')
segments_diarized = os.listdir(diarized_dir)
segments_groundtruth = os.listdir(groundtruth_dir)

In [None]:
import re
def digit_sort_key(v):

    digits = re.search('\d+', v)
    return int(digits.group())

# Sort segments by segment number
segments_diarized = sorted(segments_diarized, key=digit_sort_key)
segments_groundtruth = sorted(segments_groundtruth, key=digit_sort_key)

In [None]:
segments_groundtruth[11], segments_diarized[11]

In [None]:
# Verify that we have an equal number of segments:
assert len(segments_diarized) == len(segments_groundtruth)

In [None]:
import parselmouth
import numpy as np

In [None]:
def praat_acoustics(fname):
    snd = parselmouth.Sound(fname)
    pitch = parselmouth.praat.call(snd, "To Pitch", 0.0, 75, 500)
    f0_mean = parselmouth.praat.call(pitch, "Get mean", 0, 0, 'Hertz')
    f0_std = parselmouth.praat.call(pitch, "Get standard deviation", 0 ,0, 'Hertz')
    f0_max = parselmouth.praat.call(pitch, "Get maximum", 0, 0, 'Hertz',"None")
    f0_min = parselmouth.praat.call(pitch, "Get minimum", 0, 0, 'Hertz',"None")
    mean_loudness = np.mean(snd.to_intensity().values)
    mfccs = np.mean(snd.to_mfcc(10).to_array(), axis=1)

    acoustics = {'f0_mean': f0_mean,
                 'f0_std': f0_std,
                 'f0_max': f0_max,
                 'f0_min': f0_min,
                 'loudness_mean': mean_loudness}

    for i, mfcc in enumerate(mfccs[1:]):
      acoustics[f'MFCC{i+1}'] = mfcc

    return acoustics

In [None]:
diarized_acoustics = {}
for s in [x for x in segments_diarized if x.endswith('wav')][:]:
  ac = praat_acoustics(fname=os.path.join(diarized_dir, s))
  diarized_acoustics[s] = ac

In [None]:
groundtruth_acoustics = {}
for s in [x for x in segments_groundtruth if x.endswith('wav')][:]:
  ac = praat_acoustics(fname=os.path.join(groundtruth_dir, s))
  groundtruth_acoustics[s] = ac

#### Exercise 1.8

I have now computed the a bunch of acoustic features and saved them in two dictionaries, one for the diarized segments and one for the groundtruths. The next task is to compare the acoustic features.

This needs to be done pairwise such that *one* diarized segment is matched to *one* groundtruth segment.

In [None]:
# Generate segment number
segment_number = pd.Series(list(range(0, len(segments_diarized))), name='segment_number')

# Construct dataframes with diarized acoustics and groundtruth acoustics
diarized_df = pd.DataFrame(diarized_acoustics).transpose().reset_index(names='segment')
diarized_df = pd.concat([segment_number, diarized_df], axis=1)

groundtruth_df = pd.DataFrame(groundtruth_acoustics).transpose().reset_index(names='segment')
groundtruth_df = pd.concat([segment_number, groundtruth_df], axis=1)

In [None]:
diarized_df.iloc[9:12]

In [None]:
groundtruth_df.iloc[9:12]

In [None]:
# Compute pairwise differences for each segment and for each feature.
# Note that we don't have to loop through each segment as Python
# maps each index the two dataframes under the hood.
features = diarized_df.columns[2:]
acoustic_pairwise = {}
for f in features:
  pairwise_diff = np.abs(groundtruth_df[f] - diarized_df[f])
  acoustic_pairwise[f] = pairwise_diff

# Convert dict to dataframe
acoustic_pairwise_df = pd.DataFrame(acoustic_pairwise)

In [None]:
# Inspect results
acoustic_pairwise_df

The dataframe `acoustic_pairwise_df` contains the pairwise differences in the acoustic features measured in natural units. That is, the differences in the mean $F0$ (`f0_mean`) is in Hertz, while the differences in mean loudness (`loudness_mean`) is in decibel (dB).

The initial inspection reveals that the differences seems negligble suggesting that the automated annotations accurately recover the groundtruth results. There is one notable difference, which is caused for ther 12th segment (index 11). This corresponds to:
- `/content/output/segments_diarized/segment_11-SPEAKER_03.wav`
- `/content/output/segments_groundtruth/segment_11-TraceyCrouch.wav`

To see why this difference occurs, I first manually listen to each of the segments and check the timestamps for the diarized and groundtruth annotations. Both checks suggest that the difference between the two segments is minor, yet we observe a substantial difference in the average $F0$. To see what's going on, I represent each of the two segments as a spectogram and project estimates of the $F0$ onto the plot.

In [None]:
import matplotlib.pyplot as plt
def draw_spectrogram(spectrogram, start=0, stop=10000, dynamic_range=70, cmap='afmhot'):
    """
    Draw a spectrogram using Matplotlib.

    Parameters:
    spectrogram (Spectrogram): The input spectrogram.
    start (float, optional): The start time in milliseconds. Defaults to 0.
    stop (float, optional): The stop time in milliseconds. Defaults to 10000.
    dynamic_range (float, optional): The dynamic range in decibels. Defaults to 70.
    cmap (str, optional): The colormap to use. Defaults to 'afmhot'.

    Returns:
    None
    """
    X, Y = spectrogram.x_grid(), spectrogram.y_grid()
    indices = np.where((X >= start) & (X <= stop))[0]
    X = X[indices]
    sg_db = 10 * np.log10(spectrogram.values)
    sg_db = sg_db[:,indices[:-1]]
    plt.pcolormesh(X, Y, sg_db, vmin=sg_db.max() - dynamic_range, cmap=cmap)
    plt.ylim([spectrogram.ymin, spectrogram.ymax])


def draw_pitch(pitch):
    """
    Draw a pitch contour using Matplotlib.

    Parameters:
    pitch (Pitch): The input pitch object.

    Returns:
    Tuple[float, float]: A tuple containing the mean F0 and standard deviation of F0.
    """
    pitch_values = pitch.selected_array['frequency']
    f0mean = np.mean(pitch_values[pitch_values > 0])
    f0std = np.std(pitch_values[pitch_values > 0])
    pitch_values[pitch_values==0] = np.nan

    plt.plot(pitch.xs(), pitch_values, 'o', markersize=5, color='#381a61')
    plt.plot(pitch.xs(), pitch_values, 'o', markersize=2, color='#e78429')

    plt.grid(False)
    plt.ylim(0, pitch.ceiling)

    return np.round(f0mean, 2), np.round(f0std, 2)


def draw_spectrogram_pitch(files:list,
                           start=None,
                           stop=None,
                           max_freq = 8000,
                           cmap='afmhot',
                           show=True):


    """
    Draw spectrograms and pitch contours for a list of audio files.

    Parameters:
    files (list): List of audio files.
    gender (str): Gender of the speaker.
    dataframe (DataFrame): Dataframe containing speaker information.
    start (float, optional): The start time in seconds. If not provided, it's determined from the audio data. Defaults to None.
    stop (float, optional): The stop time in seconds. If not provided, it's determined from the audio data. Defaults to None.
    max_freq (float, optional): The maximum frequency for the spectrogram. Defaults to 6000.
    cmap (str, optional): The colormap to use. Defaults to 'afmhot'.
    show (bool, optional): Whether to display the plots. Defaults to True.

    Returns:
    None
    """
    n_plots = len(files)
    num_rows = (n_plots + 1) // 2
    num_cols = n_plots // num_rows

    for c, f in enumerate(files):
        snd = parselmouth.Sound(f)

        if not start and not stop:
          start, stop = snd.xmin, snd.xmax

        pitch = snd.to_pitch(pitch_floor=75, pitch_ceiling=500)
        spectrogram = snd.to_spectrogram(window_length=0.025, maximum_frequency=max_freq)

        plt.subplot(num_rows, num_cols, c + 1)

        draw_spectrogram(spectrogram, start=start, stop=stop, cmap=cmap)

        if c % 2 == 0:
          plt.ylabel('Frequency (Hz)')

        if c % 2 == 1:
          plt.yticks([])

        plt.xlabel('Time (s)')
        plt.xticks([start, stop])


        plt.twinx()
        f0mean, f0std = draw_pitch(pitch)

        if c % 2 == 0:
          plt.yticks([])

        if c % 2 == 1:
          plt.ylabel('F0 (Hz)')


        plt.xlim([start, stop])
        plt.title(f"F0 mean={np.round(f0mean, 3)}, F0 std={np.round(f0std, 3)}")

    if show:
      plt.show()


In [None]:
plt.figure(figsize=(10, 6))
draw_spectrogram_pitch(files=[os.path.join(diarized_dir, segments_diarized[11]),
                              os.path.join(groundtruth_dir, segments_groundtruth[11])],
                       cmap='viridis')

The plots show that the differences arises due to high pitch estimates in the very start of the groundtruth segment. This can not be heard in the audio signal, but due to the short duration of the segment, they end up contributing with fairly high amount to the overall pitch estimate for the groundtruth segment.

I now move on to the average differences for each feature. I compute both an unweighted and weighted average. The unweighted average pools the difference for each segment together without further ado while the weighted average take the varying durations into account. I construct the weights such that each segment's contribution is proportion to the segment's share of total duration.

In [None]:
# Compute unweighted means
unweighted_mean = np.mean(acoustic_pairwise_df, axis=0)

In [None]:
# Weighted mean with weights corresponding to each segment's share of total duration
dur = [float(x[4]) for x in groundtruth_rttm]
dur_share = [x/sum(dur) for x in dur]
weighted_mean = acoustic_pairwise_df.apply(lambda x: x * dur_share)
weighted_mean = np.sum(weighted_mean, axis=0)

In [None]:
weighted_df = pd.DataFrame(weighted_mean, columns=['weighted'])
unweighted_df = pd.DataFrame(unweighted_mean, columns=['unweighted'])
pairwise_df = pd.concat([unweighted_df, weighted_df], axis=1)
# pairwise_df = pairwise_df[:4]

In [None]:
# Inspect results
pairwise_df

The pairwise comparisons show very minor differences between the acoustic results obtained using automated annotations compared to the manually created groundtruth. This suggests that automated annotations can be efficiencely used to compute acoustic features with high validity.

This holds both for the unweighted and weighted values. The latter is generally smaller than the former. This makes intuitive sense as shorter segments are more sensitive to annotation errors than longer segments since each estimate contributes more to the

To illustrate the results, I plot the average differences for the $F0$ features below.

In [None]:
pairwise_df = pairwise_df[:4]
fig, ax = plt.subplots(figsize=(10,6), facecolor = "white")

ax.grid(which="major", axis='both', color='#758D99', alpha=0.6, zorder=1)

ax.spines[['top','right','bottom']].set_visible(False)

ax.hlines(y=pairwise_df.index, xmin=pairwise_df['weighted'], xmax=pairwise_df['unweighted'], color='#758D99', zorder=2, linewidth=2, label='_nolegend_', alpha=.8)
ax.scatter(pairwise_df['unweighted'], pairwise_df.index, label='Unweighted', s=60, color='#DB444B', zorder=3, alpha=.7)
ax.scatter(pairwise_df['weighted'], pairwise_df.index, label='Weighted', s=60, color='#006BA2', zorder=3, alpha=.7)

ax.xaxis.set_tick_params(labeltop=True,
                         labelbottom=False,
                         bottom=False,
                         labelsize=10,
                         pad=-1)

ax.text(0.5, 1.06, 'Hertz', transform=ax.transAxes, horizontalalignment='center', fontsize=10)

ax.set_yticks(pairwise_df.index)
ax.set_yticklabels(pairwise_df.index,     # Set labels again
                   ha = 'left')           # Set horizontal alignment to left
ax.yaxis.set_tick_params(pad=120,         # Pad tick labels so they don't go over y-axis
                         labelsize=11)    # Set label size

ax.legend(['Unweighted', 'Weighted'], loc=(0,1.05), ncol=2, frameon=False, handletextpad=-.1, handleheight=1)

# Set xlim
ax.set_xlim(0, 10)

## Exercise 2: Similarity of Speaker Embeddings

In the tutorial, we saw how pretrained speaker embeddings can be used to construct speaker embeddings on a completely different set of audio files
without any fine-tuning or adaption. While we did it visually in the tutorial, we'll exlore the similarity of embeddings using cosine similarity to test whether we can use these for speaker recognition.

The audio we work with is the diarized segments from *Exercise 1*. Your task is to:

1. Compute the pairwise cosine similarity between embeddings computed using a *sliding* window. You decide on the `duration` and `step` parameters. Compare the average for embeddings from same speakers and the average for embeddings from different speakers. Plot and describe your results. The plot should be a histogram colored by whether the similarity is computed on embeddings from the same or different speakers. I have provided you with a function below: `plot_histograms`

2. Compute the pairwise cosine similarity between embeddings computed using a *fixed* window (specified with the `window=whole`). Plot and describe your results. The plot should be a similarity matrix (a heatmap). I have provided you with a function below: `plot_similarity_matrix`

3. Discuss based on the results in 1+2 whether pretrained speaker embeddings can be exploited for speaker recognition.

There are a bunch of resources that might help you in the exercise.

- For plotting:
    * https://github.com/resemble-ai/Resemblyzer/blob/master/demo_utils.py
    * https://github.com/resemble-ai/Resemblyzer (see the cross-similarity plot)
- For embeddings:
    * https://huggingface.co/pyannote/embedding
- For cosine similarity:
    * Use the `cosine_similarity()` function from `sklearn.metrics.pairwise`

Note that are multiple ways to achieve the results and yours might very well be smarter than mine. Take a look at the solution if you get stuck.

In [None]:
# Tools for plotting
from mpl_toolkits.axes_grid1 import make_axes_locatable
from matplotlib.animation import FuncAnimation
from matplotlib import cm
import matplotlib.pyplot as plt

_default_colors = plt.rcParams["axes.prop_cycle"].by_key()["color"]
_my_colors = np.array([
    [0, 127, 70],
    [255, 0, 0],
    [255, 217, 38],
    [0, 135, 255],
    [165, 0, 165],
    [255, 167, 255],
    [97, 142, 151],
    [0, 255, 255],
    [255, 96, 38],
    [142, 76, 0],
    [33, 0, 127],
    [0, 0, 0],
    [183, 183, 183],
    [76, 255, 0],
], dtype=float) / 255


def plot_histograms(all_samples, names=None, title=""):
    """
    Plots (possibly) overlapping histograms and their median
    """

    _, ax = plt.subplots()

    for samples, color, name in zip(all_samples, _default_colors, names):
      ax.hist(samples, density=True, color=color, label=name, alpha=0.5)
    ax.legend(frameon=False, loc='upper right')
    ax.set_xlim(0, 1)
    ax.set_yticks([])
    ax.set_title(title)

    ylim = ax.get_ylim()
    ax.set_ylim(*ylim)
    for samples, color in zip(all_samples, _default_colors):
        median = np.median(samples)
        ax.vlines(median, *ylim, color, "dashed")
        ax.text(median, ylim[1] * 0.15, "median", rotation=270, color=color)

def plot_similarity_matrix(matrix, labels_a=None, labels_b=None, ax: plt.Axes=None, title=""):
    if ax is None:
        _, ax = plt.subplots()
    fig = plt.gcf()

    img = ax.matshow(matrix, extent=(-0.5, matrix.shape[0] - 0.5,
                                     -0.5, matrix.shape[1] - 0.5))

    ax.xaxis.set_ticks_position("bottom")
    if labels_a is not None:
        ax.set_xticks(range(len(labels_a)))
        ax.set_xticklabels(labels_a, rotation=90, size=7)
    if labels_b is not None:
        ax.set_yticks(range(len(labels_b)))
        ax.set_yticklabels(labels_b[::-1], size=7)  # Upper origin -> reverse y axis
    ax.set_title(title)


    cax = make_axes_locatable(ax).append_axes("right", size="5%", pad=0.15)
    fig.colorbar(img, cax=cax, ticks=np.linspace(0.25, 1, 4))
    img.set_clim(0.25, 1)
    img.set_cmap("inferno")

In [None]:
# Mapping speaker labels to speaker names and gender
# NOTE: It is not guaranteed that your labels corresponds to those below.
#       You might need to make your own mapping. You can use the transcript
#       '/content/css_fall2023/data/audio/class11/57-17-official_transcript.csv'
#       for help if you don't know the speakers purely by voice.
speaker_mapping = {'SPEAKER_09': 'Karen Bradley',
                  'SPEAKER_06': 'David Hanson',
                  'SPEAKER_07': 'John Whittingdale',
                  'SPEAKER_01': 'Christine Jardine',
                  'SPEAKER_03': 'Tracey Crouch',
                  'SPEAKER_10': 'Wes Streeting',
                  'SPEAKER_04': 'Amanda Milling',
                  'SPEAKER_11': 'Chris Elmore',
                  'SPEAKER_05': 'Nusrat Ghani',
                  'SPEAKER_00': 'Jim Shannon',
                  'SPEAKER_08': 'Tom Watson',
                  'SPEAKER_12': 'John Glen',
                  'SPEAKER_13': 'Luke Pollard',
                  'SPEAKER_02': 'CHAIR'}

gender_mapping = {'SPEAKER_09': 'Woman',
                  'SPEAKER_06': 'Man',
                  'SPEAKER_07': 'Man',
                  'SPEAKER_01': 'Woman',
                  'SPEAKER_03': 'Woman',
                  'SPEAKER_10': 'Man',
                  'SPEAKER_04': 'Woman',
                  'SPEAKER_11': 'Man',
                  'SPEAKER_05': 'Woman',
                  'SPEAKER_00': 'Man',
                  'SPEAKER_08': 'Man',
                  'SPEAKER_12': 'Man',
                  'SPEAKER_13': 'Man',
                  'SPEAKER_02': 'Man'}

#### Exercise 2.1


In [None]:
# Load pretrained embedding model
from pyannote.audio import Model
embedding_model = Model.from_pretrained("pyannote/embedding",
                              use_auth_token=access_token)

# Load Inference class
from pyannote.audio import Inference

In [None]:
# We use a sliding window with a duration of 5 seconds with a 1s step.
embedding_inference = Inference(embedding_model, window="sliding", duration=5, step=1)

# Define list of segment filepaths
segment_fpaths = list(segments_diarized)

# Define empty list to store the embeddings
embeddings = []

# Define empty lists to store speaker and gender labels
speaker_labels, speaker_gender = [], []

# Loop through each segment
for ix, segment_fpath in enumerate(segment_fpaths):
  # Compute embedding for each segment and convert to numpy array
  embed = np.array(embedding_inference(os.path.join(diarized_dir, segment_fpath)))
  # embed = embed / np.linalg.norm(embed)

  # Concatenate with previous embeddings
  if len(embeddings) > 0:
    embeddings = np.concatenate([embeddings, embed])
  else:
    embeddings = np.concatenate([embed,])

  # Generate speaker and gender labels
  segment_speaker = segment_fpath.split('-')[-1].split('.')[0]
  speaker_labels += [speaker_mapping[segment_speaker]] * embed.shape[0]
  speaker_gender += [gender_mapping[segment_speaker]] * embed.shape[0]


In [None]:
from sklearn.metrics.pairwise import cosine_similarity

embeddings0 = embeddings[:]
embeddings1 = embeddings[:].T

sim_matrix = cosine_similarity(embeddings0, embeddings1.T)

# Create a dictionary to store the indices for each name
speaker_group = {}

for ix, sp in enumerate(speaker_labels):
    if sp not in speaker_group:
        speaker_group[sp] = [ix]
    else:
        speaker_group[sp].append(ix)

# Print the indices for each name group
for sp, ix in speaker_group.items():
    print(f"{sp}: {ix}")

In [None]:
indices = list(range(0, len(sim_matrix)))
sim_same_list = []
sim_diff_list = []

for ix, val in enumerate(speaker_group):
  same_indices = speaker_group[val]
  same_values = sim_matrix[np.min(same_indices)][same_indices]

  diff_indices = list(set(indices).difference(same_indices))
  diff_values = sim_matrix[np.min(same_indices)][diff_indices]

  sim_same_list += [same_values]
  sim_diff_list += [diff_values]

sim_same_list = np.concatenate(sim_same_list)
sim_diff_list = np.concatenate(sim_diff_list)

In [None]:
plot_histograms((sim_same_list, sim_diff_list), ["Same speaker", "Different speakers"])

The plot shows that the cosine similarity between embeddings from the same speakers compared to different speakers, on average, are much more similar. The median is almost $0.6$ for same speakers while around $0.17$ for different speakers. While the distributions have a minor overlap, this is negligible. The embeddings are computed on 5 seconds windows meaning that noise and ``bad variation'' is expected.

#### Exercise 2.2

In [None]:
# We use a sliding window with a duration of 1.6 seconds with a 0.2s step.
embedding_inference = Inference(embedding_model, window="whole")

# Define list of segment filepaths
segment_fpaths = list(segments_diarized)

# Define empty list to store the embeddings
embeddings = []

# Define empty lists to store speaker and gender labels
speaker_labels, speaker_gender = [], []

# Loop through each segment
for ix, segment_fpath in enumerate(segment_fpaths[:]):

  # Compute embedding for each segment and convert to numpy array
  embed = np.array(embedding_inference(os.path.join(diarized_dir, segment_fpath)))

  # Append
  embeddings.append(embed)

  # Generate speaker and gender labels
  segment_speaker = segment_fpath.split('-')[-1].split('.')[0]
  speaker_labels += [speaker_mapping[segment_speaker]]
  speaker_gender += [gender_mapping[segment_speaker]]


In [None]:
cross_sim_matrix = cosine_similarity(np.array(embeddings), np.array(embeddings))
plot_similarity_matrix(cross_sim_matrix, labels_a=speaker_labels, labels_b=speaker_labels)

Like the distributions in *Exercise 2.1*, the similarity matrix shows a similar pattern. Embeddings from same speakers are much more similar than embeddings from different speakers. This is visually presented with bright colors showing a higher cosine similarity between embeddings computed on the entire segments.

#### Exercise 2.3

Based on the results in *2.1* and *2.2*, pretrained speaker embeddings is a promising approach to speaker recognition. Despite that they are trained on a different population of speakers, the pretrained model are still able to generate embeddings that encode each speaker's unique voice characteristics. The results show 1) that embeddings from the same speaker have a high similarity and 2) that embeddings from different speakers have a substantially lower similarity. Taken together, this means the pretrained embeddings are able to recognize speakers while at the same time distinguishing between different speakers.
