<a href="https://colab.research.google.com/github/mraskj/css_fall2023/blob/main/code/class11/class11-exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Class 11: Speaker Diarization and Recognition - Exercise

In this exercise, we investigate the validity of using pretrained models to conduct speaker diarization and speaker recognition. For the former, we compare acoustic features computed with groundtruth and automated timestamps. For the latter, we investigate how embeddings can be used to discriminate between speakers.  

As we always do in Colab, you should start by cloning the GitHub repo and by constructing an output folder to save files you create in Colab.

In [None]:
# Clone GitHub directory into
!git clone https://github.com/mraskj/css_fall2023.git

In [None]:
import os

# Define an output directory where we save all output created in Colab
output_dir = os.path.join(os.getcwd(), 'output')

# Make directory if it not already exists.
if not os.path.exists(output_dir):
  os.mkdir(output_dir)

## Exercise 1: Audio Measurement and Annotations

In the first exercise, we investigate the sensitivity of annotation errors to the estimation of acoustic features. I have provided you with a RTTM groundtruth annotations for the same debate snippet as we worked with in the tutorial. The RTTM file can be found here `/content/css_fall2023/data/audio/class11/57-17-groundtruth-withoutspeaker.rttm` (assuming you have cloned the GitHub repo). Note that we work with the version without speaker labels here. You can also work with version with speaker labels if you want - the results should be similar.

The recording of the debate snippet is found here: `/content/css_fall2023/data/audio/class11/57-71-class.mp4`.

1. Convert the video file to audio

2. Apply diarization using the `pyannote/speaker-diarization@2.1` model from `pyannote.audio`. Describe each step you take (e.g. do you keep all diarized segments or do you discard some? Do you merge back-to-back segments from the same speaker label).

3. Write the diarization result to a RTTM file.


4. Compute the DER for three different error margins: 0.0, 0.5, and 1.0. Describe the results. Based on this, do you expect substantial differences in the estimation of acoustic features such as pitch, loudness, or MFCCs when using groundtruth annotations compared to the automated annotations?


5. Split the diarized segments to separate audio files


6. Split the groundtruth segments to separate audio files


7. Compute the first 10 MFCCs, pitch, and intensity for each of the segments from step 4 and 5. For pitch, compute also the standard deviation, minimum, and maximum value.

8. Compare the measure for the diarized and groundtruth segments. Note that you must link each diarization segment to a groundtruth segment to be able to do the comparison. There must be exactly the same number of segments in both conditions. Describe and show the results.














## Exercise 2: Similarity of Speaker Embeddings

In the tutorial, we saw how pretrained speaker embeddings can be used to construct speaker embeddings on a completely different set of audio files
without any fine-tuning or adaption. While we did it visually in the tutorial, we'll exlore the similarity of embeddings using cosine similarity to test whether we can use these for speaker recognition.

The audio we work with is the diarized segments from *Exercise 1*. Your task is to:

1. Compute the pairwise cosine similarity between embeddings computed using a *sliding* window. You decide on the `duration` and `step` parameters. Compare the average for embeddings from same speakers and the average for embeddings from different speakers. Plot and describe your results. The plot should be a histogram colored by whether the similarity is computed on embeddings from the same or different speakers. I have provided you with a function below: `plot_histograms`

2. Compute the pairwise cosine similarity between embeddings computed using a *fixed* window (specified with the `window=whole`). Plot and describe your results. The plot should be a similarity matrix (a heatmap). I have provided you with a function below: `plot_similarity_matrix`

3. Discuss based on the results in 1+2 whether pretrained speaker embeddings can be exploited for speaker recognition.

There are a bunch of resources that might help you in the exercise.

- For plotting:
    * https://github.com/resemble-ai/Resemblyzer/blob/master/demo_utils.py
    * https://github.com/resemble-ai/Resemblyzer (see the cross-similarity plot)
- For embeddings:
    * https://huggingface.co/pyannote/embedding
- For cosine similarity:
    * Use the `cosine_similarity()` function from `sklearn.metrics.pairwise`

Note that are multiple ways to achieve the results and yours might very well be smarter than mine. Take a look at the solution if you get stuck.

In [None]:
# Tools for plotting
from mpl_toolkits.axes_grid1 import make_axes_locatable
from matplotlib.animation import FuncAnimation
from matplotlib import cm
import matplotlib.pyplot as plt

_default_colors = plt.rcParams["axes.prop_cycle"].by_key()["color"]
_my_colors = np.array([
    [0, 127, 70],
    [255, 0, 0],
    [255, 217, 38],
    [0, 135, 255],
    [165, 0, 165],
    [255, 167, 255],
    [97, 142, 151],
    [0, 255, 255],
    [255, 96, 38],
    [142, 76, 0],
    [33, 0, 127],
    [0, 0, 0],
    [183, 183, 183],
    [76, 255, 0],
], dtype=float) / 255


def plot_histograms(all_samples, names=None, title=""):
    """
    Plots (possibly) overlapping histograms and their median
    """

    _, ax = plt.subplots()

    for samples, color, name in zip(all_samples, _default_colors, names):
      ax.hist(samples, density=True, color=color, label=name, alpha=0.5)
    ax.legend(frameon=False, loc='upper right')
    ax.set_xlim(0, 1)
    ax.set_yticks([])
    ax.set_title(title)

    ylim = ax.get_ylim()
    ax.set_ylim(*ylim)
    for samples, color in zip(all_samples, _default_colors):
        median = np.median(samples)
        ax.vlines(median, *ylim, color, "dashed")
        ax.text(median, ylim[1] * 0.15, "median", rotation=270, color=color)

def plot_similarity_matrix(matrix, labels_a=None, labels_b=None, ax: plt.Axes=None, title=""):
    if ax is None:
        _, ax = plt.subplots()
    fig = plt.gcf()

    img = ax.matshow(matrix, extent=(-0.5, matrix.shape[0] - 0.5,
                                     -0.5, matrix.shape[1] - 0.5))

    ax.xaxis.set_ticks_position("bottom")
    if labels_a is not None:
        ax.set_xticks(range(len(labels_a)))
        ax.set_xticklabels(labels_a, rotation=90, size=7)
    if labels_b is not None:
        ax.set_yticks(range(len(labels_b)))
        ax.set_yticklabels(labels_b[::-1], size=7)  # Upper origin -> reverse y axis
    ax.set_title(title)


    cax = make_axes_locatable(ax).append_axes("right", size="5%", pad=0.15)
    fig.colorbar(img, cax=cax, ticks=np.linspace(0.25, 1, 4))
    img.set_clim(0.25, 1)
    img.set_cmap("inferno")