<a href="https://colab.research.google.com/github/mraskj/css_fall2023/blob/main/code/class11/class11-tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Class 11: Speech and Speaker Recognition - Tutorial

In the class tutorial, we'll cover how we can use pretrained models to conduct:

- Speaker diarization
- Speaker recognition
- Speech recognition

For the purpose, we'll rely on `pyannote.audio` (https://github.com/pyannote/pyannote-audio, https://huggingface.co/pyannote) and `faster-whisper` (https://github.com/guillaumekln/faster-whisper) which is open-source libraries that achives state-of-the-art performances. The latter is a faster (as the name suggest...) implementation of OpenAI's popular model `Whisper` (https://github.com/openai/whisper).



## 0 Setup

We start by:

1. Cloning the course GitHub repo
2. Importing necessary modules
3. Video to audio conversion




### 0.1 Cloning GitHub Repository

In [None]:
# Clone GitHub directory into
!git clone https://github.com/mraskj/css_fall2023.git

### 0.2 Importing Modules

In [None]:
# For file and directory management
import os

# For shell interaction
import subprocess

# For data handling
import numpy as np
import pandas as pd

# For plotting
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style = "darkgrid")

### 0.3 Video to audio conversion

For the tutorial and the exercise, we work with a 10 minute snippet of a debate from the UK House of Commons in December 2017. The ID of the debate is 57-17 saying that it's the 17th debate in the 57 parliamentary session. I have randomly selected a 10 minute snippet from the debate. You can find the video in the `content/css_fall2023/data/audio/class11` folder and is called `57-71-class.mp4`.

When we work with audio data, we need to discard the video stream from the recording. To that we use the open-source multimodal data handler library FFMPEG (https://www.ffmpeg.org/)

FFMPEG is called from the terminal. We have already seen that the
exclamation mark '!' can be used to specify that the code should be
interpreted as a shell command. A better Pythonic way of doing it is to use
the `subprocess` module, which allows you to interact with your operating
system directly from Python using Python syntax.



In [None]:
# Define an output directory where we save all output created in Colab
output_dir = os.path.join(os.getcwd(), 'output')

# Make directory if it not already exists.
if not os.path.exists(output_dir):
  os.mkdir(output_dir)

In [None]:
# We specify the sampling rate and number of channels beforehand. This is not
# strictly necessary, but is a useful approach when you wrap code in functions.
sr = 16000
channels = 1

# Specify path to video and audio file file
video_fpath = '/content/css_fall2023/data/audio/class11/57-71-class.mp4'
audio_fpath = os.path.join(output_dir, '57-71-class.wav')

# Write the ffmpeg code
cmd = ['ffmpeg',
       '-y',
       '-i',
       video_fpath,
       '-vn',
       '-ar',
       str(sr),
       '-ac',
       str(channels),
       '-acodec',
       'pcm_s16le',
       audio_fpath]

# Execute the code in the shell
subprocess.call(cmd)

# -y specifies that the output file should be overwritten if already existing
# -i specifies the input file
# -vn omits the video stream
# -ar specifies the sampling rate
# -ac specifies the number of channels
# -acodec specifies the type of encoding (in this case pcm_s16le)

## Speaker Diarization

To access the open-source models from `pyannote.audio`, we need to create an access token and accept the terms and conditions for using the models.

See the steps in the TL;DR in the link: https://github.com/pyannote/pyannote-audio/tree/develop

There are two pipelines available:

- `pyannote/speaker-diarization` (version 2.1)
- `pyannote/speaker-diarization-3.0` (version 3.0)

You need to accept terms for both models if you want to use both. We'll stick to `pyannote/speaker-diarization@2.1` here.

Before we can access the model, we also need to download the `pyannote.audio` module. We do that by:


```
# !pip install pyannote.audio
```



In [None]:
!pip install pyannote.audio

In [None]:
# Import Pipeline from pyannote.audio
from pyannote.audio import Pipeline

In [None]:
# Define access token and name of pipeline
from google.colab import userdata
access_token = userdata.get('huggingface')

pipeline_name = "pyannote/speaker-diarization@2.1"

# Load and initiate pipeline.
pipeline = Pipeline.from_pretrained(pipeline_name,
                                    use_auth_token=access_token)

In [None]:
# Import PyTorch
import torch

# Send pipeline to GPU (if available)
pipeline.to(torch.device("cuda"  if torch.cuda.is_available() else "cpu"))

print(f"Device: {pipeline.device}")

In [None]:
# Apply the diarization pipeline to our audio file
diarization = pipeline(os.path.join(output_dir, '57-71-class.wav'))

In [None]:
# Visualize the output
diarization

In [None]:
# Print the diarization result
for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}")

Evaluating diarization results always uses so-called RTTM files.

RTTM stands for Rich Transcription Time Marked (RTTM) files and are space-delimited text files containing one turn per line, each line containing ten fields:

- ``Type``  --  segment type; should always by ``SPEAKER``
- ``File ID``  --  file name; basename of the recording minus extension (e.g.,
  ``rec1_a``)
- ``Channel ID``  --  channel (1-indexed) that turn is on; should always be
  ``1``
- ``Turn Onset``  --  onset of turn in seconds from beginning of recording
- ``Turn Duration``  -- duration of turn in seconds
- ``Orthography Field`` --  should always by ``<NA>``
- ``Speaker Type``  --  should always be ``<NA>``
- ``Speaker Name``  --  name of speaker of turn; should be unique within scope
  of each file
- ``Confidence Score``  --  system confidence (probability) that information
  is correct; should always be ``<NA>``
- ``Signal Lookahead Time``  --  should always be ``<NA>``

For instance:

    SPEAKER CMU_20020319-1400_d01_NONE 1 130.430000 2.350 <NA> <NA> juliet <NA> <NA>
    SPEAKER CMU_20020319-1400_d01_NONE 1 157.610000 3.060 <NA> <NA> tbc <NA> <NA>
    SPEAKER CMU_20020319-1400_d01_NONE 1 130.490000 0.450 <NA> <NA> chek <NA> <NA>

In [None]:
# Write output to an RTTM-file by looping through each diarized segment
dia_result_withspeaker = []
dia_result_withoutspeaker = []
for turn, _, speaker in diarization.itertracks(yield_label=True):
    start = round(turn.start, 1)
    end = round(turn.end, 1)

    # Keep only segments that are five seconds or longer
    if (end - start) >= 5.0:

      # Join each line as a string
      dia_result_withspeaker.append(' '.join(['SPEAKER ' +
                                '57-17-class' +
                                ' 1',
                                str(start),
                                str(round(end-start, 1)),
                                '<NA> <NA>',
                                speaker,
                                '<NA> <NA>']))

      dia_result_withoutspeaker.append(' '.join(['SPEAKER ' +
                                '57-17-class' +
                                ' 1',
                                str(start),
                                str(round(end-start, 1)),
                                '<NA> <NA>',
                                'SPEAKER-00',
                                '<NA> <NA>']))


print(dia_result_withspeaker[:2])
print(dia_result_withoutspeaker[:2])

# Remove first diarized segment, which corresponds to the chair
dia_result_withspeaker = dia_result_withspeaker[1:]
dia_result_withoutspeaker = dia_result_withoutspeaker[1:]

# Writing
with open(os.path.join(output_dir, '57-17-prediction-withspeaker.rttm'), 'w') as f:
    for d in dia_result_withspeaker:
        f.write('%s\n' % d)
with open(os.path.join(output_dir, '57-17-prediction-withoutspeaker.rttm'), 'w') as f:
    for d in dia_result_withoutspeaker:
        f.write('%s\n' % d)

In [None]:
# Print RTTM outputs
dia_result_withspeaker, dia_result_withoutspeaker

In [None]:
# Load in DER metric and function to load RTTM files
from pyannote.database.util import load_rttm
from pyannote.metrics.diarization import DiarizationErrorRate

# We specify a collar of 1.0 seconds meaning that we allow an error margin up
# to one second before counting it as error. This takes annotation variability
# into account
metric = DiarizationErrorRate(collar=1.0, skip_overlap=True)

In [None]:
# Load the RTTM-files using the `load_rttm` function from the pyannote module
prediction_withoutspeaker = load_rttm(os.path.join(output_dir, '57-17-prediction-withoutspeaker.rttm'))['57-17-class']
groundtruth_withoutspeaker = load_rttm('/content/css_fall2023/data/audio/class11/57-17-groundtruth-withoutspeaker.rttm')['57-17-class']

# Compute the DER
der_withoutspeaker = metric(groundtruth_withoutspeaker, prediction_withoutspeaker, detailed=True)
der_withoutspeaker

In [None]:
# Load the RTTM-files using the `load_rttm` function from the pyannote module
prediction_withspeaker = load_rttm(os.path.join(output_dir, '57-17-prediction-withspeaker.rttm'))['57-17-class']
groundtruth_withspeaker =  load_rttm('/content/css_fall2023/data/audio/class11/57-17-groundtruth-withspeaker.rttm')['57-17-class']

# Compute the DER
der_withspeaker = metric(groundtruth_withspeaker, prediction_withspeaker, detailed=True)
der_withspeaker

In this case, we get the same DER. This happens since pyannote automatically generates a speaker label mapping under the hood. When we deliberately mixes things up, we see that the confusion happens.

In [None]:
# We reload the modified versions
prediction_withspeaker = load_rttm(os.path.join(output_dir, '57-17-prediction-withspeaker.rttm'))['57-17-class']
groundtruth_withspeaker =  load_rttm('/content/css_fall2023/data/audio/class11/57-17-groundtruth-withspeaker.rttm')['57-17-class']

# Recompute the DER
der_withspeaker = metric(groundtruth_withspeaker, prediction_withspeaker, detailed=True)
der_withspeaker

## Splitting Audio Files

We often encounter situations where we want to split an audio recording into smaller segments, for instance timestamps that denote the start and end time of each speech. For this we use FFMPEG again. This time, we add arguments for the start and duration of our desired segment.

In [None]:
# Split diarization output on whitespace
dia_result = [x.split() for x in dia_result_withspeaker]

In [None]:
sr = 16000
channels = 1

segment_dir = os.path.join(output_dir, 'segments')
if not os.path.exists(segment_dir):
  os.mkdir(segment_dir)

# Loop through each segment
for i, d in enumerate(dia_result):

  segment_fpath = os.path.join(segment_dir, f"segment_{i}-{d[7]}.wav")
  cmd = ['ffmpeg',
           '-y',
           '-i',
           audio_fpath,
           '-vn',
           '-ar',
           str(sr),
           '-ac',
           str(channels),
           '-acodec',
           'pcm_s16le',
           '-ss',
           str(d[3]),
           '-t',
           str(d[4]),
           segment_fpath]

  subprocess.call(cmd)

## Speaker Recognition with Speaker Embeddings

The most common way conduct speaker recognition is through supervised learning. However, speaker embeddings computed on diarized segments can be very effective as well. We'll explore this now.

We use a pretrained embedding model from `pyannote.audio` again. Once again, you need to have access to the model in the same way as for the diarization.

In [None]:
# Define list of segments
segments = os.listdir(segment_dir)

In [None]:
# Split segment in segment name and speaker label
segment_split = [x.split('.')[0].split('-') for x in segments]

# Construct a two-column pandas dataframe
segment_df = pd.DataFrame(segment_split, columns=['segment', 'speaker_label'])

# Add column denoting the path to each segment
segment_df['segment_fpath'] = [os.path.join(segment_dir, x) for x in segments]

# Add column with the number of each segment
segment_df['segment_number'] = segment_df['segment'].apply(lambda x: int(x.split('_')[1]))

# Sort values by segment number to get temporal order
segment_df = segment_df.sort_values(by='segment_number').reset_index(drop=True)

# Add timestamps to the dataframe
segment_df['start'] = [float(x[3]) for x in dia_result]
segment_df['dur'] = [float(x[4]) for x in dia_result]
segment_df['end'] = segment_df['start'] + segment_df['dur']

speaker_mapping = {'SPEAKER_09': 'Karen Bradley',
 'SPEAKER_06': 'David Hanson',
 'SPEAKER_07': 'John Whittingdale',
 'SPEAKER_01': 'Christine Jardine',
 'SPEAKER_03': 'Tracey Crouch',
 'SPEAKER_10': 'Wes Streeting',
 'SPEAKER_04': 'Amanda Milling',
 'SPEAKER_11': 'Chris Elmore',
 'SPEAKER_05': 'Nusrat Ghani',
 'SPEAKER_00': 'Jim Shannon',
 'SPEAKER_08': 'Tom Watson',
 'SPEAKER_12': 'John Glen',
 'SPEAKER_13': 'Luke Pollard',
 'SPEAKER_02': 'CHAIR'}

gender_mapping = {'SPEAKER_09': 'Woman',
 'SPEAKER_06': 'Man',
 'SPEAKER_07': 'Man',
 'SPEAKER_01': 'Woman',
 'SPEAKER_03': 'Woman',
 'SPEAKER_10': 'Man',
 'SPEAKER_04': 'Woman',
 'SPEAKER_11': 'Man',
 'SPEAKER_05': 'Woman',
 'SPEAKER_00': 'Man',
 'SPEAKER_08': 'Man',
 'SPEAKER_12': 'Man',
 'SPEAKER_13': 'Man',
 'SPEAKER_02': 'Man'}

segment_df['speaker_name'] = segment_df['speaker_label'].apply(lambda x: speaker_mapping[x])
segment_df['speaker_gender'] = segment_df['speaker_label'].apply(lambda x: gender_mapping[x])


# Reorder columns
segment_df = segment_df[['segment', 'speaker_label', 'speaker_name', 'speaker_gender',
                         'segment_number', 'start', 'end', 'dur', 'segment_fpath']]

In [None]:
# Load pretrained embedding model
from pyannote.audio import Model
embedding_model = Model.from_pretrained("pyannote/embedding",
                              use_auth_token=access_token)

In [None]:
# Load Inference class
from pyannote.audio import Inference

# We use a sliding window with a duration of 1.6 seconds with a 0.2s step.
embedding_inference = Inference(embedding_model, window="sliding",
                                duration=1.6, step=0.2)

In [None]:
# Define list of segment filepaths
segment_fpaths = list(segment_df.segment_fpath)

# Define empty list to store the embeddings
embeddings = []

# Define empty lists to store speaker and gender labels
speaker_labels, speaker_gender = [], []

# Loop through each segment
for ix, segment_fpath in enumerate(segment_fpaths[:]):
  # Compute embedding for each segment and convert to numpy array
  embed = np.array(embedding_inference(segment_fpath))

  # Concatenate with previous embeddings
  if len(embeddings) > 0:
    embeddings = np.concatenate([embeddings, embed])
  else:
    embeddings = np.concatenate([embed,])

  # Generate speaker and gender labels
  speaker_labels += [segment_df.iloc[ix].speaker_name] * embed.shape[0]
  speaker_gender += [segment_df.iloc[ix].speaker_gender] * embed.shape[0]

In [None]:
# To see the effectivness of the embeddings, we reduce the 512-dimensional vectors
# to two-dimensions using TSNE. TSNE is just like PCA, but is often better for
# visualization purposes.
from sklearn.manifold import TSNE
random_state = 13
learning_rate = 200
n_iter = 5000
perplexity = 50
n_components = 2
tsne = TSNE(n_components=n_components, verbose=1, perplexity=perplexity, n_iter=n_iter, learning_rate=learning_rate, random_state=random_state)
tsne_results = tsne.fit_transform(embeddings)

In [None]:
# Convert into dataframe
tsne_df = pd.DataFrame(tsne_results, columns=['tsne1', 'tsne2'])

# Add speaker and gender labels
clusters_tsne = pd.concat([tsne_df, pd.DataFrame({'speaker': speaker_labels,
                                                  'gender': speaker_gender})], axis=1)
# Define hue and style for plotting
speaker_hue = list(clusters_tsne.speaker)
gender_style = list(clusters_tsne.gender)

In [None]:
# Plot TSNE-reduced embeddings
plt.figure(figsize = (12,12))

scatterplot = sns.scatterplot(x=clusters_tsne.iloc[:, 0], y=clusters_tsne.iloc[:, 1],
                              hue=speaker_hue, style=gender_style,
                              palette='tab20', s=100, alpha=0.7)

# Get the current axes and legend
ax = plt.gca()
legend = ax.get_legend()

# Create a new legend for the 'hue' (color) using the handles and labels from the original legend
hue_legend = plt.legend(handles=legend.legendHandles[:-2],
                        labels=list(clusters_tsne.speaker.unique()),
                        title='', loc='lower left', frameon=False, ncol=2)

# Create a new legend for the 'style' (marker) using the handles and labels from the original legend
style_legend = plt.legend(handles=legend.legendHandles[-2:],
                          labels=list(clusters_tsne.gender.unique()),
                          title='', loc='upper left',
                          bbox_to_anchor=(0.0, 0.25), frameon=False)

# Add both legends to the plot
ax.add_artist(hue_legend)
ax.add_artist(style_legend)

# plt.legend(loc='best', frameon=False)
plt.xlabel('')
plt.ylabel('')
plt.xticks([])
plt.yticks([])
plt.show()

In [None]:
# Try to do the same with PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=n_components, random_state=random_state)
pca_results = pca.fit_transform(embeddings)

# Convert into dataframe
pca_df = pd.DataFrame(pca_results, columns=['pc1', 'pc2'])

# Add speaker and gender labels
clusters_pca = pd.concat([pca_df, pd.DataFrame({'speaker': speaker_labels,
                                                'gender': speaker_gender})], axis=1)
# Define hue and style for plotting
speaker_hue = list(clusters_pca.speaker)
gender_style = list(clusters_pca.gender)

In [None]:
# PLotting of PCA-reduced embeddings

plt.figure(figsize = (12,12))

scatterplot = sns.scatterplot(x=clusters_pca.iloc[:, 0], y=clusters_pca.iloc[:, 1],
                              hue=speaker_hue, style=gender_style,
                              palette='tab20', s=100, alpha=0.7)

# Get the current axes and legend
ax = plt.gca()
legend = ax.get_legend()

# Create a new legend for the 'hue' (color) using the handles and labels from the original legend
hue_legend = plt.legend(handles=legend.legendHandles[:-2],
                        labels=list(clusters_tsne.speaker.unique()),
                        title='', loc='lower left', frameon=False, ncol=2)

# Create a new legend for the 'style' (marker) using the handles and labels from the original legend
style_legend = plt.legend(handles=legend.legendHandles[-2:],
                          labels=list(clusters_tsne.gender.unique()),
                          title='', loc='upper left',
                          bbox_to_anchor=(0.0, 0.25), frameon=False)

# Add both legends to the plot
ax.add_artist(hue_legend)
ax.add_artist(style_legend)

# plt.legend(loc='best', frameon=False)
plt.xlabel('')
plt.ylabel('')
plt.xticks([])
plt.yticks([])
plt.show()

## Automatic Speech Recognition

https://github.com/guillaumekln/faster-whisper

In [None]:
# Install faster-whisper
!pip install faster-whisper

In [None]:
# Install jiwer to evaluate ASR output
!pip install jiwer

In [None]:
# Import ASR model
from faster_whisper import WhisperModel

In [None]:
# Model specs
model_size = "small.en"      # could also be medium.en, large.en, small, and medium, large
language = 'en'
beam_size = 5
word_timestamps = True

# Initiate ASR class
model = WhisperModel(model_size, device="cuda", compute_type="int8")

In [None]:
# Transcribe first segment
segments, info = model.transcribe(segment_fpaths[0],
                                  language=language,
                                  beam_size=beam_size,
                                  word_timestamps=word_timestamps)

In [None]:
# Print each word and its timestamps
for segment in segments:
    for word in segment.words:
        print("[%.2fs -> %.2fs] %s" % (word.start, word.end, word.word))

In [None]:
# Note that the output is a Python generator. This means that it can ONLY be used once.
# If you want to re-run the code, you must run the ASR model first.
list(segments)

In [None]:
# Re-run the ASR model
segments, _ = model.transcribe(segment_fpaths[0],
                                  language=language,
                                  beam_size=beam_size,
                                  word_timestamps=word_timestamps)

# Define list of segments - note that the `segments` object is now empty
segment_list = list(segments)

# Define empty list to store ASR output
asr_output = []

# Loop over each ASR segment
for segment in segment_list:
    # Loop over each word in each ASR-segment
    for word in segment.words:
        # Make dictionary
        word_dict = {'speech_id': "57-17",
                     'segment_id': segment.id,
                     'start': word.start,
                     'end': word.end,
                     'word': word.word.strip(),
                     'word_prob': round(word.probability, 2)}
        # Save dictionary in list
        asr_output.append(word_dict)

In [None]:
# Construct dataframe
asr_df = pd.DataFrame(asr_output)
asr_df

In [None]:
# We can reconstruct the text:
asr_text = ' '.join(list(asr_df.word))
asr_text

In [None]:
# Read in official transcript
transcript = pd.read_csv('/content/css_fall2023/data/audio/class11/57-17-official_transcript.csv')

In [None]:
# Manual transcription to generate reference text
reference_text = 'Mr. Speaker, we have been clear all along that this a publicly owned broadcaster. Channel 4 must provide for and reflect the country as a whole. \
We are still in discussions with Channel 4 about how it should do this, including relocating staff out of London, and we will set out next steps in due course.'

reference_text

In [None]:
transcript_text = transcript.iloc[1].text
transcript_text

In [None]:
# Remove punctation and convert to lower
import string
asr_text = asr_text.translate(str.maketrans('', '', string.punctuation)).lower()
reference_text = reference_text.translate(str.maketrans('', '', string.punctuation)).lower()
transcript_text = transcript_text.translate(str.maketrans('', '', string.punctuation)).lower()

In [None]:
!pip install jiwer

In [None]:
# Import wer from jiwer
from jiwer import wer as WordErrorRate
print(f"Word Error Rate for reference and ASR text: {round(WordErrorRate(reference=reference_text, hypothesis=asr_text), 3)}")
print(f"Word Error Rate for reference and transcript text: {round(WordErrorRate(reference=reference_text, hypothesis=transcript_text), 3)}")