# **Speaker Diarization Testing Using Resemblyzer**

To make the audio any useful for further processing the first thing that came to our mind was to Diarize the audio and differentiate the parts where the doctor is speaking from the parts where the patient is speaking.

We used the implementation of the research paper by Google Brain - [*Speaker Diarization with LSTM*](https://arxiv.org/abs/1710.10468)

Using this method we were able to get the required timestamps and speakers 

**Reasons why we didn't use this**

The DER (Diarization Error Rate) was higher than expected which could have led to failure of the future steps of the application.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive',force_remount=True)

Mounted at /content/gdrive


In [None]:
cd gdrive/MyDrive/files

/content/gdrive/MyDrive/files


In [None]:
ls

audio_clip.wav  [0m[01;34mResemblyzer[0m/


In [None]:
!git clone https://github.com/resemble-ai/Resemblyzer.git

In [None]:
!pip3 install webrtcvad

In [None]:
!pip install resemblyzer

## Step 1

* The preprocess_wav function internally uses a VAD to trim out the silences in the audio file and also normalizes the decibel level of audio.

* The embed_utterance function of this instance takes in the processed wav file, segments it out into windows , makes MFCCs of these segments and finally creates d-vectors of these audio segments.

In [None]:
from resemblyzer import preprocess_wav, VoiceEncoder
from pathlib import Path

#give the file path to your audio file
audio_file_path = 'audio_clip.wav'
wav_fpath = Path(audio_file_path)

wav = preprocess_wav(wav_fpath)
encoder = VoiceEncoder("cpu")
_, cont_embeds, wav_splits = encoder.embed_utterance(wav, return_partials=True, rate=16)
print(cont_embeds.shape)



Loaded the voice encoder model on cpu in 0.04 seconds.
(4206, 256)


In [None]:
!pip3 install spectralcluster

Collecting spectralcluster
  Downloading spectralcluster-0.2.4-py3-none-any.whl (22 kB)
Installing collected packages: spectralcluster
Successfully installed spectralcluster-0.2.4


## Step 2

* Next step is the clustering of our d-vectors.



In [None]:
from spectralcluster import SpectralClusterer
from spectralcluster import RefinementOptions

refinement_options = RefinementOptions(
    gaussian_blur_sigma=1,
    p_percentile=0.95,
    thresholding_soft_multiplier=0.01)

clusterer = SpectralClusterer(min_clusters=2,max_clusters=100,refinement_options=refinement_options)

labels = clusterer.predict(cont_embeds)

## Step 3

* We need to join continuous windows which have a common speaker together.

In [None]:
def create_labelling(labels,wav_splits):
    from resemblyzer.audio import sampling_rate
    times = [((s.start + s.stop) / 2) / sampling_rate for s in wav_splits]
    labelling = []
    start_time = 0

    for i,time in enumerate(times):
        if i>0 and labels[i]!=labels[i-1]:
            temp = [str(labels[i-1]),start_time,time]
            labelling.append(tuple(temp))
            start_time = time
        if i==len(times)-1:
            temp = [str(labels[i]),start_time,time]
            labelling.append(tuple(temp))

    return labelling
  
labelling = create_labelling(labels,wav_splits)

* Finally we get the labelling and the timestamps of the different people speaking 

In [None]:
print(labelling)

[('0', 0, 13.7), ('1', 13.7, 20.78), ('0', 20.78, 21.14), ('1', 21.14, 24.98), ('0', 24.98, 28.4), ('1', 28.4, 28.58), ('0', 28.58, 28.76), ('1', 28.76, 32.48), ('0', 32.48, 35.42), ('1', 35.42, 36.26), ('0', 36.26, 36.38), ('1', 36.38, 47.12), ('0', 47.12, 47.72), ('1', 47.72, 47.78), ('0', 47.78, 50.06), ('1', 50.06, 50.66), ('0', 50.66, 50.72), ('1', 50.72, 60.86), ('0', 60.86, 68.3), ('1', 68.3, 91.34), ('0', 91.34, 91.58), ('1', 91.58, 91.7), ('0', 91.7, 91.76), ('1', 91.76, 92.3), ('0', 92.3, 92.6), ('1', 92.6, 106.22), ('0', 106.22, 111.86), ('1', 111.86, 121.16), ('0', 121.16, 121.28), ('1', 121.28, 121.46), ('0', 121.46, 121.52), ('1', 121.52, 130.22), ('0', 130.22, 130.34), ('1', 130.34, 132.2), ('0', 132.2, 132.26), ('1', 132.26, 154.94), ('0', 154.94, 155.3), ('1', 155.3, 161.06), ('0', 161.06, 165.56), ('1', 165.56, 179.12), ('0', 179.12, 179.18), ('1', 179.18, 184.16), ('0', 184.16, 184.34), ('1', 184.34, 190.52), ('0', 190.52, 190.58), ('1', 190.58, 190.64), ('0', 190.64