<a href="https://colab.research.google.com/github/gulabpatel/Speech-to-Text/blob/main/06_1_transcripts_with_speaker_names_whisper_pyannote.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Notes on usage:

- Make sure to [change runtime to GPU](https://www.tutorialspoint.com/google_colab/google_colab_using_free_gpu.htm).
- The transcript will be saved in Files, which you can find in the menu on the left.
- Change the number of speakers below if different from two.
- Pick a bigger model if you want more accuracy and a smaller model if you want the program to run faster ([more info](https://github.com/openai/whisper#available-models-and-languages)).
- If you know the language being spoken is English, then change language to 'English' as this improves performance.


High level overview of what's happening here:


1.   I'm using Open AI's Whisper model to seperate audio into segments and generate transcripts.
2.   I'm then generating speaker embeddings for each segments.
3.   Then I'm using agglomerative clustering on the embeddings to identify the speaker for each segment.   

Let me know if I can make it better!


In [None]:
# upload audio file
from google.colab import files
uploaded = files.upload()
path = next(iter(uploaded))

Saving 1152023000.mp3 to 1152023000.mp3


In [None]:
num_speakers = 2 #@param {type:"integer"}

language = 'English' #@param ['any', 'English']

model_size = 'large' #@param ['tiny', 'base', 'small', 'medium', 'large']


model_name = model_size
if language == 'English' and model_size != 'large':
  model_name += '.en'


In [None]:
!pip install -q git+https://github.com/openai/whisper.git > /dev/null
!pip install -q git+https://github.com/pyannote/pyannote-audio > /dev/null

import whisper
import datetime

import subprocess

import torch
import pyannote.audio
from pyannote.audio.pipelines.speaker_verification import PretrainedSpeakerEmbedding
embedding_model = PretrainedSpeakerEmbedding(
    "speechbrain/spkrec-ecapa-voxceleb",
    device=torch.device("cuda"))

from pyannote.audio import Audio
from pyannote.core import Segment

import wave
import contextlib

from sklearn.cluster import AgglomerativeClustering
import numpy as np

hyperparams.yaml:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

embedding_model.ckpt:   0%|          | 0.00/83.3M [00:00<?, ?B/s]

mean_var_norm_emb.ckpt:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

classifier.ckpt:   0%|          | 0.00/5.53M [00:00<?, ?B/s]

label_encoder.txt:   0%|          | 0.00/129k [00:00<?, ?B/s]

In [None]:
if path[-3:] != 'wav':
  subprocess.call(['ffmpeg', '-i', path, 'audio.wav', '-y'])
  path = 'audio.wav'

In [None]:
model = whisper.load_model(model_size)

100%|█████████████████████████████████████| 2.88G/2.88G [00:44<00:00, 69.5MiB/s]


In [None]:
result = model.transcribe(path)
# result

{'text': " Hello? Hi, good evening. Chris from Scuderia Car Parts. Am I speaking with Abdul? Yes. Hi, I'm just calling because you were checking some Maserati parts on our website, the alternator belt and the mounts. Are you looking to order these? Do you need any help? Yeah, I want to discount these for my order. You want a discount? Well, I can do £13 discount, which would bring it to a total of £225.22 delivered to you. Yeah, yes, but I think it's not enough discount. It's £13. This is the best that I can offer. Because I already... I already... Because I already ordered all times... More times from you, but it's £13. Okay, 10%. I think it's good. Yeah, unfortunately, 10% wouldn't leave us enough to continue doing business. Because you give me... You give my... My language is not good, but you give my friend 15%, 20%. It's not £13 or... It's not £13. It's £15. It's £15. It's £15. Yeah, I mean... Because this fourth time or five times I order... You've put two orders in with us befor

In [None]:
segments = result["segments"]
segments

[{'id': 0,
  'seek': 0,
  'start': 0.0,
  'end': 7.0,
  'text': ' Hello?',
  'tokens': [50365, 2425, 30, 50715],
  'temperature': 0.0,
  'avg_logprob': -0.41574005582439366,
  'compression_ratio': 1.2893081761006289,
  'no_speech_prob': 0.40002283453941345},
 {'id': 1,
  'seek': 0,
  'start': 7.0,
  'end': 16.32,
  'text': ' Hi, good evening.',
  'tokens': [50715, 2421, 11, 665, 5634, 13, 51181],
  'temperature': 0.0,
  'avg_logprob': -0.41574005582439366,
  'compression_ratio': 1.2893081761006289,
  'no_speech_prob': 0.40002283453941345},
 {'id': 2,
  'seek': 0,
  'start': 16.32,
  'end': 18.400000000000002,
  'text': ' Chris from Scuderia Car Parts.',
  'tokens': [51181, 6688, 490, 2747, 28230, 654, 2741, 4100, 82, 13, 51285],
  'temperature': 0.0,
  'avg_logprob': -0.41574005582439366,
  'compression_ratio': 1.2893081761006289,
  'no_speech_prob': 0.40002283453941345},
 {'id': 3,
  'seek': 0,
  'start': 18.400000000000002,
  'end': 20.400000000000002,
  'text': ' Am I speaking with 

In [None]:
with contextlib.closing(wave.open(path,'r')) as f:
  frames = f.getnframes()
  rate = f.getframerate()
  duration = frames / float(rate)

In [None]:
audio = Audio()

def segment_embedding(segment):
  start = segment["start"]
  # Whisper overshoots the end timestamp in the last segment
  end = min(duration, segment["end"])
  clip = Segment(start, end)
  waveform, sample_rate = audio.crop(path, clip)
  return embedding_model(waveform[None])

In [None]:
embeddings = np.zeros(shape=(len(segments), 192))
for i, segment in enumerate(segments):
  embeddings[i] = segment_embedding(segment)

embeddings = np.nan_to_num(embeddings)

In [None]:
clustering = AgglomerativeClustering(num_speakers).fit(embeddings)
labels = clustering.labels_
for i in range(len(segments)):
  segments[i]["speaker"] = 'SPEAKER ' + str(labels[i] + 1)

In [None]:
def time(secs):
  return datetime.timedelta(seconds=round(secs))

f = open("transcript.txt", "w")

for (i, segment) in enumerate(segments):
  if i == 0 or segments[i - 1]["speaker"] != segment["speaker"]:
    f.write("\n" + segment["speaker"] + ' ' + str(time(segment["start"])) + '\n')
  f.write(segment["text"][1:] + ' ')
f.close()