# Whisper's transcription plus Pyannote's Diarization 

Andrej Karpathy [suggested](https://twitter.com/karpathy/status/1574476200801538048?s=20&t=s5IMMXOYjBI6-91dib6w8g) training a classifier on top of  OpenAI [Whisper](https://openai.com/blog/whisper/) model features to identify the speaker, so we can visualize the speaker in the transcript. But, as [pointed out](https://twitter.com/tarantulae/status/1574493613362388992?s=20&t=s5IMMXOYjBI6-91dib6w8g) by Christian Perone, it seems that features from whisper wouldn't be that great for speaker recognition as its main objective is basically to ignore speaker differences.

In the following, I use [**`pyannote-audio`**](https://github.com/pyannote/pyannote-audio), a speaker diarization toolkit by Hervé Bredin, to identify the speakers, and then match it with the transcriptions of Whispr. I do it on the first 30 minutes of  Lex's 2nd [interview](https://youtu.be/SGzMElJ11Cc) with Yann LeCun. Check the result [**here**](https://majdoddin.github.io/lexicap.html). 

It is tricky to match the transcriptions to diarization segemtns, specially when the speaker changes. To resolve it, Sarah Kaiser [suggested](https://github.com/openai/whisper/discussions/264#discussioncomment-3825375) runnnig the pyannote.audio first and  then just running whisper on the split-by-speaker chunks. 
For sake of performance (and transcription quality?), we attach the audio segements into a single audio file with a silent spacer as a seperator, and run whisper on it. Enjoy it!

# Preparing the audio file

 Installing `yt-dlp` and downloading the [video](https://).

In [1]:
# !pip install -U yt-dlp

In [2]:
# !wget -O - -q  https://github.com/yt-dlp/FFmpeg-Builds/releases/download/latest/ffmpeg-master-latest-linux64-gpl.tar.xz | xz -qdc| tar -x

In [3]:
# !yt-dlp -xv --ffmpeg-location ffmpeg-master-latest-linux64-gpl/bin --audio-format wav  -o lecun.wav -- https://youtu.be/SGzMElJ11Cc


Cutting the first 20 minutes of the video for further process.


In [4]:
# !pip install pydub

In [5]:
from pydub import AudioSegment

t1 = 0 * 1000 #Works in milliseconds
# t2 = 20 * 60 * 1000
t2 = 2 * 60 * 1000 # 2 min

newAudio = AudioSegment.from_wav("lecun.wav")
a = newAudio[t1:t2]
a.export("lecun1.wav", format="wav") 


<_io.BufferedRandom name='lecun1.wav'>

`pyannote.audio` seems to miss the first 0.5 seconds of the audio, and, therefore, we prepend a spcacer.

In [6]:
audio = AudioSegment.from_wav("lecun1.wav")
spacermilli = 2000
spacer = AudioSegment.silent(duration=spacermilli)
audio = spacer.append(audio, crossfade=0)

audio.export('audio.wav', format='wav')

<_io.BufferedRandom name='audio.wav'>

# Pyannote's Diarization

[`pyannote.audio`](https://github.com/pyannote/pyannote-audio) is an open-source toolkit written in Python for **speaker diarization**. 

Based on [`PyTorch`](https://pytorch.org) machine learning framework, it provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker diarization pipelines. 

`pyannote.audio` also comes with pretrained [models](https://huggingface.co/models?other=pyannote-audio-model) and [pipelines](https://huggingface.co/models?other=pyannote-audio-pipeline) covering a wide range of domains for voice activity detection, speaker segmentation, overlapped speech detection, speaker embedding reaching state-of-the-art performance for most of them. 

Installing Pyannote and running it on the video to generate the diarizations.

In [7]:
# !pip install   pyannote.audio

In [8]:

import os; os.chdir('..'); print(os.getcwd())
from keys import hf_token

/home/ubuntu/Text-Transcription


In [9]:
# print(hf_token)

In [10]:
# # instantiate the pipeline
# from pyannote.audio import Pipeline
# pipeline = Pipeline.from_pretrained(
#   "pyannote/speaker-diarization-3.1",
#   use_auth_token=hf_token)


# import torch
# pipeline.to(torch.device("cuda"))

# # run the pipeline on an audio file
# diarization = pipeline("research/audio.wav")

# # dump the diarization output to disk using RTTM format
# with open("research/audio.rttm", "w") as rttm:
#     diarization.write_rttm(rttm)


In [11]:
from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained('pyannote/speaker-diarization', use_auth_token=hf_token)

import torch
pipeline.to(torch.device("cuda"))

  from .autonotebook import tqdm as notebook_tqdm
  torchaudio.set_audio_backend("soundfile")
  torchaudio.set_audio_backend("soundfile")
torchvision is not available - cannot save figures
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.1.3. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../.cache/torch/pyannote/models--pyannote--segmentation/snapshots/c4c8ceafcbb3a7a280c2d357aee9fbc9b0be7f9b/pytorch_model.bin`


Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.1.2+cu121. Bad things might happen unless you revert torch to 1.x.


<pyannote.audio.pipelines.speaker_diarization.SpeakerDiarization at 0x7facf292bdc0>

In [12]:
DEMO_FILE = {'uri': 'blabal', 'audio': 'research/audio.wav'}
dz = pipeline(DEMO_FILE)  

with open("research/diarization.txt", "w") as text_file:
    text_file.write(str(dz))

In [13]:
print(*list(dz.itertracks(yield_label = True))[:10], sep="\n")

(<Segment(2.02218, 36.8686)>, 'A', 'SPEAKER_01')
(<Segment(38.1143, 51.4761)>, 'B', 'SPEAKER_01')
(<Segment(51.8686, 90.401)>, 'C', 'SPEAKER_00')
(<Segment(91.3055, 92.9949)>, 'D', 'SPEAKER_00')
(<Segment(94.8891, 116.493)>, 'E', 'SPEAKER_01')
(<Segment(116.493, 122.005)>, 'F', 'SPEAKER_00')


In [14]:
def millisec(timeStr):
  spl = timeStr.split(":")
  s = (int)((int(spl[0]) * 60 * 60 + int(spl[1]) * 60 + float(spl[2]) )* 1000)
  return s

In [15]:
import re
dz = open('research/diarization.txt').read().splitlines()
dzList = []
for l in dz:
  start, end =  tuple(re.findall('[0-9]+:[0-9]+:[0-9]+\.[0-9]+', string=l))
  start = millisec(start) - spacermilli
  end = millisec(end)  - spacermilli
  lex = not re.findall('SPEAKER_01', string=l)
  dzList.append([start, end, lex])

print(*dzList[:10], sep='\n')

[21, 34868, False]
[36114, 49476, False]
[49868, 88401, True]
[89305, 90994, True]
[92889, 114493, False]
[114493, 120005, True]


# Preparing audio file from the diarization

Attaching audio segements according to the diarization, with a spacer as the delimiter.

In [16]:
from pydub import AudioSegment
import re 

sounds = spacer
segments = []

dz = open('research/diarization.txt').read().splitlines()
for l in dz:
  start, end =  tuple(re.findall('[0-9]+:[0-9]+:[0-9]+\.[0-9]+', string=l))
  start = int(millisec(start)) #milliseconds
  end = int(millisec(end))  #milliseconds
  
  segments.append(len(sounds))
  sounds = sounds.append(audio[start:end], crossfade=0)
  sounds = sounds.append(spacer, crossfade=0)

sounds.export("research/dz.wav", format="wav") #Exports to a wav file in the current path.

<_io.BufferedRandom name='research/dz.wav'>

In [17]:
segments[:8]

[2000, 38847, 54209, 94742, 98431, 122034]

Freeing up some memory

In [18]:
del   sounds, DEMO_FILE, pipeline, spacer,  audio, dz, a, newAudio

# Whisper's Transcriptions

Installing Open AI whisper.

**Important:** There is a version conflict with pyannote.audio resulting in an error (see this RP). Our workaround is to first run Pyannote and then whisper. You can safely ignore the error.


In [19]:
# !pip install git+https://github.com/openai/whisper.git 

Running Open AI whisper on the prepared audio file. [link text](https://) It writes the transcription into a file.

In [20]:
!whisper research/dz.wav --language en --model large

[00:00.000 --> 00:06.200]  The following is a conversation with Jan LeCun, his second time on the podcast.
[00:06.200 --> 00:14.280]  He is the chief AI scientist at Meta, formerly Facebook, professor at NYU, touring award
[00:14.280 --> 00:19.220]  winner, one of the seminal figures in the history of machine learning and artificial
[00:19.220 --> 00:25.700]  intelligence, and someone who is brilliant and opinionated in the best kind of way, and
[00:25.700 --> 00:27.060]  so is always fun to talk to.
[00:27.060 --> 00:29.720]  So, this is a Lex Friedman podcast.
[00:29.900 --> 00:32.520]  To support it, please check out our sponsors in the description.
[00:33.220 --> 00:36.740]  And now, here's my conversation with Jan LeCun.
[00:38.460 --> 00:43.240]  You co-wrote the article, Self-Supervised Learning, The Dark Matter of Intelligence.
[00:43.620 --> 00:45.800]  Great title, by the way, with Ishan Mizra.
[00:46.260 --> 00:51.840]  So let me ask, what is self-supervised learning, and wh

Reading the transcription file.

In [21]:
# !pip install -U webvtt-py

In [22]:
!ls

__pycache__  dummy_outputs  dz.tsv  keys.py	      research
app.py	     dz.json	    dz.txt  lexicap.html      venv
audio_inps   dz.srt	    dz.vtt  requirements.txt


In [24]:
%pwd

'/home/ubuntu'

In [25]:
import os; os.chdir('/home/ubuntu/Text-Transcription')
%pwd

'/home/ubuntu/Text-Transcription'

In [26]:
import webvtt

# captions = [[(int)(millisec(caption.start)), (int)(millisec(caption.end)),  caption.text] for caption in webvtt.read('dz.wav.vtt')]
captions = [[(int)(millisec(caption.start)), (int)(millisec(caption.end)),  caption.text] for caption in webvtt.read('dz.vtt')]
print(*captions[:8], sep='\n')

[0, 6200, 'The following is a conversation with Jan LeCun, his second time on the podcast.']
[6200, 14280, 'He is the chief AI scientist at Meta, formerly Facebook, professor at NYU, touring award']
[14280, 19220, 'winner, one of the seminal figures in the history of machine learning and artificial']
[19220, 25700, 'intelligence, and someone who is brilliant and opinionated in the best kind of way, and']
[25700, 27060, 'so is always fun to talk to.']
[27060, 29720, 'So, this is a Lex Friedman podcast.']
[29900, 32520, 'To support it, please check out our sponsors in the description.']
[33220, 36740, "And now, here's my conversation with Jan LeCun."]


# Matching the Transcriptions and the Diarizations

Matching each trainscrition line to some diarizations, and generating the HTML file. To get the correct timing, we should take care of the parts in original audio that were in no diarization segment.

In [27]:
preS = '<!DOCTYPE html>\n<html lang="en">\n  <head>\n    <meta charset="UTF-8">\n    <meta name="viewport" content="width=device-width, initial-scale=1.0">\n    <meta http-equiv="X-UA-Compatible" content="ie=edge">\n    <title>Lexicap</title>\n    <style>\n        body {\n            font-family: sans-serif;\n            font-size: 18px;\n            color: #111;\n            padding: 0 0 1em 0;\n        }\n        .l {\n          color: #050;\n        }\n        .s {\n            display: inline-block;\n        }\n        .e {\n            display: inline-block;\n        }\n        .t {\n            display: inline-block;\n        }\n        #player {\n\t\tposition: sticky;\n\t\ttop: 20px;\n\t\tfloat: right;\n\t}\n    </style>\n  </head>\n  <body>\n    <h2>Yann LeCun: Dark Matter of Intelligence and Self-Supervised Learning | Lex Fridman Podcast #258</h2>\n  <div  id="player"></div>\n    <script>\n      var tag = document.createElement(\'script\');\n      tag.src = "https://www.youtube.com/iframe_api";\n      var firstScriptTag = document.getElementsByTagName(\'script\')[0];\n      firstScriptTag.parentNode.insertBefore(tag, firstScriptTag);\n      var player;\n      function onYouTubeIframeAPIReady() {\n        player = new YT.Player(\'player\', {\n          height: \'210\',\n          width: \'340\',\n          videoId: \'SGzMElJ11Cc\',\n        });\n      }\n      function setCurrentTime(timepoint) {\n        player.seekTo(timepoint);\n   player.playVideo();\n   }\n    </script><br>\n'
postS = '\t</body>\n</html>'

In [28]:
from datetime import timedelta

html = list(preS)

for i in range(len(segments)):
  idx = 0
  for idx in range(len(captions)):
    if captions[idx][0] >= (segments[i] - spacermilli):
      break;
  
  while (idx < (len(captions))) and ((i == len(segments) - 1) or (captions[idx][1] < segments[i+1])):
    c = captions[idx]  
    
    start = dzList[i][0] + (c[0] -segments[i])

    if start < 0: 
      start = 0
    idx += 1

    start = start / 1000.0
    startStr = '{0:02d}:{1:02d}:{2:02.2f}'.format((int)(start // 3600), 
                                            (int)(start % 3600 // 60), 
                                            start % 60)
    
    html.append('\t\t\t<div class="c">\n')
    html.append(f'\t\t\t\t<a class="l" href="#{startStr}" id="{startStr}">link</a> |\n')
    html.append(f'\t\t\t\t<div class="s"><a href="javascript:void(0);" onclick=setCurrentTime({int(start)})>{startStr}</a></div>\n')
    html.append(f'\t\t\t\t<div class="t">{"[Yann]" if dzList[i][2] else "[Lex]"} {c[2]}</div>\n')
    html.append('\t\t\t</div>\n\n')

html.append(postS)
s = "".join(html)

with open("lexicap.html", "w") as text_file:
    text_file.write(s)
print(s)

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <meta http-equiv="X-UA-Compatible" content="ie=edge">
    <title>Lexicap</title>
    <style>
        body {
            font-family: sans-serif;
            font-size: 18px;
            color: #111;
            padding: 0 0 1em 0;
        }
        .l {
          color: #050;
        }
        .s {
            display: inline-block;
        }
        .e {
            display: inline-block;
        }
        .t {
            display: inline-block;
        }
        #player {
		position: sticky;
		top: 20px;
		float: right;
	}
    </style>
  </head>
  <body>
    <h2>Yann LeCun: Dark Matter of Intelligence and Self-Supervised Learning | Lex Fridman Podcast #258</h2>
  <div  id="player"></div>
    <script>
      var tag = document.createElement('script');
      tag.src = "https://www.youtube.com/iframe_api";
      var firstScriptTag = document.