# Multi-Language Speech Recognition and Speaker Diarisation

This demo allows you to recognize speech in 99 different languages, identify speakers, and translate the text into a selected language. The pipeline is made up of three libraries: 
* [denoiser](https://github.com/facebookresearch/denoiser) to remove extraneous noise from the audio,
* [pyannote](https://github.com/pyannote/pyannote-audio) for speaker diarisation, and
* [whisper](https://github.com/openai/whisper) - the main component that not only recognizes speech but also has the ability to translate it into one of 99 languages.

It's worth noting that the capability to translate into any language was discovered by accident during experiments with the model, and the official repository only states that it can translate any of the languages into English. The quality of recognition and translation may vary depending on the language being used.

## Setup
This will install the necessary libraries and download the models. It may take a while, so please be patient.

### Install dependencies
The dependencies have to be installed in a specific order.

In [None]:
%%capture

!pip install -U yt-dlp
!pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117
!pip install pyannote.audio==2.1.1 denoiser==0.1.5 moviepy==1.0.3 pydub==0.25.1 git+https://github.com/openai/whisper.git@v20230124
!pip install omegaconf==2.3.0 pytorch-lightning==1.8.4

### Start web server

It will be used to serve HTML player, video file and subtitles.

In [None]:
!npm install http-server -g

import subprocess
subprocess.Popen(['http-server', '-p', '8000']);

### HTML player template

In [None]:
from string import Template
player_template = Template("""
<!doctype html>
<html>
<head>
<title>Speakers</title>
<!-- <meta name="description" content="Our first page">
<meta name="keywords" content="html tutorial template"> -->
<style>
    html, body, .video-container {
        height: 100%;
    }

    .video-container {
        background-color: #000;
        display: flex;
        flex-direction: column;
    }

    video {
        max-height: 100%;
    }

    .speaker{
        height: 100%;
    }
    .Background{
        background-color: lightgray;
    }    
    /* Pastel pink */
    .SPEAKER_00 {
        background-color: #FEC8D8;
    }

    /* Pastel peach */
    .SPEAKER_01 {
        background-color: #ff8282;
    }

    /* Pastel yellow */
    .SPEAKER_02 {
        background-color: #acffb3;
    }

    /* Pastel green */
    .SPEAKER_03 {
        background-color: #C5E5C5;
    }

    /* Pastel blue */
    .SPEAKER_04 {
        background-color: #C3D9FF;
    }

    /* Pastel lavender */
    .SPEAKER_05 {
        background-color: #E9C7E9;
    }

    /* Pastel coral */
    .SPEAKER_06 {
        background-color: #FFD6D1;
    }

    /* Pastel salmon */
    .SPEAKER_07{
        background-color: #fffab4;
    }

    /* Pastel mint */
    .SPEAKER_08 {
        background-color: #D1E8D1;
    }

    /* Pastel sky blue */
    .SPEAKER_09 {
        background-color: #9b5f2b;
    }

    #video-controls{
        background-color: rgba(0, 0, 0, 0.8);
        padding: 1em;
    }

    #timeline{
        margin-bottom: 1em;
    }

    #play {
        border: 0;
        background: transparent;
        box-sizing: border-box;
        width: 0;
        height: 37px;
        border-color: transparent transparent transparent #dbdbdb;
        transition: 100ms all ease;
        cursor: pointer;
        border-style: solid;
        border-width: 18px 0 18px 30px;
        }
    #play.paused {
        border-style: double;
        border-width: 0px 0 0px 30px;
    }
    #play:hover {
        border-color: transparent transparent transparent #fff;
    }
    .slider{
        -webkit-appearance: none;
        width: 100%;
        background: linear-gradient(to right, #ff7a7a 0%, #ff7a7a 0%, rgba(255, 255, 255, 0) 0%, rgba(255, 255, 255, 0) 100%);
        outline: none;
        transition: background 450ms ease-in;
        opacity: 0.7;
        position: absolute;
        height: 10px;
        margin: 0;
    }

    .slider:hover {
        opacity: 1; /* Fully shown on mouse-over */
    }

    .buttons{
        display: flex;
        align-items: center;
    }

    #time{
        margin-left: 1em;
        color: #fff;
    }
    #speaker{
        margin-left: 1em;
        color: #fff;
    }

    @keyframes rotate {
        from {
            transform: rotate(0deg);
        }
        to { 
            transform: rotate(360deg);
        }
    }
 

    @-webkit-keyframes rotate {
        from {
            -webkit-transform: rotate(0deg);
        }
        to { 
            -webkit-transform: rotate(360deg);
        }
    }

    #load {
        width: 20px;
        height: 20px;
        margin: 0 0 0;
        border: solid 5px #fff;
        border-radius: 50%;
        border-right-color: transparent;
        border-bottom-color: transparent;
        -webkit-transition: all 0.5s ease-in;
        -webkit-animation-name: rotate;
        -webkit-animation-duration: 1.0s;
        -webkit-animation-iteration-count: infinite;
        -webkit-animation-timing-function: linear;
        transition: all 0.5s ease-in;
        animation-name: rotate;
        animation-duration: 1.0s;
        animation-iteration-count: infinite;
        animation-timing-function: linear;
    }
    .hide{
        display: none;
    }
</style>
</head>
<body>
    <div class="video-container">
        <video controls class="video" id="video" preload="metadata" crossorigin="use-credentials">
            <source src="$url" type="video/mp4"></source>
            <track src="$vtt" kind="subtitles" srclang="en" label="English" default></track>
        </video>
        <div id="video-controls">
            <div style="position: relative;">
                <input type="range" min="1" max="100" value="0" step="0.05" class="slider" id="slider">
                <div id="timeline" title='Background' style="display: flex; align-items: center; width: 100%; background-color: lightgray; height: 10px;"></div>
            </div>

            <div class="buttons">
                <div id="load"></div>
                <div id="play" title="Play" class="hide"></div>
                <div id="time">00:00 / 00:00</div>
                <div id="speaker"></div>
            </div>
        </div>
    </div>
<script type="text/javascript">
    const zeroPad = (num, places=2) => String(num).padStart(places, '0')

    let seeking = false;

    slider = document.getElementById('slider');
    slider.addEventListener("mousedown", (event) => {
        seeking = true;
    });
    slider.addEventListener("mouseup", (event) => {
        seeking = false;
    });
    slider.addEventListener("input", (event) => {
        let value = event.target.value
        event.target.style.background = 'linear-gradient(to right, #ff7a7a 0%, #ff7a7a ' + value + '%, rgba(255, 255, 255, 0) ' + value + '%, rgba(255, 255, 255, 0) 100%)'
    });
    slider.addEventListener("change", (event) => {
        video.currentTime = (event.target.value / 100) * video.duration
    });

    const time = document.getElementById('time');

    const video = document.getElementById('video');
    video.controls = false;
    video.addEventListener('timeupdate', (event) => {
        if (!seeking) {
            let value = (video.currentTime / video.duration) * 100
            slider.style.background = 'linear-gradient(to right, #ff7a7a 0%, #ff7a7a ' + value + '%, rgba(255, 255, 255, 0) ' + value + '%, rgba(255, 255, 255, 0) 100%)'
            slider.value = value;
            }

            let minutes = Math.floor(video.currentTime / 60);
            let seconds = Math.floor(video.currentTime - minutes * 60);
            let total_minutes = Math.floor(video.duration / 60);
            let total_seconds = Math.floor(video.duration - total_minutes * 60);
            time.innerHTML = `${zeroPad(minutes)}:${zeroPad(seconds)} / ${zeroPad(total_minutes)}:${zeroPad(total_seconds)}`;
    });
    
    const speaker = document.getElementById('speaker');
    const track = video.textTracks[0];
    track.addEventListener('cuechange', () => {
        const cues = track.activeCues[0];
        if(!cues)
            return;
        const speaker_match = cues.text.match(/<v.(.*?)>/);
            if (speaker_match)
                speaker.innerHTML = speaker_match[1];  
    });

    const load = document.getElementById('load');
    const play = document.getElementById('play');

    let video_loaded = false;

    video.addEventListener('loadedmetadata', (event) => {
        video_loaded = true;
        load.classList.add('hide');
        play.classList.remove('hide');
        let total_minutes = Math.floor(video.duration / 60);
        let total_seconds = Math.floor(video.duration - total_minutes * 60);
        time.innerHTML = `00:00 / ${zeroPad(total_minutes)}:${zeroPad(total_seconds)}`;
    });
    
    const videoControls = document.getElementById('video-controls');
    const timeline = document.getElementById('timeline');

    video.addEventListener('click', () => {
        if (!video_loaded)
            return;
        if (video.paused) {
            video.play();
        }
        else {
            video.pause();
        }
    });

    play.addEventListener('click', () => {
        if (!video_loaded)
            return;
        if (video.paused) {
            video.play();
        } else {
            video.pause();
        }
    });

    video.addEventListener('play', () => {
        play.classList.add('paused');
    });

    video.addEventListener('pause', () => {
        play.classList.remove('paused');
    });


    const percentages = $percentages;

    

    async function main(){
        let divs = ''
        let last_time = 0
        for(let p of percentages){
            divs += `<div class="speaker ${p[0]}" style="width:${p[1]}%;" title="${p[0]}"></div>\n`
        }
        timeline.innerHTML = divs
    }

    main()
</script>
</body>
</html>
"""
)

### Main code

In [2]:
import os

from argparse import Namespace

import tempfile
from os.path import join as opj
import re

import torch
import torchaudio
import numpy as np

import whisper

from pyannote.audio import Pipeline

from denoiser.audio import Audioset
from denoiser import distrib, pretrained
from denoiser.audio import Audioset, find_audio_files

from moviepy.video.io.ffmpeg_tools import ffmpeg_extract_subclip
from moviepy.editor import AudioFileClip, concatenate_audioclips

device = "cuda" if torch.cuda.is_available() else "cpu"

denoise_model = pretrained.get_model(Namespace(model_path=None, dns48=False, dns64=False, master64=False, valentini_nc=False)).to(device)
denoise_model.eval()
whisper_model = whisper.load_model("large").to(device)
whisper_model.eval()

def split_audio(tmpdirname, video, chunk_size=120):
    """
    Split audio into chunks of chunk_size
    """
    path = opj(tmpdirname, 'noisy_chunks')
    os.makedirs(path)
    # extract audio from video
    audio = AudioFileClip(video.name)
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as audio_fp:
        audio.write_audiofile(audio_fp.name, verbose=False)

        # round duration to the next whole integer
        for i, chunk in enumerate(np.arange(0, audio.duration, chunk_size)):
            ffmpeg_extract_subclip(audio_fp.name, chunk, min(chunk + chunk_size, audio.duration),
                                targetname=opj(path, f'{i:09}.wav'))
    return audio.duration


def get_speakers(tmpdirname, use_auth_token=True):
    files = find_audio_files(opj(tmpdirname, 'noisy_chunks'))
    dset = Audioset(files, with_path=True,
                    sample_rate=denoise_model.sample_rate, channels=denoise_model.chin, convert=True)
    
    loader = distrib.loader(dset, batch_size=1)
    distrib.barrier()

    print('removing noise...')
    enhanced_chunks = []
    with tempfile.TemporaryDirectory() as denoised_tmpdirname:
        for data in loader:
            noisy_signals, filenames = data
            noisy_signals = noisy_signals.to(device)
            
            with torch.no_grad():
                wav = denoise_model(noisy_signals).squeeze(0)
            wav = wav / max(wav.abs().max().item(), 1)

            name = opj(denoised_tmpdirname, filenames[0].split('/')[-1])
            torchaudio.save(name, wav.cpu(), denoise_model.sample_rate)
            enhanced_chunks.append(name)

        print('reassembling chunks...')
        clips = [AudioFileClip(c) for c in sorted(enhanced_chunks)]
        final_clip = concatenate_audioclips(clips)
        cleaned_path = opj(tmpdirname, 'cleaned.wav')
        final_clip.write_audiofile(cleaned_path, verbose=False)
        
        print('identifying speakers...')
        # load pre-trained model
        pipeline = Pipeline.from_pretrained('pyannote/speaker-diarization', use_auth_token=use_auth_token)
    
        return str(pipeline({'uri': '', 'audio': cleaned_path})).split('\n'), cleaned_path

def get_subtitles(timecodes, clened_audio_path, language=None):
    if(device == 'cpu'):
        options = whisper.DecodingOptions(language=language, fp16=False)
    else:
        options = whisper.DecodingOptions(language=language)

    timeline = {}
    prev_speaker = None
    prev_start = 0
    for line in timecodes:
        start, end = re.findall(r'\d{2}:\d{2}:\d{2}.\d{3}', line)
        start = str_to_seconds(start)
        end = str_to_seconds(end)
        speaker = re.findall(r'\w+$', line)[0]

        # extract a segment of the audio for a speaker
        with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as audio_fp:
            ffmpeg_extract_subclip(clened_audio_path, start, end,
                                    targetname=audio_fp.name)

            # load audio and pad/trim it to fit 30 seconds
            audio = whisper.load_audio(audio_fp.name)
            audio = whisper.pad_or_trim(audio)  
            # make log-Mel spectrogram and move to the same device as the model
            mel = whisper.log_mel_spectrogram(audio).to(whisper_model.device)
            # decode the audio
            result = whisper.decode(whisper_model, mel, options)

            if(speaker == prev_speaker):
                timeline[prev_start]['text'] += f' <{seconds_to_str(start)}>{result.text}'
                timeline[prev_start]['end'] = end
            else:
                timeline[start] = { 'end': end, 
                                    'speaker': speaker,
                                    'text': f'<v.{speaker}>{speaker}</v>: {result.text}'}
                prev_start = start

            prev_speaker = speaker

    return timeline

def str_to_seconds(time_str):
    h, m, s = time_str.split(':')
    return int(h) * 3600 + int(m) * 60 + float(s)

def seconds_to_str(seconds):
    m, s = divmod(seconds, 60)
    h, m = divmod(m, 60)
    # with milliseconds
    return f'{int(h):02}:{int(m):02}:{s:06.3f}'
    

def timeline_to_vtt(timeline):
    vtt = 'WEBVTT\n\n'
    for start in sorted(timeline.keys()):
        end = timeline[start]['end']
        text = timeline[start]['text']
        vtt += f'{seconds_to_str(start)} --> {seconds_to_str(end)}\n'
        vtt += text+'\n\n'
    return vtt

def calc_speaker_percentage(timeline, duration):
    percentages = []
    end = 0
    for start in sorted(timeline.keys()):
        if(start > end):
            percentages.append(['Background', 100*(start-end)/duration])
        end = timeline[start]['end']
        speaker = timeline[start]['speaker']
        percentages.append([speaker, 100*(end-start)/duration])
    return percentages


### UI code

In [None]:

# Get get an external URL to the virtual machine
from google.colab.output import eval_js
url = eval_js("google.colab.kernel.proxyPort(8000)")

from yt_dlp import YoutubeDL

# List of supported languages
from whisper.tokenizer import LANGUAGES

# workaround
# https://github.com/pyannote/pyannote-audio/issues/1269
import locale
locale.getpreferredencoding = lambda: "UTF-8"

import ipywidgets as widgets
from IPython.display import clear_output, display, HTML, Markdown
import tempfile

def render_player():
  return HTML(player_template.safe_substitute(url='http://localhost:8000/video.mp4', vtt='http://localhost:8000/subtitles.vtt', percentages=str(percentages)))

# Input form elements
out = widgets.Output()
upload = widgets.FileUpload(accept='.mp4', button_style='info')
text = widgets.Text(placeholder='Youtube URL')
lang_options = [('Original', None)]+[(v.capitalize(), k) for k, v in LANGUAGES.items()]
lang = widgets.Dropdown(options=lang_options, description='Translate to:')

percentages = []

def process():  
  global percentages
  print('Processing...')

  # remove old file if exists
  !rm -f video.mp4
  if text.value: # if URL to video is provided
    with YoutubeDL({'format': 'bv[ext=mp4]+bv[height<=1080]+ba[ext=m4a]', 'outtmpl': 'video.mp4'}) as ydl:
      ydl.download([text.value])
  else: # if file is uploaded
    with open('video.mp4', 'wb') as f:
      uploaded_filename = next(iter(upload.value))
      content = upload.value[uploaded_filename]['content']
      f.write(content)
  
  clear_output()
        
  with tempfile.TemporaryDirectory() as tmpdirname:
    # split audio to fit in memory
    with open('video.mp4', 'rb') as f:
      duration = split_audio(tmpdirname, f)

    speaker_diarisation, cleaned_path = get_speakers(tmpdirname)

    clear_output()

    print('Language', lang.value)
    timeline  = get_subtitles(speaker_diarisation, cleaned_path, language=lang.value)

    vtt = timeline_to_vtt(timeline).encode('utf-8')
    percentages = calc_speaker_percentage(timeline, duration)
    with open('subtitles.vtt', 'wb') as f:
      f.write(vtt)

    clear_output()
    
    with open('index.html', 'w') as f:
      f.write(player_template.safe_substitute(url=f'{url}/video.mp4', vtt=f'{url}/subtitles.vtt', percentages=str(percentages)))
    
    # print markdown link to the video
    display(Markdown(f'Open [player]({url}/index.html) in a new tab or [download subtitles]({url}/subtitles.vtt)'))


def render_upload_form():
  return widgets.VBox([text, widgets.Label(value='or'), upload, lang])


## Login on huggingface

**Note: You will need to accept pyannote's [speaker-diarization](https://huggingface.co/pyannote/speaker-diarization) and [segmentation](https://huggingface.co/pyannote/segmentation) user conditions.**

In [None]:
from huggingface_hub import notebook_login
notebook_login()

## UI

You can either paste a YouTube link or upload a video file. If a link is provided, upload will be ignored.

In [None]:
render_upload_form()

In [None]:
process()

**Note: To be able to download subtitles open player in a new tab using the links above, 3rd party cookies must be enabled.**

Otherwise, you can download the subtitles from the file explorer. You can also render the player in the notebook by running the following cell.

In [None]:
render_player()