# Multi-Language Speech Recognition and Speaker Diarisation

This demo allows you to recognize speech in 99 different languages, identify speakers, and translate the text into a selected language. The pipeline is made up of three libraries: 
* [denoiser](https://github.com/facebookresearch/denoiser) to remove extraneous noise from the audio,
* [pyannote](https://github.com/pyannote/pyannote-audio) for speaker diarisation, and
* [whisper](https://github.com/openai/whisper) - the main component that not only recognizes speech but also has the ability to translate it into one of 99 languages.

It's worth noting that the capability to translate into any language was discovered by accident during experiments with the model, and the official repository only states that it can translate any of the languages into English. The quality of recognition and translation may vary depending on the language being used.

## Setup

In [None]:
# %%capture

!pip install -U yt-dlp
!pip install -r requirements.txt
!pip install omegaconf==2.3.0 pytorch-lightning==1.8.4

import subprocess
subprocess.Popen(['python3', '-m', 'http.server', '8000']);

from yt_dlp import YoutubeDL

# workaround colab's "unrecognized arguments"
import sys
sys.argv = []

from backend import get_speakers, split_audio, get_subtitles, timeline_to_vtt, calc_speaker_percentage
from whisper.tokenizer import LANGUAGES

# yet another workaround
# https://github.com/pyannote/pyannote-audio/issues/1269
import locale
locale.getpreferredencoding = lambda: "UTF-8"

import ipywidgets as widgets
from IPython.display import clear_output, display, HTML
from huggingface_hub import notebook_login
import tempfile

def render_player():
  with open('player.html', 'r') as f:
    player_html = f.read()
  player_html = player_html.replace('$url', 'http://localhost:8000/video.mp4')
  player_html = player_html.replace('$vtt', 'http://localhost:8000/subtitles.vtt')
  player_html = player_html.replace('$percentages', str(percentages))
  return HTML(player_html)

out = widgets.Output()
upload = widgets.FileUpload(accept='.mp4', button_style='info')
text = widgets.Text(placeholder='Youtube URL')
lang_options = [('Original', None)]+[(v.capitalize(), k) for k, v in LANGUAGES.items()]
lang = widgets.Dropdown(options=lang_options, description='Translate to:')

percentages = []

def process(enhance=True):  
  global percentages
  print('Processing...')

  !rm -f video.mp4
  if text.value:
    with YoutubeDL({'format': 'bv[ext=mp4]+bv[height<=1080]+ba[ext=m4a]', 'outtmpl': 'video.mp4'}) as ydl:
      ydl.download([text.value])
  else:
    with open('video.mp4', 'wb') as f:
      uploaded_filename = next(iter(upload.value))
      content = upload.value[uploaded_filename]['content']
      f.write(content)
  
  clear_output()
        
  with tempfile.TemporaryDirectory() as tmpdirname:
    # split audio to fit in memory
    with open('video.mp4', 'rb') as f:
      duration = split_audio(tmpdirname, f)

    speaker_diarisation, cleaned_path = get_speakers(tmpdirname)

    clear_output()

    print('Language', lang.value)
    timeline  = get_subtitles(speaker_diarisation, cleaned_path, language=lang.value)

    vtt = timeline_to_vtt(timeline)
    percentages = calc_speaker_percentage(timeline, duration)
    with open('subtitles.vtt', 'w') as f:
      f.write(vtt)

    clear_output()
    print('Ready')

def render_upload_form():
  return widgets.VBox([text, widgets.Label(value='or'), upload, lang])


## Login on huggingface

**Note: You will need to accept pyannote's [speaker-diarization](https://huggingface.co/pyannote/speaker-diarization) and [segmentation](https://huggingface.co/pyannote/segmentation) user conditions.**

In [None]:
from huggingface_hub import notebook_login
notebook_login()

## UI

In [None]:
render_upload_form()

In [None]:
process()

In [None]:
render_player()