# 1. СКАЧИВАНИЕ ДАННЫХ

Для работы было выбрано одноминутное видео YouTube  ["What Happens In One Minute?"](https://www.youtube.com/watch?v=zhWDdy_5v2w) (на английском). Причём у этого видео есть как автоматические субтитры, так и официальные, которые можно взять за эталон. Скачивание звуковой дорожки и субтитров производилось с помощью yt-dlp - ветки youtube-dl.

In [None]:
!pip install yt-dlp

In [2]:
import yt_dlp

In [3]:
video_url = "https://www.youtube.com/watch?v=zhWDdy_5v2w"

In [4]:
ydl_opts = {
    'format': 'bestaudio/best',
    'postprocessors': [{
        'key': 'FFmpegExtractAudio',
        'preferredcodec': 'wav',
        'preferredquality': '192',
    }],
    'writeautomaticsub': True,
    'subtitlesformat': 'json3',
    'outtmpl': 'video.%(ext)s',
}

with yt_dlp.YoutubeDL(ydl_opts) as ydl:
    ydl.download([video_url])

[youtube] Extracting URL: https://www.youtube.com/watch?v=zhWDdy_5v2w
[youtube] zhWDdy_5v2w: Downloading webpage
[youtube] zhWDdy_5v2w: Downloading ios player API JSON
[youtube] zhWDdy_5v2w: Downloading android player API JSON
[youtube] zhWDdy_5v2w: Downloading m3u8 information
[info] zhWDdy_5v2w: Downloading subtitles: en
[info] zhWDdy_5v2w: Downloading 1 format(s): 251
[info] Writing video subtitles to: video.en.json3
[download] Destination: video.en.json3
[download] 100% of   26.71KiB in 00:00:00 at 221.49KiB/s
[download] Destination: video.webm
[download] 100% of 1000.54KiB in 00:00:00 at 5.97MiB/s   
[ExtractAudio] Destination: video.wav
Deleting original file video.webm (pass -k to keep)


У видео есть также официальные английские субтитры, скачаем и их.

In [5]:
ydl_opts = {
  'writesubtitles': True,
  'skip_download': True,
  'subtitlesformat': 'json3',
  'outtmpl': 'standart.%(ext)s',
}

with yt_dlp.YoutubeDL(ydl_opts) as ydl:
    ydl.download([video_url])

[youtube] Extracting URL: https://www.youtube.com/watch?v=zhWDdy_5v2w
[youtube] zhWDdy_5v2w: Downloading webpage
[youtube] zhWDdy_5v2w: Downloading ios player API JSON
[youtube] zhWDdy_5v2w: Downloading android player API JSON
[youtube] zhWDdy_5v2w: Downloading m3u8 information
[info] zhWDdy_5v2w: Downloading subtitles: en
[info] zhWDdy_5v2w: Downloading 1 format(s): 248+251
[info] Writing video subtitles to: standart.en.json3
[download] Destination: standart.en.json3
[download] 100% of    3.09KiB in 00:00:00 at 33.95KiB/s


In [6]:
!ls

sample_data  standart.en.json3	video.en.json3	video.wav


In [7]:
import json

In [8]:
def get_text(file, auto = True):
    
    youtube_text = ''
    
    with open(file) as f:
        subtitles = json.load(f)
        
    if auto:
        for line in subtitles['events'][1:]:
            for i in line['segs']:
                youtube_text += i['utf8'] + ' '           
    else:
        for line in subtitles['events']:
            for i in line['segs']:
                youtube_text += i['utf8'] + ' '
    
    return youtube_text.replace('\n', ' ')

In [46]:
audio_path = "video.wav"
standart = get_text('standart.en.json3', False)
results = {'auto': get_text('video.en.json3')} # будем добавлять сюда все последующие расшифровки

# 2. Системы ASR

### Библиотека Speech Recognition (CMU Sphinx)

In [None]:
!pip install SpeechRecognition

In [None]:
!pip install PocketSphinx

In [12]:
import speech_recognition as sr

In [13]:
r = sr.Recognizer()
with sr.AudioFile(audio_path) as source:
    audio = r.record(source)

In [17]:
sb_sphinx = r.recognize_sphinx(audio)

In [47]:
results['SR (Sphinx)'] = sb_sphinx

### faster-whisper (улучшенная версия OpenAI's Whisper)

In [None]:
!pip install faster_whisper

In [23]:
from faster_whisper import WhisperModel

model = WhisperModel("large-v3")

config.json:   0%|          | 0.00/2.39k [00:00<?, ?B/s]

vocabulary.json:   0%|          | 0.00/1.07M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

model.bin:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

In [24]:
segments, info = model.transcribe(audio_path)

In [25]:
fw_text = ''
for segment in segments:
    fw_text += segment.text

In [48]:
results['faster-whisper'] = fw_text

### wav2vec2 (HuggingFace)

In [None]:
!pip install huggingsound

In [None]:
from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-english")

In [29]:
hs_wav2vec = model.transcribe([audio_path])

100%|██████████| 1/1 [02:16<00:00, 136.01s/it]


In [49]:
results['HuggingSound (wav2vec2)'] = hs_wav2vec[0]['transcription']

# 3. Оценка качества (WER)

In [35]:
import pandas as pd
from jiwer import wer

In [50]:
texts = pd.DataFrame.from_dict(results, orient="index").reset_index()
texts.columns = ['model', 'text']

In [51]:
texts

Unnamed: 0,model,text
0,auto,in a single minute your body produces ...
1,SR (Sphinx),in a single menninger body produces one hundre...
2,faster-whisper,"In a single minute, your body produces 120 to..."
3,HuggingSound (wav2vec2),in a single-minutyerbody produces onehunded-tw...


In [52]:
wer_df = {}

for model in results.keys():
    er = wer(standart.lower(), results[model].lower())
    wer_df[model] = er

In [53]:
comparison = pd.DataFrame.from_dict(wer_df, orient="index").reset_index()
comparison.columns = ['ASR model', 'WER']

In [54]:
comparison

Unnamed: 0,ASR model,WER
0,auto,0.305677
1,SR (Sphinx),0.777293
2,faster-whisper,0.052402
3,HuggingSound (wav2vec2),0.650655


In [55]:
texts.to_csv('model_text.tsv', sep="\t")

Лучше всего (даже можно сказать идеально) с задачей справился Whisper (в частности, за счет пунктуации), на втором месте - автоматические субтитры YouTube.