# Transcribe a YouTube Video in English with Assembly AI and yt-dlp

### References:

- [Assembly AI documentation](https://www.assemblyai.com/docs)
- [yt-dlp on GitHub](https://github.com/yt-dlp/yt-dlp)

## Preparation

#### Imports and Globals

In [1]:
import assemblyai as aai
import yt_dlp
import json

from config import *

aai.settings.api_key = aai_key
YT_BASE_URL = 'https://www.youtube.com/watch?v='
DST_FOLDER = 'files'

#### Task-specific Variables

In [None]:
v_id = '08j09G2CGVc'  # the main identifier of the video, absolutely needeed (or is it?)

#### Pull and save the soundtrack with yt-dlp

In [2]:
# url = f'{YT_BASE_URL}{v_id}'
# url = 'https://www.youtube.com/watch?v=4cr3KD0ayOE'
url = input('Enter the URL of the video: ')

ydl_opts = {
    'format': 'm4a/bestaudio/best',  # The best audio version in m4a format
    'outtmpl': f'{DST_FOLDER}/%(title)s_%(id)s.%(ext)s',  
}

with yt_dlp.YoutubeDL(ydl_opts) as ydl:
    info = ydl.extract_info(url)
    audio_file = ydl.prepare_filename(info)

print(f'\n>>> Downloaded to: {audio_file}')

[youtube] Extracting URL: https://www.youtube.com/watch?v=MK2pP1VvOWY
[youtube] MK2pP1VvOWY: Downloading webpage
[youtube] MK2pP1VvOWY: Downloading tv client config
[youtube] MK2pP1VvOWY: Downloading player b191cf34
[youtube] MK2pP1VvOWY: Downloading tv player API JSON
[youtube] MK2pP1VvOWY: Downloading ios player API JSON
[youtube] MK2pP1VvOWY: Downloading m3u8 information
[info] MK2pP1VvOWY: Downloading 1 format(s): 140
[download] Destination: files\Arab Zionist： ＂Something BIG is Happening in the Middle East...＂_MK2pP1VvOWY.m4a
[download] 100% of    8.64MiB in 00:00:16 at 535.62KiB/s 
[FixupM4a] Correcting container of "files\Arab Zionist： ＂Something BIG is Happening in the Middle East...＂_MK2pP1VvOWY.m4a"

>>> Downloaded to: files\Arab Zionist： ＂Something BIG is Happening in the Middle East...＂_MK2pP1VvOWY.m4a


## Building the transcript via AssemblyAI API

In [3]:
def get_file_name(audio_file: str):
    return audio_file.split("\\")[-1].split(".")[0]

In [5]:
transcriber = aai.Transcriber()

In [None]:
# no speaker differentiation
config = aai.TranscriptionConfig(language_detection=True)
transcript = transcriber.transcribe(audio_file, config)
print(transcript.status)

with open(f'files/transcript_{get_file_name(audio_file)}_{transcript.id}.txt', 'w', encoding='utf-8') as f:
    f.write(transcript.text)

In [6]:
# with speaker differentiation
config = aai.TranscriptionConfig(
    speaker_labels=True
)
transcript = transcriber.transcribe(audio_file, config)
print(transcript.status, transcript.id)

with open(f'files/transcript_{get_file_name(audio_file)}_{transcript.id}.txt', 'w', encoding='utf-8') as f:
    for utterance in transcript.utterances:
        f.write(f'{utterance.speaker}: {utterance.text}\n')

TranscriptStatus.completed 071251a8-fc54-4e75-8da0-6694712b58a1


In [7]:
json.dump(transcript.json_response, 
          open(f'files/transcript_{get_file_name(audio_file)}_{transcript.id}.json', 'w', encoding='utf-8'), 
          indent=4, 
          ensure_ascii=False
)

In [8]:
with open(f'files/transcript_{file_name}_{transcript.id}.txt', 'w', encoding='utf-8') as f:
    for utterance in transcript.utterances:
        f.write(f'{utterance.speaker}: {utterance.text}\n')

In [7]:
for utterance in transcript.utterances:
  print(f"Speaker {utterance.speaker}: {utterance.text}")

Speaker A: So I'm here with Jonathan Pollard. You spent how many years in American prison?
Speaker B: 30.
Speaker A: 30 years. How did you interpret your incarceration? How did you spiritually find the strength?
Speaker B: There are a couple of aspects to it that might be of interest. First of all, I survived by virtue of a good and loyal wife, Esther Yocheved, and a message that she gave me. She was my Mora, a very close member of Chabad in Toronto, and she taught me Yera Shamayim. That saved me. The other aspect of it, which I didn't know until about 20 years into my incarceration, was the following. I asked why it was that the. The Rabbi really never mentioned my name. And since the. My Rav and the Rebbe were close friends, he said he would ask it. So on his return visit to me, he related the following story. That when he asked the Rebbe why he hadn't mentioned my name anywhere, this is what the Rebbe told him. That as far as he was concerned, I was completing Yosef at Sadib's sente

## Working with the transcript

In [2]:
# load the transcript by id if necessary
job_id = '1dc162c7-0114-4698-be0a-8259ad1e7edf'

transcript = aai.Transcript.get_by_id(job_id)
transcript.id

'1dc162c7-0114-4698-be0a-8259ad1e7edf'

### Search for words

In [16]:
def convert_millis(millis):
    seconds = millis // 1000
    hours = seconds // 3600
    minutes = (seconds % 3600) // 60
    seconds = seconds % 60
    return f"{hours:02}:{minutes:02}:{seconds:02}"


def find_words(query):
    matches = transcript.word_search(query.split())
    for match in matches:
        print(f'{match.text}: ', end='')
        print('; '.join([convert_millis(start) for start, end in match.timestamps]))


def find_sequence(query):
    starts = [] 
    ordered_words = [word.lower().strip() for word in query.split()]
    matches = transcript.word_search(ordered_words)
    match0 = None
    for i, match in enumerate(matches):
        if match.text == ordered_words[0]:
            match0 = matches.pop(i)
            break
    if match0 is None:
        print(f'No match for "{query}"')
        return
    for index, timestamp in zip(match0.indexes, match0.timestamps):
        for match in matches:
            if index + 1 in match.indexes:
                starts.append(timestamp[0])
    return starts

In [42]:
query = 'pro'
find_words(query)

pro: 00:25:33


In [14]:
query = 'overrule God'
matches = transcript.word_search(query.split())
matches

[WordSearchMatch(text='overrule', count=8, timestamps=[(155652, 156124), (158766, 159158), (323482, 323986), (329482, 329778), (526892, 527460), (1310364, 1310876), (1398862, 1399462), (1421294, 1421766)], indexes=[383, 391, 800, 817, 1438, 3897, 4189, 4249]),
 WordSearchMatch(text='god', count=62, timestamps=[(20664, 20984), (156172, 156800), (159174, 159366), (161486, 161686), (170430, 170646), (190434, 190666), (234720, 234984), (324018, 324274), (329794, 329986), (394814, 395046), (401282, 401466), (405602, 405834), (414562, 414746), (415762, 415946), (416946, 417194), (431460, 431932), (437780, 438044), (492546, 492746), (494514, 494890), (527540, 527844), (538984, 539360), (552664, 552944), (574656, 574856), (600348, 600644), (694018, 694178), (695754, 695986), (696666, 696818), (701882, 702370), (724746, 725106), (730666, 730914), (773528, 773728), (861360, 861664), (864048, 864256), (870208, 870416), (988096, 988296), (991316, 991468), (1200312, 1200448), (1236932, 1237560), (1

In [19]:
starts = find_sequence(query)
starts

[155652, 158766, 323482, 329482, 526892, 1310364]

In [22]:
for i, word in enumerate(transcript.words):
    if word.start in starts:
        print(f'{word.start} {word.text} {transcript.words[i+1].text}')

155652 overrule God.
158766 overrule God?
323482 overrule God,
329482 overrule God,
526892 overrule God
1310364 overrule God.


In [20]:
transcript.words

[Word(text='Good', start=1440, end=1552, confidence=0.99349, speaker=None, channel=None),
 Word(text='morning', start=1552, end=1736, confidence=0.99996, speaker=None, channel=None),
 Word(text='and', start=1760, end=1912, confidence=0.99584, speaker=None, channel=None),
 Word(text='welcome', start=1936, end=2136, confidence=0.67908, speaker=None, channel=None),
 Word(text='to', start=2168, end=2312, confidence=0.99951, speaker=None, channel=None),
 Word(text='Worldwide', start=2336, end=2776, confidence=0.54569, speaker=None, channel=None),
 Word(text='Wisdom.', start=2808, end=3192, confidence=0.98676, speaker=None, channel=None),
 Word(text='Folks.', start=3256, end=3544, confidence=0.99498, speaker=None, channel=None),
 Word(text='Today', start=3592, end=3800, confidence=0.99831, speaker=None, channel=None),
 Word(text="we're", start=3840, end=4008, confidence=0.97077, speaker=None, channel=None),
 Word(text='learning', start=4024, end=4264, confidence=0.99088, speaker=None, channe