# Annotate
Downloading a video from YouTube and using OpenAI's Whisper to create captions with timestamps

Thanks: [yt-dlp](https://github.com/yt-dlp/yt-dlp), [Whisper](https://github.com/openai/whisper)

In [None]:
# Dependencies
%pip install openai-whisper
%pip install yt-dlp

In [13]:
# Settings for the YouTube downloader
YDL_OPTS = {
    "extract-audio": True,
    "audio-format": "opus",
    "noplaylist": True,
    "youtube_include_dash_manifest": False,
    'postprocessors': [{  # Extract audio using ffmpeg
        'key': 'FFmpegExtractAudio',
        'preferredcodec': 'mp3',
    }]
}

In [21]:
from yt_dlp import YoutubeDL

def extract_audio(filename: str, file_ext: str, url: str):
    YDL_OPTS["postprocessors"][0]["preferredcodec"] = file_ext
    YDL_OPTS["outtmpl"] = filename

    with YoutubeDL(YDL_OPTS) as ydl:
        ydl.download(url_list=[url])

In [20]:
extract_audio(filename="test_audio", file_ext="mp3", url="https://www.youtube.com/watch?v=z6xslDMimME")

[youtube] Extracting URL: https://www.youtube.com/watch?v=z6xslDMimME
[youtube] z6xslDMimME: Downloading webpage
[youtube] z6xslDMimME: Downloading android player API JSON
[info] z6xslDMimME: Downloading 1 format(s): 22
[download] test_audio has already been downloaded
[download] 100% of  382.88MiB
[ExtractAudio] Destination: test_audio.mp3
Deleting original file test_audio (pass -k to keep)


In [4]:
import whisper

model = whisper.load_model("base")
result = model.transcribe("test_audio.mp3")

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
result.keys()

dict_keys(['text', 'segments', 'language'])

In [6]:
result.get("segments")

[{'id': 0,
  'seek': 0,
  'start': 0.0,
  'end': 8.86,
  'text': ' Yes.',
  'tokens': [1079, 13],
  'temperature': 1.0,
  'avg_logprob': -2.610046068827311,
  'compression_ratio': 0.3333333333333333,
  'no_speech_prob': 0.1942272186279297},
 {'id': 1,
  'seek': 886,
  'start': 8.86,
  'end': 33.620000000000005,
  'text': ' So now I tweaked this.',
  'tokens': [407, 586, 286, 6986, 7301, 341, 13],
  'temperature': 1.0,
  'avg_logprob': -1.1685914126309482,
  'compression_ratio': 0.7333333333333333,
  'no_speech_prob': 0.18196281790733337},
 {'id': 2,
  'seek': 3362,
  'start': 33.62,
  'end': 35.62,
  'text': ' Oh, yo!',
  'tokens': [876, 11, 5290, 0],
  'temperature': 0.0,
  'avg_logprob': -0.38798428924990375,
  'compression_ratio': 1.4566473988439306,
  'no_speech_prob': 0.23740297555923462},
 {'id': 3,
  'seek': 3362,
  'start': 35.62,
  'end': 40.62,
  'text': " Alright, alright boys, we're good!",
  'tokens': [2798, 11, 5845, 6347, 11, 321, 434, 665, 0],
  'temperature': 0.0,
  'a

In [15]:
from datetime import timedelta
import csv

with open("output.csv", "w") as file:
    w = csv.writer(file)
    for seg in result.get("segments", []):
        start = timedelta(seconds=seg["start"])
        end = timedelta(seconds=seg["end"])
        text = seg["text"]
        text = text.lstrip()
        row = [start, end, text]
        w.writerow(row)

In [16]:
for seg in result.get("segments"):
    print(seg["text"])

 Yes.
 So now I tweaked this.
 Oh, yo!
 Alright, alright boys, we're good!
 I think this is gonna work!
 You know, I figured it was most appropriate to livestream here on Twitter instead of on Twitch.
 We're gonna figure out how Twitter works on Twitter.
 Should we close the closet door?
 So those of you all don't follow me on Twitch.
 I'm a streamer.
 Twitter's not Instagram, so that's not copyrighted.
 If you wanna post it, it's entire you can. No clips, obviously. You know my rules.
 But let's go!
 Okay, so the way that I got this by the way, my friend who is watching my Twitter text me is like,
 yo, I can get you media studio.
 Yeah, so we got media studio not through any legitimate channels.
 So Elon, you know, this is free money. I would have happily clicked BIM.
 Alright, so we can, I may not be S here, so I can go like this.
 And yeah, I think we should be good. That should be why it's green.
 Alright, cool.
 We're gonna move OES over here, and then we have a nice chrome with, 