# yt
> Utilities for Content Creation From YouTube

In [9]:
#|default_exp yt

In [1]:
#|export
import re
from typing import Optional, Annotated
import typer
from youtube_transcript_api import YouTubeTranscriptApi, TranscriptsDisabled, NoTranscriptFound
from hamel.gem import gem

## YouTube Chapter Creation

Automate pesky chapter creation + description

In [2]:
#|export
def yt_chapters(link):
    "Generate YoutTube Summary and Chapters From A Public Video."
    
    chapter_prompt="Generate a succinct video summary (1-2 sentences) followed by YouTube chapter timestamps for this video. Format each line of the chapter summaries as 'MM:SS - Chapter Title' (e.g., '02:30 - Introduction'). Start with 00:00. Include all major topics and transitions and be thorough - do not miss any important topics.  For the summary, do not say 'In this video, we will cover the following topics', 'This video discusses..' or anything like that. Instead, reference the main speaker's name if you know it.  If there is a Q&A Section, enumerate individual questions as additional chapters."
    return gem(prompt=chapter_prompt, o=link, model="gemini-2.5-pro")

This is what it looks like for Antoine's [Late Interaction Talk](https://youtu.be/1x3k0V2IITo):

In [3]:
chp = yt_chapters("https://youtu.be/1x3k0V2IITo")
print(chp)

In this presentation, Antoine Chaffin explains the inherent limitations of single-vector search, such as information loss from pooling and poor out-of-domain performance, and introduces late interaction (multi-vector) models as a superior solution. He demonstrates how these models excel in long-context and reasoning-intensive tasks and presents the PyLate library to make training and evaluating these powerful models more accessible.

00:00 - Introduction
00:32 - About Me
01:40 - Dense (Single) Vector Search Explained
03:08 - Single Vector Models: The Go-To for RAG
03:55 - Performance Evaluation & MTEB Leaderboard
04:17 - The BEIR Benchmark & Goodhart's Law
05:36 - Limitations Beyond Benchmarks: The Long Context Problem
06:33 - Limitations Beyond Benchmarks: Reasoning-Intensive Retrieval
07:50 - The Role of BM25
08:24 - Pooling: The Intrinsic Flaw of Dense Models
11:32 - Replacing Pooling with Late Interaction
12:17 - Why Not Just Use a Bigger Single Vector?
13:51 - Late Interaction: A 

## Fetch YouTube Transcript

Fetch the youtube transcript from public videos.

In [25]:
#|export
def _extract_video_id(url: str) -> Optional[str]:
    """Extract YouTube video ID from various URL formats."""
    for pattern in [r'(?:youtube\.com/watch\?v=|youtu\.be/)([^&\n?#]+)', 
                    r'youtube\.com/embed/([^&\n?#]+)', 
                    r'youtube\.com/v/([^&\n?#]+)']:
        if match := re.search(pattern, url): return match.group(1)
    return url if re.match(r'^[a-zA-Z0-9_-]{11}$', url) else None

def _format_timestamp(seconds: float) -> str:
    """Convert seconds to HH:MM:SS format."""
    h, m, s = int(seconds // 3600), int((seconds % 3600) // 60), int(seconds % 60)
    return f"{h:02d}:{m:02d}:{s:02d}"
        
def _format_seconds(seconds: float): return f"{int(seconds):d}s"       

        
def transcribe(url, seconds_only = False):
    "Download YouTube transcript."
    if not (video_id := _extract_video_id(url)): raise ValueError(f"Could not extract video ID from '{url}'")
    try: transcript_data = YouTubeTranscriptApi.get_transcript(video_id, languages=['en'])
    except (TranscriptsDisabled, NoTranscriptFound) as e: raise ValueError(f"{str(e)} for video: {video_id}")
    format_func = _format_seconds if seconds_only else _format_timestamp
    transcript_text = '\n'.join(f"[{format_func(e['start'])}] {e['text']}" for e in transcript_data)
    return transcript_text

In [26]:
t = transcribe("https://youtu.be/1x3k0V2IITo")

In [27]:
print(t[:500])

[00:00:00] Hello everyone, my name is Chapan and I
[00:00:02] am a research engineer at Leighton and
[00:00:05] today I will detail some of the limits
[00:00:08] of single vector search that have been
[00:00:10] highlighted by recent usages and
[00:00:13] evaluations and then I will introduce
[00:00:16] multi vector models also known as late
[00:00:18] interaction models and how they can
[00:00:21] overcome this and to finish I will
[00:00:24] briefly present the pilot library that
[00:00:26] al
