This notebook  take a youtube video.
In section 1 it summarizes the videos and returns the chapter timestamps in a structured output

In section 2 the transcript is chunked (and explored chunking methods using langchain)

In section 3 perform rag on the chunks, index each chunk

In [1]:
from openai import OpenAI
from pydantic import BaseModel
import json
import numpy as np
from tqdm.auto import tqdm
from dotenv import load_dotenv
load_dotenv()

openai_client = OpenAI()

In [2]:
from youtube_transcript_api import YouTubeTranscriptApi

video_id = 'ph1PxZIkz1o'

ytt_api = YouTubeTranscriptApi()
transcript = ytt_api.fetch(video_id)

In [3]:
type(transcript)

youtube_transcript_api._transcripts.FetchedTranscript

In [4]:
len(transcript)

1407

In [5]:
for i in range(5):
    print(transcript[i])

FetchedTranscriptSnippet(text='So hi everyone. Uh today we are going to', start=0.0, duration=5.04)
FetchedTranscriptSnippet(text='talk about our upcoming course. The', start=2.96, duration=3.52)
FetchedTranscriptSnippet(text='upcoming course is called machine', start=5.04, duration=5.92)
FetchedTranscriptSnippet(text='learning zoom camp. And um this is', start=6.48, duration=5.92)
FetchedTranscriptSnippet(text='already I put the link in the', start=10.96, duration=3.599)


In [6]:
# format the transcript into a single string
def format_timestamp(seconds: float) -> str:
    """Convert seconds to H:MM:SS if > 1 hour, else M:SS"""
    total_seconds = int(seconds)
    hours, remainder = divmod(total_seconds, 3600)
    minutes, secs = divmod(remainder, 60)

    if hours > 0:
        return f"{hours}:{minutes:02}:{secs:02}"
    else:
        return f"{minutes}:{secs:02}"

def make_subtitles(transcript) -> str:
    lines = []

    for entry in transcript:
        ts = format_timestamp(entry.start)
        text = entry.text.replace('\n', ' ')
        lines.append(ts + ' ' + text)

    return '\n'.join(lines)


In [7]:

subtitles = make_subtitles(transcript)

In [8]:
print(subtitles[:500])

0:00 So hi everyone. Uh today we are going to
0:02 talk about our upcoming course. The
0:05 upcoming course is called machine
0:06 learning zoom camp. And um this is
0:10 already I put the link in the
0:12 description. So if you're watching um
0:14 this video in recording or you're
0:17 watching it live, you go here in the
0:19 description after under this video and
0:21 then you see a link course. uh click on
0:25 that link and this bring you will bring
0:27 you to
0:29 this website this GitHub


# Section 1: Summarization + Structured Output

In [9]:
class Chapter(BaseModel):
    timestamp: str
    title: str

class YTSummaryResponse(BaseModel):
    summary: str
    chapters: list[Chapter]

In [10]:
instructions = """
Summarize the transcript and describe the main purpose of the video
and the main ideas. 

Also output chapters with time. Use usual sentence case, not Title Case for the chapter.

More chapters is better than fewer chapters. Have a chapter at least every 3-5 minutes
""".strip()

messages = [
    {"role": "system", "content": instructions}, 
    {"role": "user", "content": subtitles}
]

response = openai_client.responses.parse(
    model='gpt-4o-mini',
    input=messages,
    text_format=YTSummaryResponse
)

In [11]:
response.output_parsed

YTSummaryResponse(summary="The video provides an overview of the upcoming 'Machine Learning Zoom Camp' course, detailing course structure, prerequisites, and participant engagement. It emphasizes the engineering focus of the course and clarifies that it’s free, with no job placement support. The host encourages interaction through a live Q&A format, addressing various participant questions regarding the course, tools, and outcomes. There’s a discussion on what students can expect in terms of content updates, prerequisites, and overall learning objectives, reiterating the course's aim to equip participants with practical ML skills.", chapters=[Chapter(timestamp='0:00', title='Introduction to the machine learning zoom camp course'), Chapter(timestamp='2:21', title='Course updates and structure'), Chapter(timestamp='5:30', title='Questions on job placement and ML engineering skills'), Chapter(timestamp='10:05', title='Prerequisites for the course'), Chapter(timestamp='15:20', title='In-de

In [12]:
print(response.output_parsed)

summary="The video provides an overview of the upcoming 'Machine Learning Zoom Camp' course, detailing course structure, prerequisites, and participant engagement. It emphasizes the engineering focus of the course and clarifies that it’s free, with no job placement support. The host encourages interaction through a live Q&A format, addressing various participant questions regarding the course, tools, and outcomes. There’s a discussion on what students can expect in terms of content updates, prerequisites, and overall learning objectives, reiterating the course's aim to equip participants with practical ML skills." chapters=[Chapter(timestamp='0:00', title='Introduction to the machine learning zoom camp course'), Chapter(timestamp='2:21', title='Course updates and structure'), Chapter(timestamp='5:30', title='Questions on job placement and ML engineering skills'), Chapter(timestamp='10:05', title='Prerequisites for the course'), Chapter(timestamp='15:20', title='In-depth content overvie

In [13]:
response.output_parsed.summary

"The video provides an overview of the upcoming 'Machine Learning Zoom Camp' course, detailing course structure, prerequisites, and participant engagement. It emphasizes the engineering focus of the course and clarifies that it’s free, with no job placement support. The host encourages interaction through a live Q&A format, addressing various participant questions regarding the course, tools, and outcomes. There’s a discussion on what students can expect in terms of content updates, prerequisites, and overall learning objectives, reiterating the course's aim to equip participants with practical ML skills."

In [14]:
response.output_parsed.chapters

[Chapter(timestamp='0:00', title='Introduction to the machine learning zoom camp course'),
 Chapter(timestamp='2:21', title='Course updates and structure'),
 Chapter(timestamp='5:30', title='Questions on job placement and ML engineering skills'),
 Chapter(timestamp='10:05', title='Prerequisites for the course'),
 Chapter(timestamp='15:20', title='In-depth content overview and focus areas'),
 Chapter(timestamp='20:55', title='Companion book and its relevance'),
 Chapter(timestamp='25:15', title='Expectations and outcomes by course end'),
 Chapter(timestamp='30:40', title='Certificates and homework deadlines'),
 Chapter(timestamp='36:05', title='Project structure and requirements'),
 Chapter(timestamp='41:40', title='Live interaction format and engagement details'),
 Chapter(timestamp='45:00', title='Final thoughts and course preparation tips')]

In [15]:
summary = response.output_parsed

print(summary.summary)
print()
for c in summary.chapters:
    print(c.timestamp, c.title)

The video provides an overview of the upcoming 'Machine Learning Zoom Camp' course, detailing course structure, prerequisites, and participant engagement. It emphasizes the engineering focus of the course and clarifies that it’s free, with no job placement support. The host encourages interaction through a live Q&A format, addressing various participant questions regarding the course, tools, and outcomes. There’s a discussion on what students can expect in terms of content updates, prerequisites, and overall learning objectives, reiterating the course's aim to equip participants with practical ML skills.

0:00 Introduction to the machine learning zoom camp course
2:21 Course updates and structure
5:30 Questions on job placement and ML engineering skills
10:05 Prerequisites for the course
15:20 In-depth content overview and focus areas
20:55 Companion book and its relevance
25:15 Expectations and outcomes by course end
30:40 Certificates and homework deadlines
36:05 Project structure an

In [16]:
# wrap in a function

def llm_structured(instructions, user_prompt, output_format, model="gpt-4o-mini"):
    messages = [
        {"role": "system", "content": instructions},
        {"role": "user", "content": user_prompt}
    ]

    response = openai_client.responses.parse(
        model=model,
        input=messages,
        text_format=output_format
    )

    return response.output_parsed

summary = llm_structured(
    instructions=instructions,
    user_prompt=subtitles,
    output_format=YTSummaryResponse
)

print(summary.summary)
print()
for c in summary.chapters:
    print(c.timestamp, c.title)

This video is an interlude discussing the upcoming Machine Learning Zoom Camp course, set to start on September 15. The session answers inquiries from potential students regarding course details, prerequisites, content scope, and job readiness after completion. The speaker emphasizes that the course focuses on essential skills for machine learning engineering, notably deployment, and clarifies that while previous content has been updated, foundational modules remain unchanged. Participants are encouraged to utilize provided resources and ask questions through Slido. A significant theme includes the differences between machine learning engineers and data scientists, as well as the importance of practical projects and programming experience.

0:00 Introduction to the course
1:00 Course structure and registration
2:50 Course history and updates
5:00 Job placement and engineering focus
10:10 Prerequisites for the course
14:30 Content overview: math and programming expectations
20:25 Audien

# Section 2: Chunking

## User def function (sliding window)

In [17]:
def sliding_window(seq, size, step):
    """Create overlapping chunks using sliding window approach."""
    if size <= 0 or step <= 0:
        raise ValueError("size and step must be positive")

    n = len(seq)
    result = []
    for i in range(0, n, step):
        batch = seq[i:i+size]
        result.append(batch)
        if i + size >= n:
            break

    return result


def join_lines(transcript) -> str:
    """Join transcript entries into continuous text."""
    lines = []

    for entry in transcript:
        text = entry.text.replace('\n', ' ')
        lines.append(text)

    return ' '.join(lines)

def format_chunk(chunk):
    """Format a chunk with start/end timestamps and text."""
    time_start = format_timestamp(chunk[0].start)
    time_end = format_timestamp(chunk[-1].start)
    text = join_lines(chunk)

    return {
        'start': time_start,
        'end': time_end,
        'text': text
    }

    


In [18]:
chunks = []

# Experiment with different values: try (30, 10) for more granular chunks
for chunk in sliding_window(transcript, 60, 30):
    processed = format_chunk(chunk)
    chunks.append(processed)

print(f"Created {len(chunks)} chunks")

Created 46 chunks


In [19]:
chunks[0]

{'start': '0:00',
 'end': '2:38',
 'text': "So hi everyone. Uh today we are going to talk about our upcoming course. The upcoming course is called machine learning zoom camp. And um this is already I put the link in the description. So if you're watching um this video in recording or you're watching it live, you go here in the description after under this video and then you see a link course. uh click on that link and this bring you will bring you to this website this GitHub page. This GitHub page is the main entry point to our course and um yeah I think it's more or less self-explanatory. If you want to sign up this is the button you click and the actual course starts in on September 15th. it means that it's uh slightly less than one one month before the course starts and the purpose of today's um session is to just answer your questions. So you have some questions and uh you can ask these questions using uh you can ask your questions using the pinned link. So there's a pinned link in

In [20]:
chunks[:10]

[{'start': '0:00',
  'end': '2:38',
  'text': "So hi everyone. Uh today we are going to talk about our upcoming course. The upcoming course is called machine learning zoom camp. And um this is already I put the link in the description. So if you're watching um this video in recording or you're watching it live, you go here in the description after under this video and then you see a link course. uh click on that link and this bring you will bring you to this website this GitHub page. This GitHub page is the main entry point to our course and um yeah I think it's more or less self-explanatory. If you want to sign up this is the button you click and the actual course starts in on September 15th. it means that it's uh slightly less than one one month before the course starts and the purpose of today's um session is to just answer your questions. So you have some questions and uh you can ask these questions using uh you can ask your questions using the pinned link. So there's a pinned link

## Chunk using langchain

In [22]:
clean_transcript=""
for line in tqdm(transcript):
    clean_transcript += line.text + " "

  0%|          | 0/1407 [00:00<?, ?it/s]

In [23]:
clean_transcript

"So hi everyone. Uh today we are going to talk about our upcoming course. The upcoming course is called machine learning zoom camp. And um this is already I put the link in the description. So if you're watching um this video in recording or you're watching it live, you go here in the description after under this video and then you see a link course. uh click on that link and this bring you will bring you to this website this GitHub page. This GitHub page is the main entry point to our course and um yeah I think it's more or less self-explanatory. If you want to sign up this is the button you click and the actual course starts in on September 15th. it means that it's uh slightly less than one one month before the course starts and the purpose of today's um session is to just answer your questions. So you have some questions and uh you can ask these questions using uh you can ask your questions using the pinned link. So there's a pinned link in the live chat. Click on that link um and t

### Semantic Chunker
Split by semantic similarity

In [46]:
from sentence_transformers import SentenceTransformer
from langchain_core.documents import Document
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_experimental.text_splitter import SemanticChunker

In [47]:
emb = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

  emb = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")


In [51]:
# 3️⃣  Prepare your input documents
# clean_transcript must be a list[Document], not just raw text
#docs = [Document(page_content=line.text) for line in tqdm(transcript)]
docs = [Document(page_content=clean_transcript)]

In [52]:
docs

[Document(metadata={}, page_content="So hi everyone. Uh today we are going to talk about our upcoming course. The upcoming course is called machine learning zoom camp. And um this is already I put the link in the description. So if you're watching um this video in recording or you're watching it live, you go here in the description after under this video and then you see a link course. uh click on that link and this bring you will bring you to this website this GitHub page. This GitHub page is the main entry point to our course and um yeah I think it's more or less self-explanatory. If you want to sign up this is the button you click and the actual course starts in on September 15th. it means that it's uh slightly less than one one month before the course starts and the purpose of today's um session is to just answer your questions. So you have some questions and uh you can ask these questions using uh you can ask your questions using the pinned link. So there's a pinned link in the li

In [53]:

model =SentenceTransformer("/models/all-MiniLM-L6-v2")

In [54]:
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

In [55]:
splitter = SemanticChunker(emb, breakpoint_threshold_type="percentile", breakpoint_threshold_amount=95)
docs_chunk = splitter.split_documents(docs)

In [58]:
len(docs_chunk)

31

In [59]:
docs_chunk[:3]

[Document(metadata={}, page_content="So hi everyone. Uh today we are going to talk about our upcoming course. The upcoming course is called machine learning zoom camp. And um this is already I put the link in the description. So if you're watching um this video in recording or you're watching it live, you go here in the description after under this video and then you see a link course. uh click on that link and this bring you will bring you to this website this GitHub page. This GitHub page is the main entry point to our course and um yeah I think it's more or less self-explanatory. If you want to sign up this is the button you click and the actual course starts in on September 15th. it means that it's uh slightly less than one one month before the course starts and the purpose of today's um session is to just answer your questions. So you have some questions and uh you can ask these questions using uh you can ask your questions using the pinned link. So there's a pinned link in the li

### Recursive Chunker

In [60]:
clean_transcript

"So hi everyone. Uh today we are going to talk about our upcoming course. The upcoming course is called machine learning zoom camp. And um this is already I put the link in the description. So if you're watching um this video in recording or you're watching it live, you go here in the description after under this video and then you see a link course. uh click on that link and this bring you will bring you to this website this GitHub page. This GitHub page is the main entry point to our course and um yeah I think it's more or less self-explanatory. If you want to sign up this is the button you click and the actual course starts in on September 15th. it means that it's uh slightly less than one one month before the course starts and the purpose of today's um session is to just answer your questions. So you have some questions and uh you can ask these questions using uh you can ask your questions using the pinned link. So there's a pinned link in the live chat. Click on that link um and t

In [61]:
from langchain_text_splitters import RecursiveCharacterTextSplitter


splitter = RecursiveCharacterTextSplitter(
    chunk_size=80,
    chunk_overlap=20,
    separators=["\n\n", "\n", ". ", " "]
)

chunks = splitter.split_text(clean_transcript)

In [63]:
len(chunks)

908

In [66]:
chunks[0]

'So hi everyone. Uh today we are going to talk about our upcoming course'

### Token TextSplitter

In [67]:
from langchain_text_splitters import TokenTextSplitter

splitter = TokenTextSplitter(
    chunk_size=64,      # tokens
    chunk_overlap=16,   # tokens
    encoding_name="cl100k_base"  # for OpenAI tokenization; change if needed
)

chunks = splitter.split_text(clean_transcript)

In [69]:
len(chunks)

229

In [70]:
chunks[0]

"So hi everyone. Uh today we are going to talk about our upcoming course. The upcoming course is called machine learning zoom camp. And um this is already I put the link in the description. So if you're watching um this video in recording or you're watching it live, you go here in the description after under"

### MarkdownTextSplitter

In [71]:
from langchain_text_splitters import MarkdownTextSplitter

md = """# Title
Intro paragraph.

## Section A
Details for section A.

### Sub A1
Deep details.

## Section B
More content here.
"""

splitter = MarkdownTextSplitter(chunk_size=200, chunk_overlap=20)
chunks = splitter.split_text(md)
for i, c in enumerate(chunks, 1):
    print(f"--- chunk {i} ---\n{c}\n")


--- chunk 1 ---
# Title
Intro paragraph.

## Section A
Details for section A.

### Sub A1
Deep details.

## Section B
More content here.



### HTMLHeaderTextSplitter

In [76]:
from langchain_text_splitters import HTMLHeaderTextSplitter

html = """
<html><body>
<h1>Guide</h1><p>Overview paragraph.</p>
<h2>Install</h2><p>Steps to install...</p>
<h2>Usage</h2><p>Basic usage...</p>
<h3>Advanced</h3><p>Advanced usage...</p>
</body></html>
"""

splitter = HTMLHeaderTextSplitter(headers_to_split_on=[("h1", "H1"), ("h2", "H2"), ("h3", "H3")])
docs = splitter.split_text(html)          # returns list[Document] with header metadata
for idx, d in enumerate(docs):
    print(idx)
    print(d.metadata, d.page_content[:80], "...\n")


0
{'H1': 'Guide'} Guide ...

1
{'H1': 'Guide'} Overview paragraph. ...

2
{'H1': 'Guide', 'H2': 'Install'} Install ...

3
{'H1': 'Guide', 'H2': 'Install'} Steps to install... ...

4
{'H1': 'Guide', 'H2': 'Usage'} Usage ...

5
{'H1': 'Guide', 'H2': 'Usage'} Basic usage... ...

6
{'H1': 'Guide', 'H2': 'Usage', 'H3': 'Advanced'} Advanced ...

7
{'H1': 'Guide', 'H2': 'Usage', 'H3': 'Advanced'} Advanced usage... ...



### SentenceTransformersTokenTextSplitter

In [77]:
from langchain_text_splitters import SentenceTransformersTokenTextSplitter


st_splitter = SentenceTransformersTokenTextSplitter(
    tokens_per_chunk=128,
    chunk_overlap=32,
    model_name="sentence-transformers/all-MiniLM-L6-v2"  # matches your embeddings model
)

chunks = st_splitter.split_text(clean_transcript)
print(len(chunks), "chunks")


118 chunks


In [78]:
chunks[0]

"so hi everyone. uh today we are going to talk about our upcoming course. the upcoming course is called machine learning zoom camp. and um this is already i put the link in the description. so if you ' re watching um this video in recording or you ' re watching it live, you go here in the description after under this video and then you see a link course. uh click on that link and this bring you will bring you to this website this github page. this github page is the main entry point to our course and um yeah i think it ' s more or less self - explanatory."

# Section 3: Rag

In [79]:
openai_client = OpenAI()

def llm(user_prompt, instructions=None, model="gpt-4o-mini"):
    messages = []

    if instructions:
        messages.append({
            "role": "system",
            "content": instructions
        })

    messages.append({
        "role": "user",
        "content": user_prompt
    })

    response = openai_client.responses.create(
        model=model,
        input=messages
    )

    return response.output_text

In [81]:
instructions = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the video transcript.
Use only the facts from the CONTEXT when answering the QUESTION.
""".strip()

prompt_template = """
<QUESTION>
{question}
</QUESTION>

<CONTEXT>
{context}
</CONTEXT>
""".strip()

def build_prompt(question, search_results):
    search_json = json.dumps(search_results)
    return prompt_template.format(
        question=question,
        context=search_json
    )

In [87]:
embedding_model= SentenceTransformer("/models/all-MiniLM-L6-v2")

In [88]:
chunks[:5]

["so hi everyone. uh today we are going to talk about our upcoming course. the upcoming course is called machine learning zoom camp. and um this is already i put the link in the description. so if you ' re watching um this video in recording or you ' re watching it live, you go here in the description after under this video and then you see a link course. uh click on that link and this bring you will bring you to this website this github page. this github page is the main entry point to our course and um yeah i think it ' s more or less self - explanatory.",
 ". this github page is the main entry point to our course and um yeah i think it ' s more or less self - explanatory. if you want to sign up this is the button you click and the actual course starts in on september 15th. it means that it ' s uh slightly less than one one month before the course starts and the purpose of today ' s um session is to just answer your questions. so you have some questions and uh you can ask these quest

In [89]:
embeddings = []

for d in tqdm(chunks):
    text = d
    v = embedding_model.encode(text)
    embeddings.append(v)

  0%|          | 0/118 [00:00<?, ?it/s]

In [98]:
len(embeddings)

118

In [97]:
embeddings[0]

array([-5.18129133e-02, -8.77601802e-02, -2.62539368e-02, -1.92901120e-02,
        7.45076165e-02, -4.01848666e-02,  2.01712102e-02, -2.15604194e-02,
       -8.17215517e-02, -4.33636317e-03,  1.82826463e-02,  7.34690055e-02,
       -2.27548908e-02, -1.36467936e-02, -4.55070324e-02, -2.74095288e-03,
        5.13295271e-02,  9.68081411e-03, -3.30747515e-02, -1.98295270e-03,
        9.73599330e-02, -3.03309113e-02,  2.40222998e-02,  6.49953485e-02,
        2.30945647e-02, -7.15019507e-03,  8.77129566e-03,  7.51350820e-02,
        4.12595235e-02, -2.70779766e-02, -2.62982957e-03,  5.35820089e-02,
        7.76336808e-03, -6.33357931e-03, -1.42535334e-02,  4.69800830e-02,
        8.39236844e-03, -6.42409846e-02, -7.03288689e-02, -1.45103391e-02,
       -4.33865264e-02,  2.15032287e-02,  6.24255799e-02,  4.60156649e-02,
        5.30798174e-02,  5.53228669e-02, -2.23573186e-02, -1.59607634e-01,
        2.53401540e-05, -7.27942307e-03, -1.23386234e-01, -1.17364287e-01,
       -8.57263133e-02, -

In [99]:
#convert list of lists to 2D numpy array
embeddings = np.array(embeddings)
embeddings.shape

(118, 384)

In [100]:
from minsearch import VectorSearch

In [102]:
vindex = VectorSearch()
vindex.fit(embeddings, chunks)

<minsearch.vector.VectorSearch at 0x1f38b445ac0>

In [103]:
def vector_search(question):
    q = embedding_model.encode(question)

    return vindex.search(
        q,
        num_results=5
    )

def rag(question):
    search_results = vector_search(question)
    user_prompt = build_prompt(question, search_results)
    return llm(user_prompt, instructions=instructions)

In [104]:
question = 'what is this video about?'
rag(question)

'The video is about an upcoming course called "Machine Learning Zoom Camp." It provides information on how to access the course materials, which include both new and old videos. The course is aimed at individuals interested in building trading tools. Additionally, there will be updates to the course modules and some new material. The video explains the structure of the course, including homework assignments and the flexibility in timing for completing the coursework.'