In [1]:
from openai import OpenAI

openai_client = OpenAI()

def llm(user_prompt, instructions=None, model="gpt-4o-mini"):
    messages = []

    if instructions:
        messages.append({
            "role": "system",
            "content": instructions
        })

    messages.append({
        "role": "user",
        "content": user_prompt
    })

    response = openai_client.responses.create(
        model=model,
        input=messages
    )

    return response.output_text

In [2]:
from youtube_transcript_api import YouTubeTranscriptApi

In [3]:
video_id = 'ph1PxZIkz1o'

In [4]:
ytt_api = YouTubeTranscriptApi()
transcript = ytt_api.fetch(video_id)

In [5]:
import pickle

In [6]:
with open(f'{video_id}.bin', 'rb') as f_in:
    transcript = pickle.load(f_in)

In [7]:
transcript[:10]

[FetchedTranscriptSnippet(text='So hi everyone. Uh today we are going to', start=0.0, duration=5.04),
 FetchedTranscriptSnippet(text='talk about our upcoming course. The', start=2.96, duration=3.52),
 FetchedTranscriptSnippet(text='upcoming course is called machine', start=5.04, duration=5.92),
 FetchedTranscriptSnippet(text='learning zoom camp. And um this is', start=6.48, duration=5.92),
 FetchedTranscriptSnippet(text='already I put the link in the', start=10.96, duration=3.599),
 FetchedTranscriptSnippet(text="description. So if you're watching um", start=12.4, duration=4.719),
 FetchedTranscriptSnippet(text="this video in recording or you're", start=14.559, duration=4.88),
 FetchedTranscriptSnippet(text='watching it live, you go here in the', start=17.119, duration=4.561),
 FetchedTranscriptSnippet(text='description after under this video and', start=19.439, duration=5.6),
 FetchedTranscriptSnippet(text='then you see a link course. uh click on', start=21.68, duration=6.24)]

In [8]:
def format_timestamp(seconds: float) -> str:
    """Convert seconds to H:MM:SS if > 1 hour, else M:SS"""
    total_seconds = int(seconds)
    hours, remainder = divmod(total_seconds, 3600)
    minutes, secs = divmod(remainder, 60)

    if hours > 0:
        return f"{hours}:{minutes:02}:{secs:02}"
    else:
        return f"{minutes}:{secs:02}"

def make_subtitles(transcript) -> str:
    lines = []

    for entry in transcript:
        ts = format_timestamp(entry.start)
        text = entry.text.replace('\n', ' ')
        lines.append(ts + ' ' + text)

    return '\n'.join(lines)

In [9]:
subtitles = make_subtitles(transcript)

In [10]:
print(subtitles[:500])

0:00 So hi everyone. Uh today we are going to
0:02 talk about our upcoming course. The
0:05 upcoming course is called machine
0:06 learning zoom camp. And um this is
0:10 already I put the link in the
0:12 description. So if you're watching um
0:14 this video in recording or you're
0:17 watching it live, you go here in the
0:19 description after under this video and
0:21 then you see a link course. uh click on
0:25 that link and this bring you will bring
0:27 you to
0:29 this website this GitHub


In [11]:
instructions = """
Summarize the transcript and describe the main purpose of the video
and the main ideas. 

Also output chapters with time. Use usual sentence case, not Title Case for the chapter.

Output format: 

<OUTPUT>
Summary

timestamp chapter 
timestamp chapter
...
timestamp chapter
</OUTPUT>

Don't include <OUTPUT> in the output
"""

In [12]:
answer = llm(subtitles, instructions=instructions)

In [14]:
print(answer)

Summary

The video discusses the upcoming "Machine Learning Zoom Camp" course, which is set to begin on September 15th. The presenter explains the course structure, updates to the content, and answers several questions from potential participants. The course is designed for both aspiring machine learning engineers and data scientists and focuses on practical skills, particularly in machine learning engineering.

Main ideas include:
- The course consists of modules that cover foundational topics and practical applications.
- Some course materials have been updated to reflect current technologies, including the shift from TensorFlow to PyTorch in certain modules.
- Participants are encouraged to be comfortable with programming and command line usage.
- The course will include projects that can enhance participants' portfolios, which will help them in job searches.
- The course does not provide job placement but has a high success rate for participants finding employment after completion.

In [15]:
from pydantic import BaseModel

In [16]:
class Chapter(BaseModel):
    timestamp: str
    title: str

class YTSummaryResponse(BaseModel):
    summary: str
    chapters: list[Chapter]


In [17]:
def llm_structured(instructions, user_prompt, output_type, model="gpt-4o-mini"):
    messages = [
        {"role": "system", "content": instructions},
        {"role": "user", "content": user_prompt}
    ]

    response = openai_client.responses.parse(
        model=model,
        input=messages,
        text_format=output_type
    )

    return response.output_parsed

In [18]:
summary = llm_structured(
    instructions=instructions,
    user_prompt=subtitles,
    output_type=YTSummaryResponse
)

In [19]:
print(summary.summary)
print()
for c in summary.chapters:
    print(c.timestamp, c.title)

The video discusses an upcoming course titled "Machine Learning Zoom Camp," detailing its structure, content, and the prerequisites for enrolment. It emphasizes that the course is designed for aspiring ML engineers rather than data scientists, with a focus on practical skills in machine learning and deployment. The course features a mix of updated and existing materials across ten modules, four of which will be refreshed. Participants are encouraged to ask questions via a linked platform, and answers regarding job placements, course requisites, and the importance of prior programming knowledge are addressed. The course is set to commence on September 15th, with a potential for job readiness after completion, along with certification for those who fulfill project requirements.

0:00 Introduction to the machine learning zoom camp
2:38 Course updates and structure
5:25 Job placement opportunities
8:10 Prerequisites for the course
11:37 Focus on machine learning fundamentals
14:32 Importan

## RAG

In [20]:
print(subtitles[:1000])

0:00 So hi everyone. Uh today we are going to
0:02 talk about our upcoming course. The
0:05 upcoming course is called machine
0:06 learning zoom camp. And um this is
0:10 already I put the link in the
0:12 description. So if you're watching um
0:14 this video in recording or you're
0:17 watching it live, you go here in the
0:19 description after under this video and
0:21 then you see a link course. uh click on
0:25 that link and this bring you will bring
0:27 you to
0:29 this website this GitHub page.
0:34 This GitHub page is the main entry point
0:36 to our course and um yeah I think it's
0:41 more or less self-explanatory. If you
0:43 want to sign up this is the button you
0:45 click and the actual course starts in on
0:48 September 15th. it means that it's uh
0:51 slightly less than one one month before
0:53 the course starts and the purpose of
0:55 today's um session is to just answer
0:58 your questions. So you have some
1:00 questions and uh you can ask these
1:03 questions using

In [21]:
def sliding_window(seq, size, step):
    """Create overlapping chunks using sliding window approach."""
    if size <= 0 or step <= 0:
        raise ValueError("size and step must be positive")

    n = len(seq)
    result = []

    for i in range(0, n, step):
        batch = seq[i:i+size]
        result.append(batch)
        if i + size >= n:
            break

    return result

In [22]:
chunk = transcript[:10]

In [23]:
def join_lines(transcript) -> str:
    """Join transcript entries into continuous text."""
    lines = []

    for entry in transcript:
        text = entry.text.replace('\n', ' ')
        lines.append(text)

    return ' '.join(lines)


def format_chunk(chunk):
    """Format a chunk with start/end timestamps and text."""
    time_start = format_timestamp(chunk[0].start)
    time_end = format_timestamp(chunk[-1].start)
    text = join_lines(chunk)

    return {
        'start': time_start,
        'end': time_end,
        'text': text
    }

In [24]:
chunks = []

for chunk in sliding_window(transcript, 60, 30):
    processed = format_chunk(chunk)
    chunks.append(processed)

In [25]:
print(f"Created {len(chunks)} chunks")

Created 46 chunks


In [26]:
from minsearch import Index

index = Index(text_fields=["text"])
index.fit(chunks)

<minsearch.minsearch.Index at 0x7f6506add4f0>

In [27]:
results = index.search('Can I find a job after the course?', num_results=5)

In [29]:

import json

def search(query):
    """Search for relevant documents."""
    return index.search(
        query=query,
        num_results=15
    )

instructions = """
Answer the QUESTION based on the CONTEXT from the subtitles of a YouTube video.

Use only the facts from the CONTEXT when answering the QUESTION.

When answering the question, 
provide the citation in form of the video URL pointing at the timestamp where
this is discussed. If the question is discussed in multiple documents,
cite all of them.

Don't use markdown or any formatting in the output.
""".strip()

prompt_template = """
<VIDEO_ID>
{video_id}
</VIDEO_ID>

<QUESTION>
{question}
</QUESTION>

<CONTEXT>
{context}
</CONTEXT>
""".strip()

def build_prompt(question, search_results):
    context = json.dumps(search_results)
    return prompt_template.format(
        question=question,
        context=context,
        video_id=video_id
    ).strip()

def rag(query):
    search_results = search(query)
    prompt = build_prompt(query, search_results)
    response = llm(prompt, instructions=instructions)
    return response

In [30]:
answer = rag('Can I find a job after the course?')
print(answer)

Yes, you can find a job after completing the course, but it largely depends on your effort and involvement. The course is designed to make participants job-ready, especially if they engage with the projects effectively. Additionally, many past participants have successfully found jobs after completion. However, the course does not offer job placement services, so finding a job may initially require proactive steps, like seeking volunteering positions to gain practical experience.

For further details, you can refer to the video's discussion around this topic at the following timestamps: 1:21 - 3:49, 31:59 - 34:46, and 53:54 - 56:16. Here is the link to the video: https://www.youtube.com/watch?v=ph1PxZIkz1o.


In [32]:

answer = rag('What are the Recommended companion book in the video. Give a list ')
print(answer)

The recommended companion book mentioned in the video is "Machine Learning Book." This book serves as the basis for the course and contains examples used throughout the course, although some sections are noted to be slightly outdated. 

For more details, you can refer to the video at the following timestamp: https://www.youtube.com/watch?v=ph1PxZIkz1o&t=12m55s.
