In [1]:
from openai import OpenAI
from pydantic import BaseModel
import json
import numpy as np
from tqdm.auto import tqdm
from dotenv import load_dotenv
load_dotenv()


from typing import List, Dict, Any, Optional
import requests 

In [2]:
openai_client = OpenAI()

In [None]:
## load webpage in a simple markdown format

import requests

def get_page_content(url):
    reader_url_prefix = 'https://r.jina.ai/'
    request_url = reader_url_prefix + url
    response = requests.get(request_url)
    return response.content.decode('utf8')

In [5]:
print(get_page_content('https://datatalks.club'))

Title: Welcome to DataTalks.Club

URL Source: https://datatalks.club/

Published Time: Thu, 16 Oct 2025 09:40:58 GMT

Markdown Content:
Welcome to DataTalks.Club


AI Dev Tools Zoomcamp: Learn AI-powered coding assistants and agents[Register here!](https://airtable.com/appJRFiWKHBgmEt70/shrpw7rk55Ewr1jCG)

DataTalks.Club
--------------

[Articles](https://datatalks.club/articles.html)[Slack](https://datatalks.club/slack.html)[Events](https://datatalks.club/events.html)[Podcast](https://datatalks.club/podcast.html)[Books](https://datatalks.club/books.html)[Courses](https://datatalks.club/blog/guide-to-free-online-courses-at-datatalks-club.html)

* * *

The place to talk about data

Global online community of data science professionals, ML engineers, and AI practitioners
-----------------------------------------------------------------------------------------

Subscribe to our weekly newsletter and join our Slack.

 We'll keep you informed about everything happening in the Club.

Email 


In [6]:
## create tool to fetch content of webpage in markdown
reader_url_prefix = "https://r.jina.ai/"

def get_page_content(url: str) -> Optional[str]:
    """
    Fetch the Markdown content of a web page using the Jina Reader service.

    This function prepends the Jina Reader proxy URL to the provided `url`,
    sends a GET request with a timeout, and decodes the response as UTF-8 text.

    Args:
        url (str): The URL of the page to fetch.

    Returns:
        Optional[str]: The Markdown-formatted content of the page if the request
        succeeds; otherwise, None.

    Raises:
        None: All network or decoding errors are caught and suppressed.
               Logs or error messages could be added as needed.
    """
    reader_url = reader_url_prefix + url

    try:
        response = requests.get(reader_url, timeout=10)
        response.raise_for_status()  # raises for 4xx/5xx HTTP errors
        return response.content.decode("utf-8")
    except (requests.exceptions.RequestException, UnicodeDecodeError) as e:
        # Optional: log or print the error for debugging
        print(f"Error fetching content from {url}: {e}")
        return None


## Configure the agent

In [25]:
from agents import Agent, function_tool, SQLiteSession

In [26]:
assistant_instructions = """
You're a helpful assistant that helps answer user questions.
"""

assistant = Agent(
    name='assistant', #name of the agent
    tools=[function_tool(get_page_content)], # tools available to agent
    instructions=assistant_instructions, # agent system instruction
    model='gpt-4o-mini'
)

In [29]:
# To run the agent, we need a Runner:

runner_session1= SQLiteSession("my_chat_1", "../sqlite_db/conv.db")  # persist history
runner = Runner()
user_prompt = "Summarize the content of https://openai.github.io/openai-agents-python/"
result = await runner.run(assistant, input=user_prompt, session= runner_session1)

The result contains extensive information about the conversation, including all tool calls and responses.

Let's check new_items in results. That's all the communication between OpenAI and us, including all the tool calls. The last item should be the message with the final response:

final result is saved in [-1]

In [30]:
print(result.new_items[-1].raw_item.content[0].text)

The **OpenAI Agents SDK** is a Python library designed to create agentic AI applications effortlessly. It is built on previous experiments and aims for a user-friendly experience using a minimal set of primitives, such as:

- **Agents**: LLMs equipped with instructions and tools.
- **Handoffs**: Mechanisms for agents to delegate tasks.
- **Guardrails**: Features for validating inputs and outputs.
- **Sessions**: Management of conversation history automatically across runs.

Key features include a built-in agent loop, easy orchestration using Python, and tools for visualizing and debugging workflows. The SDK is installation-friendly with a simple `pip install openai-agents` command and provides tutorials, examples, and extensive documentation. 

A basic "Hello World" example illustrates setting up an agent and running it to generate output, showcasing the SDK's simplicity.

For more details, you can explore its [official documentation](https://openai.github.io/openai-agents-python/).


In [31]:
print(result.final_output)

The **OpenAI Agents SDK** is a Python library designed to create agentic AI applications effortlessly. It is built on previous experiments and aims for a user-friendly experience using a minimal set of primitives, such as:

- **Agents**: LLMs equipped with instructions and tools.
- **Handoffs**: Mechanisms for agents to delegate tasks.
- **Guardrails**: Features for validating inputs and outputs.
- **Sessions**: Management of conversation history automatically across runs.

Key features include a built-in agent loop, easy orchestration using Python, and tools for visualizing and debugging workflows. The SDK is installation-friendly with a simple `pip install openai-agents` command and provides tutorials, examples, and extensive documentation. 

A basic "Hello World" example illustrates setting up an agent and running it to generate output, showcasing the SDK's simplicity.

For more details, you can explore its [official documentation](https://openai.github.io/openai-agents-python/).


## Youtube transcript summary agent

In [32]:
# Here's how we get a transcript
from youtube_transcript_api import YouTubeTranscriptApi

def format_timestamp(seconds: float) -> str:
    """Convert seconds to H:MM:SS if > 1 hour, else M:SS"""
    total_seconds = int(seconds)
    hours, remainder = divmod(total_seconds, 3600)
    minutes, secs = divmod(remainder, 60)

    if hours > 0:
        return f"{hours}:{minutes:02}:{secs:02}"
    else:
        return f"{minutes}:{secs:02}"


def make_subtitles(transcript) -> str:
    lines = []

    for entry in transcript:
        ts = format_timestamp(entry.start)
        text = entry.text.replace('\n', ' ')
        lines.append(ts + ' ' + text)

    return '\n'.join(lines)


def fetch_transcript_raw(video_id):
    ytt_api = YouTubeTranscriptApi()
    transcript = ytt_api.fetch(video_id)
    return transcript


def fetch_transcript_text(video_id):
    transcript = fetch_transcript_raw(video_id)
    subtitles = make_subtitles(transcript)
    return subtitles  

We don't want to re-download the transcripts every time. Let's use the /data_cache/youtube_videos/ folder as cache:

First check if the folder has the YouTube video

If it does, simply return the file content

If it doesn't, fetch the transcript and save it

In [35]:
from pathlib import Path

def fetch_transcript_cached(video_id):
    cache_dir = Path("../data_cache/youtube_videos")
    cache_file = cache_dir / f"{video_id}.txt"

    if cache_file.exists():
        return cache_file.read_text(encoding="utf-8")

    subtitles = fetch_transcript_text(video_id)
    cache_file.write_text(subtitles, encoding="utf-8")

    return subtitles

In [36]:
# test
subtitles = fetch_transcript_cached('vK_SxyqIfwk')
print(subtitles[:500])

0:00 Hey everyone, welcome to our event. This
0:02 event is brought to you by data talks
0:03 club which is a community of people who
0:05 love data. We have weekly events today.
0:08 Uh this is one of such events. Um if you
0:11 want to find out more about the events
0:13 we have, there is a link in the
0:14 description. Um so click on that link,
0:16 check it out right now. We actually have
0:19 quite a few events in our pipeline, but
0:21 we need to put them on the website. Uh
0:24 but keep a


In [37]:
# create the fetch youtube transcript tool with docstring
def fetch_youtube_transcript(video_id: str) -> str:
    """
    Fetches the transcript of a YouTube video and converts it into a subtitle-formatted string.

    Args:
        video_id (str): The unique YouTube video ID.

    Returns:
        str: The subtitles generated from the video's transcript.
    """
    return fetch_transcript_cached(video_id)

In [38]:
summary_instructions = """
You're a helpful assistant that helps answer user questions
about YouTube videos
"""

tools = [
    function_tool(fetch_youtube_transcript)
]

youtube_assistant = Agent(
    name='youtube_assistant',
    tools=tools,
    instructions=summary_instructions,
    model='gpt-4o-mini'
)

In [43]:
# To run the agent, we need a Runner:

runner_session2= SQLiteSession("my_chat_2", "../sqlite_db/conv.db")  # persist history
runner = Runner()
user_prompt = "what is this video about https://www.youtube.com/watch?v=nMrGK5QgPVE"
result = await runner.run(assistant, input=user_prompt, session= runner_session1)

In [44]:
print(result.final_output)

The video titled **"Implement a Search Engine"** by Alexey Grigorev is a hands-on tutorial focused on building a search engine using text and vector search methods. It covers various aspects, including:

1. **Overview of Retrieval-Augmented Generation**: Introduction to the concept and its applications.
2. **Creating a Sample Search Engine**: Explanation of an in-memory index and a simple search engine.
3. **Setting Up the Environment**: Guidance on installing necessary libraries for development.
4. **Implementing Text Search**: Using tools like scikit-learn to facilitate text search.
5. **Combining Scores from Queries**: Techniques for weighting different fields in search results.
6. **Creating a Text Search Class**: Organizing code for search implementations.
7. **Introduction to Vector Search**: Discussing dimensionality reduction and its relevance in search.
8. **Using BERT for Embeddings**: Explanation of how transformers are used for embedding text data.

The tutorial emphasizes 