## Exploring YouTube Content with LangChain
In this notebook, we'll demonstrate how to use the LangChain library to extract and process transcripts from YouTube videos.

By the end of this notebook, you'll be able to:

- Load a YouTube video using its URL
- Retrieve the transcript and metadata from the video
- Count tokens in the transcript
- Summarize the video content using OpenAI's GPT model

### Requirements
To run this notebook successfully, make sure you've installed the following Python packages:

- `langchain`: Provides the main functionality to interact with YouTube videos
- `openai`: Allows us to use OpenAI LLM models like GPT-3.5
- `python-dotenv`: Used to read the .env file containing the OpenAI API Key
- `ipykernel`: Enables running this notebook in VSCode
- `youtube-transcript-api`: Fetches YouTube video transcripts
- `pytube`: Fetches YouTube video metadata
- `tiktoken`: Counts tokens in a text

You can install all of these with a single pip command:

```bash
pip install langchain openai python-dotenv ipykernel youtube-transcript-api pytube tiktoken
```

### Loading API Key

We need to load the OpenAI API key to utilize OpenAI's GPT models.

My key is stored in a `.env` file located in the parent directory.

Let's use the `dotenv` library to load it.

In [1]:
import os
from dotenv import load_dotenv

# load_dotenv()

# Get the absolute path of the current script
script_dir = os.path.abspath(os.getcwd())

# Get the absolute path of the parent directory
parent_dir = os.path.join(script_dir, os.pardir)

dotenv_path = os.path.join(parent_dir, '.env')
# Load the .env file from the parent directory
load_dotenv(dotenv_path)

True

### Langchain Loaders

LangChain offers 80+ loaders.

Any input => Standarized Document format

### Loading YouTube Transcripts and Metadata

With the LangChain library, it's easy to extract transcripts and metadata from a YouTube video. 

We just import the `YouTubeLoader` and use the `from_youtube_url()` function and pass in the URL of the desired video.

In [2]:
from langchain.document_loaders import YoutubeLoader

loader = YoutubeLoader.from_youtube_url("https://youtu.be/zJBpRn2zTco") # https://youtu.be/zJBpRn2zTco
docs = loader.load()

In [3]:
print(type(docs[0]))
# docs

<class 'langchain.schema.document.Document'>


In [4]:
transcript = docs[0].page_content 
metadata = docs[0].metadata
# print(transcript)

For readability, let's introduce line breaks into the transcript text using the textwrap library.

In [5]:
import textwrap

line_width = 120

print(textwrap.fill(transcript[:2000], line_width))

less than 24 hours ago meta released llama 2 their successor to the open source llama language model that helped spawn a
hundred others including alpaca vicuna and of course Orca within a few hours of release I had read the fascinating
76-page technical paper the use guide each of the many release Pages the full terms and conditions and I've run many of
my own experiments let's start with the basics it was trained on more data the biggest model has more parameters and the
context length has doubled they also spent what must be tens of Millions on fine-tuning it for chat but I'll get into
that more later but let's start with the benchmarks they deliberately compared llama 2 to llama 1 and other famous open
source models but not with gpt4 and in these benchmarks the trend is fairly clear it crushes the other open source
language models but is more of an incremental upgrade over over llama one to massively simplify the mmlu Benchmark shows
that it knows a lot about a lot of subjects but t

In [6]:
print(metadata)

{'source': 'zJBpRn2zTco'}


It would be nice to get more metadata

### Enhancing Metadata Retrieval

To get more detailed video metadata, let's the `add_video_info` parameter to `True` when calling `from_youtube_url()`.

*Note: Requires `pytube`*

In [7]:
loader = YoutubeLoader.from_youtube_url("https://youtu.be/zJBpRn2zTco", add_video_info=True)
docs = loader.load()


In [8]:
metadata = docs[0].metadata
print(metadata)

{'source': 'zJBpRn2zTco', 'title': 'Llama 2: Full Breakdown', 'description': 'Unknown', 'view_count': 90086, 'thumbnail_url': 'https://i.ytimg.com/vi/zJBpRn2zTco/hq720.jpg', 'publish_date': '2023-07-19 00:00:00', 'length': 948, 'author': 'AI Explained'}


More readable print

In [9]:
for key, value in metadata.items():
    print(f"{key}: {value}")

source: zJBpRn2zTco
title: Llama 2: Full Breakdown
description: Unknown
view_count: 90086
thumbnail_url: https://i.ytimg.com/vi/zJBpRn2zTco/hq720.jpg
publish_date: 2023-07-19 00:00:00
length: 948
author: AI Explained


### Counting Tokens in the Transcript
By using OpenAI's `tiktoken` package, we can count how many tokens there are in the video's transcript. 

This helps us manage the context length.

In [10]:
import tiktoken

def count_tokens(string: str, model: str = "gpt-3.5-turbo") -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.encoding_for_model(model)
    num_tokens = len(encoding.encode(string))
    return num_tokens

In [11]:
model="gpt-3.5-turbo"

count_tokens("Let's count tokens for this sentence", model)

7

In [12]:
### Counting tokens for our transcript
count_tokens(transcript, model)

3034

### Summarizing YouTube Videos

We can take advantage of the power of the GPT-3.5-turbo model to generate a summary of the video content. 

For this, let's:
1. prepare a summary prompt template,
2. add our transcript to the prompt template,
3. pass it to our model for summarization.

In [13]:
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(model=model, temperature=0.1)

**Step 1. Prepare prompt template**

In [14]:
SUMMARY_PROMPT = """Summarize the video with the transcript delimited by triple backticks. \
What are the 5 key takeaways? \
Transcript: ```{transcript}```"""

In [15]:
from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI


def summarize_video(transcript, template=SUMMARY_PROMPT, model="gpt-3.5-turbo"):
    
    prompt = ChatPromptTemplate.from_template(template)
    # passing the transcript to the template
    formatted_prompt = prompt.format_messages(transcript=transcript)

    # initialize model
    llm = ChatOpenAI(model=model, temperature=0.1)
    # generate summary
    summary = llm(formatted_prompt)
    return summary


In [16]:
summary = summarize_video(transcript=transcript)

In [17]:
summary

AIMessage(content='The video discusses the release of Llama 2, the successor to the open-source Llama language model. The five key takeaways from the video are:\n\n1. Llama 2 outperforms other open-source language models in benchmarks but is seen as an incremental upgrade over Llama 1.\n2. The model was trained on more data, has more parameters, and has a doubled context length.\n3. The paper highlights the use of reinforcement learning with human feedback to train the model, with separate reward models for helpfulness and safety.\n4. The decision to release the model was supported by a list of corporate supporters, but it has also raised concerns about potential misuse.\n5. Llama 2 has limitations in coding, reasoning, and performance in languages other than English. It also faces competition from other models like Orca and Phi 1.', additional_kwargs={}, example=False)

In [18]:
print(summary.content)

The video discusses the release of Llama 2, the successor to the open-source Llama language model. The five key takeaways from the video are:

1. Llama 2 outperforms other open-source language models in benchmarks but is seen as an incremental upgrade over Llama 1.
2. The model was trained on more data, has more parameters, and has a doubled context length.
3. The paper highlights the use of reinforcement learning with human feedback to train the model, with separate reward models for helpfulness and safety.
4. The decision to release the model was supported by a list of corporate supporters, but it has also raised concerns about potential misuse.
5. Llama 2 has limitations in coding, reasoning, and performance in languages other than English. It also faces competition from other models like Orca and Phi 1.


## The True Power of Large Language Models

**Same transcript, different tasks?** Just change the prompt.

Back in the days, when NLP Engineers needed to perform different tasks on the same text data, they had to train separate models for each task. The training included:
- Data collection
- Task-specific Data Preprocesseng (Very time consuming!!)
- Model selection and training (Also time consuming!)
- Evaluation and iteration

With LLMs, the process includes:
- Data Collection
- Prompting the LLM with task-specific instructions
- Evaluation and iteration

So we got rid of the 2 most time-consuming step!


**A new prompt**

In [19]:
COMPARE_PROMPT = """Based on the transcript delimited by triple backticks. \
What has improved in LLAMA 2 compared to LLAMA 1? \
Transcript: ```{transcript}```"""

In [20]:
compare = summarize_video(transcript=transcript, template=COMPARE_PROMPT)
print(compare.content)

In LLAMA 2, several improvements have been made compared to LLAMA 1. These improvements include:

1. Training on more data: LLAMA 2 was trained on more robust data cleaning and 40 more total tokens.

2. Larger model size: LLAMA 2 has more parameters, making it a bigger model compared to LLAMA 1.

3. Doubled context length: The context length in LLAMA 2 has been doubled, allowing for a better understanding of longer conversations.

4. Fine-tuning for chat: LLAMA 2 has undergone extensive fine-tuning for chat, making it more suitable for conversational tasks.

5. Better performance in benchmarks: LLAMA 2 outperforms other open source language models in benchmarks, although it is considered more of an incremental upgrade over LLAMA 1.

It is important to note that LLAMA 2 was not compared to GPT-4 in the benchmarks mentioned in the transcript.


Feel free to experiment with the prompt, so that you get the unique results.

## Summarize from URL
Combine all steps

### Function for getting the YouTube Transcript

In [21]:
def get_transcript(url: str) -> str:
    """
    Returns the transcript and title from a YouTube URL.

    Parameters:
    url (str): The YouTube URL from which the transcript and title will be extracted.

    Returns:
    transcript (str): The transcript of the video.
    """
    try:
        loader = YoutubeLoader.from_youtube_url(url, add_video_info=True)
        docs = loader.load()
        if docs:
            doc = docs[0]
            transcript = doc.page_content
            print(transcript)
            return transcript
        else:
            return None
    except Exception as e:
        print(f"Failed to load transcript and title from URL {url}: {e}")
        return None

In [22]:
TEMPLATE_PROMPT = """Summarize the video with the transcript delimited by triple backticks. \n
Answer the questions delimited by single backticks. If no questions provided, just create a general summary. \n
Questions: ` {questions} ` \n
Transcript: ```{transcript}```"""

In [23]:
from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI


def summarize_with_questions(transcript, questions, model):
    prompt = ChatPromptTemplate.from_template(TEMPLATE_PROMPT)
    formatted_prompt = prompt.format_messages(transcript=transcript, questions=questions)
    print("Formatted prompt: ", formatted_prompt)
    llm = ChatOpenAI(model=model, temperature=0.1)
    summary = llm(formatted_prompt)
    return summary

In [24]:
def summarize_from_url(url, questions):
    transcript = get_transcript(url)
    token_count = count_tokens(transcript)
    # select model
    if token_count < 3000:
        model = "gpt-3.5-turbo"
    elif token_count < 14000:
        model = "gpt-3.5-turbo-16k"
    else:
        return f"Summary unavailable for transcripts over 14k tokens. Your transcript has {token_count} tokens."
    
    summary = summarize_with_questions(transcript, questions, model)
    return summary

In [26]:
# example videos
urls = [
    "https://youtu.be/zJBpRn2zTco",  # AI Explained on LLAMA 2
    "https://youtu.be/blyzUI8kOG4",  # AI Advantage compares LLAMA 2 to ChatGPT
    "https://youtu.be/Xjy-CDRJa54",  # Matthew Berman checks LLAMA 2 performance
]


In [27]:
url = urls[1]
questions = "How does LLAMA 2 perform compared to ChatGPT?"

summary = summarize_from_url(url, questions)

meta just surprised us with a brand new open source language model called llama 2. this thing is the best open source model we have and in many cases they claim this to be better than GPT 3.5 which is the default chat GPT I'm better but in what ways is it better is it more up to date can you use it how does this move the AI space forward and why should you even care you should I'll cover all that today and we'll even go into a quick demo so First Things First Slice is such a big deal on why should you care well I'm gonna do my best to keep this simple but in the introduction of the paper that they released with this it says exactly what you should know there have been many public releases of pre-trained large language models such as Bloom llama and Falcon match the performance of closed pre-trained competitors like GPT free okay so that's the first thing there's a big distinction between models that are open that you can download and build your apps upon and that are closed where all t

In [28]:
print(summary)

content='LLAMA 2 is an open source language model released by Meta and Microsoft. It is claimed to be better than GPT 3.5 in terms of performance and usability. LLAMA 2 has been heavily fine-tuned to align with human preferences, making it more capable and safer than previous open source models. It comes in three variations with different parameter sizes. The largest variation, LLAMA 2 Chat with 70 billion parameters, is the most exciting one. LLAMA 2 is free for research and commercial use, making it accessible for building chatbots and other applications. In terms of safety, LLAMA 2 is considered the safest large language model, with a lower percentage of information leakage compared to ChatGPT. In benchmark tests, LLAMA 2 performed slightly better than ChatGPT in terms of helpfulness prompts. It also outperformed other closed models in various academic benchmarks. Overall, LLAMA 2 is considered better than ChatGPT, and its open source nature and licensing agreement make it a game-ch

### End of the Basic Use case