# Using Langchain and Deep Lake for Youtube summarization

In this tutorial, we will explore the capabilities of the LangChain library, which offers a suite of tools for working with text data, including document retrieval, text summarization, and language model interactions. We will specifically focus on utilizing LangChain in combination with OpenAI's language model to perform document retrieval and generate concise summaries. In addition, we will be leveraging the Deep Lake's ability to store multi-modal data in the form of text and vector embeddings, to make the retrieval and summarization process more streamlined.

## Outline:

### Setting up the Environment:

Importing the necessary modules from LangChain.
Configuring the required components, such as OpenAI for language model interactions.

### Document Retrieval:

Establishing a vector store using DeepLake for efficient storage and retrieval of text data.
Utilizing the retriever functionality to search for relevant documents based on specific criteria.
Extracting the retrieved documents for further processing.

### Text Summarization:

Defining a prompt template to guide the summarization process.
Creating a prompt using the PromptTemplate class to structure the summary generation.
Applying the OpenAI language model through the LLMChain class to generate summaries.
Combining the generated summaries into a cohesive text.

### Putting It All Together:

Integrating the document retrieval and text summarization steps.
Demonstrating how the LangChain library can be leveraged to retrieve relevant documents and generate concise summaries.
By the end of this tutorial, you will have a foundational understanding of using LangChain and OpenAI's language model to perform document retrieval and generate summaries. These techniques can be applied to various text analysis tasks, enabling efficient information retrieval and summarization from large text collections.

In [None]:
import os
os.environ['OPENAI_API_KEY'] = 'YOUR KEY'
os.environ['ACTIVELOOP_TOKEN'] = 'YOUR KEY'
YOUTUBE_KEY = 'YOUR KEY'

The code above imports the `os` module to work with environment variables. Then, it sets the OPENAI_API_KEY, ACTIVELOOP_TOKEN, and YOUTUBE_KEY variables to the corresponding API keys or tokens.

To obtain a YouTube API key:
1. Go to the [Google Developers Console](https://console.cloud.google.com/).
2. Create a new project or select an existing project.
3. Enable the "YouTube Data API v3" for your project.
4. Go to the "Credentials" section.
5. Create a new API key.
6. Copy the generated API key.
7. Replace the placeholder in the code snippet with your API key.
Remember to keep your API key secure and avoid sharing it publicly.

In [None]:
from googleapiclient.discovery import build

CHANNEL_ID = 'UCESLZhusAkFfsNsApnjF_Cg'

youtube = build('youtube', 'v3', developerKey=YOUTUBE_KEY)

urls = []

response = youtube.search().list(
    part='snippet',
    channelId=CHANNEL_ID,
    maxResults=10,  # Set the desired number of recent videos
    order='date',  # Retrieve videos in reverse chronological order
    type='video',
).execute()

videos = response['items']
for video in videos:
    video_id = video['id']['videoId']
    video_title = video['snippet']['title']

    urls.append(video_id)

We then retrieve the IDs and titles of the most recent videos uploaded to a specific YouTube channel using the YouTube Data API.

Here's a breakdown:

1. Import the necessary functions from the `googleapiclient.discovery` module.
2. Set the `CHANNEL_ID` variable to the unique identifier of the YouTube channel.
3. Create a `youtube` service object using the `build` function, specifying the YouTube Data API version and API key.
4. Initialize an empty list, `urls`, to store video IDs.
5. Call `youtube.search().list()` to search for videos, providing parameters such as `part`, `channelId`, `maxResults`, `order`, and `type`.
6. Execute the search request using the `execute()` method, obtaining a response.
7. Access the list of video items in the response under the `items` key.
8. Retrieve the video ID and title from each item and append the video ID to the `urls` list.

After execution, the `urls` list will contain the video IDs of the most recent videos from the specified YouTube channel, which can be used for further processing or constructing video URLs.

In [None]:
from langchain.document_loaders import YoutubeLoader
from langchain.text_splitter import CharacterTextSplitter


docs = []
for url in urls:
    loader = YoutubeLoader.from_youtube_url('https://www.youtube.com/watch?v=' + url)
    docs.extend(loader.load_and_split())

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(docs)

Langchain is then used to split the text into smaller chunks. The code uses the LangChain library to load and split text from YouTube videos into smaller text chunks.

A high level understanding of the code is as follows:

1. Import the necessary modules from the LangChain library: `YoutubeLoader` for loading YouTube videos and `CharacterTextSplitter` for splitting text.
2. Create an empty list, `docs`, to store the loaded documents.
3. Iterate over the `urls` list to process each video URL.
4. Create a `YoutubeLoader` object by passing the YouTube URL (with the video ID appended) to the `from_youtube_url` method.
5. Use the `loader.load_and_split()` method to load the video and split its text into smaller chunks.
6. Extend the `docs` list with the loaded and split documents.
7. Create a `CharacterTextSplitter` object, specifying the desired `chunk_size` and `chunk_overlap`.
8. Use the `split_documents()` method of the `text_splitter` object to split the documents into smaller text chunks.
9. The resulting `texts` variable will contain the split text chunks.

This code allows for loading and splitting text from YouTube videos, enabling further processing or analysis on smaller text units.


In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake

embeddings = OpenAIEmbeddings(disallowed_special=())
db = DeepLake(dataset_path='hub://{YOUR_NAME}/youtube', embedding_function=embeddings)
db.add_documents(texts)

This snippet utilizes the LangChain library to perform text embeddings and create a vector store using DeepLake. It adds the split text chunks to the DeepLake vector store for future retrieval and analysis.

Here's how it works:

1. Import the necessary modules from the LangChain library: `OpenAIEmbeddings` for text embeddings and `DeepLake` for the vector store.
2. Create an `OpenAIEmbeddings` object called `embeddings` with the optional parameter `disallowed_special` set to an empty tuple. This object is responsible for generating embeddings for the text chunks.
3. Create a `DeepLake` vector store called `db`, specifying the dataset path as `'hub://{YOUR_NAME}/youtube'`. Replace `{YOUR_NAME}` with your desired name or identifier.
4. Add the split text chunks (`texts`) to the `db` vector store using the `add_documents()` method.

After executing this code, the `db` vector store will contain the embedded representations of the text chunks from the YouTube videos. This allows for efficient storage and retrieval of the text embeddings for subsequent analysis or similarity matching.

In [None]:
from langchain import OpenAI
from langchain.chains.llm import LLMChain
from langchain.prompts import PromptTemplate

retriever = db.as_retriever(search_kwargs={"k": 10})
retrieved_documents = retriever.get_relevant_documents('fed raising rates')

prompt_template = """
    Generate a 100-word summary for the following text:

    {text}

    To create an accurate summary of this text, please use your own words and pay attention to the main point or idea. 
    Avoid including any additional information, as that will not accurately represent the source material. 
    Your summary should include only the key facts and concepts; it should condense the original text into a concise and accurate 100-word summary that clearly communicates the main idea or point of the text.
"""

prompt = PromptTemplate(template=prompt_template, input_variables=["text"])
chain = LLMChain(llm=OpenAI(temperature=0, model_name = 'text-davinci-003'), prompt=prompt)
results = chain.apply([{'text': t} for t in retrieved_documents])
combined_text = ('\n').join([result['text'] for result in results]).strip()

The code snippet showcases the utilization of the LangChain library for performing document retrieval and generating a 100-word summary using OpenAI's language model.

And the final step to creating a coherent summary of key ideas spread across multiple videos is to:

1. Import the necessary modules: `OpenAI` from LangChain for language model interactions, `LLMChain` for the language model chain, and `PromptTemplate` for defining the prompt structure.
2. Convert the `db` vector store into a retriever object called `retriever` using the `as_retriever()` method. Specify the retrieval search parameters, such as `"k": 10`, to retrieve the top 10 relevant documents.
3. Retrieve relevant documents by calling `get_relevant_documents()` on the `retriever` object, providing the query `'fed raising rates'`. The retrieved documents are stored in the `retrieved_documents` variable.
4. Define the `prompt_template` variable, which contains a template for generating the summary. It includes a placeholder `{text}` for the actual text.
5. Create a `PromptTemplate` object called `prompt` using the template defined above. Specify `"text"` as the input variable for the template.
6. Create an `LLMChain` object called `chain`, utilizing OpenAI's language model (`OpenAI`) with a temperature value of 0 and the model name set to `'text-davinci-003'`. This chain is responsible for generating the summaries based on the prompt.
7. Apply the chain to each retrieved document by calling the `apply()` method on the `chain` object. The input is a list of dictionaries, where each dictionary contains the key `'text'` mapped to the document text.
8. Store the generated summaries in the `results` variable.
9. Combine the generated summaries into a single string, `combined_text`, by joining the text from each result with a newline character (`\n`). Leading and trailing whitespaces are removed using `strip()`.

After executing this code, the `combined_text` variable will hold multiple concise 100-word summary generated by the language model for the retrieved documents related to the query `'fed raising rates'`.