# System Workflow

![Screenshot](./Snapshots/flow_diagram.png)

## Step-1 :
## Load the Youtube Transcripts based on TimeStamp Chunks

In [1]:
from langchain_community.document_loaders import YoutubeLoader # Load the Youtube Transcript
from langchain_community.document_loaders.youtube import TranscriptFormat # To Get transcripts as timestamped chunks
from dotenv import load_dotenv
load_dotenv()


True

In [None]:
try:
    loader = YoutubeLoader.from_youtube_url(
        "https://www.youtube.com/watch?v=pJdMxwXBsk0&list=PLKnIA16_RmvaTbihpo4MtzVm4XOQa0ER0&index=15",
         language=["hi"],
         translation="en",
        transcript_format=TranscriptFormat.CHUNKS,
        chunk_size_seconds=60,
    )
    docs = loader.load()

    if docs:
        print(f"Successfully loaded {len(docs)} transcript chunks")
    else:
        print("No transcript data was loaded (empty result)")

except Exception as e:
    print(f"Error loading YouTube transcript: {str(e)}")
    docs = None  

if docs:
    pass
else:
    print("Failed to load transcript, cannot proceed")

Successfully loaded 52 transcript chunks


In [3]:
len(docs)

52

In [4]:
docs[0]

Document(metadata={'source': 'https://www.youtube.com/watch?v=pJdMxwXBsk0&t=0s', 'start_seconds': 0, 'start_timestamp': '00:00:00'}, page_content="Hi guys, my name is Nitesh and welcome to my YouTube channel.  In this video also we will continue our lang chain playlist. And the topic of today's video is retrievers which is a very important topic.  If you talk about rag.  If you want to build a RAG based application then retriever is a very important component.  In fact, in the future when you make some advanced rag systems, you will work with different types of retrievers there, so in that sense this particular video is very important. And I would like you to watch this video end to end.  So in today's video I will not only explain to you what are retrievers?  What do they need?  But at the same time I will also tell you about different types of retrievers and will show you the code.  Ok?  So ya let's start the video.  So guys, before we start the video, I would like to give you a quic

In [5]:
docs[2]

Document(metadata={'source': 'https://www.youtube.com/watch?v=pJdMxwXBsk0&t=120s', 'start_seconds': 120, 'start_timestamp': '00:02:00'}, page_content="how we are moving forward with this playlist. Now let's focus on today's video, which is on retrievers. Retrievers are very important in langche.  So we will cover this in great detail in today's video.  First of all, we will start with this discussion that what are retrievers?  So in very simple words, if you read this first line, it is written here that a retriever is a component in the language that fetches relevant documents from a data source in response to a user's query.  Ok?  If you focus on this diagram, you will understand things better visually. So what happens is that you have a data source where all your data is stored.  Ok ?  All the data related to anything is stupid.  Now this data source can be anything.  It could be a vector store and it could be some API or something.")

In [6]:
index= 0
print(docs[index].metadata)
print(docs[index].page_content)

{'source': 'https://www.youtube.com/watch?v=pJdMxwXBsk0&t=0s', 'start_seconds': 0, 'start_timestamp': '00:00:00'}
Hi guys, my name is Nitesh and welcome to my YouTube channel.  In this video also we will continue our lang chain playlist. And the topic of today's video is retrievers which is a very important topic.  If you talk about rag.  If you want to build a RAG based application then retriever is a very important component.  In fact, in the future when you make some advanced rag systems, you will work with different types of retrievers there, so in that sense this particular video is very important. And I would like you to watch this video end to end.  So in today's video I will not only explain to you what are retrievers?  What do they need?  But at the same time I will also tell you about different types of retrievers and will show you the code.  Ok?  So ya let's start the video.  So guys, before we start the video, I would like to give you a quick recap of what we have been doin

## Step-2
## Loading the embedding model and the llm

In [7]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

In [8]:
embedding_model = HuggingFaceEmbeddings(model_name = "sentence-transformers/all-mpnet-base-v2")
from langchain_huggingface import ChatHuggingFace,HuggingFaceEndpoint
from transformers import AutoTokenizer
# Initialize a llm model
repo_id = "mistralai/Mistral-7B-Instruct-v0.3"
# First load the tokenizer explicitly
tokenizer = AutoTokenizer.from_pretrained(repo_id)
llm = HuggingFaceEndpoint(
    repo_id = repo_id,
    temperature = 0.8,
    max_new_tokens=500,
)
model = ChatHuggingFace(llm=llm,tokenizer=tokenizer)

  from .autonotebook import tqdm as notebook_tqdm


In [9]:
# !pip install langchain_groq

In [10]:
from langchain_groq import ChatGroq
llm = ChatGroq(model_name = "Llama-3.3-70b-Versatile",max_tokens= 500)

## Step-3
 
 ## Creating a vectordatabase using the Chroma db


In [11]:
# vector_store = FAISS.from_documents(docs, embedding_model)
from langchain_chroma import Chroma
vectorstore = Chroma.from_documents(docs, embedding_model)

In [12]:
# !pip install langchain-chroma
# !pip install lark

## Step-4 Defining the retriever
## Using the Metadatabased Filtering for retrievers

#### -> this retriever is known as self-query retriever

In [13]:
print(docs[index].metadata)


{'source': 'https://www.youtube.com/watch?v=pJdMxwXBsk0&t=0s', 'start_seconds': 0, 'start_timestamp': '00:00:00'}


In [14]:
from langchain.chains.query_constructor.schema import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The link of the video",
        type="string"
    ),
    AttributeInfo(
        name="start_seconds",
        description="The starting second of the video chunk (in seconds as integer)",
        type="integer"  # Changed from string to integer
    ),
    AttributeInfo(
        name="start_timestamp",
        description="Human-readable timestamp (HH:MM:SS format)",
        type="string"
    )
]

In [15]:
# # First get the base retriever from your vectorstore with increased k
# base_vectorstore_retriever = vectorstore.as_retriever(
#     # search_type = "mmr",
#     search_kwargs={"k": 20,'lambda_mult':0.5}  # Increase this number as needed
# )

In [16]:
document_content_description = "Transcript of a youtube video"
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectorstore,
    document_content_description,
    metadata_field_info,
    # base_retriever = base_vectorstore_retriever,
    verbose=True,
    search_kwargs={"k": 8}  # Increase this number as needed

)

In [34]:
# This example only specifies a filter
retriever.invoke("Create me a blog post about the video.")
# retriever.invoke("what is meant by multi query retriever ?")

[Document(id='21e30dd3-44fe-41bf-ac33-736b4aeb985a', metadata={'source': 'https://www.youtube.com/watch?v=pJdMxwXBsk0&t=3060s', 'start_seconds': 3060, 'start_timestamp': '00:51:00'}, page_content='application.  Ok?  So with that I will conclude this video.  If you liked the video, please like it.  If you have not subscribed to this channel, please do subscribe.  See you in the next video , bye.'),
 Document(id='101ad05e-b3d5-42a8-9f4d-5f6f13fb6838', metadata={'source': 'https://www.youtube.com/watch?v=pJdMxwXBsk0&t=120s', 'start_seconds': 120, 'start_timestamp': '00:02:00'}, page_content="how we are moving forward with this playlist. Now let's focus on today's video, which is on retrievers. Retrievers are very important in langche.  So we will cover this in great detail in today's video.  First of all, we will start with this discussion that what are retrievers?  So in very simple words, if you read this first line, it is written here that a retriever is a component in the language tha

## Step 5 Creating a Prompt Template

In [18]:
from langchain_core.prompts import PromptTemplate

In [22]:
template = PromptTemplate(
    template = """You are an AI assistant , which has access to a youtubes video transcript. Answer the user's query based on the provided transcripts context and do not hallicunate. 
    If the answer to the user'query is not mentioned in the context or incase if you dont know the answer respond with 'Sorry i do not have answer to your question'.
    'context'
    {context}
    'Question'
    {input}""",
    input_variables=['context','input']
)

## Step-6 Creating a RAG Chain

In [20]:
from langchain.chains.combine_documents import create_stuff_documents_chain #"Formats retrieved documents + question into a prompt and passes it to the LLM for answering."
from langchain.chains import create_retrieval_chain # "Combines a retriever (to fetch docs) with the 'create_stuff_document_chain' to automate end-to-end retrieval + answering."

In [23]:
combined_docs_chain = create_stuff_documents_chain(llm=llm,prompt=template)
rag_chain = create_retrieval_chain(retriever,combined_docs_chain)

In [35]:
result=rag_chain.invoke({
    'input':"Create me a blog post about the video."
})

In [36]:
print(result['answer'])

The video discusses the importance of retrievers in Langchain, a language model framework. The speaker explains that a retriever is a component that fetches relevant documents from a data source in response to a user's query. They then delve into the different types of retrievers, including the Wikipedia Retriever, Multi-Query Retriever, and Contextual Compression Retriever.

The Wikipedia Retriever is described as a retriever that queries the Wikipedia API to fetch relevant content for a given query. The speaker provides an example of how this retriever works, using a query about Albert Einstein. They explain that the retriever hits the Wikipedia API, retrieves the most relevant articles, and returns them in the format of Langchain Document Objects.

The speaker also discusses the Multi-Query Retriever, which understands the context of the query and returns relevant documents based on that context. They provide an example of how this retriever works, using a query that has multiple po