## Install necessary libraries

In [None]:
!pip install -q youtube-transcript-api langchain_community langchain_openai faiss-cpu tiktoken python-dotenv

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/485.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.1/485.1 kB[0m [31m36.4 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m83.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.9/81.9 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m61.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m57.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.2/471.2 kB[0m [31m30.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

In [None]:
from youtube_transcript_api import YouTubeTranscriptApi, TranscriptsDisabled
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import PromptTemplate
import os
from google.colab import userdata

In [None]:
os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")

# Step1: Indexing / Data Ingestion

## Document Loader

In [None]:
video_id = "p3sij8QzONQ"

try:
    # Fetch transcript directly (returns FetchedTranscriptSnippet objects)
    transcript_snippets = YouTubeTranscriptApi().fetch(video_id, languages=['en'])

    # Flatten into plain text
    text = " ".join(snippet.text for snippet in transcript_snippets)
    print(text[:500])  # print first 500 characters for preview

except TranscriptsDisabled:
    print("No captions available for this video.")

Learn to build a complete large language model from scratch using only pure PyTorch. This course takes you through the entire life cycle from foundational concepts to advanced alignment techniques. You'll begin by implementing the core transformer architecture and training a tiny language model. From there you will modernize and scale the model with production ready enhancements like RO mixture of experts layers and mixed precision training. The course then transitions to the full alignment phas


In [None]:
text

"Learn to build a complete large language model from scratch using only pure PyTorch. This course takes you through the entire life cycle from foundational concepts to advanced alignment techniques. You'll begin by implementing the core transformer architecture and training a tiny language model. From there you will modernize and scale the model with production ready enhancements like RO mixture of experts layers and mixed precision training. The course then transitions to the full alignment phase where you'll implement supervised fine-tuning and build a reward model. To complete the life cycle, you'll use proximal policy optimization or PO to align the model with reinforcement learning from human feedback. By the end, you'll have the deep hands-on experience needed to build and customize your own LLMs. >> Hello everyone and welcome to LLM from scratch, a hands-on curriculum in PyTorch. Uh this is going to be a long practical journey where we build modern large language model component

## Text Splitting

In [None]:
splitter = RecursiveCharacterTextSplitter(
     chunk_size=1000,
     chunk_overlap=200,
 )

In [None]:
chunks = splitter.create_documents([text])

In [None]:
len(chunks)

353

In [None]:
print(chunks[0])

page_content='Learn to build a complete large language model from scratch using only pure PyTorch. This course takes you through the entire life cycle from foundational concepts to advanced alignment techniques. You'll begin by implementing the core transformer architecture and training a tiny language model. From there you will modernize and scale the model with production ready enhancements like RO mixture of experts layers and mixed precision training. The course then transitions to the full alignment phase where you'll implement supervised fine-tuning and build a reward model. To complete the life cycle, you'll use proximal policy optimization or PO to align the model with reinforcement learning from human feedback. By the end, you'll have the deep hands-on experience needed to build and customize your own LLMs. >> Hello everyone and welcome to LLM from scratch, a hands-on curriculum in PyTorch. Uh this is going to be a long practical journey where we build modern large language mo

## Embedding Generation

In [None]:
embeddings = OpenAIEmbeddings(
    model = "text-embedding-3-small"
)

## Store embeddings in vector store

In [None]:
vector_store = FAISS.from_documents(chunks, embeddings)

In [None]:
vector_store.index_to_docstore_id

{0: '1fb69300-ba9f-459b-99b3-aabe973c1dfb',
 1: 'ea10ed5c-e154-4f06-a440-28165957672b',
 2: '7e4c80ba-3fe7-408d-9b22-8faebb2e3388',
 3: '3bc0a681-7ddb-4d8a-a49e-26037a27299b',
 4: 'b5f1f09d-4b8a-4a5a-aa56-0e3d07ba78cd',
 5: 'b8dbf93a-2a40-41d2-8efd-223f7f15a527',
 6: 'dd284dec-e219-416d-b624-b25d71d4d854',
 7: 'caea26e6-3abe-4974-a30d-d81f63507811',
 8: '2a99d9ca-5c1d-41d4-93e2-ddd601f4a559',
 9: '4b8470d0-86bf-4c60-a6d7-c7a4aa3cf2e7',
 10: 'b357b644-f90b-415f-8742-03342430957c',
 11: 'fbe6d41b-edd9-4b00-b078-089c4a52e757',
 12: 'b7c6e6f7-b735-4eba-b6cd-f5c8ad0f96dd',
 13: '4b0a698f-10fe-4c37-a832-92ddb72d17ec',
 14: 'cbaf6997-9133-4e0b-881d-57756517400c',
 15: '29ced1bc-f8bf-4c1c-89cb-86e3af7f3310',
 16: 'efc219d5-878f-4dd5-ba1d-83e467e98542',
 17: '9772b2fe-8a57-40d1-be6f-1aae68383755',
 18: '9a1cf499-043f-4ee5-9dd4-db8e853d19ae',
 19: 'ea4dbd84-0667-47fb-a770-115ed179925f',
 20: '486a363f-b79c-481d-8e2f-098cf7220e03',
 21: '6293f2f6-b48c-40fb-949e-122fca2aa271',
 22: 'ac66bbe3-da4c-

# Step2: Retrieval

In [None]:
# retrieve top 4 documents form vector store
retriever = vector_store.as_retriever(
    search_type="similarity",
    search_kwargs={"k":4}
)

In [None]:
retriever

VectorStoreRetriever(tags=['FAISS', 'OpenAIEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x795af8bb2390>, search_kwargs={'k': 4})

In [None]:
retriever.invoke("What is RLHF")

[Document(id='e3e33180-50ab-4e20-8536-8435f75440ac', metadata={}, page_content="temporal movement of the policy and uh the KL overall KL divergence from uh the model. So we've uh hit our eval and we'll just go through uh the area where till where we uh generate our responses. So we've got all of that till here and hopefully things will be different in this run. So I've got the response over here which uh now devolves to something else since we have uh trained our uh policy as per the RLHF framework. But if you look at the old model, it will still be the same. So we have significant deviation from where we started. So this essentially is what uh reinforcement learning from human feedback constitutes. So you've got the reward model on one side and then uh we do proximal policy optimization to align it to human preferences by using the reward model to represent as a proxy for the human preference score. So obviously again for a two-layer neural network for 16 data points u this all won't 

# Step3: Augmentation -> (query + embeddings)*italicized text*

In [None]:
prompt = PromptTemplate(
    template = """
    You are a helpful assistant.
    Answer only from the provided transcript context.
    If the context is insufficient, just say you don't know.
    {context}
    Question: {question}
    """,
    input_variables = ["context", "question"]
)

In [None]:
question = "Is the RLHF is discussed in the video if yes what was discussed"


In [None]:
retrieved_docs = retriever.invoke(question)

In [None]:
retrieved_docs

[Document(id='4d3c9fc0-7101-4439-a7a9-d404c685d565', metadata={}, page_content="of the alpha parameter that we've configured. So in this case again uh equal importance will be paid since our default is.5. So you can go for a hybrid implementation as well. So if you come back uh we read very minimal again used very minimal necessary force to understand theory on expert routing gating networks and load balancing. I really wanted you to have that experience while we were running the code and looking at the output and not get stuck in a bunch of theory. So we've also looked at how uh layers are implemented in PyTorch. Now this along with the part three which was modernizing the architecture and part four which was the scaling up of training will form the meat of what it looks like to have a production grade LLM trained uh training stability and communication and along with uh combining the with the dense layers for hybrid architecture. Right. So with that I think we are done with the part 

In [None]:
context_text = "\n\n".join(doc.page_content for doc in retrieved_docs)

In [None]:
context_text

"of the alpha parameter that we've configured. So in this case again uh equal importance will be paid since our default is.5. So you can go for a hybrid implementation as well. So if you come back uh we read very minimal again used very minimal necessary force to understand theory on expert routing gating networks and load balancing. I really wanted you to have that experience while we were running the code and looking at the output and not get stuck in a bunch of theory. So we've also looked at how uh layers are implemented in PyTorch. Now this along with the part three which was modernizing the architecture and part four which was the scaling up of training will form the meat of what it looks like to have a production grade LLM trained uh training stability and communication and along with uh combining the with the dense layers for hybrid architecture. Right. So with that I think we are done with the part five as well. Uh I hope all of this is still making sense to you. We were very\

In [None]:
# create final prompt
final_prompt = prompt.invoke(
    {
        "context": context_text,
        "question": question
    }
)

In [None]:
final_prompt

StringPromptValue(text="\n    You are a helpful assistant.\n    Answer only from the provided transcript context.\n    If the context is insufficient, just say you don't know.\n    of the alpha parameter that we've configured. So in this case again uh equal importance will be paid since our default is.5. So you can go for a hybrid implementation as well. So if you come back uh we read very minimal again used very minimal necessary force to understand theory on expert routing gating networks and load balancing. I really wanted you to have that experience while we were running the code and looking at the output and not get stuck in a bunch of theory. So we've also looked at how uh layers are implemented in PyTorch. Now this along with the part three which was modernizing the architecture and part four which was the scaling up of training will form the meat of what it looks like to have a production grade LLM trained uh training stability and communication and along with uh combining the 

# Step4: Generation

In [None]:
llm = ChatOpenAI(
    model = "gpt-4o-mini",
    temperature = 0.2
)

In [None]:
answer = llm.invoke(final_prompt)
print(answer)

content='Yes, RLHF (Reinforcement Learning from Human Feedback) is discussed in the video. It is explained that the process involves a reward model on one side and proximal policy optimization to align the model to human preferences. The reward model serves as a proxy for human preference scores, and the discussion highlights the need for a significant amount of data to make sense of the training process. Additionally, the video mentions the anthropics RLHF dataset, which includes chosen and rejected columns based on human preferences for building a large language model.' additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 105, 'prompt_tokens': 884, 'total_tokens': 989, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-4o-mini-2024-07-18', 'system

# Building a chain

In [None]:
from langchain_core.runnables import RunnableParallel, RunnablePassthrough, RunnableLambda
from langchain_core.output_parsers import  StrOutputParser

In [None]:
def format_docs(retrieved_docs):
  context_text = "\n\n".join(doc.page_content for doc in retrieved_docs)
  return context_text

In [None]:
parallel_chain = RunnableParallel({
    'context': retriever | RunnableLambda(format_docs),
    'question': RunnablePassthrough()
})

In [None]:
parallel_chain.invoke("What is Mixture of Experts")

{'context': "then from there we'll try to watch a worked out example to figure out how it all comes together. So first things first, what is mixture of expert? Now here we have a rough diagram of the architecture of mixture of experts. Now as you can see all of the uh attention blocks have been collapsed in. So this can be thought of as the more modern version of multi-headed attention with the group query attention and the swigloo and the rope and all the bells and whistles that we have discussed so far. But from there instead of feed forward network what happens is there's a gating function over here or a router which takes the output from the attention module and then decides to route it to one of these or one or more of these experts. So this router will have probabilities emitting from it and based on that we are going to route it to one or more of these experts. Each of these experts is a multi-layer perceptron or simply a linear layer and then from there the whole thing goes for

In [None]:
# parser
parser = StrOutputParser()

In [None]:
main_chain = parallel_chain | prompt | llm | parser

In [None]:
main_chain.invoke("Can you summarize the video")

'The video discusses a series of parts focused on modernizing and implementing various aspects of transformer architecture and related concepts. It covers topics such as policy networks, reward signals, training loops, and stability tricks. The upcoming parts will include modernizing the vanilla transformer architecture with techniques like RMS normalization and sliding window attention, scaling up tokenization, implementing mixture of experts, supervised fine-tuning, and creating a reward model using pair-wise preference datasets. The presenter encourages viewers to code along for better understanding and mentions the use of visualization tools to enhance comprehension of attention mechanisms. The video concludes with instructions on running code to visualize multi-headed attention.'