<a href="https://colab.research.google.com/github/jibz33on/rag-pipeline-langchain/blob/main/RAG_Pipeline_with_LangChain_and_OpenAI_(Web_Data_Loader_%2B_ChromaDB).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG Pipeline with LangChain, OpenAI & Chroma
This notebook demonstrates how to build a Retrieval-Augmented Generation (RAG) pipeline.  
For an overview of the project and workflow diagram, see the [README](./README.md).


In [1]:
"""
Install required libraries:
- langchain_community, langchainhub, chromadb, langchain, langchain-openai
These are needed for document loading, embeddings, vector storage, and LLM integrations.
"""
!pip install langchain_community langchainhub chromadb langchain langchain-openai




In [2]:
"""
Load OpenAI API key from Google Colab's userdata and set it as an environment variable.
This will allow authenticated access to OpenAI models (embeddings + LLM).
"""
from google.colab import userdata
import os
os.environ['OPENAI_API_KEY'] = userdata.get('OpenAI_API_Key')


In [3]:
"""
Load a webpage using LangChain's WebBaseLoader.
Here, the course page 'https://www.educosys.com/course/genai' is fetched as raw text.
"""
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader(web_paths=["https://www.educosys.com/course/genai"])
docs = loader.load()
print(docs)   # Display loaded documents




[Document(metadata={'source': 'https://www.educosys.com/course/genai', 'title': 'Hands-on Generative AI Course', 'description': 'Hands-on Generative AI Course', 'language': 'en'}, page_content="Hands-on Generative AI CourseCoursesBundle CoursesStudent DiscountFree ContentTestimonialsFAQLogin Signup Starts on 16th September 2025Hands-on Generative AI CourseLearn, Build, Deploy and Apply Generative AI7 weeks · 3 classes/week · 2 hrs/class + Post-class Doubt SupportClasses on Tue, Wed, Thurs - 9PM ISTAccess all Live BatchesLifetime access of RecordingsAccess Discord CommunityCode availableBuild ProjectsLearn Future-Ready TechEnroll 1Week 1Foundations of Generative AI Introduction to AI Mathematical Foundations for AI Probability, Statistics, and Linear Algebra Basics of Neural Networks Gradient Descent and Optimization Basics Architectures: Feedforward, RNN, and CNN Mini Project - Build a Simple Neural Network Using TensorFlow Mini Project - Train an Autoencoder on the MNIST Dataset2Week 

In [4]:
"""
Split the loaded webpage content into smaller chunks using RecursiveCharacterTextSplitter.
- chunk_size=1000 → max characters per chunk
- chunk_overlap=200 → overlapping text for better context preservation
"""
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

print(splits[0])     # Show the first split
print(len(splits))   # Show number of splits created


page_content='Hands-on Generative AI CourseCoursesBundle CoursesStudent DiscountFree ContentTestimonialsFAQLogin Signup Starts on 16th September 2025Hands-on Generative AI CourseLearn, Build, Deploy and Apply Generative AI7 weeks · 3 classes/week · 2 hrs/class + Post-class Doubt SupportClasses on Tue, Wed, Thurs - 9PM ISTAccess all Live BatchesLifetime access of RecordingsAccess Discord CommunityCode availableBuild ProjectsLearn Future-Ready TechEnroll 1Week 1Foundations of Generative AI Introduction to AI Mathematical Foundations for AI Probability, Statistics, and Linear Algebra Basics of Neural Networks Gradient Descent and Optimization Basics Architectures: Feedforward, RNN, and CNN Mini Project - Build a Simple Neural Network Using TensorFlow Mini Project - Train an Autoencoder on the MNIST Dataset2Week 2Deep Generative Models Discriminative and Generative models Generative Adversarial Networks (GANs) Variational Autoencoders (VAEs) Probabilistic Data Generation Using VAEs Four Mi

In [5]:
"""
Generate embeddings for text chunks and build a vectorstore.
- OpenAIEmbeddings(model="text-embedding-3-small") → converts text to vector representation
- Chroma → stores embeddings for similarity search and retrieval
"""
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")   # Set up embeddings
vectorstore = Chroma.from_documents(documents=splits, embedding=embeddings)  # Build vectorstore

print(vectorstore._collection.count())   # Number of vectors stored
print(vectorstore._collection.get())     # Show stored metadata
print("\nCollection 1 - ", vectorstore._collection.get(
    ids=['28651d9a-ab51-41f8-ab83-e68285623c4e'],
    include=["embeddings", "documents"]))  # Inspect one document with its embedding


11
{'ids': ['0bf9c242-c93a-4dfa-b7d3-58ed0b85c870', '28c16fff-b25f-4618-8095-65edc16cca0f', '87ab7ebd-450f-4d5d-ab00-29b14f99964b', '775e8923-ddbf-47ab-8d66-586b0c0b1f2b', 'f254136e-ca9f-468e-ac1c-4a1d99e91839', '0f2d1685-f10c-4b08-aeb1-995bc1aacb72', '8a982e82-7c96-4d9c-b67e-3f7ef3d86979', 'b5af1e61-79b5-4e41-b88d-48447f7c1591', '366e42bd-27dc-490b-9212-981a2068e743', '16e2e175-4664-49f6-b660-3281d3e0a651', 'e22ec014-d350-4b1b-b0d9-98e2bd4c0571'], 'embeddings': None, 'documents': ['Hands-on Generative AI CourseCoursesBundle CoursesStudent DiscountFree ContentTestimonialsFAQLogin Signup Starts on 16th September 2025Hands-on Generative AI CourseLearn, Build, Deploy and Apply Generative AI7 weeks · 3 classes/week · 2 hrs/class + Post-class Doubt SupportClasses on Tue, Wed, Thurs - 9PM ISTAccess all Live BatchesLifetime access of RecordingsAccess Discord CommunityCode availableBuild ProjectsLearn Future-Ready TechEnroll 1Week 1Foundations of Generative AI Introduction to AI Mathematical F

In [6]:
"""
Convert the vectorstore into a retriever.
Retriever will search for relevant chunks when queried.
"""
retriever = vectorstore.as_retriever()


In [7]:
"""
Pull a ready-made Retrieval-Augmented Generation (RAG) prompt template from LangChain Hub.
This defines how the retrieved context + question will be formatted for the LLM.
"""
from langchain import hub
prompt = hub.pull("rlm/rag-prompt")


In [8]:
"""
Initialize a ChatOpenAI model to be used as the LLM in the RAG pipeline.
"""
from langchain_openai import ChatOpenAI
llm = ChatOpenAI()


In [9]:
"""
Import utilities:
- RunnablePassthrough → forwards input unchanged
- StrOutputParser → parses LLM response into plain text
"""
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser


In [10]:
"""
Helper function to format retrieved documents into a single string.
"""
def format_docs(docs):
  return "\n".join(doc.page_content for doc in docs)


In [11]:
"""
Build the RAG chain:
1. Retrieve relevant context from vectorstore
2. Format context + question into a prompt
3. Pass it to the LLM
4. Parse the output into plain text
"""
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)


In [12]:
"""
Invoke the RAG chain with specific questions.
The LLM answers using both its knowledge and the retrieved webpage content.
"""
rag_chain.invoke("Are the recordings of the course available? For how long?")
rag_chain.invoke("Are the testimonials for the course available? Name the students who have shared testimonials")


'Yes, testimonials for the course are available. The students who have shared testimonials are Sahitya Raj, Sudarshan Suresh Srikant, and Syed I.'

In [13]:
"""
Define a custom function to print the generated prompt before sending it to the LLM.
This helps in debugging or understanding how context + question are combined.
"""
from langchain_core.runnables import RunnableLambda

def print_prompt(prompt_text):
  print("Prompt - ", prompt_text)
  return prompt_text


In [14]:
"""
Extended RAG chain with prompt inspection:
- Uses RunnableLambda to print the constructed prompt
- Sends it to the LLM
- Returns parsed text output
"""
rag_chain_with_print = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | RunnableLambda(print_prompt)
    | llm
    | StrOutputParser()
)

rag_chain_with_print.invoke("What all projects are covered in the course?")


Prompt -  messages=[HumanMessage(content="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: What all projects are covered in the course? \nContext: knowledge of Machine Learning. Keerti’s passion for teaching made complex topics easy to grasp. I highly recommend this course to anyone interested in AI and ML!Read moreManika KaushikSenior Software EngineerOptum-United HealthGroupKeerti explains everything in such simple and creative manner, even difficult and huge topics became easy to understand.Frequently asked questionsIs this a Live or Recorded Course?When will the next Live batch be launched?What if I am interested in learning Live only?What are the prerequisites for the course?Is Machine Learning pre-requisite for the course?How many projects will we work on? Can I add these to resume?I

'The course covers projects on Foundations of Generative AI, Deep Generative Models, Generative Adversarial Networks (GANs), and Variational Autoencoders (VAEs). There are mini-projects included in the course, such as building a neural network using TensorFlow and training an autoencoder on the MNIST dataset. The course provides a seamless understanding of self-attention, positional encoding, and the latest concepts like DeepSeek, offering a balance between theory and hands-on learning.'