# **Retrieval-Augmented Generation (RAG) with LangChain**

## Introduction to RAG

### LLM Limitation: Knowledge Constraints
Large Language Models (LLMs) are limited by the data they were trained on. They cannot dynamically pull in real-time or external knowledge.

### What is Retrieval-Augmented Generation?
RAG integrates external data sources with LLMs to overcome this limitation. It retrieves relevant documents or information based on user queries and uses that as context for LLMs to generate responses.

## Standard RAG Workflow
1. **User Query Input**
2. **Retriever fetches relevant documents** from vector store
3. **Context + Query is passed to the LLM**
4. **LLM generates answer** using retrieved context

## Preparing Data for Retrieval
To use RAG effectively, the documents must be ingested, split into manageable chunks, embedded, and stored in a vector database.

## ⭐ Document Loaders

LangChain provides loaders for various file formats.

```python
from langchain_community.document_loaders import (
    TextLoader,
    CSVLoader,
    JSONLoader,
    DirectoryLoader,
    PyPDFLoader,
    PDFPlumberLoader,
    PyMuPDFLoader,
    PDFMinerLoader,
    WebBaseLoader,
    UnstructuredURLLoader,
    RecursiveURLLoader,
    SitemapLoader,
    S3DirectoryLoader,
    AzureBlobStorageLoader,
    GoogleDriveLoader,
    ArxivLoader,
    YoutubeAudioLoader,
    NotionDirectoryLoader
)

```

In [None]:
from langchain_community.document_loaders import CSVLoader

path_to_csv =  r"E:\01_Github_Repo\GenAI-with-Langchain-and-Huggingface\_Developing_LLMs_Applications_with_LangChain\_data\fifa_countries_audience.csv"
# Load the CSV file using the CSVLoader

csv_loader = CSVLoader(file_path= path_to_csv)
documents = csv_loader.load()

print("Content: ", documents[0].page_content, "\n")
print("Metadata:", documents[0].metadata)

Content:  country: united states
confederation: concacaf
population_share: 4.5
tv_audience_share: 4.3
gdp_weighted_share: 11.3 

Metadata: {'source': 'E:\\01_Github_Repo\\GenAI-with-Langchain-and-Huggingface\\_Developing_LLMs_Applications_with_LangChain\\_data\\fifa_countries_audience.csv', 'row': 0}


In [35]:
from langchain_community.document_loaders import PyPDFLoader

path_to_pdf =  r"E:\01_Github_Repo\GenAI-with-Langchain-and-Huggingface\_Developing_LLMs_Applications_with_LangChain\_data\RAG.pdf"

pdf_loader = PyPDFLoader(file_path= path_to_pdf)
documents = pdf_loader.load()

print("Content: ", documents[0].page_content, "\n")
print("Metadata:", documents[0].metadata)

Content:  Retrieval Argument Generation: Enhancing Language Model 
 Capabilities Through External Knowledge Integration 
 1. Introduction to Retrieval Argument Generation (RAG) 
 Retrieval-Augmented Generation (RAG) represents a paradigm shift in how large 
 language models (LLMs) operate, moving beyond the constraints of their pre-trained 
 knowledge by incorporating information from external, authoritative knowledge bases 
 during the response generation process.  1  This  technique fundamentally optimizes the 
 output of LLMs, ensuring that the generated content is not solely reliant on the 
 model's internal parameters but is also grounded in a broader, often more current and 
 specific, set of information.  1  In the realm of natural  language processing (NLP), RAG 
 serves as a powerful tool to enhance text generation by seamlessly integrating data 
 from diverse knowledge repositories, including databases, digital asset libraries, and 
 comprehensive document repositories.  3  T

In [None]:
from langchain_community.document_loaders import UnstructuredHTMLLoader

path_to_html = r"E:\01_Github_Repo\GenAI-with-Langchain-and-Huggingface\_Developing_LLMs_Applications_with_LangChain\_data\white_house_executive_order_nov_2023.html"

html_loader = UnstructuredHTMLLoader(file_path=path_to_html, encoding='utf-8')
documents = html_loader.load()

print("Content: ", documents[0].page_content, "\n")
print("Metadata:", documents[0].metadata)

Content:  October 30, 2023

Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence

By the authority vested in me as President by the Constitution and the laws of the United States of America, it is hereby ordered as follows:

Section 1. Purpose. Artificial intelligence (AI) holds extraordinary potential for both promise and peril. Responsible AI use has the potential to help solve urgent challenges while making our world more prosperous, productive, innovative, and secure. At the same time, irresponsible use could exacerbate societal harms such as fraud, discrimination, bias, and disinformation; displace and disempower workers; stifle competition; and pose risks to national security. Harnessing AI for good and realizing its myriad benefits requires mitigating its substantial risks. This endeavor demands a society-wide effort that includes government, the private sector, academia, and civil society.

My Administration places the highest urge

## ⭐Text Splitting


Split large documents into smaller chunks for effective embedding and retrieval.

```python 
from langchain_text_splitters import (
    CharacterTextSplitter,
    TokenTextSplitter,
    RecursiveCharacterTextSplitter,
    SentenceTransformersTextSplitter,
    SpacyTextSplitter,
    NLTKTextSplitter,
    MarkdownTextSplitter,
    HTMLTextSplitter,
    LatexTextSplitter,
    JSONTextSplitter
)

```

In [41]:
from langchain_text_splitters import CharacterTextSplitter

text = """Machine learning is a fascinating field.
    It involves algorithms and models that can learn from data.
    These models can then make predictions or decisions without 
    being explicitly programmed to perform the task.
    This capability is increasingly valuable in 
    various industries, from finance to healthcare.

    There are many types of machine learning, 
    including supervised, unsupervised, and reinforcement learning.
    Each type has its own 
    strengths and applications."""

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=100,
    chunk_overlap=10
)

chunks = text_splitter.split_text(text)
print(chunks)
print([len(chunk) for chunk in chunks])

Created a chunk of size 323, which is longer than the specified 100


['Machine learning is a fascinating field.\n    It involves algorithms and models that can learn from data.\n    These models can then make predictions or decisions without \n    being explicitly programmed to perform the task.\n    This capability is increasingly valuable in \n    various industries, from finance to healthcare.', 'There are many types of machine learning, \n    including supervised, unsupervised, and reinforcement learning.\n    Each type has its own \n    strengths and applications.']
[323, 169]


- `"\n\n"` (Double Newline) –> First, the text is split at paragraph breaks (double newlines), keeping sections intact.
- `"\n"` (Single Newline) –> If chunks are still too large, the splitter moves to sentence-level splitting.
- `" "` (Space) –> If the previous splits are insufficient, it breaks at word boundaries.
- `""` (Empty String) –> As a last resort, it splits character-by-character.

In [42]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " ", ""],
    chunk_size=100,
    chunk_overlap=10
)

chunks = splitter.split_text(text)
print(chunks)
print([len(chunk) for chunk in chunks])

['Machine learning is a fascinating field.', 'It involves algorithms and models that can learn from data.', 'These models can then make predictions or decisions without', 'being explicitly programmed to perform the task.', 'This capability is increasingly valuable in', 'various industries, from finance to healthcare.', 'There are many types of machine learning,', 'including supervised, unsupervised, and reinforcement learning.\n    Each type has its own', 'strengths and applications.']
[40, 59, 59, 48, 43, 47, 41, 89, 27]


In [43]:
from langchain_community.document_loaders import PyPDFLoader

path_to_pdf =  r"E:\01_Github_Repo\GenAI-with-Langchain-and-Huggingface\_Developing_LLMs_Applications_with_LangChain\_data\RAG.pdf"

loader = PyPDFLoader(file_path=path_to_pdf)
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(documents)
print(chunks)
print([len(chunk.page_content) for chunk in chunks])

[Document(metadata={'producer': 'Skia/PDF m135', 'creator': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36 Edg/135.0.0.0', 'creationdate': '2025-04-15T19:09:13+00:00', 'title': 'RAG: Definition and Applications - Google Docs', 'moddate': '2025-04-15T19:09:13+00:00', 'source': 'E:\\01_Github_Repo\\GenAI-with-Langchain-and-Huggingface\\_Developing_LLMs_Applications_with_LangChain\\_data\\RAG.pdf', 'total_pages': 46, 'page': 0, 'page_label': '1'}, page_content="Retrieval Argument Generation: Enhancing Language Model \n Capabilities Through External Knowledge Integration \n 1. Introduction to Retrieval Argument Generation (RAG) \n Retrieval-Augmented Generation (RAG) represents a paradigm shift in how large \n language models (LLMs) operate, moving beyond the constraints of their pre-trained \n knowledge by incorporating information from external, authoritative knowledge bases \n during the response generation process.  1  T

## Embedding and Storage
Embedding represents chunks in vector form to enable similarity search. LangChain supports OpenAI and ChromaDB.

In [None]:
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

embedding_model = OpenAIEmbeddings(
    api_key=openai_api_key,
    model="text-embedding-3-small"
)

vector_store = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model
)

## Building LCEL Retrieval Chain
LangChain Expression Language (LCEL) allows declarative pipeline construction.

In [None]:
retriever = vector_store.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 2}
)

In [None]:
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_template("""
Use the following pieces of context to answer the question at the end.
If you don't know the answer, say that you don't know.
Context: {context}
Question: {question}
""")

In [None]:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", api_key="...", temperature=0)

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
result = chain.invoke({"question": "What are the key findings or results presented in the paper?"})
print(result)