## RAG Options (Post-chunking) in LangChain — Categorized 

In [None]:
### 🧠 Embedding Options for RAG (Post-Chunking)

| Category                           | Providers / Methods                                                                 | Requires API Key | Downloads Model Locally | Notes                                                                 |
|------------------------------------|--------------------------------------------------------------------------------------|------------------|---------------------------|-----------------------------------------------------------------------|
| 🛰️ Cloud-based API Providers       | `OpenAIEmbeddings`, `CohereEmbeddings`, `AzureOpenAIEmbeddings`, `VertexAIEmbeddings`, `BedrockEmbeddings` | ✅ Yes           | ❌ No                    | Remote proprietary APIs. Fast, scalable, paid beyond free tiers.     |
| 🧠 Local Inference (Downloaded)    | `HuggingFaceEmbeddings`, `InstructorEmbedding`, `transformers` (custom), `llama-cpp` | ❌ No            | ✅ Yes                   | Fully local, private. Requires downloading models and compute.        |
| ☁️ Hosted Open-Source APIs         | `HuggingFaceInferenceAPIEmbeddings`, Together AI, Replicate (custom clients)        | ✅ Yes           | ❌ No                    | Hosted inference of open models. Slower but avoids local setup.      |
| ⚙️ Local Wrappers / CLI Simplicity | `Ollama`                                                                             | ❌ No            | ✅ Yes (on first run)    | Simplified local use. Wraps `llama.cpp`. Easy to start with.         |


## 1. load env 

In [None]:
from dotenv import load_dotenv
import os
load_dotenv('../.env')
key = os.getenv("OPENAI_KEY")
print(key[:3])

sk-


## 2. load file 

In [1]:
#2 load
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader('../llm_router.pdf')
document = loader.load()
print(document[0])

page_content='Published as a conference paper at ICLR 2025
ROUTE LLM: L EARNING TO ROUTE LLM S WITH
PREFERENCE DATA
Isaac Ong∗1 Amjad Almahairi∗2 Vincent Wu1 Wei-Lin Chiang1 Tianhao Wu1
Joseph E. Gonzalez1 M Waleed Kadous3 Ion Stoica1,2
1UC Berkeley 2Anyscale 3Canva
ABSTRACT
Large language models (LLMs) excel at a wide range of tasks, but choosing the
right model often involves balancing performance and cost. Powerful models offer
better results but are expensive, while smaller models are more cost-effective but
less capable. To address this trade-off, we introduce a training framework for
learning efficient router models that dynamically select between a stronger and
weaker LLM during inference. Our framework leverages human preference data
and employs data augmentation techniques to enhance performance. Evaluations
on public benchmarks show that our approach can reduce costs by over 2 times
without sacrificing response quality. Moreover, our routers exhibit strong general-
ization ca

## 3. chunking 

In [2]:
# 3 chunk
from langchain.text_splitter import RecursiveCharacterTextSplitter
chunk = RecursiveCharacterTextSplitter(chunk_size=250, chunk_overlap=100)
docs = chunk.split_documents(documents=document)
print(docs[0])

page_content='Published as a conference paper at ICLR 2025
ROUTE LLM: L EARNING TO ROUTE LLM S WITH
PREFERENCE DATA
Isaac Ong∗1 Amjad Almahairi∗2 Vincent Wu1 Wei-Lin Chiang1 Tianhao Wu1
Joseph E. Gonzalez1 M Waleed Kadous3 Ion Stoica1,2' metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2025-02-25T01:57:29+00:00', 'author': '', 'keywords': '', 'moddate': '2025-02-25T01:57:29+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': '../llm_router.pdf', 'total_pages': 16, 'page': 0, 'page_label': '1'}


In [63]:
cont = []
import pandas as pd
for i, doc in enumerate(docs):
    cont.append({
        "chunk_no:": i,
        "content": doc.page_content, 
        "metadata": doc.metadata
    })
pd.DataFrame(cont)

Unnamed: 0,chunk_no:,content,metadata
0,0,Published as a conference paper at ICLR 2025\n...,"{'producer': 'pdfTeX-1.40.25', 'creator': 'LaT..."
1,1,Joseph E. Gonzalez1 M Waleed Kadous3 Ion Stoic...,"{'producer': 'pdfTeX-1.40.25', 'creator': 'LaT..."
2,2,right model often involves balancing performan...,"{'producer': 'pdfTeX-1.40.25', 'creator': 'LaT..."
3,3,"less capable. To address this trade-off, we in...","{'producer': 'pdfTeX-1.40.25', 'creator': 'LaT..."
4,4,weaker LLM during inference. Our framework lev...,"{'producer': 'pdfTeX-1.40.25', 'creator': 'LaT..."
...,...,...,...
358,358,optimize for performance and specify the maxim...,"{'producer': 'pdfTeX-1.40.25', 'creator': 'LaT..."
359,359,token ratio so that 50% of calls are routed to...,"{'producer': 'pdfTeX-1.40.25', 'creator': 'LaT..."
360,360,Both the matrix factorization router and causa...,"{'producer': 'pdfTeX-1.40.25', 'creator': 'LaT..."
361,361,with up to 40% fewer calls routed to GPT-4.\nF...,"{'producer': 'pdfTeX-1.40.25', 'creator': 'LaT..."


## 4. Embedding and vectordb

### Option 1 (provider but not free of cost, so not working)
#### install tiktoken, openai 

In [51]:
#4 embedding + vectorize 
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
embedding = OpenAIEmbeddings(openai_api_key=key)
vectordb = FAISS.from_documents(docs, embedding)
vectordb.save_local("faissdb")

RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

### Option 2 (huggingface)
#### install pip install langchain faiss-cpu, pip install sentence-transformers

In [53]:
from langchain.embeddings import HuggingFaceBgeEmbeddings
# embedding = HuggingFaceBgeEmbeddings("BAAI/bge-small-en-v1.5", model_kwargs={'device':'cpu'})
embedding = HuggingFaceBgeEmbeddings()

  embedding = HuggingFaceBgeEmbeddings()


In [65]:
import shutil
import os
from langchain.vectorstores import FAISS

vectorpath = "faissdb"

# Step 1: Delete existing folder if it exists
if os.path.exists(vectorpath):
    shutil.rmtree(vectorpath)
    print(f"Deleted existing vector database at {vectorpath}")

# Step 2: Create and save new FAISS db
vectordb = FAISS.from_documents(docs, embedding)
vectordb.save_local(vectorpath)

# Step 3: Confirm save
print(f"Saved new FAISS vector database at {vectorpath}")
print(f"Total vectors stored: {vectordb.index.ntotal}")



Deleted existing vector database at faissdb
Saved new FAISS vector database at faissdb
Total vectors stored: 363


## 5. retrieve top k

In [66]:
# load vectordb and retrieve top k
vectorpath = "faissdb"
vectordb = FAISS.load_local(vectorpath,embedding, allow_dangerous_deserialization=True)
print(vectordb.index.ntotal)
retriever = vectordb.as_retriever(search_kwargs={
                                                'k': 1
})
question = "give me llm router algorithms?"
results = retriever.get_relevant_documents(question, filter={'keywords':''})
print(results[0].page_content)
print(results[0].metadata)

# or 
# result = vectordb.similarity_search(query=question, k=3)

363
In this work, we introduce a principled framework for learning LLM routers from preference data.
Our approach involves routing between two classes of models: (1) strong models, which provide
{'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2025-02-25T01:57:29+00:00', 'author': '', 'keywords': '', 'moddate': '2025-02-25T01:57:29+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': '../llm_router.pdf', 'total_pages': 16, 'page': 1, 'page_label': '2'}


## 6. Use LLM for chains

### openAI/groq will not work so use option - 1 huggingface

In [76]:
## llms
# from langchain.llms import groq
# cached at C:\Users\HIMANSHU\.cache\huggingface\hub
from langchain.llms import HuggingFacePipeline
from transformers import pipeline
text_pipeline = pipeline(
                    # "text-generation",
                    "text2text-generation",
                    model="google/flan-t5-small",
                    # model="google/flan-t5-base",
                     max_length=1024,
                     temperature=0.5,
                     device=-1)
llm = HuggingFacePipeline(pipeline=text_pipeline)

Device set to use cpu


In [77]:
## prompt template
from langchain.prompts import PromptTemplate
prompt_template = """You are a RAG expert. Use the following context to answer the question at the end. If you don't know the answer, just say you don't know. Don't try to make up an answer.

Context:
{context}

Question:
{question}

Helpful Answer:"""

prompt = PromptTemplate(input_variables=['context', 'question'], template=prompt_template)
print(prompt)

input_variables=['context', 'question'] input_types={} partial_variables={} template="You are a RAG expert. Use the following context to answer the question at the end. If you don't know the answer, just say you don't know. Don't try to make up an answer.\n\nContext:\n{context}\n\nQuestion:\n{question}\n\nHelpful Answer:"


In [78]:
## chains
from langchain.chains import RetrievalQA
chain = RetrievalQA.from_chain_type(llm=llm, 
                    retriever=retriever, 
                    chain_type='stuff',
                    chain_type_kwargs ={'prompt':prompt},
                    return_source_documents=True)
question = "give me llm router algorithms?"
result = chain(question)
print(result)



{'query': 'give me llm router algorithms?', 'result': '(ii)', 'source_documents': [Document(id='f0d9df1b-5dc9-4286-a1cc-51a3cb964ae4', metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2025-02-25T01:57:29+00:00', 'author': '', 'keywords': '', 'moddate': '2025-02-25T01:57:29+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': '../llm_router.pdf', 'total_pages': 16, 'page': 1, 'page_label': '2'}, page_content='In this work, we introduce a principled framework for learning LLM routers from preference data.\nOur approach involves routing between two classes of models: (1) strong models, which provide')]}


In [None]:
print(result['query'])
print(result['result'])

give me llm router algorithms?
(ii)


## debug

In [71]:
for idx, doc in enumerate(result['source_documents']):
    print(f"Document {idx+1}:\n{doc.page_content}\n")


Document 1:
In this work, we introduce a principled framework for learning LLM routers from preference data.
Our approach involves routing between two classes of models: (1) strong models, which provide



## Option 2 llama ccp
#### pip install llama-cpp-python (No C++ Compiler	Install Visual C++ Build Tools)


In [73]:
from langchain.llms import llamacpp
llm = LlamaCpp(
    model_path="path/to/your/model.gguf",
    temperature=0.3,
    max_tokens=1024,
    n_ctx=4096,    # set according to model capability
    # n_gpu_layers=30,  # optional: speed up if you have GPU
)

NameError: name 'LlamaCpp' is not defined

In [None]:
## chains
from langchain.chains import RetrievalQA
chain = RetrievalQA.from_chain_type(llm=llm, 
                    retriever=retriever, 
                    chain_type='stuff',
                    chain_type_kwargs ={'prompt':prompt},
                    return_source_documents=True)
question = "give me llm router algorithms?"
result = chain(question)
print(result)

## Azure

In [None]:
import getpass
import os
import bs4
from dotenv import load_dotenv
from langchain_openai import AzureChatOpenAI
from langchain import hub
from langchain_chroma import Chroma
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import AzureOpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

def initialize_llm(env_file: str = None):

    if env_file:
        load_dotenv(env_file)
    else:
        load_dotenv()

    os.environ['USER_AGENT'] = 'myagent'

    llm = AzureChatOpenAI(
        azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
        azure_deployment=os.environ["AZURE_OPENAI_DEPLOYMENT_NAME"],
        openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
    )

    embeddings = AzureOpenAIEmbeddings(
        model="text-embedding-3-large",
        azure_deployment=os.environ["AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT_NAME"],
        openai_api_version=os.environ["AZURE_OPENAI_EMBEDDINGS_API_VERSION"])
    
    return llm, embeddings

def build_rag_pipeline(llm: AzureChatOpenAI, azure_embeddings: AzureOpenAIEmbeddings, documents: list):
    # Load, chunk and index the contents of the blog.
    loader = WebBaseLoader(
        web_paths=documents,
        bs_kwargs=dict(
            parse_only=bs4.SoupStrainer(
                class_=("post-content", "post-title", "post-header")
            )
        ),
    )
    docs = loader.load()

    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    splits = text_splitter.split_documents(docs)

    vectorstore = Chroma.from_documents(documents=splits, embedding=azure_embeddings)

    # Retrieve and generate using the relevant snippets of the blog.
    retriever = vectorstore.as_retriever()
    prompt = hub.pull("rlm/rag-prompt")

    def format_docs(docs):
        return "\n\n".join(doc.page_content for doc in docs)

    rag_chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )

    return rag_chain

def query_rag_pipeline(rag_chain, query_text: str):
    result = rag_chain.invoke(query_text)
    return result

if __name__ == "__main__":
    llm, azure_embeddings = initialize_llm("azure.env")
    documents = [
        "https://lilianweng.github.io/posts/2023-06-23-agent/"
    ]

    rag_chain = build_rag_pipeline(llm, azure_embeddings, documents)
    query = "What is Task decomposition?"
    answer = query_rag_pipeline(rag_chain, query)
    print(f"Question: {query}\nAnswer: {answer}")

## Hybrid (BM25 retriver)

In [None]:
import os
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import RetrievalQAWithSourcesChain
# from langchain.llms import Groq
from langchain.prompts import PromptTemplate
from langchain.retrievers import BM25Retriever, EnsembleRetriever

# === Load .env ===
load_dotenv()

# === Load and Chunk PDF ===
loader = PyPDFLoader("your_file.pdf")
pages = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
documents = splitter.split_documents(pages)

# === Add Metadata (e.g. page number) ===
for i, doc in enumerate(documents):
    doc.metadata["chunk_id"] = i

# === Azure Embeddings ===
embedding = OpenAIEmbeddings(
    openai_api_key=os.getenv("OPENAI_API_KEY"),
    openai_api_base=os.getenv("OPENAI_API_BASE"),
    openai_api_type="azure",
    openai_api_version=os.getenv("OPENAI_API_VERSION"),
    deployment=os.getenv("AZURE_EMBEDDING_DEPLOYMENT")
)

# === FAISS: persist vectorstore ===
index_path = "faiss_index"
if os.path.exists(index_path):
    vectorstore = FAISS.load_local(index_path, embedding)
else:
    vectorstore = FAISS.from_documents(documents, embedding)
    vectorstore.save_local(index_path)

faiss_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# === BM25 retriever (keyword-based) ===
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 5

# === Hybrid Ensemble Retriever ===
retriever = EnsembleRetriever(
    retrievers=[faiss_retriever, bm25_retriever],
    weights=[0.6, 0.4]  # Tune based on performance
)

# === Groq LLM ===
llm = Groq(
    api_key=os.getenv("GROQ_API_KEY"),
    model=os.getenv("GROQ_MODEL"),
    temperature=0.3
)

# === Advanced Prompt Template ===
prompt_template = PromptTemplate.from_template("""
You are an expert assistant helping summarize and answer from document context.
Use the following chunks to answer the question, and cite source chunk IDs when relevant.

Context:
{context}

Question:
{question}

Answer with sources at the end like: (Source: chunk_id 3, 5)
""")

# === Retrieval QA with Sources ===
qa_chain = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt_template}
)

# === Ask Question ===
query = "What are the main takeaways from the document?"
result = qa_chain(query)

print("\nAnswer:\n", result["answer"])
print("\nSources:\n", result.get("sources"))

# === Optional: Print full text of source docs ===
print("\nTop Retrieved Chunks:")
for doc in result["source_documents"]:
    print(f"Chunk ID: {doc.metadata.get('chunk_id')}")
    print(doc.page_content)
    print("-" * 40)


In [4]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("H:/Resume/xgboost_scale.pdf")
documents = loader.load()

In [5]:
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://github.com/himsgpt")
docs2 = loader.load()

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [6]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
textsplitter = RecursiveCharacterTextSplitter(chunk_size=900, chunk_overlap=150, separators=["\n\n", "\n", ".", " "],)
chunks = textsplitter.split_documents(documents)
chunks[0]

Document(metadata={'producer': 'pdfTeX-1.40.12', 'creator': 'LaTeX with hyperref package', 'creationdate': '2016-06-14T01:29:40+00:00', 'author': '', 'keywords': '', 'moddate': '2016-06-14T01:29:40+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.1415926-2.3-1.40.12 (TeX Live 2011) kpathsea version 6.0.1', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'H:/Resume/xgboost_scale.pdf', 'total_pages': 13, 'page': 0, 'page_label': '1'}, page_content='XGBoost: A Scalable Tree Boosting System\nTianqi Chen\nUniversity of Washington\ntqchen@cs.washington.edu\nCarlos Guestrin\nUniversity of Washington\nguestrin@cs.washington.edu\nABSTRACT\nTree boosting is a highly eﬀective and widely used machine\nlearning method. In this paper, we describe a scalable end-\nto-end tree boosting system called XGBoost, which is used\nwidely by data scientists to achieve state-of-the-art results\non many machine learning challenges. We propose a novel\nsparsity-aware algorithm for sparse data and 

In [25]:
# chunks[0].page_content
chunks[0].metadata

{'producer': 'pdfTeX-1.40.12',
 'creator': 'LaTeX with hyperref package',
 'creationdate': '2016-06-14T01:29:40+00:00',
 'author': '',
 'keywords': '',
 'moddate': '2016-06-14T01:29:40+00:00',
 'ptex.fullbanner': 'This is pdfTeX, Version 3.1415926-2.3-1.40.12 (TeX Live 2011) kpathsea version 6.0.1',
 'subject': '',
 'title': '',
 'trapped': '/False',
 'source': 'H:/Resume/xgboost_scale.pdf',
 'total_pages': 13,
 'page': 0,
 'page_label': '1'}

In [15]:
chunks2 = textsplitter.split_documents(docs2)
chunks2[0].metadata

{'source': 'https://github.com/himsgpt',
 'title': 'himsgpt (Himanshu Gupta) · GitHub',
 'description': 'With 8+ years of experience in the Data Science & Products, Himanshu specializes in Fraud and Auth modeling, Generative AI product development, ML modeling - himsgpt',
 'language': 'en'}

In [4]:
chunk_lengths = [len(chunk.page_content) for chunk in chunks]
print(f"Avg length: {sum(chunk_lengths) / len(chunk_lengths):.2f}")


Avg length: 832.33


In [7]:
for i, chunk in enumerate(chunks[:2]):
    print(f"\n--- Chunk {i+1} ---")
    print(chunk.page_content)
    print(f"\n[Metadata: {chunk.metadata}]")


--- Chunk 1 ---
XGBoost: A Scalable Tree Boosting System
Tianqi Chen
University of Washington
tqchen@cs.washington.edu
Carlos Guestrin
University of Washington
guestrin@cs.washington.edu
ABSTRACT
Tree boosting is a highly eﬀective and widely used machine
learning method. In this paper, we describe a scalable end-
to-end tree boosting system called XGBoost, which is used
widely by data scientists to achieve state-of-the-art results
on many machine learning challenges. We propose a novel
sparsity-aware algorithm for sparse data and weighted quan-
tile sketch for approximate tree learning. More importantly,
we provide insights on cache access patterns, data compres-
sion and sharding to build a scalable tree boosting system.
By combining these insights, XGBoost scales beyond billions
of examples using far fewer resources than existing systems.
Keywords
Large-scale Machine Learning
1. INTRODUCTION

[Metadata: {'producer': 'pdfTeX-1.40.12', 'creator': 'LaTeX with hyperref package', 'creati