# LangChain RAG Using Local Embeddings for PDF
Implementing a RAG system using the LangChain framework, with a focus on:
* Generating local vector embeddings for efficient similarity search
* Indexing and querying a PDF document to find relevant passages
* Utilizing the retrieved passages to generate answers to questions

## Setting Up
Uncomment to install the package

In [None]:
# pip install -U langchain-anthropic langchain_community langchain_chroma pypdf sentence_transformers

Uncomment if API key is not added yet

In [2]:
# import getpass
# import os

# os.environ["ANTHROPIC_API_KEY"] = getpass.getpass()

## Loading The Document
The document was downloaded from https://investor.fb.com/financials/ and saved as a local file.

In [5]:
from langchain_community.document_loaders import PyPDFLoader

FILE_PATH = "example_data/meta-10k-2023.pdf"
loader = PyPDFLoader(FILE_PATH)

docs = loader.load()

print(len(docs))

147


In [18]:
print(docs[0].page_content[300:650])
print(docs[0].metadata)

NSITION REPOR T PURSUANT  TO SECTION 13 OR 15(d) OF  THE SECURITIES EXCHANGE ACT  OF 1934
For the transition period fr om            to            
Commission File Number: 001-35551
__________________________
Meta Platforms, Inc.
(Exact name of r egistrant as specified in its charter)
__________________________
Delawar e 20-1665019
(State or other 
{'source': 'example_data/meta-10k-2023.pdf', 'page': 0}


## Build Custom Embeddings
This section describes how to generate embeddings locally without relying on third-party services like OpenAI. We use the `all-MiniLM-L6-v2` model, which has the following characteristics:

1. Model size: Approximately 40 MB
2. Storage: Model downloaded automatically once and stored locally
3. Functionality: Generates embeddings for text data

By using local embeddings, you maintain control over your data processing pipeline and reduce dependencies on external services.

In [None]:
from typing import List
from sentence_transformers import SentenceTransformer
from langchain.embeddings.base import Embeddings

class CustomEmbeddings(Embeddings):
    def __init__(self, model_name: str):
        self.model = SentenceTransformer(model_name)

    def embed_documents(self, documents: List[str]) -> List[List[float]]:
        return [self.model.encode(d).tolist() for d in documents]

    def embed_query(self, query: str) -> List[float]:
        return self.model.encode([query])[0].tolist()
    
embedding_model = CustomEmbeddings("all-MiniLM-L6-v2")

## Preparing Documents for Retrieval
1. Use a text splitter to divide loaded documents into smaller chunks. This ensures each segment fits within the LLM's context window.

2. Load the split documents into a vector store. This process typically involves converting text into numerical representations (vectors) for efficient searching.

3. Implement a retriever based on the vector store. This component will be responsible for fetching relevant document segments during the question-answering process.

4. Incorporate the retriever into your Retrieval-Augmented Generation (RAG) pipeline. This enables the LLM to access and utilize relevant information from the processed documents when generating responses.

Note:
* Chroma is useful here because it provides an efficient, scalable, and semantically-aware way to store and retrieve vectorized text data.

In [25]:
from langchain_chroma import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(documents=splits, embedding=embedding_model)

retriever = vectorstore.as_retriever()

## Question Answering with RAG
To construct the final RAG chain, you'll utilize built-in helper functions. The process yields two key results:

1. Final Answer: Available in the 'answer' key of the results dictionary.
2. Context: The information the Language Model (LLM) used to generate the answer.

Examining the 'context' values reveals:
- Documents containing chunks of the ingested page content
- Preserved original metadata from the initial document loading phase

This structure allows you to trace the answer's origin and understand the LLM's reasoning process.

In [28]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_anthropic import ChatAnthropic

llm = ChatAnthropic(model="claude-3-sonnet-20240229")

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "<context>"
    "{context}"
    "</context>"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

results = rag_chain.invoke({"input": "What was Meta's revenue in 2023?"})

print(results)
print(results["answer"])

{'input': "What was Meta's revenue in 2023?", 'context': [Document(metadata={'page': 75, 'source': 'example_data/meta-10k-2023.pdf'}, page_content='Reality Labs\nRL revenue in 2023 decreased $263 million, or 12%, compared to 2022. The decrease in RL revenue was mostly driven by a net decrease in the volume\nof Meta Quest sales.\nRevenue Seasonality\nRevenue is traditionally seasonally strong in the fourth quarter of each year due in part to seasonal holiday demand. We believe that this seasonality in\nboth advertising revenue and RL consumer hardware sales affects our quarterly results, which generally reflect significant growth in revenue between the third\nand fourth quarters and a decline between the fourth and subsequent first quarters. For instance, our total revenue increased 17%, 16%, and 16% between the\nthird and fourth quarters of 2023, 2022, and 2021, respectively, while total revenue for the first quarters of 2023, 2022, and 2021 declined 11%, 17%, and 7%\ncompared to the f