# LangChain RAG Using Local Embeddings for PDF
Implementing a RAG system using the LangChain framework, with a focus on:
* Generating local vector embeddings for efficient similarity search
* Indexing and querying a PDF document to find relevant passages
* Utilizing the retrieved passages to generate answers to questions

## Setting Up
Uncomment to install the package

In [29]:
# pip install -U langchain-anthropic langchain_community langchain_chroma pypdf sentence_transformers

Uncomment if API key is not added yet

In [30]:
# import getpass
# import os

# os.environ["ANTHROPIC_API_KEY"] = getpass.getpass()

## Build Custom Embeddings
This section describes how to generate embeddings locally without relying on third-party services like OpenAI. We use the `all-MiniLM-L6-v2` model, which has the following characteristics:

1. Model size: Approximately 40 MB
2. Storage: Model downloaded automatically once and stored locally
3. Functionality: Generates embeddings for text data

By using local embeddings, you maintain control over your data processing pipeline and reduce dependencies on external services.

In [66]:
from typing import List
from sentence_transformers import SentenceTransformer
from langchain.embeddings.base import Embeddings

import warnings

# Suppress the specific tokenizer warning
warnings.filterwarnings("ignore", category=FutureWarning, 
                       message="`clean_up_tokenization_spaces` was not set")

class CustomEmbeddings(Embeddings):
    def __init__(self, model_name: str):
        self.model = SentenceTransformer(model_name)

    def embed_documents(self, documents: List[str]) -> List[List[float]]:
        return [self.model.encode(d).tolist() for d in documents]

    def embed_query(self, query: str) -> List[float]:
        return self.model.encode([query])[0].tolist()
    
embedding_model = CustomEmbeddings("all-MiniLM-L6-v2")

## Creating a simple QA system

### `add_pdfs`: Loading The Document
The document was downloaded from the websites. Then, they are saved as local files.

To prepare documents for Retrieval
1. Use a text splitter to divide loaded documents into smaller chunks. This ensures each segment fits within the LLM's context window.
2. Load the split documents into a vector store. This process typically involves converting text into numerical representations (vectors) for efficient searching.
3. Implement a retriever based on the vector store. This component will be responsible for fetching relevant document segments during the question-answering process.
4. Incorporate the retriever into your Retrieval-Augmented Generation (RAG) pipeline. This enables the LLM to access and utilize relevant information from the processed documents when generating responses.

### `_setup_qa_chain`: Question Answering with RAG
To construct the final RAG chain, you'll utilize built-in helper functions. The process yields two key results:

1. Final Answer: Available in the 'answer' key of the results dictionary.
2. Context: The information the Language Model (LLM) used to generate the answer.

Examining the 'context' values reveals:
- Documents containing chunks of the ingested page content
- Preserved original metadata from the initial document loading phase

This structure allows you to trace the answer's origin and understand the LLM's reasoning process.

Note:
* Chroma is useful here because it provides an efficient, scalable, and semantically-aware way to store and retrieve vectorized text data.

In [67]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores.chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_anthropic import ChatAnthropic
from typing import List, Union
from pathlib import Path


class PDFQuestionAnswering:
    def __init__(self, chunk_size: int = 1000, chunk_overlap: int = 200):
        # Initialize embedding model
        self.embedding_model = CustomEmbeddings(model_name="all-MiniLM-L6-v2")
        
        # Initialize text splitter
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap
        )
        
        # Initialize empty vectorstore
        self.vectorstore = Chroma(
            embedding_function=self.embedding_model
        )
        
        # Setup QA chain
        self.qa_chain = self._setup_qa_chain()
    
    def add_pdfs(self, pdf_files: Union[str, List[str]]) -> None:
        """Add one or more PDFs to the existing system."""
        # Handle single file input
        if isinstance(pdf_files, str):
            pdf_files = [pdf_files]
        
        documents = []
        
        # Process each PDF file
        for pdf_path in pdf_files:
            try:
                # Validate file
                path = Path(pdf_path)
                if not path.exists():
                    print(f"File not found: {pdf_path}")
                    continue
                if path.suffix.lower() != '.pdf':
                    print(f"Not a PDF file: {pdf_path}")
                    continue
                
                # Load and process document
                loader = PyPDFLoader(pdf_path)
                docs = loader.load()
                splits = self.text_splitter.split_documents(docs)
                documents.extend(splits)
                print(f"Successfully loaded: {pdf_path}")
                
            except Exception as e:
                print(f"Error processing {pdf_path}: {str(e)}")
        
        # Add to vectorstore if documents were loaded
        if documents:
            self.vectorstore.add_documents(documents)
            print(f"Added {len(documents)} chunks to the vectorstore")
    
    def ask(self, question: str) -> str:
        """Ask a question to the system."""
        response = self.qa_chain.invoke({"question": question})
        return response
    
    def _setup_qa_chain(self):
        """Setup the QA chain with Claude."""
        llm = ChatAnthropic(model="claude-3-sonnet-20240229")
        
        # Create the prompt template
        prompt = ChatPromptTemplate.from_messages([
            ("system", ("You are an assistant for question-answering tasks. "
                "Use the following pieces of retrieved context to answer "
                "the question. If you don't know the answer, say that you "
                "don't know. Use three sentences maximum and keep the "
                "answer concise."
                "\n\n"
                "<context>"
                "{context}"
                "</context>")),
            ("human", "{question}")
        ])
        
        # Setup retrieval chain using the new LCEL interface
        retriever = self.vectorstore.as_retriever()
        
        chain = (
            {"context": retriever, "question": lambda x: x}
            | prompt
            | llm
            | StrOutputParser()
        )
        
        return chain

In [68]:
# Initialize the system
qa_system = PDFQuestionAnswering(chunk_size=10000, chunk_overlap=2000)

# Add initial PDFs
# Source: https://investor.fb.com/financials/
qa_system.add_pdfs(["example_data/meta-10k-2023.pdf"])

# Ask questions
answer = qa_system.ask("How's Meta's Reality Labs revenue in 2023?")
print(answer)

Successfully loaded: example_data/meta-10k-2023.pdf
Added 147 chunks to the vectorstore
According to the context provided, Meta's Reality Labs (RL) revenue in 2023 decreased $263 million, or 12%, compared to 2022. The decrease was primarily driven by a net decrease in the volume of Meta Quest (virtual reality headset) sales.


## Later, add more PDFs
As you ask questions, you can add more PDFs to add more context.

In [69]:
# https://www.sec.gov/Archives/edgar/data/1018724/000101872424000008/amzn-20231231.htm
qa_system.add_pdfs(["example_data/amazon-10k-2023.pdf"])

# Ask questions about old and new content
answer = qa_system.ask("What are the primary revenue streams for Meta and Amazon, and how have they evolved over recent fiscal years?")
print(answer)

Successfully loaded: example_data/amazon-10k-2023.pdf
Added 123 chunks to the vectorstore
The primary revenue streams for Meta (Facebook) are advertising revenue from its Family of Apps (Facebook, Instagram, WhatsApp, etc.) and other minor revenue sources. For Amazon, the key revenue streams are retail product sales, third-party seller fees (commission and fulfillment), Amazon Web Services (AWS), advertising services, and subscriptions like Amazon Prime.

Over recent years, Meta's advertising revenue has remained dominant but saw a slight decline in 2022 before rebounding in 2023. Amazon's retail sales and AWS have continued growing, with AWS becoming an increasingly important revenue driver alongside the retail business.
