### RAG
* RAG allows us to integrate external data sources into LLMs
* Introducing external data sources allows us to overcome a key limitation of LLMs, which is that they are only limited to answering questions from their training data only
> embed user query --> Vector database -->Retrive relevant docs -->incorporate into model prompt

> Load documets --> split documents --> embed --> Vector database

In [24]:
from dotenv import load_dotenv,find_dotenv
import os
env_path = find_dotenv(filename="password.env", usecwd=True)
load_dotenv(env_path)

True

In [17]:
key=os.getenv("OPENAI_API_KEY")

#### Load

In [5]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader('rag-paper.pdf')

docs = loader.load()

#print(docs[0].page_content)
print(docs[0].metadata)


{'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2021-04-13T00:48:38+00:00', 'author': '', 'keywords': '', 'moddate': '2021-04-13T00:48:38+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'rag-paper.pdf', 'total_pages': 19, 'page': 0, 'page_label': '1'}


#### Split
> We will need to split the documents into chunks. Larger is not necessarily better. Larger chunks means that retreival can be very slow. **Chunk size** controls for this. If you are coming from the traditional ML world you can think of this as a hyperparameter. **Chunk overlap** parameter ensures that not too much context is loat around the boundary

In [None]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate

# Step 1 load the documents
loader = PyPDFLoader('rag-paper.pdf')
data = loader.load()

#Step 2: Split into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500,
                                               chunk_overlap=50,
                                               separators=["\n\n", "\n", " ", ""])
docs = text_splitter.split_documents(data)
print(f"Number of documents: {len(docs)}")

#Step3: Create the embeddings and store in vectordb
embedding_model = OpenAIEmbeddings(model = "text-embedding-3-small",
                                   api_key = key,
                                   )
vectordb = Chroma.from_documents(documents=docs,
                                 embedding=embedding_model,
                                 collection_name="GEN_AI"
                                 )

retreiver = vectordb.as_retriever(search_type="similarity",
                                   search_kwargs={"k":3}
                                   )
#Step4: Create an LCEL rag chain to retreive
prompt = ChatPromptTemplate.from_template(
    "Use the following context to answer the question: {context}\n\nQuestion: {question}\nAnswer:"
)

from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model = "gpt-4o-mini",
                 api_key=key,
                 temperature=0,
                 max_completion_tokens=200 
                 )

rag_chain = ({'context':retreiver,'question':RunnablePassthrough()}| prompt | llm |StrOutputParser())
#Step5: Test the rag chain
result = rag_chain.invoke("Give me a summary of the paper?")
print(result) 


Number of documents: 158
The paper discusses a method for generating Wikipedia entries by summarizing long sequences of text. It was presented at the Meeting of the Association for Computational Linguistics in July 2019 in Florence, Italy. The authors of the paper include Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. The work is part of the proceedings of the conference and is accessible through the provided DOI and URL links. The paper emphasizes the application of advanced techniques in natural language processing to create concise summaries that can effectively represent longer texts.
