# Rag with LangChain



Using a example pdf, we are going to store in a vectorestore  and use it on the RAG

## Creating the infraestructure

- Getting the pdf
- Storing in a vectorstore
- Generating a retriever



In [23]:
from langchain.document_loaders import PyPDFLoader


In [24]:
#Get the pdf
pdf_file_path = "my_pdf.pdf"

pdf_file = PyPDFLoader(pdf_file_path)
docs = pdf_file.load()



In [25]:
docs

[Document(metadata={'producer': 'macOS Version 15.5 (Build 24F74) Quartz PDFContext', 'creator': 'Preview', 'creationdate': "D:20250804162618Z00'00'", 'title': 'C1_PhrasalVerbs', 'moddate': "D:20250804162618Z00'00'", 'source': 'my_pdf.pdf', 'total_pages': 6, 'page': 0, 'page_label': '1'}, page_content=''),
 Document(metadata={'producer': 'macOS Version 15.5 (Build 24F74) Quartz PDFContext', 'creator': 'Preview', 'creationdate': "D:20250804162618Z00'00'", 'title': 'C1_PhrasalVerbs', 'moddate': "D:20250804162618Z00'00'", 'source': 'my_pdf.pdf', 'total_pages': 6, 'page': 1, 'page_label': '2'}, page_content=''),
 Document(metadata={'producer': 'macOS Version 15.5 (Build 24F74) Quartz PDFContext', 'creator': 'Preview', 'creationdate': "D:20250804162618Z00'00'", 'title': 'C1_PhrasalVerbs', 'moddate': "D:20250804162618Z00'00'", 'source': 'my_pdf.pdf', 'total_pages': 6, 'page': 2, 'page_label': '3'}, page_content=''),
 Document(metadata={'producer': 'macOS Version 15.5 (Build 24F74) Quartz PDF

In [26]:
len(docs[0].page_content)

0

In [27]:
#The pdf contains images, we are going to use UnstructuredLoader
from langchain_unstructured import UnstructuredLoader


To use Unstructured parsing LOCALLY, we need to next dependencies:

Local parsing

Parsing locally requires the installation of additional dependencies.

Poppler (PDF analysis)

Linux: apt-get install poppler-utils
Mac: brew install poppler
Windows: https://github.com/oschwartz10612/poppler-windows
Tesseract (OCR)

Linux: apt-get install tesseract-ocr
Mac: brew install tesseract
Windows: https://github.com/UB-Mannheim/tesseract/wiki#tesseract-installer-for-windows
We will also need to install the unstructured PDF extras:

%pip install -qU "unstructured[pdf]"

In [28]:
loader = UnstructuredLoader(pdf_file_path,
                            strategy="hi_res")

docs = []
for doc in loader.lazy_load():
    docs.append(doc)

INFO: Reading PDF for file: my_pdf.pdf ...




In [29]:
len(docs)

321

In [30]:
docs[6]

Document(metadata={'source': 'my_pdf.pdf', 'detection_class_prob': 0.904830276966095, 'coordinates': {'points': ((np.float64(189.44822692871094), np.float64(800.351806640625)), (np.float64(189.44822692871094), np.float64(879.6382446289062)), (np.float64(789.1942138671875), np.float64(879.6382446289062)), (np.float64(789.1942138671875), np.float64(800.351806640625))), 'system': 'PixelSpace', 'layout_width': 1654, 'layout_height': 2339}, 'last_modified': '2025-08-04T18:26:43', 'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 1, 'filename': 'my_pdf.pdf', 'parent_id': '10650cd204a0fc51817b1b0e7df3b632', 'category': 'ListItem', 'element_id': 'a81433a9a85bcad29001b67ab3beac2b'}, page_content="back up give support to someone by telling other people that you agree with them backup (n): | didn’t believe Simon's story until Janice backed him up.")

It's not necessary to chunk the content, is not to large, otherwise we should do it in smaller chunks

In [31]:
#Using Ollama_embeding to generate the embedings

from langchain_ollama import OllamaEmbeddings, ChatOllama

embeddings = OllamaEmbeddings(model="nomic-embed-text")

In [35]:
#Storing the embedings in langchain_vectorstore "Chroma"
#need pip install chromadb
from langchain_community.vectorstores import Chroma

from langchain_community.vectorstores.utils import filter_complex_metadata
#documents has complex data, coordinates, where are going to use filter_complex_metadata to solve this issue, to quit this key from metadata

vectorstore = Chroma.from_documents(
    documents=filter_complex_metadata(docs),
    embedding= embeddings,
    persist_directory="chroma_store"
)

INFO: HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"


In [36]:
#Creating a retriever to vectorstore
retriever = vectorstore.as_retriever() #We are going to use default values


In [37]:
question = "Meaning of back up verb"

retriever.invoke(question)

INFO: HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"


[Document(metadata={'last_modified': '2025-08-04T18:26:43', 'filename': 'my_pdf.pdf', 'element_id': 'c09c9598bc80770db8cb578dea5fc825', 'parent_id': '10650cd204a0fc51817b1b0e7df3b632', 'category': 'ListItem', 'source': 'my_pdf.pdf', 'filetype': 'application/pdf', 'detection_class_prob': 0.8957679867744446, 'page_number': 1}, page_content='back up make a copy of information on your computer backup (n): Make sure you back al! your data up, just in case you get a virus.'),
 Document(metadata={'source': 'my_pdf.pdf', 'category': 'ListItem', 'element_id': '2be2ccb13a20020b7ebc5248be69f183', 'filename': 'my_pdf.pdf', 'detection_class_prob': 0.8740919828414917, 'parent_id': '10650cd204a0fc51817b1b0e7df3b632', 'filetype': 'application/pdf', 'last_modified': '2025-08-04T18:26:43', 'page_number': 1}, page_content='brush up (on) practise and improve your skills or knowledge of something: | took a class to brush up (on) my German before the trip.'),
 Document(metadata={'element_id': 'a81433a9a85bc

## Generating the RAG

In [41]:
ollama_model = "qwen2.5:7b"

llm = ChatOllama(model=ollama_model)

In [42]:
from langchain_core.prompts import PromptTemplate

#We are going to define the prompt template for our RAG
template="""
You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to find the examples of the requested phrasal verbs.
If you don't know the answer, respond as "I don't know"

Question: Examples to understand the meaning of the phrasal verb {question}

Context: {context}

Answer:"""

prompt = PromptTemplate(template=template)

In [43]:
#Creating the pipeline
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)
#To format the docs retrievers

#Is a pipeline, like Linu
rag_chain = (
    {"context": retriever | format_docs, "question":RunnablePassthrough()} | prompt | llm | StrOutputParser()
)


In [45]:
query="back up"
response = rag_chain.invoke(query)

INFO: HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"
INFO: HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"


In [46]:
print(response)

Here are the examples of the phrasal verb "back up" from the context provided:

1. **Make a copy of information on your computer**:
   - Context example: "backup (n): Make sure you back all your data up, just in case you get a virus."

2. **Give support to someone by telling other people that you agree with them**:
   - Context example: "backup (n): I didn’t believe Simon's story until Janice backed him up."
