## Loading PDF

In [2]:
%pip install --q unstructured langchain
%pip install --q "unstructured[all-docs]"

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [3]:
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_community.document_loaders import OnlinePDFLoader

In [4]:
#PDF file address
local_path = "Iran_Natural_Environment.pdf" #it is a local pdf file in the same folder

# Local PDF file uploads
if local_path:
  loader = UnstructuredPDFLoader(file_path=local_path)
  data = loader.load()

In [5]:
# Preview first page
data[0].page_content

"Natural Environment in Iran - Overview\n\nIran's Natural Environment\n\nIran is a diverse country in terms of natural landscapes, ecosystems, and climate zones. The geography of Iran includes deserts, mountains, forests, and coastal regions, each supporting unique flora and fauna. This document covers key aspects of Iran’s natural environment, highlighting the various ecosystems, climatic conditions, and conservation challenges.\n\n1. Geographical Location: - Iran is located in the Middle East, bordered by the Caspian Sea to the north and the Persian Gulf to the south. - Its landscape includes mountain ranges, such as the Zagros and Alborz mountains, as well as large desert areas.\n\n2. Climate Zones: - Iran experiences a wide range of climates, from arid and semi-arid conditions in central deserts to humid and temperate climates along the Caspian Sea coast.\n\n3. Biodiversity: - Iran is home to a diverse range of plant and animal species, including the Persian leopard, Caspian seals,

## Vector Embeddings

In [6]:
import ollama

#donwload 'nomic-embed-text' from ollama to embed the pdf text
ollama.pull('nomic-embed-text')

{'status': 'success'}

List of the downloaded models in your PC:

In [7]:
!ollama list


NAME                       ID              SIZE      MODIFIED               
nomic-embed-text:latest    0a109f422b47    274 MB    Less than a second ago    
phi3:14b                   cf611a26b048    7.9 GB    About an hour ago         
llama3.1:latest            42182419e950    4.7 GB    About an hour ago         


Chunk the PDF text using a suitable size for the model and provide an overlap between chunks to avoid the contexual issues due to the mid-word slicing

In [8]:
from langchain_community.embeddings import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma

In [9]:
# Split and chunk 
text_splitter = RecursiveCharacterTextSplitter(chunk_size=7500, chunk_overlap=100)
chunks = text_splitter.split_documents(data)

Create a Chroma vector database from an uploaded PDF file

In [None]:
# Add to Chroma vector database
vector_db = Chroma.from_documents(
    documents=chunks, 
    embedding=OllamaEmbeddings(model="nomic-embed-text",show_progress=True),
    collection_name="local-rag"
)

## Retrieval

In [None]:
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_community.chat_models import ChatOllama
from langchain_core.runnables import RunnablePassthrough
from langchain.retrievers.multi_query import MultiQueryRetriever

Choose model

In [511]:
# LLM from Ollama
local_model = "llama3.1"
llm = ChatOllama(model=local_model)

Making 5 questions similar to user's given question

In [512]:
QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant. Your task is to generate five
    different versions of the given user question to retrieve relevant documents from
    a vector database. By generating multiple perspectives on the user question, your
    goal is to help the user overcome some of the limitations of the distance-based
    similarity search. Provide these alternative questions separated by newlines.
    Original question: {question}""",
)

Retrieving the relevant contents in the vector database

In [513]:
# RAG retriever to find the relevant contents in the vector database
retriever = MultiQueryRetriever.from_llm(
    vector_db.as_retriever(), 
    llm,
    prompt=QUERY_PROMPT
)

# RAG prompt template
template = """Answer the question based ONLY on the following context:
{context}
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

Final input to output pipeline

In [516]:
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

Testing the RAG:

In [3]:
chain.invoke(input(""))

In [4]:
chain.invoke("what is the PDF about?")

Delete all the collections from database

In [None]:
# vector_db.delete_collection()