## Ingesting PDF

In [1]:
%pip install --q unstructured langchain
%pip install --q "unstructured[all-docs]"

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
from langchain_community.document_loaders import UnstructuredExcelLoader

In [4]:
local_path = "TRD Tracker.xlsx"

# Local PDF file uploads
if local_path:
  loader = UnstructuredExcelLoader(file_path=local_path)
  data = loader.load()
else:
  print("Upload a Excel file")

In [5]:
# Preview first page
data[0].page_content

'\n\n\nIndex\nSprint\nProject\nStory Title\nJira Link\nTRD Link\nLinked added to Story\nTRD Start time\nTRD End Time\nTRD Owner\nTRD Done\nTRD Verified\n\n\n1\n2407 Sprint 1\nmWorkorder 2.0\nAudio Response Type-Android\nMWO2-302\nhttps://docs.google.com/document/d/1ATY9h2ZSQw2SSp-wJGPh3eLTkjjuCI0IxkTj0Wm4jJw/edit#heading=h.sull3h3zm0u6\n\n\n\nMudit\n\n\n\n\n2\n2407 Sprint 1\nmWorkorder 2.0\nIssue List View - Filter Functionality - iOS\nMWO2-418\nhttps://docs.google.com/document/d/1VigJ_maz8InMSC9fMfg-aM19VXjv0wOWor3oDLYRcsE/edit?usp=sharing\n\n\n\nMudit\n\n\n\n\n3\n2407 Sprint 1\nmWorkorder 2.0\nBasic Authentication - iOS\nMWO2-810\nhttps://docs.google.com/document/d/1NI6DVyEw0cDbhUAxHCS0upqAYn98klinagY_vv3BXt8/edit?usp=sharing\n\n\n\nMudit\n\n\n\n\n4\n2407 Sprint 1\nmWorkorder 2.0\nBasic Authentication - Android\nMWO2-811\nhttps://docs.google.com/document/d/1TtxBXxEG6SRK0uvZhMjAIRBhObWqnpg-2xb3l2t8AKQ/edit?usp=sharing\n\n\n\nMudit\n\n\n\n\n5\n2407 Sprint 1\nmWorkorder 2.0\nDynamic Dom

## Vector Embeddings

In [6]:
!ollama pull nomic-embed-text

[?25lpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest 
pulling 970aa74c0a90... 100% ▕████████████████▏ 274 MB                         
pulling c71d239df917... 100% ▕████████████████▏  11 KB                         
pulling ce4a164fc046... 100% ▕████████████████▏   17 B                         
pulling 31df23ea7daa... 100% ▕████████████████▏  420 B                         
verifying sha256 digest ⠋ [?25h[?25l[2K[1G[A[2K[1G[A[2K[1G[A[2K[1G[A[2K[1G[A[2K[1Gpulling manifest 
pulling 970aa74c0a90... 100% ▕████████████████▏ 274 MB                         
pulling c71d239df917... 100% ▕████████████████▏  11 KB                         
pulling ce4a164fc046... 100% ▕████████████████▏   17 B                         
pulling 31df23ea7daa... 100% ▕████████████████▏  420 B         

In [7]:
!ollama list

NAME                   	ID          	SIZE  	MODIFIED       
mistral:latest         	2ae6f6dd7a3d	4.1 GB	12 hours ago  	
nomic-embed-text:latest	0a109f422b47	274 MB	32 seconds ago	


In [8]:
%pip install --q chromadb
%pip install --q langchain-text-splitters

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [9]:
from langchain_community.embeddings import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma

In [10]:
# Split and chunk 
text_splitter = RecursiveCharacterTextSplitter(chunk_size=7500, chunk_overlap=100)
chunks = text_splitter.split_documents(data)

In [11]:
# Add to vector database
vector_db = Chroma.from_documents(
    documents=chunks, 
    embedding=OllamaEmbeddings(model="nomic-embed-text",show_progress=True),
    collection_name="local-rag"
)

OllamaEmbeddings: 100%|██████████| 4/4 [00:11<00:00,  2.85s/it]


## Retrieval

In [12]:
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_community.chat_models import ChatOllama
from langchain_core.runnables import RunnablePassthrough
from langchain.retrievers.multi_query import MultiQueryRetriever

In [13]:
# LLM from Ollama
local_model = "mistral"
llm = ChatOllama(model=local_model)

In [14]:
QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant. Your task is to generate five
    different versions of the given user question to retrieve relevant documents from
    a vector database. By generating multiple perspectives on the user question, your
    goal is to help the user overcome some of the limitations of the distance-based
    similarity search. Provide these alternative questions separated by newlines.
    Original question: {question}""",
)

In [15]:
retriever = MultiQueryRetriever.from_llm(
    vector_db.as_retriever(), 
    llm,
    prompt=QUERY_PROMPT
)

# RAG prompt
template = """Answer the question based ONLY on the following context:
{context}
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

In [16]:
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [17]:
chain.invoke(input(""))

OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00,  2.87it/s]
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00, 10.12it/s]
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00,  4.33it/s]
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00,  7.71it/s]
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00,  5.83it/s]


' The different projects in this file are as follows:\n1. Connected Back Office (CBO)\n2. Value 360\n3. Race 2.0\n4. mWO 2.0\n5. mRounds'

In [28]:
chain.invoke("What are the 5 pillars of global cooperation?")

OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00,  2.56it/s]
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00,  7.93it/s]
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00,  4.34it/s]
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00,  7.41it/s]
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00,  4.80it/s]


' The 5 pillars of global cooperation, as outlined in the World Economic Forum (WEF) Global Cooperation Barometer 2023, are as follows:\n1. Economy and Trade\n2. Innovation and Technology\n3. Climate and Environment\n4. Security and Governance\n5. Health and Human Development\n\nEach of these pillars represents a critical area where cooperation among countries can lead to mutual benefits and address global challenges effectively. The WEF Global Cooperation Barometer 2023 assesses the state of cooperation in each pillar using various metrics and indexes, providing insights into how closely nations are working together on these issues and identifying areas for improvement.\n\n[Citation: "WEF_The_Global_Cooperation_Barometer_2024.pdf"]'

In [None]:
# Delete all collections in the db
vector_db.delete_collection()