## Ingesting PDF

In [1]:
%pip install --q unstructured langchain
%pip install --q "unstructured[all-docs]"

Note: you may need to restart the kernel to use updated packages.
^C
Note: you may need to restart the kernel to use updated packages.


In [2]:
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_community.document_loaders import OnlinePDFLoader

In [3]:
# Step 1: Reinstall the ONNX package
!pip uninstall -y onnx
!pip install onnx

# Step 2: Reinstall the Unstructured library
!pip uninstall -y unstructured
!pip install unstructured

# Step 3: Install or update ONNX Runtime
!pip install --upgrade onnxruntime

# Step 4: Check for dependency issues using pipdeptree
!pip install pipdeptree
!pipdeptree

# Step 5: Create a new Anaconda environment and install required packages
# Note: Uncomment the following lines if you want to create a new environment
# !conda create -n myenv python=3.8 -y
# !conda activate myenv
# !pip install unstructured onnx onnxruntime

# Step 6: Ensure that Visual C++ Redistributable packages are installed
print("Make sure you have installed the latest Visual C++ Redistributable from Microsoft.")

# Step 7: Check Python version
import sys
print("Current Python version:", sys.version)

# Step 8: Consult documentation and community
print("For further assistance, consult the official documentation or community forums.")


Found existing installation: onnx 1.17.0
Uninstalling onnx-1.17.0:
  Successfully uninstalled onnx-1.17.0
Collecting onnx
  Using cached onnx-1.17.0-cp38-cp38-win_amd64.whl.metadata (16 kB)
Using cached onnx-1.17.0-cp38-cp38-win_amd64.whl (14.5 MB)
Installing collected packages: onnx
Successfully installed onnx-1.17.0
Found existing installation: unstructured 0.11.8
Uninstalling unstructured-0.11.8:
  Successfully uninstalled unstructured-0.11.8
Collecting unstructured
  Using cached unstructured-0.11.8-py3-none-any.whl.metadata (26 kB)
Using cached unstructured-0.11.8-py3-none-any.whl (1.8 MB)
Installing collected packages: unstructured
Successfully installed unstructured-0.11.8
Collecting onnxruntime
  Downloading onnxruntime-1.19.2-cp38-cp38-win_amd64.whl.metadata (4.7 kB)
Downloading onnxruntime-1.19.2-cp38-cp38-win_amd64.whl (11.1 MB)
   ---------------------------------------- 0.0/11.1 MB ? eta -:--:--
   -- ------------------------------------- 0.8/11.1 MB 5.6 MB/s eta 0:00:02
 

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
unstructured-inference 0.7.18 requires onnxruntime<1.16, but you have onnxruntime 1.19.2 which is incompatible.


Collecting pipdeptree
  Using cached pipdeptree-2.23.4-py3-none-any.whl.metadata (15 kB)
Using cached pipdeptree-2.23.4-py3-none-any.whl (32 kB)
Installing collected packages: pipdeptree
Successfully installed pipdeptree-2.23.4
Brotli==1.1.0
effdet==0.4.1
├── omegaconf [required: >=2.0, installed: 2.3.0]
│   ├── antlr4-python3-runtime [required: ==4.9.*, installed: 4.9.3]
│   └── PyYAML [required: >=5.1.0, installed: 6.0.2]
├── pycocotools [required: >=2.0.2, installed: 2.0.7]
│   ├── matplotlib [required: >=2.1.0, installed: 3.7.5]
│   │   ├── contourpy [required: >=1.0.1, installed: 1.1.1]
│   │   │   └── numpy [required: >=1.16,<2.0, installed: 1.24.4]
│   │   ├── cycler [required: >=0.10, installed: 0.12.1]
│   │   ├── fonttools [required: >=4.22.0, installed: 4.55.0]
│   │   ├── importlib_resources [required: >=3.2.0, installed: 6.4.5]
│   │   │   └── zipp [required: >=3.1.0, installed: 3.21.0]
│   │   ├── kiwisolver [required: >=1.0.1, installed: 1.4.7]
│   │   ├── numpy [require

* unstructured-inference==0.7.18
 - onnxruntime [required: <1.16, installed: 1.19.2]
------------------------------------------------------------------------


In [None]:
local_path = "PATH"
# Local PDF file uploads
if local_path:
  loader = UnstructuredPDFLoader(file_path=local_path)
  data = loader.load()
else:
  print("Upload a PDF file")

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\prana\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


In [5]:
# Preview first page
data[0].page_content

'Project Title: Expansion of Manufacturing Plant\n\nOperational/Service Delivery: This section outlines the responsibilities of each party in maintaining service levels. The client will be responsible for daily operations, while the contractor will handle weekly maintenance.\n\nLegal: All terms are governed by federal and state law. Any disputes will be handled in the local court of jurisdiction.\n\nCommercial: This section covers the financial terms, including payment schedules, billing rates, and contingency fees.\n\nCompliance: The project must adhere to environmental regulations and safety standards.\n\nGovernance: The project governance structure includes monthly meetings with senior management from both parties for oversight and decision-making.'

## Vector Embeddings

In [6]:
!ollama pull nomic-embed-text

In [7]:
!ollama list

NAME                       ID              SIZE      MODIFIED       
nomic-embed-text:latest    0a109f422b47    274 MB    23 seconds ago    
granite3-dense:2b          a9c7deef7ab8    1.6 GB    27 hours ago      
llama3:8b                  365c0bd3c000    4.7 GB    2 months ago      


In [8]:
%pip install --q chromadb
%pip install --q langchain-text-splitters

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [9]:
from langchain_community.embeddings import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma

In [10]:
# Split and chunk 
text_splitter = RecursiveCharacterTextSplitter(chunk_size=7500, chunk_overlap=100)
chunks = text_splitter.split_documents(data)

In [11]:
# Add to vector database
vector_db = Chroma.from_documents(
    documents=chunks, 
    embedding=OllamaEmbeddings(model="nomic-embed-text",show_progress=True),
    collection_name="local-rag"
)

OllamaEmbeddings: 100%|██████████| 1/1 [00:06<00:00,  6.65s/it]


## Retrieval

In [12]:
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_community.chat_models import ChatOllama
from langchain_core.runnables import RunnablePassthrough
from langchain.retrievers.multi_query import MultiQueryRetriever

In [14]:
# LLM from Ollama
local_model = "granite3-dense:2b"
llm = ChatOllama(model=local_model)

In [15]:
QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant. Your task is to generate five
    different versions of the given user question to retrieve relevant documents from
    a vector database. By generating multiple perspectives on the user question, your
    goal is to help the user overcome some of the limitations of the distance-based
    similarity search. Provide these alternative questions separated by newlines.
    Original question: {question}""",
)

In [16]:
retriever = MultiQueryRetriever.from_llm(
    vector_db.as_retriever(), 
    llm,
    prompt=QUERY_PROMPT
)

# RAG prompt
template = """Answer the question based ONLY on the following context:
{context}
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

In [17]:
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [18]:
chain.invoke(input(""))

OllamaEmbeddings: 100%|██████████| 1/1 [00:04<00:00,  4.29s/it]
Number of requested results 4 is greater than number of elements in index 1, updating n_results = 1
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.18s/it]
Number of requested results 4 is greater than number of elements in index 1, updating n_results = 1
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.17s/it]
Number of requested results 4 is greater than number of elements in index 1, updating n_results = 1
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.21s/it]
Number of requested results 4 is greater than number of elements in index 1, updating n_results = 1
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.25s/it]
Number of requested results 4 is greater than number of elements in index 1, updating n_results = 1


'The document outlines a contract for the expansion of a manufacturing plant. The operational responsibilities are divided between the client (daily operations) and the contractor (weekly maintenance). Legal disputes will be handled in local court. The commercial terms include payment schedules, billing rates, and contingency fees. The project must comply with environmental regulations and safety standards. Monthly meetings with senior management from both parties are required for oversight and decision-making.'

In [19]:
chain.invoke("What are the 5 pillars of global cooperation?")

OllamaEmbeddings: 100%|██████████| 1/1 [00:04<00:00,  4.09s/it]
Number of requested results 4 is greater than number of elements in index 1, updating n_results = 1
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.15s/it]
Number of requested results 4 is greater than number of elements in index 1, updating n_results = 1
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.13s/it]
Number of requested results 4 is greater than number of elements in index 1, updating n_results = 1
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.16s/it]
Number of requested results 4 is greater than number of elements in index 1, updating n_results = 1
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.15s/it]
Number of requested results 4 is greater than number of elements in index 1, updating n_results = 1


'1. Operational/Service Delivery: This section outlines the responsibilities of each party in maintaining service levels. The client will be responsible for daily operations, while the contractor will handle weekly maintenance.\n2. Legal: All terms are governed by federal and state law. Any disputes will be handled in the local court of jurisdiction.\n3. Commercial: This section covers the financial terms, including payment schedules, billing rates, and contingency fees.\n4. Compliance: The project must adhere to environmental regulations and safety standards.\n5. Governance: The project governance structure includes monthly meetings with senior management from both parties for oversight and decision-making.'

In [None]:
# Delete all collections in the db
vector_db.delete_collection()