# Ollama PDF RAG Notebook

## Import Libraries


In [24]:
# Imports
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_ollama import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_ollama.chat_models import ChatOllama
from langchain_core.runnables import RunnablePassthrough
from langchain.retrievers.multi_query import MultiQueryRetriever

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Jupyter-specific imports
from IPython.display import display, Markdown

# Set environment variable for protobuf
import os
os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"] = "python"

## Load PDF

In [25]:
# Load PDF
local_path = "../../data/pdfs/sample/MKS-07-Poshuk-Tehnichnij-opis-ta-instruktsiya-shhodo-ekspluatuvannya.pdf"
if local_path:
    loader = UnstructuredPDFLoader(file_path=local_path)
    data = loader.load()
    print(f"PDF loaded successfully: {local_path}")
else:
    print("Upload a PDF file")

PDF loaded successfully: ../../data/pdfs/sample/MKS-07-Poshuk-Tehnichnij-opis-ta-instruktsiya-shhodo-ekspluatuvannya.pdf


## Split text into chunks

In [28]:
# Split text into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(data)
print(f"Text split into {len(chunks)} chunks")

Text split into 95 chunks


## Create vector database

In [29]:
# Create vector database
vector_db = Chroma.from_documents(
    documents=chunks,
    embedding=OllamaEmbeddings(model="nomic-embed-text"),
    collection_name="local-rag"
)
print("Vector database created successfully")

Vector database created successfully


## Set up LLM and Retrieval

In [38]:
# Set up LLM and retrieval
local_model = "mistral"  # or whichever model you prefer
llm = ChatOllama(model=local_model)

In [47]:
# Query prompt template
QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant. Your task is to generate 2
    different versions of the given user question to retrieve relevant documents from
    a vector database. Language of responses must be Ukrainian. By generating multiple perspectives on the user question, your
    goal is to help the user overcome some of the limitations of the distance-based
    similarity search. Provide these alternative questions separated by newlines.
    Original question: {question}""",
)

# Set up retriever
retriever = MultiQueryRetriever.from_llm(
    vector_db.as_retriever(), 
    llm,
    prompt=QUERY_PROMPT
)

## Create chain

In [48]:
# RAG prompt template
template = """Answer the question based ONLY on the following context:
{context}
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

In [49]:
# Create chain
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

## Chat with PDF

In [50]:
def chat_with_pdf(question):
    """
    Chat with the PDF using the RAG chain.
    """
    return display(Markdown(chain.invoke(question)))

In [51]:
# Example 1
chat_with_pdf("What is the main idea of this document?")

 The main idea of the document is a technical description and instructions for using a device called "Block of Detection" (Beta-Particle Detector, or БДБГ-07, and Виносний блок детектування), which is used in some type of equipment (possibly MKS-07, given the file name). The Block of Detection has a specific structure, including parts made of polyethylene terephthalate plastics for protection against dust and moisture. It is connected to other components using screws, hinges, and a bracket. The device also contains a scale, a switch for powering on and off, and a protective cover (накривка). The document provides instructions for operating the device, including how to turn it on and off, as well as safety measures for storage.

In [45]:
# Example 2
chat_with_pdf("Які технічні характеристики цього пристрою?") #What technical characteristics of this device?

1. The device has a detection block for beta particles (BDYB-07) with a rectangular shape and rounded corners or edges (as seen in figures V.2, V.3).
  2. There is an on/off button labeled as 'ДОЗА' (Dosa). Pressing and holding this button initiates the operation of the device.
  3. The device has a scale.
  4. The device appears to have settings for limits (ПОРІГ), accuracy (ТОЧНО), and perhaps others mentioned but not fully translated.
  5. The device can switch between modes automatically, with a 1-hour average time for precise measurement set upon initial power-up.

In [46]:
# Example 3
chat_with_pdf("Can you explain the usage cases in the document?")

1. The first document introduces a Realtime API, but no specific use case is provided.

  2. Ken Paxton's document discusses common scams, although it doesn't provide any specific use case related to the content within the context.

  3. Francesco Salvi, Manoel Horta Ribeiro, Riccardo Gallotti, and Robert West's paper focuses on the conversational persuasiveness of large language models, but no specific use case is mentioned in the context provided.

  4. Brian Schwalb's document discusses Telemarketing scams, but it doesn't provide any specific use case related to the content within the context.

  5. Fabio Urbina, Filippa Lentzos, Cédric Invernizzi, and Sean Ekins' paper talks about the dual use of artificial-intelligence-powered drug discovery, but no specific use case is mentioned in the context provided.

  6. The document titled "MKS-07 Пoshuk Технічний опис та інструкція щодо експлуатації" discusses the usage and installation of a dosimeter with RS7 connector, particularly focusing on its protective measures such as radiation detection and alarm system, switch operations, setting the time for data averaging, and the removal of the panel during operation. It also mentions the use of a transparent polyethylene terephthalate film to protect the moving fixture and detector from dust and moisture. However, no broader use case or application of this dosimeter is mentioned in the context provided.

  7. Laura Weidinger et al.'s paper discusses the taxonomy of risks posed by language models, but it doesn't provide any specific use case related to the content within the context.

## Clean up (optional)

In [27]:
# Optional: Clean up when done 
vector_db.delete_collection()
print("Vector database deleted successfully")

Vector database deleted successfully
