## Querying from a pdf

- In this notebook, we load a random pdf and try to query from it.
- We use ollama open-source models: `gemma2:2b`, `phi3.5` or `llama3.1:8b`
- Embedding model: `nomic-embed-text`
- two-step Retriever: ChromaRetriever (text retriever), `BM25Retriever` (metadata retriever), `EnsembleRetriever` (ensemble both)
    - There is other retreiver called `SelfQueryRetriever` (not explored yet) which combines both in one model 

## Step 1: Grobid Installation and Setup
For using Grobid, you need the Grobid server running. Follow the below steps to pull Grobid image and run it on docker:
1. `docker pull grobid/grobid:0.8.1-name-address`
2. `docker run --rm --init --ulimit core=0 -p 8070:8070 grobid/grobid:0.8.1-name-address`


## Step 2: Ollama Setup/Commands [Linux]

Starting and stopping service
1. Starting ollama service: `systemctl start ollama.service`
2. Stopping ollama service: `systemctl stop ollama.service`
3. Status of ollama service: `systemctl status ollama.service`

Loading models
1. pull the gemma2:2.b model:  `ollama pull gemma2:2b`
2. run gemma model: `ollama run gemma2:2b`


In [None]:
# clear cache
import torch
torch.cuda.empty_cache() 

# ignore warnings
import warnings
warnings.filterwarnings("ignore")

In [1]:
import os, sys
# set system path
CURR_DIR = os.path.dirname('__file__')
ROOT_DIR=os.path.join(os.getcwd() ,'../')
sys.path.append(ROOT_DIR)

from langchain_community.document_loaders.generic import GenericLoader
# from langchain_community.document_loaders.parsers import GrobidParser
from src.retrieval.grobid_services import GrobidDocumentParser
pdf_file_path = '../data/open_vocab_vit_object_detection.pdf' # Path to the pdf file

In [2]:
loader = GenericLoader.from_filesystem(
    path=pdf_file_path,
    parser=GrobidDocumentParser(segment_sentences=False, ),
)
docs = loader.load()

In [3]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=200, chunk_overlap=0)
texts = text_splitter.split_documents(docs)

In [5]:
from langchain.embeddings.ollama import OllamaEmbeddings
from langchain.vectorstores import Chroma

embeddings = OllamaEmbeddings(
    model="nomic-embed-text",
    temperature=0.0,    
)
docSearch = Chroma.from_documents(texts, embedding=embeddings)
docRetreiver = docSearch.as_retriever(search_type="similarity_score_threshold", # or "mmr"
                                      search_kwargs={#"k":3, 
                                                    #  "lambda_mult": 0.2, 
                                                    #  "fetch_k": 20,
                                                     "score_threshold": 0.8})


In [6]:
from langchain.retrievers import BM25Retriever, EnsembleRetriever
keyword_retriever = BM25Retriever.from_documents(texts)
keyword_retriever.k =  3

In [7]:
ensemble_retriever = EnsembleRetriever(retrievers=[docRetreiver,
                                                   keyword_retriever],
                                       weights=[0.5, 0.5])

In [13]:
from langchain_community.llms import Ollama
llm_model = Ollama(
    model="phi3.5:3.8b", #"llama3.1", #gemma2:2b", # llamma3.1:8b, phi3.5:3.8b
)

In [8]:
SYSTEM_MESSAGE = """
You are an expert research scientist with a deep understanding of complex research topics. Your role is to analyze and explain intricate concepts from academic papers clearly and concisely. 
Focus on providing insightful summaries and elucidations that make the research accessible and understandable to a diverse audience, including those who may not have a scientific background.
"""
from langchain_community.llms import Ollama
llm_model = Ollama(
    model="llama3.1",  #options: "llama3.1", gemma2:2b", llamma3.1:8b, phi3.5:3.8b
    temperature=0.0,  # Set the sampling temperature
    num_predict=1000,  # Set the maximum number of tokens to generate
    system=SYSTEM_MESSAGE,  # Set a system message
)

In [12]:
#ToDo: this part is not yet explored completely

from langchain.retrievers.self_query.base import SelfQueryRetriever
from src.retrieval.metadata_info import MetadataInfo

paper_title = "Simple Open-Vocabulary Object Detection with Vision Transformers"

metadata_info = MetadataInfo(document_content_des=f"The document is a research paper titled {paper_title}")
retriever = SelfQueryRetriever.from_llm(
    llm=llm_model,
    vectorstore=docSearch,
    document_contents=metadata_info.document_content_description,
    metadata_field_info=metadata_info.metadata_field_info,
    verbose=True,
    use_original_query=False,
)

In [9]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(llm=llm_model, chain_type="stuff", 
                                 retriever=ensemble_retriever, 
                                 verbose=True, return_source_documents=True,)

In [10]:
qa.invoke("What is open vocabulory detection as explained in the paper?")



[1m> Entering new RetrievalQA chain...[0m





[1m> Finished chain.[0m


{'query': 'What is open vocabulory detection as explained in the paper?',
 'result': 'According to the paper, Open-Vocabulary Object Detection refers to the ability of a model to detect object categories that were not seen during training. This means that the model can recognize and classify objects even if they are not part of the fixed set of semantic categories used in traditional detection models.\n\nIn other words, open-vocabulary detection allows a model to generalize beyond a closed vocabulary, where the vocabulary refers to the specific set of object categories that were trained on. The goal is to enable the model to detect and classify objects from an open-ended or dynamic set of categories, without requiring explicit annotations for each category.',
 'source_documents': [Document(metadata={'text': 'Object detection is a fundamental task in computer vision.Until recently, detection models were typically limited to a small, fixed set of semantic categories, because obtaining lo

In [10]:
query = "What is the Abstract and Title of the paper?. Refer Abstract section."
result = qa({"query": query})

  warn_deprecated(




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


In [11]:
result

{'query': 'What is the Abstract and Title of the paper?. Refer Abstract section.',
 'result': "Unfortunately, I don't have enough information to provide the title and abstract of the paper. The provided text appears to be a snippet from the introduction or methodology section of the paper, but it doesn't contain the title and abstract.\n\nHowever, based on the content, I can infer that the paper is likely related to computer vision and object detection, possibly using transformer-based architectures. If you have access to the full paper, I'd be happy to help with any other questions!",
 'source_documents': [Document(metadata={'para': '6', 'pages': "('5', '5')", 'section_title': 'Model', 'section_number': '3.1'}, page_content="Architecture.Our model uses a standard Vision Transformer as the image encoder and a similar Transformer architecture as the text encoder (Figure 1).To adapt the image encoder for detection, we remove the token pooling and final projection layer, and instead linea