# Querying from a pdf

- In this notebook, we load a random pdf and try to query from it.
- We use ollama open-source models: `gemma2:2b` or `llama3.1:8b`
- Embedding model: `nomic-embed-text`

## Ollama commands [Linux]

Starting and stopping service
1. Starting ollama service: `systemctl start ollama.service`
2. Stopping ollama service: `systemctl stop ollama.service`
3. Status of ollama service: `systemctl status ollama.service`

Loading models
1. pull the gemma2:2.b model:  `ollama pull gemma2:2b`
2. run gemma model: `ollama run gemma2:2b`


In [1]:
import warnings
warnings.filterwarnings("ignore")

# Define Embedding model

1. Install ollama
2. Pull embedding model: `ollama pull nomic-embed-text`

In [2]:
from langchain.embeddings.ollama import OllamaEmbeddings
embeddings = OllamaEmbeddings(
    model="nomic-embed-text",
)

# Document loader

In [3]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

In [5]:
loader = PyPDFLoader(file_path="../data/open_vocab_vit_object_detection.pdf")
pages = loader.load_and_split()

In [5]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(pages)

# Define database

In [6]:
docSearch = Chroma.from_documents(texts, embedding=embeddings)

# Load model

In [10]:
from langchain.llms.ollama import Ollama
llm_model = Ollama(
    model="gemma2:2b",
)

# Define retrieval

In [11]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(llm=llm_model, chain_type="stuff", retriever=docSearch.as_retriever())

# Query LLM

In [12]:
qa.run("Summary the pdf")

'The paper proposes a novel architecture for object detection and transfer learning, focusing on leveraging strong pre-training by combining text and image encoders. Here\'s a breakdown of key aspects:\n\n**Key Ideas:**\n\n* **Encoder-only Architecture:**  The model relies solely on text and image encoders without relying on fusion techniques between them. This simplifies the architecture and allows for more efficient training.\n* **Image-Level Contrastive Pre-Training:** The models are pre-trained using contrastive learning, leveraging a large image and text dataset to learn general representations.  This provides robustness and avoids overfitting during fine-tuning. \n* **Fine-Tuning with Queries:** Instead of relying on textual embeddings for object descriptions, the model works directly with images as queries in the classification head. This is especially beneficial for objects that are hard to describe with words.\n\n**Technical Details:**\n\n* **Freezing Text Encoder (partially):

In [15]:
import pprint
pprint.pprint('The paper proposes a novel architecture for object detection and transfer learning, focusing on leveraging strong pre-training by combining text and image encoders. Here\'s a breakdown of key aspects:\n\n**Key Ideas:**\n\n* **Encoder-only Architecture:**  The model relies solely on text and image encoders without relying on fusion techniques between them. This simplifies the architecture and allows for more efficient training.\n* **Image-Level Contrastive Pre-Training:** The models are pre-trained using contrastive learning, leveraging a large image and text dataset to learn general representations.  This provides robustness and avoids overfitting during fine-tuning. \n* **Fine-Tuning with Queries:** Instead of relying on textual embeddings for object descriptions, the model works directly with images as queries in the classification head. This is especially beneficial for objects that are hard to describe with words.\n\n**Technical Details:**\n\n* **Freezing Text Encoder (partially):**  Freezing the text encoder during fine-tuning helps prevent "forgetting" of learned semantic information from pre-training, potentially leading to better results.\n* **Biased Box Coordinates:** Centering predicted box coordinates at the position of corresponding tokens on a 2D grid improves learning speed and performance by breaking symmetry during bipartite matching (the process used for loss calculations).  \n* **Stochastic Depth Regularization:** To mitigate overfitting, stochastic depth regularization is applied to both image and text encoders.\n* **Focal Sigmoid Cross-Entropy Loss:** This type of loss function addresses the challenge of long-tailed datasets and effectively handles scenarios with imbalanced classes. \n\n**Advantages & Impact:**\n\n* **Transfer Learning on Open Vocabulary:**  The approach enables object detection using only image data, making it applicable to situations with a broad vocabulary of objects (objects not explicitly labeled in text).\n* **Reduced Data Requirements:** The model performs well even with relatively limited training data due to its ability to leverage pre-training and transfer learning.\n\n**Overall, the paper presents a promising framework for object detection using a novel encoder-based architecture that leverages large pre-trained models and transfer learning techniques.** \n\n\n\n')

('The paper proposes a novel architecture for object detection and transfer '
 'learning, focusing on leveraging strong pre-training by combining text and '
 "image encoders. Here's a breakdown of key aspects:\n"
 '\n'
 '**Key Ideas:**\n'
 '\n'
 '* **Encoder-only Architecture:**  The model relies solely on text and image '
 'encoders without relying on fusion techniques between them. This simplifies '
 'the architecture and allows for more efficient training.\n'
 '* **Image-Level Contrastive Pre-Training:** The models are pre-trained using '
 'contrastive learning, leveraging a large image and text dataset to learn '
 'general representations.  This provides robustness and avoids overfitting '
 'during fine-tuning. \n'
 '* **Fine-Tuning with Queries:** Instead of relying on textual embeddings for '
 'object descriptions, the model works directly with images as queries in the '
 'classification head. This is especially beneficial for objects that are hard '
 'to describe with words.\n'


In [16]:
qa.run("Extract top 3 references from the 'References' section of the pdf which explains 80 percentage of the information in it.")

'Here are the top 3 references based on their relevance to the provided text, along with a brief explanation of why they\'re important:\n\n1. **[25,43]** This is most likely referring to papers that describe "head attention pooling" and its usage in object detection models. It highlights the core aggregation technique used in the proposed approach. \n2. **[33]**, **[39,38,3]**  This references a well-established set of works on "fine-tuning for classification", specifically in the context of large Transformer models. These resources offer valuable insights into the practical aspects of training these types of models.\n3. **[6], [13,24], [47]**, These are references to papers that detail specific techniques used for object detection, such as DETR\'s bipartite matching loss and federated annotation methods. This demonstrates how they address challenges related to open-vocabulary detection datasets, which require efficient data management and unique training approaches. \n\n\n\n**Explanat

In [17]:
import pprint
pprint.pprint('Here are the top 3 references based on their relevance to the provided text, along with a brief explanation of why they\'re important:\n\n1. **[25,43]** This is most likely referring to papers that describe "head attention pooling" and its usage in object detection models. It highlights the core aggregation technique used in the proposed approach. \n2. **[33]**, **[39,38,3]**  This references a well-established set of works on "fine-tuning for classification", specifically in the context of large Transformer models. These resources offer valuable insights into the practical aspects of training these types of models.\n3. **[6], [13,24], [47]**, These are references to papers that detail specific techniques used for object detection, such as DETR\'s bipartite matching loss and federated annotation methods. This demonstrates how they address challenges related to open-vocabulary detection datasets, which require efficient data management and unique training approaches. \n\n\n\n**Explanation:**\n\n* **The text emphasizes a new approach for open-vocabulary object detection.**  \n* The references provide crucial context on: \n    * **Model architectures:** Head attention pooling, pre-trained models.\n    * **Training techniques:** Fine-tuning methods, federated annotation, prompt engineering.\n    * **Specific challenges:** Biases for location prediction, handling open vocabulary in datasets.\n\nLet me know if you\'d like more explanation about any particular reference or aspect of the text! \n')

('Here are the top 3 references based on their relevance to the provided text, '
 "along with a brief explanation of why they're important:\n"
 '\n'
 '1. **[25,43]** This is most likely referring to papers that describe "head '
 'attention pooling" and its usage in object detection models. It highlights '
 'the core aggregation technique used in the proposed approach. \n'
 '2. **[33]**, **[39,38,3]**  This references a well-established set of works '
 'on "fine-tuning for classification", specifically in the context of large '
 'Transformer models. These resources offer valuable insights into the '
 'practical aspects of training these types of models.\n'
 '3. **[6], [13,24], [47]**, These are references to papers that detail '
 "specific techniques used for object detection, such as DETR's bipartite "
 'matching loss and federated annotation methods. This demonstrates how they '
 'address challenges related to open-vocabulary detection datasets, which '
 'require efficient data manage