<a href="https://colab.research.google.com/github/hissain/mlworks/blob/main/codes/RAG_haystack_library_BM25.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install farm-haystack[inference] --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m152.2/152.2 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m37.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m57.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.7/48.7 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.7/224.7 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m34.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m763.9/763.9 kB[0m [31m27.8 MB/s[0m eta [36

In [None]:
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes import DensePassageRetriever, FARMReader, BM25Retriever
from haystack.pipelines import ExtractiveQAPipeline
from haystack.utils import print_answers

In [None]:
# Step 1: Prepare a custom dataset (documents)
documents = [
    {"content": "The Mona Lisa is a famous painting created by Leonardo da Vinci."},
    {"content": "Albert Einstein developed the theory of relativity."},
    {"content": "Python is a widely-used programming language known for its simplicity."},
    {"content": "The human heart has four chambers."},
    {"content": "The Eiffel Tower is located in Paris, France."},
    {"content": "Water boils at 100 degrees Celsius at sea level."}
]

In [None]:
# Step 2: Initialize an in-memory document store
document_store = InMemoryDocumentStore(use_bm25=True)

The line `document_store = InMemoryDocumentStore(use_bm25=True)` initializes a **document store** in Haystack, specifically an **in-memory** document store that will store your documents (or passages) for fast retrieval.

Here’s a detailed breakdown of what each part means:

### **What is a Document Store?**
The **document store** is a central repository where documents (text passages, articles, FAQs, etc.) are saved. This repository will later be queried by a **retriever** to fetch relevant documents based on the user's query.

- A **document** typically contains fields like:
  - **content**: The main body of the document (e.g., an article or text passage).
  - **meta** (optional): Metadata about the document, such as title, author, or source.

Haystack provides several types of **document stores**, which act as repositories to store, index, and manage documents (such as text passages, articles, or FAQs). The type of document store you choose depends on the scale of your dataset, retrieval method (keyword-based or dense retrieval), and infrastructure needs (in-memory or persistent storage).





Here’s an overview of the **different document stores** available in Haystack and their respective use cases:

---

### 1. **InMemoryDocumentStore**
- **Description**: This is a simple document store that keeps all the documents in the system’s RAM. It does not persist data on disk or use external databases.
- **Use Case**: Suitable for **small datasets** and for **testing** or **prototyping** applications where the data can fit entirely into memory. Fast retrieval times due to the in-memory nature, but limited by memory capacity.
- **Retrieval Method**: Supports **BM25** and **TF-IDF** for keyword-based search, as well as dense vector retrieval if combined with a dense retriever.
- **Key Feature**:
  - Fast and lightweight.
  - Not persistent (documents are lost when the program is restarted).

Example:
```python
from haystack.document_stores import InMemoryDocumentStore

document_store = InMemoryDocumentStore(use_bm25=True)
```

---

### 2. **ElasticsearchDocumentStore**
- **Description**: This document store is backed by **Elasticsearch**, which is a powerful search engine designed for large-scale document indexing and retrieval. It can handle **full-text search**, **keyword-based** retrieval, and **dense retrieval** with vector embeddings.
- **Use Case**: Suitable for **large-scale** applications where you need persistent storage for documents. Elasticsearch is ideal for systems where data must be stored across multiple nodes or where scaling is required.
- **Retrieval Method**:
  - Supports **BM25** (default), **TF-IDF**, and **custom Elasticsearch ranking algorithms**.
  - Can store **dense embeddings** and supports vector search.
  - Supports hybrid search (combining BM25 with dense vector retrieval).
- **Key Features**:
  - Distributed and highly scalable.
  - Supports **complex querying**.
  - Suitable for production environments.
  - Persistent and durable (data is stored on disk).

Example:
```python
from haystack.document_stores import ElasticsearchDocumentStore

document_store = ElasticsearchDocumentStore(
    host="localhost",
    username="",
    password="",
    index="document"  # Elasticsearch index name
)
```

---

### 3. **FAISSDocumentStore**
- **Description**: **FAISS (Facebook AI Similarity Search)** is a library designed for fast similarity search on dense vector representations. **FAISSDocumentStore** is useful for **semantic search** tasks where you need to find documents based on the similarity of their vector embeddings rather than keyword matching.
- **Use Case**: Ideal for **dense retrieval** tasks (e.g., using embeddings for document matching). It’s often used when building **semantic search** or **question-answering systems** that require matching queries and documents in a high-dimensional vector space.
- **Retrieval Method**:
  - **Dense retrieval** based on **vector similarity** (cosine similarity, L2 distance).
  - Works well with models like **Dense Passage Retrieval (DPR)** or **Sentence Transformers**.
- **Key Features**:
  - Fast and efficient for dense vector search.
  - Supports large datasets by using approximate nearest neighbor (ANN) search.
  - Not optimized for traditional keyword-based retrieval (like BM25).
  - Data is **persistent** if saved to disk, but not distributed like Elasticsearch.

Example:
```python
from haystack.document_stores import FAISSDocumentStore

document_store = FAISSDocumentStore(embedding_dim=768)
```

---

### 4. **WeaviateDocumentStore**
- **Description**: **Weaviate** is an open-source vector search engine that stores data as embeddings and supports semantic search. **WeaviateDocumentStore** allows you to use Weaviate’s distributed and vector-native capabilities.
- **Use Case**: Suitable for **semantic search** and **large-scale** applications where you need high-dimensional vector retrieval across multiple machines or nodes.
- **Retrieval Method**:
  - Supports **dense vector retrieval**.
  - Allows **graph-based querying**.
- **Key Features**:
  - Distributed and highly scalable.
  - Vector-native storage and retrieval system.
  - Supports hybrid searches (combining dense and traditional search).
  - Built-in **graph relationships** for richer querying options.
  
Example:
```python
from haystack.document_stores import WeaviateDocumentStore

document_store = WeaviateDocumentStore()
```

---

### 5. **MilvusDocumentStore**
- **Description**: **Milvus** is an open-source vector database designed for managing massive amounts of embedding vectors. It specializes in **vector retrieval** and is optimized for high-dimensional search tasks.
- **Use Case**: Ideal for large-scale **dense retrieval** and **embedding-based** search tasks. Useful for AI applications that handle millions or billions of embeddings.
- **Retrieval Method**:
  - Primarily **vector search** using **dense embeddings**.
  - Works well with models like DPR and BERT embeddings.
- **Key Features**:
  - Highly scalable.
  - Specifically built for managing large collections of vectors.
  - Distributed storage and retrieval.

Example:
```python
from haystack.document_stores import MilvusDocumentStore

document_store = MilvusDocumentStore(embedding_dim=768)
```

---

### 6. **SQLDocumentStore**
- **Description**: The **SQLDocumentStore** stores documents in a **SQL database** such as SQLite, MySQL, or PostgreSQL. It can be useful for smaller datasets where persistence is required, but high scalability or distributed storage is not necessary.
- **Use Case**: Suitable for smaller-scale applications where you need persistent storage but don’t require distributed or large-scale search capabilities like Elasticsearch.
- **Retrieval Method**:
  - Supports **BM25** and **TF-IDF**.
  - Not optimized for dense retrieval unless embeddings are stored and retrieved manually.
- **Key Features**:
  - Simple and lightweight.
  - Persistent storage on disk using a relational database.
  - Not optimized for large datasets or distributed environments.

Example:
```python
from haystack.document_stores import SQLDocumentStore

document_store = SQLDocumentStore(url="sqlite:///my_doc_store.db")
```


---

### Summary of Document Stores

| **Document Store**        | **Retrieval Method**                  | **Scalability**          | **Use Case**                         | **Persistence**   |
|---------------------------|---------------------------------------|--------------------------|--------------------------------------|-------------------|
| **InMemoryDocumentStore**  | BM25, TF-IDF, Dense Retrieval         | Low (limited by RAM)      | Prototyping, small datasets          | No (in-memory)    |
| **ElasticsearchDocumentStore** | BM25, TF-IDF, Dense, Hybrid       | High (distributed)        | Large-scale applications, production | Yes (disk)        |
| **FAISSDocumentStore**     | Dense Retrieval (vector similarity)   | Moderate                  | Semantic search, vector retrieval    | Yes (disk)        |
| **WeaviateDocumentStore**  | Dense Retrieval, Graph-based querying | High (distributed)        | Large-scale, semantic search         | Yes (distributed) |
| **MilvusDocumentStore**    | Dense Retrieval (vector search)       | High (distributed)        | AI, massive-scale embedding search   | Yes (distributed) |
| **SQLDocumentStore**       | BM25, TF-IDF                         | Moderate                  | Small datasets, persistent storage   | Yes (disk)        |

### Choosing the Right Document Store

- **InMemoryDocumentStore**: Best for small, fast, and temporary retrieval tasks.
- **ElasticsearchDocumentStore**: Ideal for large-scale, persistent, distributed storage with advanced search capabilities.
- **FAISSDocumentStore**: Useful for dense vector retrieval tasks (e.g., semantic search).
- **WeaviateDocumentStore**: For more complex, distributed, and graph-based searches.
- **SQLDocumentStore**: Suitable for small, relational database-based storage.
- **MilvusDocumentStore**: Best for massive-scale vector management.


### Options for Different Ranking Algorithms

#### 1. **BM25 (Keyword-based ranking)**:
   - BM25 is one of the ranking functions built into the **InMemoryDocumentStore** and **ElasticsearchDocumentStore**. It works well for text-based search (keyword matching). BM25 is useful for classic information retrieval tasks where exact word matching is important.
   - It's highly efficient and effective for simple use cases where keyword matching suffices.
   - Does not require embeddings or vectors.


#### 2. **Dense Passage Retrieval (DPR)**:
   - **Dense Passage Retrieval (DPR)** is a type of neural retrieval algorithm based on **dense embeddings** (vector representations) of both questions and documents.
   - Instead of using term frequency and inverse document frequency, DPR maps queries and documents into the same embedding space and retrieves documents that are most similar based on cosine similarity or L2 distance.
   - To use DPR, you would initialize a **DensePassageRetriever** along with a document store that supports dense embeddings, such as **FAISSDocumentStore** or **ElasticsearchDocumentStore** with dense retrieval enabled.
   - More effective for semantic search, where you want to match concepts, not just exact words.

#### 3. **Embedding-based Retrieval (FAISS)**:
   - **FAISS (Facebook AI Similarity Search)** allows for efficient similarity search and retrieval based on dense vector representations. FAISS is used for **semantic search** by comparing embeddings rather than direct keyword matching.
   - You can use the **FAISSDocumentStore** with **DensePassageRetriever** or **EmbeddingRetriever** to perform similarity searches on dense vectors.
   

#### 4. **TF-IDF (Term Frequency-Inverse Document Frequency)**:
   - You can use **TF-IDF** for document retrieval in Haystack. TF-IDF assigns a weight to terms based on their frequency within the document and their rarity across all documents.
   - TF-IDF can be used in the **InMemoryDocumentStore** as well.
   
#### 5. **Elasticsearch with Advanced Ranking Algorithms**:
   - If you are using the **ElasticsearchDocumentStore**, you can leverage Elasticsearch’s **native ranking algorithms** like **BM25**, **TF-IDF**, and custom **dense embeddings** (using Elasticsearch's vector search capabilities).
   - Elasticsearch supports hybrid models where BM25 and dense retrieval can be combined for better accuracy.


### Overview of Different Retrieval Techniques and Ranking Algorithms

| **Ranking Algorithm**    | **Retriever**            | **Document Store**               | **Strength**                                                                 |
|--------------------------|--------------------------|-----------------------------------|-------------------------------------------------------------------------------|
| **BM25**                  | BM25Retriever (default)  | InMemoryDocumentStore, ElasticsearchDocumentStore | Best for classic keyword-based ranking. Efficient and interpretable.          |
| **TF-IDF**                | TfidfRetriever           | InMemoryDocumentStore, ElasticsearchDocumentStore | Simple and effective for term frequency-based retrieval.                      |
| **Dense Passage Retrieval (DPR)** | DensePassageRetriever   | FAISSDocumentStore, ElasticsearchDocumentStore | Best for semantic search, uses dense embeddings for more accurate retrieval.  |
| **FAISS (Embedding Search)** | EmbeddingRetriever or DensePassageRetriever | FAISSDocumentStore              | Optimized for fast similarity search on embeddings, used for vector-based retrieval. |
| **Elasticsearch Hybrid**  | ElasticSearch-based BM25, TF-IDF, or Dense | ElasticsearchDocumentStore       | Combines BM25 with embedding-based retrieval for high accuracy.               |

### Switching the Algorithm

- **For Dense or Embedding-Based Retrieval**, use **FAISSDocumentStore** or **ElasticsearchDocumentStore** and combine it with **DensePassageRetriever**.
- **For TF-IDF**, use **TfidfRetriever** with **InMemoryDocumentStore** or **ElasticsearchDocumentStore**.



We can also use different ranking algorithms in Haystack, but the choice of the ranking algorithm depends on the type of **document store** and **retriever** you are using. In the case of **InMemoryDocumentStore**, you are limited to **BM25** for classical information retrieval (keyword-based retrieval). However, if you want to use different or more advanced algorithms like **dense retrieval** or **semantic search** (based on embeddings), you can use a combination of different document stores and retrievers.


### Conclusion:
You can switch the ranking algorithm by using a different **retriever** and **document store** combination. BM25 is the default for **InMemoryDocumentStore**, but for more sophisticated ranking algorithms like **Dense Retrieval**, you would use **FAISSDocumentStore** or **ElasticsearchDocumentStore** with the appropriate retriever.

If you need dense retrieval or hybrid ranking, I recommend using **FAISS** or **Elasticsearch**, while for keyword-based or lightweight retrieval tasks, **BM25** or **TF-IDF** with **InMemoryDocumentStore** would work well.

In [None]:
# Step 3: Write the custom dataset to the document store
document_store.write_documents(documents)

Updating BM25 representation...: 100%|██████████| 6/6 [00:00<00:00, 7051.23 docs/s]


The line `document_store.write_documents(documents)` is used to **store the documents** (or passages) into the document store in **Haystack**. Here's a detailed explanation of what it does:

### **1. Purpose of `write_documents`**

- This line writes the list of documents (passages, articles, etc.) into the specified **document store**.
- The **document store** (in this case, `InMemoryDocumentStore`, `FAISSDocumentStore`, or `ElasticsearchDocumentStore`) is where all your documents are stored so they can be **retrieved** when a user asks a query.

### **2. Example Structure of Documents**
The `documents` being passed to the `write_documents()` function should be in a format compatible with Haystack. Each document is typically a dictionary containing a **content** field.

Example of a document structure:
```python
documents = [
    {"content": "The Eiffel Tower is located in Paris, France."},
    {"content": "The Mona Lisa is a famous painting created by Leonardo da Vinci."},
    {"content": "Albert Einstein developed the theory of relativity."}
]
```

Each document in the list is written to the **document store** for later retrieval.

### **3. What Happens Internally**

When `write_documents(documents)` is called:
- The **documents** are stored in the document store (in memory, in the case of **InMemoryDocumentStore**, or in Elasticsearch/FAISS if you're using other document stores).
- The documents are **indexed** and saved with unique IDs.
- If the document store supports additional retrieval mechanisms (like BM25 or FAISS), document embeddings or indexes will also be created to enable faster searching.

For instance:
- In an **InMemoryDocumentStore**, the documents are saved directly in the system’s memory (RAM).
- In a **FAISSDocumentStore**, documents will be saved and indexed for vector similarity searches.
- In an **ElasticsearchDocumentStore**, the documents will be stored in an Elasticsearch index that supports full-text search.

### **4. Why is This Step Important?**

This step is crucial because it allows the **retriever** to access the documents during querying. Once the documents are written to the document store:
- When a user asks a query, the retriever can look into this document store to find documents that match the query.
- The **reader** can then read these retrieved documents and generate a response.


In [None]:
# Step 4: Initialize BM25Retriever for BM25 retrieval (keyword-based retrieval)
retriever = BM25Retriever(document_store=document_store)

The line `retriever = BM25Retriever(document_store=document_store)` is used to initialize a BM25Retriever in Haystack for keyword-based document retrieval.

In [None]:
# Step 4: Initialize a reader model (FARMReader)
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=False)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

The line `reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=False)` initializes a **reader model** using Haystack’s **FARMReader**. Here's a brief breakdown of what this line does:

### **1. FARMReader**:
- **FARMReader** is a **neural reading model** in Haystack that reads the retrieved documents and attempts to extract the most relevant answer to a given query.
- FARMReader is built on top of the **Transformers** library and is designed for **extractive question answering**.
- In extractive question answering, the model **highlights the span of text** from the retrieved document that best answers the query.

### **2. `model_name_or_path="deepset/roberta-base-squad2"`**:
- This specifies the pre-trained model that will be loaded as the reader. In this case, we are using `"deepset/roberta-base-squad2"`, which is a **fine-tuned version of the RoBERTa model** on the **SQuAD 2.0** dataset (Stanford Question Answering Dataset).
- **RoBERTa** (Robustly Optimized BERT Pretraining Approach) is a variant of BERT, fine-tuned specifically for tasks like **question answering**.
- The **SQuAD 2.0** dataset includes unanswerable questions, so this model can identify when there is no appropriate answer in the documents.

### **3. `use_gpu=False`**:
- This parameter tells the reader whether to use a **GPU** (Graphics Processing Unit) or **CPU** for model inference.
- Setting `use_gpu=False` will run the model on the **CPU**, which is fine for smaller datasets or testing purposes.
- If you have a **GPU** available and want faster inference, you can set `use_gpu=True`.

### **Purpose of the Reader in the Pipeline**:
- The reader is the **second stage** in a **retrieval-augmented generation (RAG)** pipeline:
  1. **Retriever** (e.g., BM25Retriever or DensePassageRetriever) retrieves relevant documents from the document store based on a user query.
  2. **Reader** (e.g., FARMReader) reads the top retrieved documents and tries to find the exact answer (a span of text) within those documents.
  
  In essence, the **retriever** provides potentially relevant documents, and the **reader** tries to find the exact answer within those documents.

### Example Flow:
1. **User Query**: The user asks a question like, "Where is the Eiffel Tower located?"
2. **Retriever**: The retriever fetches the top 5 most relevant documents containing information about the Eiffel Tower.
3. **Reader**: The reader then examines those documents and extracts the span of text that contains the answer (e.g., "Paris, France").

### Why Use FARMReader?
- **Extractive Question Answering**: FARMReader can highlight the most relevant text directly from the retrieved documents, allowing you to extract precise answers.
- **Pre-trained on SQuAD**: Models fine-tuned on SQuAD are designed to answer questions based on real-world text, making them highly effective for question-answering tasks.
- **Scalability**: FARMReader supports both CPU and GPU, making it scalable based on your hardware.



In [None]:
# Step 5: Create a pipeline using BM25Retriever and the FARMReader
pipeline = ExtractiveQAPipeline(reader=reader, retriever=retriever)

In [None]:
# Step 7: Ask a question to the RAG pipeline
query = "Where is the Eiffel Tower located?"

# Get answers from the pipeline
prediction = pipeline.run(
    query=query,
    params={
        "Retriever": {"top_k": 1},  # Number of documents to retrieve
        "Reader": {"top_k": 1}      # Number of answers to return
    }
)

print('\n')

# Print the answers
print_answers(prediction, details="minimum")

Inferencing Samples: 100%|██████████| 1/1 [00:02<00:00,  2.98s/ Batches]



'Query: Where is the Eiffel Tower located?'
'Answers:'
[   {   'answer': 'Paris, France',
        'context': 'The Eiffel Tower is located in Paris, France.'}]





In [None]:
# Test another question
query_2 = "Which one is the tallest mountain in Bangladesh?"

# Get the answers for the second query
prediction_2 = pipeline.run(
    query=query_2,
    params={
        "Retriever": {"top_k": 1},  # Number of documents to retrieve
        "Reader": {"top_k": 1}      # Number of answers to return
    }
)

print('\n')

# Print the second set of answers
print_answers(prediction_2, details="minimum")

Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.22s/ Batches]



'Query: Which one is the tallest mountain in Bangladesh?'
'Answers:'
[   {   'answer': 'Eiffel Tower',
        'context': 'The Eiffel Tower is located in Paris, France.'}]



