# Using Multi-Query Retrieval with GutenbergReader for Accessing Project Gutenberg Books
In this notebook, we will demonstrate how to enhance your document retrieval by using Multi-Query Retrieval (MQR) with the `GutenbergReader` class for accessing books from Project Gutenberg. Multi-Query Retrieval helps improve the comprehensiveness of retrieved data by breaking down a single query into multiple sub-queries, each focusing on different aspects of the question. This technique allows for more accurate and diverse retrieval of content, making it particularly useful for research and question-answering systems.

To get started, ensure you have your Project Gutenberg book IDs ready for fetching the content. This setup is crucial for accessing and processing book data effectively.

To begin, ensure you have set up your environment variables and API keys in Python using the dotenv library. This is crucial for securely managing sensitive information, such as API keys, especially when using services like HuggingFace. Ensure your `HUGGINGFACE_API_KEY` and `INDOX_API_KEY` are defined in the `.env` file to avoid hardcoding sensitive data into your codebase, thus enhancing security and maintainability.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/osllmai/inDox/blob/master/Demo/multiquery_guten_chroma.ipynb)

In [None]:
!pip install indox
!pip install chromadb
!pip install beautifulsoup4
!pip install sentence_transformers
!pip install semantic_text_splitter

## Setting Up the Python Environment

If you are running this project in your local IDE, please create a Python environment to ensure all dependencies are correctly managed. You can follow the steps below to set up a virtual environment named `indox`:

### Windows

1. **Create the virtual environment:**
```bash
python -m venv indox
```
2. **Activate the virtual environment:**
```bash
indox_judge\Scripts\activate
```

### macOS/Linux

1. **Create the virtual environment:**
   ```bash
   python3 -m venv indox
    ```

2. **Activate the virtual environment:**
    ```bash
   source indox/bin/activate
    ```
### Install Dependencies

Once the virtual environment is activated, install the required dependencies by running:

```bash
pip install -r requirements.txt
```


## Import Essential Libraries

Next, we import the essential libraries for our Indox question-answering system:

- `IndoxRetrievalAugmentation`: Enhances the retrieval process by improving the relevance and quality of the documents retrieved, leading to better QA performance.
- `OpenAI Model`: A powerful question-answering model provided by OpenAI. It leverages advanced multi-query retrieval and state-of-the-art language understanding to deliver more comprehensive and precise answers by capturing diverse aspects of the query.
- `HuggingFaceEmbedding`: This library uses Hugging Face embeddings to enrich semantic understanding, making it easier to capture the contextual meaning of the text.
- `SemanticTextSplitter`: utilizes a Hugging Face tokenizer to intelligently split text into chunks based on a specified maximum number of tokens, ensuring that each chunk maintains semantic coherence.

In [1]:
import os
from dotenv import load_dotenv

load_dotenv('api.env')

HUGGINGFACE_API_KEY = os.environ['HUGGINGFACE_API_KEY']
INDOX_API_KEY = os.environ['INDOX_API_KEY']

In [2]:
from indox import IndoxRetrievalAugmentation
indox = IndoxRetrievalAugmentation()

[32mINFO[0m: [1mIndoxRetrievalAugmentation initialized[0m

            ██  ███    ██  ██████   ██████  ██       ██
            ██  ████   ██  ██   ██ ██    ██   ██  ██
            ██  ██ ██  ██  ██   ██ ██    ██     ██
            ██  ██  ██ ██  ██   ██ ██    ██   ██   ██
            ██  ██  █████  ██████   ██████  ██       ██
            


## Building the GutenbergReader System and Initializing Models
Next, we will build our `GutenbergReader` system and initialize the necessary models for processing content from Project Gutenberg. This setup will enable us to effectively retrieve and handle texts from Gutenberg's collection, leveraging these models to support various research and question-answering tasks.

In [3]:
from indox.llms import IndoxApi
from indox.embeddings import HuggingFaceEmbedding

openai_model = IndoxApi(api_key=INDOX_API_KEY)
embed = HuggingFaceEmbedding(api_key=HUGGINGFACE_API_KEY,model="multi-qa-mpnet-base-cos-v1")

[32mINFO[0m: [1mInitialized HuggingFaceEmbedding with model: multi-qa-mpnet-base-cos-v1[0m


## Setting Up the GutenbergReader for Retrieving Book Content
To demonstrate the capabilities of our `GutenbergReader` system and its integration with `Indox`, we will use a sample book from Project Gutenberg. This book will serve as our reference data, which we will use for testing and evaluation of the system.

In [4]:
from indox.data_connectors import GutenbergReader

reader = GutenbergReader()

book_id = "11"  # Alice's Adventures in Wonderland
content = reader.get_content(book_id)

## Splitting Content into Manageable Chunks
We use the `SemanticTextSplitter` function from the `indox.splitter` module to divide the retrieved content into smaller, meaningful chunks.

In [5]:
from indox.splitter import SemanticTextSplitter
splitter = SemanticTextSplitter(400)
content_chunks = splitter.split_text(content)

## Storing and Indexing Content with Chroma
We use the `Chroma` vector store from the `indox.vector_stores` module to store and index the content chunks. By creating a collection named "sample" and applying an embedding function (`embed`), we convert each chunk into a vector for efficient retrieval. The `add` method then adds these vectors to the database, enabling scalable and effective search for question-answering tasks.

In [6]:
from indox.vector_stores import Chroma
db = Chroma(collection_name="sample",embedding_function=embed)
db.add(docs=content_chunks)


2024-09-05 13:05:44,498 - chromadb.telemetry.product.posthog - INFO - Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.


[32mINFO[0m: [1mStoring documents in the vector store[0m
[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings for texts using model: SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)[0m


Batches:   0%|          | 0/4 [00:00<?, ?it/s]

[32mINFO[0m: [1mDocument added successfully to the vector store.[0m
[32mINFO[0m: [1mDocuments stored successfully[0m


## Querying Data with GPT and Indox Multi-Query Retrieval
With our multi-query retrieval system using GPT and Indox fully set up, we are ready to test it using a sample query. This test will demonstrate how effectively our system can retrieve and process information from a vector database using the GPT model.

We’ll use the following sample query to evaluate our system:

- Query: "Who is the speaker talking to in the text?"

This query will be processed by the multi-query retrieval system, where multiple sub-queries will be generated and run against the vector database to retrieve relevant information and generate an accurate response based on the context.

In [7]:
indox.initialize_multi_query_retrieval(llm=openai_model, vector_database=db, top_k=3)

query = "Who is the speaker talking to in the text?"

answer = indox.run_multi_query_retrieval(query)

[32mINFO[0m: [1mMulti-query retrieval initialized[0m
[32mINFO[0m: [1mRunning multi-query retrieval for: Who is the speaker talking to in the text?[0m
[32mINFO[0m: [1mGenerated queries: ['Here are three different queries you could use to gather information to determine who the speaker is talking to in a given text:', '1. **Contextual Analysis Query**: "What contextual clues in the text indicate the identity or characteristics of the audience or recipient of the speaker\'s message?"', '2. **Dialogue and Interaction Query**: "Are there any direct references or dialogue exchanges in the text that reveal who the speaker is addressing, such as names, titles, or pronouns?"', '3. **Tone and Purpose Query**: "What is the tone and purpose of the speaker\'s message, and how might these elements suggest the intended audience or recipient in the text?"'][0m
[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings for texts using model: SentenceTransform

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings for texts using model: SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)[0m


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings for texts using model: SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)[0m


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings for texts using model: SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)[0m


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[32mINFO[0m: [1mRetrieved 12 relevant passages[0m
[32mINFO[0m: [1mGenerated final response[0m


In [8]:
answer

'The speaker, Alice, is talking to the Mouse in the text. She addresses the Mouse directly, asking if it knows the way out of the pool and expressing her tiredness of swimming.'