## How to use Indox Retrieval Augmentation for PDF files
In this notebook, we will demonstrate how to handle  `inDox` as system for question answering system with open source models which are available on internet like `Mistral`. so firstly you should buil environment variables and API keys in Python using the `dotenv` library.

**Note**:
Because we are using **HuggingFace** models you need to define your `HUGGINGFACE_API_KEY` in `.env` file. This allows us to keep our API keys and other sensitive information out of our codebase, enhancing security and maintainability.


In [8]:
!pip install indox
!pip install chromadb
!pip install semantic_text_splitter
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.11.0->sentence-transform

In [2]:
!wget https://raw.githubusercontent.com/osllmai/inDox/master/Demo/sample.txt

--2024-07-02 09:10:03--  https://raw.githubusercontent.com/osllmai/inDox/master/Demo/sample.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14025 (14K) [text/plain]
Saving to: ‘sample.txt’


2024-07-02 09:10:03 (72.4 MB/s) - ‘sample.txt’ saved [14025/14025]



In [1]:
import os
from dotenv import load_dotenv

load_dotenv()
HUGGINGFACE_API_KEY = os.getenv('HUGGINGFACE_API_KEY')

### Import Essential Libraries
Then, we import essential libraries for our `Indox` question answering system:
- `IndoxRetrievalAugmentation`: Enhances the retrieval process for better QA performance.
- `MistralQA`: A powerful QA model from Indox, built on top of the Hugging Face model.
- `HuggingFaceEmbedding`: Utilizes Hugging Face embeddings for improved semantic understanding.
- `SimpleLoadAndSplit`: A utility for loading and splitting PDF files.

In [2]:
from indox import IndoxRetrievalAugmentation
from indox.llms import HuggingFaceModel
from indox.embeddings import HuggingFaceEmbedding
from indox.data_loader_splitter.SimpleLoadAndSplit import SimpleLoadAndSplit

### Building the Indox System and Initializing Models

Next, we will build our `inDox` system and initialize the Mistral question answering model along with the embedding model. This setup will allow us to leverage the advanced capabilities of Indox for our question answering tasks.


In [3]:
indox = IndoxRetrievalAugmentation()
mistral_qa = HuggingFaceModel(api_key=HUGGINGFACE_API_KEY,model="mistralai/Mistral-7B-Instruct-v0.2")
embed = HuggingFaceEmbedding(model="multi-qa-mpnet-base-cos-v1")

[32mINFO[0m: [1mIndoxRetrievalAugmentation initialized[0m

            ██  ███    ██  ██████   ██████  ██       ██
            ██  ████   ██  ██   ██ ██    ██   ██  ██
            ██  ██ ██  ██  ██   ██ ██    ██     ██
            ██  ██  ██ ██  ██   ██ ██    ██   ██   ██
            ██  ██  █████  ██████   ██████  ██       ██
            
[32mINFO[0m: [1mInitializing HuggingFaceModel with model: mistralai/Mistral-7B-Instruct-v0.2[0m
[32mINFO[0m: [1mHuggingFaceModel initialized successfully[0m


2024-07-08 20:07:06,799 INFO:Load pretrained SentenceTransformer: multi-qa-mpnet-base-cos-v1
2024-07-08 20:07:08,358 INFO:Use pytorch device: cpu


[32mINFO[0m: [1mInitialized HuggingFace embeddings with model: multi-qa-mpnet-base-cos-v1[0m


### Setting Up Reference Directory and File Path

To demonstrate the capabilities of our Indox question answering system, we will use a sample directory. This directory will contain our reference data, which we will use for testing and evaluation.

First, we specify the path to our sample file. In this case, we are using a file named `sample.txt` located in our working directory. This file will serve as our reference data for the subsequent steps.

Let's define the file path for our reference data.

In [4]:
!wget https://raw.githubusercontent.com/osllmai/inDox/master/Demo/sample.txt

SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
syswgetrc = C:\Program Files (x86)\GnuWin32/etc/wgetrc
--2024-07-08 20:07:08--  https://raw.githubusercontent.com/osllmai/inDox/master/Demo/sample.txt
Resolving raw.githubusercontent.com... 50.7.4.242
Connecting to raw.githubusercontent.com|50.7.4.242|:443... connected.
ERROR: cannot verify raw.githubusercontent.com's certificate, issued by `/C=CN/O=TrustAsia Technologies, Inc./CN=TrustAsia RSA DV TLS CA G3':
  Unable to locally verify the issuer's authority.
ERROR: certificate common name `*.cdn.myqcloud.com' doesn't match requested host name `raw.githubusercontent.com'.
To connect to raw.githubusercontent.com insecurely, use `--no-check-certificate'.
Unable to establish SSL connection.


In [5]:
file_path = "sample.txt"

### Chunking Reference Data with UnstructuredLoadAndSplit

To effectively utilize our reference data, we need to process and chunk it into manageable parts. This ensures that our question answering system can efficiently handle and retrieve relevant information.

We use the `SimpleLoadAndSplit` utility for this task. This tool allows us to load the PDF files and split it into smaller chunks. This process enhances the performance of our retrieval and QA models by making the data more accessible and easier to process. We are using 'bert-base-uncased' model for splitting data.

In this step, we define the file path for our reference data and use `SimpleLoadAndSplit` to chunk the data with a maximum chunk size of 200 characters. Also we can handle to remove stop words or not by initializing `remove-sword` parameter.

Let's proceed with chunking our reference data.


In [6]:
simpleLoadAndSplit = SimpleLoadAndSplit(file_path="sample.txt",remove_sword=False,max_chunk_size=200)
docs = simpleLoadAndSplit.load_and_chunk()

2024-07-08 20:07:14,368 INFO:Initializing UnstructuredLoadAndSplit
2024-07-08 20:07:14,368 INFO:UnstructuredLoadAndSplit initialized successfully
2024-07-08 20:07:14,369 INFO:Getting all documents
2024-07-08 20:07:14,370 INFO:Starting processing
2024-07-08 20:07:14,371 INFO:Created initial document elements
2024-07-08 20:07:16,921 INFO:Completed chunking process
2024-07-08 20:07:16,922 INFO:Successfully obtained all documents


In [12]:
docs

["The wife of a rich man fell sick, and as she felt that her end was drawing near, she called her only daughter to her bedside and said, dear child, be good and pious, and then the good God will always protect you, and I will look down on you from heaven and be near you.  Thereupon she closed her eyes and departed.  Every day the maiden went out to her mother's grave, and wept, and she remained pious and good.  When winter came the snow spread a white sheet over the grave, and by the time the spring sun had drawn it off again, the man had taken another wife. The woman had brought with her into the house two daughters, who were beautiful and fair of face, but vile and black of heart. Now began a bad time for the poor step-child.  Is the stupid goose to sit in the parlor with us, they said.  He who wants to eat bread",
 'must earn it.  Out with the kitchen-wench.  They took her pretty clothes away from her, put an old grey bedgown on her, and gave her wooden shoes.  Just look at the prou

### Connecting Embedding Model to Indox

With our reference data chunked and ready, the next step is to connect our embedding model to the Indox system. This connection enables the system to leverage the embeddings for better semantic understanding and retrieval performance.

We use the `connect_to_vectorstore` method to link the `HuggingFaceEmbedding` model with our Indox system. By specifying the embeddings and a collection name, we ensure that our reference data is appropriately indexed and stored, facilitating efficient retrieval during the question-answering process.

Let's connect the embedding model to Indox.


In [7]:
from indox.vector_stores import ChromaVectorStore
db = ChromaVectorStore(collection_name="sample",embedding=embed)

2024-07-08 20:07:25,931 INFO:Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.


In [8]:
indox.connect_to_vectorstore(vectorstore_database=db)

[32mINFO[0m: [1mConnection to the vector store database established successfully[0m


<indox.vector_stores.Chroma.ChromaVectorStore at 0x2b47d77b6e0>

### Storing Data in the Vector Store

After connecting our embedding model to the Indox system, the next step is to store our chunked reference data in the vector store. This process ensures that our data is indexed and readily available for retrieval during the question-answering process.

We use the `store_in_vectorstore` method to store the processed data in the vector store. By doing this, we enhance the system's ability to quickly access and retrieve relevant information based on the embeddings generated earlier.

Let's proceed with storing the data in the vector store.


In [9]:
indox.store_in_vectorstore(docs)

[32mINFO[0m: [1mStoring documents in the vector store[0m
[32mINFO[0m: [1mDocument added successfully to the vector store.[0m
[32mINFO[0m: [1mDocuments stored successfully[0m


<indox.vector_stores.Chroma.ChromaVectorStore at 0x2b47d77b6e0>

## Query from RAG System with Indox
With our Retrieval-Augmented Generation (RAG) system built using Indox, we are now ready to test it with a sample question. This test will demonstrate how effectively our system can retrieve and generate accurate answers based on the reference data stored in the vector store.

We'll use a sample query to test our system:
- **Query**: "How did Cinderella reach her happy ending?"

This question will be processed by our Indox system to retrieve relevant information and generate an appropriate response.

Let's test our RAG system with the sample question

In [10]:
query = "How cinderella reach her happy ending?"

Now that our Retrieval-Augmented Generation (RAG) system with Indox is fully set up, we can test it with a sample question. We'll use the `invoke` submethod to get a response from the system.


The `invoke` method processes the query using the connected QA model and retrieves relevant information from the vector store. It returns a list where:
- The first index contains the answer.
- The second index contains the contexts and their respective scores.


We'll pass this query to the `invoke` method and print the response.


In [11]:
retriever = indox.QuestionAnswer(vector_database=db,llm=mistral_qa,top_k=5)

In [12]:
answer = retriever.invoke(query=query)

[32mINFO[0m: [1mRetrieving context and scores from the vector database[0m
[32mINFO[0m: [1mGenerating answer without document relevancy filter[0m
[32mINFO[0m: [1mAnswering question[0m
[32mINFO[0m: [1mSending request to Hugging Face API[0m


2024-07-08 20:07:44,160 INFO:Backing off send_request(...) for 0.3s (requests.exceptions.SSLError: HTTPSConnectionPool(host='us-api.i.posthog.com', port=443): Max retries exceeded with url: /batch/ (Caused by SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1000)'))))


[32mINFO[0m: [1mReceived successful response from Hugging Face API[0m
[32mINFO[0m: [1mQuery answered successfully[0m


In [13]:
answer

'Cinderella reached her happy ending through a series of magic events. She went and sat beneath an enchanted tree, and wept and prayed, and a little white bird always came and granted her every wish. When the king gave orders for a festival where his son could choose a bride, Cinderella wept and prayed to the bird for a golden dress and slippers. The bird granted her request, and she went to the wedding and danced with the prince until it'

In [14]:
context = retriever.context
context

['must earn it.  Out with the kitchen-wench.  They took her pretty clothes away from her, put an old grey bedgown on her, and gave her wooden shoes.  Just look at the proud princess, how decked out she is, they cried, and laughed, and led her into the kitchen. There she had to do hard work from morning till night, get up before daybreak, carry water, light fires, cook and wash.  Besides this, the sisters did her every imaginable injury - they mocked her and emptied her peas and lentils into the ashes, so that she was forced to sit and pick them out again.  In the evening when she had worked till she was weary she had no bed to go to, but had to sleep by the hearth in the cinders.  And as on that account she always looked dusty and dirty, they called her cinderella. It happened that the father was once going to the fair, and he',
 "And now the bird threw down to her a dress which was more splendid and magnificent than any she had yet had, and the slippers were golden.  And when she went

## Evaluation

In [None]:
from indox.evaluation import Evaluation
evaluator = Evaluation(["BertScore"])

In [None]:
inputs = {
    "question" : query,
    "answer" : answer,
    "context" : context
}
result = evaluator(inputs)

In [None]:
result

Unnamed: 0,0
Precision,0.547425
Recall,0.482575
F1-score,0.507377
