## Introduction

In this notebook, we will demonstrate how to securely handle `inDox` as system for question answering system with open source models which are available on internet like `Mistral`. so firstly you should buil environment variables and API keys in Python using the `dotenv` library. Environment variables are a crucial part of configuring your applications, especially when dealing with sensitive information like API keys.

::: {.callout-note}
Because we are using **HuggingFace** models you need to define your `HUGGINGFACE_API_KEY` in `.env` file. This allows us to keep our API keys and other sensitive information out of our codebase, enhancing security and maintainability.
:::

Let's start by importing the required libraries and loading our environment variables.


In [1]:
import os
from dotenv import load_dotenv

load_dotenv()
MISTRAL_API_KEY = os.getenv('MISTRAL_API_KEY')
HUGGINGFACE_API_KEY = os.getenv('HUGGINGFACE_API_KEY')

### Import Essential Libraries 
Then, we import essential libraries for our `Indox` question answering system:
- `IndoxRetrievalAugmentation`: Enhances the retrieval process for better QA performance.
- `Mistral`: A powerful QA model from Indox, built on top of the Hugging Face model.
- `HuggingFaceEmbedding`: Utilizes Hugging Face embeddings for improved semantic understanding.
- `UnstructuredLoadAndSplit`: A utility for loading and splitting unstructured data.

In [10]:
from indox import IndoxRetrievalAugmentation
indox = IndoxRetrievalAugmentation()

2024-06-26 12:11:48,329 INFO:IndoxRetrievalAugmentation initialized


### Building the Indox System and Initializing Models

Next, we will build our `inDox` system and initialize the Mistral question answering model along with the embedding model. This setup will allow us to leverage the advanced capabilities of Indox for our question answering tasks.


In [6]:
from indox.llms import Mistral
from indox.embeddings import MistralEmbedding
mistral_qa = Mistral(api_key=MISTRAL_API_KEY)
embed_mistral = MistralEmbedding(MISTRAL_API_KEY)

2024-06-26 12:10:51,848 INFO:Initializing MistralAI with model: mistral-medium-latest
2024-06-26 12:10:52,059 INFO:MistralAI initialized successfully
2024-06-26 12:10:56,454 INFO:Initialized Mistral embeddings


### Setting Up Reference Directory and File Path

To demonstrate the capabilities of our Indox question answering system, we will use a sample directory. This directory will contain our reference data, which we will use for testing and evaluation.

First, we specify the path to our sample file. In this case, we are using a file named `sample.txt` located in our working directory. This file will serve as our reference data for the subsequent steps.

Let's define the file path for our reference data.

In [7]:
file_path = "sample.txt"

### Chunking Reference Data with UnstructuredLoadAndSplit

To effectively utilize our reference data, we need to process and chunk it into manageable parts. This ensures that our question answering system can efficiently handle and retrieve relevant information.

We use the `UnstructuredLoadAndSplit` utility for this task. This tool allows us to load the unstructured data from our specified file and split it into smaller chunks. This process enhances the performance of our retrieval and QA models by making the data more accessible and easier to process.

In this step, we define the file path for our reference data and use `UnstructuredLoadAndSplit` to chunk the data with a maximum chunk size of 400 characters.

Let's proceed with chunking our reference data.


In [11]:
from indox.data_loader_splitter import UnstructuredLoadAndSplit
load_splitter = UnstructuredLoadAndSplit(file_path=file_path,max_chunk_size=400)
docs = load_splitter.load_and_chunk()

2024-06-26 12:11:58,010 INFO:Initializing UnstructuredLoadAndSplit
2024-06-26 12:11:58,011 INFO:UnstructuredLoadAndSplit initialized successfully
2024-06-26 12:11:58,011 INFO:Getting all documents
2024-06-26 12:11:58,012 INFO:Starting processing
2024-06-26 12:11:58,326 INFO:Created initial document elements
2024-06-26 12:11:58,326 INFO:Using title-based chunking
2024-06-26 12:11:58,330 INFO:Completed chunking process
2024-06-26 12:11:58,332 INFO:Successfully obtained all documents


### Connecting Embedding Model to Indox

With our reference data chunked and ready, the next step is to connect our embedding model to the Indox system. This connection enables the system to leverage the embeddings for better semantic understanding and retrieval performance.

We use the `connect_to_vectorstore` method to link the `HuggingFaceEmbedding` model with our Indox system. By specifying the embeddings and a collection name, we ensure that our reference data is appropriately indexed and stored, facilitating efficient retrieval during the question-answering process.

Let's connect the embedding model to Indox.


In [12]:
from indox.vector_stores import ChromaVectorStore
db = ChromaVectorStore(collection_name="sample",embedding=embed_mistral)

In [13]:
indox.connect_to_vectorstore(vectorstore_database=db)

2024-06-26 12:12:05,057 INFO:Attempting to connect to the vector store database
2024-06-26 12:12:05,058 INFO:Connection to the vector store database established successfully


<indox.vector_stores.Chroma.ChromaVectorStore at 0x1632bb24140>

### Storing Data in the Vector Store

After connecting our embedding model to the Indox system, the next step is to store our chunked reference data in the vector store. This process ensures that our data is indexed and readily available for retrieval during the question-answering process.

We use the `store_in_vectorstore` method to store the processed data in the vector store. By doing this, we enhance the system's ability to quickly access and retrieve relevant information based on the embeddings generated earlier.

Let's proceed with storing the data in the vector store.


In [14]:
indox.store_in_vectorstore(docs)

2024-06-26 12:12:07,250 INFO:Storing documents in the vector store


[Document(page_content=The wife of a rich man fell sick, and as she felt that her end

was drawing near, she called her only daughter to her bedside and

said, dear child, be good and pious, and then the

good God will always protect you, and I will look down on you

from heaven and be near you. Thereupon she closed her eyes and

departed. Every day the maiden went out to her mother's grave,, metadata={'filename': 'sample.txt', 'filetype': 'text/plain', 'last_modified': '2024-05-30T13:53:09'}), Document(page_content=and wept, and she remained pious and good. When winter came

the snow spread a white sheet over the grave, and by the time the

spring sun had drawn it off again, the man had taken another wife.

The woman had brought with her into the house two daughters,

who were beautiful and fair of face, but vile and black of heart.

Now began a bad time for the poor step-child. Is the stupid goose, metadata={'filename': 'sample.txt', 'filetype': 'text/plain', 'last_modified': '2024-0

2024-06-26 12:12:10,335 INFO:HTTP Request: POST https://api.mistral.ai/v1/embeddings "HTTP/1.1 200 OK"
2024-06-26 12:12:15,224 INFO:Document added successfully to the vector store.
2024-06-26 12:12:15,225 INFO:Documents stored successfully


<indox.vector_stores.Chroma.ChromaVectorStore at 0x1632bb24140>

## Query from RAG System with Indox
With our Retrieval-Augmented Generation (RAG) system built using Indox, we are now ready to test it with a sample question. This test will demonstrate how effectively our system can retrieve and generate accurate answers based on the reference data stored in the vector store.

We'll use a sample query to test our system:
- **Query**: "How did Cinderella reach her happy ending?"

This question will be processed by our Indox system to retrieve relevant information and generate an appropriate response.

Let's test our RAG system with the sample question

In [15]:
query = "How cinderella reach her happy ending?"

Now that our Retrieval-Augmented Generation (RAG) system with Indox is fully set up, we can test it with a sample question. We'll use the `invoke` submethod to get a response from the system.


The `invoke` method processes the query using the connected QA model and retrieves relevant information from the vector store. It returns a list where:
- The first index contains the answer.
- The second index contains the contexts and their respective scores.


We'll pass this query to the `invoke` method and print the response.


In [16]:
retriever = indox.QuestionAnswer(vector_database=db,llm=mistral_qa,top_k=5)

In [17]:
answer = retriever.invoke(query=query)

2024-06-26 12:12:19,042 INFO:Retrieving context and scores from the vector database
2024-06-26 12:12:19,908 INFO:HTTP Request: POST https://api.mistral.ai/v1/embeddings "HTTP/1.1 200 OK"
2024-06-26 12:12:19,911 INFO:Generating answer without document relevancy filter
2024-06-26 12:12:19,912 INFO:Answering question: How cinderella reach her happy ending?
2024-06-26 12:12:19,912 INFO:Attempting to generate an answer for the question: How cinderella reach her happy ending?


5


2024-06-26 12:12:27,793 INFO:HTTP Request: POST https://api.mistral.ai/v1/chat/completions "HTTP/1.1 200 OK"
2024-06-26 12:12:27,795 INFO:Query answered successfully


In [18]:
answer

"Based on the context provided, Cinderella's happy ending is not explicitly stated. However, it can be inferred that she might reach her happy ending through the help of a magical bird and the invitation to a three-day festival given by the king. Cinderella expressed a wish to the bird, and it threw down to her what she had wished for. The king also gave orders for a festival which was to last three days, and to which all the beautiful young girls in the country were invited, in order that his son might choose himself a bride. Cinderella wished to attend the festival, but her stepmother did not allow her to go. It is possible that Cinderella will use the help of the magical bird to attend the festival and meet the king's son, leading to her happy ending. However, this is only speculation based on the given context, and the actual ending is not explicitly stated."

In [12]:
context = retriever.context
context

['by the hearth in the cinders. And as on that account she always\n\nlooked dusty and dirty, they called her cinderella.\n\nIt happened that the father was once going to the fair, and he\n\nasked his two step-daughters what he should bring back for them.\n\nBeautiful dresses, said one, pearls and jewels, said the second.\n\nAnd you, cinderella, said he, what will you have. Father',
 'cinderella expressed a wish, the bird threw down to her what she\n\nhad wished for.\n\nIt happened, however, that the king gave orders for a festival\n\nwhich was to last three days, and to which all the beautiful young\n\ngirls in the country were invited, in order that his son might choose\n\nhimself a bride. When the two step-sisters heard that they too were',
 'know where she was gone. He waited until her father came, and\n\nsaid to him, the unknown maiden has escaped from me, and I\n\nbelieve she has climbed up the pear-tree. The father thought,\n\ncan it be cinderella. And had an axe brought and cut 

## Evaluation
Evaluating the performance of your question-answering system is crucial to ensure the quality and reliability of the responses. In this section, we will use the `Evaluation` module from Indox to assess our system's outputs.


In [13]:
from indox.evaluation import Evaluation
evaluator = Evaluation(["BertScore", "Toxicity"])

### Preparing Inputs for Evaluation
Next, we need to format the inputs according to the Indox evaluator's requirements. This involves creating a dictionary that includes the question, the generated answer, and the context from which the answer was derived.

In [14]:
inputs = {
    "question" : query,
    "answer" : answer,
    "context" : context
}
result = evaluator(inputs)

In [15]:
result

Unnamed: 0,0
Precision,0.524382
Recall,0.537209
F1-score,0.530718
Toxicity,0.074495
