## Indox Retrieval Augmentation
Here, we will explore how to work with Indox Retrieval Augmentation. We are using Mistral as LLM model and HuggingFace for our embedding, we should set our HUGGINGFACE_API_KEY and MISTRAL_API_KEY as an environment variable.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/osllmai/inDox/blob/master/cookbook/indoxArcg/mistral_clusteredSplit.ipynb)

In [None]:
!pip install indoxArcg
!pip install mistralai
!pip install chromadb

## Setting Up the Python Environment

If you are running this project in your local IDE, please create a Python environment to ensure all dependencies are correctly managed. You can follow the steps below to set up a virtual environment named `indoxArcg`:

### Windows

1. **Create the virtual environment:**
```bash
  python -m venv indoxArcg
```

2. **Activate the virtual environment:**
```bash
  indoxArcg\Scripts\activate
```


### macOS/Linux

1. **Create the virtual environment:**
   ```bash
   python3 -m venv indoxArcg
   
2. **Activate the virtual environment:**
```bash
  source indoxArcg/bin/activate
```

### Install Dependencies

Once the virtual environment is activated, install the required dependencies by running:

```bash
  pip install -r requirements.txt
```


In [None]:
!wget https://raw.githubusercontent.com/osllmai/inDox/master/Demo/sample.txt

In [1]:
import os
from dotenv import load_dotenv

load_dotenv()
MISTRAL_API_KEY = os.getenv('MISTRAL_API_KEY')
HUGGINGFACE_API_KEY = os.getenv('HUGGINGFACE_API_KEY')

### Creating an instance of IndoxTetrivalAugmentation

To effectively utilize the Indox Retrieval Augmentation capabilities, you must first create an instance of the IndoxRetrievalAugmentation class. This instance will allow you to access the methods and properties defined within the class, enabling the augmentation and retrieval functionalities.

### Generating response using Mistral's language models 
MistralQA class is used to handle question-answering task using Mistral's language models from HuggingFace. This instance creates HuggingFaceEmbedding class to specifying embedding model.By using UnstructuredLoadAndSplit function we can import various file types and split them into chunks.

In [None]:
from indoxArcg.llms import Mistral
from indoxArcg.data_loader_splitter import ClusteredSplit
from indoxArcg.embeddings import MistralEmbedding
mistral_qa = Mistral(api_key=MISTRAL_API_KEY)
embed_mistral = MistralEmbedding(MISTRAL_API_KEY,model="mistral-embed")
file_path = "sample.txt"



[32mINFO[0m: [1mInitializing MistralAI with model: mistral-medium-latest[0m
[32mINFO[0m: [1mMistralAI initialized successfully[0m
[32mINFO[0m: [1mInitialized MistralEmbedding with model: mistral-embed[0m


In [8]:
loader_splitter = ClusteredSplit(file_path=file_path,summary_model=mistral_qa,embeddings=embed_mistral)
docs = loader_splitter.load_and_chunk()

[32mINFO[0m: [1mClusteredSplit initialized successfully[0m
[32mINFO[0m: [1mStarting processing for documents[0m
[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings for texts using engine: mistral-embed[0m
[32mINFO[0m: [1m--Generated 6 clusters--[0m
[32mINFO[0m: [1mGenerating summary for documentation[0m
[32mINFO[0m: [1mGenerating summary for documentation[0m
[32mINFO[0m: [1mGenerating summary for documentation[0m
[32mINFO[0m: [1mGenerating summary for documentation[0m
[32mINFO[0m: [1mGenerating summary for documentation[0m
[32mINFO[0m: [1mGenerating summary for documentation[0m
[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings for texts using engine: mistral-embed[0m
[32mINFO[0m: [1m--Generated 1 clusters--[0m
[32mINFO[0m: [1mGenerating summary for documentation[0m
[32mINFO[0m: [1mCompleted chunking & clustering process[0m
[32mINFO[0m: [1mSuccessfully obt

 Here ChromaVectorStore handles the storage and retrieval of vector embeddings by specifying a collection name and sets up a vector store where text embeddings can be stored and queried.

In [None]:
from indoxArcg.vector_stores import Chroma
db = Chroma(collection_name="sample",embedding_function=embed_mistral)

[32mINFO[0m: [1mConnection to the vector store database established successfully[0m


<indox.vector_stores.chroma.Chroma at 0x27bb81d07a0>

### load and preprocess data
This part of code demonstrates how to load and preprocess text data from a file, split it into chunks, and store these chunks in the vector store that was set up previously.

In [None]:
from indoxArcg.data_loader_splitter import UnstructuredLoadAndSplit
loader_splitter = UnstructuredLoadAndSplit(file_path=file_path,max_chunk_size=400)
docs = loader_splitter.load_and_chunk()

[32mINFO[0m: [1mUnstructuredLoadAndSplit initialized successfully[0m
[32mINFO[0m: [1mGetting all documents[0m
[32mINFO[0m: [1mStarting processing[0m
[32mINFO[0m: [1mUsing title-based chunking[0m
[32mINFO[0m: [1mCompleted chunking process[0m
[32mINFO[0m: [1mSuccessfully obtained all documents[0m


In [11]:
len(docs)

40

In [12]:
db.add(docs=docs)

[32mINFO[0m: [1mStoring documents in the vector store[0m
[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings for texts using engine: mistral-embed[0m
[32mINFO[0m: [1mDocument added successfully to the vector store.[0m
[32mINFO[0m: [1mDocuments stored successfully[0m


<indox.vector_stores.chroma.Chroma at 0x27bb81d07a0>

### Retrieve relevant information and generate an answer
The main purpose of these lines is to perform a query on the vector store to retrieve the most relevant information (top_k=5) and generate an answer using the language model.

In [13]:
query = "How cinderella reach her happy ending?"
from indoxArcg.pipelines.rag import RAG
retriever = RAG(llm=mistral_qa,vector_store=db,top_k= 5)
answer = retriever.infer(query)
answer