## Indox Retrieval Augmentation
Here, we will explore how to work with Indox Retrieval Augmentation. We are using OpenAI from Indox Api, we should set our INDOX_OPENAI_API_KEY as an environment variable.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/osllmai/inDox/blob/master/cookbook/indoxArcg/indox_api_openai.ipynb)

In [2]:
!pip install indoxArcg chromadb duckduckgo_search

ERROR: Could not find a version that satisfies the requirement indoxRag (from versions: none)
ERROR: No matching distribution found for indoxRag


## Setting Up the Python Environment

If you are running this project in your local IDE, please create a Python environment to ensure all dependencies are correctly managed. You can follow the steps below to set up a virtual environment named `indoxArcg`:

### Windows

1. **Create the virtual environment:**
```bash
  python -m venv indoxArcg
```

2. **Activate the virtual environment:**
```bash
  indoxArcg\Scripts\activate
```


### macOS/Linux

1. **Create the virtual environment:**
   ```bash
   python3 -m venv indoxArcg
   
2. **Activate the virtual environment:**
```bash
  source indoxArcg/bin/activate
```

### Install Dependencies

Once the virtual environment is activated, install the required dependencies by running:

```bash
  pip install -r requirements.txt
```


In [3]:
!wget https://raw.githubusercontent.com/osllmai/inDox/master/Demo/sample.txt

SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
syswgetrc = C:\Program Files (x86)\GnuWin32/etc/wgetrc
--2024-12-08 18:46:41--  https://raw.githubusercontent.com/osllmai/inDox/master/Demo/sample.txt
Resolving raw.githubusercontent.com... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com|185.199.110.133|:443... connected.
OpenSSL: error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol
Unable to establish SSL connection.


In [2]:
import os
from dotenv import load_dotenv
load_dotenv()
NERD_TOKEN_API= os.getenv("NERD_TOKEN_API")

### Creating an instance of IndoxTetrivalAugmentation

To effectively utilize the Indox Retrieval Augmentation capabilities, you must first create an instance of the IndoxRetrievalAugmentation class. This instance will allow you to access the methods and properties defined within the class, enabling the augmentation and retrieval functionalities.

### Generating response using Indox
IndoxApi class is used to handle question-answering task using Indox model. This instance creates IndoxOpenAIEmbedding class to specifying embedding model.By using ClusteredSplit function we can import pdf and text file and split them into chunks.

In [4]:
# Import necessary classes from Indox library
from indoxArcg.llms import NerdToken
from indoxArcg.embeddings import NerdTokenEmbedding
from indoxArcg.data_loader_splitter import ClusteredSplit

# Create instances for API access and text embedding
openai_qa_indox = NerdToken(api_key=NERD_TOKEN_API)
embed_openai_indox = NerdTokenEmbedding(api_key=NERD_TOKEN_API, model="text-embedding-3-small")

# Specify the path to your text file
file_path = "sample.txt"

# Create a ClusteredSplit instance for handling file loading and chunking
loader_splitter = ClusteredSplit(file_path=file_path, embeddings=embed_openai_indox, summary_model=openai_qa_indox)

# Load and split the document into chunks using ClusteredSplit
docs = loader_splitter.load_and_chunk()

[32mINFO[0m: [1mInitialized IndoxOpenAIEmbedding with model: text-embedding-3-small[0m
[32mINFO[0m: [1mClusteredSplit initialized successfully[0m
[32mINFO[0m: [1mStarting processing for documents[0m
[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings texts using engine: text-embedding-3-small[0m
[32mINFO[0m: [1m--Generated 7 clusters--[0m
[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings texts using engine: text-embedding-3-small[0m
[32mINFO[0m: [1m--Generated 1 clusters--[0m
[32mINFO[0m: [1mCompleted chunking & clustering process[0m
[32mINFO[0m: [1mSuccessfully obtained all documents[0m


In [5]:
docs[2]

'  They took her pretty clothes away from her, put an old grey bedgown on her, and gave her wooden shoes   Just look at the proud princess, how decked out she is, they cried, and laughed, and led her into the kitchen There she had to do hard work from morning till night, get up before daybreak, carry water, light fires, cook and wash   Besides this, the sisters did her every imaginable injury - they mocked her'

 Here ChromaVectorStore handles the storage and retrieval of vector embeddings by specifying a collection name and sets up a vector store where text embeddings can be stored and queried.

In [6]:
from indoxArcg.vector_stores import Chroma

# Define the collection name within the vector store
collection_name = "sample"

# Create a ChromaVectorStore instance
db = Chroma(collection_name=collection_name, embedding_function=embed_openai_indox)

2024-12-08 18:51:32,661 - chromadb.telemetry.product.posthog - INFO - Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.


### load and preprocess data
This part of code demonstrates how to load and preprocess text data from a file, split it into chunks, and store these chunks in the vector store that was set up previously.

In [7]:
db.add(docs=docs)

[32mINFO[0m: [1mStoring documents in the vector store[0m
[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings texts using engine: text-embedding-3-small[0m
[32mINFO[0m: [1mDocument added successfully to the vector store.[0m
[32mINFO[0m: [1mDocuments stored successfully[0m


### Retrieve relevant information and generate an answer
The main purpose of these lines is to perform a query on the vector store to retrieve the most relevant information (top_k=5) and generate an answer using the language model.

In [8]:
query = "How cinderella reach her happy ending?"
from indoxArcg.pipelines.rag import RAG
retriever = RAG(llm=openai_qa_indox,vector_store=db,top_k= 5)

infer(query) method sends the query to the retriever, which searches the vector store for relevant text chunks and uses the language model to generate a response based on the retrieved information.
Context property retrieves the context or the detailed information that the retriever used to generate the answer to the query. It provides insight into how the query was answered by showing the relevant text chunks and any additional information used.

In [9]:
retriever.infer(query)

[32mINFO[0m: [1mRetrieving context and scores from the vector database[0m
[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings texts using engine: text-embedding-3-small[0m
[32mINFO[0m: [1mGenerating answer without document relevancy filter[0m
[32mINFO[0m: [1mQuery answered successfully[0m


"Cinderella reaches her happy ending through a series of transformative events facilitated by her inherent goodness, magical assistance, and the eventual recognition of her true worth. Here’s a summary of the key steps leading to her happy ending:\n\n1. **Magical Assistance**: After enduring mistreatment from her stepmother and stepsisters, Cinderella seeks solace at her mother’s grave, where she prays to a hazel tree. A little bird appears to grant her wishes, providing her with beautiful dresses and shoes that allow her to attend the royal festival.\n\n2. **The Royal Festival**: Cinderella attends the king's festival, where she captivates the prince with her beauty and grace. Each night, she must leave before he discovers her true identity, but she leaves behind a slipper, which becomes a crucial symbol of her identity.\n\n3. **The Prince's Search**: After the festival, the prince searches for the owner of the golden slipper. Cinderella’s stepsisters attempt to fit into the slipper, 