# Indox Retrieval Augmentation for CSV files

Here, we will explore how to work with Indox Retrieval Augmentation. We are using OpenAI from Indox Api, we should set our INDOX_OPENAI_API_KEY as an environment variable.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/osllmai/inDox/blob/master/cookbook/indoxArcg/CsvLoader.ipynb)

In [None]:
!pip install chromadb semantic_text_splitter sentence-transformers indoxarcg

## Setting Up the Python Environment

If you are running this project in your local IDE, please create a Python environment to ensure all dependencies are correctly managed. You can follow the steps below to set up a virtual environment named `indoxarcg`:

### Windows

1. **Create the virtual environment:**
```bash
python -m venv indoxarcg
```
2. **Activate the virtual environment:**
```bash
indoxarcg\Scripts\activate
```

### macOS/Linux

1. **Create the virtual environment:**
   ```bash
   python3 -m venv indoxarcg
```

2. **Activate the virtual environment:**
    ```bash
   source indoxarcg/bin/activate
```
### Install Dependencies

Once the virtual environment is activated, install the required dependencies by running:

```bash
pip install -r requirements.txt
```


In [2]:
import os
from dotenv import load_dotenv

load_dotenv()
NERD_TOKEN_API= os.getenv("NERD_TOKEN_API")


### Creating an instance of IndoxTetrivalAugmentation

To effectively utilize the Indox Retrieval Augmentation capabilities, you must first create an instance of the IndoxRetrievalAugmentation class. This instance will allow you to access the methods and properties defined within the class, enabling the augmentation and retrieval functionalities.

### Generating response using Indox
IndoxApi class is used to handle question-answering task using Indox model. This instance creates IndoxOpenAIEmbedding class to specifying embedding model.By using ClusteredSplit function we can import pdf and text file and split them into chunks.

In [4]:
# Import necessary classes from Indox library
from indoxArcg.llms import NerdToken
from indoxArcg.embeddings import NerdTokenEmbedding
from indoxArcg.data_loaders import CSV
from indoxArcg.splitter import semantic_text_splitter
# Create instances for API access and text embedding
openai_qa_indox = NerdToken(api_key=NERD_TOKEN_API)
embed_openai_indox = NerdTokenEmbedding(api_key=NERD_TOKEN_API, model="text-embedding-3-small")



[32mINFO[0m: [1mInitialized IndoxOpenAIEmbedding with model: text-embedding-3-small[0m


In [None]:
!wget "https://docs.google.com/spreadsheets/d/1EvngUGX8YHp5N1OhbP3NlOUk_xVJgpgov5u8K2uNmyM/export?format=csv&gid=0" -O sample.csv


### load and preprocess data
This part of code demonstrates how to load and preprocess text data from a pdf file using PdfMiner.

In [5]:
csv_path="sample.csv"

In [None]:
document = CSV(csv_path)
docs = document.load()


In [None]:
splitter = semantic_text_splitter(text=str(docs),max_tokens=512)

In [None]:
from indoxArcg.splitter import semantic_text_splitter
content_chunks = semantic_text_splitter(str(docs),500)

In [None]:
doc = docs
print(doc)

[Document(page_content=Question,Answer, metadata={'source': '/content/sample.csv', 'pages': 1, 'num_rows': 11}), Document(page_content=What is astronomy?,Astronomy is the scientific study of celestial objects, space, and the universe as a whole. It involves the observation and analysis of planets, stars, galaxies, and other phenomena., metadata={'source': '/content/sample.csv', 'pages': 1, 'num_rows': 11}), Document(page_content=What is the largest planet in our solar system?,The largest planet in our solar system is Jupiter. It has a diameter of about 139,820 kilometers (86,881 miles)., metadata={'source': '/content/sample.csv', 'pages': 1, 'num_rows': 11}), Document(page_content=How far is the Earth from the Sun?,The average distance from Earth to the Sun is approximately 93 million miles or 150 million kilometers, a distance known as one astronomical unit (AU)., metadata={'source': '/content/sample.csv', 'pages': 1, 'num_rows': 11}), Document(page_content=What is a black hole?,A bla

### Retrieve relevant information and generate an answer
The main purpose of these lines is to perform a query on the vector store to retrieve the most relevant information (top_k=5) and generate an answer using the language model.

In [None]:
from indoxArcg.vector_stores import Chroma

# Define the collection name within the vector store
collection_name = "sample"
db = Chroma(collection_name=collection_name, embedding_function=embed_openai_indox)

In [None]:
db.add(docs=docs)

[32mINFO[0m: [1mStoring documents in the vector store[0m
[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings texts using engine: text-embedding-3-small[0m
[32mINFO[0m: [1mDocument added successfully to the vector store.[0m
[32mINFO[0m: [1mDocuments stored successfully[0m


In [None]:
from indoxArcg.pipelines.rag import RAG
indox = RAG(llm=openai_qa_indox,vector_store=db,enable_web_fallback=False)

infer(query) method sends the query to the retriever, which searches the vector store for relevant text chunks and uses the language model to generate a response based on the retrieved information.
Context property retrieves the context or the detailed information that the retriever used to generate the answer to the query. It provides insight into how the query was answered by showing the relevant text chunks and any additional information used.

In [None]:
query = "What is dark matter?"

In [None]:
answer = indox.infer(query)
print(answer)

[32mINFO[0m: [1mRetrieving context and scores from the vector database[0m
[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings texts using engine: text-embedding-3-small[0m
[32mINFO[0m: [1mGenerating answer without document relevancy filter[0m
[32mINFO[0m: [1mQuery answered successfully[0m
Dark matter is a form of matter that does not emit, absorb, or reflect light, making it invisible. It is believed to make up about 27% of the universe's mass and affects the motion of galaxies and galaxy clusters.
