## Indox Retrieval Augmentation
Here, we will explore how to work with Indox Retrieval Augmentation. We are using OpenAI from Indox Api, we should set our INDOX_OPENAI_API_KEY as an environment variable.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/osllmai/inDox/blob/master/Demo/CsvLoader.ipynb)

In [40]:
!pip install indox
!pip install chromadb
!pip install semantic_text_splitter
!pip install sentence-transformers


Collecting git+https://github.com/osllmai/inDox.git@feature/data_loaders
  Cloning https://github.com/osllmai/inDox.git (to revision feature/data_loaders) to /tmp/pip-req-build-lsu48apm
  Running command git clone --filter=blob:none --quiet https://github.com/osllmai/inDox.git /tmp/pip-req-build-lsu48apm
  Running command git checkout -b feature/data_loaders --track origin/feature/data_loaders
  Switched to a new branch 'feature/data_loaders'
  Branch 'feature/data_loaders' set up to track remote branch 'feature/data_loaders' from 'origin'.
  Resolved https://github.com/osllmai/inDox.git to commit c0d32803c203cb6d79b81bca7b5071f631fe70b9
  Preparing metadata (setup.py) ... [?25l[?25hdone


## Setting Up the Python Environment

If you are running this project in your local IDE, please create a Python environment to ensure all dependencies are correctly managed. You can follow the steps below to set up a virtual environment named `indox`:

### Windows

1. **Create the virtual environment:**
```bash
python -m venv indox
```
2. **Activate the virtual environment:**
```bash
indox_judge\Scripts\activate
```

### macOS/Linux

1. **Create the virtual environment:**
   ```bash
   python3 -m venv indox
```

2. **Activate the virtual environment:**
    ```bash
   source indox/bin/activate
```
### Install Dependencies

Once the virtual environment is activated, install the required dependencies by running:

```bash
pip install -r requirements.txt
```


In [41]:
import os
from dotenv import load_dotenv

load_dotenv()
INDOX_API_KEY= os.getenv("INDOX_API_KEY")


### Creating an instance of IndoxTetrivalAugmentation

To effectively utilize the Indox Retrieval Augmentation capabilities, you must first create an instance of the IndoxRetrievalAugmentation class. This instance will allow you to access the methods and properties defined within the class, enabling the augmentation and retrieval functionalities.

In [42]:
from indox import IndoxRetrievalAugmentation
indox = IndoxRetrievalAugmentation()

[32mINFO[0m: [1mIndoxRetrievalAugmentation initialized[0m

            ██  ███    ██  ██████   ██████  ██       ██
            ██  ████   ██  ██   ██ ██    ██   ██  ██
            ██  ██ ██  ██  ██   ██ ██    ██     ██
            ██  ██  ██ ██  ██   ██ ██    ██   ██   ██
            ██  ██  █████  ██████   ██████  ██       ██
            


### Generating response using Indox
IndoxApi class is used to handle question-answering task using Indox model. This instance creates IndoxOpenAIEmbedding class to specifying embedding model.By using ClusteredSplit function we can import pdf and text file and split them into chunks.

In [43]:
# Import necessary classes from Indox library
from indox.llms import IndoxApi
from indox.embeddings import IndoxApiEmbedding
from indox.data_loader import Csv
from indox.splitter import semantic_text_splitter
# Create instances for API access and text embedding
openai_qa_indox = IndoxApi(api_key=INDOX_API_KEY)
embed_openai_indox = IndoxApiEmbedding(api_key=INDOX_API_KEY, model="text-embedding-3-small")



[32mINFO[0m: [1mInitialized IndoxOpenAIEmbedding with model: text-embedding-3-small[0m


In [44]:
!wget "https://docs.google.com/spreadsheets/d/1EvngUGX8YHp5N1OhbP3NlOUk_xVJgpgov5u8K2uNmyM/export?format=csv&gid=0" -O sample.csv


--2024-08-27 11:58:55--  https://docs.google.com/spreadsheets/d/1EvngUGX8YHp5N1OhbP3NlOUk_xVJgpgov5u8K2uNmyM/export?format=csv&gid=0
Resolving docs.google.com (docs.google.com)... 172.217.164.14, 2607:f8b0:4025:803::200e
Connecting to docs.google.com (docs.google.com)|172.217.164.14|:443... connected.
HTTP request sent, awaiting response... 307 Temporary Redirect
Location: https://doc-0k-9c-sheets.googleusercontent.com/export/54bogvaave6cua4cdnls17ksc4/nv8it70jmtgjbbbip2l1mitt6k/1724759935000/102335613455583036319/*/1EvngUGX8YHp5N1OhbP3NlOUk_xVJgpgov5u8K2uNmyM?format=csv&gid=0 [following]
--2024-08-27 11:58:55--  https://doc-0k-9c-sheets.googleusercontent.com/export/54bogvaave6cua4cdnls17ksc4/nv8it70jmtgjbbbip2l1mitt6k/1724759935000/102335613455583036319/*/1EvngUGX8YHp5N1OhbP3NlOUk_xVJgpgov5u8K2uNmyM?format=csv&gid=0
Resolving doc-0k-9c-sheets.googleusercontent.com (doc-0k-9c-sheets.googleusercontent.com)... 142.250.65.97, 2607:f8b0:4025:804::2001
Connecting to doc-0k-9c-sheets.googleu

### load and preprocess data
This part of code demonstrates how to load and preprocess text data from a pdf file using PdfMiner.

In [48]:
csv_path="/content/sample.csv"

In [49]:
document = Csv(csv_path)
docs = document.load()


In [50]:
splitter = semantic_text_splitter(text=str(docs),max_tokens=512)

In [51]:
from indox.splitter import semantic_text_splitter
content_chunks = semantic_text_splitter(str(docs),500)

In [52]:
doc = docs
print(doc)

[Document(page_content=Question,Answer, metadata={'source': '/content/sample.csv', 'pages': 1, 'num_rows': 11}), Document(page_content=What is astronomy?,Astronomy is the scientific study of celestial objects, space, and the universe as a whole. It involves the observation and analysis of planets, stars, galaxies, and other phenomena., metadata={'source': '/content/sample.csv', 'pages': 1, 'num_rows': 11}), Document(page_content=What is the largest planet in our solar system?,The largest planet in our solar system is Jupiter. It has a diameter of about 139,820 kilometers (86,881 miles)., metadata={'source': '/content/sample.csv', 'pages': 1, 'num_rows': 11}), Document(page_content=How far is the Earth from the Sun?,The average distance from Earth to the Sun is approximately 93 million miles or 150 million kilometers, a distance known as one astronomical unit (AU)., metadata={'source': '/content/sample.csv', 'pages': 1, 'num_rows': 11}), Document(page_content=What is a black hole?,A bla

### Retrieve relevant information and generate an answer
The main purpose of these lines is to perform a query on the vector store to retrieve the most relevant information (top_k=5) and generate an answer using the language model.

In [53]:
from indox.vector_stores import Chroma

# Define the collection name within the vector store
collection_name = "sample"
db = Chroma(collection_name=collection_name, embedding_function=embed_openai_indox)

In [54]:
db.add(docs=docs)

[32mINFO[0m: [1mStoring documents in the vector store[0m
[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings texts using engine: text-embedding-3-small[0m
[32mINFO[0m: [1mDocument added successfully to the vector store.[0m
[32mINFO[0m: [1mDocuments stored successfully[0m


In [55]:
qa = IndoxRetrievalAugmentation.QuestionAnswer(llm=openai_qa_indox, vector_database=db)

invoke(query) method sends the query to the retriever, which searches the vector store for relevant text chunks and uses the language model to generate a response based on the retrieved information.
Context property retrieves the context or the detailed information that the retriever used to generate the answer to the query. It provides insight into how the query was answered by showing the relevant text chunks and any additional information used.

In [59]:
query = "What is dark matter?"

In [60]:
answer = qa.invoke(query)
print(answer)

[32mINFO[0m: [1mRetrieving context and scores from the vector database[0m
[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings texts using engine: text-embedding-3-small[0m
[32mINFO[0m: [1mGenerating answer without document relevancy filter[0m
[32mINFO[0m: [1mQuery answered successfully[0m
Dark matter is a form of matter that does not emit, absorb, or reflect light, making it invisible. It is believed to make up about 27% of the universe's mass and affects the motion of galaxies and galaxy clusters.


In [61]:
qa.context

["What is dark matter?,Dark matter is a form of matter that does not emit, absorb, or reflect light, making it invisible. It is believed to make up about 27% of the universe's mass and affects the motion of galaxies and galaxy clusters.",
 'What is a black hole?,A black hole is a region in space where the gravitational pull is so strong that not even light can escape from it. Black holes are formed from the remnants of massive stars that have collapsed under their own gravity.',
 'What is the Milky Way?,The Milky Way is the galaxy that contains our solar system. It is a barred spiral galaxy, characterized by a central bulge surrounded by spiral arms.',
 'What is astronomy?,Astronomy is the scientific study of celestial objects, space, and the universe as a whole. It involves the observation and analysis of planets, stars, galaxies, and other phenomena.',
 'What are pulsars?,Pulsars are highly magnetized, rotating neutron stars that emit beams of electromagnetic radiation out of their m