## Indox Retrieval Augmentation
Here, we will explore how to work with Indox Retrieval Augmentation. We are using OpenAI from Indox Api, we should set our INDOX_OPENAI_API_KEY as an environment variable.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/osllmai/inDox/blob/master/Demo/pdfLoader.ipynb)

In [15]:
!pip install indox
!pip install chromadb
!pip install semantic_text_splitter
!pip install sentence-transformers
!pip install pdfminer.six


Collecting git+https://github.com/osllmai/inDox.git@feature/data_loaders
  Cloning https://github.com/osllmai/inDox.git (to revision feature/data_loaders) to /tmp/pip-req-build-zac7otvi
  Running command git clone --filter=blob:none --quiet https://github.com/osllmai/inDox.git /tmp/pip-req-build-zac7otvi
  Running command git checkout -b feature/data_loaders --track origin/feature/data_loaders
  Switched to a new branch 'feature/data_loaders'
  Branch 'feature/data_loaders' set up to track remote branch 'feature/data_loaders' from 'origin'.
  Resolved https://github.com/osllmai/inDox.git to commit 736bc97d9b8371f5b655760318eed71447de99a5
  Preparing metadata (setup.py) ... [?25l[?25hdone


## Setting Up the Python Environment

If you are running this project in your local IDE, please create a Python environment to ensure all dependencies are correctly managed. You can follow the steps below to set up a virtual environment named `indox`:

### Windows

1. **Create the virtual environment:**
```bash
python -m venv indox
```
2. **Activate the virtual environment:**
```bash
indox_judge\Scripts\activate
```

### macOS/Linux

1. **Create the virtual environment:**
   ```bash
   python3 -m venv indox
```

2. **Activate the virtual environment:**
    ```bash
   source indox/bin/activate
```
### Install Dependencies

Once the virtual environment is activated, install the required dependencies by running:

```bash
pip install -r requirements.txt
```


In [16]:
import os
from dotenv import load_dotenv

load_dotenv()
INDOX_API_KEY= os.getenv("INDOX_API_KEY")


### Creating an instance of IndoxTetrivalAugmentation

To effectively utilize the Indox Retrieval Augmentation capabilities, you must first create an instance of the IndoxRetrievalAugmentation class. This instance will allow you to access the methods and properties defined within the class, enabling the augmentation and retrieval functionalities.

In [17]:
from indox import IndoxRetrievalAugmentation
indox = IndoxRetrievalAugmentation()

[32mINFO[0m: [1mIndoxRetrievalAugmentation initialized[0m

            ██  ███    ██  ██████   ██████  ██       ██
            ██  ████   ██  ██   ██ ██    ██   ██  ██
            ██  ██ ██  ██  ██   ██ ██    ██     ██
            ██  ██  ██ ██  ██   ██ ██    ██   ██   ██
            ██  ██  █████  ██████   ██████  ██       ██
            


### Generating response using Indox
IndoxApi class is used to handle question-answering task using Indox model. This instance creates IndoxOpenAIEmbedding class to specifying embedding model.By using ClusteredSplit function we can import pdf and text file and split them into chunks.

In [18]:
# Import necessary classes from Indox library
from indox.llms import IndoxApi
from indox.embeddings import IndoxApiEmbedding
from indox.data_loader import PdfMiner
from indox.splitter import semantic_text_splitter
# Create instances for API access and text embedding
openai_qa_indox = IndoxApi(api_key=INDOX_API_KEY)
embed_openai_indox = IndoxApiEmbedding(api_key=INDOX_API_KEY, model="text-embedding-3-small")



[32mINFO[0m: [1mInitialized IndoxOpenAIEmbedding with model: text-embedding-3-small[0m


In [19]:
!wget --no-check-certificate 'https://docs.google.com/document/d/1_DG_rl-EQCcCUFwuY8E2s1okWUVGivZGx3j42u-U250/export?format=pdf' -O sample.pdf

--2024-08-27 10:57:05--  https://docs.google.com/document/d/1_DG_rl-EQCcCUFwuY8E2s1okWUVGivZGx3j42u-U250/export?format=pdf
Resolving docs.google.com (docs.google.com)... 172.217.164.14, 2607:f8b0:4025:803::200e
Connecting to docs.google.com (docs.google.com)|172.217.164.14|:443... connected.
HTTP request sent, awaiting response... 307 Temporary Redirect
Location: https://doc-04-4k-docstext.googleusercontent.com/export/e6hpso97lrhpva19l3uocm525o/f84us30u8ri5cph9dam2jjorhk/1724756225000/102335613455583036319/*/1_DG_rl-EQCcCUFwuY8E2s1okWUVGivZGx3j42u-U250?format=pdf [following]
--2024-08-27 10:57:06--  https://doc-04-4k-docstext.googleusercontent.com/export/e6hpso97lrhpva19l3uocm525o/f84us30u8ri5cph9dam2jjorhk/1724756225000/102335613455583036319/*/1_DG_rl-EQCcCUFwuY8E2s1okWUVGivZGx3j42u-U250?format=pdf
Resolving doc-04-4k-docstext.googleusercontent.com (doc-04-4k-docstext.googleusercontent.com)... 172.217.15.225, 2607:f8b0:4025:802::2001
Connecting to doc-04-4k-docstext.googleusercontent.

### load and preprocess data
This part of code demonstrates how to load and preprocess text data from a pdf file using PdfMiner.

In [20]:
pdf_path="/content/sample.pdf"

In [22]:
document = PdfMiner(pdf_path)
docs = document.load()


In [23]:
splitter = semantic_text_splitter(text=str(docs),max_tokens=512)

In [29]:
from indox.splitter import semantic_text_splitter
content_chunks = semantic_text_splitter(str(docs),500)

In [30]:
doc = docs
print(doc)

[Document(page_content=The wife of a rich man fell sick, and as she felt that her end
was drawing near, she called her only daughter to her bedside and
said, dear child, be good and pious, and then the
good God will always protect you, and I will look down on you
from heaven and be near you., metadata={}), Document(page_content=, metadata={})]


### Retrieve relevant information and generate an answer
The main purpose of these lines is to perform a query on the vector store to retrieve the most relevant information (top_k=5) and generate an answer using the language model.

In [31]:
from indox.vector_stores import Chroma

# Define the collection name within the vector store
collection_name = "sample"
db = Chroma(collection_name=collection_name, embedding_function=embed_openai_indox)

In [32]:
db.add(docs=docs)

[32mINFO[0m: [1mStoring documents in the vector store[0m
[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings texts using engine: text-embedding-3-small[0m
[32mINFO[0m: [1mDocument added successfully to the vector store.[0m
[32mINFO[0m: [1mDocuments stored successfully[0m


In [33]:
qa = IndoxRetrievalAugmentation.QuestionAnswer(llm=openai_qa_indox, vector_database=db)

invoke(query) method sends the query to the retriever, which searches the vector store for relevant text chunks and uses the language model to generate a response based on the retrieved information.
Context property retrieves the context or the detailed information that the retriever used to generate the answer to the query. It provides insight into how the query was answered by showing the relevant text chunks and any additional information used.

In [34]:
query = "How did Cinderella reach her happy ending?"

In [35]:
answer = qa.invoke(query)
print(answer)

[32mINFO[0m: [1mRetrieving context and scores from the vector database[0m
[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings texts using engine: text-embedding-3-small[0m




[32mINFO[0m: [1mGenerating answer without document relevancy filter[0m
[32mINFO[0m: [1mQuery answered successfully[0m
Cinderella reached her happy ending by following her mother's advice to be good and pious. Despite facing hardships and mistreatment, Cinderella remained kind and virtuous. Eventually, her goodness was rewarded when she was noticed by the prince at the royal ball and they lived happily ever after.


In [36]:
qa.context

['The wife of a rich man fell sick, and as she felt that her end\nwas drawing near, she called her only daughter to her bedside and\nsaid, dear child, be good and pious, and then the\ngood God will always protect you, and I will look down on you\nfrom heaven and be near you.',
 '']