## Indox Retrieval Augmentation
Here, we will explore how to work with Indox Retrieval Augmentation. We are using OpenAI from Indox Api, we should set our INDOX_OPENAI_API_KEY as an environment variable.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/osllmai/inDox/blob/master/cookbook/indoxArcg/pdfLoader.ipynb)

In [1]:
!pip install indoxArcg
!pip install chromadb
!pip install semantic_text_splitter
!pip install sentence-transformers
!pip install pdfminer.six


Collecting chromadb
  Downloading chromadb-0.6.3-py3-none-any.whl.metadata (6.8 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp311-cp311-macosx_11_0_arm64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.115.11-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.21.0-py2.py3-none-any.whl.metadata (2.9 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.21.0-cp311-cp311-macosx_13_0_universal2.whl.metadata (4.5 kB)
Collecting opentelemetry-api>=1.2.0 (from chromadb)
  Downloading opentelemetry_api-1.31.1-py3-none-any.whl.metadata (1.6 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.

## Setting Up the Python Environment

If you are running this project in your local IDE, please create a Python environment to ensure all dependencies are correctly managed. You can follow the steps below to set up a virtual environment named `indoxArcg`:

### Windows

1. **Create the virtual environment:**

```bash
python -m venv indoxArcg
```

2. **Activate the virtual environment:**

```bash
indoxArcg\Scripts\activate
```

### macOS/Linux

1. **Create the virtual environment:**

```bash
   python3 -m venv indoxArcg
```

2. **Activate the virtual environment:**

```bash
   source indoxArcg/bin/activate
```

### Install Dependencies

Once the virtual environment is activated, install the required dependencies by running:

```bash
pip install -r requirements.txt
```


In [None]:
import os
from dotenv import load_dotenv

load_dotenv()
INDOX_API_KEY= os.getenv("INDOX_API_KEY")

### Creating an instance of IndoxTetrivalAugmentation

To effectively utilize the Indox Retrieval Augmentation capabilities, you must first create an instance of the IndoxRetrievalAugmentation class. This instance will allow you to access the methods and properties defined within the class, enabling the augmentation and retrieval functionalities.

### Generating response using Indox
IndoxApi class is used to handle question-answering task using Indox model. This instance creates IndoxOpenAIEmbedding class to specifying embedding model.By using ClusteredSplit function we can import pdf and text file and split them into chunks.

In [8]:
# Import necessary classes from Indox library
from indoxArcg.llms import IndoxApi
from indoxArcg.embeddings import IndoxApiEmbedding
from indoxArcg.data_loader import PdfMiner
from indoxArcg.splitter import semantic_text_splitter
# Create instances for API access and text embedding
openai_qa_indox = IndoxApi(api_key=INDOX_API_KEY)
embed_openai_indox = IndoxApiEmbedding(api_key=INDOX_API_KEY, model="text-embedding-3-small")


  from .autonotebook import tqdm as notebook_tqdm


ImportError: cannot import name 'IndoxApi' from 'indoxArcg.llms' (/Users/parsanemati/miniconda3/envs/indox/lib/python3.11/site-packages/indoxArcg/llms/__init__.py)

In [2]:
!wget --no-check-certificate 'https://docs.google.com/document/d/1_DG_rl-EQCcCUFwuY8E2s1okWUVGivZGx3j42u-U250/export?format=pdf' -O sample.pdf

--2025-03-22 19:33:13--  https://docs.google.com/document/d/1_DG_rl-EQCcCUFwuY8E2s1okWUVGivZGx3j42u-U250/export?format=pdf
Resolving docs.google.com (docs.google.com)... 142.250.191.174, 2607:f8b0:4009:819::200e
Connecting to docs.google.com (docs.google.com)|142.250.191.174|:443... connected.
HTTP request sent, awaiting response... 307 Temporary Redirect
Location: https://doc-04-4k-docstext.googleusercontent.com/export/e6hpso97lrhpva19l3uocm525o/d7sevvhghq9vlm19m8vb7n8dfs/1742689990000/102335613455583036319/*/1_DG_rl-EQCcCUFwuY8E2s1okWUVGivZGx3j42u-U250?format=pdf [following]
--2025-03-22 19:33:13--  https://doc-04-4k-docstext.googleusercontent.com/export/e6hpso97lrhpva19l3uocm525o/d7sevvhghq9vlm19m8vb7n8dfs/1742689990000/102335613455583036319/*/1_DG_rl-EQCcCUFwuY8E2s1okWUVGivZGx3j42u-U250?format=pdf
Resolving doc-04-4k-docstext.googleusercontent.com (doc-04-4k-docstext.googleusercontent.com)... 172.217.4.65, 2607:f8b0:4009:805::2001
Connecting to doc-04-4k-docstext.googleusercontent.

### load and preprocess data
This part of code demonstrates how to load and preprocess text data from a pdf file using PdfMiner.

In [6]:
pdf_path="/content/sample.pdf"

In [7]:
document = PdfMiner(pdf_path)
docs = document.load()

NameError: name 'PdfMiner' is not defined

In [9]:
splitter = semantic_text_splitter(text=str(docs),max_tokens=512)

NameError: name 'semantic_text_splitter' is not defined

In [None]:
from indoxArcg.splitter import semantic_text_splitter
content_chunks = semantic_text_splitter(str(docs),500)

In [30]:
doc = docs
print(doc)

[Document(page_content=The wife of a rich man fell sick, and as she felt that her end
was drawing near, she called her only daughter to her bedside and
said, dear child, be good and pious, and then the
good God will always protect you, and I will look down on you
from heaven and be near you., metadata={}), Document(page_content=, metadata={})]


### Retrieve relevant information and generate an answer
The main purpose of these lines is to perform a query on the vector store to retrieve the most relevant information (top_k=5) and generate an answer using the language model.

In [None]:
from indoxArcg.vector_stores import Chroma

# Define the collection name within the vector store
collection_name = "sample"
db = Chroma(collection_name=collection_name, embedding_function=embed_openai_indox)

In [32]:
db.add(docs=docs)

[32mINFO[0m: [1mStoring documents in the vector store[0m
[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings texts using engine: text-embedding-3-small[0m
[32mINFO[0m: [1mDocument added successfully to the vector store.[0m
[32mINFO[0m: [1mDocuments stored successfully[0m


In [33]:
from indoxArcg.pipelines.rag import RAG
retriever = RAG(llm=openai_qa_indox,vector_store=db,top_k= 5)

infer(query) method sends the query to the retriever, which searches the vector store for relevant text chunks and uses the language model to generate a response based on the retrieved information.
Context property retrieves the context or the detailed information that the retriever used to generate the answer to the query. It provides insight into how the query was answered by showing the relevant text chunks and any additional information used.

In [34]:
query = "How did Cinderella reach her happy ending?"

In [35]:
answer = retriever.infer(query)
print(answer)

[32mINFO[0m: [1mRetrieving context and scores from the vector database[0m
[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings texts using engine: text-embedding-3-small[0m




[32mINFO[0m: [1mGenerating answer without document relevancy filter[0m
[32mINFO[0m: [1mQuery answered successfully[0m
Cinderella reached her happy ending by following her mother's advice to be good and pious. Despite facing hardships and mistreatment, Cinderella remained kind and virtuous. Eventually, her goodness was rewarded when she was noticed by the prince at the royal ball and they lived happily ever after.
