# Cluster Split

In [1]:
import os
from dotenv import load_dotenv

load_dotenv()
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

## Initial Setup

The following imports are essential for setting up the Indox application. These imports include the main Indox retrieval augmentation module, question-answering models, embeddings, and data loader splitter.

In [2]:
from Indox import IndoxRetrievalAugmentation
from Indox.QaModels import OpenAiQA
from Indox.Embeddings import OpenAiEmbedding
from Indox.DataLoaderSplitter import ClusteredSplit

In this step, we initialize the Indox Retrieval Augmentation, the QA model, and the embedding model. Note that the models used for QA and embedding can vary depending on the specific requirements.


In [3]:
Indox = IndoxRetrievalAugmentation()
qa_model = OpenAiQA(api_key=OPENAI_API_KEY,model="gpt-3.5-turbo-0125")
embed = OpenAiEmbedding(openai_api_key=OPENAI_API_KEY,model="text-embedding-3-small")

In [15]:
file_path = "sample.txt"

## Data Loader Setup

We set up the data loader using the `ClusteredSplit` class. This step involves loading documents, configuring embeddings, and setting options for processing the text.


In [17]:
docs = ClusteredSplit(file_path=file_path,embeddings=embed,remove_sword=True,re_chunk=False,chunk_size=300)

Starting processing...


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ASHKAN\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ASHKAN\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ASHKAN\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ASHKAN\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ASHKAN\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ASHKAN\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]  

--Generated 1 clusters--


2024-05-21 17:38:14,675 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


End Chunking & Clustering process.


## Vector Store Connection and Document Storage

In this step, we connect the Indox application to the vector store and store the processed documents.


In [19]:
Indox.connect_to_vectorstore(collection_name="sample",embeddings=embed)

2024-05-21 17:38:45,564 - INFO - Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.


Connection established successfully.


In [20]:
Indox.store_in_vectorstore(docs)

2024-05-21 17:38:51,849 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-05-21 17:38:52,914 - INFO - Document added successfully to the vector store.


<Indox.vectorstore.ChromaVectorStore at 0x2a230cf04d0>

## Querying and Interpreting the Response

In this step, we query the Indox application with a specific question and use the QA model to get the response. The response is a tuple where the first element is the answer and the second element contains the retrieved context with their cosine scores.
response[0] contains the answer
response[1] contains the retrieved context with their cosine scores

In [21]:
response = Indox.answer_question(query="How cinderella reach happy ending?",qa_model=qa_model,top_k=5)

2024-05-21 17:39:29,349 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-05-21 17:39:32,827 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


In [22]:
response[0]

"Cinderella reached her happy ending by attending the royal festival with the help of birds. Despite her stepmother and stepsisters mistreating her and trying to prevent her from attending the festival, Cinderella was able to go with the assistance of the birds who provided her with a beautiful dress. At the festival, she captivated the prince with her beauty and charm, dancing with him all evening. Even though she had to escape from the prince at the end of the night, he was determined to find her and went to great lengths to track her down. Eventually, Cinderella's true identity was revealed, and she was chosen by the prince to be his bride, leading to her happily ever after."

In [23]:
response[1]

(['The provided documentation is a detailed retelling of the classic fairy tale "Cinderella." It narrates the story of a young girl whose mother has passed away, and her father marries a woman with two daughters. The stepmother and stepsisters mistreat Cinderella, giving her difficult tasks and preventing her from attending a royal festival. With the help of birds, Cinderella is able to attend the festival in a beautiful dress, where she captivates the prince. After a series of events',
  "never thought cinderella , believed sitting home dirt , picking lentils ashes prince approached , took hand danced would dance maiden , never let loose hand , one else came invite , said , partner danced till evening , wanted go home king 's son said , go bear company , wished see beautiful maiden belonged escaped , however , sprang pigeon-house king 's son waited father came , told unknown maiden leapt pigeon-house old man thought , cinderella bring axe pickaxe might hew pigeon-house pieces , one in