# How to Use ArxivReader for Retrieving Papers from arXiv
In this notebook, we will demonstrate how to use the `ArxivReader` class for accessing papers from the arXiv repository. The `ArxivReader` class is designed to interact with the arXiv API and retrieve paper content and metadata, which can be utilized in various research and question-answering systems. We'll be leveraging open-source models available on the internet, such as Mistral, to process the retrieved data.

To begin, ensure you have set up your environment variables and API keys in Python using the dotenv library. This is crucial for securely managing sensitive information, such as API keys, especially when using services like HuggingFace. Ensure your `HUGGINGFACE_API_KEY` is defined in the `.env` file to avoid hardcoding sensitive data into your codebase, thus enhancing security and maintainability.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/osllmai/inDox/blob/master/cookbook/indoxArcg/arxiv_chroma_semantic.ipynb)

In [None]:
!pip install indoxarcg chromadb arxiv semantic_text_splitter sentence_transformers

## Setting Up the Python Environment

If you are running this project in your local IDE, please create a Python environment to ensure all dependencies are correctly managed. You can follow the steps below to set up a virtual environment named `indoxArcg`:

### Windows

1. **Create the virtual environment:**
```bash
python -m venv indoxArcg
```
2. **Activate the virtual environment:**
```bash
indoxArcg\Scripts\activate
```

### macOS/Linux

1. **Create the virtual environment:**
   ```bash
   python3 -m venv indoxArcg
    ```

2. **Activate the virtual environment:**
    ```bash
   source indoxArcg/bin/activate
    ```
   
### Install Dependencies

Once the virtual environment is activated, install the required dependencies by running:

```bash
pip install -r requirements.txt
```

## Import Essential Libraries

Next, we import the essential libraries for our indoxrag question-answering system:

- `IndoxRetrievalAugmentation`: Enhances the retrieval process by improving the relevance and quality of the documents retrieved, leading to better QA performance.
- `MistralQA`: A powerful QA model provided by indoxrag, built on top of the Hugging Face model architecture. It leverages state-of-the-art language understanding to deliver precise answers.
- `HuggingFaceEmbedding`: This library uses Hugging Face embeddings to enrich semantic understanding, making it easier to capture the contextual meaning of the text.
- `SemanticTextSplitter`: utilizes a Hugging Face tokenizer to intelligently split text into chunks based on a specified maximum number of tokens, ensuring that each chunk maintains semantic coherence.

In [2]:
import os
from dotenv import load_dotenv

load_dotenv()

HUGGINGFACE_API_KEY = os.environ['HUGGINGFACE_API_KEY']

## Building the ArxivReader System and Initializing Models

Next, we will build our ArxivReader system and initialize the MistralQA model along with the HuggingFaceEmbedding model. This setup will enable us to effectively retrieve and process arXiv papers, leveraging the advanced capabilities of these models for our question-answering tasks.


In [3]:
from indoxArcg.llms import HuggingFaceModel
from indoxArcg.embeddings import HuggingFaceEmbedding
mistral_qa = HuggingFaceModel(api_key=HUGGINGFACE_API_KEY,model="mistralai/Mistral-7B-Instruct-v0.2")
embed = HuggingFaceEmbedding(api_key=HUGGINGFACE_API_KEY,model="multi-qa-mpnet-base-cos-v1")

[32mINFO[0m: [1mInitializing HuggingFaceModel with model: mistralai/Mistral-7B-Instruct-v0.2[0m
[32mINFO[0m: [1mHuggingFaceModel initialized successfully[0m




modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/9.25k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

[32mINFO[0m: [1mInitialized HuggingFaceEmbedding with model: multi-qa-mpnet-base-cos-v1[0m


## Setting Up the ArxivReader for Retrieving Papers
To demonstrate the capabilities of our `ArxivReader` system and `Indox` question-answering model, we will use a sample paper ID. This paper will contain arXiv paper, which we will use for testing and evaluation.

In [5]:
from indoxArcg.data_connectors import ArxivReader

reader = ArxivReader()

paper_ids = ["2201.08239"]
documents = reader.load_content(paper_ids)

In [6]:
content = documents

## Splitting Content into Manageable Chunks
We use the `SemanticTextSplitter` function from the `indoxArcg.splitter` module to divide the retrieved content into smaller, meaningful chunks.

In [7]:
from indoxArcg.splitter import SemanticTextSplitter
splitter = SemanticTextSplitter(400)
content_chunks = splitter.split_text(content)

## Storing and Indexing Content with Chroma
We use the `Chroma` vector store from the `indoxArcg.vector_stores` module to store and index the content chunks. By creating a collection named "sample" and applying an embedding function (`embed`), we convert each chunk into a vector for efficient retrieval. The `add` method then adds these vectors to the database, enabling scalable and effective search for question-answering tasks.

In [8]:
from indoxArcg.vector_stores import Chroma
db = Chroma(collection_name="sample",embedding_function=embed)
db.add(docs=content_chunks)

2025-01-19 13:23:41,716 - chromadb.telemetry.product.posthog - INFO - Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.


[32mINFO[0m: [1mStoring documents in the vector store[0m
[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings for texts using model: SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)[0m


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[32mINFO[0m: [1mDocument added successfully to the vector store.[0m
[32mINFO[0m: [1mDocuments stored successfully[0m


## Querying the Arxiv Data with Indox
With our `ArxivReader` system and `indoxArcg` setup complete, we are ready to test it using a sample query. This test will show how well our system can retrieve and generate accurate answers based on the arXiv papers stored in the vector store.

We’ll use a sample query to evaluate our system:
- **Query**: "what are challenges?"

This question will be processed by the `ArxivReader` and `indoxArcg` system to retrieve relevant papers and generate a precise response based on the information.

Let’s test our setup with this query.

In [11]:
from indoxArcg.pipelines.rag import RAG
indox = RAG(llm=mistral_qa,vector_store=db,enable_web_fallback=False)

In [13]:
query = "what are challenges?"
response = indox.infer(query)

[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings for texts using model: SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)[0m


Batches:   0%|          | 0/1 [00:00<?, ?it/s]



[32mINFO[0m: [1mSending request to Hugging Face API[0m
[32mINFO[0m: [1mReceived successful response from Hugging Face API[0m
[32mINFO[0m: [1mSending request to Hugging Face API[0m
[32mINFO[0m: [1mReceived successful response from Hugging Face API[0m
[32mINFO[0m: [1mAnswering question[0m
[32mINFO[0m: [1mSending request to Hugging Face API[0m
[32mINFO[0m: [1mReceived successful response from Hugging Face API[0m
[32mINFO[0m: [1mSending request to Hugging Face API[0m
[32mINFO[0m: [1mReceived successful response from Hugging Face API[0m


In [14]:
print(response)

Challenges are situations or obstacles that require effort, skill, or determination to overcome. They can come in various forms such as personal, professional, academic, or social issues, and can test one's abilities, resilience, and adaptability. Some common challenges include managing time effectively, dealing with stress, learning new skills, facing criticism, or overcoming physical or mental health issues. Facing challenges can also be an opportunity for growth, as they provide opportunities to learn, develop new strengths, and build skills that can be applied in other areas of life. However, it's important to remember that everyone's challenges are unique, and what may be a challenge for one person may not be for another.
