# How to Use ArxivReader for Retrieving Papers from arXiv
In this notebook, we will demonstrate how to use the `ArxivReader` class for accessing papers from the arXiv repository. The `ArxivReader` class is designed to interact with the arXiv API and retrieve paper content and metadata, which can be utilized in various research and question-answering systems. We'll be leveraging open-source models available on the internet, such as Mistral, to process the retrieved data.

To begin, ensure you have set up your environment variables and API keys in Python using the dotenv library. This is crucial for securely managing sensitive information, such as API keys, especially when using services like HuggingFace. Ensure your `HUGGINGFACE_API_KEY` is defined in the `.env` file to avoid hardcoding sensitive data into your codebase, thus enhancing security and maintainability.




| Platform |
|----------|
| [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/osllmai/inDox/blob/master/cookbook/indoxArcg/arxiv_chroma_semantic.ipynb) |
| [![GitHub](https://img.shields.io/badge/GitHub-Repository-blue?logo=github)](https://github.com/osllmai/inDox/blob/master/cookbook/indoxArcg/arxiv_chroma_semantic.ipynb) |

In [14]:
# !pip install indoxarcg chromadb arxiv semantic_text_splitter sentence_transformers torch

if you have an issue to install torch in windows, please use below 

```python
pip install torch --index-url https://download.pytorch.org/whl/cpu
```

## Setting Up the Python Environment

If you are running this project in your local IDE, please create a Python environment to ensure all dependencies are correctly managed. You can follow the steps below to set up a virtual environment named `indoxArcg`:

### Windows

1. **Create the virtual environment:**

```bash
python -m venv indoxArcg

```
2. **Activate the virtual environment:**

```bash
indoxArcg\Scripts\activate

```

### macOS/Linux

1. **Create the virtual environment:**

```bash
python3 -m venv indoxArcg
```

2. **Activate the virtual environment:**

```bash
source indoxArcg/bin/activate
```
   
### Install Dependencies

Once the virtual environment is activated, install the required dependencies by running:

```bash
pip install -r requirements.txt
```

## Import Essential Libraries

Next, we import the essential libraries for our indoxrag question-answering system:

- `IndoxRetrievalAugmentation`: Enhances the retrieval process by improving the relevance and quality of the documents retrieved, leading to better QA performance.
- `MistralQA`: A powerful QA model provided by indoxrag, built on top of the Hugging Face model architecture. It leverages state-of-the-art language understanding to deliver precise answers.
- `HuggingFaceEmbedding`: This library uses Hugging Face embeddings to enrich semantic understanding, making it easier to capture the contextual meaning of the text.
- `SemanticTextSplitter`: utilizes a Hugging Face tokenizer to intelligently split text into chunks based on a specified maximum number of tokens, ensuring that each chunk maintains semantic coherence.

## How to Obtain a Hugging Face API Key

To access the Hugging Face API for our arXiv paper retrieval system, you'll need to generate an API key. Follow these steps:

### Step 1: Create a Hugging Face Account
If you don't already have one, visit [Hugging Face](https://huggingface.co/) and sign up for an account.

### Step 2: Access Your Profile Settings
- Log in to your Hugging Face account
- Click on your profile picture in the top-right corner
- Select "Settings" from the dropdown menu

### Step 3: Generate a New API Token
- Navigate to the "Access Tokens" section in the left sidebar
- Click on "New token" to create a new user access token
- Select an appropriate role (Read for inference, Write for model uploads)
- Give your token a descriptive name (e.g., "ArXiv Reader Project")
- Copy the newly generated API token immediately (it won't be shown again)

### Step 4: Store Your API Token Securely
- Store your API token in your `.env` file as shown in the code
- Never hardcode tokens directly in your application
- Keep your token confidential as it grants access to your Hugging Face account

### Step 5: Use the API Token
Once configured in your environment, the token authenticates your requests to:
- Access the Hugging Face Inference API
- Download models and embeddings
- Push or pull models from the Hub

> **Note**: The token shown in this notebook is for demonstration purposes only. You should generate your own token for actual use.

In [2]:
import os
from dotenv import load_dotenv

load_dotenv()

HUGGINGFACE_API_KEY = os.environ['HUGGINGFACE_API_KEY']

## Building the ArxivReader System and Initializing Models

Next, we will build our ArxivReader system and initialize the MistralQA model along with the HuggingFaceEmbedding model. This setup will enable us to effectively retrieve and process arXiv papers, leveraging the advanced capabilities of these models for our question-answering tasks.


In [3]:
import torch


In [4]:
from indoxArcg.llms import HuggingFaceAPIModel
from indoxArcg.embeddings import HuggingFaceEmbedding
mistral_qa = HuggingFaceAPIModel(api_key=HUGGINGFACE_API_KEY,model="mistralai/Mistral-7B-Instruct-v0.2")
embed = HuggingFaceEmbedding(api_key=HUGGINGFACE_API_KEY,model="multi-qa-mpnet-base-cos-v1")

  from .autonotebook import tqdm as notebook_tqdm


[32mINFO[0m: [1mInitializing HuggingFaceAPIModel with model: mistralai/Mistral-7B-Instruct-v0.2[0m
[32mINFO[0m: [1mHuggingFaceAPIModel initialized successfully[0m




[32mINFO[0m: [1mInitialized HuggingFaceEmbedding with model: multi-qa-mpnet-base-cos-v1[0m


## Setting Up the ArxivReader for Retrieving Papers
To demonstrate the capabilities of our `ArxivReader` system and `Indox` question-answering model, we will use a sample paper ID. This paper will contain arXiv paper, which we will use for testing and evaluation.

In [5]:
# !pip install arxiv

In [5]:
from indoxArcg.data_connectors import ArxivReader

reader = ArxivReader()

paper_ids = ["2201.08239"]
documents = reader.load_content(paper_ids)

In [7]:
documents

"Title: LaMDA: Language Models for Dialog Applications\n\nAuthors: Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Vincent Zhao, Yanqi Zhou, Chung-Ching Chang, Igor Krivokon, Will Rusch, Marc Pickett, Pranesh Srinivasan, Laichee Man, Kathleen Meier-Hellstern, Meredith Ringel Morris, Tulsee Doshi, Renelito Delos Santos, Toju Duke, Johnny Soraker, Ben Zevenbergen, Vinodkumar Prabhakaran, Mark Diaz, Ben Hutchinson, Kristen Olson, Alejandra Molina, Erin Hoffman-John, Josh Lee, Lora Aroyo, Ravi Rajakumar, Alena Butryna, Matthew Lamm, Viktoriya Kuzmina, Joe Fenton, Aaron Cohen, Rachel Bernstein, Ray Kurzweil, Blaise Aguera-Arcas, Claire Cui, Marian Croak, Ed Chi, Quoc Le\n\nAbstract: We present 

In [6]:
content = documents

## Splitting Content into Manageable Chunks
We use the `SemanticTextSplitter` function from the `indoxArcg.splitter` module to divide the retrieved content into smaller, meaningful chunks.

In [12]:
# !pip install semantic_text_splitter

In [9]:
from indoxArcg.splitter import SemanticTextSplitter
splitter = SemanticTextSplitter(400)
content_chunks = splitter.split_text(content)

content_chunks[0]

'Title: LaMDA: Language Models for Dialog Applications\n\nAuthors: Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Vincent Zhao, Yanqi Zhou, Chung-Ching Chang, Igor Krivokon, Will Rusch, Marc Pickett, Pranesh Srinivasan, Laichee Man, Kathleen Meier-Hellstern, Meredith Ringel Morris, Tulsee Doshi, Renelito Delos Santos, Toju Duke, Johnny Soraker, Ben Zevenbergen, Vinodkumar Prabhakaran, Mark Diaz, Ben Hutchinson, Kristen Olson, Alejandra Molina, Erin Hoffman-John, Josh Lee, Lora Aroyo, Ravi Rajakumar, Alena Butryna, Matthew Lamm, Viktoriya Kuzmina, Joe Fenton, Aaron Cohen, Rachel Bernstein, Ray Kurzweil, Blaise Aguera-Arcas, Claire Cui, Marian Croak, Ed Chi, Quoc Le'

## Storing and Indexing Content with Chroma
We use the `Chroma` vector store from the `indoxArcg.vector_stores` module to store and index the content chunks. By creating a collection named "sample" and applying an embedding function (`embed`), we convert each chunk into a vector for efficient retrieval. The `add` method then adds these vectors to the database, enabling scalable and effective search for question-answering tasks.

In [16]:
# !pip install chromadb

In [10]:
from indoxArcg.vector_stores import Chroma
db = Chroma(collection_name="sample",embedding_function=embed)
db.add(docs=content_chunks)

2025-03-28 18:49:27,254 - chromadb.telemetry.product.posthog - INFO - Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.


[32mINFO[0m: [1mStoring documents in the vector store[0m
[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings for texts using model: SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)[0m


Batches: 100%|██████████| 1/1 [00:00<00:00,  2.72it/s]

[32mINFO[0m: [1mDocument added successfully to the vector store.[0m
[32mINFO[0m: [1mDocuments stored successfully[0m





## Querying the Arxiv Data with Indox
With our `ArxivReader` system and `indoxArcg` setup complete, we are ready to test it using a sample query. This test will show how well our system can retrieve and generate accurate answers based on the arXiv papers stored in the vector store.

We’ll use a sample query to evaluate our system:
- **Query**: "what are challenges?"

This question will be processed by the `ArxivReader` and `indoxArcg` system to retrieve relevant papers and generate a precise response based on the information.

Let’s test our setup with this query.

In [11]:
from indoxArcg.pipelines.rag import RAG
indox = RAG(llm=mistral_qa,vector_store=db) #  WebSearchFallback=True has removed

In [12]:
query = "what are challenges?"
response = indox.infer(query)

[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings for texts using model: SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)[0m


Batches: 100%|██████████| 1/1 [00:00<00:00, 32.26it/s]


[32mINFO[0m: [1mAnswering question[0m
[32mINFO[0m: [1mSending request to Hugging Face API[0m
[32mINFO[0m: [1mReceived successful response from Hugging Face API[0m


In [13]:
print(response)

The key challenges addressed in the study for the LaMDA (Language Models for Dialog Applications) are safety and factual grounding. Safety involves ensuring the model's responses align with human values, preventing harmful suggestions and unfair biases. Factual grounding enables the model to consult external knowledge sources, providing responses rooted in known facts, rather than merely plausible responses.


## Join Us

Join us in exploring how Indox can revolutionize your document processing workflow, bringing clarity and organization to your data retrieval needs. Connect with us and become part of our growing community through the platforms below:

## Community

- [Discord](https://discord.com/invite/xGz5tQYaeq)
- [X (Twitter)](https://x.com/osllmai)
- [LinkedIn](https://www.linkedin.com/company/osllmai/)
- [YouTube](https://www.youtube.com/@osllm-rb9pr)
- [Telegram](https://t.me/osllmai)




*Reviewed by: Ali Nemati - March, 27, 2025*

