# ArxivLoader

[arXiv](https://arxiv.org/) is an open-access archive for 2 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.

## Setup

To access Arxiv document loader you'll need to install the `arxiv`, `PyMuPDF` and `langchain-community` integration packages. PyMuPDF transforms PDF files downloaded from the arxiv.org site into the text format.

In [None]:
%pip install -qU langchain-community arxiv pymupdf

## Instantiation

Now we can instantiate our model object and load documents:

In [1]:
from langchain_community.document_loaders import ArxivLoader

# Supports all arguments of `ArxivAPIWrapper`
loader = ArxivLoader(
    query="reasoning",
    load_max_docs=2,
    # doc_content_chars_max=1000,
    # load_all_available_meta=False,
    # ...
)

## Load

Use ``.load()`` to synchronously load into memory all Documents, with one
Document per one arxiv paper.

Let's run through a basic example of how to use the `ArxivLoader` searching for papers of reasoning:

In [2]:
docs = loader.load()
docs[0]

Document(page_content='Hypothesis Testing Prompting Improves Deductive Reasoning in\nLarge Language Models\nYitian Li1,2, Jidong Tian1,2, Hao He1,2, Yaohui Jin1,2\n1MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University\n2State Key Lab of Advanced Optical Communication System and Network\n{yitian_li, frank92, hehao, jinyh}@sjtu.edu.cn\nAbstract\nCombining different forms of prompts with pre-trained large language models has yielded remarkable results on\nreasoning tasks (e.g. Chain-of-Thought prompting). However, along with testing on more complex reasoning, these\nmethods also expose problems such as invalid reasoning and fictional reasoning paths. In this paper, we develop\nHypothesis Testing Prompting, which adds conclusion assumptions, backward reasoning, and fact verification during\nintermediate reasoning steps. Hypothesis Testing prompting involves multiple assumptions and reverses validation of\nconclusions leading to its unique correct answer. Expe

In [3]:
print(docs[0].metadata)

{'Published': '2024-05-09', 'Title': 'Hypothesis Testing Prompting Improves Deductive Reasoning in Large Language Models', 'Authors': 'Yitian Li, Jidong Tian, Hao He, Yaohui Jin', 'Summary': 'Combining different forms of prompts with pre-trained large language models\nhas yielded remarkable results on reasoning tasks (e.g. Chain-of-Thought\nprompting). However, along with testing on more complex reasoning, these\nmethods also expose problems such as invalid reasoning and fictional reasoning\npaths. In this paper, we develop \\textit{Hypothesis Testing Prompting}, which\nadds conclusion assumptions, backward reasoning, and fact verification during\nintermediate reasoning steps. \\textit{Hypothesis Testing prompting} involves\nmultiple assumptions and reverses validation of conclusions leading to its\nunique correct answer. Experiments on two challenging deductive reasoning\ndatasets ProofWriter and RuleTaker show that hypothesis testing prompting not\nonly significantly improves the eff

## Lazy Load

If we're loading a  large number of Documents and our downstream operations can be done over subsets of all loaded Documents, we can lazily load our Documents one at a time to minimize our memory footprint:

In [4]:
docs = []

for doc in loader.lazy_load():
    docs.append(doc)

    if len(docs) >= 10:
        # do some paged operation, e.g.
        # index.upsert(doc)

        docs = []

In this example we never have more than 10 Documents loaded into memory at a time.

## Use papers summaries as documents

You can use summaries of Arvix paper as documents rather than raw papers:

In [5]:
docs = loader.get_summaries_as_docs()
docs[0]

Document(page_content='Combining different forms of prompts with pre-trained large language models\nhas yielded remarkable results on reasoning tasks (e.g. Chain-of-Thought\nprompting). However, along with testing on more complex reasoning, these\nmethods also expose problems such as invalid reasoning and fictional reasoning\npaths. In this paper, we develop \\textit{Hypothesis Testing Prompting}, which\nadds conclusion assumptions, backward reasoning, and fact verification during\nintermediate reasoning steps. \\textit{Hypothesis Testing prompting} involves\nmultiple assumptions and reverses validation of conclusions leading to its\nunique correct answer. Experiments on two challenging deductive reasoning\ndatasets ProofWriter and RuleTaker show that hypothesis testing prompting not\nonly significantly improves the effect, but also generates a more reasonable\nand standardized reasoning process.', metadata={'Entry ID': 'http://arxiv.org/abs/2405.06707v1', 'Published': datetime.date(20

## API reference

For detailed documentation of all ArxivLoader features and configurations head to the API reference: https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.arxiv.ArxivLoader.html#langchain_community.document_loaders.arxiv.ArxivLoader