# PaperQA with Aurelian

This notebook demonstrates how to use the Aurelian PaperQA integration to search, analyze, and query scientific papers. The PaperQA agent allows you to:

1. Search for papers on specific topics
2. Add papers to your collection from files or URLs
3. Query papers to answer scientific questions
4. List papers in your collection

## Setup

First, let's load any environment variables and set up the agent.

In [1]:
from dotenv import load_dotenv
load_dotenv("../../.env")

True

In [2]:
import nest_asyncio
nest_asyncio.apply()

## Initialize the PaperQA Agent

Now we'll import and initialize the PaperQA agent with custom settings.

In [3]:
import os
import tempfile
from aurelian.agents.paperqa.paperqa_agent import paperqa_agent
from aurelian.agents.paperqa.paperqa_config import get_config

# Create a temporary directory for our papers
temp_dir = tempfile.mkdtemp()
papers_dir = os.path.join(temp_dir, "papers")
os.makedirs(papers_dir, exist_ok=True)

# Configure the agent
paperqa_config = get_config()
paperqa_config.paper_directory = papers_dir

# Optionally customize other settings
paperqa_config.llm = "gpt-4o-2024-11-20"  # Or any other supported model
paperqa_config.embedding = "text-embedding-3-small"  # Default embedding model
paperqa_config.temperature = 0.2

print(f"Papers will be stored in: {papers_dir}")

Papers will be stored in: /var/folders/_h/52yfmvlj0ylc1jxpx9tv_j8w0000gn/T/tmpmwujoavx/papers


## Adding Papers to the Collection

You can add papers to your collection either from local files or from URLs. Let's add a paper from a URL. 

> **Note**: When using URLs, make sure it's a direct link to a PDF file (ending with .pdf).

In [4]:
# Let's use the add_paper function directly
from aurelian.agents.paperqa.paperqa_tools import add_paper
from pydantic_ai import RunContext

# Create a run context with our config
ctx = RunContext(deps=paperqa_config, model=None, usage=None, prompt=None)

# Add a paper from a URL with auto_index=True to automatically build the index
url = "https://arxiv.org/pdf/2203.06566.pdf"  # PaperQA paper
result = await add_paper(ctx, url, auto_index=True)
result

Downloaded https://arxiv.org/pdf/2203.06566.pdf to /var/folders/_h/52yfmvlj0ylc1jxpx9tv_j8w0000gn/T/tmpmwujoavx/papers/2203.06566.pdf


CROSSREF_MAILTO environment variable not set. Crossref API rate limits may apply.
CROSSREF_API_KEY environment variable not set. Crossref API rate limits may apply.
SEMANTIC_SCHOLAR_API_KEY environment variable not set. Semantic Scholar API rate limits may apply.


Building index for 1 PDF files in /var/folders/_h/52yfmvlj0ylc1jxpx9tv_j8w0000gn/T/tmpmwujoavx/papers...


SEMANTIC_SCHOLAR_API_KEY environment variable not set. Semantic Scholar API rate limits may apply.


{'success': True,
 'docname': 'Wu2022',
 'doc': None,
 'index_result': {'success': True,
  'paper_directory': '/var/folders/_h/52yfmvlj0ylc1jxpx9tv_j8w0000gn/T/tmpmwujoavx/papers',
  'pdf_files_count': 1,
  'indexed_papers_count': 1,
  'indexed_papers': ['2203.06566.pdf'],
  'message': 'Successfully indexed 1 papers out of 1 PDF files.'},
 'message': 'Paper added and indexed successfully. 1 papers now in the index.'}

## Understanding the PaperQA Workflow

PaperQA has a specific workflow for managing papers:

1. **Adding papers**: When you add a paper with `add_paper()`, it:
   - Downloads the PDF if it's a URL (must be a direct PDF link)
   - Saves the PDF to the paper directory
   - Processes the paper with PaperQA

2. **Indexing papers**: For papers to be searchable, they need to be indexed:
   - By default, `add_paper()` has `auto_index=True` which automatically builds the index
   - For adding multiple papers, you can set `auto_index=False` and then manually call `build_index()`
   - You can also use the CLI command: `aurelian paperqa index`

3. **Listing papers**: The `list_papers()` function shows both:
   - PDF files found in the paper directory 
   - Papers that have been successfully indexed

Let's list the papers to see what's in our collection:

In [5]:
# Let's rebuild the index to make our paper searchable
from aurelian.agents.paperqa.paperqa_tools import build_index

index_result = await build_index(ctx)
index_result

Building index for 1 PDF files in /var/folders/_h/52yfmvlj0ylc1jxpx9tv_j8w0000gn/T/tmpmwujoavx/papers...


{'success': True,
 'paper_directory': '/var/folders/_h/52yfmvlj0ylc1jxpx9tv_j8w0000gn/T/tmpmwujoavx/papers',
 'pdf_files_count': 1,
 'indexed_papers_count': 1,
 'indexed_papers': ['2203.06566.pdf'],
 'message': 'Successfully indexed 1 papers out of 1 PDF files.'}

## Searching for Papers

Now let's search for papers related to a specific topic.

In [6]:
# Search for papers about question answering
from aurelian.agents.paperqa.paperqa_tools import search_papers

search_results = await search_papers(ctx, "question answering with scientific papers")
search_results

AnswerResponse(session=PQASession(id=UUID('e1bca311-ff17-405a-babe-0b6f4c6d0ee5'), question='Find scientific papers about: question answering with scientific papers', answer='I cannot answer.', answer_reasoning=None, has_successful_answer=False, context="wu2022promptchainerchaininglarge pages 1-1: The paper 'PromptChainer: Chaining Large Language Model Prompts through Visual Programming' explores the process of authoring chains of large language model (LLM) prompts to handle complex tasks that cannot be addressed with a single LLM run. The authors identify user needs such as data transformation between chain steps and debugging at multiple granularities. They introduce PromptChainer, an interactive visual programming interface, to support users in building and prototyping LLM chains. Case studies demonstrate its utility for designers and developers in creating AI-infused applications. The paper also discusses challenges in scaling chains for more complex tasks and low-fidelity prototyp

## Querying Papers to Answer Questions

Now let's ask a question about the papers in our collection.

In [7]:
# Query papers to answer a question
from aurelian.agents.paperqa.paperqa_tools import query_papers

answer = await query_papers(ctx, "What are the main challenges in question answering with scientific literature?")
answer

AnswerResponse(session=PQASession(id=UUID('7b0e650b-6572-44a5-bb80-0c38c45c9272'), question='What are the main challenges in question answering with scientific literature?', answer="Question answering with scientific literature using large language models (LLMs) presents several challenges. First, the complexity of multi-step tasks often requires chaining multiple LLM prompts, where outputs from one step serve as inputs for the next. Designing and authoring these chains effectively is difficult, as it involves decomposing tasks into sub-tasks and iteratively testing prompts (wu2022promptchainerchaininglarge pages 1-2). \n\nSecond, the versatile and open-ended nature of LLMs introduces challenges in managing their outputs. Users must develop a mental model of the LLM's capabilities, and the arbitrary string formats of outputs make data transformation and processing nontrivial (wu2022promptchainerchaininglarge pages 3-4). Additionally, cascading errors can occur due to the black-box natu

## Using the Agent Interface

The above examples use the tools directly. You can also use the agent interface, which provides a more natural language experience.

In [1]:
# Use the agent interface to interact with the papers
response = await paperqa_agent.run(
    "What methods are used for embedding or representing scientific papers in the literature?",
    deps=paperqa_config
)

print(response.data)

NameError: name 'paperqa_agent' is not defined

## Cleanup

Let's clean up our temporary directory.

In [None]:
import shutil
shutil.rmtree(temp_dir)
print("Temporary directory removed.")

## Conclusion

You've now seen how to use the Aurelian PaperQA integration to:
- Configure the agent with custom settings
- Add papers to your collection
- List papers in your collection
- Search for papers on specific topics
- Query papers to answer scientific questions

This integration makes it easy to work with scientific literature and extract insights from research papers.