# PaperQA with Aurelian

This notebook demonstrates how to use the Aurelian PaperQA integration to search, analyze, and query scientific papers. The PaperQA agent allows you to:

1. Search for papers on specific topics
2. Add papers to your collection from files or URLs
3. Query papers to answer scientific questions
4. List papers in your collection

## Setup

First, let's load any environment variables and set up the agent.

In [1]:
from dotenv import load_dotenv
load_dotenv("../../.env")

True

In [2]:
import nest_asyncio
nest_asyncio.apply()

## Initialize the PaperQA Agent

Now we'll import and initialize the PaperQA agent with custom settings.

In [3]:
import os
import tempfile
from aurelian.agents.paperqa.paperqa_agent import paperqa_agent
from aurelian.agents.paperqa.paperqa_config import get_config

# Create a temporary directory for our papers
temp_dir = tempfile.mkdtemp()
papers_dir = os.path.join(temp_dir, "papers")
os.makedirs(papers_dir, exist_ok=True)

# Configure the agent
paperqa_config = get_config()
paperqa_config.paper_directory = papers_dir

# Optionally customize other settings
paperqa_config.llm = "gpt-4o-2024-11-20"  # Or any other supported model
paperqa_config.embedding = "text-embedding-3-small"  # Default embedding model
paperqa_config.temperature = 0.2

print(f"Papers will be stored in: {papers_dir}")

Papers will be stored in: /var/folders/_h/52yfmvlj0ylc1jxpx9tv_j8w0000gn/T/tmptekicdt3/papers


## Adding Papers to the Collection

You can add papers to your collection either from local files or from URLs. Let's add a paper from a URL. 

> **Note**: When using URLs, make sure it's a direct link to a PDF file (ending with .pdf).

In [4]:
# Let's use the add_paper function directly
from aurelian.agents.paperqa.paperqa_tools import add_paper
from pydantic_ai import RunContext

# Create a run context with our config
ctx = RunContext(deps=paperqa_config, model=None, usage=None, prompt=None)

# Add a paper from a URL with auto_index=True to automatically build the index
url = "https://arxiv.org/pdf/2203.06566.pdf"  # PaperQA paper
result = await add_paper(ctx, url, auto_index=True)
result

Downloaded https://arxiv.org/pdf/2203.06566.pdf to /var/folders/_h/52yfmvlj0ylc1jxpx9tv_j8w0000gn/T/tmptekicdt3/papers/2203.06566.pdf


CROSSREF_MAILTO environment variable not set. Crossref API rate limits may apply.
CROSSREF_API_KEY environment variable not set. Crossref API rate limits may apply.
SEMANTIC_SCHOLAR_API_KEY environment variable not set. Semantic Scholar API rate limits may apply.


Building index for 1 PDF files in /var/folders/_h/52yfmvlj0ylc1jxpx9tv_j8w0000gn/T/tmptekicdt3/papers...


SEMANTIC_SCHOLAR_API_KEY environment variable not set. Semantic Scholar API rate limits may apply.


{'success': True,
 'docname': 'Wu2022',
 'doc': None,
 'index_result': {'success': True,
  'paper_directory': '/var/folders/_h/52yfmvlj0ylc1jxpx9tv_j8w0000gn/T/tmptekicdt3/papers',
  'pdf_files_count': 1,
  'indexed_papers_count': 1,
  'indexed_papers': ['2203.06566.pdf'],
  'message': 'Successfully indexed 1 papers out of 1 PDF files.'},
 'message': 'Paper added and indexed successfully. 1 papers now in the index.'}

## Understanding the PaperQA Workflow

PaperQA has a specific workflow for managing papers:

1. **Adding papers**: When you add a paper with `add_paper()`, it:
   - Downloads the PDF if it's a URL (must be a direct PDF link)
   - Saves the PDF to the paper directory
   - Processes the paper with PaperQA

2. **Indexing papers**: For papers to be searchable, they need to be indexed:
   - By default, `add_paper()` has `auto_index=True` which automatically builds the index
   - For adding multiple papers, you can set `auto_index=False` and then manually call `build_index()`
   - You can also use the CLI command: `aurelian paperqa index`

3. **Listing papers**: The `list_papers()` function shows both:
   - PDF files found in the paper directory 
   - Papers that have been successfully indexed

Let's list the papers to see what's in our collection:

In [5]:
# Let's rebuild the index to make our paper searchable
from aurelian.agents.paperqa.paperqa_tools import build_index

index_result = await build_index(ctx)
index_result

Building index for 1 PDF files in /var/folders/_h/52yfmvlj0ylc1jxpx9tv_j8w0000gn/T/tmptekicdt3/papers...


{'success': True,
 'paper_directory': '/var/folders/_h/52yfmvlj0ylc1jxpx9tv_j8w0000gn/T/tmptekicdt3/papers',
 'pdf_files_count': 1,
 'indexed_papers_count': 1,
 'indexed_papers': ['2203.06566.pdf'],
 'message': 'Successfully indexed 1 papers out of 1 PDF files.'}

## Searching for Papers

Now let's search for papers related to a specific topic.

In [6]:
# Search for papers about question answering
from aurelian.agents.paperqa.paperqa_tools import search_papers

search_results = await search_papers(ctx, "question answering with scientific papers")
search_results

AnswerResponse(session=PQASession(id=UUID('18c2c191-7477-4c53-912e-e9139da870d7'), question='Find scientific papers about: question answering with scientific papers', answer="PromptChainer is a visual programming tool designed to facilitate the creation of complex, multi-step workflows using large language models (LLMs). It enables users to chain multiple LLM prompts, where the output of one step serves as the input for the next. This approach is particularly useful for tasks such as question answering with scientific papers, where a single prompt is often insufficient to handle the complexity of the task. Users can visually design and test workflows that include steps like extracting, classifying, and summarizing information from academic texts (wu2022promptchainerchaininglarge pages 9-10).\n\nPromptChainer supports modular and interpretable pipelines, allowing users to customize workflows for unstructured queries. For example, it can be used to process scientific paper queries by cha

## Querying Papers to Answer Questions

Now let's ask a question about the papers in our collection.

In [7]:
# Query papers to answer a question
from aurelian.agents.paperqa.paperqa_tools import query_papers

answer = await query_papers(ctx, "What are the main challenges in question answering with scientific literature?")
answer

AnswerResponse(session=PQASession(id=UUID('d40dfa1b-3adc-4544-a06f-eb97924846f9'), question='What are the main challenges in question answering with scientific literature?', answer='Question answering with scientific literature presents several challenges, particularly when leveraging large language models (LLMs). One major issue is the need to decompose complex tasks into multiple sub-tasks, each requiring separate LLM prompts. This chaining approach increases transparency and control but introduces difficulties in designing and coordinating prompts, managing information flow between steps, and efficiently prototyping applications (wu2022promptchainerchaininglarge pages 1-2). \n\nErrors in one step can propagate through the chain, leading to cascading failures that hinder accurate results. Debugging such chains is particularly challenging due to the black-box nature of LLMs and the interdependencies between prompts (wu2022promptchainerchaininglarge pages 2-3, wu2022promptchainerchaini

## Using the Agent Interface

The above examples use the tools directly. You can also use the agent interface, which provides a more natural language experience.

In [8]:
# Use the agent interface to interact with the papers
response = await paperqa_agent.run(
    "What is Crispr? Add relevant papers",
    deps=paperqa_config
)

print(response.data)

I was unable to retrieve relevant information or papers about CRISPR at this time. You can consider providing me with a direct link to a paper if you have one, or specify more criteria to refine the search!


## Cleanup

Let's clean up our temporary directory.

In [9]:
import shutil
shutil.rmtree(temp_dir)
print("Temporary directory removed.")

Temporary directory removed.


## Running the PaperQA Agent UI

The Aurelian framework provides a convenient way to launch the PaperQA agent as a Gradio UI. This allows you to interact with your papers through a chat interface.

To start the PaperQA agent UI from the command line, use:

```bash
# Use the generic agent runner with the paperqa agent
aurelian agent --agent paperqa --ui

# Optional parameters:
# --workdir /path/to/papers    # Set a specific working directory
# --model gpt-3.5-turbo-0125   # Use a specific model (cheaper alternative)
# --share                      # Create a shareable link
# --server-port 7860           # Run on a specific port
```

In the UI, you can:
- Search for papers: "Search for papers on CRISPR gene editing"
- Ask questions: "What are the main challenges in CRISPR gene editing?"
- Add papers: "Add this paper: https://example.com/paper.pdf"
- List papers: "Show me all the papers in the collection"
- Rebuild index: "Rebuild the search index"

The PaperQA agent will process your natural language requests and perform the corresponding actions.

## Conclusion

You've now seen how to use the Aurelian PaperQA integration to:
- Configure the agent with custom settings
- Add papers to your collection
- List papers in your collection
- Search for papers on specific topics
- Query papers to answer scientific questions

This integration makes it easy to work with scientific literature and extract insights from research papers.