# arxiv reader

This example demonstrates a technique for generating understanding over long documents - papers from the latest arxiv papers in the AI category in this case.

The basic approach is to iterate over the document in chunks and ask the LLM to generate a synthetic dataset of thoughts about each chunk. The thoughts and the chunk are then embedded and can be recalled in subsequent iterations by querying the collection. Then we'll use the generated thoughts to ask questions about the document.

First, we need to define the data types that we'll be generating. Because we're embedding the generated objects we're using `Entity`, which is a thin wrapper on top of `pydantic.BaseModel`.

In [1]:
from enum import Enum
from pydantic import Field
from promptx.collection import Entity


class Document(Entity):
    title: str
    abstract: str
    url: str

class Quote(Entity):
    text: str
    source: Document
    start: int
    end: int

class ThoughtCategory(str, Enum):
    fact = 'fact'
    opinion = 'opinion'
    idea = 'idea'
    connection = 'connection'
    belief = 'belief'

class Thought(Entity):
    value: str
    category: ThoughtCategory
    confidence: float
    source: Entity = Field(None, generate=False)

[32m2023-11-02 04:11:33.462[0m | [1mINFO    [0m | [36mpromptx[0m:[36mload[0m:[36m104[0m - [1mloading local app from /home/rjl/promptx/examples/arxiv-reader[0m
[32m2023-11-02 04:11:33.465[0m | [1mINFO    [0m | [36mpromptx[0m:[36mload[0m:[36m107[0m - [1mloaded environment variables from /home/rjl/promptx/examples/arxiv-reader/.env[0m
[32m2023-11-02 04:11:33.466[0m | [1mINFO    [0m | [36mpromptx[0m:[36mload[0m:[36m108[0m - [1mAPI KEY wMeGC[0m


Next, we need to get the data from arxiv. We'll use `requests` to get the data ans `BeautifulSoup` to parse it into a `Document` instance.

In [2]:
from typing import *
from pydantic import Field
import requests
from bs4 import BeautifulSoup


def get_arxiv_urls():
    response = requests.get('https://arxiv.org/list/cs.AI/recent')
    response.raise_for_status()

    soup = BeautifulSoup(response.content, 'html.parser')
    urls = [f"https://arxiv.org{a.attrs['href']}" for a in soup.find_all('a', title='Abstract')]
    return urls

def extract_whitepaper_from_arxiv(url):
    response = requests.get(url)
    response.raise_for_status()

    soup = BeautifulSoup(response.content, 'html.parser')
    title = soup.find('h1', class_='title').text.replace('Title:', '')
    abstract = soup.find('blockquote', class_='abstract').text.replace('Abstract:', '')
    url = soup.find('a', class_='download-pdf').attrs['href']
    url = f"https://arxiv.org{url}"

    return Document(
        title=title,
        abstract=abstract,
        url=url,
    )

Now we can use these functions to fetch the latest papers, select one at random, and extract the data from the HTML content.

In [3]:
import random

try:
    urls = get_arxiv_urls()
    url = random.choice(urls)
    paper = extract_whitepaper_from_arxiv(url)
    print(paper)
except Exception as e:
    print(f'Error loading {e}')

id='2349300d-a5cc-4ca4-9ac4-7f14a36468a2' type='document' title='Minimally Modifying a Markov Game to Achieve Any Nash Equilibrium and Value' abstract='\n  We study the game modification problem, where a benevolent game designer or a\nmalevolent adversary modifies the reward function of a zero-sum Markov game so\nthat a target deterministic or stochastic policy profile becomes the unique\nMarkov perfect Nash equilibrium and has a value within a target range, in a way\nthat minimizes the modification cost. We characterize the set of policy\nprofiles that can be installed as the unique equilibrium of some game, and\nestablish sufficient and necessary conditions for successful installation. We\npropose an efficient algorithm, which solves a convex optimization problem with\nlinear constraints and then performs random perturbation, to obtain a\nmodification plan with a near-optimal cost.\n\n    ' url='https://arxiv.org/pdf/2311.00582.pdf'


The document instance only has data about the paper and doesn't contain the actual text. Let's create a function to extract text from a PDF given a path or URL.

In [4]:
import PyPDF2
import requests
import uuid

def load_pdf(filepath_or_url):
    """
    Load content of a PDF from either a file path or a remote URL.
    
    :param filepath_or_url: File path or URL to fetch the PDF from.
    :return: Content of the PDF as a string.
    """
    
    # Handle remote URL
    if filepath_or_url.startswith(("http://", "https://")):
        response = requests.get(filepath_or_url)
        response.raise_for_status()
        id = str(uuid.uuid4())
        filepath_or_url = f'./data/{id}.pdf'
        with open(filepath_or_url, 'wb') as pdf:
            pdf.write(response.content)
    
    with open(filepath_or_url, 'rb') as f:
        pdf_reader = PyPDF2.PdfReader(f)
        text_content = ''.join([page.extract_text() for page in pdf_reader.pages])
    return text_content

In [5]:
pdf = load_pdf(paper.url)
print(f'Loaded pdf with {len(pdf)} characters')

Loaded pdf with 64217 characters


We have the full text, but how do we split it into chunks that are small enough to be processed by the LLM? You could do this in a number of ways, but we'll use `spacy`, a popular NLP library, to split the text into sentences and then group them into chunks of 512 tokens.

In [6]:
import spacy
import en_core_web_sm

nlp = spacy.load("en_core_web_sm")
doc = nlp(pdf)

Now we have a parsed `spacy` document we can split it into sentences.

In [7]:
sentences = doc.sents
random.choice(list(sentences))


The stan-
dard rock paper scissors game is a special case when
the sizes are [1;36m3[0m, hence the name.

We could iterate over each sentence individually, but that will be slow, expensive, and some sentences will be too short to convery meaning on their own. Instead, we'll group them into batches, or passages, and generate thoughts based on each passage of text. Before we do that, let's define a helper function to process the sentences in batches.

In [8]:
def batch(generator, bs=1, limit=None):
    b = []
    i = 0
    for item in generator:
        if limit and i > limit:
            break
        b.append(item)
        if len(b) == bs:
            yield b
            b = []
        i += bs
    if b and (limit and i <= limit):  # Yield any remaining items in the batch
        yield b

This function yield's the generator in chunks defines by the batch size up to a total number of processed items.

In [9]:
for chunk in batch(doc.sents, bs=5, limit=100):
    print(chunk)

[Minimally Modifying a Markov Game to Achieve Any Nash
Equilibrium and Value
Young Wu Jeremy McMahan Yiding Chen Yudong Chen Xiaojin Zhu Qiaomin Xie
University of Wisconsin Madison
Abstract
We study the game modification problem,
where a benevolent game designer or a malev-
olent adversary modifies the reward function
of a zero-sum Markov game so that a tar-
get deterministic or stochastic policy pro-
file becomes the unique Markov perfect Nash
equilibrium and has a value within a target
range, in a way that minimizes the modifica-
tion cost., We characterize the set of policy
profiles that can be installed as the unique
equilibrium of some game, and establish suf-
ficient and necessary conditions for success-
ful installation., We propose an efficient al-
gorithm, which solves a convex optimization
problem with linear constraints and then per-
forms random perturbation, to obtain a mod-
ification plan with a near-optimal cost.
, 1 Introduction
Consider a two-player zero-sum Markov gam

Now we can define the logic for thought generation. It's often tempting to overcomplicate prompts, but it can be difficult to know whether more information is actually helping. Often, less is more as it allows the model to focus more effectively.

Let's start by simply generating a list of thoughts based on the current passage.

In [10]:
from promptx import prompt

def read(doc: list[str], bs=5, limit=None):
    thoughts = []
    for chunk in batch(doc, bs=bs, limit=limit):
        print(f'Passage: {chunk}')
        output = prompt(
            '''
            Given a passage from a document, generate a list of thoughts about the passage.
            ''',
            input=dict(
                passage=chunk,
            ),
            output=[Thought],
        ).objects

        print(f'Thoughts: {[t.value for t in output]}')
        thoughts += output
    return thoughts

In [11]:
thoughts = read([sentence.text for sentence in doc.sents], limit=100)
[thought.value for thought in thoughts]

Passage: ['Minimally Modifying a Markov Game to Achieve Any Nash\nEquilibrium and Value\nYoung Wu Jeremy McMahan Yiding Chen Yudong Chen Xiaojin Zhu Qiaomin Xie\nUniversity of Wisconsin Madison\nAbstract\nWe study the game modification problem,\nwhere a benevolent game designer or a malev-\nolent adversary modifies the reward function\nof a zero-sum Markov game so that a tar-\nget deterministic or stochastic policy pro-\nfile becomes the unique Markov perfect Nash\nequilibrium and has a value within a target\nrange, in a way that minimizes the modifica-\ntion cost.', 'We characterize the set of policy\nprofiles that can be installed as the unique\nequilibrium of some game, and establish suf-\nficient and necessary conditions for success-\nful installation.', 'We propose an efficient al-\ngorithm, which solves a convex optimization\nproblem with linear constraints and then per-\nforms random perturbation, to obtain a mod-\nification plan with a near-optimal cost.\n', '1 Introduction\nCo


[1m[[0m
    [32m'Minimally Modifying a Markov Game to Achieve Any Nash Equilibrium and Value'[0m,
    [32m'We study the game modification problem'[0m,
    [32m'A benevolent game designer or a malevolent adversary modifies the reward function of a zero-sum Markov game'[0m,
    [32m'A target deterministic or stochastic policy profile becomes the unique Markov perfect Nash equilibrium and has a value within a target range'[0m,
    [32m'Minimizes the modification cost'[0m,
    [32m'We characterize the set of policy profiles that can be installed as the unique equilibrium of some game'[0m,
    [32m'Establish sufficient and necessary conditions for successful installation'[0m,
    [32m'Propose an efficient algorithm'[0m,
    [32m'Solves a convex optimization problem with linear constraints and then performs random perturbation'[0m,
    [32m'Obtain a modification plan with a near-optimal cost'[0m,
    [32m'A two-player zero-sum Markov game G˝“ pR˝, P˝q with payoff matr

For example, if we replace the instructions with something like:

```
You are an AI researcher reading a whitepaper.
Given a passage of text from the paper, generate a list of thoughts about the passage.
```

This produces very similar results, but is now far less useful because of how specific it is.

Instead, let's try to improve the results by providing some more context in each prompt.

In [12]:
from promptx import prompt

def read(doc: list[str], bs=5, limit=None):
    thoughts = []
    recent_thoughts = []
    previous_passage = None
    for chunk in batch(doc, bs=bs, limit=limit):
        print(f'Passage: {chunk}')
        output = prompt(
            '''
            Given a passage from a document, generate a list of thoughts about the passage.
            ''',
            input=dict(
                passage=chunk,
                previous_passage=previous_passage,
                recent_thoughts=[t.value for t in recent_thoughts],
            ),
            output=[Thought],
        ).objects

        print(f'Thoughts: {[t.value for t in output]}')
        thoughts += output
        previous_passage = chunk
        recent_thoughts = (output + recent_thoughts)[:5]
    return thoughts

In [13]:
thoughts = read([sentence.text for sentence in doc.sents], limit=100)
[thought.value for thought in thoughts]

Passage: ['Minimally Modifying a Markov Game to Achieve Any Nash\nEquilibrium and Value\nYoung Wu Jeremy McMahan Yiding Chen Yudong Chen Xiaojin Zhu Qiaomin Xie\nUniversity of Wisconsin Madison\nAbstract\nWe study the game modification problem,\nwhere a benevolent game designer or a malev-\nolent adversary modifies the reward function\nof a zero-sum Markov game so that a tar-\nget deterministic or stochastic policy pro-\nfile becomes the unique Markov perfect Nash\nequilibrium and has a value within a target\nrange, in a way that minimizes the modifica-\ntion cost.', 'We characterize the set of policy\nprofiles that can be installed as the unique\nequilibrium of some game, and establish suf-\nficient and necessary conditions for success-\nful installation.', 'We propose an efficient al-\ngorithm, which solves a convex optimization\nproblem with linear constraints and then per-\nforms random perturbation, to obtain a mod-\nification plan with a near-optimal cost.\n', '1 Introduction\nCo


[1m[[0m
    [32m'Minimally Modifying a Markov Game to Achieve Any Nash Equilibrium and Value'[0m,
    [32m'The passage discusses the game modification problem'[0m,
    [32m'The passage proposes an efficient algorithm to solve the game modification problem'[0m,
    [32m'The passage mentions the existence of at least one Markov Perfect Equilibrium in a two-player zero-sum Markov game'[0m,
    [32m'The passage references the work of Maskin and Tirole [0m[32m([0m[32m2001[0m[32m)[0m[32m in relation to Markov Perfect Equilibrium'[0m,
    [32m'All the MPEs of G˝have the same game value, which is the expected payoff for player 1 and loss for player 2 at equilibrium.'[0m,
    [32m'In the special case where the Markov game has H“1 stage, it reduces to a matrix normal form game; the Markov Perfect Equilibrium reduces to a Nash Equilibrium [0m[32m([0m[32mNE[0m[32m)[0m[32m.'[0m,
    [32m'There may be reasons for a third party to prefer an outcome with a different M

In [14]:
from promptx import prompt

def read(doc: list[str], bs=5, limit=None):
    thoughts = []
    recent_thoughts = []
    previous_passage = None
    for chunk in batch(doc, bs=bs, limit=limit):
        print(f'Passage: {chunk}')
        output = prompt(
            '''
            Given a passage from a document, generate a list of thoughts about the passage.
            ''',
            input=dict(
                passage=chunk,
                previous_passage=previous_passage,
            ),
            output=[Thought],
        ).objects

        print(f'Thoughts: {[t.value for t in output]}')
        thoughts += output
        previous_passage = chunk
        recent_thoughts = (output + recent_thoughts)[:5]
    return thoughts

In [15]:
thoughts = read([sentence.text for sentence in doc.sents], limit=100)
[thought.value for thought in thoughts]

Passage: ['Minimally Modifying a Markov Game to Achieve Any Nash\nEquilibrium and Value\nYoung Wu Jeremy McMahan Yiding Chen Yudong Chen Xiaojin Zhu Qiaomin Xie\nUniversity of Wisconsin Madison\nAbstract\nWe study the game modification problem,\nwhere a benevolent game designer or a malev-\nolent adversary modifies the reward function\nof a zero-sum Markov game so that a tar-\nget deterministic or stochastic policy pro-\nfile becomes the unique Markov perfect Nash\nequilibrium and has a value within a target\nrange, in a way that minimizes the modifica-\ntion cost.', 'We characterize the set of policy\nprofiles that can be installed as the unique\nequilibrium of some game, and establish suf-\nficient and necessary conditions for success-\nful installation.', 'We propose an efficient al-\ngorithm, which solves a convex optimization\nproblem with linear constraints and then per-\nforms random perturbation, to obtain a mod-\nification plan with a near-optimal cost.\n', '1 Introduction\nCo


[1m[[0m
    [32m'Minimally Modifying a Markov Game to Achieve Any Nash Equilibrium and Value'[0m,
    [32m'We study the game modification problem, where a benevolent game designer or a malevolent adversary modifies the reward function of a zero-sum Markov game so that a target deterministic or stochastic policy profile becomes the unique Markov perfect Nash equilibrium and has a value within a target range, in a way that minimizes the modification cost.'[0m,
    [32m'We characterize the set of policy profiles that can be installed as the unique equilibrium of some game, and establish sufficient and necessary conditions for successful installation.'[0m,
    [32m'We propose an efficient algorithm, which solves a convex optimization problem with linear constraints and then performs random perturbation, to obtain a modification plan with a near-optimal cost.'[0m,
    [32m'Consider a two-player zero-sum Markov game G˝“pR˝, P˝qwith payoff matrices R˝and transition probability mat

In [16]:
from promptx import prompt

def read(doc: list[str], bs=5, limit=None):
    thoughts = []
    recent_thoughts = []
    previous_passage = None
    for chunk in batch(doc, bs=bs, limit=limit):
        print(f'Passage: {chunk}')
        output = prompt(
            '''
            Given a passage from a document, generate a list of thoughts about the passage.
            ''',
            input=dict(
                passage=chunk,
                recent_thoughts=recent_thoughts,
            ),
            output=[Thought],
        ).objects

        print(f'Thoughts: {[t.value for t in output]}')
        thoughts += output
        previous_passage = chunk
        recent_thoughts = (output + recent_thoughts)[:5]
    return thoughts

In [17]:
thoughts = read([sentence.text for sentence in doc.sents], limit=100)
[thought.value for thought in thoughts]

Passage: ['Minimally Modifying a Markov Game to Achieve Any Nash\nEquilibrium and Value\nYoung Wu Jeremy McMahan Yiding Chen Yudong Chen Xiaojin Zhu Qiaomin Xie\nUniversity of Wisconsin Madison\nAbstract\nWe study the game modification problem,\nwhere a benevolent game designer or a malev-\nolent adversary modifies the reward function\nof a zero-sum Markov game so that a tar-\nget deterministic or stochastic policy pro-\nfile becomes the unique Markov perfect Nash\nequilibrium and has a value within a target\nrange, in a way that minimizes the modifica-\ntion cost.', 'We characterize the set of policy\nprofiles that can be installed as the unique\nequilibrium of some game, and establish suf-\nficient and necessary conditions for success-\nful installation.', 'We propose an efficient al-\ngorithm, which solves a convex optimization\nproblem with linear constraints and then per-\nforms random perturbation, to obtain a mod-\nification plan with a near-optimal cost.\n', '1 Introduction\nCo


[1m[[0m
    [32m'Minimally Modifying a Markov Game to Achieve Any Nash Equilibrium and Value'[0m,
    [32m'We study the game modification problem,where a benevolent game designer or a malevolent adversary modifies the reward function of a zero-sum Markov game so that a target deterministic or stochastic policy profile becomes the unique Markov perfect Nash equilibrium and has a value within a target range, in a way that minimizes the modification cost.'[0m,
    [32m'We characterize the set of policy profiles that can be installed as the unique equilibrium of some game, and establish sufficient and necessary conditions for successful installation.'[0m,
    [32m'We propose an efficient algorithm, which solves a convex optimization problem with linear constraints and then performs random perturbation, to obtain a modification plan with a near-optimal cost.'[0m,
    [32m'Consider a two-player zero-sum Markov game G∼[0m[32m([0m[32mS,A1,A2,H[0m[32m)[0m[32m'[0m,
    [32m

In [33]:
from promptx import delete_collection

delete_collection('arxiv')

In [34]:
from promptx import prompt, query, store

def read(doc: list[str], bs=5, limit=None):
    thoughts = []
    recent_thoughts = []
    previous_passage = None
    for chunk in batch(doc, bs=bs, limit=limit):
        print(f'Passage: {chunk}')
        try:
            recalled_thoughts = query(*chunk, collection='arxiv-thoughts', limit=3).objects
        except Exception as e:
            recalled_thoughts = []
        
        try:
            recalled_quotes = query(*chunk, collection='arxiv-quotes', limit=3).objects
        except Exception as e:
            recalled_quotes = []

        output = prompt(
            '''
            Given a passage from a document, generate a list of thoughts about the passage.
            ''',
            input=dict(
                passage=chunk,
                previous_passage=previous_passage,
                recent_thoughts=[t.value for t in recent_thoughts],
                recalled_thoughts=[t.value for t in recalled_thoughts],
                recalled_quotes=[t.value for t in recalled_quotes],
            ),
            output=[Thought],
        ).objects

        print(f'Thoughts: {[t.value for t in output]}')
        thoughts += output
        previous_passage = chunk
        recent_thoughts = (output + recent_thoughts)[:5]

        store(*thoughts, collection='arxiv-thoughts')
        store(*[Quote(text=text, source=paper, start=0, end=0) for text in chunk], collection='arxiv-qoutes')
    return thoughts

In [35]:
thoughts = read([sentence.text for sentence in doc.sents], limit=100)
[thought.value for thought in thoughts]

Passage: ['Minimally Modifying a Markov Game to Achieve Any Nash\nEquilibrium and Value\nYoung Wu Jeremy McMahan Yiding Chen Yudong Chen Xiaojin Zhu Qiaomin Xie\nUniversity of Wisconsin Madison\nAbstract\nWe study the game modification problem,\nwhere a benevolent game designer or a malev-\nolent adversary modifies the reward function\nof a zero-sum Markov game so that a tar-\nget deterministic or stochastic policy pro-\nfile becomes the unique Markov perfect Nash\nequilibrium and has a value within a target\nrange, in a way that minimizes the modifica-\ntion cost.', 'We characterize the set of policy\nprofiles that can be installed as the unique\nequilibrium of some game, and establish suf-\nficient and necessary conditions for success-\nful installation.', 'We propose an efficient al-\ngorithm, which solves a convex optimization\nproblem with linear constraints and then per-\nforms random perturbation, to obtain a mod-\nification plan with a near-optimal cost.\n', '1 Introduction\nCo

ValueError: No collection found with name arxiv-thoughts

BREAK
---

In [None]:

def batch(generator, bs=1, limit=None):
    b = []
    i = 0
    for item in generator:
        if limit and i > limit:
            break
        b.append(item)
        if len(b) == bs:
            yield b
            b = []
        i += bs
    if b and (limit and i <= limit):  # Yield any remaining items in the batch
        yield b

In [None]:
from promptx import store, query

collection_name = 'arxiv'
store(paper, collection=collection_name)
query(collection=collection_name)[['title', 'abstract', 'url']]

In [None]:
from promptx import query

paper = query(collection=collection_name).query('type == "document"').sample().first
paper

In [None]:
pdf = load_pdf(paper.url)
print(f'Loaded pdf with {len(pdf)} characters')

In [None]:
import spacy
import en_core_web_sm

nlp = spacy.load("en_core_web_sm")
doc = nlp(pdf)

In [None]:
from promptx import store, query

for chunk in batch(doc.sents, bs=10, limit=1000):
    store(
        *[
            Quote(
                text=sentence.text,
                source=paper,
                start=sentence.start_char,
                end=sentence.end_char,
            ) 
            for sentence in chunk
        ], 
        collection=collection_name
    )

query(collection=collection_name).query('type == "quote"')

In [None]:
from promptx import prompt

def read_document(doc, bs=5, limit=1000, recall_limit=3, recent_limit=5):
    sentences = doc.sents
    recent_thoughts = []
    previous_passage = None
    for chunk in batch(sentences, bs=bs, limit=limit):
        passage = [sentence.text for sentence in chunk]
        recalled_thoughts = query(*passage, collection=collection_name, limit=recall_limit).query('type == "thought"').objects
        
        thoughts = prompt(
            '''
            Given a passage of text and some context, generate some new thoughts about the text.
            Make sure to not repeat any existing thoughts too closely.
            ''',
            input=dict(
                context=dict(
                    previous_passage=previous_passage,
                    recent_thoughts=recent_thoughts,
                    recalled_thoughts=recalled_thoughts,
                ),
                passage=passage,
            ),
            output=[Thought],
        )

        thoughts = [Thought(**{**dict(thought), 'source': paper}) for thought in thoughts.objects]
        recent_thoughts = (thoughts + recent_thoughts)[:recent_limit]
        previous_passage = passage
        
        print(f'Generated {len(thoughts)} thoughts')
        print([thought.value for thought in thoughts])

        store(*thoughts, collection=collection_name)

In [None]:
read_document(doc)

In [None]:
thoughts = query(collection=collection_name).query('type == "thought"')
thoughts