# arxiv reader

This example demonstrates a technique for generating understanding over long documents - papers from the latest arxiv papers in the AI category in this case.

The basic approach is to iterate over the document in chunks and ask the LLM to generate a synthetic dataset of thoughts about each chunk. The thoughts and the chunk are then embedded and can be recalled in subsequent iterations by querying the collection. Then we'll use the generated thoughts to ask questions about the document.

First, we need to define the data types that we'll be generating. Because we're embedding the generated objects we're using `Entity`, which is a thin wrapper on top of `pydantic.BaseModel`.

In [1]:
from enum import Enum
from pydantic import Field
from promptx.collection import Entity


class Document(Entity):
    title: str
    abstract: str
    url: str

class Quote(Entity):
    value: str
    source: Document = None
    start: int = None
    end: int = None

class ThoughtCategory(str, Enum):
    fact = 'fact'
    opinion = 'opinion'
    idea = 'idea'
    connection = 'connection'
    belief = 'belief'

class Thought(Entity):
    value: str
    category: ThoughtCategory
    confidence: float
    source: Entity = Field(None, generate=False)

[32m2023-11-10 08:10:57.444[0m | [1mINFO    [0m | [36mpromptx[0m:[36mload[0m:[36m104[0m - [1mloading local app from /home/rjl/promptx/examples/arxiv-reader[0m
[32m2023-11-10 08:10:57.446[0m | [1mINFO    [0m | [36mpromptx[0m:[36mload[0m:[36m107[0m - [1mloaded environment variables from /home/rjl/promptx/examples/arxiv-reader/.env[0m
[32m2023-11-10 08:10:57.447[0m | [1mINFO    [0m | [36mpromptx[0m:[36mload[0m:[36m108[0m - [1mAPI KEY wMeGC[0m


Next, we need to get the data from arxiv. We'll use `requests` to get the data ans `BeautifulSoup` to parse it into a `Document` instance.

Now we can use these functions to fetch the latest papers, select one at random, and extract the data from the HTML content.

The document instance only has data about the paper and doesn't contain the actual text. Let's create a function to extract text from a PDF given a path or URL.

In [2]:
import PyPDF2
import requests
import uuid

def load_pdf(filepath_or_url):
    """
    Load content of a PDF from either a file path or a remote URL.
    
    :param filepath_or_url: File path or URL to fetch the PDF from.
    :return: Content of the PDF as a string.
    """
    
    # Handle remote URL
    if filepath_or_url.startswith(("http://", "https://")):
        response = requests.get(filepath_or_url)
        response.raise_for_status()
        id = str(uuid.uuid4())
        filepath_or_url = f'./data/{id}.pdf'
        with open(filepath_or_url, 'wb') as pdf:
            pdf.write(response.content)
    
    with open(filepath_or_url, 'rb') as f:
        pdf_reader = PyPDF2.PdfReader(f)
        text_content = ''.join([page.extract_text() for page in pdf_reader.pages])
    return text_content

In [3]:
import spacy

nlp = spacy.load("en_core_web_sm")
pdf = load_pdf('./data/8c2d8faa-e281-4378-86e8-8b9ff23c0921.pdf')
doc = nlp(pdf)

We have the full text, but how do we split it into chunks that are small enough to be processed by the LLM? You could do this in a number of ways, but we'll use `spacy`, a popular NLP library, to split the text into sentences and then group them into chunks of 512 tokens.

Now we have a parsed `spacy` document we can split it into sentences.

We could iterate over each sentence individually, but that will be slow, expensive, and some sentences will be too short to convery meaning on their own. Instead, we'll group them into batches, or passages, and generate thoughts based on each passage of text. Before we do that, let's define a helper function to process the sentences in batches.

In [4]:
def batch(generator, bs=1, limit=100):
    b = []
    i = 0
    for item in generator:
        if limit and i > limit:
            break
        b.append(item)
        if len(b) == bs:
            yield b
            b = []
        i += bs
    if b and (limit and i <= limit):  # Yield any remaining items in the batch
        yield b

This function yield's the generator in chunks defines by the batch size up to a total number of processed items.

In [5]:
for chunk in batch(doc.sents, bs=5, limit=100):
    print(chunk)

[DAIL:, Data Augmentation for In-Context Learning via Self-Paraphrase
Dawei Li, Yaxuan Li, Dheeraj Mekala, Shuyao Li,
Yulin wang, Xueqi Wang, William Hogan, Jingbo Shang
University of California, San Diego
dal034, yal105, dmekala, shl118, yuw033
xuw030, whogan, jshang@ucsd.edu
Abstract
In-Context Learning (ICL) combined with pre-
trained large language models has achieved
promising results on various NLP tasks., How-
ever, ICL requires high-quality annotated
demonstrations which might not be available
in real-world scenarios., To overcome this limi-
tation, we propose DataAugmentation for In-
Context Learning ( DAIL )., DAIL leverages the
intuition that large language models are more
familiar with the content generated by them-
selves.]
[It first utilizes the language model to
generate paraphrases of the test sample and
employs majority voting to determine the final
result based on individual predictions., Our ex-
tensive empirical evaluation shows that DAIL
outperforms the standard IC

In [6]:
from promptx import delete_collection

try:
    delete_collection('tmp-quotes')
except:
    pass

In [7]:
from promptx import store

def store_quotes(doc: list[str], bs=5, limit=None, collection='tmp'):
    for chunk in batch(doc, bs=bs, limit=limit):
        quotes = [Quote(value=line) for line in chunk]
        store(*quotes, collection=collection)

store_quotes([sentence.text for sentence in doc.sents], bs=10, collection='tmp-quotes')

Now we can define the logic for thought generation. It's often tempting to overcomplicate prompts, but it can be difficult to know whether more information is actually helping. Often, less is more as it allows the model to focus more effectively.

Let's start by simply generating a list of thoughts based on the current passage.

In [8]:
from promptx import prompt

def read(doc: list[str], bs=5, limit=None):
    thoughts = []
    for chunk in batch(doc, bs=bs, limit=limit):
        print(f'Passage: {chunk}')
        output = prompt(
            '''
            Given a passage from a document, generate a list of thoughts about the passage.
            ''',
            input=dict(
                passage=chunk,
            ),
            output=[Thought],
        ).objects

        print(f'Thoughts: {[t.value for t in output]}')
        thoughts += output
    return thoughts

In [9]:
thoughts = read([sentence.text for sentence in doc.sents], limit=100)

for chunk in batch(thoughts, bs=100):
    store(*chunk, collection='tmp-basic-thoughts')

Passage: ['DAIL:', 'Data Augmentation for In-Context Learning via Self-Paraphrase\nDawei Li, Yaxuan Li, Dheeraj Mekala, Shuyao Li,\nYulin wang, Xueqi Wang, William Hogan, Jingbo Shang\nUniversity of California, San Diego\ndal034, yal105, dmekala, shl118, yuw033\nxuw030, whogan, jshang@ucsd.edu\nAbstract\nIn-Context Learning (ICL) combined with pre-\ntrained large language models has achieved\npromising results on various NLP tasks.', 'How-\never, ICL requires high-quality annotated\ndemonstrations which might not be available\nin real-world scenarios.', 'To overcome this limi-\ntation, we propose DataAugmentation for In-\nContext Learning ( DAIL ).', 'DAIL leverages the\nintuition that large language models are more\nfamiliar with the content generated by them-\nselves.']
Thoughts: ['Data Augmentation for In-Context Learning via Self-Paraphrase', 'ICL requires high-quality annotated demonstrations', 'DataAugmentation for In-Context Learning ( DAIL )', 'Large language models are more fa

For example, if we replace the instructions with something like:

```
You are an AI researcher reading a whitepaper.
Given a passage of text from the paper, generate a list of thoughts about the passage.
```

This produces very similar results, but is now far less useful because of how specific it is.

Instead, let's try to improve the results by providing some more context in each prompt.

In [10]:
from promptx import prompt

def read_with_recent_context(doc: list[str], bs=5, limit=None):
    thoughts = []
    recent_thoughts = []
    previous_passage = None
    for chunk in batch(doc, bs=bs, limit=limit):
        print(f'Passage: {chunk}')
        output = prompt(
            '''
            Given a passage from a document, generate a list of thoughts about the passage.
            Don't repeat yourself!
            ''',
            input=dict(
                passage=chunk,
                previous_passage=previous_passage,
                recent_thoughts=[t.value for t in recent_thoughts],
            ),
            output=[Thought],
        ).objects

        print(f'Thoughts: {[t.value for t in output]}')
        thoughts += output
        previous_passage = chunk
        recent_thoughts = (output + recent_thoughts)[:5]
    return thoughts

In [11]:
thoughts = read_with_recent_context([sentence.text for sentence in doc.sents], limit=100)

for chunk in batch(thoughts, bs=10):
    store(*chunk, collection='tmp-recent-context-thoughts')

Passage: ['DAIL:', 'Data Augmentation for In-Context Learning via Self-Paraphrase\nDawei Li, Yaxuan Li, Dheeraj Mekala, Shuyao Li,\nYulin wang, Xueqi Wang, William Hogan, Jingbo Shang\nUniversity of California, San Diego\ndal034, yal105, dmekala, shl118, yuw033\nxuw030, whogan, jshang@ucsd.edu\nAbstract\nIn-Context Learning (ICL) combined with pre-\ntrained large language models has achieved\npromising results on various NLP tasks.', 'How-\never, ICL requires high-quality annotated\ndemonstrations which might not be available\nin real-world scenarios.', 'To overcome this limi-\ntation, we propose DataAugmentation for In-\nContext Learning ( DAIL ).', 'DAIL leverages the\nintuition that large language models are more\nfamiliar with the content generated by them-\nselves.']
Thoughts: ['In-Context Learning (ICL) combined with pre-trained large language models has achieved promising results on various NLP tasks.', 'ICL requires high-quality annotated demonstrations which might not be avail

: 

In [None]:
from promptx import prompt, store, query, delete_collection

def read_with_recalled_context(doc: list[str], bs=5, limit=None, collection='tmp'):
    thoughts = []
    recent_thoughts = []
    previous_passage = None
    try:
        delete_collection(collection)
    except Exception as e:
        pass
    for chunk in batch(doc, bs=bs, limit=limit):
        print(f'Passage: {chunk}')
        try:
            recalled_thoughts = query(*chunk, limit=3, collection=collection).objects
        except Exception as e:
            print(f'Error querying {e}')
            recalled_thoughts = []
        output = prompt(
            '''
            Given a passage from a document, generate a list of thoughts about the passage.
            Don't repeat yourself!
            ''',
            input=dict(
                passage=chunk,
                previous_passage=previous_passage,
                recent_thoughts=[t.value for t in recent_thoughts],
                recalled_thoughts=[t.value for t in recalled_thoughts],
            ),
            output=[Thought],
        ).objects

        print(f'Thoughts: {[t.value for t in output]}')
        thoughts += output
        previous_passage = chunk
        recent_thoughts = (output + recent_thoughts)[:5]

        quotes = [Quote(value=line) for line in chunk]
        store(*output, *quotes, collection=collection)
    return thoughts

In [None]:
thoughts = read_with_recalled_context([sentence.text for sentence in doc.sents], limit=100)

for chunk in batch(thoughts, bs=10):
    store(*chunk, collection='tmp-recalled-context-thoughts')

In [6]:
import pandas as pd
pd.set_option("display.max_colwidth", None) 

OptionError: No such keys(s): 'display.left_justify'

In [8]:
from promptx import prompt, store, query, delete_collection

query(collection='tmp-quotes')[['id', 'type', 'value']]

Unnamed: 0,id,type,value
0,fb253487-4d5a-42e0-a89d-0d5f350e1011,quote,DAIL:
1,6e84618d-7f15-41bb-92be-e2c022d38d25,quote,"Data Augmentation for In-Context Learning via Self-Paraphrase\nDawei Li, Yaxuan Li, Dheeraj Mekala, Shuyao Li,\nYulin wang, Xueqi Wang, William Hogan, Jingbo Shang\nUniversity of California, San Diego\ndal034, yal105, dmekala, shl118, yuw033\nxuw030, whogan, jshang@ucsd.edu\nAbstract\nIn-Context Learning (ICL) combined with pre-\ntrained large language models has achieved\npromising results on various NLP tasks."
2,5a33b58f-f5b8-4be2-ae76-0c5a83d39b73,quote,"How-\never, ICL requires high-quality annotated\ndemonstrations which might not be available\nin real-world scenarios."
3,80454d13-9110-4e93-99e7-93d68e78b87b,quote,"To overcome this limi-\ntation, we propose DataAugmentation for In-\nContext Learning ( DAIL )."
4,0b0754c5-5ca9-4645-a2ab-1834100fefad,quote,DAIL leverages the\nintuition that large language models are more\nfamiliar with the content generated by them-\nselves.
...,...,...,...
225,a38ea15f-9e00-4eb4-8d59-7e51a67987bc,quote,What type of information does the writer express for the question?\n
226,61961fe1-6a7c-48ad-a2c7-24fcbd8e62f2,quote,EmotionLabel the emotion class of the sentence.\n
227,64512d6c-0bea-4636-9c80-efd4c11a18d9,quote,What is the emotion expressed in this message?\n
228,c2a9e0e0-0bab-466c-8f59-534f346db042,quote,What emotion does this message express?\n


In [None]:
query('low rank', collection='tmp-quotes')

In [None]:
quotes = query(collection='tmp-quotes')

# use pandas methods to remove any rows with a 'value' column with a text length less than 10
quotes = quotes[quotes['value'].str.len() > 25]
quotes

In [4]:
from promptx import prompt, query

def rag_qa(question, collection='tmp', step_back=True):
    if step_back:
        intermediate_questions = prompt(
            '''
            Take a step back, and think of questions that would be helpful to answer the original question.
            ''',
            input=dict(
                original_question=question,
            ),
        )

        print(intermediate_questions)
        intermediate_answers = rag_qa(intermediate_questions, collection=collection, step_back=False)
    else:
        intermediate_answers = None

    print(intermediate_answers)
    context = query(question, collection=collection).objects
    print(context)
    return prompt(
        '''
        Given a question and a context, generate an answer to the question.
        ''',
        input=dict(
            question=question,
            context=context,
            ponderings=intermediate_answers
        ),
    )

In [6]:
rag_qa('What is the title of the paper?', collection='tmp-quotes', step_back=False)

None


"I'm sorry, but I cannot generate an answer to the question as the input provided does not contain any relevant information or context. Please provide more information or a different question."