# Long Document Content Extraction

GPT-3 can help us extract key figures, dates or other bits of important content from documents that are too big to fit into the context window. One approach for solving this is to chunk the document up and process each chunk separately, before combining into one list of answers. 

In this notebook we'll run through this approach:
- Load in a long PDF and pull the text out
- Create a prompt to be used to extract key bits of information
- Chunk up our document and process each chunk to pull any answers out
- Combine them at the end
- This simple approach will then be extended to three more difficult questions

## Approach

- **Setup**: Take a PDF, a Formula 1 Financial Regulation document on Power Units, and extract the text from it for entity extraction. We'll use this to try to extract answers that are buried in the content.
- **Simple Entity Extraction**: Extract key bits of information from chunks of a document by:
    - Creating a template prompt with our questions and an example of the format it expects
    - Create a function to take a chunk of text as input, combine with the prompt and get a response
    - Run a script to chunk the text, extract answers and output them for parsing
- **Complex Entity Extraction**: Ask some more difficult questions which require tougher reasoning to work out

## Setup

In [None]:
!pip install textract
!pip install tiktoken
!pip install langchain
!pip install openai==0.28
!pip install pypdf2

In [24]:
import textract
import os
import openai
import tiktoken
from PyPDF2 import PdfReader 

# Extract the raw text from each PDF using textract
reader = PdfReader('OpenAI_test.pdf') 
  
# printing number of pages in pdf file 
# print(len(reader.pages)) 
text=[]
for i in range(len(reader.pages)):
    page = reader.pages[i]
    text.append(page.extract_text())
# getting a specific page from the pdf file 
# page = reader.pages[0] 
  
# extracting text from page 
text = " ".join(text)  
clean_text = text.replace("  ", " ").replace("\n", "; ").replace(';',' ')

## Simple Entity Extraction

In [38]:
# Example prompt - 
document = '<document>'
template_prompt=f'''Extract key pieces of information from this regulation document.
If a particular piece of information is not present, output \"Not specified\".
When you extract a key piece of information, include the closest page number.
Use the following format:\n0. Who is the author?\n1. Which year were BASEL I accords introduced? \n2. What are the principle objectives of BASEL III?\n3. What was the impact of BASEL III in India?\n4. What are the additions to BASEL III compared to BASEL II?\n\nDocument: \"\"\"{document}\"\"\"\n'''
print(template_prompt)

Extract key pieces of information from this regulation document.
If a particular piece of information is not present, output "Not specified".
When you extract a key piece of information, include the closest page number.
Use the following format:
0. Who is the author?
1. Which year were BASEL I accords introduced? 
2. What are the principle objectives of BASEL III?
3. What was the impact of BASEL III in India?
4. What are the additions to BASEL III compared to BASEL II?

Document: """<document>"""



In [39]:
os.environ["OPENAI_API_KEY"] = "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"
openai.api_key= "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"
# Split a text into smaller chunks of size n, preferably ending at the end of a sentence
def create_chunks(text, n, tokenizer):
    tokens = tokenizer.encode(text)
    """Yield successive n-sized chunks from text."""
    i = 0
    while i < len(tokens):
        # Find the nearest end of sentence within a range of 0.5 * n and 1.5 * n tokens
        j = min(i + int(1.5 * n), len(tokens))
        while j > i + int(0.5 * n):
            # Decode the tokens and check for full stop or newline
            chunk = tokenizer.decode(tokens[i:j])
            if chunk.endswith(".") or chunk.endswith("\n"):
                break
            j -= 1
        # If no end of sentence found, use n tokens as the chunk size
        if j == i + int(0.5 * n):
            j = min(i + n, len(tokens))
        yield tokens[i:j]
        i = j

def extract_chunk(document,template_prompt):
    
    prompt=template_prompt.replace('<document>',document)

    response = openai.Completion.create(
    model='text-davinci-003', 
    prompt=prompt,
    temperature=0,
    max_tokens=1500,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0
    )
    return "1." + response['choices'][0]['text']

In [40]:
# Initialise tokenizer
tokenizer = tiktoken.get_encoding("cl100k_base")

results = []
    
chunks = create_chunks(clean_text,1000,tokenizer)
text_chunks = [tokenizer.decode(chunk) for chunk in chunks]

for chunk in text_chunks:
    results.append(extract_chunk(chunk,template_prompt))
    #print(chunk)
    print(results[-1])


1.
0. Who is the author? Prof. R. K. Maheshwari
1. Which year were BASEL I accords introduced? 1988 (Page 1)
2. What are the principle objectives of BASEL III? To strengthen global capital and liquidity regulations with the goal of promoting a more resilient banking sector, and to improve the banking sectors stability to absorb shock arising from financial and economic stress (Page 4)
3. What was the impact of BASEL III in India? To raise the resilience of individual banking institutions in periods of stress, and to address the system wide risks which can build up across the banking sector along with the the pro-cyclical amplification of these risks over time (Page 5)
4. What are the additions to BASEL III compared to BASEL II? Liquidity Coverage Ratio (LCR) and Net Stable Funding Ratio (NSFR) for funding liquidity, and a set of five tools to be used for monitoring the liquidity risk exposures of banks (Page 6)
1.
0. Who is the author? Maheshwari
1. Which year were BASEL I accords intr

In [41]:
groups = [r.split('\n') for r in results]

# zip the groups together
zipped = list(zip(*groups))
zipped = [x for y in zipped for x in y if "Not specified" not in x and "__" not in x]
zipped

['1.',
 '1.',
 '0. Who is the author? Prof. R. K. Maheshwari',
 '0. Who is the author? Maheshwari',
 '1. Which year were BASEL I accords introduced? 1988 (Page 1)',
 '2. What are the principle objectives of BASEL III? To strengthen global capital and liquidity regulations with the goal of promoting a more resilient banking sector, and to improve the banking sectors stability to absorb shock arising from financial and economic stress (Page 4)',
 '3. What was the impact of BASEL III in India? To raise the resilience of individual banking institutions in periods of stress, and to address the system wide risks which can build up across the banking sector along with the the pro-cyclical amplification of these risks over time (Page 5)',
 '4. What are the additions to BASEL III compared to BASEL II? Liquidity Coverage Ratio (LCR) and Net Stable Funding Ratio (NSFR) for funding liquidity, and a set of five tools to be used for monitoring the liquidity risk exposures of banks (Page 6)',
 '4. Wh

## Complex Entity Extraction

In [12]:
# Example prompt - 
template_prompt=f'''Extract key pieces of information from this regulation document.
If a particular piece of information is not present, output \"Not specified\".
When you extract a key piece of information, include the closest page number.
Use the following format:\n0. Who is the author\n1. How is a Minor Overspend Breach calculated\n2. How is a Major Overspend Breach calculated\n3. Which years do these financial regulations apply to\n\nDocument: \"\"\"{document}\"\"\"\n\n0. Who is the author: Tom Anderson (Page 1)\n1.'''
print(template_prompt)

Extract key pieces of information from this regulation document.
If a particular piece of information is not present, output "Not specified".
When you extract a key piece of information, include the closest page number.
Use the following format:
0. Who is the author
1. How is a Minor Overspend Breach calculated
2. How is a Major Overspend Breach calculated
3. Which years do these financial regulations apply to

Document: """<document>"""

0. Who is the author: Tom Anderson (Page 1)
1.


In [30]:
results = []

for chunk in text_chunks:
    results.append(extract_chunk(chunk,template_prompt))
    
groups = [r.split('\n') for r in results]

# zip the groups together
zipped = list(zip(*groups))
zipped = [x for y in zipped for x in y if "Not specified" not in x and "__" not in x]
zipped

['1.',
 '1.',
 '0. Who is the author? Prof. R. K. Maheshwari',
 '0. Who is the author? Maheshwari',
 '1. Which year were BASEL I accords introduced? 1988 (Page 1)',
 '2. What are the principle objectives of BASEL III? To strengthen global capital and liquidity regulations with the goal of promoting a more resilient banking sector and to improve the banking sectors stability to absorb shock arising from financial and economic stress (Page 4)',
 '3. When was BASEL III implemented in India? April 1, 2013 to March 31, 2019 (Page 5)',
 '4. What are the additions to BASEL III compared to BASEL II? Additions to BASEL III compared to BASEL II include Minimum Capital Requirements, Supervisory Review and Evaluation Process, Market Discipline, Capital Conservation Buffer Framework, Leverage Ratio Framework, Countercyclical Capital Buffer Framework, Liquidity Coverage Ratio (LCR) and Net Stable Funding Ratio (NSFR) for funding liquidity (Page 5)']

## Consolidation

We've been able to extract the first two answers safely, while the third was confounded by the date that appeared on every page, though the correct answer is in there as well.

To tune this further you can consider experimenting with:
- A more descriptive or specific prompt
- If you have sufficient training data, fine-tuning a model to find a set of outputs very well
- The way you chunk your data - we have gone for 1000 tokens with no overlap, but more intelligent chunking that breaks info into sections, cuts by tokens or similar may get better results

However, with minimal tuning we have now answered 6 questions of varying difficulty using the contents of a long document, and have a reusable approach that we can apply to any long document requiring entity extraction. Look forward to seeing what you can do with this!

In [42]:
o_tokens = len("\n".join(results))
i_tokens = len(template_prompt)+len(clean_text)

In [43]:
i_cost = (i_tokens / 1000) * 0.0015
o_cost = (o_tokens / 1000) * 0.002

In [44]:
print(f"""Token Usage
    Prompt: {i_tokens} tokens
    Completion: {o_tokens} tokens
    Cost estimation: ${round(i_cost + o_cost, 5)}""")

Token Usage
    Prompt: 6809 tokens
    Completion: 1355 tokens
    Cost estimation: $0.01292
