This notebook will walk you through the concepts of tokens and chunking. In the grounding notebook, we were able to provide some additional context to the models. Is there a limit to the amount of additional context we can provide the model? Unfortunately, the answer is yes. There is a limit to the number of tokens that are allowed in the input and the output together. 

What are tokens? Tokens are a representation of how the Azure OpenAI models process text. They are words or just chunks of characters. Let's look at the total number of tokens in the response we got back from the grounding notebook. There are many ways to calculate tokens. In this notebook, we will take a look at the tiktoken library.

In [4]:
# Import the needed modules
import openai
import PyPDF3
import os
import json
import tiktoken

from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())

# Load your OpenAI credentials
API_KEY = os.getenv("OPENAI_API_KEY")
assert API_KEY, "ERROR: Azure OpenAI Key is missing"
openai.api_key = API_KEY

RESOURCE_ENDPOINT = os.getenv("OPENAI_API_BASE","").strip()
assert RESOURCE_ENDPOINT, "ERROR: Azure OpenAI Endpoint is missing"
assert "openai.azure.com" in RESOURCE_ENDPOINT.lower(), "ERROR: Azure OpenAI Endpoint should be in the form: \n\n\t<your unique endpoint identifier>.openai.azure.com"
openai.api_base = RESOURCE_ENDPOINT

openai.api_type = "azure"
# openai.api_version = "2022-12-01"
openai.api_version = "2023-05-15"
chat_model=os.getenv("MODEL_NAME")
model_engine = chat_model


In [5]:

def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

In [7]:
text = """Kaiser Permanente offers four types of individual and family plans: Copay plans, Deductible plans, Virtual plans, and HSA qualified plans. Catastrophic plans are also available in some markets."""

num_tokens_from_string(text, "cl100k_base")
# print(num_tokens)

43

Now what happens if we want to add in more context than what's in the text variable. Let's try to get the answer to a question based on data from a PDF.

In [None]:
# Load PDF
pdf_file = open('fia_f1_power_unit_financial_regulations_issue_1_-_2022-08-16.pdf', 'rb') 
pdf_reader = PyPDF3.PdfFileReader(pdf_file)

In [None]:
# Extract text from PDF file
num_pages = pdf_reader.getNumPages()
full_text = ''
for page_num in range(num_pages):
   page = pdf_reader.getPage(page_num)
   page_text = page.extractText()
   full_text += page_text

clean_text = full_text.replace("  ", " ").replace("\n", "; ").replace(';',' ')

In [None]:
# Set up GPT-3 model and prompt
prompt = f"What is the answer to the following question regarding the PDF document?\n\n{full_text}\n\n" 

# Ask questions and get answers
questions = ["What is the document about?", "Who wrote the document?", "What is the main idea of the document?"]
for question in questions:
   full_prompt = prompt + question
   response = openai.Completion.create(engine=model_engine, prompt=full_prompt, max_tokens=100)
   answer = response.choices[0].text.strip()
   print(f"{question}\n{answer}\n")

You will see an error after running the above snippet of code. The model reaches its maximum context length. For GPT-3 models, the token limit is 4097 tokens. 

To solve this problem, we can take a look at a concept called Chunking. Chunking helps limit the amount of information we pass into the model. The information that we pass are the most relevant chunks from the overall data. It limits the data we look at to the question that has been asked. There are various methods to chunk data.

In [None]:
# Prompt
document = '<document>'
template_prompt=f'''Extract key pieces of information from this regulation document.
If a particular piece of information is not present, output \"Not specified\".
When you extract a key piece of information, include the closest page number.
Use the following format:\n0. Who is the author\n1. What is the amount of the "Power Unit Cost Cap" in USD, GBP and EUR\n2. What is the value of External Manufacturing Costs in USD\n3. What is the Capital Expenditure Limit in USD\n\nDocument: \"\"\"{document}\"\"\"\n\n0. Who is the author: Tom Anderson (Page 1)\n1.'''
print(template_prompt)

In [None]:
# Split a text into smaller chunks of size n, preferably ending at the end of a sentence
def create_chunks(text, n, tokenizer):
    tokens = tokenizer.encode(text)
    """Yield successive n-sized chunks from text."""
    i = 0
    while i < len(tokens):
        # Find the nearest end of sentence within a range of 0.5 * n and 1.5 * n tokens
        j = min(i + int(1.5 * n), len(tokens))
        while j > i + int(0.5 * n):
            # Decode the tokens and check for full stop or newline
            chunk = tokenizer.decode(tokens[i:j])
            if chunk.endswith(".") or chunk.endswith("\n"):
                break
            j -= 1
        # If no end of sentence found, use n tokens as the chunk size
        if j == i + int(0.5 * n):
            j = min(i + n, len(tokens))
        yield tokens[i:j]
        i = j

In [None]:
# Extract a specific chunk of text from the document based on the template prompt
def extract_chunk(document,template_prompt):
    
    prompt=template_prompt.replace('<document>',document)

    response = openai.Completion.create(
    engine=model_engine, 
    prompt=prompt,
    temperature=0,
    max_tokens=1500,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0
    )
    return "1." + response['choices'][0]['text']

In [None]:
# Initialise tokenizer
tokenizer = tiktoken.get_encoding("cl100k_base")

results = []
    
chunks = create_chunks(clean_text,1000,tokenizer)
text_chunks = [tokenizer.decode(chunk) for chunk in chunks]

In [None]:
for chunk in text_chunks:
    results.append(extract_chunk(chunk,template_prompt))
    # print(chunk)
    # print(results[-1])


groups = [r.split('\n') for r in results]

# zip the groups together
zipped = list(zip(*groups))
zipped = [x for y in zipped for x in y if "Not specified" not in x and "__" not in x]
zipped


print(zipped)