This notebook will walk you through the concepts of tokens and chunking. In the previous notebook, we were able to provide some additional context and ground the models. Is there a limit to the amount of additional context we can provide the model? Unfortunately, the answer is yes. A limit exists for the number of tokens that are allowed in the input and the output combined. 

What are tokens? Tokens are a representation of how the Azure OpenAI models process text. They are words or just chunks of characters. Let's look at the total number of tokens in the response we got back from the grounding notebook. There are many ways to calculate tokens. In this notebook, we will take a look at the tiktoken library to count the tokens.

In [53]:
# Import the needed modules
import openai
import PyPDF3
import os
import json
import tiktoken
import spacy
# python -m spacy download en_core_web_sm
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())

from spacy.lang.en import English 
nlp = spacy.load("en_core_web_sm")

import langchain
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Load your OpenAI credentials
API_KEY = os.getenv("OPENAI_API_KEY")
assert API_KEY, "ERROR: Azure OpenAI Key is missing"
openai.api_key = API_KEY

RESOURCE_ENDPOINT = os.getenv("OPENAI_API_BASE","").strip()
assert RESOURCE_ENDPOINT, "ERROR: Azure OpenAI Endpoint is missing"
assert "openai.azure.com" in RESOURCE_ENDPOINT.lower(), "ERROR: Azure OpenAI Endpoint should be in the form: \n\n\t<your unique endpoint identifier>.openai.azure.com"
openai.api_base = RESOURCE_ENDPOINT

openai.api_type = os.getenv("OPENAI_API_TYPE")
openai.api_version = os.getenv("OPENAI_API_VERSION")
model=os.getenv("TEXT_DAVINCI_NAME")


Tiktoken uses a technique called BPE, or byte pair encoding to convert the given text into tokens. There are different encodings available to help process the words. In this notebook, we will use the cl100k_base.

Let's count the number of tokens in the answer we received from the previous notebook.

In [54]:

def count_tokens(string: str, encoding_name: str) -> int:
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

In [55]:
text = """Kaiser Permanente offers four types of individual and family plans: Copay plans, Deductible plans, Virtual plans, and HSA qualified plans. Catastrophic plans are also available in some markets."""

count_tokens(text, "cl100k_base")

print("There are " + str(count_tokens(text, "cl100k_base")) + " tokens in this sentence: " + text)

There are 43 tokens in this sentence: Kaiser Permanente offers four types of individual and family plans: Copay plans, Deductible plans, Virtual plans, and HSA qualified plans. Catastrophic plans are also available in some markets.


What happens if we want to add in more context than what we put in the text variable? Let's say we want to provide more context with a PDF file. Let's try to count the number of tokens again.

In [56]:
document = open(r'C:\Users\dthakar\Documents\WhatTheHack\xxx-OpenAIFundamentals\Student\Resources\data\CH3-data.pdf', 'rb') 
doc_helper = PyPDF3.PdfFileReader(document)

In [57]:
finaltext = ''
totalpages = doc_helper.getNumPages()
for eachpage in range(totalpages):
   p = doc_helper.getPage(eachpage)
   indpagetext = p.extractText()
   finaltext += indpagetext

clean_text = finaltext.replace("  ", " ").replace("\n", "; ").replace(';',' ')

In [58]:
prompt = f"What is the answer to the following question regarding the PDF document?\n\n{finaltext}\n\n" 
q = "Can you give me a summary of the document?"

final_prompt = prompt + q
response = openai.Completion.create(engine=model, prompt=final_prompt, max_tokens=50)
answer = response.choices[0].text.strip()
print(f"{q}\n{answer}\n")

InvalidRequestError: This model's maximum context length is 4097 tokens, however you requested 52475 tokens (52425 in your prompt; 50 for the completion). Please reduce your prompt; or completion length.

You will see an error after running the above snippet of code. The model reaches its maximum context length. For GPT-3 models, the token limit is 4097 tokens. How do we fix this issue by giving it all of the needed context, but not running into the token limit issue?

To solve this problem, we can take a look at a concept called Chunking. Chunking helps limit the amount of information we pass into the model. The information that we will pass through are the most relevant chunks from the overall data. Below are a couple of methods to chunk data.

1. Chunking by Splitting Sentences
2. Chunking recursively 

Let us take a look at a sample piece of text and experiment with both techniques.

Method 1: Chunking by splitting sentences

In [59]:
text = "Today was a fun day. I had lots of ice cream. I also met my best friend Sally and we played together at the new playground."

for sentence in nlp(text).sents:
    print(sentence.text)

Today was a fun day.
I had lots of ice cream.
I also met my best friend Sally and we played together at the new playground.


Method 2: Chunking recursively using LangChain

In [60]:
split_text = RecursiveCharacterTextSplitter(
    chunk_size = 300,
    chunk_overlap = 30
)
docs = split_text.create_documents([clean_text])
docs

[Document(page_content='Formula 1 Power Unit Financial Regulations     1     16 August     2022     © 202  2                Issue   1              FORMULA 1   POWER UNIT   FINANCIAL REGULATIONS     PUBLISHED ON   16 August     2022     Issue   1        CONTENTS        Art     CONTENTS     Page(s)     1.     GENERAL', metadata={}),
 Document(page_content='Page(s)     1.     GENERAL PRINCIPLES     ................................  ................................  ................................  ........     2     2.     POWER UNIT MANUFACTURER OBLIGATIONS     ................................  ................................  ....     3     3.', metadata={}),
 Document(page_content='....     3     3.     EXCLUSIONS     ................................  ................................  ................................  .....................     5     4.     ADJUSTMENTS     ................................  ................................  ................................', metadata={

Method 3: Variation of chunking with Sentence Splitting 

In [61]:
# Split a text into smaller chunks of size n, preferably ending at the end of a sentence
def create_chunks(text, n, tokenizer):
    tokens = tokenizer.encode(text)
    """Yield successive n-sized chunks from text."""
    i = 0
    while i < len(tokens):
        # Find the nearest end of sentence within a range of 0.5 * n and 1.5 * n tokens
        j = min(i + int(1.5 * n), len(tokens))
        while j > i + int(0.5 * n):
            # Decode the tokens and check for full stop or newline
            chunk = tokenizer.decode(tokens[i:j])
            if chunk.endswith(".") or chunk.endswith("\n"):
                break
            j -= 1
        # If no end of sentence found, use n tokens as the chunk size
        if j == i + int(0.5 * n):
            j = min(i + n, len(tokens))
        yield tokens[i:j]
        i = j

In [62]:
chunks = create_chunks(clean_text,1000,tokenizer)
text_chunks = [tokenizer.decode(chunk) for chunk in chunks]
text_chunks

['   Formula 1 Power Unit Financial Regulations     1     16 August     2022     © 202  2                Issue   1              FORMULA 1   POWER UNIT   FINANCIAL REGULATIONS     PUBLISHED ON   16 August     2022     Issue   1        CONTENTS        Art     CONTENTS     Page(s)     1.     GENERAL PRINCIPLES     ................................  ................................  ................................  ........     2     2.     POWER UNIT MANUFACTURER OBLIGATIONS     ................................  ................................  ....     3     3.     EXCLUSIONS     ................................  ................................  ................................  .....................     5     4.     ADJUSTMENTS     ................................  ................................  ................................  ..................     8     5.     REPORTING R  EQUIREMENTS     ................................  ................................  ......................