This notebook will walk you through the concepts of tokens and chunking. In the previous notebook, we were able to provide some additional context and ground the models. Is there a limit to the amount of additional context we can provide the model? Unfortunately, the answer is yes. A limit exists for the number of tokens that are allowed in the input and the output combined. 

What are tokens? Tokens are a representation of how the Azure OpenAI models process text. They are words or just chunks of characters. Let's look at the total number of tokens in the response we got back from the grounding notebook. There are many ways to calculate tokens. In this notebook, we will take a look at the tiktoken library to count the tokens.

In [None]:
# Import the needed modules
import openai
import PyPDF3
import os
import json
import tiktoken
import spacy
from openai.error import InvalidRequestError

# python -m spacy download en_core_web_sm
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())

from spacy.lang.en import English 
nlp = spacy.load("en_core_web_sm")

import langchain
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Load your OpenAI credentials
API_KEY = os.getenv("OPENAI_API_KEY")
assert API_KEY, "ERROR: Azure OpenAI Key is missing"
openai.api_key = API_KEY

RESOURCE_ENDPOINT = os.getenv("OPENAI_API_BASE","").strip()
assert RESOURCE_ENDPOINT, "ERROR: Azure OpenAI Endpoint is missing"
assert "openai.azure.com" in RESOURCE_ENDPOINT.lower(), "ERROR: Azure OpenAI Endpoint should be in the form: \n\n\t<your unique endpoint identifier>.openai.azure.com"
openai.api_base = RESOURCE_ENDPOINT

openai.api_type = os.getenv("OPENAI_API_TYPE")
openai.api_version = os.getenv("OPENAI_API_VERSION")
model=os.getenv("TEXT_DAVINCI_NAME")


Tiktoken uses a technique called BPE, or byte pair encoding to convert the given text into tokens. There are different encodings available to help process the words. In this notebook, we will use the cl100k_base.

Let's count the number of tokens in the answer we received from the previous notebook.

In [None]:

def count_tokens(string: str, encoding_name: str) -> int:
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

In [None]:
text = """Kaiser Permanente offers four types of individual and family plans: Copay plans, Deductible plans, Virtual plans, and HSA qualified plans. Catastrophic plans are also available in some markets."""

count_tokens(text, "cl100k_base")

print("There are " + str(count_tokens(text, "cl100k_base")) + " tokens in this sentence: " + text)

What happens if we want to add in more context than what we put in the text variable? Let's say we want to provide more context with a PDF file. Let's try to count the number of tokens again.

In [None]:
document = open(r'C:\Users\dthakar\Documents\WhatTheHack\xxx-OpenAIFundamentals\Student\Resources\data\CH3-data.pdf', 'rb') 
doc_helper = PyPDF3.PdfFileReader(document)

In [None]:
finaltext = ''
totalpages = doc_helper.getNumPages()
for eachpage in range(totalpages):
   p = doc_helper.getPage(eachpage)
   indpagetext = p.extractText()
   finaltext += indpagetext

clean_text = finaltext.replace("  ", " ").replace("\n", "; ").replace(';',' ')

In [None]:
prompt = f"What is the answer to the following question regarding the PDF document?\n\n{finaltext}\n\n" 
q = "Can you give me a summary of the document?"

try:
    final_prompt = prompt + q
    response = openai.Completion.create(engine=model, prompt=final_prompt, max_tokens=50)
    answer = response.choices[0].text.strip()
    print(f"{q}\n{answer}\n")

except InvalidRequestError as e:
    print(e.error)



You will get an error message after running the above snippet of code. The model reaches its maximum context length. For GPT-3 models, the token limit is 4097 tokens. How do we fix this issue by giving it all of the needed context, but not running into the token limit issue?

To solve this problem, we can take a look at a concept called Chunking. Chunking helps limit the amount of information we pass into the model. The information that we will pass through are the most relevant chunks from the overall data. There are many considerations that come into play when chunking. For example, you need to figure out the best chunk size. If the chunks are too small, you may lose important context. If the chunks are too big, it may contain unnecessary information. 

Below are some common chunking techniques.

1. Chunking with smaller chunks 
2. Chunking by splitting sentences  
3. Chunking with sentence overlap 
4. Chunking recursively 

Let us take a look at these techniques in action.

Method 1: Chunking with smaller chunks 

In [None]:
text = "The sun was setting over the horizon, casting a warm glow over the landscape. Birds chirped in the trees, and a gentle breeze rustled the leaves. In the distance, a herd of deer grazed in a meadow. The air was filled with the sweet scent of blooming flowers. It was a peaceful and serene scene, perfect for a quiet evening stroll."
chunks = text.split()

for chunk in chunks: 
    print(chunk)

In method one, we see an example of what not to do when chunking. The chunks are so small that the individual chunks make no sense without the rest of the context. This is an important key when chunking - you want to keep the semantic meaning. 

Method 2: Chunking by splitting sentences

In [None]:
text = "Today was a fun day. I had lots of ice cream. I also met my best friend Sally and we played together at the new playground."

for sentence in nlp(text).sents:
    print(sentence.text)

In method two, we see a better example of chunking. Here we are using the spaCy library to chunk the text into individual sentences. This can be useful when you are trying to do text summarization. You can rank the individual sentences and use the top results in the summary.  

Method 3:  Chunking with sentence overlap 

In [None]:
text = "The sun was setting over the horizon, casting a warm glow over the landscape. Birds chirped in the trees, and a gentle breeze rustled the leaves. In the distance, a herd of deer grazed in a meadow. The air was filled with the sweet scent of blooming flowers. It was a peaceful and serene scene, perfect for a quiet evening stroll."
doc = nlp(text)

sentences = list(doc.sents)
overlap = 1
chunks =[]

for i in range(len(sentences) - overlap):
    chunk = sentences[i : i + overlap + 1]
    chunks.append(chunk)

for chunk in chunks:
    print([sent.text for sent in chunk])

In method three, we see an example of chunking with ensures the semantic meaning is kept. In other words, the context is preserved between the sentences. This is especially important when you are searching data for relevant results or when you are summarizing a piece of text. It is important to capture the relationships between the sentences.

Method 4: Chunking recursively using LangChain

In [None]:
split_text = RecursiveCharacterTextSplitter(
    chunk_size = 300,
    chunk_overlap = 30
)
docs = split_text.create_documents([clean_text])
docs

In method four, we see an example of chunking using langchain, a popular framework for creating applications using large language models. In the previous methods you saw various examples of chunking. Langchain can help make the chunking process easier with some of its methods. These methods include fixed size chunks as well as recursive chunking, which we saw above.

For example, there is CharacterTextSplitter which will split the given text into a fixed size chunk of a given size and a given overlap of characters. 

RecursiveCharacterTextSplitter divides the text into smaller chunks in an iterative manner. Again, you can provide the chunk size and chunk overlap count. 


To summarize, chunking is an important technique for many reasons. It helps bypass the token limit when working with lots of data and also optimizes the response we get back from the model. Finding the right chunking technique and chunk size is crucial to receiving relevant responses.