# Challenge 03-B-Chunking 

## 1. Overview 

In this challenge, you will walk through the concepts of tokens and chunking. In the previous notebook (CH-03-A-Grounding), we were able to provide some additional context to ground the model. Is there a limit to the amount of additional context we can provide the model? Unfortunately, the answer is yes. A limit exists for the number of tokens that are allowed in the input and the output combined based on the model being used.

So what are tokens? Tokens are a representation of how the Azure OpenAI models process text. They are words or just chunks of characters. Let's look at the total number of tokens in the response we got back from the first notebook in CH-03. There are many ways to calculate tokens. In this challenge, we will take a look at the tiktoken library to count the tokens.

## 2. Let's Start Implementation

You will need to import the needed modules. The following cells are key setup steps you completed in the previous challenges.

In [4]:
! pip install --upgrade click
! python -m spacy download en_core_web_sm


Collecting click
  Using cached click-8.1.6-py3-none-any.whl (97 kB)
Installing collected packages: click
  Attempting uninstall: click
    Found existing installation: click 7.1.2
    Uninstalling click-7.1.2:
      Successfully uninstalled click-7.1.2
Successfully installed click-8.1.6
Collecting en-core-web-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.6.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [1]:
import openai
import PyPDF3
import os
import json
import tiktoken
import spacy
from openai.error import InvalidRequestError

from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())

# from spacy.lang.en import English 
# nlp = spacy.load("en_core_web_sm")
import spacy
nlp = spacy.load("en_core_web_sm")

import langchain
from langchain.text_splitter import RecursiveCharacterTextSplitter

Set up your environment to access your Azure OpenAI keys. Refer to your Azure OpenAI resource in the Azure Portal to retrieve information regarding your Azure OpenAI endpoint and keys. 

For security purposes, store your sensitive information in an .env file.

In [2]:
# Load your OpenAI credentials
API_KEY = os.getenv("OPENAI_API_KEY")
assert API_KEY, "ERROR: Azure OpenAI Key is missing"
openai.api_key = API_KEY

RESOURCE_ENDPOINT = os.getenv("OPENAI_API_BASE","").strip()
assert RESOURCE_ENDPOINT, "ERROR: Azure OpenAI Endpoint is missing"
assert "openai.azure.com" in RESOURCE_ENDPOINT.lower(), "ERROR: Azure OpenAI Endpoint should be in the form: \n\n\t<your unique endpoint identifier>.openai.azure.com"
openai.api_base = RESOURCE_ENDPOINT

openai.api_type = os.getenv("OPENAI_API_TYPE")
openai.api_version = os.getenv("OPENAI_API_VERSION")
model=os.getenv("CHAT_MODEL_NAME")


## 3. Counting Tokens

Tiktoken uses a technique called BPE, or byte pair encoding to convert the given text into tokens. There are different encodings available to help process the words. In this notebook, we will use the cl100k_base.

#### Student Task #1: 

Count the number of tokens in the final answer we received in CH-03-A-Grounding by completing the function, count_tokens, below. 

In [3]:
def count_tokens(string: str, encoding_name: str) -> int:
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

#### Student Task #2:

Enter in the text from the answer you received in CH-03-A-Grounding. Run the cell below to retrieve the number of tokens using the count_tokens function.

In [5]:
text = """Carlos Alcaraz
- Marketa Vondrousova."""

count_tokens(text, "cl100k_base")

print("There are " + str(count_tokens(text, "cl100k_base")) + " tokens in this sentence: " + text)

There are 14 tokens in this sentence: Carlos Alcaraz
- Marketa Vondrousova.


Ok so now we know how many tokens we are working with. What happens if we want to add in more context than what we already put in the text variable above? If you think about our Wimbleton  scenario, we will need to give the model more context to help it understand everything it needs to know about the tournament. More importantly, everything it needs to know to help answer your questions when writing the report! Let's say we want to provide more context to the model with a PDF document. Can we try to get a summary of the PDF document to help us with our paper?

#### Student Task #3: 

In the cell below, insert the path of the PDF document given to you in the Resources.zip file. Run the three cells to see the output.

In [6]:
document = open(r'C:\Users\dthakar\Documents\GitHub\WhatTheHack\xxx-OpenAIFundamentals\Student\Resources\data\CH3-data.pdf', 'rb')
doc_helper = PyPDF3.PdfFileReader(document)

In [7]:
finaltext = ''
totalpages = doc_helper.getNumPages()
for eachpage in range(totalpages):
   p = doc_helper.getPage(eachpage)
   indpagetext = p.extractText()
   finaltext += indpagetext

clean_text = finaltext.replace("  ", " ").replace("\n", "; ").replace(';',' ')

In [8]:
prompt = f"What is the answer to the following question regarding the PDF document?\n\n{finaltext}\n\n" 
q = "Can you give me a summary of the document?"

try:
    final_prompt = prompt + q
    response = openai.Completion.create(engine=model, prompt=final_prompt, max_tokens=50)
    answer = response.choices[0].text.strip()
    print(f"{q}\n{answer}\n")

except InvalidRequestError as e:
    print(e.error)



{
  "message": "This model's maximum context length is 8193 tokens, however you requested 39536 tokens (39486 in your prompt; 50 for the completion). Please reduce your prompt; or completion length.",
  "type": "invalid_request_error",
  "param": null,
  "code": null
}


As you will see above, you will get an error message after running the above snippet of code. The model reaches its maximum context length. For GPT-3 models, the token limit is 4097 tokens. How do we fix this issue by giving it all of the needed context, but not running into the token limit issue?

To solve this problem, we can take a look at a concept called Chunking. 

## 4. Chunking

Chunking helps limit the amount of information we pass into the model. The information that we will pass through are the most relevant chunks from the overall data. There are many considerations that come into play when chunking. For example, you need to figure out the best chunk size. If the chunks are too small, you may lose important context. If the chunks are too big, it may contain unnecessary information. 

Below are some common chunking techniques.

1. Chunking with smaller chunks 
2. Chunking by splitting sentences  
3. Chunking with sentence overlap 
4. Chunking recursively 

Let us take a look at these techniques in action.

### 4.1 Chunking with smaller chunks 

#### Student Task #4: Add code in the cell below. Use the split() function to chunk the text.

In [9]:

text = "The sun was setting over the horizon, casting a warm glow over the landscape. Birds chirped in the trees, and a gentle breeze rustled the leaves. In the distance, a herd of deer grazed in a meadow. The air was filled with the sweet scent of blooming flowers. It was a peaceful and serene scene, perfect for a quiet evening stroll."

chunks = text.split()

for chunk in chunks: 
    print(chunk)

The
sun
was
setting
over
the
horizon,
casting
a
warm
glow
over
the
landscape.
Birds
chirped
in
the
trees,
and
a
gentle
breeze
rustled
the
leaves.
In
the
distance,
a
herd
of
deer
grazed
in
a
meadow.
The
air
was
filled
with
the
sweet
scent
of
blooming
flowers.
It
was
a
peaceful
and
serene
scene,
perfect
for
a
quiet
evening
stroll.


What can you observe about the chunks returned? If each chunk stood by itself, would you be able to understand the semantic meaning?

### 4.2: Chunking by splitting sentences

#### Student Task #5:Add code in the cell below. Use the spacy library and specifically sents function to chunk the text.

In [10]:
text = "Today was a fun day. I had lots of ice cream. I also met my best friend Sally and we played together at the new playground."

for sentence in nlp(text).sents:
    print(sentence.text)

Today was a fun day.
I had lots of ice cream.
I also met my best friend Sally and we played together at the new playground.


Are the results better than the method in 4.1? The spaCy library helps toto chunk the text into individual sentences. This can be useful when you are trying to do text summarization. You can rank the individual sentences and use the top results in the summary.  

### 4.3: Chunking with sentence overlap 

#### Student Task #6: Run the code below to see another example of chunking. As you will see, the semantic meaning is kept. In other words, the context is preserved between the sentences. This is especially important when you are searching data for relevant results or when you are summarizing a piece of text. It is important to capture the relationships between the sentences.

In [11]:
text = "The sun was setting over the horizon, casting a warm glow over the landscape. Birds chirped in the trees, and a gentle breeze rustled the leaves. In the distance, a herd of deer grazed in a meadow. The air was filled with the sweet scent of blooming flowers. It was a peaceful and serene scene, perfect for a quiet evening stroll."
doc = nlp(text)

sentences = list(doc.sents)
overlap = 1
chunks =[]

for i in range(len(sentences) - overlap):
    chunk = sentences[i : i + overlap + 1]
    chunks.append(chunk)

for chunk in chunks:
    print([sent.text for sent in chunk])

['The sun was setting over the horizon, casting a warm glow over the landscape.', 'Birds chirped in the trees, and a gentle breeze rustled the leaves.']
['Birds chirped in the trees, and a gentle breeze rustled the leaves.', 'In the distance, a herd of deer grazed in a meadow.']
['In the distance, a herd of deer grazed in a meadow.', 'The air was filled with the sweet scent of blooming flowers.']
['The air was filled with the sweet scent of blooming flowers.', 'It was a peaceful and serene scene, perfect for a quiet evening stroll.']


### 4.4: Chunking recursively using LangChain

#### Student Task #7: Add in the required parameters for the RecursiveCharacterSplitter in the cell below.

In [12]:
split_text = RecursiveCharacterTextSplitter(
    chunk_size = 300,
    chunk_overlap = 30 
)
docs = split_text.create_documents([clean_text])
docs

[Document(page_content='Formula 1 Power Unit Financial Regulations     1     16 August     2022     © 202  2                Issue   1              FORMULA 1   POWER UNIT   FINANCIAL REGULATIONS     PUBLISHED ON   16 August     2022     Issue   1        CONTENTS        Art     CONTENTS     Page(s)     1.     GENERAL', metadata={}),
 Document(page_content='Page(s)     1.     GENERAL PRINCIPLES     ................................  ................................  ................................  ........     2     2.     POWER UNIT MANUFACTURER OBLIGATIONS     ................................  ................................  ....     3     3.', metadata={}),
 Document(page_content='....     3     3.     EXCLUSIONS     ................................  ................................  ................................  .....................     5     4.     ADJUSTMENTS     ................................  ................................  ................................', metadata={

Above, we did some chunking using langchain, a popular framework for creating applications using large language models. In the previous methods you saw various examples of chunking. Langchain can help make the chunking process easier with some of its methods. These methods include fixed size chunks as well as recursive chunking, which we saw just now.

For example, there is CharacterTextSplitter which will split the given text into a fixed size chunk of a given size and a given overlap of characters. 

RecursiveCharacterTextSplitter divides the text into smaller chunks in an iterative manner. Again, you can provide the chunk size and chunk overlap count. 


Chunking is an important technique for many reasons. It helps bypass the token limit when working with lots of data and also optimizes the response we get back from the model. Finding the right chunking technique and chunk size is crucial to receiving relevant responses.

Success Criteria

To complete this challenge successfully:

* Show an understanding of tokens and how to calculate them.
* Show an understanding of chunking by experimenting with different techniques.
* Be able to understand the importance of finding the right chuning solution based on if the semantic meaning is getting captured or not.