### In summary
Steps to build a recursive_text_splitter to split chunks each below a target length.

1. Build a function to count the total number of tokens (This is different from counting words)
2. Build a function that halves a piece of string by its delimiter
3. Build a function that truncate a string based on the number of tokens
4. Build a recursive_text_splitter that splits a body of text into chunks. Use halve method to cut it in half, feed each half to the function recurssively until they are smaller than the max_len required.

In [8]:
# Get number of tokens
import tiktoken
BASE_MODEL = "gpt-4"

def num_tokens(text: str) -> int:
    tokenizer = tiktoken.encoding_for_model(BASE_MODEL)
    return len(tokenizer.encode(text))

In [53]:
# Split a chunk of text into half, try to blanace the number of words in each half

def halve_text(string: str, delimiter: str = '\n') -> tuple[str, str]:
    chunks = string.split(delimiter)
    half = num_tokens(string) // 2
    token_so_far = 0

    for i, chunk in enumerate(chunks):
        curr_token = num_tokens(chunk)
        if token_so_far > half:
            return (delimiter.join(chunks[:i]), delimiter.join(chunks[i:]))
        else:
            token_so_far += curr_token

    return ["", ""]

In [54]:
text = """
The Jamaica Wine House, known locally as "the Jampot", is located in St Michael's Alley, Cornhill, in the heart of London's financial district. It was the first coffee house in London and was visited by the English diarist Samuel Pepys in 1660.

[1] It is now a Grade II listed public house[2] and is set within a labyrinth of medieval courts and alleys in the City of London. It lies in the ward of Cornhill.

The Jamaica Wine House has historic links to the sugar trade of the West Indies and the Ottoman Empire. 

There is a plaque on the wall which reads "Here stood the first London Coffee house at the sign of the Pasqua Rosee's Head 1652." 

Pasqua Rosée, the proprietor, was the servant of a Levant Company merchant named Daniel Edwards, a trader in Ottoman goods, who imported the coffee and assisted Rosée in setting up the establishment. The coffee house, which opened in 1652, is known in some accounts as The Turk's Head.[3][4][5]

The building that currently stands on the site is a 19th-century public house. This pub's licence was acquired by Shepherd Neame[6] and the premises were reopened after a restoration that finished in April 2009. 

There is a wood-panelled bar with three sections on the ground floor and downstairs restaurant.

"""

left, right = halve_text(text)
print(left)


The Jamaica Wine House, known locally as "the Jampot", is located in St Michael's Alley, Cornhill, in the heart of London's financial district. It was the first coffee house in London and was visited by the English diarist Samuel Pepys in 1660.

[1] It is now a Grade II listed public house[2] and is set within a labyrinth of medieval courts and alleys in the City of London. It lies in the ward of Cornhill.

The Jamaica Wine House has historic links to the sugar trade of the West Indies and the Ottoman Empire. 

There is a plaque on the wall which reads "Here stood the first London Coffee house at the sign of the Pasqua Rosee's Head 1652." 


In [55]:
# Test how decode works

import tiktoken
model = "gpt-3.5-turbo"

string = "I like to eat bananas."
max_tokens = 3
tokenizer = tiktoken.encoding_for_model(model)
encoded_string = tokenizer.encode(string)
print(encoded_string)

truncated_string = tokenizer.decode(encoded_string[:max_tokens])
truncated_string

[40, 1093, 311, 8343, 68442, 13]


'I like to'

In [56]:
# Truncate string

def truncate_string(string: str, max_len: int) -> str:
    tokenizer = tiktoken.encoding_for_model(BASE_MODEL)
    encoded_string = tokenizer.encode(string)
    truncated_string = tokenizer.decode(encoded_string[:max_len + 1])
    return truncated_string

In [57]:
t = truncate_string(text, 10)
print(t)


The Jamaica Wine House, known locally as "the


In [58]:
# Split string into chunks fo max_len recursively
def recursive_text_splitter(text: str, max_len: int, max_recurssion: int) -> list[str]:
    length = num_tokens(text)
    if length <= max_len:
        return [text]

    if max_recurssion == 0:
        return [truncate_string(text, max_len)]

    for delimiter in ["\n\n", "\n", ". "]:
        left, right = halve_text(text, delimiter)
        if left == "" or right =="":
            # Try a more fine grained delminiator
            continue

        chunks = []
        for half in [left, right]:
            half_chunks = recursive_text_splitter(half, max_len, max_recurssion - 1)
            chunks.extend(half_chunks)
    
        return chunks

    # no split was found, just truncate
    return [truncate_string(text, max_len)]

In [62]:
chunks = recursive_text_splitter(text, 50, 5)
for i, c in enumerate(chunks):
    print(i, c)
    print()

0 
The Jamaica Wine House, known locally as "the Jampot", is located in St Michael's Alley, Cornhill, in the heart of London's financial district

1 It was the first coffee house in London and was visited by the English diarist Samuel Pepys in 1660.

2 [1] It is now a Grade II listed public house[2] and is set within a labyrinth of medieval courts and alleys in the City of London. It lies in the ward of Cornhill.

3 The Jamaica Wine House has historic links to the sugar trade of the West Indies and the Ottoman Empire. 

There is a plaque on the wall which reads "Here stood the first London Coffee house at the sign of the Pasqua Rosee's Head 165

4 Pasqua Rosée, the proprietor, was the servant of a Levant Company merchant named Daniel Edwards, a trader in Ottoman goods, who imported the coffee and assisted Rosée in setting up the establishment

5 The coffee house, which opened in 1652, is known in some accounts as The Turk's Head.[3][4][5]

6 The building that currently stands on the si