In [1]:
import langchain

In [2]:
content = """A large language model (LLM) is a language model notable for its ability to achieve general-purpose language generation. LLMs acquire these abilities by learning statistical relationships from text documents during a computationally intensive self-supervised and semi-supervised training process. LLMs are artificial neural networks typically built with a transformer-based architecture. Some recent implementations are based on alternative architectures such as recurrent neural network variants and Mamba (a state space model).

LLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or word. Up to 2020, fine tuning was the only way a model could be adapted to be able to accomplish specific tasks. Larger sized models, such as GPT-3, however, can be prompt-engineered to achieve similar results. They are thought to acquire knowledge about syntax, semantics and "ontology" inherent in human language corpora, but also inaccuracies and biases present in the corpora.

Some notable LLMs are OpenAI GPT models (e.g., GPT-3.5 and GPT-4, used in ChatGPT), Google's PaLM and Gemini (used in Bard), Microsoft's Copilot, Meta LLaMA family of open-source models, and Anthropic's Claude models.

Although sometimes matching human performance, it is not clear they are plausible cognitive models. At least for recurrent neural networks it has been shown that they sometimes learn patterns which humans do not learn, but fail to learn patterns that humans typically do learn.
"""

In [3]:
content

'A large language model (LLM) is a language model notable for its ability to achieve general-purpose language generation. LLMs acquire these abilities by learning statistical relationships from text documents during a computationally intensive self-supervised and semi-supervised training process. LLMs are artificial neural networks typically built with a transformer-based architecture. Some recent implementations are based on alternative architectures such as recurrent neural network variants and Mamba (a state space model).\n\nLLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or word. Up to 2020, fine tuning was the only way a model could be adapted to be able to accomplish specific tasks. Larger sized models, such as GPT-3, however, can be prompt-engineered to achieve similar results. They are thought to acquire knowledge about syntax, semantics and "ontology" inherent in human language corpora, but also in

In [4]:
len(content)

1541

### Split by Character

In [5]:
from langchain.text_splitter import CharacterTextSplitter

In [8]:
text_chunks = CharacterTextSplitter(chunk_size=400,
                                    chunk_overlap=10,
                                    separator="\n",
                                    length_function=len,
                                    is_separator_regex=False)

chunks = text_chunks.split_text(content)

Created a chunk of size 529, which is longer than the specified 400
Created a chunk of size 511, which is longer than the specified 400


In [9]:
print("Total number of chunks in the Paragraph:", len(chunks))

Total number of chunks in the Paragraph: 4


In [10]:
for i,_ in enumerate(chunks):
    print(f"Chunk #{i}: {chunks[i]}")
    print("-"*100)

Chunk #0: A large language model (LLM) is a language model notable for its ability to achieve general-purpose language generation. LLMs acquire these abilities by learning statistical relationships from text documents during a computationally intensive self-supervised and semi-supervised training process. LLMs are artificial neural networks typically built with a transformer-based architecture. Some recent implementations are based on alternative architectures such as recurrent neural network variants and Mamba (a state space model).
----------------------------------------------------------------------------------------------------
Chunk #1: LLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or word. Up to 2020, fine tuning was the only way a model could be adapted to be able to accomplish specific tasks. Larger sized models, such as GPT-3, however, can be prompt-engineered to achieve similar results. They ar

In [11]:
for i, _ in enumerate(chunks):
    print(f"Chunk #{i}, Size:{len(chunks[i])}")

Chunk #0, Size:529
Chunk #1, Size:511
Chunk #2, Size:217
Chunk #3, Size:277


### Recursive Split by Character

In [12]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [14]:
text_chunks = RecursiveCharacterTextSplitter(chunk_size=400,
                                             chunk_overlap=10,
                                             length_function=len,
                                             is_separator_regex=False)

chunks = text_chunks.split_text(content)

In [15]:
print("Total number of chunks in the Paragraph:", len(chunks))

Total number of chunks in the Paragraph: 6


In [16]:
for i,_ in enumerate(chunks):
    print(f"Chunk #{i}: {chunks[i]}")
    print("-"*100)

Chunk #0: A large language model (LLM) is a language model notable for its ability to achieve general-purpose language generation. LLMs acquire these abilities by learning statistical relationships from text documents during a computationally intensive self-supervised and semi-supervised training process. LLMs are artificial neural networks typically built with a transformer-based architecture. Some recent
----------------------------------------------------------------------------------------------------
Chunk #1: recent implementations are based on alternative architectures such as recurrent neural network variants and Mamba (a state space model).
----------------------------------------------------------------------------------------------------
Chunk #2: LLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or word. Up to 2020, fine tuning was the only way a model could be adapted to be able to accomplish spe

### Tiktoken Tokenizer

Tiktoken tokenizer has been created by OpenAI for their family of models. Using this strategy, the split still happens based on the character. 
However, the length of the chunk is determined by the number of tokens.

In [17]:
from langchain.text_splitter import TokenTextSplitter
import tiktoken

In [18]:
text_splitter = TokenTextSplitter(chunk_size=400,
                                  chunk_overlap=10,
                                  length_function=len)

In [19]:
texts = text_splitter.create_documents(content)

In [20]:
encoding = tiktoken.get_encoding("cl100k_base")

In [21]:
print("Total number of chunks Created:", len(chunks))

Total number of chunks Created: 6


In [25]:
print(f"Total Number of Tokens in the document: {len(encoding.encode(content))} tokens")

Total Number of Tokens in the document: 314 tokens


### Hugging Face Tokenizer

Hugging Face has become the go-to platform for anyone building apps using LLMs or even other models. All models available via Hugging Face are also accompanied
by their tokenizers.

In [30]:
from transformers import GPT2TokenizerFast
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [32]:
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

In [38]:
text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(tokenizer,
                                                                          chunk_size=400,
                                                                          chunk_overlap=10)

In [39]:
texts = text_splitter.split_text(content)

In [40]:
texts[0]

'A large language model (LLM) is a language model notable for its ability to achieve general-purpose language generation. LLMs acquire these abilities by learning statistical relationships from text documents during a computationally intensive self-supervised and semi-supervised training process. LLMs are artificial neural networks typically built with a transformer-based architecture. Some recent implementations are based on alternative architectures such as recurrent neural network variants and Mamba (a state space model).\n\nLLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or word. Up to 2020, fine tuning was the only way a model could be adapted to be able to accomplish specific tasks. Larger sized models, such as GPT-3, however, can be prompt-engineered to achieve similar results. They are thought to acquire knowledge about syntax, semantics and "ontology" inherent in human language corpora, but also in