# SentenceTransformersTokenTextSplitter

This notebook demonstrates how to use the `SentenceTransformersTokenTextSplitter` text splitter.

Language models have a token limit. You should not exceed the token limit. When you split your text into chunks it is therefore a good idea to count the number of tokens. There are many tokenizers. When you count tokens in your text you should use the same tokenizer as used in the language model. 

The `SentenceTransformersTokenTextSplitter` is a specialized text splitter for use with the sentence-transformer models. The default behaviour is to split the text into chunks that fit the token window of the sentence transformer model that you would like to use.

In [1]:
from langchain.text_splitter import SentenceTransformersTokenTextSplitter

In [2]:
splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0)
text = "Lorem "

In [3]:
count_start_and_stop_tokens = 2
text_token_count = splitter.count_tokens(text=text) - count_start_and_stop_tokens
print(text_token_count)

2


In [4]:
token_multiplier = splitter.maximum_tokens_per_chunk // text_token_count + 1

# `text_to_split` does not fit in a single chunk
text_to_split = text * token_multiplier

print(f"tokens in text to split: {splitter.count_tokens(text=text_to_split)}")

tokens in text to split: 514


In [5]:
text_chunks = splitter.split_text(text=text_to_split)

print(text_chunks[1])

lorem
