-
Notifications
You must be signed in to change notification settings - Fork 13.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue: TokenTextSplitter with local tokenizer ? #12078
Comments
🤖 Hello, Thank you for reaching out and I apologize for the confusion. The You can use the Here's how you can do it:
from transformers import AutoTokenizer
# Load your local tokenizer
tokenizer = AutoTokenizer.from_pretrained('/my/path/to/my/tokenizer/')
# Configure the TextSplitter
text_splitter = TextSplitter.from_huggingface_tokenizer(tokenizer, chunk_size = 50, chunk_overlap = 10)
# Configure the TextSplitter
text_splitter = TextSplitter.from_tiktoken_encoder(encoding_name='gpt2', chunk_size = 50, chunk_overlap = 10) Please replace These methods are defined in the As for the I hope this helps! If you have any other questions, feel free to ask. Best regards. Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
This is what I would like to achieve unfortunately this throws an error : 'Can't instantiate abstract class TextSplitter with abstract method split_text' I tried to create change TextSplitter to TokenTextSplitter but this throws errors as he is trying to do online get requests to get gpt-2. |
🤖 Hello, Thank you for reaching out. I understand you're trying to use a local tokenizer with the The To resolve this, you can create a custom method that uses your local tokenizer to count the length of the text. This method can then be passed to the def local_tokenizer_length(text: str) -> int:
# Assuming `tokenizer` is your local tokenizer instance
return len(tokenizer.encode(text))
# Create a TextSplitter instance using the local tokenizer
text_splitter = TextSplitter(length_function=local_tokenizer_length) In this example, If you are using a custom tokenizer, you can also create a def local_tokenizer_encode(text: str) -> List[int]:
# Assuming `tokenizer` is your local tokenizer instance
return tokenizer.encode(text)
def local_tokenizer_decode(token_ids: List[int]) -> str:
# Assuming `tokenizer` is your local tokenizer instance
return tokenizer.decode(token_ids)
tokenizer = Tokenizer(
chunk_overlap=200,
tokens_per_chunk=4000,
decode=local_tokenizer_decode,
encode=local_tokenizer_encode,
)
splits = split_text_on_tokens(text=text, tokenizer=tokenizer) In this example, I hope this helps! If you have any other questions, feel free to ask. Best, SourcesThis response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
Surprisingly effective ! The second bit of code using a custom tokenizer worked super well. Well done. Good bot <3 |
Issue you'd like to raise.
Hello,
I am using langchain offline on a local machine. I'd like to split document over tokens user TokenTextSplitter.
Unfortunately, I can't get to make the class to user a local tokenizer.
I tried to do
text_splitter = TokenTextSplitter(model_name='/my/path/to/my/tokenizer/', chunk_size = 50, chunk_overlap = 10
like I did for the HuggingFaceEmbedding (and it worked pretty well).
But I get the following error:
Could not automaticcaly map '/my/path/to/my/tokenizer/' to a tokenizer. Please use 'tiktoken.get_encoding' to explicitly get the tokenizer you expect
Couldn't find any info in the documentation about setup an offline / local tokenizer.
Suggestion:
No response
The text was updated successfully, but these errors were encountered: