Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue: TokenTextSplitter with local tokenizer ? #12078

Closed
Civel-1 opened this issue Oct 20, 2023 · 4 comments
Closed

Issue: TokenTextSplitter with local tokenizer ? #12078

Civel-1 opened this issue Oct 20, 2023 · 4 comments
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature 🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder Ɑ: embeddings Related to text embedding models module

Comments

@Civel-1
Copy link

Civel-1 commented Oct 20, 2023

Issue you'd like to raise.

Hello,

I am using langchain offline on a local machine. I'd like to split document over tokens user TokenTextSplitter.

Unfortunately, I can't get to make the class to user a local tokenizer.

I tried to do

text_splitter = TokenTextSplitter(model_name='/my/path/to/my/tokenizer/', chunk_size = 50, chunk_overlap = 10

like I did for the HuggingFaceEmbedding (and it worked pretty well).

But I get the following error:
Could not automaticcaly map '/my/path/to/my/tokenizer/' to a tokenizer. Please use 'tiktoken.get_encoding' to explicitly get the tokenizer you expect

Couldn't find any info in the documentation about setup an offline / local tokenizer.

Suggestion:

No response

@dosubot dosubot bot added Ɑ: embeddings Related to text embedding models module 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature 🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder labels Oct 20, 2023
@dosubot
Copy link

dosubot bot commented Oct 20, 2023

🤖

Hello,

Thank you for reaching out and I apologize for the confusion. The TokenTextSplitter class in LangChain can indeed be configured to use a local tokenizer when working offline. However, the way to do it is slightly different than what you've tried.

You can use the from_huggingface_tokenizer or from_tiktoken_encoder methods of the TextSplitter class, depending on the type of tokenizer you want to use.

Here's how you can do it:

  1. For a HuggingFace tokenizer:
from transformers import AutoTokenizer

# Load your local tokenizer
tokenizer = AutoTokenizer.from_pretrained('/my/path/to/my/tokenizer/')

# Configure the TextSplitter
text_splitter = TextSplitter.from_huggingface_tokenizer(tokenizer, chunk_size = 50, chunk_overlap = 10)
  1. For a tiktoken encoder:
# Configure the TextSplitter
text_splitter = TextSplitter.from_tiktoken_encoder(encoding_name='gpt2', chunk_size = 50, chunk_overlap = 10)

Please replace 'gpt2' with the name of your encoding name.

These methods are defined in the TextSplitter class in the LangChain repository here.

As for the tiktoken.get_encoding method mentioned in the error message, it seems to be indirectly referenced in the get_token_ids method of the OpenAI class. The get_token_ids method uses tiktoken to encode the input text into token IDs. The tiktoken.encoding_for_model function is used to get the encoding for a specific model, and the encode method of the returned encoding is used to convert the text into token IDs. You can find more details about this in the OpenAI class in the LangChain repository here.

I hope this helps! If you have any other questions, feel free to ask.

Best regards.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@Civel-1
Copy link
Author

Civel-1 commented Oct 20, 2023

@dosu-bot

# Configure the TextSplitter
text_splitter = TextSplitter.from_huggingface_tokenizer(tokenizer, chunk_size = 50, chunk_overlap = 10)

This is what I would like to achieve unfortunately this throws an error : 'Can't instantiate abstract class TextSplitter with abstract method split_text'

I tried to create change TextSplitter to TokenTextSplitter but this throws errors as he is trying to do online get requests to get gpt-2.

@dosubot
Copy link

dosubot bot commented Oct 20, 2023

🤖

Hello,

Thank you for reaching out. I understand you're trying to use a local tokenizer with the TokenTextSplitter class in the LangChain Python framework while working offline.

The TokenTextSplitter class in LangChain is designed to work with the tiktoken package, which is used to encode and decode the text. However, it seems like you're trying to use a local tokenizer from the transformers library, which is causing the issue.

To resolve this, you can create a custom method that uses your local tokenizer to count the length of the text. This method can then be passed to the TextSplitter class as the length_function argument. Here's an example of how you can do this:

def local_tokenizer_length(text: str) -> int:
    # Assuming `tokenizer` is your local tokenizer instance
    return len(tokenizer.encode(text))

# Create a TextSplitter instance using the local tokenizer
text_splitter = TextSplitter(length_function=local_tokenizer_length)

In this example, local_tokenizer_length is a function that uses your local tokenizer to count the length of the text. This function is passed to the TextSplitter class as the length_function argument, which is used to count the length of the text.

If you are using a custom tokenizer, you can also create a Tokenizer instance and use it with the split_text_on_tokens function. Here's an example:

def local_tokenizer_encode(text: str) -> List[int]:
    # Assuming `tokenizer` is your local tokenizer instance
    return tokenizer.encode(text)

def local_tokenizer_decode(token_ids: List[int]) -> str:
    # Assuming `tokenizer` is your local tokenizer instance
    return tokenizer.decode(token_ids)

tokenizer = Tokenizer(
    chunk_overlap=200,
    tokens_per_chunk=4000,
    decode=local_tokenizer_decode,
    encode=local_tokenizer_encode,
)

splits = split_text_on_tokens(text=text, tokenizer=tokenizer)

In this example, split_text_on_tokens will split the text into chunks using your local tokenizer. The Tokenizer instance is created with your local tokenizer's encode and decode methods, and the chunk_overlap and tokens_per_chunk parameters control the size and overlap of the chunks.

I hope this helps! If you have any other questions, feel free to ask.

Best,
Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@Civel-1
Copy link
Author

Civel-1 commented Oct 20, 2023

Surprisingly effective ! The second bit of code using a custom tokenizer worked super well. Well done. Good bot <3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature 🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder Ɑ: embeddings Related to text embedding models module
Projects
None yet
Development

No branches or pull requests

1 participant