Issue: TokenTextSplitter with local tokenizer ? #12078

Civel-1 · 2023-10-20T13:46:35Z

Issue you'd like to raise.

Hello,

I am using langchain offline on a local machine. I'd like to split document over tokens user TokenTextSplitter.

Unfortunately, I can't get to make the class to user a local tokenizer.

I tried to do

text_splitter = TokenTextSplitter(model_name='/my/path/to/my/tokenizer/', chunk_size = 50, chunk_overlap = 10

like I did for the HuggingFaceEmbedding (and it worked pretty well).

But I get the following error:
Could not automaticcaly map '/my/path/to/my/tokenizer/' to a tokenizer. Please use 'tiktoken.get_encoding' to explicitly get the tokenizer you expect

Couldn't find any info in the documentation about setup an offline / local tokenizer.

Suggestion:

No response

The text was updated successfully, but these errors were encountered:

dosubot · 2023-10-20T13:54:47Z

🤖

Hello,

Thank you for reaching out and I apologize for the confusion. The TokenTextSplitter class in LangChain can indeed be configured to use a local tokenizer when working offline. However, the way to do it is slightly different than what you've tried.

You can use the from_huggingface_tokenizer or from_tiktoken_encoder methods of the TextSplitter class, depending on the type of tokenizer you want to use.

Here's how you can do it:

For a HuggingFace tokenizer:

from transformers import AutoTokenizer

# Load your local tokenizer
tokenizer = AutoTokenizer.from_pretrained('/my/path/to/my/tokenizer/')

# Configure the TextSplitter
text_splitter = TextSplitter.from_huggingface_tokenizer(tokenizer, chunk_size = 50, chunk_overlap = 10)

For a tiktoken encoder:

# Configure the TextSplitter
text_splitter = TextSplitter.from_tiktoken_encoder(encoding_name='gpt2', chunk_size = 50, chunk_overlap = 10)

Please replace 'gpt2' with the name of your encoding name.

These methods are defined in the TextSplitter class in the LangChain repository here.

As for the tiktoken.get_encoding method mentioned in the error message, it seems to be indirectly referenced in the get_token_ids method of the OpenAI class. The get_token_ids method uses tiktoken to encode the input text into token IDs. The tiktoken.encoding_for_model function is used to get the encoding for a specific model, and the encode method of the returned encoding is used to convert the text into token IDs. You can find more details about this in the OpenAI class in the LangChain repository here.

I hope this helps! If you have any other questions, feel free to ask.

Best regards.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Civel-1 · 2023-10-20T14:06:01Z

@dosu-bot

# Configure the TextSplitter
text_splitter = TextSplitter.from_huggingface_tokenizer(tokenizer, chunk_size = 50, chunk_overlap = 10)

This is what I would like to achieve unfortunately this throws an error : 'Can't instantiate abstract class TextSplitter with abstract method split_text'

I tried to create change TextSplitter to TokenTextSplitter but this throws errors as he is trying to do online get requests to get gpt-2.

dosubot · 2023-10-20T14:15:16Z

🤖

Hello,

Thank you for reaching out. I understand you're trying to use a local tokenizer with the TokenTextSplitter class in the LangChain Python framework while working offline.

The TokenTextSplitter class in LangChain is designed to work with the tiktoken package, which is used to encode and decode the text. However, it seems like you're trying to use a local tokenizer from the transformers library, which is causing the issue.

To resolve this, you can create a custom method that uses your local tokenizer to count the length of the text. This method can then be passed to the TextSplitter class as the length_function argument. Here's an example of how you can do this:

def local_tokenizer_length(text: str) -> int:
    # Assuming `tokenizer` is your local tokenizer instance
    return len(tokenizer.encode(text))

# Create a TextSplitter instance using the local tokenizer
text_splitter = TextSplitter(length_function=local_tokenizer_length)

In this example, local_tokenizer_length is a function that uses your local tokenizer to count the length of the text. This function is passed to the TextSplitter class as the length_function argument, which is used to count the length of the text.

If you are using a custom tokenizer, you can also create a Tokenizer instance and use it with the split_text_on_tokens function. Here's an example:

def local_tokenizer_encode(text: str) -> List[int]:
    # Assuming `tokenizer` is your local tokenizer instance
    return tokenizer.encode(text)

def local_tokenizer_decode(token_ids: List[int]) -> str:
    # Assuming `tokenizer` is your local tokenizer instance
    return tokenizer.decode(token_ids)

tokenizer = Tokenizer(
    chunk_overlap=200,
    tokens_per_chunk=4000,
    decode=local_tokenizer_decode,
    encode=local_tokenizer_encode,
)

splits = split_text_on_tokens(text=text, tokenizer=tokenizer)

In this example, split_text_on_tokens will split the text into chunks using your local tokenizer. The Tokenizer instance is created with your local tokenizer's encode and decode methods, and the chunk_overlap and tokens_per_chunk parameters control the size and overlap of the chunks.

I hope this helps! If you have any other questions, feel free to ask.

Best,
Dosu

Sources

libs/langchain/langchain/text_splitter.py

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Civel-1 · 2023-10-20T14:30:20Z

Surprisingly effective ! The second bit of code using a custom tokenizer worked super well. Well done. Good bot <3

dosubot bot added Ɑ: embeddings Related to text embedding models module 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature 🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder labels Oct 20, 2023

Civel-1 closed this as completed Oct 20, 2023

dosubot bot mentioned this issue Nov 7, 2023

ConversationalRetrievalChain using local LLM models and tokenizers #12982

Closed

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue: TokenTextSplitter with local tokenizer ? #12078

Issue: TokenTextSplitter with local tokenizer ? #12078

Civel-1 commented Oct 20, 2023

dosubot bot commented Oct 20, 2023

Civel-1 commented Oct 20, 2023

dosubot bot commented Oct 20, 2023

Civel-1 commented Oct 20, 2023

Issue: TokenTextSplitter with local tokenizer ? #12078

Issue: TokenTextSplitter with local tokenizer ? #12078

Comments

Civel-1 commented Oct 20, 2023

Issue you'd like to raise.

Suggestion:

dosubot bot commented Oct 20, 2023

Sources

Civel-1 commented Oct 20, 2023

dosubot bot commented Oct 20, 2023

Sources

Civel-1 commented Oct 20, 2023