TokenTextSplitter ignores model_name when using from_tiktoken_encoder #4357

danb27 · 2023-05-08T19:47:49Z

System Info

langchain v0.0.162
python3.10

Who can help?

@hwchase17 I believe issue was introduced here: #2963

This works when calling __init__ directly, but the model_name is not passed to __init__ when using from_tiktoken_encoder()

Information

The official example notebooks/scripts
My own modified scripts

Related Components

Reproduction

print(TokenTextSplitter()._tokenizer)  # <Encoding 'gpt2'>
print(TokenTextSplitter(model_name="gpt-3.5-turbo")._tokenizer)  # <Encoding 'cl100k_base'>
print(TokenTextSplitter.from_tiktoken_encoder(model_name="gpt-3.5-turbo")._tokenizer)  # <Encoding 'gpt2'>

Expected behavior

print(TokenTextSplitter()._tokenizer)  # <Encoding 'gpt2'>
print(TokenTextSplitter(model_name="gpt-3.5-turbo")._tokenizer)  # <Encoding 'cl100k_base'>
print(TokenTextSplitter.from_tiktoken_encoder(model_name="gpt-3.5-turbo")._tokenizer)  # <Encoding 'cl100k_base'>

The text was updated successfully, but these errors were encountered:

@hwchase17

…encoder (#4358) # Fix model name not being passed to __init__ when using from_tiktoken_encoder   Fixes #4357 ## Before submitting  ## Who can review? Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested:  @hwchase17

@danb27

Thanks to @danb27 for the fix! Minor update Fixes langchain-ai#4357 --------- Co-authored-by: Dan Bianchini <42096328+danb27@users.noreply.github.com>

@danb27

Thanks to @danb27 for the fix! Minor update Fixes langchain-ai#4357 --------- Co-authored-by: Dan Bianchini <42096328+danb27@users.noreply.github.com>

danb27 mentioned this issue May 8, 2023

fix model name not being passed to __init__ when using from_tiktoken_encoder #4358

Merged

dev2049 mentioned this issue May 8, 2023

Dev2049/from tiktoken fix #4361

Merged

dev2049 closed this as completed in 02ebb15 May 8, 2023

dosubot bot mentioned this issue Oct 20, 2023

Issue: TokenTextSplitter with local tokenizer ? #12078

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TokenTextSplitter ignores model_name when using from_tiktoken_encoder #4357

TokenTextSplitter ignores model_name when using from_tiktoken_encoder #4357

danb27 commented May 8, 2023

TokenTextSplitter ignores model_name when using from_tiktoken_encoder #4357

TokenTextSplitter ignores model_name when using from_tiktoken_encoder #4357

Comments

danb27 commented May 8, 2023

System Info

Who can help?

Information

Related Components

Reproduction

Expected behavior