Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TokenTextSplitter ignores model_name when using from_tiktoken_encoder #4357

Closed
1 of 14 tasks
danb27 opened this issue May 8, 2023 · 0 comments · Fixed by #4358 or #4361
Closed
1 of 14 tasks

TokenTextSplitter ignores model_name when using from_tiktoken_encoder #4357

danb27 opened this issue May 8, 2023 · 0 comments · Fixed by #4358 or #4361

Comments

@danb27
Copy link
Contributor

danb27 commented May 8, 2023

System Info

langchain v0.0.162
python3.10

Who can help?

@hwchase17 I believe issue was introduced here: #2963

This works when calling __init__ directly, but the model_name is not passed to __init__ when using from_tiktoken_encoder()

Information

  • The official example notebooks/scripts
  • My own modified scripts

Related Components

  • LLMs/Chat Models
  • Embedding Models
  • Prompts / Prompt Templates / Prompt Selectors
  • Output Parsers
  • Document Loaders
  • Vector Stores / Retrievers
  • Memory
  • Agents / Agent Executors
  • Tools / Toolkits
  • Chains
  • Callbacks/Tracing
  • Async

Reproduction

print(TokenTextSplitter()._tokenizer)  # <Encoding 'gpt2'>
print(TokenTextSplitter(model_name="gpt-3.5-turbo")._tokenizer)  # <Encoding 'cl100k_base'>
print(TokenTextSplitter.from_tiktoken_encoder(model_name="gpt-3.5-turbo")._tokenizer)  # <Encoding 'gpt2'>

Expected behavior

print(TokenTextSplitter()._tokenizer)  # <Encoding 'gpt2'>
print(TokenTextSplitter(model_name="gpt-3.5-turbo")._tokenizer)  # <Encoding 'cl100k_base'>
print(TokenTextSplitter.from_tiktoken_encoder(model_name="gpt-3.5-turbo")._tokenizer)  # <Encoding 'cl100k_base'>
dev2049 pushed a commit that referenced this issue May 8, 2023
…encoder (#4358)

# Fix model name not being passed to __init__ when using
from_tiktoken_encoder

<!--
Thank you for contributing to LangChain! Your PR will appear in our next
release under the title you set. Please make sure it highlights your
valuable contribution.

Replace this with a description of the change, the issue it fixes (if
applicable), and relevant context. List any dependencies required for
this change.

After you're done, someone will review your PR. They may suggest
improvements. If no one reviews your PR within a few days, feel free to
@-mention the same people again, as notifications can get lost.
-->

<!-- Remove if not applicable -->

Fixes #4357


## Before submitting

<!-- If you're adding a new integration, include an integration test and
an example notebook showing its use! -->

## Who can review?

Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:

<!-- For a quicker response, figure out the right person to tag with @

        @hwchase17 - project lead

        Tracing / Callbacks
        - @agola11

        Async
        - @agola11

        DataLoader Abstractions
        - @eyurtsev

        LLM/Chat Wrappers
        - @hwchase17
        - @agola11

        Tools / Toolkits
        - @vowelparrot
 -->
@hwchase17
@dev2049 dev2049 closed this as completed in 02ebb15 May 8, 2023
EandrewJones pushed a commit to Oogway-Technologies/langchain that referenced this issue May 9, 2023
Thanks to @danb27 for the fix! Minor update

Fixes langchain-ai#4357

---------

Co-authored-by: Dan Bianchini <42096328+danb27@users.noreply.github.com>
jpzhangvincent pushed a commit to jpzhangvincent/langchain that referenced this issue May 12, 2023
Thanks to @danb27 for the fix! Minor update

Fixes langchain-ai#4357

---------

Co-authored-by: Dan Bianchini <42096328+danb27@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant