Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a JS based tokenizer for token counting #1239

Merged
merged 12 commits into from May 17, 2023
Merged

Use a JS based tokenizer for token counting #1239

merged 12 commits into from May 17, 2023

Conversation

dqbd
Copy link
Collaborator

@dqbd dqbd commented May 12, 2023

The idea behind this PR is to reduce the friction when adopting LangchainJS by rewriting the tokenizer library into JS and fetching ranks on demand from a CDN (https://tiktoken.pages.dev). As a side effect, this will reduce the bundle size (necessary for edge functions, see #809, #62), at the expense of slower tokenization (which should be a non-issue for pure token counting).

Tokenization behavior should match the @dqbd/tiktoken, which has been entirely removed.

Supersedes #847 and #1146

  • Move the tests to ensure correct tokenization
  • Remove WASM tiktoken entirely in count_tokens (weak dependencies are not well supported afaict)
  • Serve other encoders from CDN (w/ cache)

@vercel
Copy link

vercel bot commented May 12, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
langchainjs-docs ✅ Ready (Inspect) Visit Preview May 17, 2023 1:21pm

Copy link
Collaborator

@nfcampos nfcampos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

langchain/src/util/tiktoken.ts Outdated Show resolved Hide resolved
langchain/src/util/tiktoken.ts Show resolved Hide resolved
@nfcampos
Copy link
Collaborator

Let's also update the test-exports packages to remove any custom config we had put in there for wasm etc

@@ -1,14 +1,6 @@
/** @type {import('next').NextConfig} */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably remove this file completely?

@nfcampos nfcampos added the lgtm PRs that are ready to be merged as-is label May 17, 2023
@nfcampos nfcampos merged commit d60eae5 into main May 17, 2023
10 checks passed
@nfcampos nfcampos deleted the lite-tokenizer branch May 17, 2023 13:44
@anzemur
Copy link

anzemur commented Aug 30, 2023

@nfcampos @dqbd I think this is not quite right - the unpackaged size of js-tiktoken is almost 10MB and this package is listed as the dependency. The usage of the library in langchainjs is implemented in a way that the encoders are always loaded from the CDN the first time they are used, so the actual encoder files are never used - this means that we have 10MB of overhead here. The "light" version of js-tiktoken should only include the code required for running the tokenizers, and the user should provide the actual encoder - from local files or from CDN.

The size of langchainjs itself is almost 5MB and with all of the dependencies, this can cause troubles in environments where package size is limited (ex. AWS Lambda).

@dqbd
Copy link
Collaborator Author

dqbd commented Aug 30, 2023

Hey @anzemur! As mentioned in dqbd/tiktoken#68 but applicable for LangchainJS as well, the best course of action is to add an intermediate build step for tree shaking and bundling dependencies, such as esbuild or rollup (as documented here). That should address your concerns with the bundle size as well as increase your performance of Lambda functions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm PRs that are ready to be merged as-is
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants