-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a JS based tokenizer for token counting #1239
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
Let's also update the test-exports packages to remove any custom config we had put in there for wasm etc |
test-exports-vercel/next.config.js
Outdated
@@ -1,14 +1,6 @@ | |||
/** @type {import('next').NextConfig} */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably remove this file completely?
@nfcampos @dqbd I think this is not quite right - the unpackaged size of The size of |
Hey @anzemur! As mentioned in dqbd/tiktoken#68 but applicable for LangchainJS as well, the best course of action is to add an intermediate build step for tree shaking and bundling dependencies, such as |
The idea behind this PR is to reduce the friction when adopting LangchainJS by rewriting the tokenizer library into JS and fetching ranks on demand from a CDN (
https://tiktoken.pages.dev
). As a side effect, this will reduce the bundle size (necessary for edge functions, see #809, #62), at the expense of slower tokenization (which should be a non-issue for pure token counting).Tokenization behavior should match the @dqbd/tiktoken, which has been entirely removed.
Supersedes #847 and #1146