feat: improves chinese and japanese tokenizers#899
Merged
micheleriva merged 1 commit intomainfrom Feb 27, 2025
Merged
Conversation
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request includes significant changes to the
packages/tokenizersmodule, primarily focusing on migrating from a custom build system using Rust and WebAssembly to a TypeScript-based build system. The changes also include updates to the configuration files, package scripts, and test files to accommodate this migration.Migration to TypeScript-based build system:
packages/tokenizers/.tshy/build.json: Added TypeScript build configuration withnodenextmodule and module resolution options.packages/tokenizers/.tshy/commonjs.json: Added configuration for CommonJS output, specifying source files to include and exclude, and setting the output directory.packages/tokenizers/.tshy/esm.json: Added configuration for ES module output, specifying source files to include and exclude, and setting the output directory.Package configuration updates:
packages/tokenizers/package.json: Updated theexportsfield to reflect new paths for ES and CommonJS modules, added atshysection for build configuration, and updated scripts to usetshyfor building andtsxfor testing. [1] [2] [3]Removal of Rust and WebAssembly build system:
packages/tokenizers/scripts/build.mjs: Removed the script for building tokenizers using Rust and WebAssembly.packages/tokenizers/src/tokenizer-japanese/.gitignore,packages/tokenizers/src/tokenizer-mandarin/.gitignore: Removed entries related to Rust build artifacts. [1] [2]packages/tokenizers/src/tokenizer-japanese/Cargo.toml,packages/tokenizers/src/tokenizer-mandarin/Cargo.toml: Removed Rust project configuration files. [1] [2]packages/tokenizers/src/tokenizer-japanese/src/lib.rs,packages/tokenizers/src/tokenizer-mandarin/src/lib.rs: Removed Rust source files for tokenizers. [1] [2]packages/tokenizers/src/tokenizer-japanese/src/tokenizer.ts,packages/tokenizers/src/tokenizer-mandarin/src/tokenizer.ts: Removed TypeScript wrappers for Rust-based tokenizers. [1] [2]Addition of new TypeScript tokenizers:
packages/tokenizers/src/japanese.ts: Added a new Japanese tokenizer implemented in TypeScript.packages/tokenizers/src/mandarin.ts: Added a new Mandarin tokenizer implemented in TypeScript.Test updates:
packages/tokenizers/tests/japanese.test.ts: Updated the Japanese tokenizer test to use the new TypeScript-based tokenizer.packages/tokenizers/tests/mandarin.test.ts: Updated the Mandarin tokenizer test to use the new TypeScript-based tokenizer.