Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

minbpe-rs: A pure Rust implementation of minbpe #66

Open
shubham0204 opened this issue Apr 21, 2024 · 2 comments
Open

minbpe-rs: A pure Rust implementation of minbpe #66

shubham0204 opened this issue Apr 21, 2024 · 2 comments

Comments

@shubham0204
Copy link
Contributor

shubham0204 commented Apr 21, 2024

Gregor Purdy (@gnp) is working on a Rust version of minbpe: minbpe-rs

The Rust crate (similar to a package in Python) contains the three tokenizers currently included in the Python version of minbpe: BasicTokenizer, RegexTokenizer and the GPT4Tokenizer. Here's an example, similar to the one in the README of this project, but using minbpe-rs,

use std::path::Path;
use minbpe::{BasicTokenizer, Saveable, Tokenizer, Trainable};

fn main() {
    let text = "aaabdaaabac" ;
    let mut tokenizer = BasicTokenizer::new() ;
    tokenizer.train( text , 256 + 3 , false ) ;
    println!( "{:?}" , tokenizer.encode(text) ) ;
    println!( "{:?}" , tokenizer.decode( &[258, 100, 258, 97, 99] ) ) ;
    tokenizer.save( Path::new( "./" ) , "toy" ) ;
}

which on execution prints,

$> cargo run

   ...
   Compiling minbpe-test v0.1.0 (~/minbpe-test)
    Finished dev [unoptimized + debuginfo] target(s) in 15.71s
     Running `target/debug/minbpe-test`
[258, 100, 258, 97, 99]
"aaabdaaabac"

@gnp is the lead developer with me, @shubham0204, working on the docs, examples and the README of the project.

  • minbpe-rs will be a good start for the 2nd point in todo section of the README: write an even more optimized C or Rust version (think through)
  • The project also contains a test comparing RegexTokenizer with the GPT-4 tokenizer from tictoken-rs(Rust version of tictoken), similar to inference: GPT-4 comparison from the README. See the test here.
  • Currently, the project has a base level of documentation, which can be enriched by adding more docstrings and examples for the tokenizers

It would be great if minbpe-rs can be added as a community extension in the README of this repository, encouraging more developers to work on this Rust implementation and build more features into it (ex. Python bindings, multi-threading support, or wrappers for Java/C). We would like the community to review minbpe-rs and provide their feedback or contributions.

@karpathy
Copy link
Owner

submit a PR happy to merge

@shubham0204
Copy link
Contributor Author

@karpathy Thanks! Here's the PR #67

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants