🙂 Fast state-of-the-art tokenizers for Ruby
Add this line to your application’s Gemfile:
gem "tokenizers"
Note: Rust is currently required for installation.
Load a pretrained tokenizer
tokenizer = Tokenizers.from_pretrained("bert-base-cased")
Encode
encoded = tokenizer.encode("I can feel the magic, can you?")
encoded.ids
encoded.tokens
Decode
tokenizer.decode(ids)
Load a tokenizer from files
tokenizer = Tokenizers::CharBPETokenizer.new("vocab.json", "merges.txt")
View the changelog
Everyone is encouraged to help improve this project. Here are a few ways you can help:
- Report bugs
- Fix bugs and submit pull requests
- Write, clarify, or fix documentation
- Suggest or add new features
To get started with development:
git clone https://github.com/ankane/tokenizers-ruby.git
cd tokenizers-ruby
bundle install
bundle exec ruby ext/tokenizers/extconf.rb && make
bundle exec rake download:files
bundle exec rake test