tok

tok provides bindings to the 🤗tokenizers library. It uses the same Rust libraries that powers the Python implementation.

We still don’t provide the full API of tokenizers. Please open a issue if there’s a feature you are missing.

Installation

You can install tok from CRAN using:

install.packages("tok")

Installing tok from source requires working Rust toolchain. We recommend using rustup.

On Windows, you’ll also have to add the i686-pc-windows-gnu and x86_64-pc-windows-gnu targets:

rustup target add x86_64-pc-windows-gnu
rustup target add i686-pc-windows-gnu

Once Rust is working, you can install this package via:

remotes::install_github("dfalbel/tok")

Features

We still don’t have complete support for the 🤗tokenizers API. Please open an issue if you need a feature that is currently not implemented.

Loading tokenizers

tok can be used to load and use tokenizers that have been previously serialized. For example, HuggingFace model weights are usually accompanied by a ‘tokenizer.json’ file that can be loaded with this library.

To load a pre-trained tokenizer from a json file, use:

path <- testthat::test_path("assets/tokenizer.json")
tok <- tok::tokenizer$from_file(path)

Use the encode method to tokenize sentendes and decode to transform them back.

enc <- tok$encode("hello world")
tok$decode(enc$ids)
#> [1] "hello world"

Using pre-trained tokenizers

You can also load any tokenizer available in HuggingFace hub by using the from_pretrained static method. For example, let’s load the GPT2 tokenizer with:

tok <- tok::tokenizer$from_pretrained("gpt2")
enc <- tok$encode("hello world")
tok$decode(enc$ids)
#> [1] "hello world"

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
.github/workflows		.github/workflows
R		R
inst		inst
man		man
src		src
tests		tests
tools		tools
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
LICENSE.note		LICENSE.note
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
cran-comments.md		cran-comments.md
tok.Rproj		tok.Rproj

License

Licenses found

mlverse/tok

Folders and files

Latest commit

History

Repository files navigation

tok

Installation

Features

Loading tokenizers

Using pre-trained tokenizers

About

Resources

License

Licenses found

Stars

Watchers

Forks

Languages