RoBERTa Java Tokenizer

About

This repo contains a Java tokenizer used by RoBERTa model. The implementation is mainly according to HuggingFace Python RoBERTa Tokenizer, but also we took references from other implementations as mentioned in the code and below:

The algorithm used is a byte-level Byte Pair Encoding.

https://huggingface.co/docs/transformers/tokenizer_summary#bytelevel-bpe

How do I get set up?

Clone the repo for explicit usage.
Add the Maven dependency to your pom.xml for usage in your project:

<dependency>
    <groupId>cloud.genesys</groupId>
    <artifactId>roberta-tokenizer</artifactId>
    <version>1.0.7</version>
</dependency>

<distributionManagement>
    <repository>
      <id>ossrh</id>
      <url>https://s01.oss.sonatype.org/service/local/staging/deploy/maven2/</url>
    </repository>
    ...
</distributionManagement>

Tests

Unit tests - Run on local machine.

File Dependencies

Since we want efficiency when initializing the tokenizer, we use a factory to create the relevant resources files and create it "lazily".

For this tokenizer we need 3 data files:

base_vocabulary.json - map of numbers ([0,255]) to symbols (UniCode Characters). Only those symbols will be known by the algorithm. e.g., given s as input it iterates over the bytes of the String s and replaces each given byte with the mapped symbol. This way we assure what characters are passed.
vocabulary.json - Is a file that holds all the words(sub-words) and their token according to training.
merges.txt - describes the merge rules of words. The algorithm splits the given word into two subwords, afterwards it decides the best split according to the rank of the sub words. The higher those words are, the higher the rank.

Please note:

All three files must be under the same directory.
They must be named like mentioned above.
The result of the tokenization depends on the vocabulary and merges files.

Example


String baseDirPath = "base/dir/path";
RobertaTokenizerResources robertaResources = new RobertaTokenizerResources(baseDirPath);
Tokenizer robertaTokenizer = new RobertaTokenizer(robertaResources);
...
String sentence = "this must be the place";
long[] tokenizedSentence = robertaTokenizer.tokenize(sentence);
System.out.println(tokenizedSentence);

An example output would be: [0, 9226, 531, 28, 5, 317, 2] - Depends on the given vocabulary and merges files.

Contribution guidelines

Use temporary branches for every issue/task.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

pom.xml

pom.xml

Repository files navigation

RoBERTa Java Tokenizer

About

How do I get set up?

Tests

File Dependencies

Example

Contribution guidelines

About

Releases

Packages

Contributors 4

Languages

License

purecloudlabs/roberta-tokenizer

Folders and files

Latest commit

History

Repository files navigation

RoBERTa Java Tokenizer

About

How do I get set up?

Tests

File Dependencies

Example

Contribution guidelines

About

Resources

License

Stars

Watchers

Forks

Languages