In [1]:
@file:DependsOn("com.londogard:nlp:1.2.0-BETA2")

### Tokenizers

There currently exists three types of tokenizers all extending the same `interface`.

```kotlin
interface Tokenizer {
    fun split(text: String): List<String>
    /** A more efficient approach for native tokenizers, i.e. HuggingFaceTokenizer */
    fun batchSplit(texts: List<String>): List<List<String>> = texts.map(this::split)
}
```

#### CharTokenizer

Tokenizes a string into the chars.

In [2]:
import com.londogard.nlp.tokenizer.*

CharTokenizer().split("hello, world!")

[h, e, l, l, o, ,,  , w, o, r, l, d, !]

#### SimpleTokenizer

This is a word-tokenizer which splits out words based on a few different heuristics, which can be modified.

In [3]:
SimpleTokenizer().split("hello, world!")

[hello, ,, world, !]

#### SentencePieceTokenizer
The SentencePiece Tokenizer is a subword tokenizer which is Language Specific, through [BPEmb](https://nlp.h-its.org/bpemb/) we have 275 languages covered through Wikipedia. There exists model of the following vocab-sizes: `1000, 3000, 5000, 10_000, 25_000, 50_000, 100_000 & 200_000`. The larger vocab the less subwords are tokenized and more words.

The SentencePiece model is the raw C++ from [Google](https://github.com/google/sentencepiece/) with a wrapper from [DJL (Amazon)](http://docs.djl.ai/extensions/sentencepiece/index.html). I'm usually very hesitant in adding native libraries (JNI) when working on JVM-projects but no-one can deny the power of SentencePiece and I haven't had the time to implement the algorithm myself.  
This wrapper by DJL is very small (they provide a single dependency with only sentencepiece) and fits my philosophy of keeping dependencies and size at a low.

In [4]:
import com.londogard.nlp.utils.LanguageSupport.*

SentencePieceTokenizer.fromLanguageSupportOrNull(sv)?.split("hej där borta, hur mår ni?")

[▁he, j, ▁där, ▁bor, ta, ,, ▁hur, ▁mår, ▁ni, ?]

**HuggingFaceTokenizerWrapper**
A wrapper around `HuggingFaceTokenizer` (the rust version), simple to use!

In [5]:
import com.londogard.nlp.tokenizer.HuggingFaceTokenizerWrapper

HuggingFaceTokenizerWrapper("bert-base-cased").split("Randomly hello!")

[[CLS], Random, ##ly, hello, !, [SEP]]

In [7]:
HuggingFaceTokenizerWrapper("bert-base-cased").encode("Randomly hello!")

ai.djl.huggingface.tokenizers.Encoding@3548f963