### Simple Approach

In [1]:
s = "very long corpus..."
words = s.split(" ")  
vocabulary = dict(enumerate(set(words)))  

In [2]:
vocabulary

{0: 'corpus...', 1: 'very', 2: 'long'}

Tokenizers librarary was designed so that it provides all the required blocks to create end-to-end tokenizers in an interchangeable way. In that sense, it provides these various components:

- Normalizer: Executes all the initial transformations over the initial input string. For example when you need to lowercase some text, maybe strip it, or even apply one of the common unicode normalization process, you will add a Normalizer.
- PreTokenizer: In charge of splitting the initial input string. That's the component that decides where and how to pre-segment the origin string. The simplest example would be like we saw above, to simply split on spaces.
- Model: Handles all the sub-token discovery and generation, this part is trainable and really dependant on input data.
- Post-Processor: Provides advanced construction features to be compatible with some of the Transformers-based SoTA models. For instance, for BERT it would wrap the tokenized sentence around [CLS] and [SEP] tokens.
- Decoder: In charge of mapping back a tokenized input to the original string. The decoder is usually chosen according to the PreTokenizer we used previously.
- Trainer: Provides training capabilities to each model.

For each of the components above tokenizers library provides multiple implementations:

- Normalizer: Lowercase, Unicode (NFD, NFKD, NFC, NFKC), Bert, Strip, ...
- PreTokenizer: ByteLevel, WhitespaceSplit, CharDelimiterSplit, Metaspace, ...
- Model: WordLevel, BPE, WordPiece
- Post-Processor: BertProcessor, ...
- Decoder: WordLevel, BPE, WordPiece, ...

All of these building blocks can be combined to create working tokenization pipelines.

### Subtoken Tokenization

In [3]:
BIG_FILE_URL = 'https://raw.githubusercontent.com/dscape/spell/master/test/resources/big.txt'

from requests import get
with open('big.txt', 'wb') as big_f:
    response = get(BIG_FILE_URL, )
    
    if response.status_code == 200:
        big_f.write(response.content)
    else:
        print("Unable to get the file: {}".format(response.reason))

In [4]:
# Everything described below can be replaced by the ByteLevelBPETokenizer class. 

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import ByteLevel
from tokenizers.normalizers import Lowercase, NFKC, Sequence
from tokenizers.decoders import ByteLevel as ByteLevelDecoder

In [5]:
tokenizer = Tokenizer(BPE())

In [6]:
# Now we enable lower-casing and unicode-normalization
# The Sequence normalizer allows us to combine multiple Normalizer that will be executed in order.

tokenizer.normalizer = Sequence([
    NFKC(),
    Lowercase()
])

In [7]:
# Our tokenizer also needs a pre-tokenizer responsible for converting the input to a ByteLevel representation.
tokenizer.pre_tokenizer = ByteLevel()

# And finally, let's plug a decoder so we can recover from a tokenized input to the original one
tokenizer.decoder = ByteLevelDecoder()