# Hugging Face tokenizers in Pyserini 
In this notebook we demonstrate how Hugging Face tokenizers can be leveraged in building BM25 indices with Anserini and Pyserini. The HF tokenizers serve as drop-in replacements for Lucene-native analyzers. More example of how to use Anserini with HF tokenizers can be found here https://github.com/castorini/anserini/blob/master/docs/regressions-msmarco-passage-hgf-wp.md.

We begin by importing all the necessary dependencies

In [2]:
import os

from datasets import load_dataset, Dataset
from datasets.utils.py_utils import convert_file_size_to_int
from pyserini.pyclass import autoclass
from transformers import PreTrainedTokenizerFast
from tokenizers import (
    normalizers,
    pre_tokenizers,
    decoders,
    Tokenizer,
)
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tqdm import tqdm

Subsequently we download and preprocess a dataset from the Hugging Face Hub and split it into shards. Check the Indexing notebook for more details of the steps below: https://github.com/huggingface/gaia/blob/main/notebooks/00-indexing.ipynb.

In [3]:
dset = load_dataset("imdb", split="train")
dset = dset.add_column("id", [str(i) for i in range(len(dset))])
dset = dset.rename_column("text", "contents")
dset = dset.select_columns(["id", "contents"])

shard_dir = f"../shards/imdb"
max_shard_size = convert_file_size_to_int('10MB')
dataset_nbytes = dset.data.nbytes
num_shards = int(dataset_nbytes / max_shard_size) + 1
num_shards = max(num_shards, 1)
print(f"Sharding into {num_shards} JSONL files.")
os.makedirs(shard_dir, exist_ok=True)
for shard_index in tqdm(range(num_shards)):
    shard = dset.shard(num_shards=num_shards, index=shard_index, contiguous=True)
    shard.to_json(f"{shard_dir}/docs-{shard_index:03d}.jsonl", orient="records", lines=True)

Found cached dataset imdb (/home/piktus_huggingface_co/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)


Sharding into 4 JSONL files.


  0%|                                                                                                                                                                                                                          | 0/4 [00:00<?, ?it/s]

Creating json from Arrow format:   0%|          | 0/7 [00:00<?, ?ba/s]

 25%|████████████████████████████████████████████████████▌                                                                                                                                                             | 1/4 [00:00<00:00,  5.04it/s]

Creating json from Arrow format:   0%|          | 0/7 [00:00<?, ?ba/s]

 50%|█████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                                                         | 2/4 [00:00<00:00,  5.48it/s]

Creating json from Arrow format:   0%|          | 0/7 [00:00<?, ?ba/s]

 75%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                    | 3/4 [00:00<00:00,  5.45it/s]

Creating json from Arrow format:   0%|          | 0/7 [00:00<?, ?ba/s]

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  5.64it/s]


## Build the tokenizer
Below we illustrate how one can train a custom BPE tokenizer for a given dataset and upload it to the Hugging Face Hub. Alternatively, one can also use any of the tokenizers already uploaded to the Hugging Face Hub.

In [4]:
def batch_iterator(
    dataset, text_column_name, batch_size
):  # Batch size has to be a multiple of the dataset size
    for i in range(0, len(dataset), batch_size):
        yield dataset.select(range(i, i + batch_size))[text_column_name]

In [15]:
VOCAB_SIZE = 25000

tokenizer = Tokenizer(BPE())

tokenizer.normalizer = normalizers.Sequence([normalizers.NFKD(), normalizers.StripAccents()])
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True, use_regex=True)
tokenizer.decoder = decoders.ByteLevel(add_prefix_space=True, use_regex=True)

trainer = BpeTrainer(vocab_size=VOCAB_SIZE, show_progress=True)

In [16]:
%%time
tokenizer.train_from_iterator(batch_iterator(dset, "contents", 100), trainer=trainer)




CPU times: user 30min 27s, sys: 3min 20s, total: 33min 48s
Wall time: 16.7 s


Finally, we wrap the tokenizer into a transformers tokenizer object to get access to the `push_to_hub` feature. We have already pre-uploaded a tokenizer build using the code above to the HF hub here: https://huggingface.co/spacerini/bpe-imdb-25k. Try out pushing a new tokenizer to the hub using your Hugging Face credentials.

In [17]:
tokenizer_model = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer, vocab_size=VOCAB_SIZE
)
# tokenizer_model.push_to_hub("your-username/bpe-imdb-25k")

In [22]:
tokenizer.encode("hello world").tokens

['Ġhello', 'Ġworld']

## Build the index with a HF tokenizer as the Analyzer
In order to build an index using a HF tokenizer to tokenize the documents, we simply have to pass the `-analyzeWithHuggingFaceTokenizer` followed by the signature of the tokenizer in question as an argument to the indexer. E.g. we pass `spacerini/bpe-imdb-25k` to use the tokenizer trained above. Under the hood, the indexer will use the tokenizer to normalize the document and split it into terms (or tokens).

In [49]:
JIndexCollection = autoclass("io.anserini.index.IndexCollection")

In [50]:
indexing_args = [
    "-input",
    shard_dir,
    "-index",
    "../indexes/bpe-imdb-25k",
    "-collection",
    "JsonCollection",
    "-threads",
    "28",
    "-analyzeWithHuggingFaceTokenizer",
    "spacerini/bpe-imdb-25k",
    "-storePositions",
    "-storeDocvectors",
    "-storeContents",
]

In [51]:
JIndexCollection.main(indexing_args)

2023-05-17 15:22:52,552 INFO  [main] index.IndexCollection (IndexCollection.java:380) - Setting log level to INFO
2023-05-17 15:22:52,552 INFO  [main] index.IndexCollection (IndexCollection.java:383) - Starting indexer...
2023-05-17 15:22:52,552 INFO  [main] index.IndexCollection (IndexCollection.java:385) - DocumentCollection path: ../shards/imdb
2023-05-17 15:22:52,553 INFO  [main] index.IndexCollection (IndexCollection.java:386) - CollectionClass: JsonCollection
2023-05-17 15:22:52,553 INFO  [main] index.IndexCollection (IndexCollection.java:387) - Generator: DefaultLuceneDocumentGenerator
2023-05-17 15:22:52,553 INFO  [main] index.IndexCollection (IndexCollection.java:388) - Threads: 28
2023-05-17 15:22:52,553 INFO  [main] index.IndexCollection (IndexCollection.java:389) - Language: en
2023-05-17 15:22:52,553 INFO  [main] index.IndexCollection (IndexCollection.java:390) - Stemmer: porter
2023-05-17 15:22:52,555 INFO  [main] index.IndexCollection (IndexCollection.java:391) - Keep st

# Using the index
Follow the next tutorial to see how you can interact with your index:
- searching: https://github.com/huggingface/gaia/blob/main/notebooks/02-searching.ipynb
- analysis: https://github.com/huggingface/gaia/blob/main/notebooks/03-analysis.ipynb