# Building an index

In this notebook we demonstrate how one can create a Pyserini-backed BM25 index of a dataset avaiable on the Hugging Face Hub. We start by importing necessary dependencies.

In [1]:
import os
import json
import pprint

from tqdm.notebook import tqdm

from pyserini.index.lucene import LuceneIndexer, IndexReader
from pyserini.pyclass import autoclass

from datasets import load_dataset, Dataset
from datasets.utils.py_utils import convert_file_size_to_int

## Preparing the dataset
We will work with the IMDB dataset of movie recommendations, available on the Hugging Face hub at https://huggingface.co/datasets/imdb. We begin by loading the dataset like below - this operation downloads the full dataset on disk and creates a local cache to be reused in the future.

In [2]:
dset = load_dataset("imdb", split="train")
dset

Found cached dataset imdb (/home/piktus_huggingface_co/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)


Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

Pyserini expects a specific format of the documents which are being indexed - namely, a string `id` column contaninig a unique identigier of a given datapoint and a `contents` column containing the text of the document. We leverage Huggin Face `datasets` API to reformat the dataset. More on datasets can be found here  https://huggingface.co/docs/datasets/index.

In [3]:
dset = dset.add_column("id", [str(i) for i in range(len(dset))])
dset = dset.rename_column("text", "contents")
dset = dset.select_columns(["id", "contents"])
dset

Dataset({
    features: ['contents', 'id'],
    num_rows: 25000
})

Let's see what a single document looks like!

In [4]:
pprint.pprint(dset[0])

{'contents': 'I rented I AM CURIOUS-YELLOW from my video store because of all '
             'the controversy that surrounded it when it was first released in '
             '1967. I also heard that at first it was seized by U.S. customs '
             'if it ever tried to enter this country, therefore being a fan of '
             'films considered "controversial" I really had to see this for '
             'myself.<br /><br />The plot is centered around a young Swedish '
             'drama student named Lena who wants to learn everything she can '
             'about life. In particular she wants to focus her attentions to '
             'making some sort of documentary on what the average Swede '
             'thought about certain political issues such as the Vietnam War '
             'and race issues in the United States. In between asking '
             'politicians and ordinary denizens of Stockholm about their '
             'opinions on politics, she has sex with her drama t

In order to speed up index building we shard the dataset into multiple files which can then be processed concurrently by Pyserini.

In [5]:
shard_dir = f"../shards/imdb"
max_shard_size = convert_file_size_to_int("10MB")
dataset_nbytes = dset.data.nbytes
num_shards = int(dataset_nbytes / max_shard_size) + 1
num_shards = max(num_shards, 1)
print(f"Sharding into {num_shards} JSONL files.")
os.makedirs(shard_dir, exist_ok=True)
for shard_index in tqdm(range(num_shards)):
    shard = dset.shard(num_shards=num_shards, index=shard_index, contiguous=True)
    shard.to_json(
        f"{shard_dir}/docs-{shard_index:03d}.jsonl", orient="records", lines=True
    )

Sharding into 4 JSONL files.


  0%|          | 0/4 [00:00<?, ?it/s]

Creating json from Arrow format:   0%|          | 0/7 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/7 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/7 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/7 [00:00<?, ?ba/s]

## Offline indexing
The most straightforward option is to index a dataset downloaded locally. We already have the dataset in the `../shards/imdb/` folder. We leverage the Pyserini API to build the index.

In [6]:
JIndexCollection = autoclass("io.anserini.index.IndexCollection")

In [7]:
indexing_args = [
    "-input",
    shard_dir,
    "-index",
    "../indexes/imdb",
    "-collection",
    "JsonCollection",
    "-threads",
    "28",
    "-language",
    "en",
    "-storePositions",
    "-storeDocvectors",
    "-storeContents",
]

In [8]:
%%time
JIndexCollection.main(indexing_args)

2023-05-17 12:34:45,495 INFO  [main] index.IndexCollection (IndexCollection.java:380) - Setting log level to INFO
2023-05-17 12:34:45,497 INFO  [main] index.IndexCollection (IndexCollection.java:383) - Starting indexer...
2023-05-17 12:34:45,497 INFO  [main] index.IndexCollection (IndexCollection.java:385) - DocumentCollection path: ../shards/imdb
2023-05-17 12:34:45,498 INFO  [main] index.IndexCollection (IndexCollection.java:386) - CollectionClass: JsonCollection
2023-05-17 12:34:45,498 INFO  [main] index.IndexCollection (IndexCollection.java:387) - Generator: DefaultLuceneDocumentGenerator
2023-05-17 12:34:45,499 INFO  [main] index.IndexCollection (IndexCollection.java:388) - Threads: 28
2023-05-17 12:34:45,499 INFO  [main] index.IndexCollection (IndexCollection.java:389) - Language: en
2023-05-17 12:34:45,499 INFO  [main] index.IndexCollection (IndexCollection.java:390) - Stemmer: porter
2023-05-17 12:34:45,499 INFO  [main] index.IndexCollection (IndexCollection.java:391) - Keep st

The index is now built and available for querying in the `../indexes/imdb` folder. Note that the same index can be built using command line interface by running:
```
python -m pyserini.index.lucene \
    --collection JsonCollection
    --input ../shards/imdb/
    --index ../indexes/imdb/
    --threads 28
    --storePositions
    --storeDocvectors
    --storeRaw
    --language en
```

## Datasets streaming
One of the contributions of our collaboration is adding the ability to stream data directly into the index, without the need to dump the corpus into a local directory first. This can be combined with the data streaming feature available on the Hugging Face hub - https://huggingface.co/docs/datasets/stream. This solution removes the need to save the full dataset on disk which is particularly useful in low-disk environments. 

In [9]:
dset_streaming = load_dataset("imdb", split="train", streaming=True)
dset_streaming

<datasets.iterable_dataset.IterableDataset at 0x7f61906bda80>

In [21]:
streaming_indexer = LuceneIndexer("../indexes/imdb-streaming", threads=28)

2023-05-17 12:40:47,715 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:141) - Using DefaultEnglishAnalyzer
2023-05-17 12:40:47,716 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:142) - Stemmer: porter
2023-05-17 12:40:47,716 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:143) - Keep stopwords? false
2023-05-17 12:40:47,716 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:144) - Stopwords file: null


In [22]:
%%time
for i, row in tqdm(enumerate(dset_streaming)):
    streaming_indexer.add_doc_dict({"contents": row["text"], "id": str(i)})
streaming_indexer.close()

0it [00:00, ?it/s]

CPU times: user 15.9 s, sys: 229 ms, total: 16.1 s
Wall time: 49.3 s


In [23]:
dset_streaming = load_dataset("imdb", split="train", streaming=True)
batched_streaming_indexer = LuceneIndexer(
    "../indexes/imdb-streaming-batched", threads=28
)

2023-05-17 12:42:21,062 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:141) - Using DefaultEnglishAnalyzer
2023-05-17 12:42:21,063 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:142) - Stemmer: porter
2023-05-17 12:42:21,063 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:143) - Keep stopwords? false
2023-05-17 12:42:21,063 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:144) - Stopwords file: null


In [24]:
%%time
batch_size = 1000
batch = []

for i, row in tqdm(enumerate(dset_streaming)):
    batch.append({"contents": row["text"], "id": str(i)})
    if len(batch) >= batch_size:
        batched_streaming_indexer.add_batch_dict(batch)
        batch = []

if len(batch) >= batch_size:
    batched_streaming_indexer.add_batch_dict(batch)

batched_streaming_indexer.close()

0it [00:00, ?it/s]

CPU times: user 17.4 s, sys: 864 ms, total: 18.3 s
Wall time: 44 s


Streaming datasets from the Hugging Face Hub comes with an expected slowdown due to the need to maintain a network connection with the hub. To better see the speed difference between the non-batched vs the batched variant of streaming data into a Pyserini index we also experiment with streaming into an index using local data.

In [14]:
%%time
streaming_indexer = LuceneIndexer("../indexes/imdb-streaming", threads=28)
for i, row in tqdm(enumerate(dset)):
    streaming_indexer.add_doc_dict(row)
streaming_indexer.close()

2023-05-17 12:36:36,806 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:141) - Using DefaultEnglishAnalyzer
2023-05-17 12:36:36,807 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:142) - Stemmer: porter
2023-05-17 12:36:36,807 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:143) - Keep stopwords? false
2023-05-17 12:36:36,807 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:144) - Stopwords file: null


0it [00:00, ?it/s]

CPU times: user 7.3 s, sys: 342 ms, total: 7.64 s
Wall time: 6.48 s


In [15]:
%%time
batched_streaming_indexer = LuceneIndexer(
    "../indexes/imdb-streaming-batched", threads=28
)
batch_size = 1000
batch = []

for i, row in tqdm(enumerate(dset)):
    batch.append(row)
    if len(batch) >= batch_size:
        batched_streaming_indexer.add_batch_dict(batch)
        batch = []

if len(batch) >= batch_size:
    batched_streaming_indexer.add_batch_dict(batch)

batched_streaming_indexer.close()

2023-05-17 12:36:43,293 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:141) - Using DefaultEnglishAnalyzer
2023-05-17 12:36:43,294 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:142) - Stemmer: porter
2023-05-17 12:36:43,294 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:143) - Keep stopwords? false
2023-05-17 12:36:43,294 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:144) - Stopwords file: null


0it [00:00, ?it/s]

CPU times: user 12.1 s, sys: 982 ms, total: 13.1 s
Wall time: 3.88 s


As we can see, batched streaming is almost twice as fast for the considered dataset.

# Using the index
Follow the next tutorial to see how you can interact with your index:
- searching: https://github.com/huggingface/gaia/blob/main/notebooks/02-searching.ipynb
- analysis: https://github.com/huggingface/gaia/blob/main/notebooks/03-analysis.ipynb