# Creating Memory Mapped NLP-based Data

In this notebook, we will use a fast dataset provider-based abstraction that interfaces with Hugging Face's `datasets` (and has been created by HazyResearch). The key advantage of this approach is the use of either shared memory or memory maps in Python to accelerate the caching process. Furthermore, the dataset is cached as a contiguous numpy array, enabling manipulation of data with any sequence length. This feature eliminates the need for re-encoding data for multiple lengths, streamlining the data processing pipeline.

## Instantiating the Provider

The first step is to instantiate the `FastHfDatasetProvider.from_hub()`, which loads and encodes the dataset. A set of arguments can be passed to its class method according to the user's needs:

* `dataset_name`: Name of the dataset.
* `dataset_config_name`: Name of the dataset configuration.
* `data_dir`: Path to the data directory.
* `tokenizer`: Instance of tokenizer to use.
* `tokenizer_name`: Name of the tokenizer, if `tokenizer` has not been passed.
* `mapping_column_name`: The columns in `dataset` that should be tokenized.
* `validation_split`: Fraction of the dataset to use for validation.
* `seed`: Random seed.
* `num_workers`: Number of workers to use for encoding.
* `use_eos_token`: Whether to use EOS token to separate sequences.
* `use_shared_memory`: Whether to use shared memory for caching.
* `cache_dir`: Path to the cache directory.

In [1]:
from archai.datasets.nlp.fast_hf_dataset_provider import FastHfDatasetProvider

# The provider will automatically download the dataset and tokenizer, encode
# the dataset and cache it for future use
dataset_provider = FastHfDatasetProvider.from_hub(
    "glue",
    dataset_config_name="sst2",
    tokenizer_name="gpt2",
    mapping_column_name=["sentence"],
    use_shared_memory=False,
    cache_dir="cache/glue-sst2-gpt2"
)

# (inputs, labels) can be retrieved with any sequence length
train_dataset = dataset_provider.get_train_dataset(seq_len=512)
val_dataset = dataset_provider.get_val_dataset(seq_len=512)
print(train_dataset[0], val_dataset[0])

2023-02-24 11:58:47,846 - archai.datasets.nlp.fast_hf_dataset_provider — INFO —  Downloading dataset ...


Found cached dataset glue (C:/Users/gderosa/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

2023-02-24 11:58:52,905 - archai.datasets.nlp.fast_hf_dataset_provider — INFO —  Encoding dataset ...
2023-02-24 11:58:52,907 - archai.datasets.nlp.fast_hf_dataset_provider — INFO —  Number of workers: 1 | EOS token: True


Loading cached processed dataset at C:\Users\gderosa\.cache\huggingface\datasets\glue\sst2\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-4b268755324077d2.arrow
Loading cached processed dataset at C:\Users\gderosa\.cache\huggingface\datasets\glue\sst2\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-3313fdb7893e6922.arrow
Loading cached processed dataset at C:\Users\gderosa\.cache\huggingface\datasets\glue\sst2\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-252c3cce9291d908.arrow


2023-02-24 11:58:53,054 - archai.datasets.nlp.fast_hf_dataset_provider — INFO —  Processing dataset to memory ...
2023-02-24 11:58:53,056 - archai.datasets.nlp.fast_hf_dataset_provider — INFO —  Number of workers: 1 | Shared memory: False


  0%|          | 0/68 [00:00<?, ?ex/s]

  0%|          | 0/1 [00:00<?, ?ex/s]

  0%|          | 0/2 [00:00<?, ?ex/s]

2023-02-24 11:58:54,300 - archai.datasets.nlp.fast_hf_dataset_provider — INFO —  Saving dataset to: cache\glue-sst2-gpt2
(tensor([24717,   649,  3200,   507,   422,   262, 21694,  4991,   220, 50256,
         3642,  1299,   645, 20868,   837,   691,  2248,  1850,   308,  3775,
          220, 50256,  5562, 10408,   663,  3435,   290, 48556,  1223,  2138,
         4950,   546,  1692,  3450,   220, 50256,  2787,  1299, 15950, 11378,
          284,  3520,   262,   976,  3690,   220, 50256,   261,   262,  5290,
        15827,    12,  1659,    12,  1169,    12,  1008,  9310, 35478, 20954,
          262, 28303,   714, 47478,   469,   510,   220, 50256,  5562,   705,
           82,  1290,  1165, 15444,   284, 17004,   884, 31194,  3513,   220,
        50256, 26567,  2536,   689,   326,   262,  3437,   286,   884,   289,
        31777,  2512, 30181,   355, 29408,  1830,   460,   991,  1210,   503,
          257,  1402,   837,  2614,  2646,   351,   281,  7016,  3355,   404,
          764,   220

## Loading from Cache

After loading and encoding the dataset for the first time, a cache will be created with a unique fingerprint (identifier) based on its configuration. The cached is composed by the following files:

* `config.json`: Dataset provider configuration (used to re-create the object when loaded from cache).
* `tokenizer.pkl`: Tokenizer used to encode the data (also re-created when loaded from cache).
* `train.npy`: Training tokens (inputs and labels).
* `validation.npy`: Validation tokens (inputs and labels).
* `test.npy`: Testing tokens (inputs and labels).

The `FastHfDatasetProvider` class provides a `from_cache` method which can be used to re-instantiate the cached dataset provider, in case the user wants to re-use in different places.

In [2]:
# The caching mechanism automatically saves `config.json` and `tokenizer.pkl`,
# which are used to recreate the provider when calling `from_cache` method
dataset_provider = FastHfDatasetProvider.from_cache("cache/glue-sst2-gpt2")

train_dataset = dataset_provider.get_train_dataset(seq_len=512)
val_dataset = dataset_provider.get_val_dataset(seq_len=512)
print(train_dataset[0], val_dataset[0])

2023-02-24 11:58:54,579 - archai.datasets.nlp.fast_hf_dataset_provider — INFO —  Loading dataset from: cache/glue-sst2-gpt2
(tensor([24717,   649,  3200,   507,   422,   262, 21694,  4991,   220, 50256,
         3642,  1299,   645, 20868,   837,   691,  2248,  1850,   308,  3775,
          220, 50256,  5562, 10408,   663,  3435,   290, 48556,  1223,  2138,
         4950,   546,  1692,  3450,   220, 50256,  2787,  1299, 15950, 11378,
          284,  3520,   262,   976,  3690,   220, 50256,   261,   262,  5290,
        15827,    12,  1659,    12,  1169,    12,  1008,  9310, 35478, 20954,
          262, 28303,   714, 47478,   469,   510,   220, 50256,  5562,   705,
           82,  1290,  1165, 15444,   284, 17004,   884, 31194,  3513,   220,
        50256, 26567,  2536,   689,   326,   262,  3437,   286,   884,   289,
        31777,  2512, 30181,   355, 29408,  1830,   460,   991,  1210,   503,
          257,  1402,   837,  2614,  2646,   351,   281,  7016,  3355,   404,
          764,   