# "Speed: Fastai vs HuggingFace nlp Datasets"

> Speedtest: Fastai's `TextDataloders` vs HuggingFace's `nlp` Datasets

- badges: true
- categories: [nlp, fastai, dataloader]
- image: images/bokeh_mini.png

## TODOs:
Add post processing

## tl;dr
gggg

## Speed
I started playing around with the `nlp` library recently and was blown away by the speed at which you can iterate through the data (thanks to PyArrow wizardry), its seriously fast!

> twitter: https://twitter.com/Thom_Wolf/status/1272512974935203841

So I wondered if there was a significant speed up to be gained by doing as much data processing as I could with the library. After previously discovering Fastai's funcionality to do [faster text loading](https://forums.fast.ai/t/nlp-speed-up-if-using-sorteddl/74636) I was in the market for more speed!

The library supports not only 100+ common datasets but thanks to [this pointer from Thomas Wolf](https://discuss.huggingface.co/t/nlp-0-3-0-is-out/50/3) on the new HuggingFace forums I learned that you can also easily load your own csvs and bask in all of that speedy goodness!

```
from nlp import load_dataset

dataset = load_dataset('csv', data_files='my_file.csv')
dataset = load_dataset('csv', data_files=['my_file_1.csv', 'my_file_2.csv', 'my_file_3.csv'])
dataset = load_dataset('csv', data_files={'train': ['my_train_file_1.csv', 'my_train_file_2.csv'], 
                                          'test': 'my_test_file.csv'})

```

So, is it faster?

## Experiment setup

We'll be comparing Fastai's high-level `TextDataloders` class to a custom dataprocessing pipeline using HuggingFace's `nlp` datasets library.

This Fastai class does a bunch of different things:
- Pre and Post Processing
- Tokenization: The default uses Spacy's tokenizer and creates a vocabulary and parallelises the tokenization
- Optimizations: Sorting data by text sample length and padding only to the longest item in the sequence, [similar what was described here](https://towardsdatascience.com/divide-hugging-face-transformers-training-time-by-2-or-more-21bf7129db9q-21bf7129db9e)
- Creates train and validation dataloaders

The `nlp` Datasets pipeline I wrote tries to replicate all of the core functionality of `TextDataloaders` as best I could. 

> Note: I couldn't figure out how to parallelise the text processing with `nlp` although this is probably down to my lack of experience with parallelism as opposed to a limitation of the library

### Sentiment Dataset
For this experiment I used the [Sentiment140](https://huggingface.co/datasets/sentiment140) dataset, a sentiment classifcation dataset of Twitter data. 

For our experiment we'll use
- 10% of sentiment dataset (160,000 tweets, 11.8M space-separated tokens), pulled from nlp library
- 80/20 train/val split

### Experiment Settings
A full timed run comprises of:

0. Reading the data from disk, from a csv for fastai and from a PyArrow file for `nlp`

1. Applying [fastai's default text pre-processing functions](http://dev.fast.ai/text.core#Preprocessing-rules). These will:


    Fix various messy bits of html sometimes seen in documents
    Replace repetitions at the character level, e.g. `cccc` becomes: `TK_REP 4 c`
    Replace word repetitions, e.g. `cow cow cow cow` becomes: `TK_WREP 4 cow`
    Add spaces around / and #
    Remove multiple spaces 
    Replace tokens in ALL CAPS by their lower version and add TK_UP before.
    Replace characters in ALL CAPS by their lower version and add TK_UP before.
    Lowercases everything


2. Tokenizing based on Spacy's tokenizer (fastai's default)

3. Applying a post-processing rule which replaces embedded spaces in a token with unicode line char to allow for split/join

4. Performing 1 epoch iterating through the training data, bs = 64



## Results

#### 10% Data
Results are...mixed! While the Fastai convienience function had a faster init (48s vs 71s), the PyArrow-backed `nlp` run through a single epoch was significantly faster (11s vs 14s).

| 0.16M ROWS: | Init (s)| 1 epoch (s) | 1 mini-batch [bs=64] (ms) | 1.6M ROWS: | Init (s) | 1 epoch (s) |
| :- | :-: | :-: | :-: | :-: | :-: | :-: |
| **Fastai** | 124 | 14.28 | 7.4 | - | | |
| **Fastai w/sorted** | **48.1** | 14.25 | 7.4 | - | | |
| **nlp** | 71.2 | **11.27** | 5.6 | - | 1290 | |

#### 100% Data
With 100% of the data, the difference in init time is clearer. 

| 1.6M ROWS: | Init (s) | 1 epoch (s) |
| :- | :-: | :-: |
| **Fastai** | | |
| **Fastai w/sorted** | | |
| **nlp**| 1290 | |

In [None]:
#hide
def timings(n_epochs, init, per_ep):
    #init=48
    #eps = n_epochs * 14.25
    eps = n_epochs * per_ep
    return init+eps

fastai_sorted_10 = [48, 14.25]
nlp_10 = [71, 11.27]

timings_ls = []
timings = [fastai_sorted_10, nlp_10]

n_eps = list(range(0,20,2))

[[timings(n_epochs=n, init=t[0], per_ep=t[1]) for n in n_eps] for t in timings]

# def nlp_sorted(n_epochs):
#     init=71
#     eps = n_epochs * 11.27
#     return init+eps

# n_eps = list(range(0,20,2))
# f_sorted, nlp_ls =[],[]
# for n in n_eps:
#     f_sorted.append(fastai_sorted(n))
#     nlp_ls.append(nlp_sorted(n))
    
plt.plot(f_sorted)
plt.plot(nlp_ls);

for timing_data in timings:
    plt.plot(timing_data)
plt.show();

In [1]:
#hide
%reload_ext autoreload
%autoreload 2

from fastai2.basics import *
from fastai2.text.all import *
# from fastai2.callback.all import *
# from fastai2.data.transforms import RandomSplitter
from fastai2.text.core import defaults

from nlp import load_dataset

import spacy,html
from spacy.symbols import ORTH

import timeit

Note, Dynamic Padding is only needed if actually feeding batches...if not them SortedDL should also be bypassed!!

## Preprocessing Tasks

In [23]:
#hide 
# The Pre and Post-Processing functions as well as the special tokens can be found here
print(defaults.text_proc_rules,'\n\n', defaults.text_postproc_rules,'\n\n', defaults.text_spec_tok)

[<function fix_html at 0x7f25ff011ef0>, <function replace_rep at 0x7f25ff011dd0>, <function replace_wrep at 0x7f25ff011e60>, <function spec_add_spaces at 0x7f26106f75f0>, <function rm_useless_spaces at 0x7f261067bb00>, <function replace_all_caps at 0x7f25ff011f80>, <function replace_maj at 0x7f25ff01c050>, <function lowercase at 0x7f25ff01c0e0>] 

 [<function replace_space at 0x7f25ff01c170>] 

 ['xxunk', 'xxpad', 'xxbos', 'xxeos', 'xxfld', 'xxrep', 'xxwrep', 'xxup', 'xxmaj']


## Fastai Testing

In [6]:
#hide_collapse
#%%timeit

# Read data; the first 10% of the sentiment140 dataset, extraced from the `nlp` library and saved as a csv
fn_10pct = 'sentiment140_10pct.csv'
df = pd.read_csv(fn_10pct, index_col=None)

# SORT: Calculate text sample lengths
df['word_count'] = df['text'].str.split().map(len)

res=df['word_count'].values

# Create Dataloaders
dls = TextDataLoaders.from_csv(path='.', csv_fname=fn_10pct, valid_pct=0.2, bs=64, 
                               text_col='text', label_col='sentiment' , res=res)

# Do 1 pass of the training dataloader
s = """for b in dls.train:
            pass
    """

time = timeit.timeit(stmt=s, number=1, globals=globals()); time
time, time / len(dls.train)

## HuggingFace `nlp` Datasets Testing

Tokenizer, Numericalizer and Padding functions

In [2]:
#hide_collapse
class SpacyTokenizerNLP():
    "Spacy tokenizer for `lang`"
    def __init__(self, lang='en', special_toks=None, buf_sz=5000):
        self.special_toks = ifnone(special_toks, defaults.text_spec_tok)
        nlp = spacy.blank(lang, disable=["parser", "tagger", "ner"])
        for w in self.special_toks: nlp.tokenizer.add_special_case(w, [{ORTH: w}])
        self.pipe,self.buf_sz = nlp.pipe,buf_sz
        
    def encodes(self, items):
        tmp = [list(doc) for doc in self.pipe(items, batch_size=self.buf_sz)]
        return {'tok_text_pre': [list(str(t) for t in l) for l in tmp]}

def make_vocab(count, min_freq=3, max_vocab=60000, special_toks=None):
    "Create a vocab of `max_vocab` size from `Counter` `count` with items present more than `min_freq`"
    vocab = [o for o,c in count.most_common(max_vocab) if c >= min_freq]
    special_toks = ifnone(special_toks, defaults.text_spec_tok)
    for o in reversed(special_toks): #Make sure all special tokens are in the vocab
        if o in vocab: vocab.remove(o)
        vocab.insert(0, o)
    vocab = vocab[:max_vocab]
    return vocab + [f'xxfake' for i in range(0, 8-len(vocab)%8)]

class NumericalizeNLP(Transform):
    "Reversible transform of tokenized texts to numericalized ids"
    def __init__(self, dsets=None, vocab=None, min_freq=3, max_vocab=60000, special_toks=None, pad_tok=None):
        store_attr(self, 'vocab,min_freq,max_vocab,special_toks,pad_tok')
        self.vocab, self.special_toks, self.min_freq, self.max_vocab = vocab, special_toks, min_freq, max_vocab
        self.o2i = None if vocab is None else defaultdict(int, {v:k for k,v in enumerate(vocab)})

        if self.vocab is None:
            count = Counter(p for o in dsets for p in o)
            self.vocab = make_vocab(count, min_freq=self.min_freq, max_vocab=self.max_vocab, special_toks=self.special_toks)
            self.o2i = defaultdict(int, {v:k for k,v in enumerate(self.vocab) if v != 'xxfake'})
    
    def encodes_nlp(self, o): return TensorText(tensor([self.o2i  [o_] for o_ in o]))
    def encodes_nlp(self, b): return {'toks' : [[self.o2i[o_] for o_ in oo] for oo in b['tok_text']]}
    
# Padding functions
def pad_seq(x, max_batch_len, pad_idx):    
    pad =  x.new_zeros(max_batch_len-x.size(0))+pad_idx
    return torch.cat([x, pad])
 
# Pad up to longest item in the batch and put batch on the GPU
def pad_batch(batch=None, pad_token_id=1):
    batch_inputs = list()
    max_size = max([len(item['toks']) for item in batch])
    for item in batch:
        batch_inputs += [pad_seq(item['toks'], max_size, pad_token_id)]
    return torch.stack(batch_inputs).cuda()

In [5]:
#hide_collapse

# Download text, a clean version of the dataset is downloaded (not included in the timings)
senti_dataset = load_dataset('sentiment140', split='train[:100%]', download_mode='reuse_cache_if_exists')

spacy_tok = SpacyTokenizerNLP(lang='en', special_toks=defaults.text_spec_tok)

def preproc_and_tok(b): return spacy_tok.encodes(list(maps(*defaults.text_proc_rules, b['text'])))

def postproc(b): 
    return {'tok_text': [list(maps(*defaults.text_postproc_rules, _b)) for _b in b['tok_text_pre']]}

def get_tok_lengths(example_batch): return {'tok_lens': [len(e) for e in example_batch['toks']]}

def prepare_dataset(dataset):
    '''
        Takes a raw nlp dataset and returns a processed, tokenized, numericalised dataset
    '''
    # Apply processing rules and tokenize
    dataset = dataset.map(preproc_and_tok, batched=True)

    # Apply post-processing rules 
    dataset = dataset.map(postproc, batched=True)

    # Init Numericalizer and create vocab
    numeric = NumericalizeNLP(dsets=dataset['tok_text_pre'], special_toks=defaults.text_spec_tok, pad_tok=1)

    # Numericalize
    dataset = dataset.map(numeric.encodes_nlp, batched=True)

    # Get sample lengths for sorting
    dataset=dataset.map(get_tok_lengths, batched=True)

    # Sort dataset from small to large
    dataset = dataset.sort('tok_lens')
    
    return dataset

Downloading and preparing dataset sentiment140/sentiment140 (download: 77.59 MiB, generated: 214.21 MiB, total: 291.81 MiB) to /home/morgan/.cache/huggingface/datasets/sentiment140/sentiment140/1.0.0...


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset sentiment140 downloaded and prepared to /home/morgan/.cache/huggingface/datasets/sentiment140/sentiment140/1.0.0. Subsequent calls will reuse this data.


In [None]:
#%%timeit -n 1 -r 1
#hide_collapse

# Do all of the text processing, tokenization and numericalization
senti_dataset_f = prepare_dataset(senti_dataset)

# Create train and test splits: `.train_test_split` is giving me an error, lets use `.select` instead
train_split=int(len(senti_dataset_f)*0.8)
train_senti = senti_dataset_f.select(list(range(train_split)))
test_senti = senti_dataset_f.select(list(range(train_split, len(senti_dataset_f))))

# Format our dataset to outputs torch.Tensor to train a pytorch model
columns = ['toks']
train_senti.set_format(type='torch', columns=columns)
test_senti.set_format(type='torch', columns=columns)

# Instantiate out PyTorch Dataloaders 
train_dataloader = torch.utils.data.DataLoader(train_senti, batch_size=64, collate_fn=pad_batch)
test_dataloader = torch.utils.data.DataLoader(test_senti, batch_size=64, collate_fn=pad_batch)

# # Do 1 epoch
# for b in train_dataloader: 
#     pass

100%|██████████| 1600/1600 [06:53<00:00,  3.87it/s]
100%|██████████| 1600/1600 [00:27<00:00, 58.24it/s]
100%|██████████| 1600/1600 [00:32<00:00, 49.38it/s]
100%|██████████| 1600/1600 [00:33<00:00, 47.25it/s]
100%|██████████| 1600000/1600000 [06:48<00:00, 3912.30it/s]
 69%|██████▉   | 887967/1280000 [02:24<01:50, 3546.51it/s] 

In [None]:
%%timeit

s = """for b in train_dataloader: 
            pass
    """
time = timeit.timeit(stmt=s, number=1, globals=globals()); time
time, time / (len(train_senti)/64)