# Compare JSON and Parquet EF representations

In [20]:
from htrc_features import Volume, utils
import os
import pandas as pd

In [10]:
ids = pd.read_csv('../test_dataset_htids.csv.gz', names=['htid'])['htid']
jsonpaths = ids.apply(lambda x: '/data/extracted-features/' + utils.id_to_rsync(x))
parqpaths = ids.apply(lambda x: '/data/extracted-features-parquet/' + utils.id_to_rsync(x))
parqpaths = parqpaths.str.replace('.json.bz2', '')

In [19]:
jsonpaths.iloc[0], parqpaths.iloc[0]

('/data/extracted-features/uc1/pairtree_root/$b/29/69/86/$b296986/uc1.$b296986.json.bz2',
 '/data/extracted-features-parquet/uc1/pairtree_root/$b/29/69/86/$b296986/uc1.$b296986')

## File Size

In [55]:
def stat_if(path):
    try:
        return os.stat(path).st_size
    except:
        return 0

In [66]:
# In GB
jsonsize = jsonpaths.apply(stat_if).div(1024**3).sum()
metasize = (parqpaths + '.meta.json').apply(stat_if).div(1024**3).sum()
parqsize = (parqpaths + '.tokens.parquet').apply(stat_if).div(1024**3).sum()
jsonsize.round(2), metasize.round(2), parqsize.round(2), (metasize+parqsize).round(2)

(23.98, 0.14, 32.1, 32.24)

In [65]:
print("Parquet is larger by {}%".format(int((metasize+parqsize)/jsonsize*100)))

Parquet is larger by 134%


## Performance on token loading

As you would expect, the Parquet is much quicker. Some notes, though:

- the parquet option is not only reading parquet files, but their associated metadata file in JSON. It's possible to save without the metadata, but it's small enough.
- Of course it's quicker! In addition to not needing JSON parsing and using faster decompression than BZIP2, the data has already been preprocessed and formatted into a table format.

The point is that if you ever expect to read your files *more than once*, [converting your local Extracted Features collection to parquet](https://github.com/massivetexts/compare-tools/blob/master/scripts/convert-to-parquet.py) using the `Volume.save_parquet` function will save you a great deal of computing time. It is also processing that can be front-loaded - converting to Parquet can be done in the background while you're developing your project code, not at the end.

In [75]:
%%time
for path in jsonpaths.head(1000):
    vol = Volume(path, parser='json')
    tl = vol.tokenlist(pos=False)

CPU times: user 3min 53s, sys: 6.33 s, total: 3min 59s
Wall time: 4min 12s


In [76]:
%%time
for path in parqpaths.head(1000):
    vol = Volume(path, parser='parquet')
    tl = vol.tokenlist(pos=False)

CPU times: user 19.9 s, sys: 528 ms, total: 20.4 s
Wall time: 34.3 s
