In [1]:
# Benchmarking binary storage strategies.

# Feather format

Feather is a cross-platform. binary table serialization format. It comes out of the Apache Arrow project, which is primarily designed
to create a multi-language format for representing **in-memory** data--feather is a format for writing this to disk.

It's similar in this regard to parquet, already supported. But feather is a little more closely tied to the internals of 
some major analytics platforms widely used in the digital humanities--one of the major leaders is Wes McKinney, the author of pandas,
who has indicated that feather might form the backbone of pandas 2.0, and he's been supported in his work on it by RStudio which is the central locus for `tidyverse` R.

McKinney [benchmarked feather vs parquet performance here](https://ursalabs.org/blog/2019-10-columnar-perf/). This notebook shows how to use it inside the feature reader, and why you might want want to.

## Resolvers

We'll define a few different resolvers that use different parsing formats.



In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
%load_ext autoreload
%autoreload 2

import htrc_features
import htrc_features.resolvers
from htrc_features import Volume, resolvers, FeatureReader, caching
import os
import pandas as pd


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [4]:
import tempfile

json_dir = tempfile.mkdtemp()


In [75]:
# Whatever the resolver, we want to cache the json files locally from the web first.
general_args = [
    {"method": "http"},
    {"method": "stubbytree", "format": "json", "compession": "bz2", "dir": "/drobo/hathi-ef"},
    {"method": "stubbytree", "format": "json", "compression": "bz2", "dir": json_dir + "/json"},
]

my_resolvers = []
names = [("json", "bz2"), ("json", None), ("feather", "zstd"), ("feather", "lz4"), 
         ("feather", "gz"), ("feather", None)]

for compression in ["snappy", "gzip",  "brotli", "lz4", "zstd"]:
    names.append(("parquet", compression))
    
for format, compression in names:
    local_args = [l for l in general_args] + [{"method": "stubbytree", "format": format, "compression": compression, "dir": json_dir + f"/{format}_{compression}"}]
    my_resolvers.append(resolvers.make_resolver_chain(local_args))


In [72]:
# Some htids selected at random from the full list. (Literally, at random.)
sample = ['mdp.39015020742493', 'uiuo.ark:/13960/t3ws8n13d', 'hvd.hn1jew',
       'nyp.33433074850094', 'uva.x000868401', 'uc1.b4424441',
       'hvd.hwkqkk', 'mdp.39015061341361', 'uva.x030689728', 'hvd.hndc7d',
       'osu.32435023708159', 'mdp.39015004750587', 'mdp.39015031405163',
       'nyp.33433006916369', 'njp.32101068782760',
       'uc2.ark:/13960/t12n50h0v', 'hvd.32044058128661',
       'uc1.31175035532533', 'mdp.39015076449795', 'mdp.39015002692997',
          "aeu.ark:/13960/t40s02d4z",
"coo1.ark:/13960/t1wd4f728",
"hvd.32044103205969",
"hvd.32044079685467",
"inu.30000026298186",
"inu.30000088225523",
"mdp.35112103114049",
"mdp.39015007884326",
"mdp.39015011889683",
"mdp.39015038809037",
"mdp.39015041722870",
"mdp.39015047580512",
"mdp.39015055379757",
"mdp.39015067028814",
"mdp.39015065679964",
"mdp.39015078414359",
"mdp.39015086730945",
"txu.059173004447483",
"uc1.32106020347115",
"uc1.31822033837402",
"uc1.a0002741510",
"uc1.b4530124",
"uc1.b3068723",
"uc1.b3961655",
"uc1.b3176404",
"uc1.l0064461494",
"uiug.30112073642933",
"uiuo.ark:/13960/t44r4jm98",
"umn.31951002124137v",
"uva.x000072908",
"uva.x030165640",
"uva.x001918364",
]

In [73]:
# First run to create the files. Don't bother counting the time.
for labs, resolver in zip(names, my_resolvers):
    for id in sample:
      v = Volume(id, id_resolver = resolver)


# Relative time.

JSON is extremely slow compared to the parquet and feather formats--about four times as long.

Gzipped feather is much worse than any other format, probably because it involves a weird shim.

The absolute fastest seems to be feather with lz4 or uncompressed feather, but it varies from run to run.

In [147]:
# Then reload, counting the time.
import time
times = []
for (format, compression), resolver in zip(names, my_resolvers):
    start = time.time()    
    if format == "json": continue
    for id in sample:
      v = Volume(id, id_resolver = resolver)
      _ = v.parser._make_tokencount_df(indexed = True)
    times.append((time.time() - start, format, str(compression)))
    print(f"{format} with {compression} takes {time.time() - start}")
    start = time.time()

feather with zstd takes 3.5215890407562256
feather with lz4 takes 3.5317656993865967
feather with gz takes 8.139514923095703
feather with None takes 3.3880927562713623
parquet with snappy takes 3.3624908924102783
parquet with gzip takes 3.389672040939331
parquet with brotli takes 3.441004991531372
parquet with lz4 takes 3.3155200481414795
parquet with zstd takes 3.3089661598205566


# Relative Size

Compression can get these things pretty small, but nothing beats bzip compression the json. 

A variety of formats--brotli, gzip , and zstd--can substantially outperform 'snappy' compression,
the parquet default (i.e., by about 25%).

In [148]:
from pathlib import Path
counts = []
for p in Path(json_dir).glob("*/*/*/*"):
    name = p.name
    size = p.stat().st_size
    try:
        format, compression = p.parent.parent.parent.name.split("_")
    except:
        format = p.parent.parent.parent.name
        compression = "None"
    name = name.replace("." + format, "")
    name = name.replace("." + compression, "")
    counts.append((size, name, format, str(compression)))

sizes = pd.DataFrame(counts, columns = ["bytes", "name", "format", "compression"]).fillna("No").groupby(["format", "compression"])[['bytes']].sum()
number_of_files = pd.DataFrame(counts, columns = ["bytes", "name", "format", "compression"]).groupby(["format", "compression"])[['bytes']].count()

sizes['rel_size'] = sizes['bytes']/sizes.loc["parquet", "snappy"].max()


In [149]:
time_df = pd.DataFrame(times, columns = ['time', 'format', 'compression'])
time_df = time_df.set_index(["format", "compression"])
time_df['rel_time'] = time_df.time/time_df.loc["parquet", "snappy"]['time'].max()
time_df

Unnamed: 0_level_0,Unnamed: 1_level_0,time,rel_time
format,compression,Unnamed: 2_level_1,Unnamed: 3_level_1
feather,zstd,3.52158,1.047315
feather,lz4,3.531757,1.050342
feather,gz,8.139508,2.420683
feather,,3.388079,1.007612
parquet,snappy,3.362484,1.0
parquet,gzip,3.389665,1.008084
parquet,brotli,3.440996,1.023349
parquet,lz4,3.315513,0.986031
parquet,zstd,3.308959,0.984082


In [150]:
together = time_df.join(sizes).reset_index()
together['efficiency'] = (together['rel_time'] + together['rel_size'])/2

# Size and Time together

Comparing all these methods across both size and time for a combined efficiency metric,
the best algos come out as brotli and zstd parquet. Both have very high compression rates but--
at least in this run--load faster than snappy.

Feather with zstd is comparable--it compresses down to 83% of snappy, and loads in the same time.

Feather with lz4 is fast to load, but kind of large (25% larger than parquet).

## The takeaway

Use parquet with brotli or zstd if you are willing to take the extra files.

Use feather with zstd if you want just one file.

### A note

If you want the files to load more than twice as fast, you 
can call `vol.parser._make_tokencount_df(indexed = False)`
if using the feather format. Parquet could also probably
share this optimization if we stored unindexed frames on 
disk.



In [123]:
together[["format", "compression", "rel_time", "rel_size", "efficiency"]].sort_values("efficiency")

Unnamed: 0,format,compression,rel_time,rel_size,efficiency
8,parquet,brotli,0.983391,0.722969,0.85318
10,parquet,zstd,0.964535,0.819727,0.892131
7,parquet,gzip,1.035055,0.755399,0.895227
2,feather,zstd,0.992723,0.835495,0.914109
9,parquet,lz4,0.951349,1.007186,0.979267
6,parquet,snappy,1.0,1.0,1.0
3,feather,lz4,0.942579,1.255741,1.09916
4,feather,gz,2.096192,0.700786,1.398489
5,feather,,0.945929,3.029982,1.987956
0,json,bz2,4.813219,0.643002,2.728111
