# Convert CSV to the Better Format

## Imports

In [2]:
# ipython config
%matplotlib inline
%reload_ext autoreload
%autoreload 2

# for Jupyter
from IPython.display import display

# for Fastai and PyTorch
from fastai.structured import *
from fastai.column_data import *
np.set_printoptions(threshold=50, edgeitems=20)

# path to data
PATH='data/'

  from numpy.core.umath_tests import inner1d


## Read from CSV

In [None]:
# CSV

table_names = ['train', 'holidays_events', 'items', 'oil', 'stores', 'transactions', 'test']

tables = [pd.read_csv(f'{PATH}{fname}.csv', low_memory=True) for fname in table_names]

train, holidays_events, items, oil, stores, transactions, test = tables

print((len(train), len(test)))

## Save to Feather

Feather provides binary columnar serialization for data frames. It is designed to make reading and writing data frames efficient, and to make sharing data across data analysis languages easy. This initial version comes with bindings for python (written by Wes McKinney) and R (written by Hadley Wickham).

Feather uses the Apache Arrow columnar memory specification to represent binary data on disk. This makes read and write operations very fast. This is particularly important for encoding null/NA values and variable-length types like UTF8 strings.

Feather is a part of the broader Apache Arrow project. Feather defines its own simplified schemas and metadata for on-disk representation.

Feather currently supports the following column types:

+ A wide range of numeric types (int8, int16, int32, int64, uint8, uint16, uint32, uint64, float, double).
+ Logical/boolean values.
+ Dates, times, and timestamps.
+ Factors/categorical variables that have fixed set of possible values.
+ UTF-8 encoded strings.
+ Arbitrary binary data.

All column types support NA/null values.

https://github.com/wesm/feather

In [None]:
for idx, table in enumerate(tables):
    getattr(table, 'to_feather')(f'{PATH}{table_names[idx]}.feather')
    print(f'Table "{table_names[idx]}" was saved')

## Load from Feather

In [None]:
# Feather

table_names = ['train', 'holidays_events', 'items', 'oil', 'stores', 'transactions', 'test']

tables = [pd.read_feather(f'{PATH}{fname}.feather') for fname in table_names]

train, holidays_events, items, oil, stores, transactions, test = tables

print((len(train), len(test)))

## Save to HDF5

In [None]:
for idx, table in enumerate(tables):
    getattr(table, 'to_hdf')(f'{PATH}{table_names[idx]}.h5', key='df')
    print(f'Table "{table_names[idx]}" was saved')

In [3]:
!ls -lh {PATH}

total 35348416
-rw-r--r--  1 ilirium  staff    22K Oct 19  2017 holidays_events.csv
-rw-r--r--  1 ilirium  staff    26K Sep  3 13:31 holidays_events.feather
-rw-r--r--  1 ilirium  staff   1.0M Sep  3 15:02 holidays_events.h5
-rw-r--r--  1 ilirium  staff    99K Oct 19  2017 items.csv
-rw-r--r--  1 ilirium  staff   149K Sep  3 13:31 items.feather
-rw-r--r--  1 ilirium  staff   1.2M Sep  3 15:02 items.h5
-rw-r--r--  1 ilirium  staff    20K Oct 19  2017 oil.csv
-rw-r--r--  1 ilirium  staff    27K Sep  3 13:31 oil.feather
-rw-r--r--  1 ilirium  staff   1.0M Sep  3 15:02 oil.h5
-rw-r--r--  1 ilirium  staff    39M Oct 19  2017 sample_submission.csv
-rw-r--r--  1 ilirium  staff   1.4K Oct 19  2017 stores.csv
-rw-r--r--  1 ilirium  staff   2.9K Sep  3 13:31 stores.feather
-rw-r--r--  1 ilirium  staff   1.0M Sep  3 15:02 stores.h5
-rw-r--r--  1 ilirium  staff   120M Oct 19  2017 test.csv
-rw-r--r--  1 ilirium  staff   123M Sep  3 13:31 test.feather
-rw-r--r--  1 ilirium  staff   

## Load from HDF5

In [None]:
# HDF5

table_names = ['train', 'holidays_events', 'items', 'oil', 'stores', 'transactions', 'test']

tables = [pd.read_hdf(f'{PATH}{fname}.h5') for fname in table_names]

train, holidays_events, items, oil, stores, transactions, test = tables

print((len(train), len(test)))

##  Save to Parquet

+ Parquet format is designed for long-term storage, where Arrow is more intended for short term or ephemeral storage (Arrow may be more suitable for long-term storage after the 1.0.0 release happens, since the binary format will be stable then)

+ Parquet is more expensive to write than Feather as it features more layers of encoding and compression. Feather is unmodified raw columnar Arrow memory. We will probably add simple compression to Feather in the future.

+ Due to dictionary encoding, RLE encoding, and data page compression, Parquet files will often be much smaller than Feather files

+ Parquet is a standard storage format for analytics that's supported by many different systems: Spark, Hive, Impala, various AWS services, in future by BigQuery, etc. So if you are doing analytics, Parquet is a good option as a reference storage format for query by multiple systems

https://stackoverflow.com/questions/48083405/what-are-the-differences-between-feather-and-parquet

## Resulst

Read all files from:

+ CSV ~ `1 min 35 sec`
+ Feather ~ `0 min 35 sec`
+ HDF5 ~ `1 min 32 sec`

`total 35348416
-rw-r--r--  1 ilirium  staff    22K Oct 19  2017 holidays_events.csv
-rw-r--r--  1 ilirium  staff    26K Sep  3 13:31 holidays_events.feather
-rw-r--r--  1 ilirium  staff   1.0M Sep  3 15:02 holidays_events.h5
-rw-r--r--  1 ilirium  staff    99K Oct 19  2017 items.csv
-rw-r--r--  1 ilirium  staff   149K Sep  3 13:31 items.feather
-rw-r--r--  1 ilirium  staff   1.2M Sep  3 15:02 items.h5
-rw-r--r--  1 ilirium  staff    20K Oct 19  2017 oil.csv
-rw-r--r--  1 ilirium  staff    27K Sep  3 13:31 oil.feather
-rw-r--r--  1 ilirium  staff   1.0M Sep  3 15:02 oil.h5
-rw-r--r--  1 ilirium  staff    39M Oct 19  2017 sample_submission.csv
-rw-r--r--  1 ilirium  staff   1.4K Oct 19  2017 stores.csv
-rw-r--r--  1 ilirium  staff   2.9K Sep  3 13:31 stores.feather
-rw-r--r--  1 ilirium  staff   1.0M Sep  3 15:02 stores.h5
-rw-r--r--  1 ilirium  staff   120M Oct 19  2017 test.csv
-rw-r--r--  1 ilirium  staff   123M Sep  3 13:31 test.feather
-rw-r--r--  1 ilirium  staff   149M Sep  3 15:02 test.h5
-rw-r--r--  1 ilirium  staff   4.7G Oct 19  2017 train.csv
-rw-r--r--  1 ilirium  staff   5.4G Sep  3 13:31 train.feather
-rw-r--r--  1 ilirium  staff   6.3G Sep  3 15:02 train.h5
-rw-r--r--  1 ilirium  staff   1.5M Oct 19  2017 transactions.csv
-rw-r--r--  1 ilirium  staff   2.4M Sep  3 13:31 transactions.feather
-rw-r--r--  1 ilirium  staff   4.0M Sep  3 15:02 transactions.h5
`