# File Formats

I present three data formats, feather, parquet and hdf but it exists several more like [Apache Avro](http://avro.apache.org/docs/current/) or [Apache ORC](https://orc.apache.org). 

These data formats may be more appropriate in certain situations. 
However, the software needed to handle them is either more difficult 
to install, incomplete, or more difficult to use because less 
documentation is provided. For ORC and AVRO the python libraries 
offered are less well maintained than the formats we will see. You can find many on 
the web but it is hard to know which one is the most stable. 
- [pyorc](https://github.com/noirello/pyorc)
- [avro](https://avro.apache.org/docs/1.10.0/gettingstartedpython.html) and [fastavro](https://github.com/fastavro/fastavro)
The following formats are supported
by pandas and apache arrow. These softwares are supported by very strong communities.

## Feather

For light data, it is recommanded to use [Feather](https://github.com/wesm/feather). It is a fast, interoperable data frame storage that comes with bindings for python and R.

Feather uses also the Apache Arrow columnar memory specification to represent binary data on disk. This makes read and write operations very fast.

## Parquet file format

[Parquet format](https://github.com/apache/parquet-format) is a common binary data store, used particularly in the Hadoop/big-data sphere. It provides several advantages relevant to big-data processing:

The Apache Parquet project provides a standardized open-source columnar storage format for use in data analysis systems. It was created originally for use in Apache Hadoop with systems like Apache Drill, Apache Hive, Apache Impala, and Apache Spark adopting it as a shared standard for high performance data IO.



## Hierarchical Data Format

 [HDF](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) is a self-describing data format
allowing an application to interpret the structure and 
contents of a file with no outside information. 
One HDF file can hold a mix of related objects 
which can be accessed as a group or as individual objects. 

Let's create some big dataframe with consitent data (Floats) and 10% of missing values:

In [None]:
import feather
import pandas as pd
import numpy as np
arr = np.random.randn(500000) # 10% nulls
arr[::10] = np.nan
df = pd.DataFrame({'column_{0}'.format(i): arr for i in range(10)})

In [None]:
%time df.to_csv('test.csv')

CPU times: user 9.31 s, sys: 378 ms, total: 9.69 s
Wall time: 9.93 s


In [None]:
%rm test.h5

In [None]:
%time df.to_hdf("test.h5", key="test")

CPU times: user 377 ms, sys: 2.15 s, total: 2.53 s
Wall time: 3.08 s


In [None]:
%time df.to_parquet('test.parquet')

CPU times: user 615 ms, sys: 373 ms, total: 987 ms
Wall time: 1.3 s


In [None]:
%time df.to_feather('test.feather')

CPU times: user 321 ms, sys: 180 ms, total: 502 ms
Wall time: 564 ms


In [None]:
%%bash
du -sh test.*

88M	test.csv
36M	test.feather
205M	test.h5
38M	test.parquet


In [None]:
%%time
df = pd.read_csv("test.csv")
len(df)

CPU times: user 1.32 s, sys: 829 ms, total: 2.15 s
Wall time: 2.16 s


500000

In [None]:
%%time
df = pd.read_hdf("test.h5")
len(df)

CPU times: user 337 ms, sys: 998 ms, total: 1.34 s
Wall time: 1.39 s


500000

In [None]:
%%time
df = pd.read_parquet("test.parquet")
len(df)

CPU times: user 373 ms, sys: 737 ms, total: 1.11 s
Wall time: 900 ms


500000

In [None]:
%%time
df = pd.read_feather("test.feather")
len(df)

CPU times: user 164 ms, sys: 579 ms, total: 742 ms
Wall time: 486 ms


500000

In [None]:
# Now we create a new big dataframe with a column of strings

In [None]:
import numpy as np
import pandas as pd
from lorem import sentence

words = np.array(sentence().strip().lower().replace(".", " ").split())

# Set the seed so that the numbers can be reproduced.
np.random.seed(0)  
n = 1000000
df = pd.DataFrame(np.c_[np.random.randn(n, 5),
                  np.random.randint(0,10,(n, 2)),
                  np.random.randint(0,1,(n, 2)),
np.array([np.random.choice(words) for i in range(n)])] , 
columns=list('ABCDEFGHIJ'))

df["A"][::10] = np.nan
len(df)


1000000

In [None]:
%%time
df.to_csv('test.csv', index=False)

CPU times: user 5.44 s, sys: 1.03 s, total: 6.48 s
Wall time: 6.78 s


In [None]:
%%time
df.to_hdf('test.h5', key="test", mode="w")

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->Index(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'], dtype='object')]

  encoding=encoding,
CPU times: user 5.57 s, sys: 4.7 s, total: 10.3 s
Wall time: 10.8 s


In [None]:
%%time
df.to_feather('test.feather')

CPU times: user 2.02 s, sys: 1.61 s, total: 3.64 s
Wall time: 3.66 s


In [None]:
%%time
df.to_parquet('test.parquet')

CPU times: user 2.63 s, sys: 1.35 s, total: 3.98 s
Wall time: 4.29 s


In [None]:
%%time 
df = pd.read_csv("test.csv")
len(df)

CPU times: user 2.54 s, sys: 6.22 s, total: 8.77 s
Wall time: 8.77 s


1000000

In [None]:
%%time 
df = pd.read_hdf("test.h5")
len(df)

CPU times: user 4.26 s, sys: 9.34 s, total: 13.6 s
Wall time: 13.6 s


1000000

In [None]:
%%time 
df = pd.read_feather('test.feather')
len(df)

CPU times: user 3.88 s, sys: 5.99 s, total: 9.87 s
Wall time: 9.1 s


1000000

In [None]:
%%time 
df = pd.read_parquet('test.parquet')
len(df)


CPU times: user 4.57 s, sys: 6.87 s, total: 11.4 s
Wall time: 10.1 s


1000000

In [None]:
df.head(10)

Unnamed: 0,A,B,C,D,E,F,G,H,I,J
0,,0.4001572083672233,0.9787379841057392,2.240893199201458,1.8675579901499677,0,4,0,0,quisquam
1,-0.977277879876411,0.9500884175255894,-0.1513572082976979,-0.1032188517935578,0.4105985019383723,5,5,0,0,quisquam
2,0.144043571160878,1.454273506962975,0.7610377251469934,0.1216750164928284,0.4438632327454256,6,1,0,0,modi
3,0.3336743273742668,1.494079073157606,-0.2051582637658008,0.3130677016509013,-0.8540957393017248,0,5,0,0,eius
4,-2.5529898158340787,0.6536185954403606,0.8644361988595057,-0.7421650204064419,2.269754623987608,6,7,0,0,magnam
5,-1.4543656745987648,0.045758517301446,-0.1871838500258336,1.5327792143584575,1.469358769900285,6,0,0,0,velit
6,0.1549474256969163,0.3781625196021735,-0.8877857476301128,-1.980796468223927,-0.3479121493261526,8,0,0,0,magnam
7,0.15634896910398,1.2302906807277207,1.2023798487844113,-0.3873268174079523,-0.3023027505753355,5,5,0,0,modi
8,-1.0485529650670926,-1.4200179371789752,-1.7062701906250126,1.9507753952317897,-0.5096521817516535,7,5,0,0,velit
9,-0.4380743016111864,-1.2527953600499262,0.7774903558319101,-1.6138978475579515,-0.2127402802139687,2,0,0,0,non


In [None]:
df['J'] = pd.Categorical(df.J)

In [None]:
%time df.to_feather('test.feather')


CPU times: user 1.38 s, sys: 1.53 s, total: 2.91 s
Wall time: 3.11 s


In [None]:
%time df.to_parquet('test.parquet')

CPU times: user 1.96 s, sys: 1.31 s, total: 3.27 s
Wall time: 3.68 s


In [None]:
%%time 
df = pd.read_feather('test.feather')
len(df)

CPU times: user 3.23 s, sys: 4.41 s, total: 7.64 s
Wall time: 7 s


1000000

In [None]:
%%time 
df = pd.read_parquet('test.parquet')
len(df)

CPU times: user 4.34 s, sys: 7.19 s, total: 11.5 s
Wall time: 10.4 s


1000000

## Feather or Parquet

- Parquet format is designed for long-term storage, where Arrow is more intended for short term or ephemeral storage because files volume are larger.
- Parquet is usually more expensive to write than Feather as it features more layers of encoding and compression. 
- Feather is unmodified raw columnar Arrow memory. We will probably add simple compression to Feather in the future.
- Due to dictionary encoding, RLE encoding, and data page compression, Parquet files will often be much smaller than Feather files
- Parquet is a standard storage format for analytics that's supported by Spark. So if you are doing analytics, Parquet is a good option as a reference storage format for query by multiple systems

[source stackoverflow](https://stackoverflow.com/questions/48083405/what-are-the-differences-between-feather-and-parquet)

## Apache Arrow

[Arrow](https://arrow.apache.org/docs/python/) is a columnar in-memory analytics layer designed to accelerate big data. 
It houses a set of canonical in-memory representations of 
hierarchical data along with multiple language-bindings 
for structure manipulation. Arrow offers an unified way to be able to 
share the same data representation among languages and it will certainly be 
the next standard to store dataframes in all languages.

- [R package](https://cran.r-project.org/web/packages/arrow/index.html)
- [Julia package](https://github.com/JuliaData/Arrow.jl)
- [GitHub project](https://github.com/apache/arrow)

![](images/arrow_ecosystem.png)

Apache Arrow is an ideal in-memory transport layer for data that is being read or written with Parquet files. [PyArrow](https://arrow.apache.org/docs/python/) includes Python bindings to read and write Parquet files with pandas.

- columnar storage, only read the data of interest
- efficient binary packing
- choice of compression algorithms and encoding
- split data into files, allowing for parallel processing
- range of logical types
- statistics stored in metadata allow for skipping unneeded chunks
- data partitioning using the directory structure

![arrow](images/arrow.png)

- https://arrow.apache.org/docs/python/csv.html
- https://arrow.apache.org/docs/python/feather.html
- https://arrow.apache.org/docs/python/parquet.html


Example:
```py
import pyarrow as pa
import pandas as pd
import numpy as np
arr = np.random.randn(500000) # 10% nulls
arr[::10] = np.nan
df = pd.DataFrame({'column_{0}'.format(i): arr for i in range(10)})

hdfs = pa.hdfs.connect()
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, root_path="test", filesystem=hdfs)
hdfs.ls("test")

```
### Read CSV from HDFS

Put the file test.csv on hdfs system 

```python
from pyarrow import csv
with hdfs.open("/data/nycflights/1999.csv", "rb") as f:
 df = pd.read_csv(f, nrows = 10)
print(df.head())
```

### Read Parquet File from HDFS with pandas

```python
import pandas as pd
wikipedia = pd.read_parquet("hdfs://svmass2.mass.uhb.fr:54310/data/pagecounts-parquet/part-00007-8575060f-6b57-45ea-bf1d-cd77b6141f70.snappy.parquet", engine=’pyarrow’)
print(wikipedia.head())
```
### Read Parquet File with pyarrow

```py
table = pq.read_table("example.parquet")
```

### Writing a parquet file from Apache Arrow
```py
pq.write_table(table, "example.parquet")
```

### Check metadata
```py
parquet_file = pq.ParquetFile("example.parquet")
print(parquet_file.metadata)
```

### See schema
```py
print(parquet_file.schema)
```

### Connect to the Hadoop file system

```py
hdfs = pa.hdfs.connect()

# copy to local
with hdfs.open("user.txt", "rb") as f:
    f.download("user.text")

# write parquet file on hdfs
with open("example.parquet", "rb") as f:
    pa.HadoopFileSystem.upload(hdfs, "example.parquet", f)

# List files
for f in hdfs.ls("/user/navaro_p"):
    print(f)

# create a small dataframe and write it to hadoop file system
small_df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['a', 'b', 'c'])
table = pa.Table.from_pandas(small_df)
pq.write_table(table, "small_df.parquet", filesystem=hdfs)


# Read files from Hadoop with pandas
with hdfs.open("/data/irmar.csv") as f:
    df = pd.read_csv(f)

print(df.head())

# Read parquet file from Hadoop with pandas
server = "hdfs://svmass2.mass.uhb.fr:54310"
path = "data/pagecounts-parquet/part-00007-8575060f-6b57-45ea-bf1d-cd77b6141f70.snappy.parquet"
pagecount = pd.read_parquet(os.path.join(server, path), engine="pyarrow")
print(pagecount.head())

# Read parquet file from Hadoop with pyarrow
table = pq.read_table(os.path.join(server,path))
print(table.schema)
df = table.to_pandas()
print(df.head())
```

### Exercise

- Take the second dataframe with string as last column
- Create an arrow table from pandas dataframe
- Write the file test.parquet from arrow table
- Print metadata from this parquet file
- Print schema
- Upload the file to hadoop file system
- Read this file from hadoop file system and print dataframe head


Hint: check the doc https://arrow.apache.org/docs/python/parquet.html

In [None]:
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import numpy as np
arr = np.random.randn(500000) # 10% nulls
arr[::10] = np.nan
df = pd.DataFrame({'column_{0}'.format(i): arr for i in range(10)})

# hdfs = pa.hdfs.connect()
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, root_path="test")

In [None]:
%%bash

ls test

f5bb48bcb26749878b42a265b4716fca.parquet


```python
import os
import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np
import pandas as pd
from lorem import sentence
from time import time


print("""
1 Creation de la dataframe avec des chaines en derniere colonne
""")

words = np.array(sentence().strip().lower().replace(".", " ").split())

np.random.seed(0)
n = 1000000
df = pd.DataFrame(np.c_[np.random.randn(n, 5),
                  np.random.randint(0,10,(n, 2)),
                  np.random.randint(0,1,(n, 2)),
np.array([np.random.choice(words) for i in range(n)])] ,
columns=list('ABCDEFGHIJ'))

df["A"][::10] = np.nan
print(len(df))

print("""
2 Creation de la table Arrow
""")

table = pa.Table.from_pandas(df)

print("""
2 Creation du fichier test.parquet depuis la table arrow
""")

pq.write_table(table, "test.parquet")

print("""
3 Visualiser les metadata
""")

parquet_file = pq.ParquetFile("test.parquet")
print(parquet_file.metadata)
print(" Autre maniere de faire ")
print(pq.read_metadata("test.parquet"))

print("""
4 Afficher le schema
""")

print(parquet_file.schema)

print("""
5 copier le ficher parquet sur le systeme hadoop (atention c'est long)
""")

hdfs = pa.hdfs.connect()

with open("test.parquet", "rb") as f:
    pa.HadoopFileSystem.upload(hdfs, "test.parquet", f)

for f in hdfs.ls("/user/navaro_p"):
    print(f)
server = "hdfs://svmass2.mass.uhb.fr:54310"
path = "user/navaro_p/test.parquet"
table = pq.read_table(os.path.join(server,path))
print(table.schema)
df = table.to_pandas()
print(df.head())

```