# What we are going to do
In the tutorial, I will show you how to use ezHDF to save two csv files (data1.csv and data2.csv in this folder) into HDF format.

In [1]:
import sys; sys.path.insert(0,'/Users/shutingpi/Dropbox/ezHDF')
from ezHDF.ezHDF import ezHDF
import pandas as pd

  from ._conv import register_converters as _register_converters


# use pandas to create a chunk reader of the cvs files
assuming we are going to read large files (much, much, larger than your RAM), it is impossile to load the data at once. What you can do it to read a chunk of data each time. It can be easily achieved using pandas with the parameter "chunksize". 

In [2]:
# create chunk reader object
wkdir = '/Users/shutingpi/Dropbox/ezHDF/Example'
reader1 = pd.read_csv(wkdir+'/data1.csv', sep = ',', engine = 'c', 
                    error_bad_lines=False, warn_bad_lines=False, index_col = False,
                    chunksize= 100)

reader2 = pd.read_csv(wkdir+'/data2.csv', sep = ',', engine = 'c', 
                    error_bad_lines=False, warn_bad_lines=False, index_col = False,
                    chunksize= 100)

# create a exHDF.hdf_store object
ok, now let's define a hdf_store object. It is the major object to handle your data. mode can be 'w', 'r', 'a', as [h5py file object](http://docs.h5py.org/en/latest/high/file.html) 

In [3]:
# create hdf_store object
store = ezHDF(wkdir = wkdir, hdf_name = 'my_hdf.h5', mode = 'w')

# set parameters for the dataset
To use ezHDF, we have to manually input the column name and data type of each column. 

When assigning the data type, use 's' for string, 'i' for integer and 'f' for float. You may also want to input the working directory so that ezHDF can work in the correct folder

**Note that column names don't necessary equal to what shown in your csv file. It can be arbiraty names. However the data type must be consistent with your raw data**. 

In [4]:
# define some parameters
col_name1 = ['str0','int1','float2','float3','str4','str5','str6']
dtype1 = ['s','i','f','f','s','s','s']
col_name2 = ['str0','int1','str2','str3','float4','float5','str6']
dtype2 = ['s','i','s','s','f','f','s']

# create new data sets in the object to store your data
when create a new data set, you will also need to provide a parameter called "container_size". When create a dataset in ezHDF, it will comes with a "size", i.e. the number of rows. In the beginning, all the rows are empty now. If you append new data smaller than the container_size, ezHDF won't need to request more size from the disk. If you append new data to a dataset that makes the container_size insufficient, ezHDF will **"automatically" (yes, you don't have to do it manually)** request more space from the disk to increase the container_size which will make your code slower. Therefore, it is suggest to set a initial container_size slightly larger than your data. 

Since we have 10K rows in data1.csv and 20K rows in data2.csv, we should set an initial container_size equal or slight largr than that.

In [5]:
store.new_dataset(ds_name = 'data1', container_size = 12000, column_names = col_name1, column_dtype = dtype1)
store.new_dataset(ds_name = 'data2', container_size = 22000, column_names = col_name2, column_dtype = dtype2)

# check info about the object
You can get basic information of the hdf_store object. Here n_rows means how many rows of data have been stored. n_container means the size of the container. Since we have put any data in the dataset, n_rows =0.

In [6]:
# show object information
store.info()


--- ezHDF hdf_store info ---

dataset name: data1
column names:
   ['str0', 'int1', 'float2', 'float3', 'str4', 'str5', 'str6']
column dtype:[ s , i , f , f , s , s , s ]
n_rows: 0
n_container: 12000

dataset name: data2
column names:
   ['str0', 'int1', 'str2', 'str3', 'float4', 'float5', 'str6']
column dtype:[ s , i , s , s , f , f , s ]
n_rows: 0
n_container: 22000



# resize the container
you can also reassign the size of the container if you are not satifity current container size. 

In [7]:
# resize container size
store.resize('data1', 15000)
store.resize('data2', 25000)

dataset (data1) now resized to (15000) rows
dataset (data2) now resized to (25000) rows


# check information agian
you can find the container sizes are 15K and 25K respectively.

In [8]:
# show object information
store.info()


--- ezHDF hdf_store info ---

dataset name: data1
column names:
   ['str0', 'int1', 'float2', 'float3', 'str4', 'str5', 'str6']
column dtype:[ s , i , f , f , s , s , s ]
n_rows: 0
n_container: 15000

dataset name: data2
column names:
   ['str0', 'int1', 'str2', 'str3', 'float4', 'float5', 'str6']
column dtype:[ s , i , s , s , f , f , s ]
n_rows: 0
n_container: 25000



# put data to each dataset
you can use hdf_store.append_to_dataset to put your data in the dataset. Note that, if you pandas data frame uses row index as a single column, you have to drop it before append to dataset.   

In [9]:
# put data1 into dataset data1
for chunk in reader1:
    chunk = chunk.drop(chunk.columns[0], axis = 1)
    store.append(ds_name = 'data1', data = chunk)

# put data2 into dataset data2
for chunk in reader2:
    chunk = chunk.drop(chunk.columns[0], axis = 1)
    store.append(ds_name = 'data2', data = chunk)

# check information again
check info again, you will find n_row has been changed for each dataset because we alrady put some data (10K and 20L) in the datasets.

In [10]:
# check information of the HD file again, n_container > n_rows
store.info()


--- ezHDF hdf_store info ---

dataset name: data1
column names:
   ['str0', 'int1', 'float2', 'float3', 'str4', 'str5', 'str6']
column dtype:[ s , i , f , f , s , s , s ]
n_rows: 10000
n_container: 15000

dataset name: data2
column names:
   ['str0', 'int1', 'str2', 'str3', 'float4', 'float5', 'str6']
column dtype:[ s , i , s , s , f , f , s ]
n_rows: 20000
n_container: 25000



# resize container
once you have put all data in the dataset, it will be a waste if n_container > n_rows. You can either use hdf_store.resize() to resize the container manually or you can use hdf_store.auto_resize() to let ezHDF resize container size automatically. 

In [11]:
# let resize n_container = n_rows, to save storage size
store.auto_resize(ds_name = 'data1')
store.auto_resize(ds_name = 'data2')

dataset (data1) now resized to (10000) rows
dataset (data2) now resized to (20000) rows


In [12]:
# check info again, now container size is equal to n_rows
store.info()


--- ezHDF hdf_store info ---

dataset name: data1
column names:
   ['str0', 'int1', 'float2', 'float3', 'str4', 'str5', 'str6']
column dtype:[ s , i , f , f , s , s , s ]
n_rows: 10000
n_container: 10000

dataset name: data2
column names:
   ['str0', 'int1', 'str2', 'str3', 'float4', 'float5', 'str6']
column dtype:[ s , i , s , s , f , f , s ]
n_rows: 20000
n_container: 20000



# close the file
Remember to close the file if you no longer want to use it. 

In [13]:
store.close()

# Ok, now you know everything about data storing