# explore data using ezHDF
ezHDF also offers a very convenient API for users to explore data stored in a HDF5 file if they are stored using ezHDF hdf_store. Let's first load the file we've just stored. Remember to use mode = 'r'. Use 'w' will erase all your stored data.    
   
**Note:**  
currently ezHDF only allows you to explore data. You can not modify the data stored 

In [1]:
import sys; sys.path.insert(0,'/Users/shutingpi/Dropbox/ezHDF')
from ezHDF.ezHDF import ezHDF
import pandas as pd
import numpy as np; np.random.seed(0)

  from ._conv import register_converters as _register_converters


In [2]:
wkdir = '/Users/shutingpi/Dropbox/ezHDF/example/'
store = ezHDF(wkdir = wkdir, hdf_name = 'my_hdf.h5', mode = 'r')

# check info of the file

In [3]:
store.info()


--- ezHDF hdf_store info ---

dataset name: data1
column names:
   ['str0', 'int1', 'float2', 'float3', 'str4', 'str5', 'str6']
column dtype:[ s , i , f , f , s , s , s ]
n_rows: 10000
n_container: 10000

dataset name: data2
column names:
   ['str0', 'int1', 'str2', 'str3', 'float4', 'float5', 'str6']
column dtype:[ s , i , s , s , f , f , s ]
n_rows: 20000
n_container: 20000



# Explore a dataset
If you want to explore the dataset "data1", you can simple call the dataset explore. It will create an dataset explore object for you manupulate the data.  

In [4]:
ds = store.explorer(ds_name = 'data1')

# check information of the ds object
just print it, it will show all the necessary information of this dataset. 

In [5]:
print(ds)

ezHDF hanlder object:
  dataset name: data1
  column names: 
  [str0 , int1 , float2 , float3 , str4 , str5 , str6]
  column dtype:
  [s  ,  i  ,  f  ,  f  ,  s  ,  s  ,  s]
  size of data: 10000
  size of container 10000


# Fetch data by slicing
Now you can fetch your data by slicing using the row index and column names. The return is super convenient, **a pandas data frame!**  
  
Note that:  
* the column can not use "slice" such as 1:3. You can only use a list for slicing.
* In HDF5, data will be moved to RAM only when you slice it. Therefore, you should slice only a chunk of data when dealing with huge dataset. 

In [6]:
# slicing using column names
df = ds[0:5,'str0']
print(df)

                str0
0           hJNDpcpA
1  jGwbJmIFYwSwhjeVh
2            ZNpYSwb
3           nQjBoPjp
4             BTQMgU


In [7]:
# you can also slice using a list of column names
df = ds[0:5,['str0','float2','str5']]
print(df)

                str0    float2                str5
0           hJNDpcpA  0.423237     doKrkpzJmKGDDix
1  jGwbJmIFYwSwhjeVh  0.979484             nJIaonN
2            ZNpYSwb  0.963585         PLRYcItYPoe
3           nQjBoPjp  0.381746               JseDV
4             BTQMgU  0.309559  XyYvSGFkcaQNPNpgmS


In [8]:
# of course you can slice use numeric indexes
df = ds[0:5,2]
print(df)

     float2
0  0.423237
1  0.979484
2  0.963585
3  0.381746
4  0.309559


In [9]:
# let along a list of numeric indexes
df = ds[0:5, [0,2,4,6]]
print(df)

                str0    float2                  str4                 str6
0           hJNDpcpA  0.423237        bZUXqFubbyCKCP       vySjHlNofqeNeE
1  jGwbJmIFYwSwhjeVh  0.979484        LVdZZEyENWdcvZ            edBTaHlEA
2            ZNpYSwb  0.963585  SoVCJtYtiuAAOXpYmxny  OcjYwHwrtFErHYhzGMN
3           nQjBoPjp  0.381746    fmlKBIMbLjBvJYDitK                yuGNa
4             BTQMgU  0.309559    AeYzNJgdDynHqEyWDD              beiMCJY


# Get a batch of data
When dealing wiht a large file, you may want to get only a chunk of data rather than loading the whold dataset into your RAM. To this regard, ezHDF offers a convenient batch generator (also a python generator) to get only a chunk of data. 

In [10]:
# define a batch data generator
data_generator = ds.batch(batch_size = 256)
# it's a generator, you can director use it in a for loop
for data in data_generator:
    print(data.shape)

(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(256, 7)
(16, 7)


In [11]:
# you can also get a batch of data using the __next__() method
# note that once the generator has gone through all data, you will need you call it again. 
data_generator = ds.batch(batch_size = 10)
print(data_generator.__next__())

                 str0  int1    float2    float3                  str4  \
0            hJNDpcpA   627  0.423237  0.829474        bZUXqFubbyCKCP   
1   jGwbJmIFYwSwhjeVh   429  0.979484  0.475929        LVdZZEyENWdcvZ   
2             ZNpYSwb   984  0.963585  0.022148  SoVCJtYtiuAAOXpYmxny   
3            nQjBoPjp   362  0.381746  0.931910    fmlKBIMbLjBvJYDitK   
4              BTQMgU   574  0.309559  0.723824    AeYzNJgdDynHqEyWDD   
5    oATwdCasJncLZije    34  0.911361  0.847395          LNFaDBILpXZQ   
6  KuIWxZatqqyAuqouXt   546  0.884977  0.668139               jFVZzdS   
7     nkUgPsbqILzvpLO   361  0.439210  0.823291   vhNXvHRajNRYJXFWFbq   
8      xLljzvqCTMqZXT    21  0.378707  0.136805          tNUlZdIyEGMu   
9      xfqpURdrsWXXEB   926  0.212674  0.494258   jiROBNrVFCevuXfqmfu   

                   str5                 str6  
0       doKrkpzJmKGDDix       vySjHlNofqeNeE  
1               nJIaonN            edBTaHlEA  
2           PLRYcItYPoe  OcjYwHwrtFErHYhzGMN  
3       

# Get a random batch 
Once you have stored all your data in a HDF file, it is difficult to shuffle your data without loading all the data into RAM. It became an issue when dealing a huge amount of data. To this regard, ezHDF offers a cheap approach to help you get a random batch. 

Consider we have 1000 rows. we can first group the data by every 10 rows (e.g. 0~9, 10~29, ..., etc), called "mini batch". Therefore, we will have 100 mini batches totally. If we want 100 samples for each batch, ezHDF will randomly permute the 100 mini batches (e.g. 0~9, 100~109, 350~369,...,etc.) and and pick every 10 mini batch as a batch. (Note that ezHDF will shuffle all data in the batch again before output.) 

Therefore, if you use every 1 row as a mini batch, you will have a truely random batch data. However, it will be very time consuming because it needs to read data from disk many times. Therefore, you must trade off between the randomness and time. 

In [12]:
# use every 32 rows to for a mini batch. randomly pick 5 mini batches as a batch. 
# so each batch has 160 samples
# note that once a mini batch has been picked in a batch, it will not appear in any future batch. 
random_data_generator = ds.random_batch(mini_batch_size = 32, n_mini_batch = 5)

In [13]:
# get each batch using for loop. note that there is a batch that contains less than 160 samples 
# is no longer the last loop because ezHDF pick the mini batches randomly.
for data in random_data_generator:
    print(data.shape)

(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(144, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)
(160, 7)


In [14]:
# also, you can get a random batch by __next__() method
random_data_generator = ds.random_batch(mini_batch_size = 32, n_mini_batch = 5)
print(random_data_generator.__next__())

                     str0  int1    float2    float3                  str4  \
0                   WixPO   100  0.768776  0.334362               zpNRhaS   
1                  bfJsTR   813  0.491949  0.343310              lzeHAHnC   
2               mKwxBjKeU   223  0.119829  0.727196                 bLfiL   
3       lgcvueljTbvikwUtb   786  0.898268  0.433156      ErKfPAIYyvhJxetg   
4    ibUAWsvwqzmmpLUWIRuP   341  0.895168  0.322340              WiYUurbP   
5            cREQuRGlBLef   698  0.404444  0.269234       sVLmqDDHefzZPik   
6        RAedUkdefvELAICL   807  0.941902  0.891592         qkHGnUbWlOmYU   
7            ZLlQIdmdvIcr   387  0.403868  0.709471          jIshrIqGphNB   
8                  JyDWsp    85  0.982888  0.447901             iZsnUVIdV   
9           MSYeFWkSfxELg   369  0.136309  0.855019     knYHzqzuntaCpTlGT   
10                 kelWWV   932  0.969566  0.277107        xBGrEuESvnQEgX   
11        WwcomAVuZTkmWTN    73  0.658990  0.561233             YKFMHmibk   

# control randomness
you can also reset your random number generator. This will guarantee you get exact the same random batch. If you have two different datasets with the same number of rows and you want to get the random batch by exactly way (i.e. fetch the same rows for every batch), you can also use set reset_seed = True.   

In [15]:
random_data_generator = ds.random_batch(mini_batch_size = 32, n_mini_batch = 5, reset_seed = True)
print(random_data_generator.__next__())

                     str0  int1    float2    float3                  str4  \
0                 MexbinO   741  0.733865  0.592121         mfoYfYCogNnOQ   
1    iNmDfTYvqDvPoOCRMCXg   307  0.179869  0.494822                 byqQs   
2             iYDlSksMLBu   975  0.079313  0.365207              qWmRfwbS   
3                 FhxztEm   731  0.044610  0.993928                wOKPGK   
4                  pHvpHE    11  0.991088  0.532065                tIDGkf   
5        FuKIwNVwGfxknEHP    55  0.098161  0.026445     iXuAyTdFzQdoxwIWJ   
6                   eyEYq   575  0.452487  0.798744   rBMXsQirDOykneYeGGe   
7           rCiKyMkmvIdjz   738  0.325645  0.667743    lbsdpaNsQIHWixOtiE   
8                 dTpIaGS   857  0.091240  0.699233              RHvPBtsp   
9      xseddGYTwUTqeOEIho   352  0.703463  0.811673               GhRgwbZ   
10              cXlMySgbB   427  0.330221  0.317580             ngcNNWfmu   
11         SNcVCuoINvVzEp   706  0.490399  0.244070      cOgMQvjaodjAVWbv   

In [16]:
# create the object again, you will get exactly the same random batch 
random_data_generator = ds.random_batch(mini_batch_size = 32, n_mini_batch = 5, reset_seed = True)
print(random_data_generator.__next__())

                     str0  int1    float2    float3                  str4  \
0                 MexbinO   741  0.733865  0.592121         mfoYfYCogNnOQ   
1    iNmDfTYvqDvPoOCRMCXg   307  0.179869  0.494822                 byqQs   
2             iYDlSksMLBu   975  0.079313  0.365207              qWmRfwbS   
3                 FhxztEm   731  0.044610  0.993928                wOKPGK   
4                  pHvpHE    11  0.991088  0.532065                tIDGkf   
5        FuKIwNVwGfxknEHP    55  0.098161  0.026445     iXuAyTdFzQdoxwIWJ   
6                   eyEYq   575  0.452487  0.798744   rBMXsQirDOykneYeGGe   
7           rCiKyMkmvIdjz   738  0.325645  0.667743    lbsdpaNsQIHWixOtiE   
8                 dTpIaGS   857  0.091240  0.699233              RHvPBtsp   
9      xseddGYTwUTqeOEIho   352  0.703463  0.811673               GhRgwbZ   
10              cXlMySgbB   427  0.330221  0.317580             ngcNNWfmu   
11         SNcVCuoINvVzEp   706  0.490399  0.244070      cOgMQvjaodjAVWbv   

In [17]:
store.close()