# Converting HDF5 to CSV

While HDF5 is a format used for storing data values, CSV files are very easy to read and understand. Further, you can directly import them in `pandas` and use them as needed.

In this notebook, we'll explore the **January, 2020 GPM data**, identify the values we want to record and create a CSV file.

## Load libraries

We need the `h5py` package to read the HDF5 file. Further, we'll use the `pandas` package to create a final dataset and save it to a CSV file.

In [1]:
import h5py
import pandas as pd

## Load dataset

We have one data file inside **/data** directory. I'll read the same using the `h5py` package.

In [2]:
dataset = h5py.File('data/gpm_jan_2020.HDF5', 'r')

## Explore dataset

Once the dataset is loaded in, it acts like a Python dictionary. So, we'll start by looking at the various key value pairs and based on them, identify all the values we want to keep.

In [3]:
dataset.keys()

<KeysViewHDF5 ['Grid']>

It appears the HDF5 file has a **Grid** inside it. So, let's see the key value pairs inside it.

In [4]:
grid = dataset['Grid']
grid.keys()

<KeysViewHDF5 ['nv', 'lonv', 'latv', 'time', 'lon', 'lat', 'time_bnds', 'lon_bnds', 'lat_bnds', 'precipitation', 'randomError', 'gaugeRelativeWeighting', 'probabilityLiquidPrecipitation', 'precipitationQualityIndex']>

We observe that there are a lot of values in this data file. Here, I'm most interested in the `lon`, `lat` and `precipitation` values. Let's take a brief look at them.

### Longitude

In [5]:
print("Longitude data: {}".format(grid['lon']))
print("Longitude data attributes: {}".format(list(grid['lon'].attrs)))

Longitude data: <HDF5 dataset "lon": shape (3600,), type "<f4">
Longitude data attributes: ['DimensionNames', 'Units', 'units', 'standard_name', 'LongName', 'bounds', 'axis', 'CLASS', 'REFERENCE_LIST']


The shape indicates that `longitude` has 3600 values. From the attributes, I'll yse the `standard_name` and the `units`.

In [15]:
print("Name: {}".format(grid['lon'].attrs['standard_name'].decode()))
print("Unit: {}".format(grid['lon'].attrs['units'].decode()))

Name: longitude
Unit: degrees_east


### Latitude

In [16]:
print("Latitude data: {}".format(grid['lat']))
print("Latitude data attributes: {}".format(list(grid['lat'].attrs)))

Latitude data: <HDF5 dataset "lat": shape (1800,), type "<f4">
Latitude data attributes: ['DimensionNames', 'Units', 'units', 'standard_name', 'LongName', 'bounds', 'axis', 'CLASS', 'REFERENCE_LIST']


The shape indicates that `latitude` has 1800 values. From the attributes, I'll yse the `standard_name` and the `units`.

In [17]:
print("Name: {}".format(grid['lat'].attrs['standard_name'].decode()))
print("Unit: {}".format(grid['lat'].attrs['units'].decode()))

Name: latitude
Unit: degrees_north


### Precipitation

In [18]:
print("Precipitation data: {}".format(grid['precipitation']))
print("Precipitation data attributes: {}".format(list(grid['precipitation'].attrs)))

Precipitation data: <HDF5 dataset "precipitation": shape (1, 3600, 1800), type "<f4">
Precipitation data attributes: ['DimensionNames', 'Units', 'units', 'coordinates', '_FillValue', 'CodeMissingValue', 'DIMENSION_LIST']


The shape shows that it is a 3-dimensional array. The values order 3600-1800 implies that we have 6480000 (3600\*1800) precipitation values. For each combination of longitude and latitude, we get a value of precipitation. I'll also use the `units` attrbiute here.

In [20]:
print("Unit: {}".format(grid['precipitation'].attrs['units'].decode()))

Unit: mm/hr
