# Tutorial-0

This first exercise demonstrates how to create an empty HDF5eis file and add a single channel of data to it. To begin, let's import some necessary packages for this tutorial and create a directory where we can write some data files.

In [1]:
# Standard library imports
import io
import pathlib

# Third-party imports
import hdf5eis
import obspy.clients.fdsn
import pandas as pd


OUTPUT_DIR = pathlib.Path("/home/malcolmw/scratch/hdf5eis_tutorial")
OUTPUT_DIR.mkdir(exist_ok=True, parents=True)

Core functionality for manipulating HDF5eis files is accessed via the `hdf5eis.File` class.

In [2]:
hdf5eis.File?

[0;31mInit signature:[0m [0mhdf5eis[0m[0;34m.[0m[0mFile[0m[0;34m([0m[0;34m*[0m[0margs[0m[0;34m,[0m [0moverwrite[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
An h5py.File subclass for convenient I/O of big, multidimensional
timeseries data from environmental sensors. This class provides
the core functionality for manipulating HDF5eis files.
[0;31mInit docstring:[0m
Initialize hdf5eis.File object.

Parameters
----------
*args :
    These are passed directly to the super class initializer.
    Must contain a file path (str or bytes) or a file-like
    object.
overwrite : bool, optional
    Whether or not to overwrite an existing file if one exists
    at the given location. The default is False.
**kwargs :
    These are passed directly to super class initializer.

Raises
------

    ValueError if mode="w" and file already exists.

Returns
-------
    None.
[0;31mFile:[0m       

Let's create an empty HDF5eis file. The default `mode` is `"r"`, so to create a new file we need to specify `mode="w"` or `mode="a"`. Note that the `hdf5eis.File` class inherits from the `h5py.File` class and passes `*args` and `**kwargs` to its super-class initializer. Any valid positional or keyword arguments for the `h5py.File` intializer are therefore valid for `hdf5eis.File` as well.

In [3]:
file_out = hdf5eis.File(OUTPUT_DIR.joinpath("my_first_file.hdf5"), mode="a")

`hdf5eis.File` instances have three properties (`timeseries`, `metadata`, and `products`) which provide functionality to manipulate the groups by the same name.

In [4]:
file_out.timeseries?

[0;31mType:[0m        property
[0;31mString form:[0m <property object at 0x7f07aafaf860>
[0;31mDocstring:[0m  
Provides functionality to manipulate the  "/timeseries"
group.

Returns
-------
hdf5eis.TimeseriesAccessor
    Provides functionality to manipulate the  "/timeseries"
    group.


In [5]:
file_out.metadata?

[0;31mType:[0m        property
[0;31mString form:[0m <property object at 0x7f07aaf93950>
[0;31mDocstring:[0m  
Provides functionality to manipulate the  "/metadata"
group.

Returns
-------
hdf5eis.AuxiliaryAccessor
    Provides functionality to manipulate the  "/metadata"
    group.


In [6]:
file_out.products?

[0;31mType:[0m        property
[0;31mString form:[0m <property object at 0x7f07aafaf810>
[0;31mDocstring:[0m  
Provides functionality to manipulate the  "/products"
group.

Returns
-------
hdf5eis.AuxiliaryAccessor
    Provides functionality to manipulate the  "/products"
    group.


The `hdf5eis.File.timeseries` property has an `index` property that records the contents of the group. At present, it is empty.

In [7]:
file_out.timeseries.index

Unnamed: 0,tag,start_time,end_time,sampling_rate,npts


Let's download some timeseries data to add to the file. We will download one hour of data from IRIS for channel `AZ.BZN..HHZ`.

In [8]:
network = "AZ"
station = "BZN"
location = ""
channel = "HHZ"
start_time = obspy.UTCDateTime("2021-01-01T00:00:00Z")
end_time = obspy.UTCDateTime("2021-01-01T01:00:00Z")
client = obspy.clients.fdsn.Client()
stream = client.get_waveforms(
    network,
    station,
    location,
    channel,
    start_time,
    end_time
)

We can add the data using the `add` method.

In [9]:
file_out.timeseries.add?

[0;31mSignature:[0m [0mfile_out[0m[0;34m.[0m[0mtimeseries[0m[0;34m.[0m[0madd[0m[0;34m([0m[0mdata[0m[0;34m,[0m [0mstart_time[0m[0;34m,[0m [0msampling_rate[0m[0;34m,[0m [0mtag[0m[0;34m=[0m[0;34m''[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Add timeseries data to the parent HDF5eis file.

Parameters
----------
data : array-like
    Data array of any shape to add to file.
start_time : str, int, float, or pandas.Timestamp
    The UTC time of the first sample in data. This value is
    internally converted to a pandas.Timestamp by
    pandas.to_datetime().
sampling_rate : int, float
    The temporal sampling rate of data in units of samples per
    second.
tag : str, optional
    Tag to associate with data. The default is "".
**kwargs :
    Additional keyword arguments are passed directly the
    h5py.Group.create_datset() method and can be used, for
    example, to choose the chunk layout and compres

In [10]:
for trace in stream:
    file_out.timeseries.add(
        trace.data,
        str(trace.stats.starttime),
        trace.stats.sampling_rate,
        tag=".".join((network, station, location, channel))
    )

Now we can see that there is a corresponding row in the timeseries index.

In [11]:
file_out.timeseries.index

Unnamed: 0,tag,start_time,end_time,sampling_rate,npts
0,AZ.BZN..HHZ,2021-01-01 00:00:00.008400+00:00,2021-01-01 00:59:59.998400+00:00,100.0,360000


We can retrieve the data now using a hybrid of dictionary-like and array-slicing syntax.

In [12]:
super_gather = file_out.timeseries["AZ.BZN..HHZ", "2021-01-01T00:00:00Z": "2021-01-01T00:30:00Z"]

Timeseries data are returned as a dictionary in which the key is the `tag` associated with the corresponding value and the value is a list of `hdf5eis.Gather` objects.

In [13]:
super_gather

{'AZ.BZN..HHZ': [<hdf5eis.gather.Gather at 0x7f09aa74a430>]}

Each `hdf5eis.Gather` object has a number of descriptive properties (a subset is demonstrated here).

In [14]:
gather = super_gather["AZ.BZN..HHZ"][0]
print("gather.data:", gather.data)           # The raw data array
print("gather.starttime:", gather.starttime) # The UTC time of the first temporal sample.
print("gather.times:", gather.times)         # The UTC time of each temporal sample.

gather.data: [ 716  693  700 ... 1247 1303 1395]
gather.starttime: 2021-01-01 00:00:00.008400+00:00
gather.times: DatetimeIndex(['2021-01-01 00:00:00.008400+00:00',
               '2021-01-01 00:00:00.018400+00:00',
               '2021-01-01 00:00:00.028400+00:00',
               '2021-01-01 00:00:00.038400+00:00',
               '2021-01-01 00:00:00.048400+00:00',
               '2021-01-01 00:00:00.058400+00:00',
               '2021-01-01 00:00:00.068400+00:00',
               '2021-01-01 00:00:00.078400+00:00',
               '2021-01-01 00:00:00.088400+00:00',
               '2021-01-01 00:00:00.098400+00:00',
               ...
               '2021-01-01 00:29:59.908400+00:00',
               '2021-01-01 00:29:59.918400+00:00',
               '2021-01-01 00:29:59.928400+00:00',
               '2021-01-01 00:29:59.938400+00:00',
               '2021-01-01 00:29:59.948400+00:00',
               '2021-01-01 00:29:59.958400+00:00',
               '2021-01-01 00:29:59.968400+00:00',


A dictionary is returned when retrieving data because regular expressions are permitted when specifying the `tag` value. To demonstrate this, let's add data for one another station to  the file.

In [15]:
station = "CRY"
stream = client.get_waveforms(
    network,
    station,
    location,
    channel,
    start_time,
    end_time
)
for trace in stream:
    file_out.timeseries.add(
        trace.data,
        str(trace.stats.starttime),
        trace.stats.sampling_rate,
        tag=".".join((network, station, location, channel))
    )
    
file_out.timeseries.index

Unnamed: 0,tag,start_time,end_time,sampling_rate,npts
0,AZ.BZN..HHZ,2021-01-01 00:00:00.008400+00:00,2021-01-01 00:59:59.998400+00:00,100.0,360000
1,AZ.CRY..HHZ,2021-01-01 00:00:00.008400+00:00,2021-01-01 00:59:59.998400+00:00,100.0,360000


Now we can specify a regular expression to select data from both stations.

In [16]:
super_gather = file_out.timeseries["AZ.*", "2021-01-01T00:00:00Z": "2021-01-01T00:30:00Z"]
super_gather

{'AZ.BZN..HHZ': [<hdf5eis.gather.Gather at 0x7f07aaff0fd0>],
 'AZ.CRY..HHZ': [<hdf5eis.gather.Gather at 0x7f09aa7035b0>]}

Now that we can add and retrieve timeseries data, let's get the corresponding station metadata from IRIS.

In [17]:
inventory = sum(*[
    client.get_stations(
        network=network, 
        station=station, 
        location=location, 
        channel=channel
    )
    for station in ("BZN", "CRY")
])

We can write this metadata to STATIONXML format using a buffer.

In [18]:
buffer = io.BytesIO()
inventory.write(buffer, "STATIONXML")
buffer.seek(0)
stationxml = buffer.read()

# stationxml is now a stream of UTF-8 encoded bytes.
print(stationxml)

b'<?xml version=\'1.0\' encoding=\'UTF-8\'?>\n<FDSNStationXML xmlns="http://www.fdsn.org/xml/station/1" schemaVersion="1.1">\n  <Source>IRIS-DMC</Source>\n  <Sender>IRIS-DMC</Sender>\n  <Module>IRIS WEB SERVICE: fdsnws-station | version: 1.1.48</Module>\n  <ModuleURI>http://service.iris.edu/fdsnws/station/1/query?network=AZ&amp;station=CRY&amp;location=--&amp;channel=HHZ</ModuleURI>\n  <Created>2022-06-22T13:54:33.717000Z</Created>\n  <Network code="AZ" startDate="1982-01-01T00:00:00.000000Z" restrictedStatus="open">\n    <Description>ANZA Regional Network (ANZA)</Description>\n    <Identifier type="DOI">10.7914/SN/AZ\n   </Identifier>\n    <TotalNumberStations>93</TotalNumberStations>\n    <SelectedNumberStations>1</SelectedNumberStations>\n    <Station code="CRY" startDate="1982-10-01T00:00:00.000000Z" restrictedStatus="open">\n      <Latitude unit="DEGREES">33.5654</Latitude>\n      <Longitude unit="DEGREES">-116.7373</Longitude>\n      <Elevation>1128.0</Elevation>\n      <Site>\n 

And we can add this byte stream to the `/metadata` group using the `hdf5eis.File.metadata.add()` method.

In [19]:
file_out.metadata.add(stationxml, "network_as_UTF8_STATIONXML")

We can retrieve this metadata using dictionary-like syntax.

In [20]:
file_out.metadata["network_as_UTF8_STATIONXML"]

'<?xml version=\'1.0\' encoding=\'UTF-8\'?>\n<FDSNStationXML xmlns="http://www.fdsn.org/xml/station/1" schemaVersion="1.1">\n  <Source>IRIS-DMC</Source>\n  <Sender>IRIS-DMC</Sender>\n  <Module>IRIS WEB SERVICE: fdsnws-station | version: 1.1.48</Module>\n  <ModuleURI>http://service.iris.edu/fdsnws/station/1/query?network=AZ&amp;station=CRY&amp;location=--&amp;channel=HHZ</ModuleURI>\n  <Created>2022-06-22T13:54:33.717000Z</Created>\n  <Network code="AZ" startDate="1982-01-01T00:00:00.000000Z" restrictedStatus="open">\n    <Description>ANZA Regional Network (ANZA)</Description>\n    <Identifier type="DOI">10.7914/SN/AZ\n   </Identifier>\n    <TotalNumberStations>93</TotalNumberStations>\n    <SelectedNumberStations>1</SelectedNumberStations>\n    <Station code="CRY" startDate="1982-10-01T00:00:00.000000Z" restrictedStatus="open">\n      <Latitude unit="DEGREES">33.5654</Latitude>\n      <Longitude unit="DEGREES">-116.7373</Longitude>\n      <Elevation>1128.0</Elevation>\n      <Site>\n  

And we can parse the data using `obspy.read_inventory()`

In [21]:
buffer = io.BytesIO(file_out.metadata["network_as_UTF8_STATIONXML"].encode("UTF-8"))
obspy.read_inventory(buffer, format="STATIONXML")

Inventory created at 2022-06-22T13:54:33.717000Z
	Created by: IRIS WEB SERVICE: fdsnws-station | version: 1.1.48
		    http://service.iris.edu/fdsnws/station/1/query?network=AZ&station=C...
	Sending institution: IRIS-DMC (IRIS-DMC)
	Contains:
		Networks (2):
			AZ (2x)
		Stations (2):
			AZ.BZN (Buzz Northerns Place, Anza, CA, USA)
			AZ.CRY (Cary Ranch, Anza, CA, USA)
		Channels (0):


Finally, we can convert the metadata to a `pandas.DataFrame`.

In [22]:
dataf = pd.DataFrame(
    [
        [
            network.code, 
            station.code, 
            station.latitude, 
            station.longitude, 
            station.elevation
        ]
        for network in inventory for station in network
    ],
    columns=["network", "station", "latitude", "longitude", "elevation"]
)
dataf

Unnamed: 0,network,station,latitude,longitude,elevation
0,AZ,CRY,33.5654,-116.7373,1128.0
1,AZ,BZN,33.4915,-116.667,1301.0


In [23]:
file_out.metadata.add(dataf, "network_geometry_as_table")
file_out.metadata["network_geometry_as_table"]



Unnamed: 0,elevation,latitude,longitude,network,station
0,1128.0,33.5654,-116.7373,AZ,CRY
1,1301.0,33.4915,-116.667,AZ,BZN


The `hdf5eis.File.products` attribute behaves exactly as the `metadata` attribute. Let's finish responsibly by closing our file.

In [24]:
file_out.close()

Note that using the context manager is the canonical way of opening and closing HDF5eis files.

In [25]:
with hdf5eis.File(OUTPUT_DIR.joinpath("my_first_file.hdf5"), mode="r") as file_in:
    print(file_in.metadata["network_geometry_as_table"])

   elevation  latitude  longitude network station
0     1128.0   33.5654  -116.7373      AZ     CRY
1     1301.0   33.4915  -116.6670      AZ     BZN


That's it!  Those are the basics of adding data to and retrieving it from an HDF5eis file! In the next tutorial, we will learn how to add and retrieve multidimensional arrays and use HDF5eis external linking functionality.