# Convert STEAD data into Seisbench format
This notebook contains mostly copy-pasted code from seisbench/seisbench/data/stead.py for converting STEAD data into the Seisbench format.
Note that all paths are currently hard-coded.
Explanations for the paths are next to their occurrences.
For convenience, I also added the seisbench-transformed, *but not the original!*, 100samples dataset to the git repository.

In [1]:
import pandas as pd
import h5py
from seisbench.data import WaveformDataset, WaveformDataWriter

In [2]:
metadata_dict = {
            "trace_start_time": "trace_start_time",
            "trace_category": "trace_category",
            "trace_name": "trace_name",
            "p_arrival_sample": "trace_p_arrival_sample",
            "p_status": "trace_p_status",
            "p_weight": "trace_p_weight",
            "p_travel_sec": "path_p_travel_sec",
            "s_arrival_sample": "trace_s_arrival_sample",
            "s_status": "trace_s_status",
            "s_weight": "trace_s_weight",
            "s_travel_sec": "path_s_travel_sec",
            "back_azimuth_deg": "path_back_azimuth_deg",
            "snr_db": "trace_snr_db",
            "coda_end_sample": "trace_coda_end_sample",
            "network_code": "station_network_code",
            "receiver_code": "station_code",
            "receiver_type": "trace_channel",
            "receiver_latitude": "station_latitude_deg",
            "receiver_longitude": "station_longitude_deg",
            "receiver_elevation_m": "station_elevation_m",
            "source_id": "source_id",
            "source_origin_time": "source_origin_time",
            "source_origin_uncertainty_sec": "source_origin_uncertainty_sec",
            "source_latitude": "source_latitude_deg",
            "source_longitude": "source_longitude_deg",
            "source_error_sec": "source_error_sec",
            "source_gap_deg": "source_gap_deg",
            "source_horizontal_uncertainty_km": "source_horizontal_uncertainty_km",
            "source_depth_km": "source_depth_km",
            "source_depth_uncertainty_km": "source_depth_uncertainty_km",
            "source_magnitude": "source_magnitude",
            "source_magnitude_type": "source_magnitude_type",
            "source_magnitude_author": "source_magnitude_author",
        }

Read the metadata csv-file.
Confusingly, it is named `merged.csv` in my case, but really corresponds to 100samples.csv from the EQTransformer repository.

In [3]:
metadata = pd.read_csv("data/STEAD/example/merged.csv")
metadata.rename(columns = metadata_dict, inplace = True)

`metadata_path` and `waveforms_path` denote the target-paths for the seisbench-transformed data.

In [4]:
writer = WaveformDataWriter(metadata_path = "data/STEAD/example/seisbench/metadata.csv", waveforms_path = "data/STEAD/example/seisbench/waveforms.hdf5")

In [5]:
# Set split
test_split = metadata["trace_name"].sample(frac = 0.1)
test_mask = metadata["trace_name"].isin(test_split)
train_dev = metadata["trace_name"][~test_mask].values
dev_split = train_dev[
    ::10
]  # Use 5% of total traces as suggested in EQTransformer Github repository
# 100 samples; 10 test; 81 train; 9 dev (ie validation)
dev_mask = metadata["trace_name"].isin(dev_split)
metadata["split"] = "train"
metadata.loc[dev_mask, "split"] = "dev"
metadata.loc[test_mask, "split"] = "test"

In [6]:
writer.data_format = {
            "dimension_order": "CW",
            "component_order": "ZNE",
            "sampling_rate": 100,
            "measurement": "velocity",
            "unit": "counts",
            "instrument_response": "not restituted",
        }

In [7]:
writer.set_total(len(metadata))

Traces converted: 0it [00:00, ?it/s]

Finally, merged.hdf5 is really 100samples.hdf5

In [8]:
with h5py.File("data/STEAD/example/merged.hdf5") as f:
    gdata = f["data"]
    for _, row in metadata.iterrows():
        row = row.to_dict()
        waveforms = gdata[row["trace_name"]][()]
        waveforms = waveforms.T  # From WC to CW
        waveforms = waveforms[[2, 1, 0]]  # From ENZ to ZNE

        writer.add_trace(row, waveforms)

In [9]:
writer._finalize()