# Header

text

**Note:** text

In [1]:
import seisbench.data as sbd
import matplotlib.pyplot as plt
from pathlib import Path
import numpy as np
import pandas as pd
from config import load_config

In [2]:
import warnings
warnings.simplefilter('ignore', DeprecationWarning)

#### Loading configuration file

In [None]:
cfg = load_config('Kaki-cfg.yml')
print(cfg)

### Loading the dataset

Now that the dataset conversion is finished, we can check it by simply loading it. Here we load the dataset, print the metadata and visualize the first waveform together with the annotated pick.

In [None]:
base_path = Path(cfg.path.dataset)

data = sbd.WaveformDataset(base_path, sampling_rate=100)

In [None]:
print("Training examples:", len(data.train()))
print("Development examples:", len(data.dev()))
print("Test examples:", len(data.test()))

In [None]:
targets = [key for key in data.metadata.keys() if 'arrival' in key]
data.metadata[targets]
# data.metadata.keys()

In [8]:
def cmap(phase_hint):
    c = {'Pg': 'r', 'Sg': 'b', 'AML': 'c'}
    return c.get(phase_hint, 'y')

In [None]:
range_ii = [10, 100]
for ii, metadata in data.metadata.iterrows():
    if range_ii[0] < ii <= range_ii[1]:
        # print(metadata)
        fig = plt.figure(figsize=(7, 2.5))
        ax = fig.add_subplot(111)
        trace = data.get_waveforms(ii)
        print(trace.shape)
        ax.plot(trace.T)
        targets = [key for key in metadata.keys() if 'arrival' in key]
        targets = [key for key in targets if not np.isnan(metadata[key])]
        # print(targets, metadata[targets])
        for target in targets:
            phase_hint = target.split('_')[1]
            ax.axvline(metadata[target], lw=3, c=cmap(phase_hint), label=phase_hint)
        plt.legend()
        plt.show()

## Considerations for converting datasets

As outlined above, this tutorial provides a very minimal example on converting a dataset. Here we outline additional consideration that should be taken into account when preparing a dataset.

- **Grouping picks**: In this example, we created one trace for each pick. Naturally, traces will overlap if multiple picks, e.g., P and S phases, are available for an event at a station. For an example implementation of this grouping operation, have a look [here](https://github.com/seisbench/seisbench/blob/df94dcd86ce66d6a2ee2bd00da3857259fe579bd/seisbench/data/ethz.py#L109) and in the subsequent lines.
- **Adding station information**: In this example, we added no station information except its name. In practice, it will often be helpful for users to incorporate, for example, the location of the station. We skipped this step here, because it requires loading station inventories through FDSN. For an example implementation, have a look [here](https://github.com/seisbench/seisbench/blob/df94dcd86ce66d6a2ee2bd00da3857259fe579bd/seisbench/data/ethz.py#L315).
- **Memory requirements**: Internally, the `WaveformDataWriter` writes out the the waveforms continuously in blocks (see point below), but keeps all metadata in memory until the dataset is complete. For very large datasets (or very detailed metadata) this can result in several gigabytes of memory consumption. If you are writing such datasets, make sure the available memory on your machine is sufficient.
- **Waveform blocks**: Instead of writing each waveform separately, waveforms are written out in blocks. This massively improves IO performance. Have a look at [the documentation](https://seisbench.readthedocs.io/en/stable/pages/data_format.html#traces-blocks) for details on the strategy. We expect that in nearly all cases using the default setting will be a good choice.
- **FDSN considerations**: When converting very large datasets, the performance might be limited by the performance of the FDSN webservice. Unfortunately, downloading lots of short waveforms (as required for many machine learning applications) does not seem to be the most favorable use case for FDSN. This leads to rather slow performance when naively downloading the waveforms as outlined above. Instead, it is often helpful to issue [bulk requests](https://docs.obspy.org/master/packages/autogen/obspy.clients.fdsn.client.Client.get_waveforms_bulk.html). In addition, it might be a good choice to first download the waveforms and cache them locally, for example, in .mseed format, and then convert them to SeisBench.

For further details on the data format, check out [the data format specification in the SeisBench documentation](https://seisbench.readthedocs.io/en/stable/pages/data_format.html#traces-blocks).

In [None]:
ii = 4
metadata = data.metadata.iloc[ii]
# print(metadata)
fig, axes = plt.subplots(3, 1, figsize=(10, 5), sharex=True)
trace = data.get_waveforms(ii)
print(trace.shape)
for jj in range(3):
    axes[jj].plot(trace.T[:, jj], c='k', lw=0.5)
    axes[jj].patch.set_visible(False)
    axes[jj].axis('off')
targets = [key for key in metadata.keys() if 'arrival' in key]
targets = [key for key in targets if not np.isnan(metadata[key])]
# print(targets, metadata[targets])
for target in targets:
    phase_hint = target.split('_')[1]
    for jj in range(3):
        axes[jj].axvline(metadata[target], lw=2, c=cmap(phase_hint), label=phase_hint[0])
axes[1].legend()