# Usage

In [None]:
from importlib import reload  ## only for developping
from pathlib import Path
import numpy as np

import matplotlib.pyplot as plt
plt.style.use('tableau-colorblind10')

from obspy.clients.filesystem.sds import Client
from obspy.core import UTCDateTime as UTC

from data_quality_control import sds_db, dqclogging, util, analysis, base

# Logger

In a script or notebook, the logger can be set once at the
beginning. The routine `dqclogging.configure_handlers` allows
to set to different handler, one writing message to the console and
optionally one that writes to a file.
It is possible to set different levels for the two.

In [None]:
reload(dqclogging)
loglevel_console = "INFO"
loglevel_file = None # No log file

dqclogging.configure_handlers(loglevel_console, loglevel_file)

## Define parameters

In [None]:
# NSLC
nslc_code = "GR.BFO..BHZ"

# Processing parameters
overlap = 60 
fmin, fmax = (4, 14)
nperseg = 2048
winlen_in_s = 3600
#proclen = 24*3600
sampling_rate = 20

# Data sources
sds_root = Path('../sample_sds/').absolute()
inventory_routing_type = "eida-routing"

# Output configuration
outdir = Path('output/usage_demo')
fileunit = "year" # period to store in 1 file

Create output directory:

In [None]:
outdir.mkdir(parents=True, exist_ok=True)

You can use the sds-Client directly to check the content
of the database. Note though, that this can take some time
if your database is large.

In [None]:
sdsclient = Client(str(sds_root))
sdsclient.get_all_nslc()

## Process raw data

Since we have an sds-database, we use the `sds_db` module to extract amplitudes and
power spectral densities (PSD) from the raw seismic data.

In [None]:
reload(sds_db)  # only for developping
processor = sds_db.SDSProcessor(
        nslc_code,
        inventory_routing_type,
        sds_root,
        outdir=outdir, 
        fileunit=fileunit,
        # Processing parameters
        overlap=overlap, nperseg=nperseg, 
        winlen_seconds=winlen_in_s, 
        #proclen_seconds=proclen,
        amplitude_frequencies=(fmin, fmax),
        sampling_rate=sampling_rate)

print(processor)
#processor.logger.setLevel("INFO")

In [None]:
startdate = UTC("2020-12-20")
enddate = UTC("2021-01-15")

In [None]:
%%time
#it -n1 -r7
processor.process(startdate, enddate, force_new_file=True)

If we change the `fileunit` to `"month"`, we get different filenames, indicating
also the month.

In [None]:
reload(sds_db)  # only for developping
processor = sds_db.SDSProcessor(
        nslc_code,
        inventory_routing_type,
        sds_root,
        outdir=outdir, 
        fileunit="month",
        # Processing parameters
        overlap=overlap, nperseg=nperseg, 
        winlen_seconds=winlen_in_s, 
        #proclen_seconds=proclen,
        amplitude_frequencies=(fmin, fmax))

print(processor)

In [None]:
startdate = UTC("2020-12-20")
enddate = UTC("2021-01-15")

In [None]:
%%time
#it -n1 -r7
processor.process(startdate, enddate, force_new_file=True)

`filunit="month"` produces output files with ending `YYYY-MM.hdf5`. Note that 
these files are only about 1/12 of the size of the yearly files, indicating
that they cover only one month rather than 1 year of data.

In [None]:
%ls -lh ../sample_output/show_processing/

In [None]:
f = Path("../sample_output/show_processing/GR.BFO..BHZ_2020-12.hdf5")
print(f.stat().st_size)

In [None]:
f = Path("../sample_output/show_processing/GR.BFO..BHZ_2020.hdf5")
print(f.stat().st_size)

## Analyze processed data

With the `analysis` module, the processed data can be accessed 
and visualized.

In [None]:
from data_quality_control import analysis #, util

In [None]:
# Only for display in documentation!
from IPython.core.display import display, HTML 

First, we initialize an Analyzer by setting the path to the 
HDF5-data (`outdir`), a station code and the `fileunit`, i.e.
which name format the HDF5-files have, that we want to analyze.

The initial object does not have any data yet.

In [None]:
reload(analysis)
#reload(util)
lyza = analysis.Analyzer(outdir, nslc_code,
                            fileunit="year")

In [None]:
print(lyza)

We can inquire which files and time ranges are available for
the given code, location and fileunit.

In [None]:
files = lyza.get_available_datafiles()
print(files)

In [None]:
print(lyza.get_available_timerange())

## View data for time range

In order to view the data, the amplitudes and spectra are treated differently.
Amplitudes are loaded as they are in the HDF5-file. Thus, we obtain an array of
shape `N_proclen x N_winlen`. The sample data covers 16 days and we used 
`proclen_seconds = 86400`, i.e. 1 day, so the first dimension is 16. 
The `winlen_seconds = 3600`, thus 24 windows per day which gives the second dimension
of the amplitude array.

For the spectra, there are two options:
1. load all spectra within a specific time range
2. load spectra for selected times given as list

The spectra are stored in the HDF5-files as 3D arrays. The
first two dimensions correspond to those of the amplitude
array; the third dimension is the frequency axis.
In contrast, the Analyzer flattens the first to dimensions,
i.e. the resulting array is basically a spectrogram, thus
a sequence of spectra over time.

This allows to select spectra only for specific times.
For example, one may want to select only those hours
where the wind speed is in a specific range.

In [None]:
startdate = UTC("2020-12-25")
enddate = UTC("2021-01-15")

In [None]:
lyza.get_data(startdate, enddate)

In [None]:
print(lyza)
print("lyza is of type", type(lyza))

#### Spectrogram

In [None]:
fig = lyza.plot_spectrogram()

#### Amplitudes

Amplitude values are visualized in a matrix covering
date vs time of day (at least if you use appropriate 
processing and window length).

In [None]:
lyza.plot_amplitudes();

#### 3D-Plots

Interactive 3D plots are created using plotly. The figures are 
HTML-code, heavily loaded with Javascript, which can be stored and
viewed in a browser.

**Careful!!!** For large data sets, the files can become extremly large
and your browser might not be able to handle it. So use with care.

In [None]:
fig_amp, fig_psd = lyza.plot3d()

In [None]:
display(HTML(fig_psd.to_html(include_mathjax="cdn")))

In [None]:
display(HTML(fig_amp.to_html(include_mathjax="cdn")))

In a notebook or script you could simply run:

```python
fig.show()
```

### View data for selected times

For some use cases one might want to get only the power 
spectral densities for specific times. For example, one could
filter a time series of wind speed data for times with a
certain speed. 

The Analyzer extracts psds for specific times only if it
receives a list of UTCDateTimes. 

**Note that for time lists the time axis in the spectral plots is only approximate!**

#### Create random time list

For demonstration, we create a list of 100 random times within
the time range of the data.

In [None]:
starttime = UTC("2020-12-25")
endtime = UTC("2021-01-10")
times = np.arange(str(starttime.date), str(endtime.date),
             dtype="datetime64[h]")

In [None]:
tlist = np.sort(np.random.choice(times, 100, replace=False))

tlist = [UTC(str(t)) for t in tlist]

#### Read data for times in list

Since we used 100 datetimes, we get a PSD-Matrix which has
100 entries along the time axis. The amplitude matrix remains
unaffected.

In [None]:
lyza.get_data(tlist)

print(lyza.startdate, lyza.enddate)
print(lyza.amplitudes.shape)
print("PSD shape:", lyza.psds.shape)
print("len(tlist):", len(tlist))

In [None]:
fig = lyza.plot_spectrogram()

#### 3D-Plots

In [None]:
fig_amp, fig_psd = lyza.plot3d()

In [None]:
display(HTML(fig_psd.to_html(include_mathjax="cdn")))

In [None]:
display(HTML(fig_amp.to_html(include_mathjax="cdn")))

# Smooting & Data reduction

In some cases it might be desirable or even recommended to smooth 
and/or reduce the amount of data. For example, you computed PSDs and 
amplitude levels over 1h from the seismic data. If you tried to 
plot several years or even decades of data however, this would be still 
an enormeous amount of data which has to be plotted. (From experience, 
in a Jupyter notebook, even the matplotlib-figure (2D-plots) will freeze
the browser; the 3D-figures can reach Gigabytes in size, so you probably
won't be able to open that either.)

Thus, at such a time range it is neither advisable not necessary to have
such a time resolution. In that case, you can further reduce the 
already computed data using the `SmoothOperator`. It is build on top
of the `Analyzer` which we used before, but applies a median filter
operation on the data arrays. The filter operator is defined by 2 
parameters. The `kernel_size` determines over how many samples the median
is computed. The `kernel_shift` determines by how many samples the 
filter operator is shifted to compute the next median. Thus if 
`kernel_shift=1` the data is just smoothed at the same resolution; if
`kernel_shift>1` the data is downsampled. 

The result of this operation is a similar output file as from the original
processing and is stored in the same way. You need to make sure to
give a different directory, otherwise you existing data may be overwritten.


We create a new output directory for the smoothed data.

In [None]:
smoothdir = outdir.joinpath("smoothed")
smoothdir.mkdir(exist_ok=True)

Now we initiate the operator. Note that the `outdir` from
before becomes the data directory here. We set a 
kernel size of 6 samples, thus effectively 6h because the original 
data was processed over 3600s = 1h. The kernel is shift by 3 samples, thus
in the end, we get a value per every 3 hours.

In [None]:
reload(base)
reload(analysis)
polly = analysis.SmoothOperator(outdir, nslc_code, 
                                kernel_size=6, kernel_shift=3)

Now we start the actual filtering.

In [None]:
%%time 
polly.smooth(smoothdir, force_new_file=True)

## View smoothed data

In [None]:
%%time
lyza = analysis.Analyzer(smoothdir, nslc_code, fileunit="year")

lyza.get_data(*lyza.get_available_timerange())

In [None]:
print(lyza)

The spectrogram from the new data looks coarser then
the first one.

In [None]:
lyza.plot_spectrogram();

In [None]:
lyza.plot_amplitudes();