# Handling of data gaps within files

In [None]:
import numpy as np

from scipy.signal import welch
from obspy.signal.filter import bandpass

In [None]:
from obspy.clients.filesystem.sds import Client
from obspy.clients.fdsn import RoutingClient
from obspy.core import UTCDateTime as UTC

In [None]:
import matplotlib.pyplot as plt
plt.style.use('tableau-colorblind10')

In [None]:
from data_quality_control import base, util

In [None]:
from importlib import reload

In [None]:
network = 'GR'
station = 'BFO'
location = ''
channel = 'HHZ'
overlap = 60 #3600

fmin, fmax = (4, 14)
nperseg = 2048
winlen_in_s = 3600
proclen = 24*3600

sds_root = '/home/lehr/sds/data'
inventory_routing_type = 'eida-routing'

sdsclient = Client(sds_root)
invclient = RoutingClient(inventory_routing_type)

In [None]:
starttime = UTC("2020-336")
#overlap = 600

In [None]:
#reload(processing)
processor = base.NSCProcessor(
        network, 
        station,
        channel,
        location,
        dataclient=sdsclient,
        invclient=invclient,)

In [None]:
starttime = starttime - overlap
endtime = starttime + proclen + 2*overlap
st = sdsclient.get_waveforms(starttime=starttime, 
                             endtime=endtime,
                    **processor.nsc_as_dict())

In [None]:
processor.nsc_as_dict()

In [None]:
inv = invclient.get_stations(starttime=starttime, 
                        endtime=endtime, level='response',
                        **processor.nsc_as_dict())


In [None]:
tr = util.process_stream(st, inv, 
            starttime, endtime)

In [None]:
tr.plot(show=False)

Obviously the trace contains a large data gap-

In [None]:
np.where(np.isnan(tr.data))

Now, let's go through the routine that extracts the 
spectra and amplitudes step by step.

In [None]:
# Get some numbers
sr = tr.stats.sampling_rate
nf = int(proclen/winlen_in_s)
#proclen_samples = proclen * sr
winlen_samples = int(winlen_in_s * sr)

In [None]:
nf*winlen_samples

For the spectra, we can simply reshape the data
within the time which we want to analyse. It's
the time window minus the overlaps at the end, so
`starttime+overlap` and `endtime-overlap`. 

In [None]:
# Data for spectra computation
data = util.get_adjacent_frames(tr, 
        starttime+overlap, nf, winlen_samples)

Let's take a look at the data matrix:

Rows show 1 h of data.

Between 5:00 and 10:00 is a data gap in which data is Nan.

In [None]:
plt.imshow(data, aspect='auto')
plt.ylabel('hours')
plt.xlabel('npts');

In [None]:
scale = 1e6
for i, row in enumerate(data):
    plt.plot(row*scale + i)
plt.ylabel('hours')
plt.xlabel('npts');

Spectra are only computed for rows without Nans. So the resulting array with the spectral data looks like this:

In [None]:
freq, P = welch(data, fs=sr, nperseg=nperseg, axis=1)

In [None]:
plt.imshow(np.log(P), aspect='auto')
plt.ylabel('hours')
plt.xlabel('npts');

In [None]:
i, j = np.where(np.isnan(P))
print(np.unique(i))

For the amplitudes, we want to get the $p$-th percentile of each time frame (e.g. 1 h) in the data.
We use `np.percentile`. In principle, we need the data again as a 2d-array with rows corresponding to 
each time frame, as it was the case for the spectral computation.

However, for the extraction of amplitude informations, the situation is a little more complicated because we want to filter the data.

To avoid edge effects from filtering affecting the amplitude extraction, we need want the data to be a little bit longer than needed. This is why we add some `overlap` to the ends of the requested time window.

For contiguous data, we can simply filter the entire trace at once, cut off the overlap and reshape the data vector into a 2d array as for the 
spectral computation.

Unfortunately, the filter function stops at the first occurrence of Nans. So in the example, we
would only get amplitudes for the first 4 hours, even though later, several complete hours are left, that could be analysed.

Thus, if Nans are present, we filter the data frame-wise. However, to accomodate the edge effects, we need some overlap between the frames as
well. The overlap gets tapered by multiplying each frame with a Tukey window. Then, we filter each
row and for the amplitude analysis, we then just use the inner, non-tapered part.

The decision, which workflow to use, is made by 
`processing.get_amplitude`.

Let's take a look at these steps:

First, we take a look at what happens, if we use 
simple, adjacent frames and filter them.

In [None]:
# Reshape into frames
data = util.get_adjacent_frames(tr, 
        starttime+overlap, nf, winlen_samples)

# Filter row-wise
filtered1 = bandpass(data.copy(), fmin, fmax, sr)

The next plot demonstrates, how the filter function
`bandpass` reacts if Nans are in a row:

The row at 5:00 is processed until the first 
occurrence of Nans after around 250000 samples.

The row at 9:00 is not processed at all, although
we know from the figures above of the unfiltered
data that the data returns later in that hour.

In [None]:
scale = 5e6
for i, row in enumerate(filtered1):
    plt.plot(row*scale + i)
plt.ylabel('hours')
plt.xlabel('npts');
#plt.xlim(-10, 100)

If we zoom in at the beginning of the rows, we 
see the edge effects of the filter. That's the
high amplitude wiggle at the beginning.

In [None]:
scale = 1e8
for i, row in enumerate(filtered1):
    plt.plot(row*scale + i, lw=1)
plt.ylabel('hours')
plt.xlabel('npts');
plt.xlim(-100, 10000)
plt.ylim(0, 25);

Compare it with the unfiltered beginnings. There
is no such overshoot in the data.

In [None]:
scale = 1e6
for i, row in enumerate(data):
    plt.plot(row*scale + i)
plt.ylabel('hours')
plt.xlabel('npts');
plt.xlim(-100, 10000)
plt.ylim(0, 25);

So for the amplitude analysis, we want to exclude 
this early part. But if we do this now, we could
only analyse slightly less than our desired frame
length of 1 hour. 

This is why we here need overlapping frames.

In [None]:
taper_samples = int(overlap*sr)
data2, taper = util.get_overlapping_tapered_frames(
        tr.copy(), starttime+overlap, nf, winlen_samples, 
        taper_samples)

Note that the data matrix is now slightly bigger
than for the simple, adjacent framing, namely twice
the taper length.

In [None]:
print('simple frames', data.shape)
print('overlapping frames', data2.shape)

If we now apply the filter row-wise, the edge 
effects are suppressed by the taper. The frame
that we actually want to use, starts after the 
fade-in. The same occurs of course at the other end.

In [None]:
# Filter row-wise
filtered2 = bandpass(data2.copy(), fmin, fmax, sr)

In [None]:
scale = 1e8
for i, row in enumerate(filtered2):
    plt.plot(row*scale + i)
plt.vlines(taper, -1, 25, 'k')
plt.text(taper, 26, 'actual frame start', 
         ha='center')
plt.ylabel('hours')
plt.xlabel('npts');
plt.xlim(-100, 10000)
plt.ylim(-1, 25);


`util.get_overlapping_tapered_frames` also
sets all rows containing any Nans to Nan entirely.

In [None]:
scale = 1e7
for i, row in enumerate(filtered2):
    plt.plot(row*scale + i)
plt.vlines([taper, filtered2.shape[1]-taper], 
           -1, 25, 'k')
plt.text(taper, 26, 'actual frame start', 
         ha='center')
plt.ylabel('hours')
plt.xlabel('npts');
#plt.xlim(-10, 1000)
plt.ylim(-1, 25);


Thus, we can apply `np.percentile` to the targeted
section `filtered2[:,taper:-taper]` to get es

In [None]:
perc = 75

In [None]:
prctl = np.percentile(filtered2[:,taper:-taper],
                      perc, axis=1)

and compare it with the result of
`util.get_amplitude` which summarizes the
previous steps.

In [None]:
prctl_direct = util.get_amplitude(
        tr, starttime+overlap, 
            fmin, fmax, overlap, winlen_samples, nf)

In [None]:
plt.plot(prctl, 'o:', ms=10,
         label='prctl (step by step)')
plt.plot(prctl_direct, 's:', ms=5, 
         label='prctl_direct (processing.get_amplitudes()')

Difference:

In [None]:
prctl-prctl_direct

## Note on tr.slice / tr.trim

However, using time objects at both ends can result
in a bad number of samples. This is because the demanded times may not fall exactly at the samples in the trace. In this case obspy has to choose where to set the cut. As a result, the number of samples in a trace can vary by 1 or 2 for the same time length. 

Since we want to apply `np.reshape` we need the total number of samples in the vector (our data) to be exactly $n_{tot} = n_{frames} \times n_{win}$ with $n_{win}$ being the number of samples per frame, so length of frame in seconds $l$ times sampling rate $f$: 
$n_{win} = l \times f$

Therefore, we just use the starttime and then select $n_{tot}$ samples.