# Handling of missing files

Here, I demonstrate what happens if no files are available for a requested date.

We treat 3 cases:

1. no data at all for requested time period

    In this case, no data can be processed. An HDF5 file is created nevertheless, 
    but because it does not contain any data arrays, its size will be very small.
    We still iterate through every requested day because we don't check
    beforehand if there is data at all. We don't check because such a
    check can be quite expansive if the database is large. So it is the
    users responsibility to request only reasonable time frames.
    
1. no files until or after some day within the requested period.

    E.g. the database has entries from 2020-Feb-01 to 2020-Oct-01 but we
    put a request for 2020-Jan-01 to 2020-Dec-31. Again, the algorithm
    needs to try every day we ask for. However, we only save the data
    and dates within the available time range. The missing days are
    filled with Nans.
    
2. no files within a requested period
    
    E.g. the database has entries from 2020-Feb-01 to 2020-Oct-01 but
    files for 20-March-01 to 20-March-10 are missing. We request data
    for 2020-Feb-01 to 2020-Oct-01. In this case, we get a processed
    data set with the missing days filled with Nans in the amplitude
    and psds data.

In [None]:
from importlib import reload
import os
from pathlib import Path
import numpy as np

In [None]:
from data_quality_control import base

In [None]:
from obspy.clients.filesystem.sds import Client
from obspy.clients.fdsn import RoutingClient
from obspy.core import UTCDateTime as UTC

In [None]:
import matplotlib.pyplot as plt
plt.style.use('tableau-colorblind10')

In [None]:
nscl_code = "GR.BFO..BHZ"
overlap = 60 #3600

fmin, fmax = (4, 14)
nperseg = 2048
winlen_in_s = 3600
proclen = 24*3600

outdir = Path("output/missing_file_handling") #'../sample_output/missing_file_handling/'

sds_root = os.path.abspath('../sample_sds/')
inventory_routing_type = "eida-routing"

sdsclient = Client(sds_root)
invclient = RoutingClient(inventory_routing_type)

Create output directory:

In [None]:
outdir.mkdir(parents=True, exist_ok=True)

## Database content

We have the last 6 days of 2021 and the first 9 days of 
January 2021 for GR.BFO..HHZ

In [None]:
%ls ../sample_sds/*/*/*/*

Init processor for station code and clients

In [None]:
reload(base)
processor = base.GenericProcessor(
        nscl_code,
        dataclient=sdsclient, 
        invclient=invclient, 
        outdir=outdir,
        # Default parameters correspond to those given
        )

print(processor)

## Request data entirely outside available time frame

We try to process every requested day as usual but since no data is available, 
no hdf5-files are created. The logger issues a warning.

In [None]:
startdate = UTC("2018-12-25")
enddate = UTC("2019-01-05")

In [None]:
%%time
#it -n1 -r7
processor.process(startdate, enddate, force_new_file=True)

In [None]:
processor

In [None]:
%ls ../sample_output/missing_file_handling/

## Request partially available time frame

We request data for 2020-12-20 to 2021-01-15. We expect 2 files, 
one for the 2020 data and one for 2021. However, this is 5 days 
more on both ends than available.

The algorithm thus starts filling the first file (the one for 2020)
only at 25 December. The second file ends at 9 January. 

Since `fileunit="year"` (default) and the processing length is 1 day
`proclen_seconds=24*3600` , the output file is allocated
for shape (365, 24) for amplitude data.

In [None]:
startdate = UTC("2020-12-20")
enddate = UTC("2021-01-15")

In [None]:
%%time
#it -n1 -r7
processor.process(startdate, enddate, force_new_file=True)

In [None]:
%ls -nhG output/missing_file_handling/

By default, the output file expects to receive one year of data and
the arrays are allocated accordingly. However, most of the entries are
Nans, except for the very last and first days of 2020 and 2021, respectively.

In [None]:
dat1 = base.BaseProcessedData()
dat1.from_file(outdir.joinpath('GR.BFO..BHZ_2020.hdf5'))

print(dat1.amplitudes.shape)
print(dat1.psds.shape)

In [None]:
_dat = dat1
A = _dat.reshape_amps_to_days()
fig, axs = plt.subplots(1, 2, gridspec_kw=dict(wspace=0.5))
fig.suptitle("2020")
datalabels = ['whole year', 'available period']
ticks = np.arange(0, len(A.T))
ticklabels = [l.date for l in
              np.arange(_dat.startdate, _dat.enddate+24*3600, 24*3600)]

for i, datalabel in enumerate(datalabels):
    ax = axs[i]
    ax.set_title(datalabel)
    cax = ax.imshow(A.T, aspect='auto')
    ax.set_xlabel('hours');
    

    if i==0:
        ax.set_yticks(ticks[::30])
        ax.set_yticklabels(labels=ticklabels[::30]);
    elif i==1:
        ax.set_yticks(ticks)
        ax.set_yticklabels(labels=ticklabels);
        ax.set_ylim(366, 355)

In [None]:
dat2 = base.BaseProcessedData()
dat2.from_file(outdir.joinpath('GR.BFO..BHZ_2021.hdf5'))

print(dat2.amplitudes.shape)
print(dat2.psds.shape)

In [None]:
_dat = dat2
A = _dat.reshape_amps_to_days()
fig, axs = plt.subplots(1, 2, gridspec_kw=dict(wspace=0.5))
fig.suptitle("2020")
datalabels = ['whole year', 'available period']
ticks = np.arange(0, len(A.T))
ticklabels = [l.date for l in
              np.arange(_dat.startdate, _dat.enddate+24*3600, 24*3600)]

for i, datalabel in enumerate(datalabels):
    ax = axs[i]
    ax.set_title(datalabel)
    cax = ax.imshow(A.T, aspect='auto')
    ax.set_xlabel('hours');
    

    if i==0:
        ax.set_yticks(ticks[::30])
        ax.set_yticklabels(labels=ticklabels[::30]);
    elif i==1:
        ax.set_yticks(ticks)
        ax.set_yticklabels(labels=ticklabels);
        ax.set_ylim(15, -1)

## Missing files within requested time

We remove 2 days (Jan 4-5 2021) of data from the data base.
Then we request data for 2021-01-02 to 2021-01-12. 

In [None]:
%mv ../sample_sds/2021/GR/BFO/BHZ.D/GR.BFO..BHZ.D.2021.00[45]* .

In [None]:
%ls ../sample_sds/*/*/*/*

We also keep a backup of the first version of the file for 2021 because the
next step will override the existing one.

In [None]:
%cp output/missing_file_handling/GR.BFO..BHZ_2021.hdf5 output/missing_file_handling/GR.BFO..BHZ_2021_bak.hdf5

In [None]:
startdate = UTC("2021-01-01")
enddate = UTC("2021-01-12")

In [None]:
processor.process(startdate, enddate, force_new_file=True)

In [None]:
%ls -nhG output/missing_file_handling/

Check if the correct number of days is in the file.

In [None]:
dat3 = base.BaseProcessedData()
dat3.from_file(outdir.joinpath('GR.BFO..BHZ_2021.hdf5'))

print(dat3.amplitudes.shape)

Now we plot the amplitude arrays for the contigous case (where no
files were missing in the database) and the one when we removed 2 days.
The two missing days appear as Nans (white color).

Note that we also get Nans at the edges around the data gap because there
is no data to create the overlap between the files.

We get the old data from the backup file.

In [None]:
dat2 = base.BaseProcessedData()
dat2.from_file(outdir.joinpath('GR.BFO..BHZ_2021_bak.hdf5'))

In [None]:
fig, axs = plt.subplots(1, 2, sharey=True)
datalabels = ['contiguous database', 'missing files']
for i, (datalabel, _dat) in enumerate(zip(datalabels,[dat2, dat3])):
    A = _dat.reshape_amps_to_days().T
    ax = axs[i]
    ax.set_title(datalabel)
    cax = ax.imshow(A, aspect='auto')
    labels = [l.date for l in
             np.arange(_dat.startdate, _dat.enddate+24*3600, 24*3600)]
    ax.set_yticks(np.arange(len(A)))
    ax.set_yticklabels(labels=labels);
    ax.set_ylim(0, 15)
    ax.set_xlabel('hours');


(For some reason, the axis here are reversed compared to the previous plots.)

Let's place back those files

In [None]:
%mv GR.BFO..BHZ.D.2021.00[45] ../sample_sds/2021/GR/BFO/BHZ.D/

In [None]:
%ls ../sample_sds/*/*/*/*