Here, I demonstrate what happens if no files are available for a requested date.

We treat 3 cases:

1. no data at all for requested time period

    In this case, no data can be processed and no HDF5 file is created.
    We still iterate through every requested day because we don't check
    beforehand if there is data at all. We don't check because such a
    check can be quite expansive if the database is large. So it is the
    users responsibility to request only reasonable time frames.
    
1. no files until or after some day within the requested period.

    E.g. the database has entries from 2020-Feb-01 to 2020-Oct-01 but we
    put a request for 2020-Jan-01 to 2020-Dec-31. Again, the algorithm
    needs to try every day we ask for. However, we only save the data
    and dates within the available time range. The dates in the filename
    and meta data are ajusted accordingly. The missing days are not
    filled with Nans in this case.
    
2. no files within a requested period
    
    E.g. the database has entries from 2020-Feb-01 to 2020-Oct-01 but
    files for 20-March-01 to 20-March-10 are missing. We request data
    for 2020-Feb-01 to 2020-Oct-01. In this case, we get a processed
    data set with the missing days filled with Nans in the amplitude
    and psds data.

In [None]:
from importlib import reload
import os
import numpy as np

In [None]:
from data_quality_control import processing

In [None]:
from obspy.clients.filesystem.sds import Client
from obspy.clients.fdsn import RoutingClient
from obspy.core import UTCDateTime as UTC
from obspy.signal import util


In [None]:
import matplotlib.pyplot as plt
plt.style.use('tableau-colorblind10')


In [None]:
network = 'GR'
station = 'BFO'
location = ''
channel = 'HHZ'
overlap = 60 #3600

fmin, fmax = (4, 14)
nperseg = 2048
winlen_in_s = 3600
proclen = 24*3600

sds_root = os.path.abspath('../sample_sds/')
inventory_routing_type = "eida-routing"

sdsclient = Client(sds_root)
invclient = RoutingClient(inventory_routing_type)

In [None]:
sdsclient.get_all_nslc()

### Database content

We have the last 6 days of 2021 and the first 9 days of 
January 2021 for GR.BFO..HHZ

In [None]:
%ls ../sample_sds/*/*/*/*

Init processor for station code and clients

In [None]:
reload(processing)
processor = processing.RawDataProcessor2(
        sdsclient,
        invclient,
        network, 
        station,
        location,
        channel,
        )

# Request data entirely outside available time frame

We try to process every requested day as usual but since no data is available, 
no hdf5-files are created. The logger issues a warning.

In [None]:
startdate = UTC("2018-12-25")
enddate = UTC("2019-01-05")

In [None]:
%%timeit
processor.process(startdate, enddate, 
                        overlap, winlen_in_s,
        nperseg, fmin, fmax, '../data/',  proclen)

# Request partially available time frame

We request data for 2020-12-20 to 2021-01-15. We expect 2 files, 
one for the 2020 data and one for 2021. However, this is 5 days 
more on both ends than available.

The algorithm thus starts the first file (the one for 2020)
only at 25 December. The second file ends at 9 January rather
than 15.

In [None]:
startdate = UTC("2020-12-20")
enddate = UTC("2021-01-15")

In [None]:
ch = processing.logger.handlers[0]

ch.setLevel('INFO')

In [None]:
%%timeit -n1 -r7
processor.process(startdate, enddate, 
                        overlap, winlen_in_s,
        nperseg, fmin, fmax, '../data/',  proclen)

In [None]:
%ls -lh ../data

Check if the correct number of days (7 and 9) is also in the files

In [None]:
dat1 = processing.ProcessedData()
dat1.from_file('../data/GR.BFO..HHZ_2020-12-25_2020-12-31.hdf5')

print(dat1.amplitudes.shape)

In [None]:
dat2 = processing.ProcessedData()
dat2.from_file('../data/GR.BFO..HHZ_2021-01-01_2021-01-09.hdf5')

print(dat2.amplitudes.shape)

# Missing files within requested time

We remove 2 days (Jan 4-5 2021) of data from the data base.
Then we request data for 2021-01-02 to 2021-01-12. We start at 
2 Jan just to get a different file name from the previous test.

In [None]:
%mv ../sample_sds/2021/GR/BFO/HHZ.D/GR.BFO..HHZ.D.2021.00[45]* ../data

In [None]:
%ls ../sample_sds/*/*/*/*

In [None]:
startdate = UTC("2021-01-02")
enddate = UTC("2021-01-12")

In [None]:
processor.process(startdate, enddate, 
                        overlap, winlen_in_s,
        nperseg, fmin, fmax, '../data/',  proclen)

In [None]:
%ls -lh ../data

Check if the correct number of days is in the file.
We now expect 1 day less than for `dat2` which started on 1 Jan 
instead of 2 Jan.

In [None]:
dat3 = processing.ProcessedData()
dat3.from_file('../data/GR.BFO..HHZ_2021-01-02_2021-01-09.hdf5')

print(dat3.amplitudes.shape)

Now we plot the amplitude arrays for the contigous case (where no
files were missing in the database) and the one when we removed 2 days.
The two missing days appear as Nans (white color).

Note that we also get Nans at the edges around the data gap because there
is no data to create the overlap between the files.

In [None]:
fig, axs = plt.subplots(2,1, sharex=True)
datalabels = ['contiguous database', 'missing files']
for i, (datalabel, _dat) in enumerate(zip(datalabels,[dat2, dat3])):
    ax = axs[i]
    ax.set_title(datalabel)
    cax = ax.imshow(_dat.amplitudes, aspect='auto')
    labels = [l.date for l in
              np.arange(_dat.startdate, _dat.enddate+24*3600, 24*3600)]
    ax.set_yticks(np.arange(len(_dat.amplitudes)))
    ax.set_yticklabels(labels=labels);
plt.xlabel('hours');

However if we request processing for Jan-04-2021 to Jan-12-2021,
our output file starts only at Jan-06-2021.

In [None]:
startdate = UTC("2021-01-04")
enddate = UTC("2021-01-12")

In [None]:
processor.process(startdate, enddate, 
                        overlap, winlen_in_s,
        nperseg, fmin, fmax, '../data/',  proclen)

In [None]:
dat4 = processing.ProcessedData()
dat4.from_file('../data/GR.BFO..HHZ_2021-01-06_2021-01-09.hdf5')

print(dat4.amplitudes.shape)
print(dat4.startdate)
print(dat4.enddate)

In [None]:
fig, axs = plt.subplots(2,1, sharex=True)
datalabels = ['start BEFORE missing files', 
              'start AT missing files']
for i, (datalabel, _dat) in enumerate(zip(datalabels,[dat3, dat4])):
    ax = axs[i]
    ax.set_title(datalabel)
    cax = ax.imshow(_dat.amplitudes, aspect='auto')
    labels = [l.date for l in
              np.arange(_dat.startdate, _dat.enddate+24*3600, 24*3600)]
    ax.set_yticks(np.arange(len(_dat.amplitudes)))
    ax.set_yticklabels(labels=labels);
plt.xlabel('hours');

Let's place back those files

In [None]:
%mv ../data/GR.BFO..HHZ.D.2021.00[45] ../sample_sds/2021/GR/BFO/HHZ.D/

In [None]:
%ls ../sample_sds/*/*/*/*