/
__init__.py
529 lines (408 loc) · 22.5 KB
/
__init__.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Mass Downloader for FDSN Compliant Web Services
===============================================
This package contains functionality to query and integrate data from any number
of `FDSN web service <https://www.fdsn.org/webservices/>`_ providers
simultaneously. The package aims to formulate download requests in a way that
is convenient for seismologists without having to worry about political and
technical data center issues. It can be used by itself or as a library
component integrated into a bigger project.
:copyright:
Lion Krischer (krischer@geophysik.uni-muenchen.de), 2014-2015
:license:
GNU Lesser General Public License, Version 3
(https://www.gnu.org/copyleft/lesser.html)
.. contents:: Contents
:local:
:depth: 2
Why Would You Want to Use This?
-------------------------------
Directly using the FDSN web services for example via the
:mod:`obspy.clients.fdsn` client is fine for small amounts of data but quickly
becomes cumbersome for larger data sets. Many data centers do provide tools to
easily download larger amounts of data but that is usually only from one data
center. Now most seismologists don't really care a lot where the data they
download originates - they just want the data for their use case and
oftentimes they want as much data as they can get. As the number of FDSN
compliant web services increases this becomes more and more cumbersome. That
is where this module comes in. You
1. specify the **geographical region** from which to download data,
2. define a number of **other restrictions** (temporal, data quality, ...),
3. and launch the download.
The mass downloader module will acquire all waveforms and associated station
information across all known FDSN web service implementations producing a
**clean data set** ready for further use. It works by
1. figuring out what stations each provider offers,
2. downloading MiniSEED and associated StationXML meta information in an
efficient and data center friendly manner, and
3. dealing with all the nasty real-world data issues like missing or incomplete
data, duplicate data across data centers, e.g.
* Basic optional automatic quality control by assuring that the data has
no-gaps/overlaps or is available for a certain percentage of the requested
time span.
* It can relaunch download to acquire missing pieces which might happen for
example if a data center has been offline.
* It can assure that there always is a corresponding StationXML file for the
waveforms.
Usage Examples
--------------
Before delving into the nitty-gritty details of how it works and why it does
things in a certain way we'll demonstrate the usage of this module on two
annotated examples. They can serve as templates for your own needs.
Earthquake Data
~~~~~~~~~~~~~~~
The classic seismological data set consists of waveform recordings for a
certain earthquake. This example downloads all data it can find for the
Tohoku-Oki Earthquake from 5 minutes before the earthquake centroid time to 1
hour after. It will furthermore only download data with an epicentral distance
between 70.0 and 90.0 degrees and some additional restrictions. Be aware that
this example will attempt to download data from all FDSN data centers that
ObsPy knows of and combine it into one data set.
.. code-block:: python
import obspy
from obspy.clients.fdsn.mass_downloader import CircularDomain, \\
Restrictions, MassDownloader
origin_time = obspy.UTCDateTime(2011, 3, 11, 5, 47, 32)
# Circular domain around the epicenter. This will download all data between
# 70 and 90 degrees distance from the epicenter. This module also offers
# rectangular and global domains. More complex domains can be defined by
# inheriting from the Domain class.
domain = CircularDomain(latitude=37.52, longitude=143.04,
minradius=70.0, maxradius=90.0)
restrictions = Restrictions(
# Get data from 5 minutes before the event to one hour after the
# event. This defines the temporal bounds of the waveform data.
starttime=origin_time - 5 * 60,
endtime=origin_time + 3600,
# You might not want to deal with gaps in the data. If this setting is
# True, any trace with a gap/overlap will be discarded.
reject_channels_with_gaps=True,
# And you might only want waveforms that have data for at least 95 % of
# the requested time span. Any trace that is shorter than 95 % of the
# desired total duration will be discarded.
minimum_length=0.95,
# No two stations should be closer than 10 km to each other. This is
# useful to for example filter out stations that are part of different
# networks but at the same physical station. Settings this option to
# zero or None will disable that filtering.
minimum_interstation_distance_in_m=10E3,
# Only HH or BH channels. If a station has HH channels, those will be
# downloaded, otherwise the BH. Nothing will be downloaded if it has
# neither. You can add more/less patterns if you like.
channel_priorities=["HH[ZNE]", "BH[ZNE]"],
# Location codes are arbitrary and there is no rule as to which
# location is best. Same logic as for the previous setting.
location_priorities=["", "00", "10"])
# No specified providers will result in all known ones being queried.
mdl = MassDownloader()
# The data will be downloaded to the ``./waveforms/`` and ``./stations/``
# folders with automatically chosen file names.
mdl.download(domain, restrictions, mseed_storage="waveforms",
stationxml_storage="stations")
Continuous Request
~~~~~~~~~~~~~~~~~~
Another use case requiring massive amounts of data are noise studies. Ambient
seismic noise correlations require continuous recordings from stations over a
large time span. This example downloads data, from within a certain
geographical domain, for a whole year. Individual MiniSEED files will be split
per day. The download helpers will attempt to optimize the queries to the data
centers and split up the files again if required.
.. code-block:: python
import obspy
from obspy.clients.fdsn.mass_downloader import RectangularDomain, \\
Restrictions, MassDownloader
# Rectangular domain containing parts of southern Germany.
domain = RectangularDomain(minlatitude=30, maxlatitude=50,
minlongitude=5, maxlongitude=35)
restrictions = Restrictions(
# Get data for a whole year.
starttime=obspy.UTCDateTime(2012, 1, 1),
endtime=obspy.UTCDateTime(2013, 1, 1),
# Chunk it to have one file per day.
chunklength_in_sec=86400,
# Considering the enormous amount of data associated with continuous
# requests, you might want to limit the data based on SEED identifiers.
# If the location code is specified, the location priority list is not
# used; the same is true for the channel argument and priority list.
network="BW", station="A*", location="", channel="EH*",
# The typical use case for such a data set are noise correlations where
# gaps are dealt with at a later stage.
reject_channels_with_gaps=False,
# Same is true with the minimum length. All data might be useful.
minimum_length=0.0,
# Guard against the same station having different names.
minimum_interstation_distance_in_m=100.0)
# Restrict the number of providers if you know which serve the desired
# data. If in doubt just don't specify - then all providers will be
# queried.
mdl = MassDownloader(providers=["LMU", "GFZ"])
mdl.download(domain, restrictions, mseed_storage="waveforms",
stationxml_storage="stations")
Usage
-----
Using the download helpers requires the definition of three separate things,
all of which are detailed in the following paragraphs.
1. **Data Selection:** The data to be downloaded can be defined by enforcing
geographical or temporal constraints and a couple of other options.
2. **Storage Options:** Choosing where the final MiniSEED and StationXML files
should be stored.
3. **Start the Download:** Choose from which provider(s) to download and then
launch the downloading process.
Step 1: Data Selection
~~~~~~~~~~~~~~~~~~~~~~
Data set selection serves the purpose to limit the data to be downloaded to
data useful for the purpose at hand. It is handled by two objects:
subclasses of the :class:`~obspy.clients.fdsn.mass_downloader.domain.Domain`
object and the
:class:`~obspy.clients.fdsn.mass_downloader.restrictions.Restrictions` class.
The :class:`~obspy.clients.fdsn.mass_downloader.domain` module currently
defines three different domain types used to limit the geographical extent of
the queried data:
:class:`~obspy.clients.fdsn.mass_downloader.domain.RectangularDomain`,
:class:`~obspy.clients.fdsn.mass_downloader.domain.CircularDomain`, and
:class:`~obspy.clients.fdsn.mass_downloader.domain.GlobalDomain`. Subclassing
:class:`~obspy.clients.fdsn.mass_downloader.domain.Domain` enables the
construction of arbitrarily complex domains. Please see the
:class:`~obspy.clients.fdsn.mass_downloader.domain` module for more details.
Instances of these classes will later be passed to the function sparking the
downloading process. A rectangular domain for example is defined like this:
>>> from obspy.clients.fdsn.mass_downloader.domain import RectangularDomain
>>> domain = RectangularDomain(minlatitude=-10, maxlatitude=10,
... minlongitude=-10, maxlongitude=10)
Additional restrictions like temporal bounds, SEED identifier wildcards,
and other things are set with the help of
the :class:`~obspy.clients.fdsn.mass_downloader.restrictions.Restrictions`
class. Please refer to its documentation for a more detailed explanation of
the parameters.
>>> from obspy import UTCDateTime
>>> from obspy.clients.fdsn.mass_downloader import Restrictions
>>> restrict = Restrictions(
... starttime=UTCDateTime(2012, 1, 1),
... endtime=UTCDateTime(2012, 1, 1, 1),
... network=None, station=None, location=None, channel=None,
... reject_channels_with_gaps=True,
... minimum_length=0.9,
... minimum_interstation_distance_in_m=1000,
... channel_priorities=["HH[ZNE]", "BH[ZNE]"],
... location_priorities=["", "00", "01"])
Step 2: Storage Options
~~~~~~~~~~~~~~~~~~~~~~~
After determining what to download, the helpers must know where to store the
requested data. That requires some flexibility in case the mass downloader
is to be integrated as a component into a bigger system. An example is
a toolbox that has a database to manage its data.
A major concern is to not download pre-existing data. In order to enable such
a use case the download helpers can be given functions that are evaluated when
determining the file names of the requested data. Depending on the return
value, the helper class will download the whole, part, or even none, of that
particular piece of data.
Storing MiniSEED waveforms
^^^^^^^^^^^^^^^^^^^^^^^^^^
The MiniSEED storage rules are set by the ``mseed_storage`` argument of the
:meth:`~obspy.clients.fdsn.mass_downloader.mass_downloader.MassDownloader.download`
method of the
:class:`~obspy.clients.fdsn.mass_downloader.mass_downloader.MassDownloader`
class
**Option 1: Folder Name**
In the simplest case it is just a folder name:
>>> mseed_storage = "waveforms"
This will cause all MiniSEED files to be stored as
``waveforms/NETWORK.STATION.LOCATION.CHANNEL__STARTTIME__ENDTIME.mseed``.
An example of this is
``waveforms/BW.FURT..BHZ__20141027T163723Z__20141027T163733Z.mseed``
which is rather general but also quite long.
**Option 2: String Template**
For more control use the second possibility and provide a string containing
``{network}``, ``{station}``, ``{location}``, ``{channel}``, ``{starttime}``,
and ``{endtime}`` format specifiers. These values will be interpolated to
acquire the final filename. The start and end times will be formatted with
``strftime()`` with the specifier ``"%Y%m%dT%H%M%SZ"`` in an effort to
avoid colons which are troublesome in file names on many systems.
>>> mseed_storage = ("some_folder/{network}/{station}/"
... "{channel}.{location}.{starttime}.{endtime}.mseed")
results in
``some_folder/BW/FURT/BHZ..20141027T163723Z.20141027T163733Z.mseed``.
The download helpers will create any non-existing folders along the path.
**Option 3: Custom Function**
The most complex but also most powerful possibility is to use a function which
will be evaluated to determine the filename. **If the function returns**
``True`` **, the MiniSEED file is assumed to already be available and will not
be downloaded again; keep in mind that in that case no station data will be
downloaded for that channel.** If it returns a string, the MiniSEED file will
be saved to that path. Utilize closures to use any other parameters in the
function. This hypothetical function checks if the file is already in a
database and otherwise returns a string which will be interpreted as a
filename.
>>> def get_mseed_storage(network, station, location, channel, starttime,
... endtime):
... # Returning True means that neither the data nor the StationXML file
... # will be downloaded.
... if is_in_db(network, station, location, channel, starttime, endtime):
... return True
... # If a string is returned the file will be saved in that location.
... return os.path.join(ROOT, "%s.%s.%s.%s.mseed" % (network, station,
... location, channel))
>>> mseed_storage = get_mseed_storage
.. note::
No matter which approach is chosen, if a file already exists, it will not
be overwritten; it will be parsed and the download helper class will
attempt to download matching StationXML files.
Storing StationXML files
^^^^^^^^^^^^^^^^^^^^^^^^
The same logic applies to the StationXML files. This time the rules are set by
the ``stationxml_storage`` argument of the
:func:`~obspy.clients.fdsn.mass_downloader.mass_downloader.MassDownloader.download`
method of the
:class:`~obspy.clients.fdsn.mass_downloader.mass_downloader.MassDownloader`
class. StationXML files will be downloaded on a per-station basis thus all
channels and locations from one station will end up in the same StationXML
file.
**Option 1: Folder Name**
A simple string will be interpreted as a folder name. This example will save
the files to ``"stations/NETWORK.STATION.xml"``, e.g. to
``"stations/BW.FURT.xml"``.
>>> stationxml_storage = "stations"
**Option 2: String Template**
Another option is to provide a string formatting template, e.g.
>>> stationxml_storage = "some_folder/{network}/{station}.xml"
will write to ``"some_folder/NETWORK/STATION.xml"``, in this case for example
to ``"some_folder/BW/FURT.xml"``.
.. note::
If the StationXML file already exists, it will be opened to see what is in
the file. In case it does not contain all necessary channels, it will be
deleted and **only those channels needed in the current run will be
downloaded again**. Pass a custom function to the ``stationxml_path``
argument if you require different behavior as documented in the
following section.
**Option 3: Custom Function**
As with the waveform data, the StationXML paths can also be set with the help
of a function. The function in this case is a bit more complex than for the
waveform case. It has to return a dictionary with three keys:
``"available_channels"``, ``"missing_channels"``, and ``"filename"``.
``"available_channels"`` is a list of channels that are already available as
station information and that require no new download. Make sure to include all
already available channels; this information is later used to discard
MiniSEED files that have no corresponding station information.
``"missing_channels"`` is a list of channels for that particular station that
must be downloaded and ``"filename"`` determines where to save these. Please
note that in this particular case the StationXML file will be overwritten if it
already exists and only the ``"missing_channels"`` will be downloaded to it,
independent of what already exists in the file.
Alternatively the function can also return a string and the behaviour is the
same as two first options for the ``stationxml_storage`` argument.
The next example illustrates a complex use case where the availability of each
channel's station information is queried in some database and only those
channels that do not exist yet will be downloaded. Use closures to pass more
arguments to the function.
>>> def get_stationxml_storage(network, station, channels, starttime, endtime):
... available_channels = []
... missing_channels = []
... for location, channel in channels:
... if is_in_db(network, station, location, channel, starttime,
... endtime):
... available_channels.append((location, channel))
... else:
... missing_channels.append((location, channel))
... filename = os.path.join(ROOT, "%s.%s.xml" % (network, station))
... return {
... "available_channels": available_channels,
... "missing_channels": missing_channels,
... "filename": filename}
>>> stationxml_storage = get_stationxml_storage
Step 3: Start the Download
~~~~~~~~~~~~~~~~~~~~~~~~~~
The final step is to actually start the download. Pass the previously created
domain, restrictions, and path settings and off you go. Two more parameters of
interest are the ``chunk_size_in_mb`` setting which controls how much data is
requested per thread, client and request. ``threads_per_clients`` control how
many threads are used to download data in parallel per data center - 3 is a
value in agreement with some data centers.
>>> mdl = MassDownloader() # doctest: +SKIP
>>> mdl.download(domain, restrictions, chunk_size_in_mb=50,
... threads_per_client=3, mseed_storage=mseed_storage,
... stationxml_storage=stationxml_storage) # doctest: +SKIP
How it Works
------------
At a high level the mass downloader works by looping over each FDSN web service
and downloading whatever it offers. A bit more detail:
1. Loop over all passed or known FDSN web service implementations and
auto-discover if they are available and what they can do. If an
implementation has a ``dataselect`` and a ``station`` service it will be
part of the following steps. Otherwise it will be discarded.
2. For each web service client:
a) Request the availability for the given time and domain settings. It will
request a text file from the ``station`` service at the channel level. If
the service supports the ``matchtimeseries`` parameter it will be used
and the availability is considered to be *"reliable"* for the further
stages.
b) Channel and location priorities are applied resulting in a single
instrument per station.
c) Any already existing network + station combinations are discarded.
d) If the availability for the particular client is considered reliable it
will perform the minimum distance filtering now. If no stations have
already been downloaded it will select the largest subset of stations
satisfying the minimum interstation distance constraint. Otherwise it
will successively add new stations with the largest distance to the
closest already existing station until no more stations satisfying the
minimum distance remain. This results in the maximum possible amount of
chosen stations satisfying the constraints.
e) Download the MiniSEED data - this is threaded and it will use a bulk
request honoring the desired ``chunk_size_in_mb`` setting. Afterwards it
splits the MiniSEED files again to match the desired restrictions. The
split happens at the record level thus no information available in the
original MiniSEED records is lost.
f) Any MiniSEED files not fulfilling the minimum length or no/gap overlap
restrictions will be deleted. Faulty MiniSEED files as well.
g) For each downloaded MiniSEED file: Download the corresponding StationXML
file at the response level.
h) If the ``sanitize`` argument of the Restrictions object is ``True``,
delete all MiniSEED files for which no station information could be
downloaded. This is a useful setting if you want a clean data set.
g) If the availability information is not reliable, perform the minimum
interstation distance filtering now. This is a bit unfortunate but many
client do return pretty terrible availability information (or interpret
the ``station`` service differently) so there is no way around that for
now.
h) Rinse and repeat for all remaining FDSN web service implementations.
Logging
-------
The download helpers utilizes Python's `logging facilities
<https://docs.python.org/2/library/logging.html>`__. By default it will log to
stdout at the ``logging.INFO`` level which provides a fair amount of detail. If
you want to change the log level or setup a different stream handler, just get
the corresponding logger after you import the download helpers module:
>>> import logging
>>> logger = logging.getLogger("obspy.clients.fdsn.mass_downloader")
>>> logger.setLevel(logging.DEBUG) # doctest: +SKIP
Authentication
--------------
To make the mass downloader work for restricted data, just initialize it
with existing :class:`~obspy.clients.fdsn.client.Client` instances that have
credentials. Note that you can mix already initialized clients with varying
credientials and just passing the name of the FDSN services to query.
>>> from obspy.clients.fdsn import Client
>>> client_orfeus = Client("ORFEUS", user="random", password="some_pw")
>>> client_eth = Client("ETH", user="from_me", password="to_you")
>>> mdl = MassDownloader(providers=[client_orfeus, "IRIS", client_eth]) \
# doctest: +SKIP
Further Documentation
---------------------
Further functionality of this module is documented at a couple of other places:
* :mod:`~.domain` module
* :class:`~.restrictions.Restrictions` class
* :class:`~.mass_downloader.MassDownloader` class
"""
import warnings
from obspy.core.util.base import SCIPY_VERSION
# Convenience imports.
from .mass_downloader import MassDownloader # NOQA
from .restrictions import Restrictions # NOQA
from .domain import (Domain, RectangularDomain, # NOQA
CircularDomain, GlobalDomain) # NOQA
__all__ = ['MassDownloader', 'Restrictions', 'Domain', 'RectangularDomain',
'CircularDomain', 'GlobalDomain']
if __name__ == '__main__':
import doctest
doctest.testmod(exclude_empty=True)