# Constructing an Event Dataset from CSS3.0 database tables 
## Prof. Gary L. Pavlis, Dept. Earth and Atmos. Sci., Indiana University


# 1. Overview
The purpose of this tutorial is to show how to use MsPASS to assemble a working data set using catalog data from an Antelope (CSS3.0) database as a template.  This is one example of many workflows that seismologists may want to use to assemble a working data set for processing by MsPASS.   We use it as an example because (a) it was a data set the author wanted to assemble, and (b) it is a useful example that exercises many of the tools seismologists would need to use to assemble a data set, and (c) it illustrates some of the idiosyncracies of real data downloaded from IRIS and the FDSN.  

The data set we will be building in this exercise are waveform segments appropriate for estimation of so called "receiver functions" from teleseismic P wave data.  We will be exploiting a useful but underutilized data set produced by Earthscope's USArray facility of analyst picks of teleseismic P waves recorded by USArray.   The concept I'm using here is that if the analyst can observe the phase well enough that he or she are confident enough to measure the P wave arrival time it is a candidate for receiver function processing.   We can thus use the picks as a screen to reduce the amount of debris from data with a signal so small that there is no way it can work for the processing we aim to accomplish.   An alternative workflow for accomplishing that prefiltering task is to use an automated procedure to screen junk later in the processing chain.  That would be feasible in MsPASS, but is not the approach used in this tutorial. 

The data set we will be using in this exercise was assembled with two tools the reader may already be familiar with:  (1) obspy's FDSN web service functions, and (2) Antelope's database tools.   As we use them I will explain how these data were assembled for your education, but the idea of this tutorial is that most of the data you will need for this exercise is already assembled for you to load.  The key educational objective is to help you understand how MongoDB can be used to manage raw data and what it takes to define what we call a working input data set for MsPASS.  

# 2. Creating An Empty MongoDB Database
We first need to create an instance of a MongoDB database into which we will be writing the data for this exercise.  For this tutorial I need to assume two things:
1.  You have a local instance of MongoDB running and accessible from the machine on which you are running this tutorial.  If that is not true, you need to follow the procedure described in this page_:<https://github.com/wangyinz/mspass/wiki/Using-MsPASS-with-Docker> for docker or this page: <https://github.com/wangyinz/mspass/wiki/Using-MsPASS-with-Singularity-(on-HPC)> for singularity.
2.  We assume the database we will be used, which we will call *raw_data_tutorial*, does not exist.   If you reenter this tutorial or something goes wrong and the tutorial database gets clobbered, we show below how to clear the contents quickly.\

With that (long winded) background, if you satisfy the assumptions above run the following:

In [1]:
from pymongo import MongoClient
from mspasspy.db.database import Database
from mspasspy.db.client import Client
client=Client()
db=Database(client,'raw_data_tutorial')

The variable *db* is now a top level handle to MongoDB.   It is an abstraction that should be viewed as a handle to the "database" with the name "raw_data_tutorial".  Under this top level handle are "collections", which are MongoDB's equivalent to a *relation* (or equivalently a *table*) in a relational database.   You can see what collections are defined running the following at any time:

In [4]:
colnames=db.list_collections()
for nm in colnames:
    print(nm)

{'name': 'site', 'type': 'collection', 'options': {}, 'info': {'readOnly': False, 'uuid': UUID('0e908145-050f-4655-a99c-4bd25787840a')}, 'idIndex': {'v': 2, 'key': {'_id': 1}, 'name': '_id_', 'ns': 'raw_data_tutorial.site'}}
{'name': 'arrival', 'type': 'collection', 'options': {}, 'info': {'readOnly': False, 'uuid': UUID('9d1b0af8-4110-441d-a5dd-22d7b728adba')}, 'idIndex': {'v': 2, 'key': {'_id': 1}, 'name': '_id_', 'ns': 'raw_data_tutorial.arrival'}}
{'name': 'wf_miniseed', 'type': 'collection', 'options': {}, 'info': {'readOnly': False, 'uuid': UUID('d383524c-1ce0-4da3-8bb2-ac5c84f9238b')}, 'idIndex': {'v': 2, 'key': {'_id': 1}, 'name': '_id_', 'ns': 'raw_data_tutorial.wf_miniseed'}}
{'name': 'channel', 'type': 'collection', 'options': {}, 'info': {'readOnly': False, 'uuid': UUID('e5087c60-4fbd-493e-a0ac-fab4869cda9f')}, 'idIndex': {'v': 2, 'key': {'_id': 1}, 'name': '_id_', 'ns': 'raw_data_tutorial.channel'}}
{'name': 'source', 'type': 'collection', 'options': {}, 'info': {'readOnly

For the present you should get nothing if you run that cell.  Later after we start adding data you can rerun that cell and see as we add a set of data to multiple "collections".  If you get output that is not clearly an error you can clear the database completely with this (very dangerous) command:

In [18]:
client.drop_database('raw_data_tutorial')
client=MongoClient()
db=client['raw_data_tutorial']

where we have repeated initalizing db as above.   

# 3. Retrieving raw data with web services
Over the past three decades IRIS developed a long suite of methods to retieve seismic data from their archives.  Many are still available, and without doubt the best alternative to the methods describe here is their "breqfast" request mechanism.   For the past decade, however, IRIS DMC has moved toward a solution using modern web services as their recommended data retrieval method.   We used obspy's FDSN mass download function to accomplish this objective.  (see <https://docs.obspy.org/packages/obspy.clients.fdsn.html> for a starting point on their documentation of this methodology.)  

You are welcome to use our actual procedure to download the data in this tutorial by running the python functions supplied with this tutorial.  From the directory you launched this tutorial cd to the directory *download_scripts* and run `python download2012.py`.   If you choose that route, however, you will need to be patient because when I ran this on a completely unloaded machine on a gigabit network it took about 5 days to complete.   Because that step is so time consuming we supply all the required file that process would yield in the directory where you are running this tutorial.   We do, however, only supply a small fraction of the actual waveform data as the size of that request is the main reason this takes 5 days - when I ran this script it returned approximately 0.65 TB.  

Since I doubt many readers will want to wait 5 days and consume over half a terrabyte of disk to run this tutorial we supply the following that we extracted from the output of that procedure:
1. We supply miniseed format waveform data for the first 10 events the download2012.py procedure will retieve.   You could recreate this in far shorter time if you hacked the script to count to 10 or just killed it after a few hours, but I recommend you start by looking in the directory wf/2012.   There you should see a set of "event" files.   Those files are not small because they were produced by concantenation of the miniseed files obspy downloads.  I did that to reduce the (absurd) number of files the original obspy script creates.  That is, obspy's function writes one file per channel per event.   Since the query I used in this script (all broadband stations running in 2012 within a rectangle around the lower 48 states of the US) yields over 1700 stations and around 5000 channels that gets out of control very quickly.  That means the data for that one year alone contains around 5 million files.  That is bad for a whole lot of reasons from simple things like many unix commands won't function on file lists for a directory that large to truly terrible performance problems on HPC systems (e.g. I found using the unix du query to query the data volume of the directory chain took hours to complete on a HPC disk farm using the Lustre file system at Indiana.)   I note we can use the concantenated miniseed files without any issues because of the way miniseed files are constructed and how they interact with obspy's reader we will be using later.  
2.  The procedure we will be using is driven by events obspy downloads as the new xml format recently defined for web services by the FDSN.  The file downloaded is `./eventdata/events_2012.xml`.  Note we will load that data, but override it later using an alternative event catalog.  I include that for completeness and hope it will not prove too confusing.  It is useful to emphasize the lesson that in passive data handling we often have to face the issue of multiple source parameters for the same seismic "event".   In this exercise we will have conflicting source information from two sources:  (a) the FDSN download that process this xml file, and (b) the catalog Array Network Facility analysts used to construct their event catalog.   
3.  During downloading obypy dynamically assembles station metadata in a custom scheme used by the download procedure using in the download2012.py script.  Some of the details of how they (obspy) handle this can be found in their documentation, but a key point for this tutorial is you will not be able to reproduce the station metadata we supply here exactly unless you run the download2012.py script to completion. We supply the output in the directory site_2012.dir, which is the name their script creates and populates with the files you will find there.  That directory contains 1712 small files.  There is one file for each seed network-station code combinations of data downloaded.  Each file is a recently standardized FDSN web service xml representation of station metadata.  Seismologists unfamiliar with station xml should realize it is simply a different format to contain the same information previously stored in the SEED blockettes used to define station metadata.  That information is also sometimes stored as "dataless seed" files (e.g. the Array Network Facility distributes these through their web site at <https://anf.ucsd.edu/data/products/dataless_sta/>).   

The next three sections describe how we assemble complete metadata for the tutorial data set we will build and load into MongoDB here.  The sections are headed by a useful classification of the type of data involved:  (a) receiver metadata, (b) source metadata, and (c) waveform data and metadata. 

# 4. Assembling the pieces
## 4.1. Receiver metadata
With tools we presently supply this is by far the easiest thing to do.   For this tutorial, just run this simple set of command:

In [21]:
from obspy import read_inventory
inv=read_inventory('data/raw_data_tutorial/site_2012.dir/*.xml')
db.save_inventory(inv)
inv=read_inventory('data/raw_data_tutorial/site_2011.dir/*.xml')
db.save_inventory(inv)

Data in loc code section overrides station section
Station section coordinates:   37.91886 -122.15179 0.21969999999999998
loc code section coordinates:   37.91881 -122.15176 0.2189
Data in loc code section overrides station section
Station section coordinates:   36.3887 -121.5514 0.542
loc code section coordinates:   36.3887 -121.5514 0.5395
Data in loc code section overrides station section
Station section coordinates:   40.8161 -121.46117 1.0092999999999999
loc code section coordinates:   40.8161 -121.46117 1.0086
Data in loc code section overrides station section
Station section coordinates:   36.68011 -119.02282 1.14
loc code section coordinates:   36.68011 -119.02282 1.1375
Data in loc code section overrides station section
Station section coordinates:   35.63597 -120.86984 0.4168
loc code section coordinates:   35.63597 -120.86984 0.4143
Data in loc code section overrides station section
Station section coordinates:   39.2291 -121.7861 0.252
loc code section coordinates:   39.229

Data in loc code section overrides station section
Station section coordinates:   43.7337 -96.6141 0.478
loc code section coordinates:   43.7337 -96.6141 0.43289999999999995
Data in loc code section overrides station section
Station section coordinates:   33.4112 -83.4666 0.1165
loc code section coordinates:   33.4112 -83.4666 0.083
Data in loc code section overrides station section
Station section coordinates:   39.1009 -96.6094 0.317
loc code section coordinates:   39.1009 -96.6094 0.2963
Data in loc code section overrides station section
Station section coordinates:   39.11083 -110.52383 1.804
loc code section coordinates:   39.110828 -110.523827 1.804
Data in loc code section overrides station section
Station section coordinates:   44.6837 -122.1862 0.6573
loc code section coordinates:   44.683701 -122.186203 0.652
Data in loc code section overrides station section
Station section coordinates:   47.56413 -122.82498 0.22
loc code section coordinates:   47.564129 -122.824982 0.22
Dat

Data in loc code section overrides station section
Station section coordinates:   40.4314 -117.221001 1.594
loc code section coordinates:   40.4314 -117.220993 1.594
Data in loc code section overrides station section
Station section coordinates:   39.999395 -76.348888 0.106
loc code section coordinates:   39.999199 -76.350601 0.091
Data in loc code section overrides station section
Station section coordinates:   43.973468 -74.22307 0.575
loc code section coordinates:   43.9734 -74.222801 0.575
Data in loc code section overrides station section
Station section coordinates:   41.0056 -73.9079 0.066
loc code section coordinates:   41.0056 -73.907898 0.066
Data in loc code section overrides station section
Station section coordinates:   42.335 -71.1705 0.06
loc code section coordinates:   42.334999 -71.170502 0.06
Data in loc code section overrides station section
Station section coordinates:   41.917 -71.5378 0.116
loc code section coordinates:   41.917 -71.537804 0.116
Data in loc code s

(380, 1500, 1569, 4700)

You will notice this generates a fair number of warnings.  Those warning are intentional and should be inspected if you repeat this process for other data.   In all cases here these seem to be a combination of two issues: (1) many stations seem to have a mismatch in precision of location information stored in the station and channels section of the xml data, and (2) some of the mismatches in the station and channels section reflect a real geometric difference (e.g. RSSD loc code 00 is a borehole instrument so the elevation field differs from the station location which is at surface elevation).  

Notice for this process we are using obspy's read_inventory function to crack the station xml files and turn them into something we can manipulate.  That something is their Inventory object (<https://docs.obspy.org/packages/autogen/obspy.core.inventory.inventory.Inventory.html#obspy.core.inventory.inventory.Inventory>) which is pretty much a python translation of the station xml.  The save_inventory method of our Database class translates the Inventory data into the structure MongoDB requires to manage the data we just saved.   In MongoDB lingo that information is a "document".  We write one document in a collection called *site* for each "station" defined by a unique SEED network (net) and station (sta) code.  Each station always has one or more "channels" associated with it.  In SEED each channel is defined by two additional keys:  a channel (chan) code, and a location (loc) code.  SEED thus indexes a single channel of seismic data with a key defined by all four code:  net, sta, chan, and loc.  In MsPASS we map the Channels data assembled in the obspy Inventory object to produce one or more MongoDB documents we write to a collection we call *channel*. 

A further huge complication is that station and channel metadata can and often are time variable.   The most common example is a sensor change where the response data and/or the orientation of the instrument can change.  Furthermore, all channels operate only over a finite time range.  Hence, the *channel* collection commonly contain multiple documents for some channels.   There are no examples in this tutorial data with a *site* having multiple documents for the same SEED net-sta keys, but the schema we use allows that by using the time interval defined by attributes *starttime* and *endtime* as an additional constraint.   (Note using net-sta-opeational_time as a unique site idenifier was borrowed completely from CSS3.0 which uses that same concept.)  

To clarify this further, you can run the following optional code block to get a basic report of the number of channels per site.  You can also easily alter the report to print station location information using the interactive feature of jupyter.

In [22]:
def print_station_report(db):
    """
    Simple example function to print a basic report on contents of the site collection.

    :param db:  is a mspasspy.db Database object assumed pointed at an operating MongoDB server
    """
    dbsite=db.site
    curs=dbsite.find({})
    print('net sta lat lon elev')
    for doc in curs:
        net=doc['net']
        sta=doc['sta']
        lat=doc['latitude']
        lon=doc['longitude']
        elev=doc['elevation']
        print(net,sta,lat,lon,elev)

print_station_report(db)
        

net sta lat lon elev
2G IUGFS 45.728371 -111.970802 1.6337000000000002
7D FS01B 40.326801 -124.949203 -0.94
7D FS05B 40.3866 -124.899597 -2.316
7D FS06B 40.381199 -124.785301 -2.198
7D FS09B 40.438702 -124.808502 -2.161
7D FS14B 40.495499 -124.591797 -0.107
7D G02B 40.048599 -125.296898 -1.92
7D G03A 40.059101 -126.162498 -4.113
7D G03B 40.057999 -126.163399 -4.051
7D G04B 40.058498 -126.928497 -4.368
7D G05B 40.070599 -127.747803 -4.462
7D G10B 40.677898 -125.553299 -2.936
7D G11B 40.6875 -126.376404 -3.123
7D G12B 40.686901 -127.228897 -3.08
7D G13B 40.683998 -128.028305 -3.215
7D G19B 41.3074 -125.773598 -3.071
7D G20B 41.298698 -126.613297 -3.141
7D G21B 41.316002 -127.453697 -3.156
7D G22B 41.3092 -128.274002 -3.038
7D G27B 41.916599 -126.016701 -3.48
7D G28B 41.942799 -126.733902 -3.327
7D G29B 41.9772 -127.483398 -3.197
7D G30A 41.955002 -128.319305 -3.124
7D G30B 41.956501 -128.319794 -3.119
7D G35B 42.5658 -126.055801 -2.367
7D G36B2 42.5994 -126.903801 -2.423
7D G37B 42.59130

TA M40A 41.405998 -91.5121 0.223
TA M41A 41.375 -90.542198 0.226
TA M42A 41.454601 -89.758102 0.212
TA M43A 41.436501 -88.958099 0.19
TA M44A 41.388199 -88.043198 0.207
TA M45A 41.3881 -87.250397 0.216
TA M46A 41.407902 -86.352402 0.242
TA M47A 41.359402 -85.621399 0.283
TA M48A 41.4846 -84.717102 0.258
TA M49A 41.474499 -83.975197 0.203
TA M50A 41.4035 -83.042801 0.176
TA M51A 41.3321 -82.183098 0.239
TA M54A 41.5079 -79.664703 0.488
TA M65A 41.562 -70.646599 0.022
TA MDND 47.848099 -99.602898 0.479
TA MSTX 33.969601 -102.7724 1.167
TA N02D 40.974 -122.705 0.937
TA N23A 40.894699 -105.944 2.458
TA N33A 40.7384 -97.4506 0.475
TA N34A 40.837799 -96.500504 0.401
TA N35A 40.861198 -95.642502 0.353
TA N36A 40.815601 -94.960403 0.349
TA N37A 40.758202 -94.209503 0.351
TA N38A 40.793098 -93.235001 0.317
TA N39A 40.877602 -92.502296 0.26
TA N40A 40.884102 -91.583702 0.208
TA N41A 40.707699 -90.855202 0.226
TA N42A 40.828999 -90.0345 0.205
TA N43A 40.9394 -89.1735 0.215
TA N44A 40.7953 -88.133

CI MUR 33.599991 -117.195427 0.562
CI MWC 34.22362 -118.05832 1.725
CI NEE2 34.76759 -114.618813 0.271
CI OSI 34.614498 -118.723503 0.718
CI PASC 34.17141 -118.18523 0.341
CI PHL 35.40773 -120.54556 0.355
CI RCT 36.305229 -119.243843 0.107
CI RPV 33.743462 -118.404121 0.107
CI SBC 34.440762 -119.71492 0.094
CI SCI2 32.9799 -118.546967 0.199
CI SDD 33.552589 -117.661713 0.12
CI SHO 35.8996 -116.27515 0.451
CI SMM 35.3142 -119.99581 0.599
CI SNCC 33.247871 -119.524368 0.275
CI SVD 34.106468 -117.098221 0.605
CI SWS 32.94508 -115.79988 0.14
CI USC 34.019192 -118.286308 0.058
CI VCS 34.48372 -118.11781 0.992
CN BMSB 48.8356 -125.1355 0.01
CN SNB 48.7751 -123.1723 0.402
IU CCM 38.0557 -91.2446 0.1715
IU HKT 29.9618 -95.8384 -0.863
IU WVT 36.1297 -87.83 0.17
LD BMNY 44.83987 -74.5065 0.115
LD LUPA 40.5987 -75.3718 0.255
LD MVL 39.999199 -76.350601 0.091
NE YLE 41.3165 -72.9208 0.005
NM SIUC 37.714802 -89.2174 0.12
PN PPBNL 38.882999 -86.450996 0.203
PN PPPHS 40.861 -86.494003 0.223
TA 034A 2

A related problem is that save_inventory created a similar collection called *channel* that stores additional information that is tied to a specific channel.   The *channel* collection is always much larger than *site* because today most if not all instruments will record three-component data and have at least three channels.  Channel data contains instrument response data and orientation information so you will find it nearly always has many more documents than site because of instrumentation changes.  

To see that here is a similar summary report for the *channel* collection:

In [23]:
def print_channel_report(db):
    """
    Simple example function to print a basic report on contents of the site and channel collections.
    The example uses the low level MongoDB API using json.  
    :param db:  is a mspasspy.db Database object assumed pointed at an operating MongoDB server
    """
    dbsite=db.site
    dbchannel=db.channel
    curs=dbsite.find({})
    print('net sta Number_channels')
    for doc in curs:
        net=doc['net']
        sta=doc['sta']
        query=dict()
        query['net']=net
        query['sta']=sta
        n=dbchannel.count_documents(query)
        print(net,sta,n)

print_channel_report(db)

net sta Number_channels
2G IUGFS 3
7D FS01B 3
7D FS05B 3
7D FS06B 3
7D FS09B 3
7D FS14B 3
7D G02B 3
7D G03A 3
7D G03B 3
7D G04B 3
7D G05B 3
7D G10B 3
7D G11B 3
7D G12B 3
7D G13B 3
7D G19B 3
7D G20B 3
7D G21B 3
7D G22B 3
7D G27B 3
7D G28B 3
7D G29B 3
7D G30A 3
7D G30B 3
7D G35B 3
7D G36B2 3
7D G37B 3
7D J06A 3
7D J06B 3
7D J09B 3
7D J10B 3
7D J11B 3
7D J18B 3
7D J19B 3
7D J20B 3
7D J23A 3
7D J23B 3
7D J25A 3
7D J27B 3
7D J28A 3
7D J28B 3
7D J29A 3
7D J30A 3
7D J31A 3
7D J33A 3
7D J35A 3
7D J36A 3
7D J37A 3
7D J38A 3
7D J39A 3
7D J43A 3
7D J44A 3
7D J45A 3
7D J46A 3
7D J47A 3
7D J48A 3
7D J48B 3
7D J52A 3
7D J53A 3
7D J54A 3
7D J55A 3
7D J57A 3
7D J61A 3
7D J63A 3
7D J63B 3
7D J65A 3
7D J67A 3
7D J68A 3
7D J73A 3
7D M01A 3
7D M02A 3
7D M07A 3
7D M08A 3
7D M11B 3
7D M12B 3
7D M14B 3
AE 113A 3
AE 319A 3
AE U15A 3
AE W13A 3
AE X16A 6
AE X18A 3
AE Y14A 3
AZ BZN 3
AZ CPE 6
AZ CRY 6
AZ FRD 3
AZ HWB 3
AZ KNW 3
AZ LVA2 6
AZ MONP2 6
AZ PFO 6
AZ RDM 6
AZ SCI2 3
AZ SMER 3
AZ SND 3
AZ SOL 3
AZ TRO 3

TA N45A 3
TA N46A 3
TA N47A 3
TA N48A 3
TA N49A 3
TA N50A 3
TA N51A 3
TA N53A 3
TA N54A 3
TA N55A 3
TA N59A 3
TA O02D 3
TA O03D 3
TA O03E 3
TA O20A 3
TA O33A 3
TA O34A 3
TA O35A 3
TA O36A 3
TA O37A 3
TA O38A 3
TA O39A 3
TA O40A 3
TA O41A 3
TA O42A 3
TA O43A 3
TA O44A 3
TA O45A 3
TA O47A 3
TA O48A 3
TA O49A 3
TA O50A 3
TA O51A 3
TA O52A 3
TA O53A 3
TA O54A 3
TA O55A 3
TA O56A 3
TA P34A 3
TA P35A 3
TA P36A 3
TA P37A 3
TA P38A 3
TA P39B 3
TA P40A 3
TA P41A 3
TA P42A 3
TA P43A 3
TA P44A 3
TA P45A 3
TA P46A 3
TA P47A 6
TA P48A 3
TA P49A 3
TA P50A 3
TA P51A 3
TA P52A 3
TA P53A 3
TA P55A 3
TA Q24A 3
TA Q34A 3
TA Q35A 3
TA Q36A 3
TA Q37A 3
TA Q38A 3
TA Q39A 3
TA Q40A 3
TA Q41A 3
TA Q42A 6
TA Q43A 3
TA Q44A 3
TA Q45A 3
TA Q46A 3
TA Q47A 3
TA Q48A 3
TA Q49A 3
TA Q50A 6
TA Q51A 3
TA Q52A 3
TA Q55A 3
TA R11A 3
TA R34A 3
TA R35A 3
TA R36A 3
TA R37A 3
TA R38A 3
TA R39A 3
TA R40A 3
TA R41A 3
TA R42A 3
TA R43A 3
TA R44A 3
TA R45A 3
TA R46A 3
TA R47A 3
TA R48A 3
TA R49A 3
TA R50A 3
TA R51A 3
TA R52A 3


7A W10 3
AR 113A 3
AR 319A 3
AR U15A 3
AR W13A 3
AR X18A 3
AR Y14A 3
AZ FRD 3
BK BDM 6
BK BKS 6
BK GASB 6
BK WDC 6
BK YBH 8
CI ADO 3
CI BAK 3
CI BBR 3
CI BC3 3
CI BFS 3
CI CHF 3
CI CIA 6
CI CWC 6
CI DEC 6
CI DGR 3
CI DJJ 3
CI EDW2 3
CI GLA 3
CI GMR 3
CI GRA 3
CI IBP 3
CI IKP 6
CI ISA 3
CI LRL 3
CI MLAC 3
CI MPM 6
CI MUR 6
CI MWC 3
CI NEE2 3
CI OSI 4
CI PASC 9
CI PHL 3
CI RCT 3
CI RPV 6
CI SBC 3
CI SCI2 3
CI SDD 6
CI SHO 6
CI SMM 3
CI SNCC 3
CI SVD 6
CI SWS 6
CI USC 6
CI VCS 6
CN BMSB 3
CN SNB 3
IU CCM 6
IU HKT 6
IU WVT 6
LD BMNY 3
LD LUPA 3
LD MVL 3
NE YLE 6
NM SIUC 6
PN PPBNL 1
PN PPPHS 1
TA 034A 3
TA 035A 3
TA 035Z 3
TA 130A 3
TA 131A 3
TA 133A 3
TA 134A 3
TA 135A 3
TA 230A 3
TA 231A 3
TA 232A 3
TA 233A 3
TA 234A 3
TA 238A 3
TA 239A 3
TA 330A 3
TA 331A 3
TA 332A 3
TA 333A 3
TA 334A 3
TA 335A 3
TA 336A 3
TA 337A 3
TA 338A 3
TA 339A 3
TA 340A 3
TA 431A 3
TA 432A 3
TA 433A 3
TA 434A 3
TA 436A 3
TA 437A 3
TA 438A 6
TA 439A 3
TA 440A 3
TA 530A 3
TA 531A 3
TA 532A 3
TA 533A 3
TA 534A 3
TA 

Notice that most of the stations have exactly 3 components, but a few have 6. The reason is that those with only 3 components had no configuration changes in the time period and only one sensor attached.  Stations with a number greater than 3 in the table above have one of two issues MsPASS handles:  (1) the instrumentation changed changing response and/or orientation, and/or (2) the station has multiple sensors attached (e.g. IU:ANMO) with different "location" codes.  

A deficiency in the data currently stored with this tutorial is that the stationxml files we used above created by obspy are incomplete.   For the purpose of this exercise it doesn't matter, but for a full workflow you would want at this stage to be sure your station metadata are complete.  We leave that process as an exercise for the student because different people have different needs with regard to this issue.

A final point about the Database object in MsPASS and the save_inventory method.  That method can be called repeatedly with duplicate data and the function will assure pure duplicates are not added to the site or channel collections. That makes adding collections of stationxml files downloaded independently easy to add without worries about adding clutter.  That is common, for example, when downloading data like the TA by year; each download run will produce a new pile of stationxml files with many duplicates. To demonstrate this, run this block again which will attempt to save the same Inventory object read earlier a second time.

In [9]:
db.save_inventory(inv)

Data in loc code section overrides station section
Station section coordinates:   37.91886 -122.15179 0.21969999999999998
loc code section coordinates:   37.91881 -122.15176 0.2189
Data in loc code section overrides station section
Station section coordinates:   36.3887 -121.5514 0.542
loc code section coordinates:   36.3887 -121.5514 0.5395
Data in loc code section overrides station section
Station section coordinates:   40.8161 -121.46117 1.0092999999999999
loc code section coordinates:   40.8161 -121.46117 1.0086
Data in loc code section overrides station section
Station section coordinates:   36.68011 -119.02282 1.14
loc code section coordinates:   36.68011 -119.02282 1.1375
Data in loc code section overrides station section
Station section coordinates:   35.63597 -120.86984 0.4168
loc code section coordinates:   35.63597 -120.86984 0.4143
Data in loc code section overrides station section
Station section coordinates:   39.2291 -121.7861 0.252
loc code section coordinates:   39.229

Data in loc code section overrides station section
Station section coordinates:   44.5646 -69.6617 0.05
loc code section coordinates:   44.564602 -69.661697 0.05
Data in loc code section overrides station section
Station section coordinates:   43.7337 -96.6141 0.478
loc code section coordinates:   43.7337 -96.6141 0.43289999999999995
Data in loc code section overrides station section
Station section coordinates:   33.4112 -83.4666 0.1165
loc code section coordinates:   33.4112 -83.4666 0.083
Data in loc code section overrides station section
Station section coordinates:   39.1009 -96.6094 0.317
loc code section coordinates:   39.1009 -96.6094 0.2963
Data in loc code section overrides station section
Station section coordinates:   39.11083 -110.52383 1.804
loc code section coordinates:   39.110828 -110.523827 1.804
Data in loc code section overrides station section
Station section coordinates:   44.6837 -122.1862 0.6573
loc code section coordinates:   44.683701 -122.186203 0.652
Data in

(0, 0, 1712, 5148)

To see the key point you will need to scroll to the bottom of the output to bypass the warning messages.  The summary should say that 0 site and 0 channel records were saved.  

## 4.2. Source metadata 
### 4.2.1. Extracting data from stationml files
MsPASS has a similar, simple method to import event (source) data stored as a file in the so called QuakeML format defined by the FDSN (<https://www.fdsn.org/webservices/>) for web services.  Obspy has a direct translation of the XML format file to a python object they call a Catalog (<https://docs.obspy.org/packages/autogen/obspy.core.event.Catalog.html#obspy.core.event.Catalog>) object.  You can load the QuakeML data file for this tutorial by a pair of commands very similar to that used above to read the station metadata:

In [24]:
from obspy import read_events
#db.drop_collection('source')
cat=read_events('data/raw_data_tutorial/eventdata/events_2012.xml',format='QUAKEML')
n=db.save_catalog(cat)
print('Number of events stored by save_catalog in source collection=',n)

Number of events stored by save_catalog in source collection= 961


Note the method returns the count of the number of documents added that is printed with an explanation in the last line.  In MsPASS we store event information in a separate collection we call *source*.

It is important to realize one key difference between this method and save_inventory.   Unlike seismic instruments that fixed objects an earthquake location is intrinsically fuzzy.  Multiple institutions often estimate the location from a different mix of instruments.   The CSS3.0 schema has a fairly elaborate way to handle this problem that causes major bookkeeping headaches for all seismic network operators or any experiment that requires preparing an event bulletin (meaning location estimates and a set of measure arrival times).   In MsPASS we treat that problem as a preprocessing issue that we will need to address below.   That is, in the next section we will add a second set of earthquake location estimates to our *source* collection and use a method to define which document in *source* to link to other documents.  A WARNING, however, is you should not rerun save_catalog on data from the same input file.  It will silently insert exact duplicates of the first run. 

### 4.2.2. Import data from Antelope (CSS3.0 database)
#### 4.2.2.1. Background
Antelope is a software package developed and supported by a small company called Boulder Real Time Technologies (BRTT).   A component of Antelope is a specialized relational database package sometimes called Datascope because the current version is a descendent of an open source package with that name developed in the 1990s.   There are several key points about Antelope that are important to recognize as a preface to this tutorial:
- Although Antelope is a commerical package it is widely available and used in the US because of a generous licensing arrangement by BRTT.  Thus most readers of this tutorial in the US can obtain Antelope at no cost.  If that is you then you can recreate the steps here and create the catalog file we will be using yourself.  For others we supply an export file that can be used directly.  If you are the later, you need only skim this section to get an idea of what the file is and how it was generated. 
- Antelope is used by a number of regional seismic networks, notably the Alaska and Nevada networks.   More importantly to many seismologists is the fact that Antelope was the processing system used to manage the real time data from the USArray project at the Array Network Facility.   Important data relevant to this tutorial is the set of monthly catalogs of data distributed by ANF at this location:  <https://anf.ucsd.edu/tools/events/>.  In this tutorial we will be using data from the USArray recorded in 2012.   The unpacked tar files are found in a directory *anf* in the folder where you launched this tutorial.   I used a shell script found in a different directory, *catalog_data*, to use Antelope database command line tools to create a file *./catalog_data/usarray_tele2012.txt*. That script is found in *./catalog_data/export2012.csh* and for reference is this:
```
!/bin/csh
set yr=2012
set outfile="usarray_tele${yr}.txt"
#rm $outfile
set dbdir="../anf"
foreach mon (01 02 03 04 05 06 07 08 09 10 11 12)
  set dbname=usarray_${yr}_${mon}
  set dbpath=${dbdir}/events_${dbname}/$dbname
  echo Working on database $dbname
  dbjoin ${dbpath}.event origin \
  | dbsubset - "orid==prefor" \
  | dbjoin - netmag \
  | dbsubset - "magnitude>5.0" \
  | dbjoin - assoc arrival \
  | dbsubset - "assoc.delta>30.0 && assoc.delta<95.0" \
  | dbselect - event.evid origin.lat origin.lon origin.depth origin.time origin.mb origin.ms assoc.sta assoc.phase arrival.iphase assoc.delta assoc.seaz assoc.esaz assoc.timeres arrival.time arrival.deltim \
  >> $outfile
```
- Although that shell script is not elegant it accomplishes the object I was seeking.  It produces the main attributes I need from the css3.0 database for defining the data set we will build in this tutorial.   That is, a subset of teleseismic events recorded by USArray in 2012.   In particular, our objective is to produce a data set appropriate for P wave receiver function analysis.   We select only larger events (the *magnitude>5.0* dbsubset command) and arrivals measured at stations between 30 and 95 degrees epicentral distances (the dbsubset querry using *assoc.delta*).  We use Antelope's dbselect command to extract only attributes we might need for this analysis. You should recognize the argument list to dbselect does not extract all attributes possible, but the specific list there is intimately linked to a step below when we parse the output of this script (the file *./catalog_data/usarray_tele2012.txt*).  
- The approach we use here can and should be thought of as a way to general way export data in an antelope css3.0 database to MsPASS.  Here the emphasis is "catalog" data, but the same general idea could be done for any relational database view that can be constructed with Antelope tools.   This particular example is a custom export to match python functions we will be used that are distributed with MsPASS and which we will be using below.

#### 4.2.2.2. The *arrival* collection
Previously we added data to collections called *site*, *channel*, and *source*.   Those three collections contain fundamental data comparable to geometry in seismic reflection processing and are required information for most processing work flows with MsPASS.   For raw processing, however, we need an additional concept that is not as univeral as those three core concepts.   That new concept is what seismologists universally call an "arrival".  We define an *arrival document* as a set of parametric data measured in a time window of data that defines a particular "seismic phase".  For this tutorial we will be working with arrivals defined by teleseismic P waves.  The main parametric data of interest for this tutorial will be the measured arrival time of P by an analyst at the Array Network Facility.   In the script above, that is the CSS3.0 attribute with the tag "arrival.time".  

We first want to load the output from the shell script described in the previous section.   Note compared to what we did previously, this data set is huge.  The table that script generates has 353,482 rows.  Run the following script to load that data in this tutorial:

In [25]:
from mspasspy.preprocessing.css30.dbarrival import load_css30_arrivals
valret=load_css30_arrivals(db,'data/raw_data_tutorial/catalog_data/usarray_tele2012.txt')

We can confirm this worked and inserted the number we expected with this pair of commands.

In [26]:
dbarrival=db.arrival
print('Number of arrivals added=number of documents in arrival=',dbarrival.count_documents({}))


Number of arrivals added=number of documents in arrival= 353482


#### 4.2.2.3. CSS3.0 Limitation History and Solution - background you may elect to skip
The CSS3.0 schema was developed in the late 1970s when storage and computer speeds were orders of magnitude smaller than today.  At the time few had the vision to think the number of seismic instruments on the planet would grow as large as it has.  The schema relied on the concept of an alphanumeric "station code", which had a long history in seismology as a filing system for analog records.  Hence, it was natural to use station codes as an index for many database tables.  One of them was the arrival table, which is what we are working to translate here. The catalog tables (event, origin, assoc, and arrival) use the station code as a primary join key.  The data we just loaded extracted that field in the shell script above a assoc.sta, which is the copy of the station code in the CSS3.0 assoc table.

In the mid 1980s with the birth of the Global Seismic Network and IRIS the global seismic community developed the Standard Earthquake Exchange Data (SEED) format.  The committee that drafted the standard recognized that a station code alone created a problem by limiting the namespace to define a unique key to identify a single channel of seismic data.  They elected to define a four part key that has constrained earthquake data management since that decision was made.  That four part key in MsPASS have the abbreviations:  net, sta, chan, and loc.  (Obspy refers to these in the more verbose form network, station, channel, and location.)  The SEED standard created a collision with the CSS3.0 schema standard because CSS3.0 did not use the concept of a network or location code.   The developers of the original Datascope package, the ancestor of Antelope's database, had a significant code base that utilized the CSS3.0 schema - notably dbpick written by Danny Harvey that remains a workhorse in Antelope. Hence, Datascope did not use a revised schema but instead built their tools directly on the CSS3.0 schema.  As their code base grew it became too difficult for them the move to a different schema that would handle SEED data more easily.  They instead chose a band-aid solution that had several elements:
1.  The station code tag was treated as a unique identifier in all tables as defined in CSS3.0
2.  Similarly the channel (chan) code was treated as a unique identifier for each seismic channel as defined in CSS3.0
3.  If a SEED (miniseed) file was read that created a name collision with the station code (i.e. two or more stations with the same station code but a different network code) the name of all secondary stations encountered would use an altered station code.  (e.g. in this data we have net=BK and sta=HELL and net=TA and sta=HELL.  If TA:HELL had been handled earlier BK:HELL would be tagged as HELLBK.  Similarly if BK:HELL were handled first TA:HELL would be treated as HELLTA.)   
4.  Many observatory style seismic stations have multiple sensors at the same location.  GSN stations, for example, all (at least nearly all) have 2 or more sensors.   Unfortunately, the SEED standard also adopted ridid rules about how a channel codes should be defined that further limited the namespace (<https://ds.iris.edu/ds/nodes/dmc/data/formats/seed-channel-naming/>).  Hence, we have only a small namespace for channel names like BHZ, HHZ, BH1, HHE, etc.  A SEED defined channel with that restriction will always be ambiguous if a seismic station has more than one sensor.   e.g. Many GSN stations have a BHZ channel defined for location codes 00 and 01.  To adapt CSS3.0 to this ambiguity the authors of Datascope elected to use similar rules as for the station information.  The first channel code with a unique name would drop the loc code and subsequently appearances would include the loc code.  

### 4.3. Resolving arrival ambiguities
The point of the admittedly long winded discussion above about css3.0 limitations and net:sta:chan:loc code in SEED is that the arrival data we just loaded currently have an ambiguous relationship with the data we stored in the site and channel collections.  To see that run the cell below:

In [27]:
from mspasspy.preprocessing.css30.dbarrival import find_duplicate_sta
dups=find_duplicate_sta(db)
print('There are ',len(dups),' stations in site with ambiguous sta codes')
print('List of station with multiple net codes for that sta code')
for k in dups:
    print(k,dups[k])

There are  84  stations in site with ambiguous sta codes
List of station with multiple net codes for that sta code
G30A {'TA', '7D'}
J25A {'TA', '7D'}
J28A {'TA', '7D'}
J29A {'TA', '7D'}
J30A {'TA', '7D'}
J31A {'TA', '7D'}
J33A {'TA', '7D'}
J35A {'TA', '7D'}
J36A {'TA', '7D'}
J37A {'TA', '7D'}
J38A {'TA', '7D'}
J39A {'TA', '7D'}
J43A {'TA', '7D'}
J45A {'TA', '7D'}
J46A {'TA', '7D'}
J47A {'TA', '7D'}
J48A {'TA', '7D'}
J52A {'TA', '7D'}
J54A {'TA', '7D'}
J55A {'TA', '7D'}
113A {'AE', 'AR'}
319A {'AE', 'AR'}
U15A {'AE', 'AR'}
W13A {'AE', 'AR'}
X18A {'AE', 'AR'}
Y14A {'AE', 'AR'}
PFO {'AZ', 'II'}
SCI2 {'AZ', 'CI'}
PER {'Y5', 'CI'}
WES {'NE', 'CI'}
CCRK {'UW', 'HW'}
DDRF {'UW', 'HW'}
PHIN {'UW', 'HW'}
NCB {'LD', 'US'}
U32A {'TA', 'OK'}
X37A {'TA', 'OK'}
A01 {'YX', 'XD'}
A02 {'YX', 'XD'}
A03 {'YX', 'XD'}
A04 {'YX', 'XD'}
A05 {'YX', 'XD'}
A06 {'YX', 'XD'}
A08 {'YX', 'XD'}
A09 {'YX', 'XD'}
A10 {'YX', 'XD'}
A11 {'YX', 'XD'}
A13 {'YX', 'XD'}
A14 {'YX', 'XD'}
A15 {'YX', 'XD'}
A26A {'TA', 'XD'}
B0

The above shows we have 58 potentially ambiguous station codes.  That small function, whoever, just looks at site.  The more relevant output for the problem at hand, which is to unambiguously associate arrival times we loaded in arrival with the right documents in site and channel, comes form this related function:

In [28]:
from mspasspy.preprocessing.css30.dbarrival import check_for_ambiguous_sta
# this is using the dups list created above and extracting only the sta code into a python list
stalist=list()
for k in dups:
    stalist.append(k)
retval=check_for_ambiguous_sta(db,stalist)

station count
G30A 0
J25A 0
J28A 0
J29A 0
J30A 0
J31A 190
J33A 209
J35A 161
J36A 518
J37A 512
J38A 481
J39A 761
J43A 645
J45A 87
J46A 133
J47A 178
J48A 149
J52A 45
J54A 7
J55A 4
113A 0
319A 0
U15A 0
W13A 0
X18A 0
Y14A 0
PFO 983
SCI2 449
PER 0
WES 0
CCRK 0
DDRF 0
PHIN 0
NCB 0
U32A 80
X37A 87
A01 0
A02 0
A03 0
A04 0
A05 0
A06 0
A08 0
A09 0
A10 0
A11 0
A13 0
A14 0
A15 0
A26A 0
B01 0
B02 0
C02 0
SC04 0
D02 0
D03 0
D04 0
D05 0
D06 0
D07 0
D08 0
D09 0
D10 0
D11 0
D12 0
D13 0
D14 0
D15 0
D17 0
D18 0
D20 0
D21 0
A07 0
B03 0
W01 0
W02 0
W03 0
W04 0
W05 0
W06 0
W07 0
W08 0
W09 0
W10 0


This shows we can ignore some stations like A01 that have no picks in arrival, but we have a big issue with stations like PFO with hundreds of picks in arrival.   

We can patch part of this problem using a nonstandard Datascope (Antelope) table I obtained directly from ANF, that is, unfortunately, not distributed from the ANF web site.  The reasons for that are complicated and irrelevant, but for now the key data is in the file snetsta/usarray.snetsta.  The snetsta table is a patch method Antelope uses to workaround the station ambiguity problem.  Here are few selected lines from that file that show how we use it:
```
TA       934A   934A    1503120361.98014
TA       834A   834A    1503120736.28809
TA       835A   835A    1503120736.92130
AE       X16A   X16AAE  1600860674.47829
AZ       BZN    BZN     1600860676.10732
AZ       CPE    CPE     1600860677.78964
```
In this table the net code is column 1, the seed station code is column2, and the sta code we can expect to find in arrival is column 3.  Column 4 is an internal time stamp used in Datascope that defines the epoch time of that tuple was last updated.  The key tuple to note is the 4th one in this list.  Note net AE and sta X16A are aliased to X16AAE.   The next step will exploit this to fix part of our ambiguity issue.

In [29]:
from mspasspy.preprocessing.css30.dbarrival import (parse_snetsta,set_netcode_snetsta)
staxref=parse_snetsta('data/raw_data_tutorial/snetsta/usarray.snetsta')
print('Example station cross reference created from snetsta for sta=120A:',staxref['120A'])
print('Using staxref to set net codes in arrival collection')
rettuple=set_netcode_snetsta(db,staxref)
print('Number of arrival documents scanned=',rettuple[0])
print('Number of arrival documents updated=',rettuple[1])

Example station cross reference created from snetsta for sta=120A: {'fsta': '120A', 'net': 'TA'}
Using staxref to set net codes in arrival collection
353482
Number of arrival documents scanned= 353482
Number of arrival documents updated= 98149


I intentionally didn't print all the components of the rettuple.   We will use the 3rd component of that tuple in the next step.  First, let's see what it contains:

In [30]:
stations_not_found=rettuple[2]
print('Number of stations with unset net code=',len(stations_not_found))
print(stations_not_found)

Number of stations with unset net code= 666
{'R35A', 'F39A', 'O47A', '554A', 'D31A', 'I39A', 'I37A', 'F49A', 'K42A', 'TYNO', 'W39A', 'E47A', 'X52A', 'L40A', 'U41A', 'L38A', '152A', 'BASO', 'TASM', 'C37A', 'X49A', 'GLMI', 'L48A', 'EPYK', '541A', 'I43A', 'G42A', 'Y46A', 'LATQ', 'Y51A', 'I35A', 'H55A', '153A', 'TFRD', 'W42A', 'V35A', 'F32A', 'K43A', 'Z43A', 'U44A', 'H31A', '450A', '149A', 'U45A', 'J01E', 'D04E', '252A', 'J54A', 'X50A', 'CCM', '344A', 'P45A', '555A', 'H39A', 'S45A', 'ERPA', 'NHSC', '452A', 'O41A', 'U47A', '443A', '136A', 'ACSO', 'P38A', 'E38A', 'M47A', '061Z', '342A', 'LSQQ', 'F38A', 'C40A', 'E52A', 'S41A', 'SNCC', 'P52A', 'F52A', 'Y42A', 'M33A', 'T41A', 'G45A', 'I42A', 'R51A', 'M41A', 'B34A', 'O53A', 'J36A', 'BW06', 'O54A', 'S35A', '453A', 'H47A', 'E33A', 'H41A', 'S38A', 'N48A', 'K49A', 'L41A', 'Z45A', 'Q42A', 'L50A', 'H38A', 'Y50A', 'K55A', 'Y35A', 'C06D', 'L42A', 'W50A', 'L33A', 'O37A', '240A', 'U36A', 'Z50A', 'Y54A', 'D41A', 'N49A', 'N44A', 'P47A', '242A', 'N40A', '140

Which shows snetsta solved only part of our problem.   We have a devil of problem to fix here with 666 stations having no entry in the snetsta table we used. (Bad joke, but couldn't resist.) 

The first fix we will apply makes a simple assumption that is reasonable here and probably usually would be for other data sets assembled by a similar mechanism (i.e. web service download of waveforms from IRIS and the FDSN).  We assume that the station data we wrote in the site collection from the download procedure is close to definitive but not completely.  The exact assumption is that if there is a unique match for a station code in site set the net code for an arrival to that value.  If it is not found put it in one list to be returned.  If it is ambiguous, meaning there are multiple net codes for the same sta name, return it in a different list. The next code block applies that algorithm:

In [25]:
from mspasspy.preprocessing.css30.dbarrival import set_netcode_from_site
rt_setsite=set_netcode_from_site(db)
nprocessed=rt_setsite[0]
nupdated=rt_setsite[1]
ambiguous=rt_setsite[2]
sta_not_found=rt_setsite[3]
print('set_netcode_from_site processed ',nprocessed,' documents in arrival collection')
print('number of documents with net code update=',nupdated)

set_netcode_from_site processed  353482  documents in arrival collection
number of documents with net code update= 1388


You will likely find that that function takes a while to run because it is doing a lot of updates in a modest sized collection.   Let's print the two larger containers returned in the rt_setsite tuple in a pretty form:

In [26]:
from obspy import UTCDateTime
print('ambiguous site entries matched against arrival')
# first sort by sta - put tuples in a list container first to allow sorting
amblist=list()
for t in ambiguous:
    amblist.append(t)
amblist.sort(key = lambda x: x[1])
for t in amblist:
    print(t[0],t[1],UTCDateTime(t[2]),' to ',UTCDateTime(t[3]))
print('These sta codes had entries in arrival but were not found in site')
for nullsta in sta_not_found:
    print(nullsta)

ambiguous site entries matched against arrival
TA J31A 2010-07-31T00:00:00.000000Z  to  2012-05-18T23:59:59.000000Z
7D J31A 2011-11-21T00:00:00.000000Z  to  2012-05-16T23:59:59.000000Z
TA J33A 2010-07-27T00:00:00.000000Z  to  2012-05-17T23:59:59.000000Z
7D J33A 2011-10-16T00:00:00.000000Z  to  2012-07-19T23:59:59.000000Z
TA J35A 2010-07-29T00:00:00.000000Z  to  2012-05-16T23:59:59.000000Z
7D J35A 2011-10-19T00:00:00.000000Z  to  2012-07-18T23:59:59.000000Z
7D J36A 2011-10-19T00:00:00.000000Z  to  2012-07-18T23:59:59.000000Z
TA J36A 2010-10-07T00:00:00.000000Z  to  2012-09-16T23:59:59.000000Z
7D J37A 2011-11-28T00:00:00.000000Z  to  2012-05-16T23:59:59.000000Z
TA J37A 2010-10-06T00:00:00.000000Z  to  2012-09-10T23:59:59.000000Z
TA J38A 2010-10-11T00:00:00.000000Z  to  2012-09-14T23:59:59.000000Z
7D J38A 2011-11-23T00:00:00.000000Z  to  2012-05-17T23:59:59.000000Z
7D J39A 2011-11-22T00:00:00.000000Z  to  2012-05-17T23:59:59.000000Z
TA J39A 2011-06-12T00:00:00.000000Z  to  2013-06-03T23:5

So, we clearly need to do some detective work to sort this out.   There are multiple issues we need to fix, and that, unfortunately, is part of the tedious work of assembling any data set.  We will attack these in pieces with two helpful functions we can use to query the site and channel collections.   

Let's first look at the entries for the US network code.   You might guess that they share a common problem and indeed they do as shown by the output of this small block of code:

In [2]:
from mspasspy.preprocessing.seed.util import (site_report, channel_report)
usnetlist=['AAM', 'ACSO','CBKS','ERPA','GLMI','JFWS','MCWV','NHSC']
for sta in usnetlist:
    print('site entries for station=',sta)
    site_report(db,sta=sta,net='US')
    print('channel entries for station=',sta)
    channel_report(db,sta=sta,net='US')

site entries for station= AAM
 net      sta =loc== latitude longitude starttime endtime
  US      AAM ==00== 42.3012 -83.6567 0.172 1994-06-20T00:00:00.000000Z 2051-01-01T00:00:00.000000Z
channel entries for station= AAM
 net      sta   chan =loc== latitude longitude starttime endtime
  US      AAM    BH1 ==00== 42.3012 -83.6567 0.172 2011-05-02T19:07:00.000000Z 2051-01-01T00:00:00.000000Z
  US      AAM    BH2 ==00== 42.3012 -83.6567 0.172 2011-05-02T19:07:00.000000Z 2051-01-01T00:00:00.000000Z
  US      AAM    BHZ ==00== 42.3012 -83.6567 0.172 2011-05-02T19:07:00.000000Z 2051-01-01T00:00:00.000000Z
site entries for station= ACSO
 net      sta =loc== latitude longitude starttime endtime
  US     ACSO ==00== 40.2319 -82.982 0.288 2000-05-14T00:00:00.000000Z 2051-01-01T00:00:00.000000Z
channel entries for station= ACSO
 net      sta   chan =loc== latitude longitude starttime endtime
  US     ACSO    BH1 ==00== 40.2319 -82.982 0.288 2012-01-26T00:00:00.000000Z 2051-01-01T00:00:00.000000Z


This shows all of these are ambiguous only in the sense that there are two loc codes for the same site.  However, the report shows that, as they should, both documents for the same site but different loc codes have a exactly the same location.  

Let's now look at IU CCM (Cathedral caves for the Iris University network code)

In [5]:
print('site data for IU CCM')
site_report(db,net='IU',sta='CCM')
print('channel data for IU CCM')
channel_report(db,net='IU',sta='CCM')

site data for IU CCM
 net      sta =loc== latitude longitude starttime endtime
  IU      CCM ==00== 38.0557 -91.2446 0.1715 2011-07-14T00:00:00.000000Z 2051-01-01T00:00:00.000000Z
  IU      CCM ==00== 38.0557 -91.2446 0.1715 1996-06-07T21:00:00.000000Z 2011-07-14T00:00:00.000000Z
channel data for IU CCM
 net      sta   chan =loc== latitude longitude starttime endtime
  IU      CCM    BH1 ==00== 38.0557 -91.2446 0.1715 2011-07-14T00:00:00.000000Z 2012-06-14T23:27:00.000000Z
  IU      CCM    BH2 ==00== 38.0557 -91.2446 0.1715 2011-07-14T00:00:00.000000Z 2012-06-14T23:27:00.000000Z
  IU      CCM    BHZ ==00== 38.0557 -91.2446 0.1715 2011-07-14T00:00:00.000000Z 2012-06-14T23:27:00.000000Z
  IU      CCM    BHE ==00== 38.0557 -91.2446 0.1715 1999-06-03T19:50:00.000000Z 2011-07-14T00:00:00.000000Z
  IU      CCM    BHN ==00== 38.0557 -91.2446 0.1715 1999-06-03T19:50:00.000000Z 2011-07-14T00:00:00.000000Z
  IU      CCM    BHZ ==00== 38.0557 -91.2446 0.1715 1999-06-03T19:50:00.000000Z 2011-07-14

This shows CCM had an instrumentation change July 14, 2011.   The location was unchanged but the channels were renamed.  For this tutorial this particular change is irrelevant as the data being handled here are only from 2012. 

The next issue is the long list of stations with a common sta name but two different net codes:  TA and TD.   We need only look closely at one of these to understand what the problem is here.

In [6]:
print('site data for J31A')
site_report(db,sta='J31A')
print('channel data for J31A')
channel_report(db,sta='J31A')

site data for J31A
 net      sta =loc== latitude longitude starttime endtime
channel data for J31A
 net      sta   chan =loc== latitude longitude starttime endtime


If you go to IRIS's Metadata aggregator page for query by network, <http://ds.iris.edu/SeismiQuery/by_network.html>, you will learn that 7D is a set of OBS instruments deployed in Cascadia in 2012.  The TA code is for the USArray Transportable Array.  The ANF received TA data in real time and the focus of their efforts was the data received in real time.  To the best of my knowledge they did not go back and pick the OBS data that was not accessible until well after 2012 - the year of all arrival picks we are trying to sort out here.  Hence, the assumption we make is that all of those picks can safely assign TA as the net code.   The only exception is X37A, but the timestamps suggest this is an example of a TA station adopted by the OK network.  Let's check the coordinates to verify that is true before proceeding:

In [7]:
print('site data for X37A')
site_report(db,sta='X37A')
print('channel data for X37A')
channel_report(db,sta='X37A')

site data for X37A
 net      sta =loc== latitude longitude starttime endtime
channel data for X37A
 net      sta   chan =loc== latitude longitude starttime endtime


Which confirms or hypothesis that X37A was adopted by OK on Feb 2, 2012.  

The last issue is a set of stations with the net code CI.  The network aggregator will show you that CI is the southern California seismic network and the CI is a mnemonic for Caltech who have been the operator of that network since it was defined with that code.  Let's run as script similar to that we used for the US net code:

In [9]:
cinetlist=['IKP','SNCC']
for sta in cinetlist:
    print('site entries for station=',sta)
    site_report(db,sta=sta,net='CI')
    print('channel entries for station=',sta)
    channel_report(db,sta=sta,net='CI')

site entries for station= IKP
 net      sta =loc== latitude longitude starttime endtime
channel entries for station= IKP
 net      sta   chan =loc== latitude longitude starttime endtime
site entries for station= SNCC
 net      sta =loc== latitude longitude starttime endtime
channel entries for station= SNCC
 net      sta   chan =loc== latitude longitude starttime endtime


This is the messiest problem we need to fix.  It appears we have near but not exact duplicate records in both site and channel for both of the stations IKP and SNCC.   Not at all clear why this has happened, but if you go back and run the unix diff command on the xml files from 2011 you will see they are indeed different.  Why they are returned that way from web services is not at all clear, but a reasonable hypothesis is they came from different data centers that had slightly different entries for these stations.   

What makes this even more challenging is the two stations have different problem:  IKP needs edits to both site and channel while SNCC requires only edits to site.  In all cases a (nonunique) solution appears to be to use the endtime key to distinguish the keeper.   We judge the entries with endtime in 2051 to be better than the alternative because the coordinates appear to have been measured to higher precision.  First, let's fix IKP.

You may want to run this twice.  The first time it will produce a lot of output because the channel documents are large due to storage of the response data.  Repeating the box will show 0 deletes which is what we want to confirm.

In [14]:
from obspy import UTCDateTime
dbsite=db.site
query=dict()
query['sta']='IKP'
query['net']='CI'
t_test=UTCDateTime('2060-01-01:00:00:00.0')
query['endtime']={'$gt':t_test.timestamp}
n=dbsite.count_documents(query)
print('IKP with this query yields ',n,' documents')
if n==1:
    doc=dbsite.find_one(query)
    print('Deleting this entry in site:')
    print(doc)
    dbsite.delete_one(query)
# same for channel use delete_many
dbchan=db.channel
n=dbchan.count_documents(query)
print('IKP number of channel documents to be deleted=',n)
print('channel documents that will be deleted')
curs=dbchan.find(query)
for doc in curs:
    print(doc)
dbchan.delete_many(query)

IKP with this query yields  0  documents
IKP number of channel documents to be deleted= 0
channel documents that will be deleted


<pymongo.results.DeleteResult at 0x7fef55857180>

Now the same for SNCC but we only need to handle site.

In [15]:
query=dict()
query['sta']='SNCC'
query['net']='CI'
t_test=UTCDateTime('2060-01-01:00:00:00.0')
query['endtime']={'$gt':t_test.timestamp}
n=dbsite.count_documents(query)
print('SNCC with this query yields ',n,' documents')
if n==1:
    doc=dbsite.find_one(query)
    print('Deleting this entry in site:')
    print(doc)
    dbsite.delete_one(query)

SNCC with this query yields  1  documents
Deleting this entry in site:
{'_id': ObjectId('600c122eaeca3fad23077ecc'), 'loc': '', 'net': 'CI', 'sta': 'SNCC', 'latitude': 33.24787, 'longitude': -119.52437, 'coords': [33.24787, -119.52437], 'elevation': 0.275, 'edepth': 0.0, 'starttime': 767923200.0, 'endtime': 32503680000.0, 'site_id': ObjectId('600c122eaeca3fad23077ecc')}


Now we are left with the problem of setting net and loc codes for the stations marked ambiguous earlier.  Since we fixed the duplicate problem with IKP and SNCC the residual problems we need to fix are:
1.  The list of stations with net code US need to have the net code set to US and loc set to 00
2.  CCM arrival picks need the net code set to IU and the loc code set to 00 
3.  All remaining arrivals with net null should have it set to TA 

First let's handle US data station.  There all we do is find all entries matching our US net station list and force the loc code to 00.   We do that because I think 00 are borehole instruments that should be quieter, but that is not necessarily the right choice.  If it really mattered one might want to dig.

In [22]:
dbarr=db.arrival
for sta in usnetlist:
    query=dict()
    query['sta']=sta
    newval={"$set" : {'loc':'00', 'net':'US'}}
    x = dbarr.update_many(query,newval)
    print(x.modified_count, "arrival documents updated for station=",sta)

501 arrival documents updated for station= AAM
583 arrival documents updated for station= ACSO
670 arrival documents updated for station= CBKS
553 arrival documents updated for station= ERPA
395 arrival documents updated for station= GLMI
666 arrival documents updated for station= JFWS
518 arrival documents updated for station= MCWV
372 arrival documents updated for station= NHSC


Now a simpler step for CCM because we don't need a loop.

In [23]:
query['sta']='CCM'
newval={"$set" : {'loc':'00', 'net':'IU'}}
x = dbarr.update_many(query,newval)
print(x.modified_count, "arrival documents updated for station=CCM")

670 arrival documents updated for station=CCM


Now we can run this script that sets all remaining null net codes to TA. This works only because we fixed the earlier ones.

In [27]:
from mspasspy.preprocessing.css30.dbarrival import set_netcode_time_interval
# First we have to run this function again to refresh the list of ambiguous stations
rt_setsite=set_netcode_from_site(db)
# Now we only need this one
ambiguous=rt_setsite[2]
# and in a later box this one
sta_not_found=rt_setsite[3]
# first use a set container to find all unique names
sta_to_force=set()
for t in ambiguous:
    sta_to_force.add(t[1])
for sta in sta_to_force:
    n=set_netcode_time_interval(db,sta,net='TA')
    print('set net code to TA for ',n,' arrival documents with picks for station=',sta)

set net code to TA for  518  arrival documents with picks for station= J36A
set net code to TA for  87  arrival documents with picks for station= X37A
set net code to TA for  133  arrival documents with picks for station= J46A
set net code to TA for  161  arrival documents with picks for station= J35A
set net code to TA for  1069  arrival documents with picks for station= TPNV
set net code to TA for  190  arrival documents with picks for station= J31A
set net code to TA for  45  arrival documents with picks for station= J52A
set net code to TA for  4  arrival documents with picks for station= J55A
set net code to TA for  560  arrival documents with picks for station= SCIA
set net code to TA for  149  arrival documents with picks for station= J48A
set net code to TA for  178  arrival documents with picks for station= J47A
set net code to TA for  761  arrival documents with picks for station= J39A
set net code to TA for  512  arrival documents with picks for station= J37A
set net code to

Finally, we have to deal with the orphans, which are the stations held in the set container with the name sta_not_found.  We will do that with this little 

In [28]:
from obspy.clients.fdsn import Client
webclient=Client('IRIS')
starttime=UTCDateTime("2004-01-01T00:00:00.000")
endtime=UTCDateTime("2016-12-31T00:00:00.000")
write_dir="data/raw_data_tutorial/orphanxml"
for sta in sta_not_found:
    print('Attempting to download xml data for station=',sta)
    try:
        inv=webclient.get_stations(network="*",station=sta,location="*",
                            channel="BH*",
                            starttime=starttime,endtime=endtime,
                            level="response")
    except:
        print("No B channel data for station=",sta," for any net code")
        print("Ignoring this station")
        continue
    print('downloaded as this inventory object by obspy:')
    print(inv)
    fname=write_dir+'/'+sta+'.xml'
    print('writing as xml data to file=',fname)
    inv.write(fname,format='STATIONXML')
    print('saving to MongoDB site and channel collections')
    db.save_inventory(inv,verbose=True)

Attempting to download xml data for station= CHGQ
No B channel data for station= CHGQ  for any net code
Ignoring this station
Attempting to download xml data for station= ACTO
No B channel data for station= ACTO  for any net code
Ignoring this station
Attempting to download xml data for station= DRWO
No B channel data for station= DRWO  for any net code
Ignoring this station
Attempting to download xml data for station= DELO
No B channel data for station= DELO  for any net code
Ignoring this station
Attempting to download xml data for station= ORIO
No B channel data for station= ORIO  for any net code
Ignoring this station
Attempting to download xml data for station= BMRO
No B channel data for station= BMRO  for any net code
Ignoring this station
Attempting to download xml data for station= TCOL
downloaded as this inventory object by obspy:
Inventory created at 2021-01-23T21:35:06.000000Z
	Created by: IRIS WEB SERVICE: fdsnws-station | version: 1.1.47
		    http://service.iris.edu/fdsnw

No B channel data for station= BELQ  for any net code
Ignoring this station
Attempting to download xml data for station= MATQ
No B channel data for station= MATQ  for any net code
Ignoring this station
Attempting to download xml data for station= ALFO
No B channel data for station= ALFO  for any net code
Ignoring this station
Attempting to download xml data for station= ALGO
No B channel data for station= ALGO  for any net code
Ignoring this station
Attempting to download xml data for station= PLIO
No B channel data for station= PLIO  for any net code
Ignoring this station
Attempting to download xml data for station= ELFO
No B channel data for station= ELFO  for any net code
Ignoring this station
Attempting to download xml data for station= Y22E
downloaded as this inventory object by obspy:
Inventory created at 2021-01-23T21:35:14.000000Z
	Created by: IRIS WEB SERVICE: fdsnws-station | version: 1.1.47
		    http://service.iris.edu/fdsnws/station/1/query?starttime=2004-01-01...
	Sending

No B channel data for station= PLVO  for any net code
Ignoring this station
Attempting to download xml data for station= BANO
downloaded as this inventory object by obspy:
Inventory created at 2021-01-23T21:35:14.000000Z
	Created by: IRIS WEB SERVICE: fdsnws-station | version: 1.1.47
		    http://service.iris.edu/fdsnws/station/1/query?starttime=2004-01-01...
	Sending institution: IRIS-DMC (IRIS-DMC)
	Contains:
		Networks (1):
			YR
		Stations (1):
			YR.BANO (WADI DARABAN)
		Channels (6):
			YR.BANO..BHZ (2x), YR.BANO..BHN (2x), YR.BANO..BHE (2x)
writing as xml data to file= data/raw_data_tutorial/orphanxml/BANO.xml
saving to MongoDB site and channel collections
net:sta:loc= YR : BANO :  for time span  2005-09-26T12:00:00.000000Z  to  2006-08-28T12:00:00.000000Z  added to site collection
net:sta:loc:chan= YR : BANO :  : BHE for time span  2005-09-26T12:00:00.000000Z  to  2006-02-23T10:02:40.000000Z  added to channel collection
net:sta:loc:chan= YR : BANO :  : BHE for time span  2006-0

downloaded as this inventory object by obspy:
Inventory created at 2021-01-23T21:35:24.000000Z
	Created by: IRIS WEB SERVICE: fdsnws-station | version: 1.1.47
		    http://service.iris.edu/fdsnws/station/1/query?starttime=2004-01-01...
	Sending institution: IRIS-DMC (IRIS-DMC)
	Contains:
		Networks (1):
			TA
		Stations (1):
			TA.POKR (Poker Flat Research Range, AK, USA)
		Channels (9):
			TA.POKR..BHZ, TA.POKR..BHN, TA.POKR..BHE, TA.POKR.01.BHZ (2x), 
			TA.POKR.01.BHN (2x), TA.POKR.01.BHE (2x)
writing as xml data to file= data/raw_data_tutorial/orphanxml/POKR.xml
saving to MongoDB site and channel collections
net:sta:loc= TA : POKR :  for time span  2012-10-02T00:00:00.000000Z  to  2020-02-11T23:59:59.000000Z  added to site collection
net:sta:loc:chan= TA : POKR :  : BHE for time span  2012-10-02T00:00:00.000000Z  to  2020-02-11T23:59:59.000000Z  added to channel collection
net:sta:loc:chan= TA : POKR :  : BHN for time span  2012-10-02T00:00:00.000000Z  to  2020-02-11T23:59:59.00000

The verbose output of that script above shows the problem sites are all created by new sites coming into the TA data stream from testing in Alaska.  Other seem extraneous stations that do not have broadband channels.  

Note that should eventually be removed before release:  There is a bug in the save_inventory script somewhere from some initialization related to the obnoxious behavior of insert_one. insert_one adds the object_id of the inserted document to the dict it is passed as an argument.  That creates a duplicate key condition if that id is not cleared.   Couldn't find it without a more careful debugging but found I could get through the whole processing by just hitting run again and again until it finished.  Apparently the initialization is on a second pass after a file is processed.

Let's verify what stations still have null net codes

In [29]:
from mspasspy.preprocessing.css30.dbarrival import find_null_net_stations
ret=find_null_net_stations(db)
print(ret)

{'CHGQ', 'ACTO', 'DRWO', 'DELO', 'ORIO', 'BMRO', 'TCOL', 'TORO', 'BRCO', 'BELQ', 'MATQ', 'ALFO', 'ALGO', 'PLIO', 'ELFO', 'Y22E', 'EPYK', 'PLVO', 'BANO', 'BASO', 'TOBO', 'PEMO', 'HDA', 'BUKO', 'CLWO', 'WLVO', 'PKRO', 'LSQQ', 'KLBO', 'MEDO', 'STCO', 'DRCO', 'BWLO', 'TYNO', 'TOLK', 'POKR', 'LATQ', 'ORHO'}


The output from above suggests all these data are not of interest for what we are assembling (Alaska test sites and sites that don't have broadband sensors) but we should do what we can.   The next block test the set returned above and reruns the function to try to set the net code from the site table:

In [30]:
for sta in ret:
    n=set_netcode_time_interval(db,sta,net='TA')
    print('set net code to TA for ',n,' arrival documents with picks for station=',sta)


set net code to TA for  234  arrival documents with picks for station= CHGQ
set net code to TA for  288  arrival documents with picks for station= ACTO
set net code to TA for  223  arrival documents with picks for station= DRWO
set net code to TA for  380  arrival documents with picks for station= DELO
set net code to TA for  250  arrival documents with picks for station= ORIO
set net code to TA for  305  arrival documents with picks for station= BMRO
set net code to TA for  248  arrival documents with picks for station= TCOL
set net code to TA for  73  arrival documents with picks for station= TORO
set net code to TA for  216  arrival documents with picks for station= BRCO
set net code to TA for  149  arrival documents with picks for station= BELQ
set net code to TA for  287  arrival documents with picks for station= MATQ
set net code to TA for  438  arrival documents with picks for station= ALFO
set net code to TA for  117  arrival documents with picks for station= ALGO
set net code 

Let's finish this by verifying we no longer have any null net entries in the arrival collection:

In [31]:
from mspasspy.preprocessing.css30.dbarrival import find_null_net_stations
ret=find_null_net_stations(db)
print(ret)

set()


## 5. Indexing Waveform Data
For this tutorial we will be reading a set of miniseed files derived from the files downloaded by web services with obspy as described in section 3 above.   As noted earlier I elected to concatenate the large number of miniseed files obspy's downloader wrote to create files that would work more efficiently on HPC systems.  For readers familiar with seismic reflection processing the files we will be working with in this and the next section can be thought of as a bundle that defines a common source (shot) gather.   In MsPASS we abstract any group of seismic data objects that have some relationship into a container we call and Ensemble object.   For obspy users an ensemble is similar in concept to the obspy Stream object but more generic.  

For the preprocessing phase the next step we need to accomplish is building an index that defines the contents of the data we are aiming to assemble.   We will again be using MongoDB to accomplish this task.  We will create a set of documents that describe our data in it's raw form.   In the following sections we will edit these documents to add additional metadata needed to complete the data set.  Then we will run an example MsPASS run with spark that will be used to preprocessind and time window these data before writing them into the processing collection with call *wf*.   

The next code block runs a small function we can use to build our index.   Run the next code block and we will discuss what it does in afterwards by mixing in some words with database queries to explain the concepts.

In [32]:
from mspasspy.preprocessing.seed.ensembles import (dbsave_seed_ensemble_file)
for i in range(29):
    fname="wf/2012/event{0}.mseed".format(i+1)
    print('Building index for miniseed file=',fname)
    try:
        dbsave_seed_ensemble_file(db,fname)
    except Exception as err:
        print(err)

Building index for miniseed file= wf/2012/event1.mseed
Building index for miniseed file= wf/2012/event2.mseed
Building index for miniseed file= wf/2012/event3.mseed
Building index for miniseed file= wf/2012/event4.mseed
Building index for miniseed file= wf/2012/event5.mseed




something threw an exception - this needs detailed handlers
Building index for miniseed file= wf/2012/event6.mseed
Building index for miniseed file= wf/2012/event7.mseed
Building index for miniseed file= wf/2012/event8.mseed
Building index for miniseed file= wf/2012/event9.mseed
Building index for miniseed file= wf/2012/event10.mseed
Building index for miniseed file= wf/2012/event11.mseed
Building index for miniseed file= wf/2012/event12.mseed
Building index for miniseed file= wf/2012/event13.mseed
Building index for miniseed file= wf/2012/event14.mseed
Building index for miniseed file= wf/2012/event15.mseed
Building index for miniseed file= wf/2012/event16.mseed
Building index for miniseed file= wf/2012/event17.mseed
Building index for miniseed file= wf/2012/event18.mseed
Building index for miniseed file= wf/2012/event19.mseed
Building index for miniseed file= wf/2012/event20.mseed
Building index for miniseed file= wf/2012/event21.mseed
Building index for miniseed file= wf/2012/event2

Notice that event5 had read errors.  That problem, unfortunately, is too common with waveform data downloaded by web services.  This function makes no attempt to salvage that data so it is lost until we attempt further detective work.   We will ignore that problem for now, but emphasize to the reader that is part of the tedious work that is required today to assemble the most complete dataset.  What complete means is dependent on the project so we treat solving that particular problem as an exercise for the student.  We intentionally leave the data slightly dirty to emphasize this point.

The following script shows only 28 of the 29 files were successfully processed and prints the contents of the first document in a prettier for (the full collection has 27 similar documents) 

In [33]:
dbwf=db.wf_miniseed
n=dbwf.count_documents({})
print('seedwf collection has ',n,' documents - should have been 29')
doc=dbwf.find_one()
print('Ensemble Metadata')
for k in doc:
    if k!='members':
        print(k,doc[k])
print('Ensemble Member Metadata')
n=0
for d in doc['members']:
    print('Metadata dict for member number ',n)
    print(d)
    n+=1
    

seedwf collection has  28  documents - should have been 29
Ensemble Metadata
_id 600c9b15df5fa30c0c62c771
dir /home/pavlis/mspass_tutorial/notebooks/wf/2012
dfile event1.mseed
format mseed
member_type TimeSeries
mover obspy_seed_ensemble_reader
starttime 1356943580.750001
endtime 1356947180.744538
Ensemble Member Metadata
Metadata dict for member number  0
{'net': '2G', 'sta': 'BHE', 'chan': '', 'starttime': 1356943580.799998, 'endtime': 1356947180.749998, 'sampling_rate': 20.0, 'delta': 0.05, 'npts': 72000, 'calib': 1.0, 'seed_file_id': '37894c69-443f-4595-99ae-43a4f7ab4965'}
Metadata dict for member number  1
{'net': '7D', 'sta': 'BHZ', 'chan': '', 'starttime': 1356943580.7527978, 'endtime': 1356947180.7327979, 'sampling_rate': 50.0, 'delta': 0.02, 'npts': 180000, 'calib': 1.0, 'seed_file_id': 'fcdebd65-1fe7-4aa1-bda3-696169247a4e'}
Metadata dict for member number  2
{'net': '7D', 'sta': 'BH2', 'chan': '', 'starttime': 1356943580.7556999, 'endtime': 1356947180.7357, 'sampling_rate': 

Metadata dict for member number  1496
{'net': 'TA', 'sta': 'BHN', 'chan': '', 'starttime': 1356943580.7500002, 'endtime': 1356947180.7500002, 'sampling_rate': 40.0, 'delta': 0.025, 'npts': 144001, 'calib': 1.0, 'seed_file_id': '82072c61-25fe-45fd-8ef7-f97a93f049a0'}
Metadata dict for member number  1497
{'net': 'TA', 'sta': 'BHN', 'chan': '', 'starttime': 1356943580.7500002, 'endtime': 1356947180.725, 'sampling_rate': 40.0, 'delta': 0.025, 'npts': 144000, 'calib': 1.0, 'seed_file_id': 'c5a6ed98-3648-4e33-a74b-b46205dcf7f1'}
Metadata dict for member number  1498
{'net': 'TA', 'sta': 'BHE', 'chan': '', 'starttime': 1356943580.7500002, 'endtime': 1356947180.7500002, 'sampling_rate': 40.0, 'delta': 0.025, 'npts': 144001, 'calib': 1.0, 'seed_file_id': 'e5ccd7fe-d2f0-4213-867f-aec07c7b402a'}
Metadata dict for member number  1499
{'net': 'TA', 'sta': 'BHE', 'chan': '', 'starttime': 1356943580.7500002, 'endtime': 1356947180.7500002, 'sampling_rate': 40.0, 'delta': 0.025, 'npts': 144001, 'calib

Metadata dict for member number  2992
{'net': 'XI', 'sta': 'BHN', 'chan': '', 'starttime': 1356943580.76, 'endtime': 1356947180.7350001, 'sampling_rate': 40.0, 'delta': 0.025, 'npts': 144000, 'calib': 1.0, 'seed_file_id': '92ea2e9f-3b31-4ac8-b76f-f28cc7748b8b'}
Metadata dict for member number  2993
{'net': 'XI', 'sta': 'BHN', 'chan': '', 'starttime': 1356943580.7649999, 'endtime': 1356947180.74, 'sampling_rate': 40.0, 'delta': 0.025, 'npts': 144000, 'calib': 1.0, 'seed_file_id': 'c71a6823-e99c-4ce3-b614-a69ba33cdc33'}
Metadata dict for member number  2994
{'net': 'XO', 'sta': 'BHE', 'chan': '', 'starttime': 1356943580.76, 'endtime': 1356947180.7350001, 'sampling_rate': 40.0, 'delta': 0.025, 'npts': 144000, 'calib': 1.0, 'seed_file_id': '1fa96f0d-7ff7-4c50-89f4-5e65039807d5'}
Metadata dict for member number  2995
{'net': 'XO', 'sta': 'BHN', 'chan': '', 'starttime': 1356943580.76, 'endtime': 1356947180.7350001, 'sampling_rate': 40.0, 'delta': 0.025, 'npts': 144000, 'calib': 1.0, 'seed_fi

Here are some key points about this rather large output:
- We write the documents that define this index in a collection called "seed_data.ensemble".   Note the "." character is treated as part of the name and is not significant.  i.e. the collection name is literally "seed_data.ensemble".  It is, however, common practice to use names like this in MongoDB to group collections having some kind of conceptual relationship.  In this case, it is one of several possible collections that could be created from seed data.  Other raw data types are expected to appear as collections with similar names (e.g. segy_data, ph5_data, etc.) 
- Readers familiar with CSS3.0 and/or Antelope may find it useful think of seed_data.ensemble as comparable to the css3.0 table called wfdisc.   The data are not loaded into MongoDB's work area.  seed_data.ensemble documents are only an index defining data holdings in files.  The approach here is strictly file based, but could be completely different.  e.g. instead of a file with properties defined by dir, dfile, and format we could define a web services URL.    
- Under the hood here we use obspy's miniseed reader.
- This collection uses what MongoDB calls subdocuments.  There is one subdocument for each channel of data in this event gather.  The subdocuments are a list of dict containers derived from the obspy stats array for each Trace object obspy reader creates in cracking the miniseed file.  The list is indexed by the key "members".  
- Each file also generates a set of metadata we tag in the print script above as "Ensemble Metadata".  We do that because the will literally be used for that purpose in the MsPASS Ensemble container later, but more importantly because it fits a generic concept.  That is, an "Ensemble" in MsPASS means a group of atomic objects that have some generic relationship.  In this case, the relationship is there are all connected to a single earthquake. The next thing we will need to do is link the ensemble to the source with with it is associated.  We will do that with pure database manipulations of small tables for efficiency. 
- The member Metadata contain an attribute with the key seed_file_id.   That is an internally generated unique key that was created by this function to provide a unique tag for each original seed file TimeSeries.   For processing using the object-level history functions of MsPASS that tag is important to define what we call an ORIGIN for a processing history chain. 

## 6. Source Association
The general problem of associating a set of waveforms to a metadata describing the source of the transient signals to be analyzed can be a thorny one.  For this tutorial we can use a very simple algorithm, however, due to the specific way I downloaded these data with obspy.   That is, all the event files were referenced to a be similar to seismic reflection shot gathers such that the first data sample for all channels is approximately the estimated origin time of the earthquake.  You can see that in the dump above because all the channels have a start time within a few 10s of ms of epoch time 1356943580.7.  Hence, the algorithm we will run below just looks for events in the source collection for which the origin time is within a tolerance of the average start time of the ensemble members (stored in the ensemble metadata with the key starttime).  

In [34]:
from mspasspy.preprocessing.seed.ensembles import link_source_collection
link_source_collection(db)
    

Now that little function associated source data downloaded with web services by obspy.   Those events are not from the same source as the events ANF used to when they picked P waves from the TA data stream.   Further, a large fraction of the station in our download did not come on the real time data stream to ANF so they would not have picks in the css3.0 arrival data.  Recall our objective in this tutorial is to extract only waveforms that have a pick made by ANF.   

Our first step to do that is we need to also load the source locations in the ascii file we used to generate the arrival collection into our source collection.  We first run this small function to get the unique sources found in the (large) ascii table we created earlier.   Run the next cell and we will view the output below:

In [35]:
from mspasspy.preprocessing.css30.dbarrival import extract_unique_css30_sources
events=extract_unique_css30_sources('data/raw_data_tutorial/catalog_data/usarray_tele2012.txt')
print('Found ',len(events),' unique sources from 2012 css3.0 data file')
# print just the first 10 events
n=0
for k in events:
    print('evid=',k,' contents:  ',events[k])
    n+=1
    if(n>=10):
        break

Found  906  unique sources from 2012 css3.0 data file
evid= 206397  contents:   {'evid': 206397, 'latitude': 12.011, 'longitude': 143.505, 'depth': 10.0, 'time': 1325377805.03}
evid= 206398  contents:   {'evid': 206398, 'latitude': -11.372, 'longitude': 166.224, 'depth': 66.7, 'time': 1325379008.01}
evid= 206401  contents:   {'evid': 206401, 'latitude': 31.416, 'longitude': 138.155, 'depth': 348.5, 'time': 1325395674.5}
evid= 206408  contents:   {'evid': 206408, 'latitude': 12.018, 'longitude': 143.607, 'depth': 6.5, 'time': 1325432652.07}
evid= 206439  contents:   {'evid': 206439, 'latitude': -14.748, 'longitude': 167.44, 'depth': 221.6, 'time': 1325652454.05}
evid= 206454  contents:   {'evid': 206454, 'latitude': -10.67, 'longitude': 166.375, 'depth': 75.9, 'time': 1325703417.29}
evid= 206458  contents:   {'evid': 206458, 'latitude': -45.97, 'longitude': -76.014, 'depth': 10.0, 'time': 1325724873.21}
evid= 206459  contents:   {'evid': 206459, 'latitude': -17.691, 'longitude': -173.54

As the name of that function implies what it did was scan that large text file keeping all locations associated with unique evid values.  That is a potentially dangerous test if used on a poorly constructed css3.0 database, but the ANF databases are clean so we don't have an issue.  I state that because if anyone tries to extend this tutorial don't do so without making sure the evid values are unique with and between any file(s) you try to process this way.

We now use the function below to load this set of locations with the evid attributes into the source collection.  Note, however, that the number of attributes we store with these sources is much smaller than those loaded form the quakeml files.

In [36]:
from mspasspy.preprocessing.css30.dbarrival import load_css30_sources
n=load_css30_sources(db,events)
print('loaded ',n,' new documents into source collection')

loaded  906  new documents into source collection


The set of events we just loaded can be distinguished from those created from the quakeml file by the existence of the attribute "evid" in the documents we just saved;  the quakeml generated documents do not contain an evid attribute.  This short script demonstrates that using some low level MongoDB queries:

In [38]:
dbsource=db.source
n=dbsource.count_documents({})
print('Source collection currently has ',n,' total source documents')
query={'evid': {'$exists' : True}}
n=dbsource.count_documents(query)
print(n,' of the documents have evid set')

Source collection currently has  1867  total source documents
906  of the documents have evid set


Because evid values appear in arrival documents too, we will exploit that relation to associate arrivals with waveforms.  First, because evids are potentially dangerous we will run the following small function to post the unique id source_id to all arrival documents.  

In [39]:
from mspasspy.preprocessing.css30.dbarrival import set_source_id_from_evid
result=set_source_id_from_evid(db)
print('Number of arrival documents processed=',result[0])
print('Number of arrival documents updated=',result[1])
print('Number arrivals in set that did not match keyed by evid =',result[3])

Number of arrival documents processed= 353482
Number of arrival documents updated= 353482
Number arrivals in set that did not match keyed by evid = {}


# 7. Defining Indexes
A MongoDB database is heavily dependent upon a well defined index to improve preformance on large collections.  In this tutorial the only collection that will create problems for us is the arrival collection.   We saw that earlier when we updated every document in the collection. Indices are best built on static collections as they slow writes.  We didn't do this step earlier because we were building the arrival collection and doing as many writes as reads. For the next step we will only be reading form arrival (a lot) so we need to build an index.  Experience shows this will speed processing by 1 to 3 orders of magnitude.  

We are going to build the index with these three keys to support the queries we will be running on arrival below:   source_id, net, and sta.  The MongoDB syntax to build an index is a bit weird but this serves as a good example:

In [40]:
db.arrival.create_index(
[
    ('source_id',1),
    ('net', 1),
    ('sta',1)
])

'source_id_1_net_1_sta_1'

Noting two things about this:
1.  The pymongo api uses a different data structure from the mongo shell.  This warning is important if you do an internet search for building an index in MongoDB - build on pymongo examples as mongo shell commands require a translation.
2.  The magic 1 is a cryptic shorthand for the more verbose "ASCENDING" tag you will find in some sources.  You can also use "DESCENDING" or -1 to reverse the search order for the index.   Reversing the order of the index search has no merit here, but you can imagine examples where it would matter.

# 8.  Ensemble Processing
## 8.1. Cross-referencing source data
We previously wrote an index for common source gather files we downloaded in a collection we called seed_data.ensembles.  The following function takes the time range of that data and associates a document in the source collection with the file writing the source_id to the ensemble's index document.  We will need that in the reduction workflow that concludes this tutorial to link arrivals to the waveform data.

This function also has to deal with another not so small detail.  We have duplicate source locations for most of the events in the source collection.  The reason is we loaded the catalog obspy downloaded and the one ANF used in measuring the picks in the arrival collection.  We want to prefer the ones form ANF as they are the ones we can easily link to arrival.

In [41]:
from mspasspy.preprocessing.seed.ensembles import link_source_collection
# this function is like subroutine and returns nothing
link_source_collection(db,prefer_evid=True)



## 8.2 Final Data Assembly Workflow
First, we need to load a number of new modules that will be used in the workflow below:

In [6]:
from mspasspy.util.Undertaker import Undertaker
from mspasspy.algorithms import signals
from mspasspy.ccore.seismic import (TimeWindow,TimeSeries,TimeSeriesEnsemble)
from mspasspy.algorithms.window import WindowData
from mspasspy.algorithms.bundle import bundle_seed_data
from mspasspy.preprocessing.seed.ensembles import (load_one_ensemble,
                                                   load_channel_data,
                                                   load_site_data,
                                                   load_arrivals_by_id,
                                                   erase_seed_metadata)
import time

We now run our load preprocessor.  It has these steps it applies to each ensemble in our test data set:

1. Loads the seed data into memory using the wf_miniseed collection as an index.
2. Associates arrivals with the waveforms now loaded
3. Kills data that do not have an arrival association and destroys them (the cremate method of Undertaker).
4. Uses obspy's detrend operator on each signal 
5. Time shifts each seismogram to put time 0 at the arrival time
6. Window each signal from 300 s before to 400 s after the arrival time.
7. Loads data from the channel collection (notably hang and vang needed to define orientation)
8. Runs bundle_seed_data to create Seismogram objects from groups of 3 TimeSeries with common net,sta,loc values.
9. Saves the results to MongoDB in gridfs storage. 

In [7]:
stedronsky=Undertaker(db)
dbwf=db.wf_miniseed
cursor=dbwf.find({},no_cursor_timeout = True)
n=0
for doc in cursor:
    print('working on enemble number ',n)
    if not 'evid' in doc:
        print('Ensemble number ',n,' does not have an evid set - skipped')
        n+=1
        continue
    ens=load_one_ensemble(doc,apply_calib=True,
                  ensemble_mdkeys=['source_id','evid','starttime','endtime'])
    load_site_data(db,ens)
    nlive=load_arrivals_by_id(db,ens)
    n+=1
    t0=time.time()
    d_cleaned=stedronsky.cremate(ens)
    t1=time.time()
    print('Time to handle dead data=',t1-t0)
    t0=time.time()
    signals.detrend(d_cleaned,'demean')
    t1=time.time()
    print('Time to apply detrend operator=',t1-t0)
    t0=time.time()
    twin=TimeWindow(-300.0,400.0)
    for i in range(len(d_cleaned.member)):
        d=TimeSeries(d_cleaned.member[i])
        t=d['arrival_time']
        d.ator(t)
        d=WindowData(d,twin)
        d_cleaned.member[i]=d
    t1=time.time()
    print('Time to window cleaned ensemble=',t1-t0)
    t0=time.time()
    signals.filter(d_cleaned,"bandpass",corners=2,freqmin=0.01,freqmax=2.0)
    t1=time.time()
    print('Time to filter cleaned ensemble=',t1-t0)
    t0=time.time()
    load_channel_data(db,d_cleaned)
    t1=time.time()
    print('Time to build cross reference to channel=',t1-t0)
    t0=time.time()
    ens3c=bundle_seed_data(d_cleaned)
    t1=time.time()
    print('Time to run bundle algorithm=',t1-t0)
    print("Number of Seismograms after bundle_seed_data=",len(ens3c.member))
    #print_dead_logs(ens3c)
    erase_seed_metadata(d_cleaned)
    # This is a workaround for a bug that currently exists in
    # save_ensemble_data in handling history database
    # Should be able to eventually delete this
    for d in ens3c.member:
        d.clear_history()
    t0=time.time()
    db.save_ensemble_data(ens3c,'gridfs')
    t1=time.time()
    print("Time to save data to MongoDB=",t1-t0)
    del ens
    n+=1


working on enemble number  0
Ensemble number  0  does not have an evid set - skipped
working on enemble number  1
Ensemble number  1  does not have an evid set - skipped
working on enemble number  2
Time to handle dead data= 0.10956454277038574
Time to apply detrend operator= 40.367491245269775
Time to window cleaned ensemble= 0.16676712036132812
Time to filter cleaned ensemble= 8.166987895965576
Time to build cross reference to channel= 2.6701512336730957
Time to run bundle algorithm= 0.3094930648803711
Number of Seismograms after bundle_seed_data= 190
Time to save data to MongoDB= 0.20568346977233887
working on enemble number  4
Time to handle dead data= 0.037932395935058594
Time to apply detrend operator= 9.896740913391113
Time to window cleaned ensemble= 0.03679490089416504
Time to filter cleaned ensemble= 2.0053420066833496
Time to build cross reference to channel= 0.6282927989959717
Time to run bundle algorithm= 0.06404304504394531
Number of Seismograms after bundle_seed_data= 46