# MsPASS Getting Started Tutorial
## *Gary L. Pavlis, Indiana University and Yinzhi (Ian) Wang, TACC*
## Preliminaries
This tutorial assumes you have already done the following:
1.  Installed docker.
2.  Run the commmand `docker pull wangyinz/mspass_tutorial` 
3.  Launched docker with `docker run` as described in the User's Manual.  You should launch this instance with the `-p 8787:8787` option to utilize dask diagnostics near the end.  RESTART if you haven't enabled 8787 or you will have issues starting this in the middle.
4.  Connected the container to get this tutorial running.

Our user manual and github wiki pages describes how to do that.  This tutorial assumes that was completed and you are running this notebook while connected to MsPASS running within the docker container. 
Note MsPASS can also be run from a local copy of MsPASS installed through pip.   The easiest way to do that is to still use the standard container but only use it to launch and run MongoDB.  See the github wiki page on running mspass with docker for guidance on that procedure.  After launching the container the main difference is in launching jupyter-notebook/jupyterlab to get this tutorial running.   None of the tutorial should depend upon which approach you are using.  Further, if either approach was not done correctly you can expect python errors at the first import of a mspasspy module. 

## Download data with obspy
### Overview of this section
MsPASS leans heavily on obspy.  In particular, in this section we will use obspy's web services functions to download waveform data, station metadata, and source metadata.  The approach we are using here is to stage these data to your local disk.   The dataset we will assemble is the mainshock and 7 days of larger aftershocks of the Tohoku earthquake.  The next section then covers how we import these data into the MsPASS framework to allow them to be processed.

### Select, download, and save source data in MongoDB
As noted we are focusing on the Tohoku earthquake and its aftershocks.  That earthquake's origin time is approximately  March 11, 2011, at 5:46:24 UTC.  The ISC epicenter is 38.30N, 142.50E.  We will then apply obspy's *get_events* function with the following time and area filters:
1.  Starttime March 11, 2011, 1 hour before the origin time.
2.  End time 7 days after the mainshock origin time.
3.  Epicenters within + or - 3 degrees Latitude
4.  Epicenters within + or - 3 degrees of Longitude. 
5.  Only aftershocks larger than 6.5

Here is the incantation in obspy to do that:

In [1]:
from obspy import UTCDateTime
from obspy.clients.fdsn import Client
client=Client("IRIS")
t0=UTCDateTime('2011-03-11T05:46:24.0')
starttime=t0-3600.0
endtime=t0+(7.0)*(24.0)*(3600.0)
lat0=38.3
lon0=142.5
minlat=lat0-3.0
maxlat=lat0+3.0
minlon=lon0-3.0
maxlon=lon0+3.0
minmag=6.5

cat=client.get_events(starttime=starttime,endtime=endtime,
        minlatitude=minlat,minlongitude=minlon,
        maxlatitude=maxlat,maxlongitude=maxlon,
        minmagnitude=minmag)
print(cat)

11 Event(s) in Catalog:
2011-03-12T01:47:16.160000Z | +37.590, +142.751 | 6.5  MW
2011-03-11T19:46:35.300000Z | +38.800, +142.200 | 6.5  mb
...
2011-03-11T05:51:20.500000Z | +37.310, +142.240 | 6.8  None
2011-03-11T05:46:23.200000Z | +38.296, +142.498 | 9.1  MW
To see all events call 'print(CatalogObject.__str__(print_all=True))'


We can save these easily into MongoDB for use in later processing with this simple command.

In [2]:
from mspasspy.db.database import Database
import mspasspy.client as msc
#dbclient=msc.Client(schema='mspass_lite.yaml')
dbclient=msc.Client()
db = dbclient.get_database('getting_started')

In [3]:
n=db.save_catalog(cat)
print('number of event entries saved in source collection=',n)

number of event entries saved in source collection= 11


### Select, download, and save station metadata to MongoDB
We use a very similar procedure to download and save station data.   We again use obspy but in this case we use their *get_stations* function to construct what they call an "Inventory" object containing the station data. 

In [4]:
inv=client.get_stations(network='TA',starttime=starttime,endtime=endtime,
                        format='xml',channel='BH?',level='response')
net=inv.networks
x=net[0]
sta=x.stations
print("Number of stations retrieved=",len(sta))
print(inv)

Number of stations retrieved= 446
Inventory created at 2023-05-23T11:58:49.751200Z
	Created by: IRIS WEB SERVICE: fdsnws-station | version: 1.1.52
		    http://service.iris.edu/fdsnws/station/1/query?starttime=2011-03-...
	Sending institution: IRIS-DMC (IRIS-DMC)
	Contains:
		Networks (1):
			TA
		Stations (446):
			TA.034A (Hebronville, TX, USA)
			TA.035A (Encino, TX, USA)
			TA.035Z (Hargill, TX, USA)
			TA.109C (Camp Elliot, Miramar, CA, USA)
			TA.121A (Cookes Peak, Deming, NM, USA)
			TA.133A (Hamilton Ranch, Breckenridge, TX, USA)
			TA.134A (White-Moore Ranch, Lipan, TX, USA)
			TA.135A (Vickery Place, Crowley, TX, USA)
			TA.136A (Ennis, TX, USA)
			TA.137A (Heron Place, Grand Saline, TX, USA)
			TA.138A (Matatall Enterprise, Big Sandy, TX, USA)
			TA.139A (Bunkhouse Ranch, Marshall, TX, USA)
			TA.140A (Cam and Jess, Hughton, LA, USA)
			TA.141A (Papa Simpson, Farm, Arcadia, LA, USA)
			TA.142A (Monroe, LA, USA)
			TA.143A (Socs Landing, Pioneer, LA, USA)
			TA.214A (Organ Pi

The output shows we just downloaded the data form 446 TA stations that were running during this time period. Note a detail is if you want full response information stored in the database you need to specify *level='response'* as we have here.  The default is never right.  You need to specify level as at least "channel". 

We will now save these data to MongoDB with a very similar command to above: 

In [5]:
ret=db.save_inventory(inv,verbose=True)
print('save_inventory returned value=',ret)

net:sta:loc= TA : 034A :  for time span  2010-01-08T00:00:00.000000Z  to  2011-11-17T23:59:59.000000Z  added to site collection
net:sta:loc:chan= TA : 034A :  : BHE for time span  2010-01-08T00:00:00.000000Z  to  2011-11-17T17:05:00.000000Z  added to channel collection
net:sta:loc:chan= TA : 034A :  : BHN for time span  2010-01-08T00:00:00.000000Z  to  2011-11-17T17:05:00.000000Z  added to channel collection
net:sta:loc:chan= TA : 034A :  : BHZ for time span  2010-01-08T00:00:00.000000Z  to  2011-11-17T17:05:00.000000Z  added to channel collection
net:sta:loc= TA : 035A :  for time span  2010-01-12T00:00:00.000000Z  to  2011-11-14T23:59:59.000000Z  added to site collection
net:sta:loc:chan= TA : 035A :  : BHE for time span  2010-01-12T00:00:00.000000Z  to  2011-11-14T17:40:00.000000Z  added to channel collection
net:sta:loc:chan= TA : 035A :  : BHN for time span  2010-01-12T00:00:00.000000Z  to  2011-11-14T17:40:00.000000Z  added to channel collection
net:sta:loc:chan= TA : 035A :  : B

We turned on verbose mode so the output is quite large, but we did that because it demonstrates what this Database method is doing.  It takes apart the complicated Inventory object obspy created from the stationxml data we downloaded and turns the result into a set of documents saved in the two main station metadata collections in MongoDB:  (1) *site* for station information and (2) *channel* that contains most of the same data as *site* but add a set of important additional information (notably component orientation and response data).  

We turn now to the task of downloading the waveform data.

### Download waveform data
The last download step for this tutorial is the one that will take the most time and consume the most disk space;  downloading the waveform data.   To keep this under control we keep only a waveform section spanning most of the body waves.   We won't burden you with the details of how we obtained the following rough numbers we use to define the waveform downloading parameters:

1.  The approximate distance from the mainshock epicenter to the center of the USArray in 2011 is 86.5 degrees.
2.  P arrival is expected about 763 s after the origin time
3.  S arrival is expected about 1400 s after the origin time

Since we have stations spanning the continent we will use the origin time of each event +P travel time (763 s) - 4 minutes as the start time.  For the end time we will use the origin time + S travel time (1400 s) + 10 minutes.  

This process will be driven by origin times from the events we downloaded earlier.   We could drive this by using the obspy *Catalog* object created above, but because the event data was previously saved to the database we will use this opportunity to illustrate how that data is managed in MsPASS.

First, let's go over the data we saved in MongoDB.  We saved the source data in a *collection* we call *source*.   For those familiar with relational databases a MongoDB "collection" plays a role similar to a table (relation) in a relational database.   A "collection" contains one or more "documents".  A "document" in MongoDB is analagous to a single tuple in a relational database.  The internal structure of a MongoDB is, however, very different being represented by binary storage of name-value pairs in a format they call BSON because the structure can be represented in human readable form as a common format today called JSON.   A key point for MsPASS to understand is that the BSON (JSON) documents stored in MongoDB map directly into a python dict container.   We illustrate that in the next box by printing the event hypocenter data we downloaded above and then stored in MongoDB:

In [6]:
dbsource=db.source
cursor=dbsource.find()   # This says to retrieve and iterator overall all source documents
# The Cursor object MongoDB's find function returns is iterable
print('Event in tutorial dataset')
evlist=list()
for doc in cursor:
    lat=doc['lat']
    lon=doc['lon']
    depth=doc['depth']
    origin_time=doc['time']
    mag=doc['magnitude']
    # In MsPASS all times are stored as epoch times. obspy's UTCDateTime function easily converts these to 
    # a readable form in the print statment here but do that only for printing or where required to 
    # interact with obspy
    print(lat,lon,depth,UTCDateTime(origin_time),mag)
    evlist.append([lat,lon,depth,origin_time,mag])

Event in tutorial dataset
37.5898 142.7512 24.8 2011-03-12T01:47:16.160000Z 6.5
38.8 142.2 33.0 2011-03-11T19:46:35.300000Z 6.5
39.2219 142.5316 18.9 2011-03-11T11:36:40.030000Z 6.5
36.1862 141.6639 26.0 2011-03-11T08:19:27.470000Z 6.5
38.051 144.6297 19.8 2011-03-11T06:25:50.740000Z 7.6
36.0675 141.7291 23.0 2011-03-11T06:20:02.390000Z 6.5
36.0692 142.1388 23.0 2011-03-11T06:18:51.060000Z 6.6
36.2274 141.088 25.4 2011-03-11T06:15:37.570000Z 7.9
38.9847 143.4632 20.0 2011-03-11T06:08:32.540000Z 6.7
37.31 142.24 33.0 2011-03-11T05:51:20.500000Z 6.8
38.2963 142.498 19.7 2011-03-11T05:46:23.200000Z 9.1


Notice that we use python dict syntax to extract attributes like latitude ('lat' key) from "document" which acts like a python dict.  

We saved the core metadata attributes in a python list, *evlist*, to allow us to reduce the volume of data we will retrieve.  We'll keep just the three biggest events;  the mainshock and the two 7+ aftershocks.  We do that in the block below using a set container to define keepers.  The approach is a bit obscure and not the most efficient.  We could just use the list we just built to drive the processing in the next box, but we use the approach there to illustrate another example of how to loop over documents retrieved from MongoDB - a common thing you will need to do to work with MongoDB in MsPASS.

With that the loop below is similar to the simple print loop above, BUT requires an obscure parameter not usually discussed in the MongoDB documentation.   The problem we have to deal with is that the obspy web service downloader we are going to call in the loop below will take a while to complete (likely over an hour).  Long running processes interacting with MongoDB and using a "Cursor" object (the thing *find* returned) can fail and throw confusing messages from a timeout problem.  That is, a job will mysteriously fail with a message that does not always make the fundamental problem clear.  The solution is to create what some books call an "immortal cursor".   You will see this in the next box that does the waveform downloading as this line:
```
cursor=dbsource.find({},no_cursor_timeout=True)
```
where here we use an explicit "find all" with the (weird) syntax of "{}" and make the cursor immortal by setting no_cursor_timeout to True.

With that background here is the script to download data.  You might want to go grab a cup of coffee while this runs as it will take a while.  

In [7]:
from mspasspy.util.converter import Trace2TimeSeries

#db=Database(dbclient,'getting_started')
# We have to redefine the client from obspy to use their so called bulk downloader
from obspy.clients.fdsn import RoutingClient
client = RoutingClient("iris-federator")
# This uses the (admitted obscure) approach of a set container of keepers to reduce the download time
keepers=set()
keepers.add(4)
keepers.add(7)
keepers.add(10)

cursor=db.source.find({},no_cursor_timeout = True)
count=0
start_offset=763.0-4*60.0
end_offset=1400.0+10*60.0
i=0
for doc in cursor:
    if i in keepers: 
        origin_time=doc['time']
        # We need the ObjectId of the source to provide a cross reference to link our 
        # waveform data to the right source.  Better to do this now than later as association 
        # can be a big challenge
        id=doc['_id']
        
        print('Starting on event number',i,' with origin time=',UTCDateTime(origin_time))
        stime=origin_time+start_offset
        etime=origin_time+end_offset
        strm=client.get_waveforms(
            starttime=UTCDateTime(stime),
            endtime=UTCDateTime(etime),
            network='TA',
            channel='BH?',
            location='*'
        )
        # The output of get_waveforms is an obspy Stream object. Stream objects are iterable 
        # so we work through the group, converting each to a MsPASS TimeSeries, and then saving 
        # the results to the database.  
        for d in strm:
            d_mspass=Trace2TimeSeries(d)
            # Here is where we save the id linked to the source collection. 
            # We use a MsPASS convention that such metadata have name collection_id
            d_mspass['source_id'] = id
            d_mspass=db.save_data(d_mspass)
            print('Saved data for sta=',d_mspass['sta'],' and chan=',d_mspass['chan'],"status=",d_mspass.live)
            count += 1
            if count>20:
                break
    else:
        print('Skipping event number ',i,' with origin time=',UTCDateTime(origin_time))
    i += 1
print('Number of waveforms saved=',count)
    

Skipping event number  0  with origin time= 2011-03-11T05:46:23.200000Z
Skipping event number  1  with origin time= 2011-03-11T05:46:23.200000Z
Skipping event number  2  with origin time= 2011-03-11T05:46:23.200000Z
Skipping event number  3  with origin time= 2011-03-11T05:46:23.200000Z
Starting on event number 4  with origin time= 2011-03-11T06:25:50.740000Z


  return Cursor(self, *args, **kwargs)


Saved data for sta= 034A  and chan= BHE status= True
Saved data for sta= 034A  and chan= BHN status= True
Saved data for sta= 034A  and chan= BHZ status= True
Saved data for sta= 035A  and chan= BHE status= True
Saved data for sta= 035A  and chan= BHN status= True
Saved data for sta= 035A  and chan= BHZ status= True
Saved data for sta= 035Z  and chan= BHE status= True
Saved data for sta= 035Z  and chan= BHN status= True
Saved data for sta= 035Z  and chan= BHZ status= True
Saved data for sta= 109C  and chan= BHE status= True
Saved data for sta= 109C  and chan= BHN status= True
Saved data for sta= 109C  and chan= BHZ status= True
Saved data for sta= 121A  and chan= BHE status= True
Saved data for sta= 121A  and chan= BHN status= True
Saved data for sta= 121A  and chan= BHZ status= True
Saved data for sta= 133A  and chan= BHE status= True
Saved data for sta= 133A  and chan= BHN status= True
Saved data for sta= 133A  and chan= BHZ status= True
Saved data for sta= 134A  and chan= BHE status

KeyboardInterrupt: 

In [None]:
for d in strm:
    d_mspass=Trace2TimeSeries(d)
    print(d_mspass["net"],d_mspass["sta"],d_mspass["chan"])

In [None]:
from bson import json_util
doc=db.wf_TimeSeries.find({}).limit(5)
print(json_util.dumps(doc,indent=2))

In [None]:
print("obspy Trace stats")
print(d.stats)
print("mspass converted metadata")
print(json_util.dumps(d_mspass,indent=2))

## Quick Look for QC
MsPASS has some basic graphics capabilities to display its standard data types.  We refer the reader to a the *BasicGraphics* tutorial for more details, but for now we just illustrate using the plotting for a quick look as a basic QC to verify we have what we were looking for.

The previous step used *save_data* to save the data we downloaded to MongoDB.   How it is stored is a topic for later, but here we'll retrieve some of that data and plot it to illustrate the value of abstracting the read and write operations.   Reading is slightly more complicated than writing one atomic object as we did above.  The reason is that we often want to do a database select operation to limit what we get.  This example illustrates some basics on MongoDB queries.  The reader is referred to extensive external documentation on MongoDB (books and many online sources) on this topic. 

Let's first read the data for one station from all three events we saved.   This is a good illustration of the basic query mechanism used by MongoDB.  It also illustrates the concept of an *Ensemble* (Readers familiar with seismic reflection processing can view an Ensemble as a generic form of "gather".) which means a group of related data bundled together.  First run the following code block.  Explanations of what is done and what it should teach are in the text block that follows.

In [None]:
sta='234A'
query={ 'sta' : sta }
ensemble_md=query
n=db.wf_TimeSeries.count_documents(query)
print('Trying to retrieve ',n,' TimeSeries objects for station=',sta)
curs=db.wf_TimeSeries.find(query)
ensemble=db.read_ensemble_data(curs,ensemble_metadata=ensemble_md)
print(len(ensemble.member))
print('Success:  number of members in this ensemble=',
      len(ensemble.member))
print('python type of ensemble symbol=',type(ensemble))
print('Ensemble header (Metadata container) contents=',ensemble)


Starting at the top this algorithm can be broken up into four steps:
1.  The first two lines generate a query construct for MongoDB.  This example is a single, simple equality matching for one station using the seed code.
2.  Line 3 is not essential but illustrates a best practice for this example.   MsPASS ensemble objects contain a *Metadata* container that can be used to stored attributes common to the entire group.  Since this is a common station gather driven by a specific SEED station code, a logical way to define that is with that name (234A).  
3.  Lines 4-7 show a typical MsPASS construct to read data.  Since our example is naturally defined by an "ensemble" we read the data with the *read_ensemble_data* method of the database handle (db).  Notice that arg0 of the call to that method is the return from the MongoDB *find* method.  The "wf_TimeSeries" incantation is clarifier that tells MongoDB to read these data from a "collection" called "wf_TimeSeries". We use the optional argument, *ensemble_metadata*, to tell the reader to load the content of the python dict *ensemble_md* to the ensemble header.   The first two lines of this section are not required but are often a good idea to assure you are getting what you expect.  The *count_documents* method called there is used to clarify how many objects the reader should expect to find from the query.  We print the result as a sanity check.   Note a useful precheck on a large data set would be a simple loop that used that same construct to print the number of data objects expected for each ensemble.   
4.  That last group of lines are optional print statements but useful here to clarify the basics of what was read and illustrate typical QC print statements you may need in assembling a dataset.  The last line may seen a bit weird as one might expect a very large output from a print of the entire ensemble.   The example illustrates that what it does instead is only print the header data (the ensemble's *Metadata* container content).  

In addition to the educational point of a how to read data, this example also illustrates a key concept in MsPASS.   We define two data objects as *Atomic* that we refer to as *TimeSeries* and *Seismogram*.
As the one print statement above shows the symbol *ensemble* is what we call a *TimeSeriesEnsemble*.   It is a container that has an attribute called *member* that is itself a vector of  *TimeSeries* objects.  *TimeSeries* objects are an abstraction of the single channel records downloaded with FDSN web services.   Later in this tutorial we will convert the data associated with the wf_TimeSeries database collection to *Seismogram* objects and write them into a wf_Seismogram collection.  We call *Timeseries* and *Seismogram* object "Atomic" because for most processing they should be considered a single thing.  (For more experienced programs, note that in reality like atoms these data objects are made of subatomic particles with a class inheritance structure but the energy barrier to pull them apart is significant.)  Many functions, however, need the concept we call an *Ensemble*.   They should be used to group a set of atomic objects together that have a generic relationship.   Examples are "shot gather", "CMP gather", and "common receiver gather" concepts used in seismic reflection processing.  With passive array data the two most common groups are those associated with a single event and collections of all data recording a particular time period.  An *Ensemble* provides a generic way to hold any of those.  Your workflow needs to be aware at all times what is in any ensemble and be sure the requirements of an algorithm are met.  (e.g. in reflection processing you will get junk if you think an ensemble is an NMO corrected CMP gather and the data are actually a raw shot gather).  We reiterate that there are two named types of ensemble containers; one for each atomic data object.  They are called *TimeSeriesEnsemble* and *SeismogramEnsemble* for containers of *TimeSeries* and *Seismogram* object respectively.   In both cases the seismic data components are stored in simple vector container defined with the symbol *member*. The "vector" in this context is more generic than something like a numpy array.   We use the C++ std::vector generic concept that the member symbol can be subscripted with a "random access iterator".   That means you can use randomly access a single TimeSeries object in the above example with a construct like this:

d = ensemble.member[4]

and python can fetch that instance in equal time for random access.
The container also can act a bit like a python dict to store global metadata related to the ensemble. We will refer to that container here as the *ensemble metadata*.   Our example above can be thought of as a common receiver gather for TA station 234A.  Hence, we post that name with the key *sta* with the line `ensemble['sta']=sta`.    That model should be the norm for any ensemble.   That is, the ensemble metadata should normally contain a set of one or more key-value pairs that at least provide a  hint at the ensemble contents.  Here that is the station name, but we could add other data like the stations coordinates.  We defer that to below where that data becomes necessary. 

With that background let's plot these data.  Here we illustrate the use of basic plotting routines in MsPASS but we note any python graphic package can be use for plotting these data if you understand the data structures.  We provide the *SeismicPlotter* class here as a convenience.  A valuable addition for community development is extensions of our basic graphics module or alternative plotting modules.

In [None]:
from mspasspy.graphics import SeismicPlotter
plotter=SeismicPlotter(normalize=True)
# TODO  default wtvaimg has a bug and produces a 0 height plot - wtva works for now
plotter.change_style('wtva')
plotter.plot(ensemble)

Again, about as simple as it gets.  This and the earlier examples illustrate a key design goal we had for MsPASS:  make the package as simple to use for beginners as possible.   The only option we used here was to turn on automatic scaling (The `normalize=True` line in the constructor), which is turned off by default.  Scaling is essential here since we are mixing data from events with different magnitudes.  Without scaling only the mainshock signals would visible.

The plots shows the mainshock record as the top 3 signals.   Below that are sets of 3 signals from the other 2 large events in our data subset defined as "keepers".  Notice how the 3 sets of signals are offset in time.  The data are plotted that way because the data time stamp is coordinated universal time (UTC) and they are being plotted in their actual timing position.   The topic of a UTC time standard and how MsPASS handles this is a unique feature of MsPASS and is the topic of next section.

The SeismicPlotter has a fair amount of additional functionality.  See the BasicGraphics tutorial to learn some of that functionality.

## UTC and Relative Time
A unique feature of MsPASS is that we aimed to make it generic by supporting multiple time standards.  MsPASS currently supports two time standards we refer to with the name keys *UTC* and *Relative*.   The first, *UTC*, is well understood by all seismologists who work with any modern data.  UTC is a standard abbreviation for coordinated univeral time, which is the time standard used on all modern data loggers.  It is important to recognize that unlike obspy we store all UTC times internally as "unix epoch times".  Epoch times are the number of seconds in UTC since the first instant of the year 1970.  We use only epoch times internally as it vastly simplifies storage of time attributes since they can be stored as a standard python float (always a 64 bit real number in python) that causes no complications in storage to MongoDB.  It also vastly simpifies computing time differences, which is a very common thing in data processing.   To convert UTC times to a human readable form we suggest using the obspy UTCDateTime class as we did above.  The inverse (converting a UTC date string to an epoch time) is simple with the timestamp method of UTCDateTime.  Some wrapped obspy functions require UTCDateTime objects but all database times are stored as floats. Most obspy function, like the web service functions we used above, use the UTCDateTime class to define times.  The point is be to be cautious about what time arguments mean to different functions.

The idea of a *Relative* time is well known to anyone who has done seismic reflection processing.   Experienced SAC users will also understand the concept through a different mechanism that we generalize.  Time 0 for seismic reflection data ALWAYS means the time that the "shot" was fired.  That is a type example of what we mean by *Relative* time.   Times are "relative" to the shot time and 0 is the shot time.  Earthquake data can be converted to the same concept by setting time zero time for each signal to the origin time of the event.   SAC users will recognize that idea as the case of the "O" definition of the data's time stamp.   Our *Relative* time, in fact, is a generalization of SAC's finite set of definitions for the time stamp for one of their data files like Tn, B, O, etc.   *Relative* just means the data are relative to some arbitrary time stamp that we refer to internally as *t0_shift*.  It is the user's responsibility to keep track of what *t0_shift* means for your data and whether that reference is rational for the algorithm being run.  We stress, however, that TimeSeries and Seismogram objects do keep track of *t0_shift* internally.  The time reference can be shifted to a different meaning if desired through combinations of three different methods:  *rtoa* (switch from Relative to Absolute=UTC), *ator* (switch from Absolute(UTC) to Relative), and *shift* that is used to apply a relative time shift.  In particular, when used correctly any relative time data can be restored to UTC by simply called the *rtoa* method or using the *rota* wrapper function.

In this next block we take the ensemble we created above and apply a time shift to put 0 at a constant time shift from the origin time of each event.  

In [None]:
from obspy import UTCDateTime
from mspasspy.ccore.seismic import TimeSeriesEnsemble
# We make a deep copy with this mechanism so we can restore raw data later
enscpy=TimeSeriesEnsemble(ensemble)
i=0
for d in ensemble.member:
    print('member ',i,' input t0 time=',UTCDateTime(d.t0))
    d.ator(d.t0)
    print('member ',i,' after running ator has time 0=',d.t0)
    i+=1

Notice how the times changed from an offset from the origin time we used for downloading to 0.   We can see this effect graphically in the next box.  

In [None]:
plotter.plot(ensemble)

Note that now the time axis starts at 0, BUT that is Relative time.  Here that time is a fixed offset from the origin time we obtained form the hypocenter origin time for each event.  

Those signals are ugly because these aftershocks are buried in the long period coda of the mainshock.  That should be clear from the UTC time plot we first made where the bottom six signals overlap, a common problem that presents a huge problem in some processing frameworks.   Let's plot this again with a bandpass filter applied.

In [None]:
from mspasspy.algorithms.signals import filter
filter(ensemble,'bandpass',freqmin=0.2,freqmax=2.0)
plotter.plot(ensemble)

Notice that we can now see the P and S waves even for the two aftershocks in this extended short period band.  

## Windowing Data

The time scale we have in the plot above is largely useless;  it is just an arbitrary offset from the origin time for each event.   In this section we will illustrate the common processing step in dealing with teleseismic data where we need to extract a smaller time window around a phase of interest.  

First, as we just saw the aftershocks are a special problem because they are buried in low frequency coda of the mainshock.  We will thus focus for the rest of this tutorial on the mainshock.   First, let's load the vertical components of the mainshock from all recording stations into a working ensemble.   To do that, we first have to define a query method to extract only the data we want.   This will be a step you will nearly alway need to address in handling teleseismic data.  The previous examples and the steps below are a start, but we reiterate that if you become a serious user of MsPASS you will need to become familiar with the pymongo API.   Our documentation indirectly covers many of the essentials, but MongoDB is a large, heavily-used package with a lot of features.  Google is your friend with MongoDB and it is relatively easy to find answers to almost any question about usage.

With that lecture, the key data we will use is the start times printed in the box above that ran the ator method.  Here is the algorithm.  See comments below the code box that provide the lessons to be learned form this example:

In [None]:
t_to_query=UTCDateTime('2011-03-11T05:55:06.000000')
# This is a MongoDB range query.   We allow a +- 5 s slop because the start 
# times are alway slightly irregular. 
# we also do an exact match test for BHZ.  That works for these data but 
# more elaborate queries with wildcards are subject best left to MongoDB documentation
dt_for_test=5.0
tmin=t_to_query-dt_for_test
tmax=t_to_query+dt_for_test
query={'starttime' : {'$gt': tmin.timestamp,'$lt' : tmax.timestamp }}
# query is a python dict so we can add to an additional criteria like this
query['chan']='BHZ'
print('Time query to send to MongoDB server:')
print(query)
n=db.wf_TimeSeries.count_documents(query)
print('Ensemble size expected from time query=',n)
# We could just use that but let's get the special key source_id we set above
# and query with it as a more unambigous query
doc=db.wf_TimeSeries.find_one(query)
srcid=doc['source_id']
idquery={'source_id' : srcid}
idquery['chan']='BHZ'  # add same equality test of BHZ 
n=db.wf_TimeSeries.count_documents(idquery)
print('Query sent to MongoDB server=',idquery)
print('Mainshock ensemble size that is expected for that query=',n)
# Now build our working enemble.  To keep graphics clean we limit the 
# number retrieved to the first 15 stations.  Illustrates another pymongo function
ensemble=TimeSeriesEnsemble(15)
# Set the source_id into the ensembled metadata - this will now be a common source(shot)
# gather so appropriate to post it like this.  Might normally also add other source 
# metadata there but they aren't needed for this tutorial
ensemble['source_id']=doc['source_id']
cursor=db.wf_TimeSeries.find(idquery).limit(15)
for doc in cursor:
    d=db.read_data(doc)
    ensemble.member.append(d)
print("Size of ensemble actualy plotted=",len(ensemble.member))
plotter.plot(ensemble)

Some things you should learn from this example working from the top down:
1.  With pymongo a query is constructed as a python dict.   Equality matches are implied by  constructs like `query['chan']='BHZ'`.  The starttime construct is more elaborate but note the key to the dict is *starttime* and the value associated with the key is itself a dict.  Note that the concept of what MongoDB calls a "document" maps exactly into a python dict container.  MongoDB would call the dict container that is the value associated with *starttime* a "subdocument".  The construct used for the *starttime* query uses MongoDB operators as the keys in the subdocument.  The operators are keywords that begin with the dollar sign ($) symbol.   There is a long list of operators that can be found in various MongoDB online sources [like this one.](https://docs.mongodb.com/manual/reference/operator/query/>)
2.  We constructed both of our two query dictionaries (`query` and `idquery`) in two steps.  That isn't required.  We did it just to illustrate that a query can often best be constructed from a core set that can be defined with an incantation using curly brackets, colons, constants, and python variables.  That programming trick is not discussed in most tutorials.
3.  We create a new TimeSeriesEnsemble with a different approach that illustrates the atomic reader.  That constructor initializes the container setting aside 15 slots for TimeSeries objects.  Specifying 15 provides a minor improvement in efficiency.  All core data objects in MsPASS are implemented in C++ to improve performance.  Note that any symbol loaded with the path defined by *mspasspy.ccore* means the thing being accessed was written in C++ with python wrappers.  We do not currently have a clean mechanism for creating complete sphynx documentation pages for the C++ code.  Most of the API can be inferred from the C++ doxygen web page found [here.](https://wangyinz.github.io/mspass/cxx_api/mspass.html#mspass-namespace)
4.  Our example intentionally issues a second (redundant) query for the data of interest (the `idquery` symbol).  We do that strictly for the educational value this provides.  In this case they are exactly equivalent.  The two queries illustrate two alternative mechanism (with these data) to assemble a common source gather (Ensemble).  The second query uses a special entity used extensively in MongoDB called a [ObjectID](https://docs.mongodb.com/manual/reference/method/ObjectId/).  The second query using a *source_id* value works only because we loaded that cross-reference key with the data.  If you inspect the earlier boxes you will see we downloaded the data looping over a list of sources.  We defined *source_id* by extracting the ObjectID of the document holding the source information in the *source collection* referenced by the key `"_id"` of the linking source document.  (i.e. *source_id* is the same as the value associated with a document in the source collection with the key *_id*.) It has the advantage of always being accessible through a fast index.
In MsPASS the standard model for data is that source and receiver information are stored in what MongoDB calls the [normalized data model](https://docs.mongodb.com/manual/core/data-model-design/).  The next section addresses approaches to handling normalized data.

The last thing the above box does is plot the data. We see the data are again in UTC time and the data are not aligned because there is no "moveout correction".  The next step is then to convert the data to relative time and align the data on the predicted P wave arrival time.  The first step is to associate each TimeSeries object with the receiver metadata of the instrument that that recorded the data.  Because these data came from an FDSN data center (IRIS) a given channel of data is uniquely defined by the four magic SEED code names referred to as network, station, channel, and location.  In the standard MsPASS schema these are shortened to *net*, *sta*, *chan*, and *loc* respectively.  Here we use a preprocessing function called *get_seed_channel* to retrieve the basic station metadata and load that data into each member of the working ensemble.  

## Data Normalization
Data normalization is MongoDB's approach to what is called a "join" in relational database theory.   Relational databases (e.g. Antelope) are usually carefully designed to make some "join" operators as fast as possible.  An example in the CSS3.0 schema used by Antelope is joining "wfdisc" table tuples to "site" tuples.   In CSS3.0 databases that operation is used to link station coordinates to each waveform.   MongoDB is not a relational database but defines the same concept as "normalization".   We use `channel` and `site` collections to fill the roll of several CSS3.0 tables in a single "collection".  We also use a `source` collection in place of several source related tables in CSS3.0. The purpose of all is to put small amounts of information that are shared by many waveforms in a single, organized place that MongoDB calls a "collection".   We created `source`, `channel`, and `site` collections for this tutorial at the beginning of this notebook.

We have found that a pure database approach to handling MongoDB normalization is slow and can be a bottleneck in handling large data sets.   For that reason we developed multiple ways of handling the "normalization" problem.   The workflow below illustrates the two primary mechanisms we have found as a best practice for working with MsPASS.   As above we suggest you run this code box and then refer to below to discuss what it does and what it should teach you.

In [None]:
from mspasspy.db.normalize import (normalize,
                                   MiniseedMatcher,
                                   ObjectIdMatcher)
from bson import json_util
# We have to reconstruct the database handle to use the full schema for wf_TimeSeries
# Without this step the source collection normalization will fail
dbclient2=msc.Client()
db = dbclient2.get_database('getting_started')
# Method 1:   normalization during read is used for source data
idquery={'source_id' : srcid}
cursor=db.wf_TimeSeries.find(idquery)
ensemble=db.read_ensemble_data(cursor,normalize=["source"])
# Method 2:   matcher object with data loaded from db but saved in a cache
station_matcher = MiniseedMatcher(db)
for i in range(len(ensemble.member)):
    d = ensemble.member[i]
    ensemble.member[i] = normalize(d,station_matcher)
# Final section:  printing of results
print("This is a (pretty) print of all the metadata for member 0 of the ensemble")
print(json_util.dumps(ensemble.member[0],indent=2))
print("Table of source and receiver coordinates for first 10 ensemble members")
print("source: lat, lon, depth, otime | receiver:  lat, lon, elevation")
for i in range(10):
      d=ensemble.member[i]   # shorthand done only to reduce typing
      print( d["source_lat"],
             d["source_lon"],
             d["source_depth"],
             UTCDateTime(d["source_time"]),
             " | ",
             d["channel_lat"],
             d["channel_lon"],
             d["channel_elev"]
           )
    

Comments in the code above bracket sections of code showing the two methods we used for normalization.   They have different strengths and weaknesses described in the User's Manual.   Details on what was done above follow:

*Method 1 section*.   The example shows a "normalize on read" approach to normalization.   The example first uses a query to limit the read to a single ensemble linked to all data from a common source.   (We did that only because this section is instructional. We will extend this below to process the entire data set.)  We then call the Database method `read_ensemble_data` using the MongDB cursor object returned by the find query.   That form is identical to above but we added the `normalize=["source"]` argument.  That argument assumes each datum read has the attribute `source_id` defined.   In this case that is guaranteed because the query only returns data matching the specified source id.   Internally, the reader does a query of the "source" collection for the document with the specified source id and loads that data into the headers (Metadata) of every member of the ensemble.   Note we used that approach here as an illustration but it is not at al the most efficient way to do this operation.   A better algorithm would be to match and load the data to the ensemble's header (Metadata container).   We leave it as an exercise for the student to alter this notebook to do that.

*Method 2 section*.   Here we create a special python object (class) with the type `MiniseedMatcher`.   Details on this object can be found by reading the docstring for the class.  The basic idea, however, is that when the contents of the  `station_matcher` are created the entire "channel" collection will be loaded into memory and indexed.   The index uses the miniseed station codes and time ranges.  It should be obvious this approach only makes sense if the size of the channel collection is not huge.   In our experience, for passive array data that is always true.   We can imagine examples where that model may be wrong (e.g. DAS data) so we note there is also a `MiniseedDBmatcher` that accomplishes the same thing using database transactions.   In any case, what the code above does is loop through all members of the ensemble being processed loading a default list of attributes from the channel collection.  This print section at the end of the code blocks shows that both methods were successful in loading source and receiver coordinates.  

## Processing Workflow 
Finally, we illustrate an example workflow.   We show first how this particular task can be done as a "serial" job.  We then demonstrate that with the proper structure it is very easy to convert a serial job to one that can use the dask or spark schedulers to allow the task to be run on a cluster.

### Serial Version of Workflow
The example here aims to process the data set we downloaded with the following sequence of steps repeated for every waveform in this dataset:
1.  Load the data and build the internal mspass data objects we call TimeSeries.  We normalize the source data while reading.
2.  Normalize the data to load receiver coordinates.
3.  Demean each datum
4.  Resample all data to 20 sps.
5.  Compute model-based P wave travel times and load the times into the data headers.
6.  Time shift the data to have zero time equal to the P computed P wave arrival time.
7.  Window the data to a smaller time window relative to the P time.

Normally we would terminate this job by saving the results to the database, but we will not do that initially to demonstrate parallel concepts.  

Before running the serial version of this workflow, it is worth pointing out that in most cases the best way to build the script to process a set of data is to use a subset of the data to verify everything is working right before attempting to run a workflow on a large data set.  The example below emphasizes that by not actually attempting to process all the data.  The following line contains the MongoDB incantation to limit the test run to the first 100 waveforms:
```
cursor=db.wf_TimeSeries.find(query).limit(100)
```
If we left off `limit(100)` that command would read all the data satisfying the query.  

Now run the code below and we will discuss what the details after the code block.

In [None]:
from mspasspy.algorithms.window import WindowData
from mspasspy.algorithms.signals import detrend
from mspasspy.algorithms.basic import ator
from mspasspy.ccore.algorithms.basic import TimeWindow
from mspasspy.db.normalize import normalize
from mspasspy.algorithms.window import WindowData
from mspasspy.algorithms.resample import (ScipyResampler,
                                          ScipyDecimator,
                                          resample,
                                         )
from mspasspy.algorithms.bundle import bundle_seed_data
from obspy.geodetics import gps2dist_azimuth,kilometers2degrees
from obspy.taup import TauPyModel
import time

def set_Ptime(d,Ptimekey="Ptime",model=None):
    """
    Function to calculate P wave arrival time and set that time 
    as the header (Metadata) field defined by Ptimekey.   
    
    :param d:  input datum.  Type is not checked and is assumed 
      to be a TimeSeries.  It also must contain source coordinates 
      and "channel" based coordinates (i.e.  channel_lat, channel_lon,
      and channel_elev) or the datum will be killed.  This algorithm 
      assumes these are already set because in this workflow 
      normalization methods will kill any datum for which they 
      couldn't be loaded.
    :param Ptimekey:  key used to define the header attribute that 
      will contain the computed P time.  Default "Ptime".
    :param model:  instance of obspy TauPyModel travel time engine. 
      Default is None.   That mode is slow as an new engine will be
      constructed on each call to the function.  Normal use should 
      pass an instance for greater efficiency.  
    """
    if d.live:
        if model is None:
            model = TauPyModel(model="iasp91") 
        srclat=d["source_lat"]
        srclon=d["source_lon"]
        srcz=d["source_depth"]
        srct=d["source_time"]
        stalat=d["channel_lat"]
        stalon=d["channel_lon"]
        staelev=d["channel_elev"]
        georesult=gps2dist_azimuth(srclat,srclon,stalat,stalon)
        # obspy's function we just called returns distance in m in element 0 of a tuple
        # their travel time calculator it is degrees so we need this conversion
        dist=kilometers2degrees(georesult[0]/1000.0)
        arrivals=model.get_travel_times(source_depth_in_km=srcz,
                                            distance_in_degree=dist,
                                            phase_list=['P'])
        Ptime=srct+arrivals[0].time
        d.put(Ptimekey,Ptime)
    return d

ttmodel = TauPyModel(model="iasp91")
station_matcher = MiniseedMatcher(db)
resampler=ScipyResampler(20.0)
decimator=ScipyDecimator(20.0)
stime=-5.0
etime=100.0
# This query is used to reduce debris.  We only return data for which the 
# id source_id is defined
query={"source_id" : {"$exists":True}}
n=db.wf_TimeSeries.count_documents(query)
print("Number of waveforms to process for full data set=",n)
cursor=db.wf_TimeSeries.find(query).limit(100)
t0 = time.time()
for doc in cursor:
    d = db.read_data(doc,normalize=['source'])
    d = normalize(d,station_matcher)
    d = detrend(d,type="constant")
    d = resample(d,decimator,resampler)
    d = set_Ptime(d,model=ttmodel)
    # We don't need to test if d is dead here because ator 
    # is a mspass function that handles that automatically
    d = ator(d,d["Ptime"])
    d = WindowData(d,stime,etime)
    #print(d.live,d.dt,d.t0,d.endtime())
t=time.time()    
print("Time to process 100 waveforms using a loop=",t-t0)

Elements of this code from the to down are:
1.  After a set of required imports we define a function needed in the processing step to compute and set P wave arrival times.
2.  From the end of the function body to the to of the for loop are a set of initializations.  In addition to the matcher that is the same as that used above we define an instance of the obspy travel time calcultor, we recreate a pair of resample operators used to assure all the data are at the same sample rate, and we define two time window parameters (`stime` and `etime`).  
3.  The for loop is driven by a MongoDB "cursor" object return by the call to the database find method.  As noted for this instructive test we limit the loop to only 100 waveforms. 
4.  We then enter the for loop that processes a series of waveforms sequentially (serial processing).  The sequence of lines implement the steps listed above:  read data, normalize to set channel attributes, detrend, resample, set the P wave arrival time, time shift so time 0 is at the P wave time, and then window around the P wave arrival time. 

### Parallel workflow

Now we show how the exact same workflow as above can be run in parallel.  Note for this case we remove the safeties and process the entire job.  We don't save it yet though.  Run this block and read the discussion below while it runs.

In [None]:
from mspasspy.io.distributed import read_distributed_data

ttmodel = TauPyModel(model="iasp91")
station_matcher = MiniseedMatcher(db)
resampler=ScipyResampler(20.0)
decimator=ScipyDecimator(20.0)
stime=-5.0
etime=100.0
# This query is used to reduce debris.  We only return data for which the 
# id source_id is defined
query={"source_id" : {"$exists":True}}
n=db.wf_TimeSeries.count_documents(query)
print("Number of waveforms to process=",n)
cursor=db.wf_TimeSeries.find(query)
t0 = time.time()
bg = read_distributed_data(db,cursor,normalize=['source'])
bg = bg.map(normalize,station_matcher)
bg = bg.map(detrend,type="constant")
bg = bg.map(resample,decimator,resampler)
bg = bg.map(set_Ptime,model=ttmodel)
# This version took 5 times longer to run
#bg = bg.map(set_Ptime)
bg = bg.map(lambda d : ator(d,d["Ptime"]))
bg = bg.map(WindowData,stime,etime)
bg = bg.map(db.save_data,data_tag="Pwave_windowed_data")
bg = bg.map(lambda d : d.live)
res=bg.compute()
t=time.time()    
print("Parallel job processing time=",t-t0)
print("Time per waveform=",(t-t0)/n)
nlive=0
n=len(res)
for x in res:
    if x:
        nlive += 1
print("Processing completed ",nlive," of ",n," waveforms handled")

Compare the above with the earlier serial job.  The for loop command and read_data line are replaced by the following line:
```
bg = read_distributed_data(db,cursor,normalize=['source'])
```
The `read_distributed_data` function creates a container called a dask "bag".  A convenient way to view a bag is list of things that doesn't need to fit in memory.   The "things", in our case, are mspass TimeSeries objects.   The `read_distributed_data` line is followed by a series of lines that in python jargon apply the "map method" of the "bag" object/container. The concept of a "map" operator is one of the two keywords in the modern concept of the "map-reduce" model of big data science.  You can find many web pages and turorials discussing map-reduce in general and map-reduce for dask in particular.   For now, we emphasize that arg0 of the map method is a function name.  Each call to map applies a named function to data that it assumes emits another datum that is always the same type.   All the processing functions in the loop above use that model.  For example, the `resample` function takes an input TimeSeries of any sample rate and returns a resampled representation of that datum at 20 sps.  

With that background, note the workflow runs a sequence of algorithms through the map method driven by the same function names as above in the same order. For example, consider this line in the serial job that runs the normalize function we used earlier:
```
d = normalize(d,station_matcher)
```
The comparable operator above is this:
```
bg = bg.map(normalize,station_matcher)
```
The key point we want to make here is that it is straightforward to convert any loop like the serial job to a parallel version using dask.  There are three deviation:
1.  The call to the ator function required us to use a python `lambda` function.  That is often a useful trick to handle variable arguments padded through header values.   If you are unfamiliar with lambda function there are numerous articles on this topic on the web.
2. We added a call to `db.save_data` so we an work on these data further befow.   
3.  We use a terminator lambda function to return only the value of the boolean "live" attribute.    

A feature of dask potentially confusing to newcomers is all the calls the the bag "map method" are "lazy".   What that means is nothing is actually computed until we call the bag's "compute method".   A simple way to understand the call to compute is that it converts a bag to a python list and returns the result.  We store that list here as `res`.  This entire data set may not fit in your local machine.  That is why we used the last lambda function. It reduces the bag to a list of booleans that are unlikely to cause a memory problem. 

### Parallel Workflow with dask distributed
We next introduce an important variant of the above when running a job on a large cluster with multiple nodes.  The method above is appropriate for testing a workflow on a desktop before running the job on a bigger scale data set on a large cluster with multiple nodes.   In that case the authors of dask recommend the use of the newer scheduler they call [dask distributed](https://distributed.dask.org/en/stable/). Besides better performance in a cluster dask distributed adds the capability of monitoring a job in real time and profiling a job through the dask [diagnostic monitor](https://distributed.dask.org/en/stable/diagnosing-performance.html).   The next block enables this capability with this notebook. 

In [None]:
from dask.distributed import Client
scheduler_client=Client()
scheduler_client

The status page for this notebook is now available to you an can be accessed via port 8787.   It is because of that requirement that you may have had to restart this container with the `-p 8787:8787` incantation.   Without that port mapping you would not be able to connect to the diagnostics page.   We note that in our experience using the hyperlink above will no work either.   You instead will probably need to use the link via the default localhost of `127.0.0.1:8787/status`.  You might be able to click on [this link](http://127.0.0.1:8787/status).   If that doesn't work resort to a cut and paste of the above url. 

Now that we have dask diagnostics running let's run a variation of the above workflow that will allow you to watch dask work.  Note this workflow differs from the above in three ways:
1.  It doesn't repeat the initializations.  Note that was done here only because of the structure of this notebook and would not be normal.
2.  We use a different approach to launch the computations.  We link the bag (`bg`) to the dask distributed scheduler we just created.   Without that line the diagnostics monitor will not display.
3.  We intentionally commented out the line to save the data.  We did that to allow you to run this next box repeatedly and not produce duplicate data.  

As item 3 says run the next code box and watch the real time display.   Experiment with the different menu options as described in the dask documentation link above.   When the job completes you might also want to look at the profiling output.   We won't dwell on the details of dask diagnostics, but refer you to documentation.  The main point here is that those tools can be useful to improve performance on a workflow you need to run on a large amount of data.

In [None]:
cursor=db.wf_TimeSeries.find(query)
t0 = time.time()
bg = read_distributed_data(db,cursor,normalize=['source'])
bg = bg.map(normalize,station_matcher)
bg = bg.map(detrend,type="constant")
bg = bg.map(resample,decimator,resampler)
bg = bg.map(set_Ptime,model=ttmodel)
bg = bg.map(lambda d : ator(d,d["Ptime"]))
bg = bg.map(WindowData,stime,etime)
bg = bg.map(lambda d : d.live)
#bg = bg.map(db.save_data,data_tag="Pwave_windowed_data")
scheduler_client.persist(bg)
res=bg.compute()
t=time.time()    
print("Parallel job processing time with dask distributed=",t-t0)
print("Time per waveform=",(t-t0)/n)
nlive=0
n=len(res)
for x in res:
    if x:
        nlive += 1
print("Processing completed ",nlive," of ",n," waveforms handled")

## Creating Seismogram Objects
MsPASS considers two data types to be atomic seismic data.  What we call a *TimeSeries* is a single channel of data.  As we saw above a unique seed combination of the codes net, sta, chan, and loc define a single channel data stream.  In all seismic data processing we usually cut out sections of data as a chunk to be dealt with as a single entity.  There are some algorithms where the model of a single channel of data is meaningless or useless.  Data recorded by a three-component seismic station is a case in point;  the components have a fundamental relationship that for some applications make them indivisible.  Two examples most seismologists will be familiar with are particle motion measurements and conventional receiver functions.   Both require an input of three-component data to make any sense.   For this reason we distinguish a differerent atomic object we call a *Seismogram* to define data that is by definition a three-component set of recordings.

The problem of assembling three-component data from raw data is not at all trivial. Today most data loggers produce multiple sample rate representations of the same data stream and observatory data like GSN stations frequently have multiple, three-component (3C) sensors at the same approximate location.  It is not at all uncommon to have 24 or more channels defining the same data with different sensors and different sample rates.  Active source multichannel data, by which I mean older cable systems were the data were multiplexed, present a different problem assembling three-component data as a map between channel number and component is required to put the components in the right place for each receiver position.  The primary reason we define a separate data object for 3C versus scalar data is too allow algorithms that depend upon 3C data to not be burdened with that complexity.  If the workflow you need for your research requires 3C data you should alway think of your workflow as four distinct steps:
1. Importing raw data as TimeSeries.  For MsPASS that always means selecting and cutting out time windows that define what part of what signals you need to work with.  The example above cutting data down to a P wave window is an example, but the concept is generic.
2. Populating the metadata that define at least two fundamental properties of all data channels to be handled:  (1) at least relative amplitudes between components and (2) orientations of the components in space.  We already saw a solution to this problem in the section on "normalization".  
3. A generic process we call *bundle*.  Bundling is descriptive because the process requires the three components that define a particular sensor to be joined together an put into the single container we call a *Seismogram*.  
4.  Save the bundled data in some form.  In MsPASS the preferred form is storage under the control of MongoDB and MsPASS, but such data could also be exported in some output format for handling by some external package.  


In this tutorial we already completed step 1 and part of step 2.  The example we develop here is form miniseed data assembled in common source gathers, which what we created above.  

temporary - develop this workflow to create seismogram data serial then get it running parallel

In [None]:
from mspasspy.algorithms.bundle import bundle_seed_data
from mspasspy.util.Undertaker import Undertaker
from mspasspy.ccore.seismic import TimeSeriesEnsemble
db=Database(dbclient,'getting_started')

srcids=db.wf_TimeSeries.distinct('source_id')
stedronsky=Undertaker(db)
for sid in srcids:
    query={'source_id' : sid,
           'data_tag' : 'Pwave_windowed_data'}
    nd=db.wf_TimeSeries.count_documents(query)
    cursor=db.wf_TimeSeries.find(query)
    ensemble = db.read_ensemble_data(cursor,normalize=["source","channel"]
    print('Number of TimeSeries objects for this source=',len(ensemble.member))
    ens3c=bundle_seed_data(ensemble)
    print('Number of (3C) Seismogram object saved for this source=',len(ens3c.member))
    [living,bodies]=stedronsky.bring_out_your_dead(ens3c)
    print('number of bundled Seismogram=',len(living.member))
    print('number of killed Seismogram=',len(bodies.member))
    for i in range(len(bodies.member)):
        d=bodies.member[i]
        net=d['net']
        sta=d['sta']
        print('Errors posted for net=',net,' station=',sta)
        for e in d.elog.get_error_log():
            print(e.algorithm,e.badness,e.message)
    db.save_ensemble_data(ens3c)

In [None]:
n=db.Seismogram.count_documents({})
print('Total number of seismograms objects now in db=',n)

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/obspy/clients/fdsn/routing/routing_client.py", line 97, in _try_download_bulk
    return _download_bulk(r)
  File "/opt/conda/lib/python3.10/site-packages/obspy/clients/fdsn/routing/routing_client.py", line 140, in _download_bulk
    return fct(bulk_str + r["bulk_str"])
  File "/opt/conda/lib/python3.10/site-packages/obspy/clients/fdsn/client.py", line 1051, in get_waveforms_bulk
    data_stream = self._download(
  File "/opt/conda/lib/python3.10/site-packages/obspy/clients/fdsn/client.py", line 1482, in _download
    code, data = download_url(
  File "/opt/conda/lib/python3.10/site-packages/obspy/clients/fdsn/client.py", line 1927, in download_url
    data = io.BytesIO(f.read())
  File "/opt/conda/lib/python3.10/http/client.py", line 459, in read
    return self._read_chunked(amt)
  File "/opt/conda/lib/python3.10/http/client.py", line 591, in _read_chunked
    value.append(self._safe_read(chunk_left))
