# Earthscope MsPASS Short Course - preprocess

## *Gary L. Pavlis, Indiana University and Yinzhi (Ian) Wang, TACC*

## Purpose of this Notebook
This notebook is intended to be run without explanation by students prior to the start of the course.  The purpose is only to create a working data set that you will work with during the first class session.  You may attempt to grok the code in this notebook but be aware the plan is to visit the code in this notebook near the end of the first class session to discuss what exactly it does.  

Readers from github reading this who are not in the 2025 short course must realize this notebook was designed to run on Earthscope's GeoLab jupyter lab gateway on AWS.   It can still work if you are running MsPASS on a local computer.   The main difference is if you run __[mspass-desktop](https://www.mspass.org/getting_started/mspass_desktop.html)__ the step below to launch mongod is not necessary.    

## Step one:  Terminal preprocess preparation steps
### Launch the database server
You must run a couple of commands in a jupyter terminal window to allow this notebook to be run.   The first is a necessary evil to avoid authentication complexities in a shared database server.   That is, for this class each student will be running their own instance of the database package called MongoDB.  In normal MsPASS usage mongodb is launched as a service automatically, but we have been unable to devise a comparable setup on GeoLab.  Consequently, you will need to do the following:
1.  Launch a jupyter "Terminal" tab.  (Push the + icon in the upper left to create  "Launcher" tab. Then push the Terminal icon.)
2.  Then type this incantation:
```
mongod --dbpath ./db --logpath ./logs
```
3.  Verify that worked by typing `ps -A`  You should see a line where the CMD field is "mongod".   If not, contact me by email or slack if you are unable to solve the problem.

### Fetch waveform files
Another current limitation of GeoLab is there is no shared storage area.  You will need to download the waveform data stored on AWS as "S3" bucket objects to your GeoLab file work space.  Cut-and-paste this incantation in the terminal window you used above:
```
!mkdir -p wf && cd wf && for i in {0..19}; do wget https://essc-mspass2024.s3.us-east-2.amazonaws.com/Event_${i}.msd;done
```

## Step two:  Run the notebook
Once the steps above are run successfully, the rest of this notebook can be run sequentially from top to bottom.  When mongod is running and you wget command to fetch the waveform data completes in the terminal window just select Run->Run All Cells.   When the notebook finishes you will need to complete a short quiz in Moodle to verify you completed this preclass exercise successfully.   

In [1]:
from obspy import UTCDateTime
from obspy.clients.fdsn import Client
client=Client("IRIS")
ts=UTCDateTime('2011-01-01T00:00:00.0')
starttime=ts
te=UTCDateTime('2012-01-01T00:00:00.0')
endtime=te
lat0=38.3
lon0=142.5
minmag=7.0

cat=client.get_events(starttime=starttime,endtime=endtime,
        minmagnitude=minmag)
# this is a weird incantation suggested by obspy to print a summeary of all the events
print(cat.__str__(print_all=True))

20 Event(s) in Catalog:
2011-12-14T05:04:57.810000Z |  -7.528, +146.814 | 7.1  MW
2011-10-28T18:54:34.750000Z | -14.557,  -76.121 | 7.0  MW
2011-10-23T10:41:22.010000Z | +38.729,  +43.447 | 7.1  MW
2011-10-21T17:57:17.310000Z | -28.881, -176.033 | 7.4  MW
2011-09-15T19:31:03.160000Z | -21.593, -179.324 | 7.3  MW
2011-09-03T22:55:35.760000Z | -20.628, +169.778 | 7.0  MW
2011-08-24T17:46:11.560000Z |  -7.620,  -74.538 | 7.0  MW
2011-08-20T18:19:24.610000Z | -18.331, +168.226 | 7.0  MW
2011-08-20T16:55:04.090000Z | -18.277, +168.067 | 7.1  MW
2011-07-10T00:57:10.910000Z | +38.055, +143.302 | 7.0  MW
2011-07-06T19:03:20.470000Z | -29.307, -176.257 | 7.6  MW
2011-06-24T03:09:38.920000Z | +51.980, -171.820 | 7.3  MW
2011-04-07T14:32:44.100000Z | +38.251, +141.730 | 7.1  MW
2011-03-11T06:25:50.740000Z | +38.051, +144.630 | 7.6  MW
2011-03-11T06:15:37.570000Z | +36.227, +141.088 | 7.9  MW
2011-03-11T05:46:23.200000Z | +38.296, +142.498 | 9.1  MW
2011-03-09T02:45:19.590000Z | +38.441, +142.980 

This next section creates a client and related database handle to interact with MongoDB

In [2]:
import mspasspy.client as msc

msc_client=msc.Client()
dbclient=msc_client.get_database_client()

In [3]:
dbname = "Earthscope2025"
db = dbclient.get_database(dbname)

As the name suggests this saves the data downloaded by obspy to MongoDB. 

In [4]:
n=db.save_catalog(cat)
print('number of event entries saved in source collection=',n)

number of event entries saved in source collection= 20


The files we just downloaded are raw miniseed data files.   We need to build an index that MsPASS can use to crack these files and load them into our processing workflow later.   We run this step in parallel to improve performance as this process has to pass through every byte of all 20 files. 

In [5]:
import os
import dask.bag as dbg
# remove the comment below if you need to restart this workflow 
# at this point c
#db.drop_collection('wf_miniseed')
# Note this dir value assumes the wf dir was created with 
# the previous command that also downloads the data from AWS
current_directory = os.getcwd()
dir = os.path.join(current_directory, 'wf')
dfilelist=[]
with os.scandir(dir) as entries:
    for entry in entries:
        if entry.is_file():
            dfilelist.append(entry.name)
print(dfilelist)
mydata = dbg.from_sequence(dfilelist)
mydata = mydata.map(db.index_mseed_file,dir=dir)
index_return = mydata.compute()

['Event_7.msd', 'Event_17.msd', 'Event_0.msd', 'Event_11.msd', 'Event_4.msd', 'Event_1.msd', 'Event_14.msd', 'Event_12.msd', 'Event_15.msd', 'Event_6.msd', 'Event_16.msd', 'Event_19.msd', 'Event_13.msd', 'Event_8.msd', 'Event_5.msd', 'Event_9.msd', 'Event_3.msd', 'Event_2.msd', 'Event_10.msd', 'Event_18.msd']


In [6]:
n=db.wf_miniseed.count_documents({})
print("Number of wf_miniseed indexing documents saved by MongoDB = ",n)

Number of wf_miniseed indexing documents saved by MongoDB =  26247


Next retrieve station metadata with web services using obspy.  Result is loaded as single obspy *Inventory* object.  We then save the data to MongoDB with the MsPASS databse method called *save_inventory*.   The *Inventory* object is disassembled to save the contents as in the form of a python dictionary == a MongoDB document.

In [7]:
ts=UTCDateTime('2010-01-01T00:00:00.0')
starttime=ts
te=UTCDateTime('2013-01-01T00:00:00.0')
inv=client.get_stations(network='TA',starttime=starttime,endtime=endtime,
                        format='xml',channel='BH?',level='response')
net=inv.networks
x=net[0]
sta=x.stations
print("Number of stations retrieved=",len(sta))

Number of stations retrieved= 855


In [8]:
ret=db.save_inventory(inv)
print('save_inventory returned value=',ret)

Database.save_inventory processing summary:
Number of site records processed= 857
number of site records saved= 857
number of channel records processed= 2796
number of channel records saved= 2784
save_inventory returned value= (857, 2784, 857, 2796)


This section creates a normalization cross-reference needed later to connect wf documents to matching documents in the *site*, *channel*, and *source* collections.   The algorithms used for source and receiver data are diferent.  

In [9]:
from mspasspy.db.normalize import (
    bulk_normalize,
    MiniseedMatcher,
    OriginTimeMatcher,
)
chan_matcher = MiniseedMatcher(db,
        collection="channel",
        attributes_to_load=["starttime","endtime","lat","lon","elev","hang","vang","_id"],
    )
site_matcher = MiniseedMatcher(db,
        collection="site",
        attributes_to_load=["starttime","endtime","lat","lon","elev","_id"],
    )
source_matcher = OriginTimeMatcher(db,t0offset=300.0,tolerance=100.0)
ret = bulk_normalize(db,matcher_list=[chan_matcher,site_matcher,source_matcher])
print("Number of documents processed in wf_miniseed=",ret[0])
print("Number of documents updated with channel cross reference id=",ret[1])
print("Number of documents updated with site cross reference id=",ret[2])
print("Number of documents updated with source cross reference id=",ret[3])

Number of documents processed in wf_miniseed= 26247
Number of documents updated with channel cross reference id= 26247
Number of documents updated with site cross reference id= 26247
Number of documents updated with source cross reference id= 26232


In [10]:
from mspasspy.algorithms.window import WindowData
from mspasspy.algorithms.signals import detrend
from mspasspy.algorithms.basic import ator
from mspasspy.ccore.algorithms.basic import TimeWindow
from mspasspy.ccore.utility import ErrorSeverity
from mspasspy.db.normalize import (normalize,
                                   ObjectIdMatcher,
                                   OriginTimeMatcher)
from mspasspy.algorithms.window import WindowData
from mspasspy.algorithms.resample import (ScipyResampler,
                                          ScipyDecimator,
                                          resample,
                                         )
from obspy.geodetics import gps2dist_azimuth,kilometers2degrees
from obspy.taup import TauPyModel
import time

def set_PStime(d,Ptimekey="Ptime",Stimekey="Stime",model=None):
    """
    Function to calculate P and S wave arrival time and set times 
    as the header (Metadata) fields defined by Ptimekey and Stimekey.
    Tries to handle some complexities of the travel time calculator 
    returns when one or both P and S aren't calculatable.  That is 
    the norm in or at the edge of the core shadow.  
    
    :param d:  input TimeSeries datum.  Assumes datum's Metadata 
      contains stock source and channel attributes.  
    :param Ptimekey:  key used to define the header attribute that 
      will contain the computed P time.  Default "Ptime".
    :param model:  instance of obspy TauPyModel travel time engine. 
      Default is None.   That mode is slow as an new engine will be
      constructed on each call to the function.  Normal use should 
      pass an instance for greater efficiency.  
    """
    if d.live:
        if model is None:
            model = TauPyModel(model="iasp91") 
        # extract required source attributes
        srclat=d["source_lat"]
        srclon=d["source_lon"]
        srcz=d["source_depth"]
        srct=d["source_time"] 
        # extract required channel attributes
        stalat=d["channel_lat"]
        stalon=d["channel_lon"]
        staelev=d["channel_elev"]
        # set up and run travel time calculator
        georesult=gps2dist_azimuth(srclat,srclon,stalat,stalon)
        # obspy's function we just called returns distance in m in element 0 of a tuple
        # their travel time calculator it is degrees so we need this conversion
        dist=kilometers2degrees(georesult[0]/1000.0)
        arrivals=model.get_travel_times(source_depth_in_km=srcz,
                                            distance_in_degree=dist,
                                            phase_list=['P','S'])
        # always post this for as it is not cheap to compute
        # WARNING:  don't use common abbrevation delta - collides with data dt
        d['epicentral_distance']=dist
        # these are CSS3.0 shorthands s - station e - event
        esaz = georesult[1]
        seaz = georesult[2]
        # css3.0 names esax = event to source azimuth; seaz = source to event azimuth
        d['esaz']=esaz
        d['seaz']=seaz
        # get_travel_times returns an empty list if a P time cannot be 
        # calculated.  We trap that condition and kill the output 
        # with an error message
        if len(arrivals)==2:
            Ptime=srct+arrivals[0].time
            rayp = arrivals[0].ray_param
            Stime=srct+arrivals[1].time
            rayp_S = arrivals[1].ray_param
            d.put(Ptimekey,Ptime)
            d.put(Stimekey,Stime)
            # These keys are not passed as arguments but could be - a choice
            # Ray parameter is needed for free surface transformation operator
            # note tau p calculator in obspy returns p=R sin(theta)/V_0
            d.put("rayp_P",rayp)
            d.put("rayp_S",rayp_S)
        elif len(arrivals)==1:
            if arrivals[0].name == 'P':
                Ptime=srct+arrivals[0].time
                rayp = arrivals[0].ray_param
                d.put(Ptimekey,Ptime)
                d.put("rayp_P",rayp)
            else:
                # Not sure we can assume name is S
                if arrivals[0].name == 'S':
                    Stime=srct+arrivals[0].time
                    rayp_S = arrivals[0].ray_param
                    d.put(Stimekey,Stime)
                    d.put("rayp_S",rayp_S)
                else:
                    message = "Unexpected single phase name returned by taup calculator\n"
                    message += "Expected phase name S but got " + arrivals[0].name
                    d.elog.log_error("set_PStime",
                                     message,
                                     ErrorSeverity.Invalid)
                    d.kill()
                
    # Note python indents mean if an ensemble is marked dead this function just silenetly returns 
    # what it received doing nothing - correct mspass model
    return d
def cut_Pwindow(d,stime=-100.0,etime=500.0):
    """
    Window datum relative to P time window.  Time
    interval extracted is Ptime+stime to Ptime+etime.
    Uses ator,rtoa feature of MsPASS.
    """
    if d.live:
        if "Ptime" in d:
            ptime = d["Ptime"]
            d.ator(ptime)
            d = WindowData(d,stime,etime)
            d.rtoa()
    return d
   

In [11]:
from obspy.taup import TauPyModel
from mspasspy.algorithms.signals import filter,detrend
from mspasspy.util.Janitor import Janitor

ttmodel = TauPyModel(model="iasp91")
stime=-100.0
etime=500.0

janitor = Janitor()
# nonstandard keys added for travel times - need to save these to keep them from being thrown out by the janitor
for k in ["seaz","esaz","Ptime","epicentral_distance","rayp_P","rayp_S"]:
    janitor.add2keepers(k)

t0 = time.time()
nlive=0
ndead=0
cursor=db.wf_miniseed.find({})
for doc in cursor:
    # the normalize matchers in this read were defined in the normalize section of this 
    # notebook.  Could cause problems if this box is run out of order
    d = db.read_data(doc,collection='wf_miniseed',normalize=[chan_matcher,source_matcher])
    d = detrend(d,type="constant")
    d = filter(d,'lowpass',freq=2.0,zerophase=False)
    
    # this function will run faster if passed an instance the TauP calculator (ttmodel)
    # If left as the default an instance is instantiated on each call to the function which is very inefficient.
    d = set_PStime(d,model=ttmodel)
    d = cut_Pwindow(d,stime,etime)
    if d.live:
        nlive += 1
    else:
        ndead += 1
    janitor.clean(d)
    db.save_data(d,storage_mode='file',dir='./wf_TimeSeries',data_tag='serial_preprocessed')
t=time.time()    
print("Total processing time=",t-t0)
print("Number of live data saved=",nlive)
print("number of data killed=",ndead)



Total processing time= 408.6127417087555
Number of live data saved= 26178
number of data killed= 69


In [12]:
from mspasspy.algorithms.bundle import bundle_seed_data
from mspasspy.util.Undertaker import Undertaker

stedronsky=Undertaker(db)
t0=time.time()
srcids=db.wf_TimeSeries.distinct('source_id')
nsrc=len(srcids)
print("This run will process ",nsrc,
      " common source gathers into Seismogram objects")
for sid in srcids:
    query={'source_id' : sid,
           'data_tag' : 'serial_preprocessed'}
    nd=db.wf_TimeSeries.count_documents(query)
    print('working on ensemble with source_id=',sid)
    print('database contains ',nd,' documents == channels for this ensemble')
    cursor=db.wf_TimeSeries.find(query)
    # For this operation we only need channel metadata loaded by normalization
    # orientation data is critical (hang and vang attributes)
    ensemble = db.read_data(cursor,
                            normalize=[chan_matcher],
                           )
    print('Number of TimeSeries objects constructed for this source=',len(ensemble.member))
    ensemble=bundle_seed_data(ensemble)
    # The reader would do the following handling of dead data automatically
    # it is included here for demonstration purposes only
    # part of the lesson is handling of dead data
    print('Number of (3C) Seismogram object created from input ensemble=',len(ensemble.member))
    [living,bodies]=stedronsky.bring_out_your_dead(ensemble)
    print('number of bundled Seismogram=',len(living.member))
    print('number of killed Seismogram=',len(bodies.member))
    # bury the dead if necessary
    if len(bodies.member)>0:
        stedronsky.bury(bodies)
    janitor.clean(ensemble)
    db.save_data(ensemble,storage_mode='file',dir='./wf_Seismogram',collection='wf_Seismogram',data_tag='serial_preprocessed')

t = time.time()
print("Time to convert all data to Seismogram objects=",t-t0)

This run will process  20  common source gathers into Seismogram objects
working on ensemble with source_id= 6846c258f5576e7425beca63
database contains  1311  documents == channels for this ensemble
Number of TimeSeries objects constructed for this source= 1311
Number of (3C) Seismogram object created from input ensemble= 437
number of bundled Seismogram= 437
number of killed Seismogram= 0
working on ensemble with source_id= 6846c258f5576e7425beca64
database contains  1332  documents == channels for this ensemble
Number of TimeSeries objects constructed for this source= 1332
Number of (3C) Seismogram object created from input ensemble= 444
number of bundled Seismogram= 444
number of killed Seismogram= 0
working on ensemble with source_id= 6846c258f5576e7425beca65
database contains  1323  documents == channels for this ensemble
Number of TimeSeries objects constructed for this source= 1323
Number of (3C) Seismogram object created from input ensemble= 441
number of bundled Seismogram= 44