# Earthscope MsPASS Short Course - preprocess

## *Gary L. Pavlis, Indiana University and Yinzhi (Ian) Wang, TACC*

## Purpose of this Notebook
This notebook is intended to be run without explanation by students prior to the start of the course.  The purpose is only to create a working data set that you will work with during the first class session.  You may attempt to grok the code in this notebook but be aware the plan is to visit the code in this notebook near the end of the first class session to discuss what exactly it does.  

Readers from github reading this who are not in the 2025 short course must realize this notebook was designed to run on Earthscope's GeoLab jupyter lab gateway on AWS.   It can still work if you are running MsPASS on a local computer.   The main difference is if you run __[mspass-desktop](https://www.mspass.org/getting_started/mspass_desktop.html)__ the step below to launch mongod is not necessary.    

## Step one:  Terminal preprocess preparation steps
### Launch the database server
You will need to run a command in a jupyter terminal window to allow this notebook to be run.   This is a necessary evil to avoid authentication complexities in a shared database server that would could have used to solve this problem otherwise.   As a result, for this class each student will be running their own instance of the database package called MongoDB.  That is actually the mode of running MsPASS for most research applications anyway.  On the other hand, in normal MsPASS usage mongodb is launched as a service automatically, but we have been unable to devise a comparable setup on GeoLab.  Consequently, you will need to do the following:
1.  Launch a jupyter "Terminal":
    a.  Push the + icon in the upper left corner if you aren't seeing a Launcher tab in your browser
    b.  In the Launcher window select "Terminal".
2.  Create two workng directories you will need for this exercise in your home directory.  You can copy and paste the following into the terminal you just launched:
```
mkdir db
mkdir logs
```
3.  Then type this incantation to launch a MongoDB server.
```
mongod --dbpath ./db --logpath ./logs/mongo_log
```
Note each time you reconnect to GeoLab you will need to relaunch the MonogDB server with that last incantation. 
4.  Verify that worked by typing `ps -A`  You should see a line where the CMD field is "mongod".   If not, contact me by email or slack if you are unable to solve the problem.

### Use outside GeoLab
If you are accessing this notebook from github and are not part of the 2025 short course you can still run this tutorial notebook on a desktop system. To do so you will need to do two things:
1.  Install the `mspass-desktop` GUI described in __[MsPASS User's Manual here](https://www.mspass.org/getting_started/mspass_desktop.html)__
2.  Launch MsPASS as described on that page and run this notebook using the "run" button on the GUI. Alternatively, you may run it interactively by pushing the "jupyter" button.

## Step two:  Run the notebook
Once the steps above are run successfully, the rest of this notebook can be run sequentially from top to bottom.  Just select *Run->Run All Cells* or *Kernel-> Restart Kernal and Run All Cells ... I

Note running this entire notebook take on the order of 10 minutes to run.   If you chose to run it box by box be prepared to wait several minutes for some of the code boxes to run.  When the notebook finishes you will need to complete a short quiz in Moodle to verify you completed this preclass exercise successfully.   

In [None]:
from obspy import UTCDateTime
from obspy.clients.fdsn import Client
client=Client("IRIS")
ts=UTCDateTime('2011-01-01T00:00:00.0')
starttime=ts
te=UTCDateTime('2012-01-01T00:00:00.0')
endtime=te
lat0=38.3
lon0=142.5
minmag=7.0

cat=client.get_events(starttime=starttime,endtime=endtime,
        minmagnitude=minmag)
# this is a weird incantation suggested by obspy to print a summeary of all the events
print(cat.__str__(print_all=True))

This next section creates a client and related database handle to interact with MongoDB

In [None]:
import mspasspy.client as msc

msc_client=msc.Client()
dbclient=msc_client.get_database_client()

In [None]:
dbname = "Earthscope2025"
db = dbclient.get_database(dbname)

As the name suggests this saves the data downloaded by obspy to MongoDB. 

In [None]:
n=db.save_catalog(cat)
print('number of event entries saved in source collection=',n)

Next we fetch a set of waveform files from AWS S3.  We will talk about the incantations below in the third class on cloud computing.  For now you should realize that the incantations below download a set of 20 files form AWS.  Those 20 files are miniseed format event files that were obtained using web services to extract the data from the Earthscope archives. On the cloud session we will refer to the approach here as the "download" model where you save an image of what is transferred from AWS to local storage.

In [None]:
# this may fail if this notebook is rerun - mkdir may need a conditional
import os
wfdir="./wf"
if os.path.isdir(wfdir):
    print("Warning:  ",wfdir," already exists.  miniseed files in that directory will be overwritten")
else:
    os.mkdir(wfdir)

In [None]:
# this is the AWS python package for accessing "s3 data'
import boto3
from botocore import UNSIGNED
from botocore.config import Config
s3=boto3.resource("s3",config=Config(signature_version=UNSIGNED))  # the config arg seems necessary for anonymous access
bucket="essc-mspass2024"
for evid in range(20):
    # for this bucket the file name is the key required by boto3
    fname = "Event_{}.msd".format(evid)
    key = fname
    # this is a file path passed to download_file as the output file name
    path = wfdir+"/"+fname
    s3.Bucket(bucket).download_file(key,path)


The files we just downloaded are raw miniseed data files.   We need to build an index that MsPASS can use to crack these files and load them into our processing workflow later.   We run this step in parallel to improve performance as this process has to pass through every byte of all 20 files. 

In [None]:
import os
import dask.bag as dbg
# remove the comment below if you need to restart this workflow 
# at this point c
#db.drop_collection('wf_miniseed')
# Note this dir value assumes the wf dir was created with 
# the previous command that also downloads the data from AWS
current_directory = os.getcwd()
dir = os.path.join(current_directory, 'wf')
dfilelist=[]
with os.scandir(dir) as entries:
    for entry in entries:
        if entry.is_file():
            dfilelist.append(entry.name)
print(dfilelist)
mydata = dbg.from_sequence(dfilelist)
mydata = mydata.map(db.index_mseed_file,dir=dir)
index_return = mydata.compute()

In [None]:
n=db.wf_miniseed.count_documents({})
print("Number of wf_miniseed indexing documents saved by MongoDB = ",n)

Next retrieve station metadata with web services using obspy.  Result is loaded as single obspy *Inventory* object.  We then save the data to MongoDB with the MsPASS databse method called *save_inventory*.   The *Inventory* object is disassembled to save the contents as in the form of a python dictionary == a MongoDB document.

In [None]:
ts=UTCDateTime('2010-01-01T00:00:00.0')
starttime=ts
te=UTCDateTime('2013-01-01T00:00:00.0')
inv=client.get_stations(network='TA',starttime=starttime,endtime=endtime,
                        format='xml',channel='BH?',level='response')
net=inv.networks
x=net[0]
sta=x.stations
print("Number of stations retrieved=",len(sta))

In [None]:
ret=db.save_inventory(inv)
print('save_inventory returned value=',ret)

This section creates a normalization cross-reference needed later to connect wf documents to matching documents in the *site*, *channel*, and *source* collections.   The algorithms used for source and receiver data are diferent.  

In [None]:
from mspasspy.db.normalize import (
    bulk_normalize,
    MiniseedMatcher,
    OriginTimeMatcher,
)
chan_matcher = MiniseedMatcher(db,
        collection="channel",
        attributes_to_load=["starttime","endtime","lat","lon","elev","hang","vang","_id"],
    )
site_matcher = MiniseedMatcher(db,
        collection="site",
        attributes_to_load=["starttime","endtime","lat","lon","elev","_id"],
    )
source_matcher = OriginTimeMatcher(db,t0offset=300.0,tolerance=100.0)
ret = bulk_normalize(db,matcher_list=[chan_matcher,site_matcher,source_matcher])
print("Number of documents processed in wf_miniseed=",ret[0])
print("Number of documents updated with channel cross reference id=",ret[1])
print("Number of documents updated with site cross reference id=",ret[2])
print("Number of documents updated with source cross reference id=",ret[3])

Next we define a set of specialized functions for completing the processing of this notebook.  The docstring with each function describes what it does.  Note that all use components of MsPASS to form a composite function that does something useful from the pieces of the framework.  That is a general theme for developing a new processing workflow.

In [None]:
from mspasspy.algorithms.window import WindowData
from mspasspy.algorithms.signals import detrend
from mspasspy.algorithms.basic import ator
from mspasspy.ccore.algorithms.basic import TimeWindow
from mspasspy.ccore.utility import ErrorSeverity
from mspasspy.db.normalize import (normalize,
                                   ObjectIdMatcher,
                                   OriginTimeMatcher)
from mspasspy.algorithms.window import WindowData
from mspasspy.algorithms.resample import (ScipyResampler,
                                          ScipyDecimator,
                                          resample,
                                         )
from obspy.geodetics import gps2dist_azimuth,kilometers2degrees
from obspy.taup import TauPyModel
import time

def set_PStime(d,Ptimekey="Ptime",Stimekey="Stime",model=None):
    """
    Function to calculate P and S wave arrival time and set times 
    as the header (Metadata) fields defined by Ptimekey and Stimekey.
    Tries to handle some complexities of the travel time calculator 
    returns when one or both P and S aren't calculatable.  That is 
    the norm in or at the edge of the core shadow.  
    
    :param d:  input TimeSeries datum.  Assumes datum's Metadata 
      contains stock source and channel attributes.  
    :param Ptimekey:  key used to define the header attribute that 
      will contain the computed P time.  Default "Ptime".
    :param model:  instance of obspy TauPyModel travel time engine. 
      Default is None.   That mode is slow as an new engine will be
      constructed on each call to the function.  Normal use should 
      pass an instance for greater efficiency.  
    """
    if d.live:
        if model is None:
            model = TauPyModel(model="iasp91") 
        # extract required source attributes
        srclat=d["source_lat"]
        srclon=d["source_lon"]
        srcz=d["source_depth"]
        srct=d["source_time"] 
        # extract required channel attributes
        stalat=d["channel_lat"]
        stalon=d["channel_lon"]
        staelev=d["channel_elev"]
        # set up and run travel time calculator
        georesult=gps2dist_azimuth(srclat,srclon,stalat,stalon)
        # obspy's function we just called returns distance in m in element 0 of a tuple
        # their travel time calculator it is degrees so we need this conversion
        dist=kilometers2degrees(georesult[0]/1000.0)
        arrivals=model.get_travel_times(source_depth_in_km=srcz,
                                            distance_in_degree=dist,
                                            phase_list=['P','S'])
        # always post this for as it is not cheap to compute
        # WARNING:  don't use common abbrevation delta - collides with data dt
        d['epicentral_distance']=dist
        # these are CSS3.0 shorthands s - station e - event
        esaz = georesult[1]
        seaz = georesult[2]
        # css3.0 names esax = event to source azimuth; seaz = source to event azimuth
        d['esaz']=esaz
        d['seaz']=seaz
        # get_travel_times returns an empty list if a P time cannot be 
        # calculated.  We trap that condition and kill the output 
        # with an error message
        if len(arrivals)==2:
            Ptime=srct+arrivals[0].time
            rayp = arrivals[0].ray_param
            Stime=srct+arrivals[1].time
            rayp_S = arrivals[1].ray_param
            d.put(Ptimekey,Ptime)
            d.put(Stimekey,Stime)
            # These keys are not passed as arguments but could be - a choice
            # Ray parameter is needed for free surface transformation operator
            # note tau p calculator in obspy returns p=R sin(theta)/V_0
            d.put("rayp_P",rayp)
            d.put("rayp_S",rayp_S)
        elif len(arrivals)==1:
            if arrivals[0].name == 'P':
                Ptime=srct+arrivals[0].time
                rayp = arrivals[0].ray_param
                d.put(Ptimekey,Ptime)
                d.put("rayp_P",rayp)
            else:
                # Not sure we can assume name is S
                if arrivals[0].name == 'S':
                    Stime=srct+arrivals[0].time
                    rayp_S = arrivals[0].ray_param
                    d.put(Stimekey,Stime)
                    d.put("rayp_S",rayp_S)
                else:
                    message = "Unexpected single phase name returned by taup calculator\n"
                    message += "Expected phase name S but got " + arrivals[0].name
                    d.elog.log_error("set_PStime",
                                     message,
                                     ErrorSeverity.Invalid)
                    d.kill()
                
    # Note python indents mean if an ensemble is marked dead this function just silenetly returns 
    # what it received doing nothing - correct mspass model
    return d
def cut_Pwindow(d,stime=-100.0,etime=500.0):
    """
    Window datum relative to P time window.  Time
    interval extracted is Ptime+stime to Ptime+etime.
    Uses ator,rtoa feature of MsPASS.
    """
    if d.live:
        if "Ptime" in d:
            ptime = d["Ptime"]
            d.ator(ptime)
            d = WindowData(d,stime,etime)
            d.rtoa()
    return d
   

This next box does the first order processing for this analysis.   It reads raw data from miniseed files we downloaded above, does a couple of simple processing steps, and saves the data as MsPASS native format that is much faster to read back it.  

This code illustrates a standard serial processing algorithm for an entire data set. The for loop iterates through all the TimeSeries data that can be constructed from the miniseed data files and handles them one at a time.  Later in the course we will see how to parallelize this loop.

In [None]:
from obspy.taup import TauPyModel
from mspasspy.algorithms.signals import filter,detrend
from mspasspy.util.Janitor import Janitor

ttmodel = TauPyModel(model="iasp91")
stime=-100.0
etime=500.0

janitor = Janitor()
# nonstandard keys added for travel times - need to save these to keep them from being thrown out by the janitor
for k in ["seaz","esaz","Ptime","epicentral_distance","rayp_P","rayp_S"]:
    janitor.add2keepers(k)

t0 = time.time()
nlive=0
ndead=0
cursor=db.wf_miniseed.find({})
for doc in cursor:
    # the normalize matchers in this read were defined in the normalize section of this 
    # notebook.  Could cause problems if this box is run out of order
    d = db.read_data(doc,collection='wf_miniseed',normalize=[chan_matcher,source_matcher])
    d = detrend(d,type="constant")
    d = filter(d,'lowpass',freq=2.0,zerophase=False)
    
    # this function will run faster if passed an instance the TauP calculator (ttmodel)
    # If left as the default an instance is instantiated on each call to the function which is very inefficient.
    d = set_PStime(d,model=ttmodel)
    d = cut_Pwindow(d,stime,etime)
    if d.live:
        nlive += 1
    else:
        ndead += 1
    janitor.clean(d)
    db.save_data(d,storage_mode='file',dir='./wf_TimeSeries',data_tag='serial_preprocessed')
t=time.time()    
print("Total processing time=",t-t0)
print("Number of live data saved=",nlive)
print("number of data killed=",ndead)

This final code box does one simple task:  it assembles all possible three-component sets of TimeSeries objects and "bundles" them into what we call Seismogram object.  We will review that concept in detail in the first class session.

In [None]:
from mspasspy.algorithms.bundle import bundle_seed_data
from mspasspy.util.Undertaker import Undertaker

stedronsky=Undertaker(db)
t0=time.time()
srcids=db.wf_TimeSeries.distinct('source_id')
nsrc=len(srcids)
print("This run will process ",nsrc,
      " common source gathers into Seismogram objects")
for sid in srcids:
    query={'source_id' : sid,
           'data_tag' : 'serial_preprocessed'}
    nd=db.wf_TimeSeries.count_documents(query)
    print('working on ensemble with source_id=',sid)
    print('database contains ',nd,' documents == channels for this ensemble')
    cursor=db.wf_TimeSeries.find(query)
    # For this operation we only need channel metadata loaded by normalization
    # orientation data is critical (hang and vang attributes)
    ensemble = db.read_data(cursor,
                            normalize=[chan_matcher],
                           )
    print('Number of TimeSeries objects constructed for this source=',len(ensemble.member))
    ensemble=bundle_seed_data(ensemble)
    # The reader would do the following handling of dead data automatically
    # it is included here for demonstration purposes only
    # part of the lesson is handling of dead data
    print('Number of (3C) Seismogram object created from input ensemble=',len(ensemble.member))
    [living,bodies]=stedronsky.bring_out_your_dead(ensemble)
    print('number of bundled Seismogram=',len(living.member))
    print('number of killed Seismogram=',len(bodies.member))
    # bury the dead if necessary
    if len(bodies.member)>0:
        stedronsky.bury(bodies)
    janitor.clean(ensemble)
    db.save_data(ensemble,storage_mode='file',dir='./wf_Seismogram',collection='wf_Seismogram',data_tag='serial_preprocessed')

t = time.time()
print("Time to convert all data to Seismogram objects=",t-t0)