# Benchmark Data Assembly
## *Gary Pavlis - started March 30, 2021*
This notebook documents how the benchmark data was assembled from the downloaded usarray data set.  It build on the raw_data_tutorial combined with a more complete database assembled previously, exported as json, and sent to TACC.   This notebook will be run on a mac pro called quakes.geology.indiana.edu testing the docker container image.   

## Importing metadata collections
As noted I previously the first step is to import the json files created with the export process.  To do that I had to first start the docker all-in-one container.   An oddity I think we need to fix is the name of the running container seems to be assigned automatically somehow.  My instance on quakes was called "infallible_thompson.   To get a shell in the container I did this:
```
docker exec -it infallible_thompson bash
cd /home
```
noting I had launched docker from /Volumes/data/usarray/db.  postscript:  After a call to Ian it seems the random naming was intentional for this use.  The all in one procedure is intended for running tutorials where knowing the container name would be baggage.   

Importing such data is well documented in the mongodb documentation.   I followed this source https://docs.mongodb.com/guides/server/import/ to generate the following command lines to import the four exported collections.

```
# top level directory on quakes is - run this in a shell from this directory /scratch/06058/iwang/mspass/workdir
mongoimport --host $HOSTNAME --port "27017" --db usarraytest --collection arrival --file json_import/arrival.json
mongoimport --host $HOSTNAME --port "27017" --db usarraytest --collection channel --file json_import/channel.json
mongoimport --host $HOSTNAME --port "27017" --db usarraytest --collection site --file json_import/site.json
mongoimport --host $HOSTNAME --port "27017" --db usarraytest --collection source --file json_import/source.json
```

# Indexing waveform data
We now run a variant of section 5 of the raw data tutorial with the same title as above.  The variant is that here we define a year at the top and assume a specific file structure that is valid for the way I assembled these data.  Some tweeking may be needed when this notebook is transferred to TACC. 

In running this I uncovered an important issue to beware of.  That is, it really matters what directory is mapped into the container.   I (incorrectly) created a seperate db directory and had docker mount that as /home.  Problem I created was the mount point is inside the file system tree.  I had the waveform data in (to use unix shell syntax) the directory ../wf.  The problem is ".." is not visible as I mounted it.  Symbolic links wouldn't work either for the same reason.  I solved it for this test by creating a wf/2012 directory and creating hard links form ../wf/2012/*.mseed to create aliases in the local wf/2012.  Works as a patch, but not the right solution.

In [1]:
import fnmatch
import os
topdirectory= os.environ.get('SCRATCH') + '/mspass/workdir/wf'   # appropriate for container if wf contains the waveform data
year='2012'   # Change this for a different year
dir=topdirectory+'/'+year
filelist=fnmatch.filter(os.listdir(dir),'*.mseed')
print(filelist)

['event756.mseed', 'event548.mseed', 'event394.mseed', 'event714.mseed', 'event305.mseed', 'event936.mseed', 'event89.mseed', 'event960.mseed', 'event656.mseed', 'event916.mseed', 'event327.mseed', 'event580.mseed', 'event716.mseed', 'event506.mseed', 'event407.mseed', 'event674.mseed', 'event867.mseed', 'event892.mseed', 'event443.mseed', 'event310.mseed', 'event83.mseed', 'event39.mseed', 'event421.mseed', 'event161.mseed', 'event645.mseed', 'event32.mseed', 'event820.mseed', 'event171.mseed', 'event110.mseed', 'event634.mseed', 'event308.mseed', 'event953.mseed', 'event851.mseed', 'event267.mseed', 'event781.mseed', 'event434.mseed', 'event795.mseed', 'event195.mseed', 'event668.mseed', 'event526.mseed', 'event463.mseed', 'event870.mseed', 'event56.mseed', 'event497.mseed', 'event44.mseed', 'event393.mseed', 'event830.mseed', 'event928.mseed', 'event292.mseed', 'event194.mseed', 'event119.mseed', 'event279.mseed', 'event502.mseed', 'event212.mseed', 'event204.mseed', 'event670.mseed

In [4]:
import dask.bag
from dask.distributed import Client as DaskClient
from mspasspy.db.database import Database
from mspasspy.db.client import DBClient
from mspasspy.preprocessing.seed.ensembles import (dbsave_seed_ensemble_file)

hostname = os.environ.get('HOSTNAME')
daskclient = DaskClient(hostname + ':8786')
dbclient = DBClient(hostname + ':27017')
db = Database(dbclient,'usarraytest')

def save_ens(fn):
    fname=dir+'/'+fn
    try:
        db.index_mseed_file(fn, dir=dir, collection='import_miniseed_ensemble')
    except Exception as err:
        return err
    return 'Building index for miniseed file='+fname

filelist_dbg = dask.bag.from_sequence(filelist)
filelist_map = filelist_dbg.map(save_ens)
save_log = filelist_map.compute()


# from mspasspy.db.database import Database
# from mspasspy.db.client import Client
# from mspasspy.preprocessing.seed.ensembles import (dbsave_seed_ensemble_file)

# dbclient=Client('mongodb://c402-003:27017')
# db=Database(dbclient,'usarraytest')
# print(db.list_collection_names())
# for file in filelist:
#     fname=dir+'/'+file
#     print('Building index for miniseed file=',fname)
#     try:
#         dbsave_seed_ensemble_file(db,fname)
#     except Exception as err:
#         print(err)

In [5]:
print(save_log)

[1, 4, 9, 16, 25]


## Source association
Next the same steps as 6 in raw data tutorial.  Well shortened and without the explanations:

In [10]:
# Added to run remotely test status.  Could be removed.

dbclient=DBClient(hostname + ':27017')
db=Database(dbclient,'usarraytest')
cols=db.list_collections()
for doc in cols:
    print(doc)

{'name': 'source', 'type': 'collection', 'options': {}, 'info': {'readOnly': False, 'uuid': UUID('2a422223-5cb7-41de-8dc6-ae3dd19a953e')}, 'idIndex': {'v': 2, 'key': {'_id': 1}, 'name': '_id_', 'ns': 'usarraytest.source'}}
{'name': 'channel', 'type': 'collection', 'options': {}, 'info': {'readOnly': False, 'uuid': UUID('600479e9-de65-49ff-8f70-ee2eeabce12c')}, 'idIndex': {'v': 2, 'key': {'_id': 1}, 'name': '_id_', 'ns': 'usarraytest.channel'}}
{'name': 'site', 'type': 'collection', 'options': {}, 'info': {'readOnly': False, 'uuid': UUID('7515415e-bdef-41e0-86fb-4a0ae30a8e60')}, 'idIndex': {'v': 2, 'key': {'_id': 1}, 'name': '_id_', 'ns': 'usarraytest.site'}}
{'name': 'import_miniseed_ensemble', 'type': 'collection', 'options': {}, 'info': {'readOnly': False, 'uuid': UUID('96be719e-9b12-4439-9e5d-046036cbf4b1')}, 'idIndex': {'v': 2, 'key': {'_id': 1}, 'name': '_id_', 'ns': 'usarraytest.import_miniseed_ensemble'}}
{'name': 'arrival', 'type': 'collection', 'options': {}, 'info': {'readOnl

Started this up much later than above (April 7) so wanted to verify how many ensembles were actually processed with the following:

In [11]:
n=db.import_miniseed_ensemble.count_documents({})
print("number of ensembles successfully indexed=",n)

number of ensembles successfully indexed= 3833785


In [None]:
from mspasspy.preprocessing.seed.ensembles import link_source_collection
link_source_collection(db)

The next step requires some changes because it needs to access the converted antelope database tables used to build the arrival collection.   

In [8]:
from mspasspy.preprocessing.css30.dbarrival import extract_unique_css30_sources
arrfiledir= os.environ.get('SCRATCH') + 'mspass'  # for docker container
arrfile=arrfiledir+'/'+'usarray_tele'+year+'.txt'
events=extract_unique_css30_sources(arrfile)
# print just the first 10 events
n=0
for k in events:
    print('evid=',k,' contents:  ',events[k])
    n+=1
    if(n>=10):
        break

evid= 206397  contents:   {'evid': 206397, 'lat': 12.011, 'lon': 143.505, 'depth': 10.0, 'time': 1325377805.03}
evid= 206398  contents:   {'evid': 206398, 'lat': -11.372, 'lon': 166.224, 'depth': 66.7, 'time': 1325379008.01}
evid= 206401  contents:   {'evid': 206401, 'lat': 31.416, 'lon': 138.155, 'depth': 348.5, 'time': 1325395674.5}
evid= 206408  contents:   {'evid': 206408, 'lat': 12.018, 'lon': 143.607, 'depth': 6.5, 'time': 1325432652.07}
evid= 206439  contents:   {'evid': 206439, 'lat': -14.748, 'lon': 167.44, 'depth': 221.6, 'time': 1325652454.05}
evid= 206454  contents:   {'evid': 206454, 'lat': -10.67, 'lon': 166.375, 'depth': 75.9, 'time': 1325703417.29}
evid= 206458  contents:   {'evid': 206458, 'lat': -45.97, 'lon': -76.014, 'depth': 10.0, 'time': 1325724873.21}
evid= 206459  contents:   {'evid': 206459, 'lat': -17.691, 'lon': -173.543, 'depth': 35.0, 'time': 1325726020.43}
evid= 206464  contents:   {'evid': 206464, 'lat': 18.325, 'lon': -70.361, 'depth': 39.8, 'time': 1325

Then this that uses the output of the above:

In [9]:
from mspasspy.preprocessing.css30.dbarrival import load_css30_sources
n=load_css30_sources(db,events)
print('loaded ',n,' new documents into source collection')

The following records in collection  source  have matching data for one or more evids
You must fix the mismatch problem before you can load these data
206397 {'_id': ObjectId('6012d95c961f396071be8657'), 'evid': 206397, 'latitude': 12.011, 'longitude': 143.505, 'depth': 10.0, 'time': 1325377805.03, 'source_id': '6012d95c961f396071be8657'}
206398 {'_id': ObjectId('6012d95c961f396071be8658'), 'evid': 206398, 'latitude': -11.372, 'longitude': 166.224, 'depth': 66.7, 'time': 1325379008.01, 'source_id': '6012d95c961f396071be8658'}
206401 {'_id': ObjectId('6012d95c961f396071be8659'), 'evid': 206401, 'latitude': 31.416, 'longitude': 138.155, 'depth': 348.5, 'time': 1325395674.5, 'source_id': '6012d95c961f396071be8659'}
206408 {'_id': ObjectId('6012d95c961f396071be865a'), 'evid': 206408, 'latitude': 12.018, 'longitude': 143.607, 'depth': 6.5, 'time': 1325432652.07, 'source_id': '6012d95c961f396071be865a'}
206439 {'_id': ObjectId('6012d95c961f396071be865b'), 'evid': 206439, 'latitude': -14.748,

207375 {'_id': ObjectId('6012d95c961f396071be86f7'), 'evid': 207375, 'latitude': 40.134, 'longitude': 24.059, 'depth': 2.5, 'time': 1330831868.4, 'source_id': '6012d95c961f396071be86f7'}
207379 {'_id': ObjectId('6012d95c961f396071be86f8'), 'evid': 207379, 'latitude': 2.687, 'longitude': -84.34, 'depth': 9.0, 'time': 1330854254.6, 'source_id': '6012d95c961f396071be86f8'}
207381 {'_id': ObjectId('6012d95c961f396071be86f9'), 'evid': 207381, 'latitude': -21.529, 'longitude': 169.769, 'depth': 14.0, 'time': 1330865342.44, 'source_id': '6012d95c961f396071be86f9'}
207383 {'_id': ObjectId('6012d95c961f396071be86fa'), 'evid': 207383, 'latitude': -21.554, 'longitude': -69.848, 'depth': 46.2, 'time': 1330878441.04, 'source_id': '6012d95c961f396071be86fa'}
207390 {'_id': ObjectId('6012d95c961f396071be86fb'), 'evid': 207390, 'latitude': -28.246, 'longitude': -63.294, 'depth': 553.9, 'time': 1330933570.04, 'source_id': '6012d95c961f396071be86fb'}
207407 {'_id': ObjectId('6012d95c961f396071be86fc'), 

208764 {'_id': ObjectId('6012d95c961f396071be87bc'), 'evid': 208764, 'latitude': 39.762, 'longitude': 75.194, 'depth': 9.9, 'time': 1338553942.54, 'source_id': '6012d95c961f396071be87bc'}
208781 {'_id': ObjectId('6012d95c961f396071be87bd'), 'evid': 208781, 'latitude': -22.059, 'longitude': -63.555, 'depth': 527.0, 'time': 1338623573.99, 'source_id': '6012d95c961f396071be87bd'}
208800 {'_id': ObjectId('6012d95c961f396071be87be'), 'evid': 208800, 'latitude': 44.9, 'longitude': 10.94, 'depth': 9.0, 'time': 1338751243.0, 'source_id': '6012d95c961f396071be87be'}
208805 {'_id': ObjectId('6012d95c961f396071be87bf'), 'evid': 208805, 'latitude': 5.305, 'longitude': -82.629, 'depth': 7.0, 'time': 1338770715.29, 'source_id': '6012d95c961f396071be87bf'}
208806 {'_id': ObjectId('6012d95c961f396071be87c0'), 'evid': 208806, 'latitude': 5.508, 'longitude': -82.563, 'depth': 7.0, 'time': 1338779724.75, 'source_id': '6012d95c961f396071be87c0'}
208808 {'_id': ObjectId('6012d95c961f396071be87c1'), 'evid':

210514 {'_id': ObjectId('6012d95c961f396071be887e'), 'evid': 210514, 'latitude': 30.545, 'longitude': -113.909, 'depth': 10.0, 'time': 1346109360.51, 'source_id': '6012d95c961f396071be887e'}
210524 {'_id': ObjectId('6012d95c961f396071be887f'), 'evid': 210524, 'latitude': 12.499, 'longitude': -88.65, 'depth': 34.0, 'time': 1346134096.61, 'source_id': '6012d95c961f396071be887f'}
210527 {'_id': ObjectId('6012d95c961f396071be8880'), 'evid': 210527, 'latitude': 11.978, 'longitude': -88.706, 'depth': 35.0, 'time': 1346144016.88, 'source_id': '6012d95c961f396071be8880'}
210565 {'_id': ObjectId('6012d95c961f396071be8881'), 'evid': 210565, 'latitude': -17.61, 'longitude': 168.334, 'depth': 103.1, 'time': 1346247270.18, 'source_id': '6012d95c961f396071be8881'}
210576 {'_id': ObjectId('6012d95c961f396071be8882'), 'evid': 210576, 'latitude': 38.425, 'longitude': 141.889, 'depth': 53.7, 'time': 1346267111.92, 'source_id': '6012d95c961f396071be8882'}
210588 {'_id': ObjectId('6012d95c961f396071be8883

302619 {'_id': ObjectId('6012d95c961f396071be8939'), 'evid': 302619, 'latitude': 49.399, 'longitude': 155.489, 'depth': 59.5, 'time': 1351753043.5, 'source_id': '6012d95c961f396071be8939'}
302620 {'_id': ObjectId('6012d95c961f396071be893a'), 'evid': 302620, 'latitude': 51.011, 'longitude': -179.699, 'depth': 9.0, 'time': 1351763459.09, 'source_id': '6012d95c961f396071be893a'}
302631 {'_id': ObjectId('6012d95c961f396071be893b'), 'evid': 302631, 'latitude': 55.887, 'longitude': 162.799, 'depth': 9.1, 'time': 1351821123.35, 'source_id': '6012d95c961f396071be893b'}
302640 {'_id': ObjectId('6012d95c961f396071be893c'), 'evid': 302640, 'latitude': 9.219, 'longitude': 126.161, 'depth': 37.0, 'time': 1351880252.73, 'source_id': '6012d95c961f396071be893c'}
302657 {'_id': ObjectId('6012d95c961f396071be893d'), 'evid': 302657, 'latitude': 10.515, 'longitude': 126.93, 'depth': 35.0, 'time': 1351941141.68, 'source_id': '6012d95c961f396071be893d'}
302658 {'_id': ObjectId('6012d95c961f396071be893e'), '

That output suggests strongly that evids were set earlier when I created the original usarray data set.  Will blunder on assuming that is so.

Anf finally this for this step:

In [9]:
from mspasspy.preprocessing.css30.dbarrival import set_source_id_from_evid
result=set_source_id_from_evid(db)
print('Number of arrival documents processed=',result[0])
print('Number of arrival documents updated=',result[1])
print('Number arrivals in set that did not match keyed by evid =',result[3])

Number of arrival documents processed= 0
Number of arrival documents updated= 0
Number arrivals in set that did not match keyed by evid = {}


That failed for an unknown reason.  The next books seeks to know why.

In [10]:
cursor=db.arrival.find({}).limit(20)
for doc in cursor:
    print(doc)

{'_id': ObjectId('601012354b4f9e654b4e0e97'), 'index': 6, 'evid': 243020, 'source_lat': -17.376, 'source_lon': -72.611, 'source_depth': 26.9, 'source_time': 1174877218.64, 'mb': -999.0, 'ms': -999.0, 'sta': '115A', 'phase': 'P', 'iphase': 'P', 'delta': 62.786, 'seaz': 136.82, 'esaz': 322.89, 'residual': -0.054000000000000006, 'time': 1174877841.55093, 'deltim': 2.275, 'net': 'TA', 'source_id': '6012d929961f396071be7569'}
{'_id': ObjectId('601012354b4f9e654b4e0e91'), 'index': 0, 'evid': 243020, 'source_lat': -17.376, 'source_lon': -72.611, 'source_depth': 26.9, 'source_time': 1174877218.64, 'mb': -999.0, 'ms': -999.0, 'sta': '319A', 'phase': 'P', 'iphase': 'P', 'delta': 60.128, 'seaz': 138.91, 'esaz': 323.99, 'residual': -0.445, 'time': 1174877823.17531, 'deltim': 1.3, 'css30_sta': '319ATA', 'net': 'TA', 'source_id': '6012d929961f396071be7569'}
{'_id': ObjectId('601012354b4f9e654b4e0e94'), 'index': 3, 'evid': 243020, 'source_lat': -17.376, 'source_lon': -72.611, 'source_depth': 26.9, 's

From this output and looking at the code for set_source_id_from_evid I think it means this is all good to go.   Hence I'm going to see if I can release the kraken and got on.

Warning - the above takes a long time.  It might be smart to run the next box before running the above.

## Build index
As noted in raw data tutorial building these indices will speed things up a lot:

In [11]:
db.arrival.create_index(
[
    ('source_id',1),
    ('net', 1),
    ('sta',1)
])

'source_id_1_net_1_sta_1'

## Source data final preprocessing
One last step 8.1 of raw data tutorial.  No change form that prototype

In [12]:
from mspasspy.preprocessing.seed.ensembles import link_source_collection
# this function is like subroutine and returns nothing
#link_source_collection(db,prefer_evid=True)
# Above failed trying this variant
link_source_collection(db)

## Ensemble preprocessing data loading
This is a variant of the last box in the raw data tutorial. This should run a bit faster with no significant degradation of the result.   Omit the bandpass filtering and doesn't do demean until after cutting the data down to size with a shorter time window. 

In [13]:
from mspasspy.util.Undertaker import Undertaker
from mspasspy.algorithms import signals
from mspasspy.ccore.seismic import (TimeWindow,TimeSeries,TimeSeriesEnsemble)
from mspasspy.algorithms.window import WindowData
from mspasspy.algorithms.bundle import bundle_seed_data
from mspasspy.preprocessing.seed.ensembles import (load_one_ensemble,
                                                   load_channel_data,
                                                   load_site_data,
                                                   load_arrivals_by_id,
                                                   erase_seed_metadata)
import time
stedronsky=Undertaker(db)
dbwf=db.import_miniseed_ensemble
cursor=dbwf.find({},no_cursor_timeout = True)
n=0
for doc in cursor:
    print('working on enemble number ',n)
    if not 'evid' in doc:
        print('Ensemble number ',n,' does not have an evid set - skipped')
        n+=1
        continue
    print('calling load_one_ensmble for event number=',n)
    ens=load_one_ensemble(doc,apply_calib=True,
                  ensemble_mdkeys=['source_id','evid','starttime','endtime'])
    print('calling load_site_data')
    load_site_data(db,ens)
    print('calling_load_arrivals_by_id')
    nlive=load_arrivals_by_id(db,ens)
    print('Done - calling editor to cremate the dead')
    n+=1
    t0=time.time()
    d_cleaned=stedronsky.cremate(ens)
    t1=time.time()
    print('Time to handle dead data=',t1-t0)
    t0=time.time()
    twin=TimeWindow(-300.0,400.0)
    for i in range(len(d_cleaned.member)):
        d=TimeSeries(d_cleaned.member[i])
        t=d['arrival_time']
        d.ator(t)
        d=WindowData(d,twin)
        d_cleaned.member[i]=d
    t1=time.time()
    print('Time to window cleaned ensemble=',t1-t0)
    t0=time.time()
    signals.detrend(d_cleaned,'demean')
    t1=time.time()
    print('Time to apply detrend operator=',t1-t0)
    t0=time.time()
    load_channel_data(db,d_cleaned)
    t1=time.time()
    print('Time to build cross reference to channel=',t1-t0)
    t0=time.time()
    ens3c=bundle_seed_data(d_cleaned)
    t1=time.time()
    print('Time to run bundle algorithm=',t1-t0)
    print("Number of Seismograms after bundle_seed_data=",len(ens3c.member))
    #print_dead_logs(ens3c)
    erase_seed_metadata(d_cleaned)
    # This is a workaround for a bug that currently exists in
    # save_ensemble_data in handling history database
    # Should be able to eventually delete this
    for d in ens3c.member:
        d.clear_history()
    t0=time.time()
    db.save_ensemble_data(ens3c,'gridfs')
    t1=time.time()
    print("Time to save data to MongoDB=",t1-t0)
    del ens
    n+=1