# Session 2:  MsPASS Generalized Headers (Metadata container)
## Basic Metadata Operations
This tutorial assumes you have already completed the session 1 getting started tutorial.  There we saved waveform data downloaded with FDSN web services to a folder called "./wf".  In this session we are going to build an alternative form for handling input with raw miniseed data rather than converting the data immediately to MsPASS data objects.  Reading directly from miniseed is a useful feature, but since most of the students in this course are likely to have some experience with miniseed the hope is the header namespace this creates will be familiar ground. We will also be using the paradigm of one file per channel of data that SAC users are familiar with.   We do that for the educational benefit.  We presume it is familiar ground to most students and helps clarify a key relationship of what metadata belong to which signal. A noteworth point, however, is we consider the model of one file per signal an abomination that needs to be exorcised from data management dogma because it creates horrible performance problems on HPC systems.    

Our first step is to read the set of miniseed files we created in session1.  First, let's just use a python way to look into that directory:

In [14]:
import os
with os.scandir('./wf') as entries:
    for entry in entries:
        print(entry.name)

wfshortcourse_1091.mseed
wfshortcourse_1186.mseed
wfshortcourse_450.mseed
wfshortcourse_584.mseed
wfshortcourse_255.mseed
wfshortcourse_522.mseed
wfshortcourse_539.mseed
wfshortcourse_1019.mseed
wfshortcourse_781.mseed
wfshortcourse_761.mseed
wfshortcourse_109.mseed
wfshortcourse_637.mseed
wfshortcourse_668.mseed
wfshortcourse_849.mseed
wfshortcourse_565.mseed
wfshortcourse_138.mseed
wfshortcourse_1117.mseed
wfshortcourse_24.mseed
wfshortcourse_729.mseed
wfshortcourse_645.mseed
wfshortcourse_985.mseed
wfshortcourse_474.mseed
wfshortcourse_1115.mseed
wfshortcourse_33.mseed
wfshortcourse_967.mseed
wfshortcourse_424.mseed
wfshortcourse_180.mseed
wfshortcourse_1063.mseed
wfshortcourse_994.mseed
wfshortcourse_757.mseed
wfshortcourse_316.mseed
wfshortcourse_332.mseed
wfshortcourse_618.mseed
wfshortcourse_499.mseed
wfshortcourse_1262.mseed
wfshortcourse_1207.mseed
wfshortcourse_276.mseed
wfshortcourse_1018.mseed
wfshortcourse_913.mseed
wfshortcourse_1102.mseed
wfshortcourse_1208.mseed
wfshort

The names are meaningless, but let's now build an index for this set of files and store them in our database.  To do that we first have to build the MsPASS handle used to access the data.  Variations of the following incantation will normally appear at the top of any MsPASS job script:

In [13]:
from mspasspy.db.client import DBClient
from mspasspy.db.database import Database
dbclient=DBClient()
dbh=Database(dbclient,'shortcourse')

There are two things created here that are important to understand:
1.  *dbclient* is an instance of a communication handle created to interact with MongoDB.  It is a minor variant of a pymongo class called *Client*.  (For those familiar with object oriented programming *DBClient* is a subclass/child of *Client*.).  It can be thought of as a top level handle for the entire database system.  It is normally created once and referenced only in calls like the last line of the box above.
2.  *dbh* is an instance of the MsPASS *Database* class.  *dbh* is a handle that we will use to manipulate a particular database, which in this case we called "shortcourse".  We will get more into the weeds of MongoDB later, but for now think of this as an abstract handle we use to interact with the database. 

With that our next step is to build an index to each of the files in "./wf".   

In [18]:
with os.scandir('./wf') as entries:
    for entry in entries:
        if entry.is_file():
            filename='./wf'+'/'+entry.name
            dbh.index_mseed_file(filename)

In [19]:
n=dbh.wf_miniseed.count_documents({})
print('Number of documents in wf_miniseed collection=',n)

Number of documents in wf_miniseed collection= 1287


In [21]:
cursor=dbh.wf_miniseed.find({}).limit(3)
# This form is currently broken. Replace next block with this when read_ensemble_data is repaired
#ensemble=dbh.read_ensemble_data(cursor)

In [22]:
# This box is temporary until read_ensemble_data gets repaired
from mspasspy.ccore.seismic import TimeSeriesEnsemble
ensemble=TimeSeriesEnsemble(3)
cursor=dbh.wf_miniseed.find({}).limit(3)
for doc in cursor:
    d=dbh.read_data(doc,collection='wf_miniseed')
    ensemble.member.append(d)

Let's first look at what the headers contain.   We can start with a simple print of the first member of the 
ensemble:

In [23]:
print(ensemble.member[0])

{'_id': ObjectId('60e7003608bbf0c506fbfb95'), 'chan': 'BHZ', 'delta': 0.025000, 'dfile': './wf/wfshortcourse_1091.mseed', 'dir': '/home/pavlis/MsPASSShortCourse2021/Notebooks', 'foff': 0, 'format': 'mseed', 'last_packet_time': 1302188679.900000, 'nbytes': 53248, 'net': 'TA', 'npts': 55712, 'sampling_rate': 40.000000, 'sta': 'T36A', 'starttime': 1302187287.100000, 'storage_mode': 'file'}


The default print is hard to read.  Here is an alternative display that will be helpful for class discussion:

In [24]:
# This copy is not efficient but makes the syntax less obscure
d=ensemble.member[0]
print('python type of symbol d=',type(d))
print('\nkey-value pairs in d:\n')
for k in d:
    print(k,d[k])

python type of symbol d= <class 'mspasspy.ccore.seismic.TimeSeries'>

key-value pairs in d:

_id 60e7003608bbf0c506fbfb95
chan BHZ
delta 0.025
dfile ./wf/wfshortcourse_1091.mseed
dir /home/pavlis/MsPASSShortCourse2021/Notebooks
foff 0
format mseed
last_packet_time 1302188679.9
nbytes 53248
net TA
npts 55712
sampling_rate 40.0
sta T36A
starttime 1302187287.1
storage_mode file


We can also print a summary of the entire ensemble in tabular form as follows:

In [25]:
from obspy import UTCDateTime
print('number  net sta chan ->loc<- starttime endtime dt npts')
i=0
for d in ensemble.member:
    net=d['net']
    sta=d['sta']
    chan=d['chan']
    if d.is_defined('loc'):
        loc=d['loc']
    else:
        loc='  '
    stime=d.t0
    etime=d.endtime()
    dt=d.dt
    npts=d.npts
    print(i,net,sta,chan,'->',loc,'<-',UTCDateTime(stime),UTCDateTime(etime),dt,npts)
    i+=1

number  net sta chan ->loc<- starttime endtime dt npts
0 TA T36A BHZ ->    <- 2011-04-07T14:41:27.100000Z 2011-04-07T15:04:39.875000Z 0.025 55712
1 TA W37B BHN ->    <- 2011-04-07T14:41:27.100000Z 2011-04-07T15:06:01.025000Z 0.025 58958
2 TA D35A BHE ->    <- 2011-04-07T14:41:27.100000Z 2011-04-07T15:04:30.000000Z 0.025 55317


A few points of clarification as a lead in to a class discussion:

* Notice the use of the "member" container in TimeSeriesEnsemble.  The code above shows it is python 
  "iterable".  Loops like th above are a common construct.
* The "member" elements of the ensemble are *TimeSeries* objects.  The ugly print box shows that explicitly
  for member 0.
* Notice the use of the dict like syntax for the getters like "net=d['net']"
* The tabular print shows a method used to display a property of d that isn't appropriate as "Metadata": 
  the calls to d.npts, d.dt, and d.endtime()