# Use of Metadata, MetadataDefinitions, and AntelopePf

## Metadata

MsPASS makes extensive use of a C++ object we call Metadata.  A Metadata object is one of a many options we could have used to define the generalization of the idea of a header familar to many seismologists.  Headers are used, for example, in SAC and all seismic reflection packages we are aware of.   A "header", however, can be thought of as a implementation detail for a more general concept:  fetching parameters with a name-value pair relationship.  We use Metadata in preference to a python dict because the implementation is cleaner at the C++ level.  It is also, in principle, faster since the same methods visible through python wrapper are accesible in C++ code.   The purpose of this tutorial is to give users familiarity of this core class used for a wide variety of purposes.  

This tutorial first teaches how to utilize the Metadata object in python.  When that is understood, we move to the AntelopePf, which is a child of Metadata with expanded capabilities.  

We have to tell python what a Metadata object is.  We use the following common python incantation:

In [1]:
from mspasspy.ccore import Metadata

ModuleNotFoundError: No module named 'mspasspy.ccore'; 'mspasspy' is not a package

We can then create an empty Metadata container in the standard python way.

In [2]:
md=Metadata()

The Metadata container is more flexible, but in MsPASS to provide a cleaner mapping to MongoDB we restrict the contents to the four lowest common demoninator types native to most computer languages: 
* real numbers, which are always promoted in MsPASS to double (64 bit floats)
* integers, which are always promoted to 64 bit signed ints
* character strings, which are handled in C++ as string objects and wrapped to python strings.  i.e. the conversion between C++ and python strings should be seamless.
* booleans - meaning values that are "True" or "False" in python.   

There are putters can create each of these core types by a clear name convention:

In [3]:
md.put_double("real_example",10.45)
md.put_long("int_example",42)
md.put_string("foo","bar")  # the classic programmer idiom
md.put_bool("bool_example",True)

There are also "overloaded" versions of the same that can depend on the python interpreters rules for assigning type from a literal.   These four lines do nearly the same thing as the previous four lines of python code:

In [4]:
md.put("overreal",99.45)
md.put("overint",2)
md.put("overstr","foobar")
md.put("overbool",False)

C++ code could use operator<< to dump the contents of md, but we currently have no equivalent in python.  For now you need to use getters.  This small example shows getters to pull a subset of the 8 entries currently stored in md.

In [6]:
x=md.get_double('overreal')
i=md.get_long('int_example')
s=md.get_string('foo')
b=md.get_bool('bool_example')
print("real example =",x)
print("int example=",i)
print("string example=",s)
print("boolean example=",b)

real example = 99.45
int example= 42
string example= bar
boolean example= True


In interactive scripts or for testing it is often helpful to know what data are defined.  For that purpose we provide the keys method used as follows:

In [7]:
keys=md.keys()
print(keys)

{'bool_example', 'overstr', 'foo', 'real_example', 'overbool', 'overint', 'int_example', 'overreal'}


## MetadataDefinitions

A thorny problem in real data analysis with header attributes, which is what Metadata generalizes, is that the type of a parameter must match what is stored.  For example, if we tried to fetch the parameter "foo" from md as an integer or real value, the result would make no sense.    The Metadata class will throw a C++ exception that your python code may need to handle.   The current wrappers for Metadata cast the exception message to a stock python RuntimeError. Hence, if you have a section of code where a key-value pair may not be defined or is subject to a type mismatch, you should use a construct like the the following:

In [8]:
try:
    s=md.get_string("foo")  # this will work
    x=md.get_double("foo")  # this will throw an exception we need to handle
except Exception as e: 
    print(repr(e))

RuntimeError('Error in Metadata get method.   Type mismatch in attem to get data with key=foo\nboost::any bad_any_cast wrote this message:  \nboost::bad_any_cast: failed conversion using boost::any_cast\nTrying to convert to data of type=float\nActual entry has type=std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >\n',)


This example uses a best practice catcher that uses the base class Exception and the built in repr function to convert the RuntimeError to a more readable form.  It is remains a bit ugly because of mismatch is handling the newline (\n) character in ipython, but the gist of the error message should be clear.

The general problem of data types with attributes like those we extract from Metadata has two end members in the computing world:  (1) python, which is completely agnostic about type, and (2) strongly typed languages like C/C++ and FORTRAN.  Since MsPASS is a hybrid with C++ implementation of some compute intensive, core algorithms and python for higher level processing this clash in concept has to be handled. An additional constraint comes from the use of MongoDB or any database engine.  Chaos will reign if a name key is used for attributes of different types; a problem far too easy to create even accidentally with python. Since the main purpose of MsPASS is to provide a framework for handling long-running, compute intensive data processing jobs, we have to enforce some type restrictions on attributes to avoid mysterious downstream behaviour and bugs that are difficult to impossible to find.  We do this through a core mspass object we call MetadataDefinitions.   More about the philosphy and design concepts of MetadataDefinitions can be found in the User's Manual (hyperlink to proper page).   Here we focus on how MetadataDefinitions should be used in a workflow.

Any MsPASS job using any of the ccore data objects should near the top of the job script contain this line:

In [9]:
from mspasspy.ccore import MetadataDefinitions
mdef=MetadataDefinitions()

The import line, of course, could be mixed with other import commands at the top of your processing script.  The point here is the call to the default constructor for MetadataDefinitions().   That step initiates a read of a configuration file that defines the MsPASS default attribute namespace.    The full details of the default namespace can be found in tables in the User's Manual (hyperlink).  

First, let's look at the full set of keys for the default namespace:

In [10]:
mdkeys=mdef.keys()
print(mdkeys)

['U11', 'U12', 'U13', 'U21', 'U22', 'U23', 'U31', 'U32', 'U33', 'calib', 'chan', 'chanid', 'delta', 'dfile', 'dir', 'foff', 'gridfs_idstr', 'hang', 'iphase', 'loc', 'mb', 'ms', 'net', 'npts', 'oid_site', 'oid_source', 'orid', 'phase', 'sampling_rate', 'site_elev', 'site_id', 'site_lat', 'site_lon', 'source_depth', 'source_id', 'source_lat', 'source_lon', 'source_time', 'sta', 'starttime', 'storage_mode', 't0_shift', 'time_standard', 'vang', 'wfid_string']


You can find the type of any key through the type method.  This script prints the full table of types for all defined keys:

In [11]:
for k in mdkeys:
    t=mdef.type(k)
    print(k,' has type=',t)

U11  has type= MDtype.Double
U12  has type= MDtype.Double
U13  has type= MDtype.Double
U21  has type= MDtype.Double
U22  has type= MDtype.Double
U23  has type= MDtype.Double
U31  has type= MDtype.Double
U32  has type= MDtype.Double
U33  has type= MDtype.Double
calib  has type= MDtype.Double
chan  has type= MDtype.String
chanid  has type= MDtype.Int64
delta  has type= MDtype.Double
dfile  has type= MDtype.String
dir  has type= MDtype.String
foff  has type= MDtype.Int64
gridfs_idstr  has type= MDtype.String
hang  has type= MDtype.Double
iphase  has type= MDtype.String
loc  has type= MDtype.String
mb  has type= MDtype.Double
ms  has type= MDtype.Double
net  has type= MDtype.String
npts  has type= MDtype.Int64
oid_site  has type= MDtype.String
oid_source  has type= MDtype.String
orid  has type= MDtype.Int64
phase  has type= MDtype.String
sampling_rate  has type= MDtype.Double
site_elev  has type= MDtype.Double
site_id  has type= MDtype.Int64
site_lat  has type= MDtype.Double
site_lon  has 

While writing a script if you aren't sure what a key represents you can use the concept method.  Here, for example, we ask for a brief description of what the attribute "vang" should be:

In [12]:
s=mdef.concept('vang')
print(s)

Inclination from +up (in degree) of a seismometer component - vertical angle


To provide a mechanism for MsPASS to legacy packages like SAC, the MetadataDefinitions object has a generic aliases mechanism.   For example, if we wanted to know alternative names that can be handled automatically for the attribute called "source_lat", we can issue this command:

In [13]:
print(mdef.aliases('source_lat'))

['EVLA', 'origin.lat']


This shows that we can use the SAC name, EVLA, or the Antelope name, origin.lat, for an alternative to source_lat and it will be handled automatically by database readers and writers.

There are a collection of other useful methods defined in for MetadataObject for dealing with aliases.  The following script illustrates their use:

In [14]:
if(mdef.is_alias('EVLA')):
    print('EVLA is a valid alias')
    ukey=mdef.unique_name('EVLA')
    print('The unique key for EVLA for database records is ',ukey)
if(mdef.has_alias('source_lat')):
    print('The keyword source_lat has one or more aliases defined')
    print('Valid aliases are',mdef.aliases('source_lat'))

EVLA is a valid alias
The unique key for EVLA for database records is  ('source_lat', MDtype.Double)
The keyword source_lat has one or more aliases defined
Valid aliases are ['EVLA', 'origin.lat']


The use of aliases is an important component of MsPASS to simplify utilizing legacy software as part of a workflow.   A typical example would be some specialized program that was built around the SAC data format or an anticipated development in MsPASS of running SAC from a MsPASS workflow.   To support that type of application MsPASS has two methods in *MetadataDefinitions* called *apply_aliases* and the inverse called *clear_aliases*.   The following small script demonstrates a small example:

MetadataDefinitions has what could be called advanced methods for handling a feature of MongoDB we found useful called "normalization".   (See https://docs.mongodb.com/manual/core/data-model-design/ for the concepts of normalization in MongoDB)   The key point is that some attributes like receiver coordinates and response information are better stored in a single place and found by a cross-reference mechanism rather than being duplicated many times or risk mistakes from incorrect associations.   Methods available to support this feature are:

1.  *is_ normalized* - test whether an attribute is expected to be normalized
2.  *collection* - return the MongoDB collection name where the master of an attribute should be located.
3.  *readonly, writeable* - test for whether an attribute is marked readonly.   Normalized data are normally marked readonly* (immutable) because they should be set only on construction and never altered by a processing workflow.   
4.  *set_readonly, set_writeable* - backdoor methods to set (set_readonly) or override locks (set_writeable) on an attribute.   These functions should not be used unless essential.   

Finally, there are two methods for manually defining new attributes not defined in the master namespace.

1. *add* - define a new attribute with type and concept properties
2. *add_alias* - define a new alias for an existing attribute. 

These also should be used with caution as it is preferable for custom applications to edit the master list used to construct MetadataDefinitions objects.  

In [17]:
from mspasspy.ccore import CoreSeismogram
def printmd(md):
    keys=md.keys()
    for k in keys:
        print("Value with key=",k," is ",md[k])

# We create an seismogram object to demonstrate these are used for data objects
sacmd=CoreSeismogram(100)
sacmd.put_double("source_lat",10.0)
sacmd.put_double("source_lon",-75.0)
sacmd.put_double("source_depth",10.0)
sacmd.put_double("site_lat",45.0)
sacmd.put_double("site_lon",55.0)
print("Initial metadata with MsPASS names")
printmd(sacmd)
# This creates the list of aliases to apply - the names are the alias names
# Note STEL is not actually defined above - illustrates handling of unset aliases
aliaslist=["EVLA","EVLO","EVDP","STLA","STLO","STEL"]
mdef.apply_aliases(sacmd,aliaslist)
print("Metadata after apply_aliases")
printmd(sacmd)

Initial metadata with MsPASS names
Value with key= site_lon  is  55.0
Value with key= starttime  is  0.0
Value with key= source_lat  is  10.0
Value with key= npts  is  100
Value with key= delta  is  1.0
Value with key= source_lon  is  -75.0
Value with key= source_depth  is  10.0
Value with key= site_lat  is  45.0
Metadata after apply_aliases
Value with key= EVLO  is  -75.0
Value with key= starttime  is  0.0
Value with key= npts  is  100
Value with key= EVDP  is  10.0
Value with key= delta  is  1.0
Value with key= EVLA  is  10.0
Value with key= STLO  is  55.0
Value with key= STLA  is  45.0


Notice that apply_aliases silently skipped the inconsistency that *site_elev* is not defined in sacmd.  That was an intentional design because we expect global alias lists (e.g. applying a set of fixed aliases for sac, obspy, or Antelope) to be the norm.  

Also notice that three additional Metadata attributes appeared when the printmd function was called:  starttime(real), npts (integer), and delta (real).   These are three attributes wired into private variables in the mspass data objects.   The C++ api guarantees these Metadata name-value pairs are consistent with internal values.  They appear in this output because the printmd lists all defined Metadata fields.  Note finally printmd uses the alternative access method for Metadata fields using the key in the same syntax as a python dict.   For purely pythonic interactions that approach is appropriate because python just returns the right type.  In contrast calling a method like get_double that is dogmatic about type will generate an exception if the type does not match.  Which you should use will depend on context.

The *clear_aliases* method of *MetadataDefiniions* will restore all entries to the standard value as illustrated here:

In [None]:
mdef.clear_aliases(sacmd)
printmd(sacmd)

The *apply_aliases* and *clear_aliases* methods provide a generic support for working with different namespaces.  The cost of applying them is not large, but not tiny either.  For efficiency avoid unnecessary alias definitions by grouping processes using a common namespace when possible.   On the other hand, when a block of processing steps requiring aliases are finished you should immediately call *clear_aliases* to avoid downstream errors.  We emphasize *clear_aliases* resets ALL metadata defined as an alias by the MetadataDefinitions object.  


## AntelopePf

The Metadata object is an implementation of the core idea of fetching an attribute defined by a name-value pair, where the name is what is commonly called a key and the value is the data we want to associate with the key.   As we just saw this maps exactly into the concept of a header that has been proven a useful concept since the earliest days of seismic data processing pioneered by the oil and gas industry in the 1960s.  

The *AntelopePf* is an implementation of a different, but related problem in all data processing.  Almost any algorithm has at least one tunable parameter that needs to be defined.  In general, the more generic an algorithm is the more parameters will be needed to define the way it should behave.  In the early days of data processing this problem was often solved by creating special format "input files" that a program read at startup.   Computer scientists realized decades ago that this was a generic problem and thus had generic solutions.   Hundreds of solutions to this problem exist with configuration files of various formats.  Today the most common is probably xml. We elected to no use xml for our initial development of mspass for two reasons:
1.  xml is not a human readable format, but a language for robots (computers).  It is very hard for a human being to create a valid xml file by entering the data manually.  We needed a format that was easy for a human to construct.
2.  Many seismologists utilize BRTT's "parameter files", because of the generous license agreement BRTT provides for U.S. scientists.   Furthermore, both of primary authors of MsPASS were familiar with parameter files and we had an open source implementation we could build on from Pavlis's plane wave migration code.

We thus adopted the "parameter files" syntax to implement an extension of Metadata we call an *AntelopePf*. The *AntelopePf* is a child of *Metadata* so the same methods introduced for *Metadata* can be used for an *AntelopePf*.   The use, however, is more than a little subtle and is best understood from an example.  

Let's look at a concrete example of the kind of complex parameter file that *AntelopePf* was designed to handle.  The following is the default configuration for a new deconvolution routine in MsPASS we call CNR3CDecon (for Colored Noise 3C (Three-component) Deconvolution):
```
########################################################################
operator_nfft 4096
#damping_factor 1000.0
damping_factor 1.0
snr_regularization_floor 2.0
target_sample_interval 0.05
deconvolution_data_window_start -2.0
deconvolution_data_window_end 30.0
time_bandwidth_product 4.5
number_tapers 8
shaping_wavelet_dt 0.05
shaping_wavelet_type ricker
shaping_wavelet_frequency 1.0
shaping_wavelet_frequency_for_inverse 0.5
noise_window_start -30.0
noise_window_end -5.0

taper_type cosine
CosineTaper &Arr{
  data_taper &Arr{
    front0 -2.0
    front1 -1.0
    tail1 27.0
    tail0 29.5
  }
  wavelet_taper &Arr{
   front0 -0.75
   front1 -0.25
   tail1 2.5
   tail0 3.0
  }
}
LinearTaper &Arr{
  data_taper &Arr{
    front0 -2.0
    front1 -1.0
    tail1 27.0
    tail0 29.5
  }
  wavelet_taper &Arr{
   front0 -0.75
   front1 -0.25
   tail1 2.5
   tail0 3.0
  }
}
########################################################################

```
We have supplied a copy of the data above in a file called data/test.pf.  You should then be able to load this file with the following:  

In [25]:
from mspasspy.ccore import AntelopePf
pf=AntelopePf('data/test.pf')

Parameters outside the curly brackets and the "&Arr" tags are handled by Metadat methods - they are simple name value pairs.  Here are a couple examples.  You can extend these to test your knowledge.

In [26]:
print("Simple parameter operator_nfft has this value:",pf.get_long("operator_nfft"))
print("Simple parameter damping_factor has this value:",pf.get_double("damping_factor"))

Simple parameter operator_nfft has this value: 4096
Simple parameter damping_factor has this value: 1.0


*AntelopePf* extends *Metadata* with two primary methods:   *get_branch* and *get_tbl*.  This little code fragment illustrates the *get_branch* method:

In [27]:
pfb1=pf.get_branch('CosineTaper')
# Note the example pf has no simple name-value pairs under the CosineTaper tag.  
# This illustrates that is o
keys=pfb1.keys()
print('Metadata keys for CosineTaper branch  (an empty list)',keys)
pfb2=pfb1.get_branch('wavelet_taper')
keys=pfb2.keys()
print('Metadata keys found for wavelet_taper branch: ',keys)

Metadata keys for CosineTaper branch  (an empty list) set()
Metadata keys found for wavelet_taper branch:  {'front0', 'tail0', 'tail1', 'front1'}


This particular example is a little unusual in that the result of the first get_branch call has simple name-value pairs, but only two branches.  It also doesn't demonstrate have data that can be fetched with the other *AntelopePf* extension called *get_tbl*.   To see, here is the default parameter file for the Antelope contrib program export_to_mspass that can be used to take a data set defined by an Antelope database and import it to MsPASS.
```
required &Tbl{
dt delta real
origin.depth source_depth real
origin.lat source_lat real
origin.lon source_lon real
origin.time source_time real
site.lat site_lat real
site.lon site_lon real
site.elev site_elev real
nsamp npts int
sta sta string
evid source_id int
U11 U11 real
U12 U12 real
U13 U13 real
U21 U21 real
U22 U22 real
U23 U23 real
U31 U31 real
U32 U32 real
U33 U33 real
}
optional &Tbl{
origin.mb mb real
origin.ms ms real
arrival.iphase iphase string
assoc.phase phase string
orid orid int
}
```
The above data are contained in another file data/test2.pf.   You should be able to load it by running the following:

In [28]:
pf2=AntelopePf('data/test2.pf')
tbllist=pf2.get_tbl('optional')
print(tbllist)

['origin.mb mb real', 'origin.ms ms real', 'arrival.iphase iphase string', 'assoc.phase phase string', 'orid orid int']


This example illustrates the *get_tbl* method returns the data between the "optional &Tbl{" and the "}" at the end of the data file as a python list of strings - one list element per line.   This can be used by any program where the input can be defined as a sequence of lines.  This example uses a format where token 1 is the antelope database attribute name that is to be fetch, token 2 is the name that is to be assigned for the export file, and token 3 is the name used to define the type of the data expected (Antelope's database has type constraints for the same reasons we noted above).  