# Earthscope Course 2024 - Session 2
## *Gary L. Pavlis* and *Ian Wang*

## Overview
The learning objectives of this tutorial are:
1. Gain basic skills in using the MongoDB Database to manage seismic data.
2. Gain a more complete understanding of content of the different seismic data types used in MsPASS and how they are handled with the MongoDB database inside MsPASS.
3. Understand what we mean by "atomic data" versus "ensembles".   

In session 1 we utilized MongoDB but gave you examples that are more-or-less like an incantation for a magic trick.   Through this tutorial our aim is to help you see through the magicians trick to know how it is done so you can use MongoDB effectively in your work.  The integrated database of MsPASS is one of the most important reasons MsPASS is the best solution today for handling the enormous volume of seismology data available today.  

Numerous pedagogic materials exist online for learning MongoDB, but this notebook focuses on key features the authors has found useful in seismology research.  It is best used in conjunction with two other sources:
1.  The section of the User's Manual titled "[Using MongoDB with MsPASS](http://www.mspass.org/user_manual/mongodb_and_mspass.html)".
2.  As with most modern IT topics a web search for details of some topics addressed in this tutorial may be helpful if the MsPASS User's Manual proves inadequate.

Embedded within the tutorial on MongoDB it is natural to teach the concepts for objectives 2 and 3.   

## MongoDB Core Concepts
### Client-server model
MongoDB is a client-server system.  That bit of jargon 
has some important implications:

1.  All database commands issued from python are not executed directly by the python interpreter.  Instead instructions are sent to the MongoDB server.   In MsPASS the server is launched inside a container.   Unless you are running this notebook on a cluster with multiple nodes, you can verify the server is running by launching a terminal in the jupyterlab interface and running the command `ps -A`.  You should get output similar to the following that shows the server as the CMD with the name `mongod`:
```
root@b0d79c4cc440:/home/scoped# ps -A
  PID TTY          TIME CMD
    1 ?        00:00:00 tini
    8 ?        00:00:00 start-mspass.sh
   15 ?        00:07:27 dask-scheduler
   21 ?        00:06:47 dask-worker
   22 ?        00:01:44 mongod
   23 ?        00:00:44 jupyter-lab
   34 ?        00:00:00 python3.10
   37 ?        00:10:43 python3.10
  154 ?        00:00:20 python
  364 ?        00:00:01 python
 1010 pts/0    00:00:00 bash
 1036 pts/0    00:00:00 ps
```
2.  All database IO passes through a network data connection on network "port number 27017".   That is important to know as a fundamental issue because a network communication channel is not the fastest data pipe on most computers. It can also create a need to work around a firewall on some systems.  
3.  To communicate with MongoDB, your program must create a connection to the "server".  In the jargon of modern computing you have to create a "client" that will act as your agent to talk to the arrogant MongoDB "server" (the mongod program running in the background).  

The "client-server model" is ubiquitous in the modern computing environment.   To show that here are three we used one way or the other in session 1:  

In [1]:
from obspy.clients.fdsn import Client as ObspyClient
from mspasspy.client import Client
obspy_client = ObspyClient()
mspass_client = Client(database_name='Earthscope2024')
dbclient = mspass_client.get_database_client()
print("The type of dbclient is ", type(dbclient))

The type of dbclient is  <class 'mspasspy.db.client.DBClient'>


Noting:
1. *obspy_client* is an instance of obspy's FDSN web service client we used to interact with FDSN web services.
2. *mspass_client* is a top-level "client" use din MsPASS.   It is more-or-less a container we use to interact with the two main services that are central to MsPASS:  (a) MongoDB and (b) the "client" used to interact with the parallel "scheduler" (dask or spark) that we will learn about in session 3.
3. *dbclient* is an instance of the "client" mentioned above for interacting with MongoDB.  Notice we fetch it from *mspass_client* object with it's `get_database_client` method.  
A geeky detail worth noting here, which is illustrated by the print statement in the last line, is that the symbol *dbclient* is an instance of a class called `DBClient`.  `DBClient` is a "subclass" of `pymongo.MongoClient`.  I point that out because all internet sources that are MongoDB introductions will create an instance of [MongoClient](https://pymongo.readthedocs.io/en/stable/api/pymongo/mongo_client.html)` instead of the MsPASS extension [DBClient](http://www.mspass.org/python_api/mspasspy.db.html#module-mspasspy.db.client).  An important "extension" DBClient adds is illustrated by the next code box:

In [2]:
db = dbclient.get_database("Earthscope2024")

This incantation runs the `get_database` "method" of the class called `DBClient`.   It returns what we call a "database handle" in the User's Manual.   The MsPASS "database handle" is a python class that is itself a subclass  of another pymongo class.  Both have the name `Database`, but the [MsPASS version](http://www.mspass.org/python_api/mspasspy.db.html#mspasspy.db.database.Database) adds a number of extensions for handling of seismic data.   The main ones of interest are readers and writers for seismic data objects, station metadata, and source metadata.  A key point is almost all MsPASS workflows begin with a variation of the combination of the two python code boxes above.   In particular, this is a copy of what we ran in code box 2 of Session 1.   A version of this changing only the database name, which in this case is "Earthscope2024", should appear at the top of almost any MsPASS python script/notebook.   (This one is disabled as we already created *db* so we don't need it here.)

When you call the `get_database` method as shown above the "handle" is created/constructed and can be accessed for the rest of your python workflow with the symbol you put on the left hand side of the expression (`db` in this example).  That name, of course, can be anything you want it to be, but for all examples in the MsPASS documentation we used `db` as a standard symbol to reduce confusion, but that should be viewed as simply a notation convention not a rule

### Documents and Collections
In the lecture part of this session we will discuss the MongoDB jargon terms `document` and `collection` at length.   We will not repeat that material here, but note from here on I assume you know what those two terms mean.   If you don't know what these terms mean consult the [Using MongoDB with MsPASS](http://www.mspass.org/user_manual/mongodb_and_mspass.html) section of the User's Manual or a bewildering array of internet source before proceeding.

## CRUD
A near universal mnemonic found in books on database theory and online tutorials is the acronymn CRUD.  CRUD is short for Create-Read-Update-Delete.  It is used to as a mnenomic to remember those four primry functions any operational database must be capable of doing.   In this class we will focus exclusively on  the C (Create==writers) and R (readers).   Reading and writing are pretty much essential for all MsPASS workflows.  Updates and deletes, in contrast, are rarely needed and, in fact, are usually ill advised and best done not within a larger workflow but as a sidebar to fix some problem.  A notable exception is the [normalize_mseed]() function we used in session 1 to add cross-reference ids between `wf_miniseed` and `channel` documents.  That is a pure update function, but it does the work more efficiently due to some tricks done under the hood.   In any case, to reduce information overload this class will focus on read and write operations.  When you use MsPASS if you understand the syntax for the C and R functions of MongoDB the forms for the U and D functions are completely logical.   You can also consult the sections in the [User Manual on CRUD operations](http://www.mspass.org/user_manual/CRUD_operations.html) and the section titled ["Using MongoDB with MsPASS"](http://www.mspass.org/user_manual/mongodb_and_mspass.html).

### Create
#### CommandCursor concept
The first letter in the CRUD acronynm is "Create".  For all applications some form of "create" is an essential first step to put some kind of data into your database.  We already did that in the first session of this class.  There we populated several collections.   We can see what they are with the `list_collecions` method of the `Database` class.  This shows the usage:

In [3]:
cursor=db.list_collections()
for doc in cursor:
    print(doc['name'])

wf_TimeSeries
channel
abortions
wf_Seismogram
elog
wf_miniseed
source
cemetery
history_object
fs.chunks
site
fs.files


As you can see there are a lot more there than the "wf" collections and "channel", "site", and "source" what were explicitly discussed in the last class.   We will examine some of them in more detail later, but for now focus on the block of python code that created that output.   The `list_collections` method returned a special data type used in pymongmo.  To see what that is consider:

In [4]:
type(cursor)

pymongo.command_cursor.CommandCursor

Every database system I know of implements some version of the concept encapsulated by the pymongo class called a [CommandCursor](https://pymongo.readthedocs.io/en/stable/api/pymongo/command_cursor.html).   A "cursor" is a standard return from any query like operation in any database system.  A MongoDB `CommandCursor` is technically a __[forward iterator](https://www.boost.org/sgi/stl/ForwardIterator.html)__.   That means it acts like a list that can only be traversed "forward" with a construct like that above.   It is not at all the same thing, however, as a python list.   It is a handle that interacts with the database to sequentially return documents.  The above example would not require any complexity.  Where it is fundamentally different is if the number of elements in the list exceed the memory buffer size of the client that handles io with the MonogDB server.  Then the client-server pair manage the grungy work of trying to keep the memory buffer full and assuring the client does no have to wait for data to arrive.  The important thing that means is that when reading a very large amount of data (e.g. processing millions of TimeSeries objects driven by wf_miniseed records) sequential reads with a cursor almost never have to wait for data.  In addition, we will see examples below where the understanding that a mostly acts [CommandCursor](https://pymongo.readthedocs.io/en/stable/api/pymongo/command_cursor.html) mostly acts like a python list is fundamental to many database driven algorithms.   

#### MsPASS writers
If you read books and online tutorials on MongoDB you will find that the standard "Create" functionality is defined by two "collection-level" methods called `insert` and `insert_many`.  The pymongo documentation on both can be found [here](https://www.w3schools.com/python/python_mongodb_insert.asp).   Both have their uses, but are rarely used in seismic processing with MsPASS.   The fundamental reason is that both `insert` and `insert_many` are low-level operators that only work on "documents" (i.e. python dictionaries).  The data we aim to manage with MongoDB in MsPASS are more complex data objects that cannot always be reduced to "documents" or need to be converted to that form.   For that reason in MsPASS we have implemented the following high level writers that do most "Create" operations.  Examples of all of these can be found by reviewing the notebook from session 1 of this class:
1. [index_mseed_file](http://www.mspass.org/python_api/mspasspy.db.html#mspasspy.db.database.Database.index_mseed_file) is used to scan a file and creating one or more wf_miniseed documents that define an index for waveform segments stored in the file processed.   
2. [save_data](http://www.mspass.org/python_api/mspasspy.db.html#mspasspy.db.database.Database.save_data) is used to save all seismic data objects (i.e. `TimeSeries',`Seismogram`, `TimeSeriesEnsemble`, or `SeismogramEnsemble`).   
3. [save_inventory](http://www.mspass.org/python_api/mspasspy.db.html#mspasspy.db.database.Database.read_inventory) is used to save station metadata downloaded via web services with obspy and bundled into the obspy [Inventory](https://docs.obspy.org/packages/autogen/obspy.core.inventory.inventory.Inventory.html) object that is more-or-less an image of the [StationXML](https://docs.fdsn.org/projects/stationxml/en/latest/overview.html) data retrieved.  
4. [save_catalog](http://www.mspass.org/python_api/mspasspy.db.html#mspasspy.db.database.Database.save_catalog) is used to save earthquake source data downloaded via web services with obspy and bundled into obspy's [Catalog object]().  Like `Inventory` the `Catatlog` object is more-or-less an image of the new [QuakeML format](https://quake.ethz.ch/quakeml) for distributing earthquake source parameters. 
5. [write_distributed_data](http://www.mspass.org/python_api/mspasspy.io.html#mspasspy.io.distributed.write_distributed_data)  is the parallel writer for seismic data objects.  We will discuss this function in more detail in session 3 of the class.

All the MsPASS writers use `insert` and/or `insert_many` operations to save data to MongoDB.   For source and receiver data that process is more straightforward.  Each relevant source or channel/station record in an input QuakeML or StationXML file image is saved as one document in a MongoDB collection (nominally "source" and "channel" respectively).  Saving seismic data objects is much more complex for three reasons:
1.   We recognized that large data storage is a rapidly evolving technology today and we aimed to abstract the process to support multiple versions of what we call "storage mode".  The rest of this session will give an overview of that concept.
2.   As we will show in the presentation done in parallel with this notebook, the seismic data objects in MsPASS are conceptually defined as composed of four different containers:   (a) the Metadata container that translates directly into a MongoDB document, (b) the sample data, (c) an [error log](http://www.mspass.org/python_api/mspasspy.ccore.html#mspasspy.ccore.utility.ErrorLogger) container, and (d) a [ProcessingHistory](http://www.mspass.org/python_api/mspasspy.ccore.html#mspasspy.ccore.utility.ProcessingHistory) container used to (optionally) store what we call "object-level history" (see [this section](http://www.mspass.org/user_manual/processing_history_concepts.html) of the User Manual).
3.   We aimed to create a single function to save atomic data and ensemble data.  Ensembles are mostly groups of atomic data, but present some special challenges the `save_data` method needed to handle.   Using the common function to automatically handle those details simplifies usage.      

#### Storage Mode Options
The [save_data](http://www.mspass.org/python_api/mspasspy.db.html#mspasspy.db.database.Database.save_data) method of `Database` has an argument called "storage_mode".   It determines how the sample data are saved.   The options are string keywords that must be one of the following:
1.  "gridfs" (the default) saves the sample data internally in the MongoDB data area.  Because in session 1 we did all our saves without specifying the storage option, all the processed data created there have their `Metadata` content stored in the collection (see list above) called "wf_TimeSeries" or "wf_Seismogram".   The sample data are stored in file-like objects managed by MongoDB and with documents needed to define them in the two collections called "fs.files" and "fs.chunks".  Gridfs is convenient storage because it is easier to manage as an integrated and bombproof storage area managed completely by MongoDB.   The dark side is we know from experience writing data to gridfs can cause an io bottleneck since the more voluminous sample data have to pass through the same io channel as the Metadata write operations (the actual calls to `insert` to "wf_TimeSeries` or `wf_Seismogram`.
2.  "files", as the name suggests, writes data to conventional computer files.   When using the option `storage_mode="file"` in a call to `save_data` by definition the writer has to know what file it should open and use to save the requested data.  The best way to do that is to set an explicit value for `dir` and `dfile` in the line where you call `save_data`.  If the arguments are not defined, `save_data` attempts to extract values from each atomic datum's Metadata container using the same keywords. (i.e. attempts to retrieve two string values with the keys "dir" and "dfile".  If that fails, it falls to the last resort;  "dir" will be set as to the run directory and "dfile" will be defined by a unique, random string. In all cases the file name is then generated by using the stock python `join` method of the `os.path`.   That is, the file name is generated as `fname=os.path.join(dir,dfile)`.  `save_data` then attempts to open the file.  If successful it seeks to the end of the file, posts the byte offset as the "foff" attribute, and writes the sample data.   The default is a raw binary dump, but as described in the User Manual and docstring for [save_data](http://www.mspass.org/python_api/mspasspy.db.html#mspasspy.db.database.Database.save_data) the output can be written in any format supported by obspy's writer (subject to major issues of rigid namespace requirements for many formats.)  All should recognize that using files requires some thought beforehand about how the files should be named and organized.   The model used is heavily project dependent and outside the scope of this course.

There is one more detail about writing data that is an important performance issue.  That is, when working with atomic data MsPASS will always open a file, write data, and then close the file.  All three operations take nontrivial amounts of time to complete.  The excessive open-close commands are intrinsic bottlenecks on even a desktop system.   The fastest write model is to possible with ensemble written in the default binary mode.   In that mode, a file is opened only once for each ensemble and the data are dumped sequentially to the same file using a the low-level C fwrite function.

The small python code below illustrates the different modes of writing data and times their relative performance.   The output at the end demonstrates the difference in performance for the different approaches:

In [5]:
import time
import os
# first we make sure to create a file directory we will call "wf3c" to separate it from where we have the miniseed data stored
current_directory = os.getcwd()
dir = os.path.join(current_directory, 'wf3c')
# dir='/home/wf3c'   # note this assumes running in the docker container with /home bound to 
if not os.path.exists(dir):
    os.mkdir(dir)
reading_time=0.0
atomic_gridfs_write_time=0.0
atomic_file_write_time=0.0
ensemble_write_time=0.0
# drive the processing by source_id - builds on output from session 1
srcids = db.wf_Seismogram.distinct('source_id')[:2] # we only use 2 events here to reduce output size
Nens = len(srcids)   # right because we know there are waveforms for each source document
for sid in srcids:
    # we will learn more about this query structure shortly
    query = {'source_id' : sid, 'data_tag' : 'serial_preprocessed'}
    t0 = time.time()
    cursor = db.wf_Seismogram.find(query)
    ensemble=db.read_data(cursor,collection='wf_Seismogram')
    t = time.time()
    reading_time += (t-t0)
    # time atomic writes with gridfs default (done by the writer)
    t0 = time.time()
    # note use of data_tag to allow us to later ignore these data
    db.save_data(ensemble,collection='wf_Seismogram',data_tag='gridfs_write_test')
    t = time.time()
    atomic_gridfs_write_time += (t-t0)
    # now write to files as atomic writes - loop over members
    t0 = time.time()
    dfile = str(sid) + ".dat"
    for d in ensemble.member:
        db.save_data(d,collection='wf_Seismogram',storage_mode='file',dir=dir,dfile=dfile,data_tag='atomic_file_write_test')
    t = time.time()
    atomic_file_write_time += (t-t0)
    # finally write the files with ensemble writer - use the same file names but they get appended
    t0 = time.time()
    dfile = str(sid) + ".dat"
    db.save_data(ensemble,collection='wf_Seismogram',storage_mode='file',dir=dir,dfile=dfile,data_tag='ensemble_file_write_test')
    t = time.time()
    ensemble_write_time += (t-t0)
print("Number of ensembled processed=",Nens)
print("Total time spent reading=",reading_time)
print("Total time writing with gridfs=",atomic_gridfs_write_time)
print("Total time with atomic writes to files=",atomic_file_write_time)
print("Total time with ensemble writes=",ensemble_write_time) 

Number of ensembled processed= 20
Total time spent reading= 85.33140468597412
Total time writing with gridfs= 128.1169364452362
Total time with atomic writes to files= 69.78627610206604
Total time with ensemble writes= 65.72909188270569


You will get different results on different hardware and operating systems.   You should definitely see that gridfs is always signficantly slower.   Whether or not the atomic write is faster or slower than the ensemble write is more variable.   The reasons are deep in the weeds of the implementation and are not important for this class.   The big lesson is to use files for performance, but do be careful about how you organize the files.   

## In class session 1
Answer questions with this tag in Homework2.ipynb.

### Read
The R of CRUD is "Read" and is more-or-less the inverse of "create".   The keyword used for pulling "documents" from a MongoDB database, however, is `find`.  There are two basic methods in the core MongoDB API for fetching documents:  `find_one` and `find`.  They behave completely differently.

#### find_one

Let's begin with a simple application of `find_one`.  As the name implies it always returns one and only one document.  Here is a default application to the "site" collection that was created under the hood when we ran `save_inventory` above:

In [6]:
doc = db.site.find_one()
print("The type of a document = ",type(doc))
print("This is the content of that document")
print(doc)

The type of a document =  <class 'dict'>
This is the content of that document
{'_id': ObjectId('666d5452f63ce193f69019c4'), 'loc': '', 'net': 'TA', 'sta': '034A', 'lat': 27.064699, 'lon': -98.683296, 'coords': [-98.683296, 27.064699], 'location': {'type': 'Point', 'coordinates': [-98.683296, 27.064699]}, 'elev': 0.155, 'edepth': 0.0, 'starttime': 1262908800.0, 'endtime': 1321574399.0, 'site_id': ObjectId('666d5452f63ce193f69019c4')}


As the output demonstrates a `find_one` returns data in a python dictionary.   You might also note the raw `print(doc)` output is a bit challenging to read.   For the rest of this tutorial we will use a simple little function defined below called `pretty_print` to make the output a bit easier to read.  

In [7]:
from bson import json_util
def pretty_print(doc,indent=2):
    print(json_util.dumps(doc,indent=indent))
doc=db['site'].find_one()
pretty_print(doc)

{
  "_id": {
    "$oid": "666d5452f63ce193f69019c4"
  },
  "loc": "",
  "net": "TA",
  "sta": "034A",
  "lat": 27.064699,
  "lon": -98.683296,
  "coords": [
    -98.683296,
    27.064699
  ],
  "location": {
    "type": "Point",
    "coordinates": [
      -98.683296,
      27.064699
    ]
  },
  "elev": 0.155,
  "edepth": 0.0,
  "starttime": 1262908800.0,
  "endtime": 1321574399.0,
  "site_id": {
    "$oid": "666d5452f63ce193f69019c4"
  }
}


Things of note in that box are:
1.  The `pretty_print` function definition is a bit trivial, which is why it isn't a standard MsPASS function.   It uses the `json_util.dumps` function to create the curly bracket formatted print that is a lot easier to understand than the raw dump of the python dictionary.   It shows more clearly that a document is always made of up of one or more key-value pairs.
2.  This example intentionally uses a variant of the syntax for interacting with the database handle.   Note in the first box I used `db.site` while in the second I used `db['site']`.   A powerful but confusing, in my opinion, feature of python is its capability to create that type of syntactic alternative incantation.   Technically, what it does is specify a "collection", which in this case is named "site".  In the jargon of MongoDB the `find` and `find_one` methods, which are the core MongoDB "read" methods, are "collection operation".   You should realize that `db` is the top-level symbol that refers to the "whole" database that is assumed to contain one more "collection"s.  The two incantations used above are alternative ways to get a handle to a specific "collection".  You will see both in books and online sources.

### find
We are now ready to dig more deeply into the standard MongoDB query function called `find`.  We've used it many times already above and in session 1.  We now dig deeper into `find` as it is the standard query interface for MongoDB.  Like `find_one` it is a "collection" method so it normally occurs in a construct like the following:

In [8]:
cursor = db.wf_miniseed.find({})   # as we will see {} means all 
print("The type of the object returned by find is ",type(cursor))

The type of the object returned by find is  <class 'pymongo.cursor.Cursor'>


Note that is exactly the same type that was returned earlier when we ran the command `cursor=db.list_collection()`.   We described the generic concept of a command cursor their.  In this case, the cursor we just created should be though for as way to sequentially return the documents the query would return.  Our example, as the comment states, is MongoDB's way of asking for "all".   We will see why in a moment went we get deeper into MongoDB's query language.  

First, let's verify find_one does what we just described.   We don't want the previous example as it would blast over 26,000 documents to us.  Here is a trick to limit that:

In [9]:
cursor.limit(2)
for doc in cursor:
    pretty_print(doc)

{
  "_id": {
    "$oid": "666d53c640282871ba98e64a"
  },
  "sta": "034A",
  "net": "TA",
  "chan": "BHE",
  "sampling_rate": 40.0,
  "delta": 0.025,
  "starttime": 1299639019.000001,
  "last_packet_time": 1299641348.450001,
  "foff": 0,
  "nbytes": 86016,
  "npts": 96000,
  "endtime": 1299641418.9750009,
  "storage_mode": "file",
  "format": "mseed",
  "dir": "/home/wf",
  "dfile": "Event_16.msd",
  "time_standard": "UTC",
  "channel_endtime": 1321549500.0,
  "channel_hang": 89.1,
  "channel_id": {
    "$oid": "666d5452f63ce193f69019c4"
  },
  "channel_starttime": 1262908800.0,
  "channel_vang": 90.0,
  "site_endtime": 1321574399.0,
  "site_id": {
    "$oid": "666d5452f63ce193f69019c4"
  },
  "site_starttime": 1262908800.0
}
{
  "_id": {
    "$oid": "666d53c740282871ba98e64b"
  },
  "sta": "034A",
  "net": "TA",
  "chan": "BHN",
  "sampling_rate": 40.0,
  "delta": 0.025,
  "starttime": 1299639019.000001,
  "last_packet_time": 1299641341.1750011,
  "foff": 86016,
  "nbytes": 86016,
  "n

The point here is that we used the "method" of "CommandCursor" called "limit" to only retrieve the first 2 documents instead of all 26,000 + in wf_miniseed.  

To emphasize the point that a `CommandCursor` is a "forward iterator" consider this:

In [10]:
for doc in cursor:
    pretty_print(doc)

That produced no output because the cursor had already been traversed.  (Note in some contexts that kind of construct would also throw an python exception.)  The solution is an application of the "rewind" method:


In [11]:
cursor.rewind()
cursor.limit(2)
for doc in cursor:
   pretty_print(doc)

{
  "_id": {
    "$oid": "666d53c640282871ba98e64a"
  },
  "sta": "034A",
  "net": "TA",
  "chan": "BHE",
  "sampling_rate": 40.0,
  "delta": 0.025,
  "starttime": 1299639019.000001,
  "last_packet_time": 1299641348.450001,
  "foff": 0,
  "nbytes": 86016,
  "npts": 96000,
  "endtime": 1299641418.9750009,
  "storage_mode": "file",
  "format": "mseed",
  "dir": "/home/wf",
  "dfile": "Event_16.msd",
  "time_standard": "UTC",
  "channel_endtime": 1321549500.0,
  "channel_hang": 89.1,
  "channel_id": {
    "$oid": "666d5452f63ce193f69019c4"
  },
  "channel_starttime": 1262908800.0,
  "channel_vang": 90.0,
  "site_endtime": 1321574399.0,
  "site_id": {
    "$oid": "666d5452f63ce193f69019c4"
  },
  "site_starttime": 1262908800.0
}
{
  "_id": {
    "$oid": "666d53c740282871ba98e64b"
  },
  "sta": "034A",
  "net": "TA",
  "chan": "BHN",
  "sampling_rate": 40.0,
  "delta": 0.025,
  "starttime": 1299639019.000001,
  "last_packet_time": 1299641341.1750011,
  "foff": 86016,
  "nbytes": 86016,
  "n

### Mongo Query Language (MQL)
#### Single key match and basics
I will run a set of examples of increasing levels of complexity.   This particular section of this tutorial is intended as a hands on supplement the lecture in this class and the material in the User Manual section titled ["Using MongoDB with MsPASS"](http://www.mspass.org/user_manual/mongodb_and_mspass.html).  A point made there worth repeating is that we have found no book or online source that describe the syntax rules of the MQL language.   The following quote from the User's Manual is thus worth emphasizing that is our take on the rules defining MQL:

1.  All queries use a python dictionary to contain the instructions.
2.  The key of a dictionary used for query normally refers to an attribute
    in documents of the collection being queried.  There is an exception
    for the logical OR and logical AND operators (discussed below).
3.  The "value" of each key-value pair is normally itself a python
    dictionary.   The contents of the dictionary define a simple
    language (Mongo Query Language) that resolves True for a match
    and False if there is no match.  The key point is the overall
    expression the query dictionary has to resolve to a boolean condition.
4.  The keys of the dict containers that are on the value side of
    a query dict are normally operators.  Operators are defined with
    strings that begin with the "$" symbol.
5.  Simple queries are a single key-value pair with the value either
    a constant or a dictionary with a single operator key.  e.g.
    to a test for the "sta" attribute being the constant "AAK" the
    query could be either `{"sta" : "AAK"}` or `{"sta" : {"$eq" : "AAK"}}`.
    The form with constant value only works for "$eq".
6.  Compound queries (e.g. time interval expressions) have a value
    with multiple operator keys.
7.  There is an implied logical AND operation
    between multiple key operations.  An OR must be specified differently
    (see below).

With that knowledge of the MQL syntax rules, the rest of this section demonstrates those ideas with examples.  For this exercise we are going to focus on the "site" collection as it has fewer complexities and has relatively small documents compared to anything else we could work with here. 

First, a unique match query:

In [12]:
query={'sta' : '134A'}
nsite=db.site.count_documents(query)
print("Number of site documents for station 134A=",nsite)
nchannel=db.channel.count_documents(query)
print("Number of channel documents for station 134A=",nchannel)
cursor=db.site.find(query)
for doc in cursor:
    pretty_print(doc)

Number of site documents for station 134A= 1
Number of channel documents for station 134A= 3
{
  "_id": {
    "$oid": "666d5452f63ce193f69019f4"
  },
  "loc": "",
  "net": "TA",
  "sta": "134A",
  "lat": 32.572899,
  "lon": -98.079498,
  "coords": [
    -98.079498,
    32.572899
  ],
  "location": {
    "type": "Point",
    "coordinates": [
      -98.079498,
      32.572899
    ]
  },
  "elev": 0.297,
  "edepth": 0.0,
  "starttime": 1258329600.0,
  "endtime": 1315526399.0,
  "site_id": {
    "$oid": "666d5452f63ce193f69019f4"
  }
}


Notice:
1.  I used another important collection method called `count_documents` to fetch the expected number of documents the query would yield.  Standard practice in working through many queries is to do a check that the number it returns makes sense.
2.  We see there is one and only one station matching query is site and three matching in channel.  The reason channel has three, of course, is that there is a three-component sensor at that station that defines the recording channels. 

### Projections
There is a lot of extra stuff in the document we retrieved.  We often want a simple "report" that only displays a subset of the content we are interested.  SQL users will recognize this functionality as a SELECT clause in the SQL queries.   The same idea in MongoDB is called, for reasons known only to the developers of MongoDB, a "projection".   
Here is an example where we extract and print only net, sta, chan, loc, hang, and vang from each of the 3 channel documents our query returns:

In [13]:
projection={'net':1,'sta':1,'chan':1,'loc':1,'vang':1,'hang':1,'_id':0}
cursor=db.channel.find(query,projection)
for doc in cursor:
    print(doc)

{'loc': '', 'net': 'TA', 'sta': '134A', 'chan': 'BHE', 'vang': 90.0, 'hang': 90.7}
{'loc': '', 'net': 'TA', 'sta': '134A', 'chan': 'BHN', 'vang': 90.0, 'hang': 0.7}
{'loc': '', 'net': 'TA', 'sta': '134A', 'chan': 'BHZ', 'vang': 0.0, 'hang': 0.0}


Noting the "projection" symbol is a python dictionary==MongoDB document.   The oddity is that a 0 value means False and 1 means True.   That definition says we want to retrieve everything listed with 1 and drop everything else.  The oddity of setting "_id" to 0 is necessary because by default the id is always retrieved in a find/find_one operation.  That incantation says we don't want to see it here. 

Here is a fancier variant using pandas to print the same attributes in tabular form dropping "loc" since we see it is always empty in this case:

In [14]:
import pandas as pd
projection={
    'net':1,
    'sta':1,
    'chan':1,
    'lat':1,
    'lon':1,
    'elev':1,
    'hang':1,
    'vang':1,
    '_id':0,
}
cursor=db.channel.find(query,projection)
doclist=[]
for doc in cursor:
    doclist.append(doc)
df = pd.DataFrame.from_dict(doclist)
print(df)

  net   sta        lat        lon   elev chan  vang  hang
0  TA  134A  32.572899 -98.079498  0.297  BHE  90.0  90.7
1  TA  134A  32.572899 -98.079498  0.297  BHN  90.0   0.7
2  TA  134A  32.572899 -98.079498  0.297  BHZ   0.0   0.0


The pandas construct is useful for a number of reasons.  Therefore, let's create a function to simplify that type of printing operation.

In [15]:
import pandas as pd
def print_as_table(doclist):
    df = pd.DataFrame.from_dict(doclist)
    print(df)

#### Multiple key equality matching
Next let's do a query with multiple keys.   We will fetch the (shortened) record for the BHN component of a different station:

In [16]:
query={
    'sta' : '131A',
    'chan' : 'BHZ',
}
cursor=db.channel.find(query,projection)
for doc in cursor:
    pretty_print(doc)

{
  "net": "TA",
  "sta": "131A",
  "lat": 32.673698,
  "lon": -100.388802,
  "elev": 0.622,
  "chan": "BHZ",
  "vang": 0.0,
  "hang": 0.0
}


#### Range operator examples (compound query)
We often want to query by a range of values.  Here is an example that returns the coordinates of all TA stations within a 5 degree box defined by 30 to 35 latitude and -110 to -100 longitude: 

In [17]:
query={
    'lat' : {'$gte' : 30.0,'$lte' : 35.0},
    'lon' : {'$gte' : -110.0, '$lte' : -100},
}
projection={
   'net':1,
    'sta':1,
    'chan':1,
    'lat':1,
    'lon':1,
    'elev':1,
    '_id':0, 
}
cursor=db.site.find(query,projection)
doclist=[]
for doc in cursor:
    doclist.append(doc)
print_as_table(doclist)


   net   sta        lat         lon   elev
0   TA  121A  32.532398 -107.785103  1.652
1   TA  123A  32.634899 -106.262199  1.206
2   TA  124A  32.700100 -105.454399  2.078
3   TA  125A  32.658798 -104.657303  1.212
4   TA  126A  32.646198 -104.020401  1.032
..  ..   ...        ...         ...    ...
71  TA  Z27A  33.314999 -103.214500  1.197
72  TA  Z28A  33.288399 -102.386597  1.045
73  TA  Z29A  33.259499 -101.706200  0.938
74  TA  Z30A  33.286098 -101.128197  0.729
75  TA  Z31A  33.318298 -100.143501  0.547

[76 rows x 5 columns]


A variant using a regular expression to only select station names that start with the latter "Y":

In [18]:
query={
    'lat' : {'$gte' : 30.0,'$lte' : 35.0},
    'lon' : {'$gte' : -110.0, '$lte' : -100},
    'sta' : {'$regex' : 'Y.*'},
}
cursor=db.site.find(query,projection)
doclist=[]
for doc in cursor:
    doclist.append(doc)
print_as_table(doclist)

   net   sta        lat         lon   elev
0   TA  Y22D  34.073900 -106.921000  1.436
1   TA  Y22E  34.074200 -106.920799  1.444
2   TA  Y22E  34.074200 -106.920799  1.444
3   TA  Y23A  33.931499 -106.054901  1.789
4   TA  Y24A  33.925701 -105.436096  1.827
5   TA  Y25A  33.922901 -104.692802  1.364
6   TA  Y26A  33.923199 -103.824600  1.371
7   TA  Y27A  33.883900 -103.163300  1.253
8   TA  Y28A  33.908600 -102.247902  1.068
9   TA  Y29A  33.860199 -101.671204  0.991
10  TA  Y30A  33.876598 -100.897797  0.812
11  TA  Y31A  33.962898 -100.261497  0.530


#### Geospatial query
MongoDB has some very useful geospatial query capabilities.  See the ["MongoDB and MsPASS"](http://www.mspass.org/user_manual/mongodb_and_mspass.html) section of the User's Manual for more about this capability.  On the other hand, it is probably best thought of, at least at present, as an advanced feature.   The syntax is complex and, as noted in that section of the manual, MongoDB documentation is less than ideal and many online sources are inconsistent with the current implementation.  For this tutorial I will just show an example that is a variant of that shown in User's Manual page.

An IMPORTANT rule about using geospatial searches is that a special index is REQUIRED.  For this example the following is needed to make this work:

In [19]:
db.site.create_index({'location' : '2dsphere'})

'location_2dsphere'

Noting:
1.  'location' is the key used to tag the geoJSON format documents `save_inventory` created in the site collection.  It is a constant tag in the MsPASS schema for these data.  Note also that if you were running this on the source collection the key has a different name ('epicenter') since the content exactly matches the definition of the jargon term. 
2. '2dsphere' is a magic string that tells MongoDB to create a special index that uses spherical geometry for spatial calculations.  The alternative is '2d' but the alternative is not advised for most if not all seismology applications.  The '2d' index uses a map projection that produces distorted answers unless the area of study is small. Examples you can find online use a '2d' index for applications like apps that are have data only on a single city.
3. An advanced topic, which is a side issue for this discussion of geospatial queries, is that any key that used frequently in a find operations on large collections should have an index created.  All indexs produce some form of hash table that allows the MongoDB server to find documents without doing a linear search through the entire collection.   We have found multiple order of magnitude differences in performance with million scale collections.  

Now that we have an index, we can do a search.  This search produces a similar result to the lat-lon range query above but for a circular (great circle path distance circle that is) region at the center of the same lat-lon box as above.  

In [20]:
query = {"location":{
        '$nearSphere': {
            '$geometry' : {
                'type' : 'Point',
                'coordinates' : [-105.0,32.5]
            },
            '$maxDistance' : 300000.0,
        }
      }
    }
# A flaw in the current MongoDB implementation is
# count_documents seems to not work with any geospatial 
# query.  If you remove this comment you will see 
# the error it throws.  If it works, it means MongoDB 
# developers fixed the problem
#n=db.site.count_documents(query)
cursor=db.site.find(query)
for doc in cursor:
    pretty_print(doc)

{
  "_id": {
    "$oid": "666d5452f63ce193f69019dc"
  },
  "loc": "",
  "net": "TA",
  "sta": "125A",
  "lat": 32.658798,
  "lon": -104.657303,
  "coords": [
    -104.657303,
    32.658798
  ],
  "location": {
    "type": "Point",
    "coordinates": [
      -104.657303,
      32.658798
    ]
  },
  "elev": 1.212,
  "edepth": 0.0,
  "starttime": 1205366400.0,
  "endtime": 1266537599.0,
  "site_id": {
    "$oid": "666d5452f63ce193f69019dc"
  }
}
{
  "_id": {
    "$oid": "666d5452f63ce193f6901a36"
  },
  "loc": "",
  "net": "TA",
  "sta": "225A",
  "lat": 32.1101,
  "lon": -104.822899,
  "coords": [
    -104.822899,
    32.1101
  ],
  "location": {
    "type": "Point",
    "coordinates": [
      -104.822899,
      32.1101
    ]
  },
  "elev": 1.703,
  "edepth": 0.0,
  "starttime": 1206489600.0,
  "endtime": 1266623999.0,
  "site_id": {
    "$oid": "666d5452f63ce193f6901a36"
  }
}
{
  "_id": {
    "$oid": "666d5452f63ce193f69019d9"
  },
  "loc": "",
  "net": "TA",
  "sta": "124A",
  "lat":

Because of the pretty print of the full documents, that is a bit verbose, but it hopefully illustrates the point.  Although geospatial queries are complex, they have a lot of potential use for workflows that need to group data by the spatial location of stations (a "virtual array" concept) or by source (stacking of closely spaced sources).  

### Sorting
There are many situations where it is advantageous to 
sort the return of a query by one or more keys.   Sorting is technically a "method of the CommandCursor object" returned by a query but more magic happens when the client passes the query to the MongoDB server to assure the operation is done efficiently.   The reason I point that out here is mostly to clarify why the sort clause appears where it does in typical usage.  The User Manual addresses this in more detail, but here is an example that sorts 
channel documents to a form sensible for miniseed that 
uses the net:sta:chan:loc:time-interval as a unique 
key combination.  

In [21]:
# this is a test to verify sort syntax - delete when completed
filter_clause = {
    "_id":0,
    "sta":1,
    "chan":1,
    "starttime":1,
    "endtime":1,
}
sort_clause = [
    ("net",1),
    ("sta",1),
    ("chan",1),
    ("starttime",1),
  ]
cursor=db.channel.find({},filter_clause).sort(sort_clause).limit(6)
doclist=[]
for doc in cursor:
    doclist.append(doc)
from obspy import UTCDateTime
for doc in doclist:
    doc['starttime']=UTCDateTime(doc['starttime'])
    doc['endtime']=UTCDateTime(doc['endtime'])
print_as_table(doclist)
    

    sta                    starttime                      endtime chan
0  034A  2010-01-08T00:00:00.000000Z  2011-11-17T17:05:00.000000Z  BHE
1  034A  2010-01-08T00:00:00.000000Z  2011-11-17T17:05:00.000000Z  BHN
2  034A  2010-01-08T00:00:00.000000Z  2011-11-17T17:05:00.000000Z  BHZ
3  035A  2010-01-12T00:00:00.000000Z  2011-11-14T17:40:00.000000Z  BHE
4  035A  2010-01-12T00:00:00.000000Z  2011-11-14T17:40:00.000000Z  BHN
5  035A  2010-01-12T00:00:00.000000Z  2011-11-14T17:40:00.000000Z  BHZ


Noting:
1.  The "sort" function call appears after the find function with arguments.   That is the syntax because "sort" is a Cursor "method".
2.  I added a second qualifier, limit, to only return the first 6 documents.  I did that just to keep the volume of the output under control.   The number return is much larger if you remove the `.limit(6)` qualifier.
3.  I did a projection and used the `print_as_table` function we defined to make a more readable report. 

### The read_data method
Now that  you have a basic understanding of MQL and the two "Read" operators in MongoDB called `find_one` and `find`, we return to the MsPASS workhorse method of [Database](http://www.mspass.org/python_api/mspasspy.db.html#module-mspasspy.db.database) called [read_data](http://www.mspass.org/python_api/mspasspy.db.html#mspasspy.db.database.Database.read_data).  It is the serial processing tool for loading data in MsPASS.   (In the third class of this course we will use the parallel version that is a function called [read_distributed_data](http://www.mspass.org/python_api/mspasspy.io.html#mspasspy.io.distributed.read_distributed_data).)  From the docstring realize first that [read_data](http://www.mspass.org/python_api/mspasspy.db.html#mspasspy.db.database.Database.read_data) is a method of [Database](http://www.mspass.org/python_api/mspasspy.db.html#module-mspasspy.db.database) and does NOT accept MQL commands at all.   What it does is driven by arg0 which must be one of two things or it will throw an exception:

1.  A python dictionary with content sufficient to construct a `TimeSeries` or `Seismogram` object.  The simplest way to say that is it is a document from one of the "wf" collections of MsPASS:  "wf_miniseed", "wf_TimeSeries", or "wf_Seismogram".   Note we populated all of those already in our first class.
2.  A [CommandCursor](https://pymongo.readthedocs.io/en/stable/api/pymongo/command_cursor.html) that points to one of the "wf" collections.

For case 1 `read_data` will return an atomic datum (i.e. a `TimeSeries` or `Seismogram`) and the second will return an ensemble.   Although there are defaults it is is good practice to ALWAY add a value for the "collection" argument of `read_data` both for clarity and because it will abort in many cases if you don't.   The following block contains some variants of sections of code from our first class that are used here to show examples of what is we are discussing:

In [22]:
# atomic read from wf_miniseed
doc = db.wf_miniseed.find_one()
d = db.read_data(doc,collection='wf_miniseed')
print("Type of return from read_data=",type(d))
# atomic read from wf_Seismogram
doc = db.wf_Seismogram.find_one()
d = db.read_data(doc,collection='wf_Seismogram')
print("Type of return from read_data=",type(d))
# atomic read from default wf_TimeSeries
doc = db.wf_TimeSeries.find_one()
d = db.read_data(doc)
print("Type of return from read_data=",type(d))
# ensemble read demonstration using source_id.   First get a valid source_id and then construct a query
doc = db.source.find_one()
sid=doc['_id']
query={'source_id' : sid, 'data_tag' : 'serial_preprocessed'}
# read ensemble from wf_TimeSeries with a cursor 
cursor = db.wf_TimeSeries.find(query)
d = db.read_data(cursor,collection='wf_TimeSeries')  # collection could be dropped here but clearer to specify it
print('Type of return from read_data=',type(d))
print('Number of members in this ensemble=',len(d.member))
# repeat for wf_Seismogram
cursor = db.wf_Seismogram.find(query)
d = db.read_data(cursor,collection='wf_Seismogram')  # collection could be dropped here but clearer to specify it
print('Type of return from read_data=',type(d))
print('Number of members in this ensemble=',len(d.member))

Type of return from read_data= <class 'mspasspy.ccore.seismic.TimeSeries'>
Type of return from read_data= <class 'mspasspy.ccore.seismic.Seismogram'>
Type of return from read_data= <class 'mspasspy.ccore.seismic.TimeSeries'>
Type of return from read_data= <class 'mspasspy.ccore.seismic.TimeSeriesEnsemble'>
Number of members in this ensemble= 1311
Type of return from read_data= <class 'mspasspy.ccore.seismic.SeismogramEnsemble'>
Number of members in this ensemble= 437


## Update
One has to do an "update" to a MongoDB database if you need to change the contents of one or more documents.  Database updates happen in the modern world in inconceivably huge numbers every day in commericial operations.  e.g. if you order something from Amazon all those tracking stages from your clicking history to the time a package is delivered to your home invoke a series of database transactions including, I presume, a lot of updates.  

Although updates are a common requirement in commercial databases, a less obvious thing to most people is that updates are rarely if ever needed in data processing with a system like MsPASS.   Most data processing involves three stages:  1) read the data set, 2) process the data set, and 3) save the results.   Some processors may need to do read operations from the database, but updates are rarely needed.  They are also highly undesirable in a data-driven workflow like that because database transactions, from the computer's perspective, are like a human talking to someone on Jupiter; a response to the request for an update takes forever in terms of computer clock cycles.  For that reason, updates should be avoided in any workflow and should absolutely never be embedded in a large, parallel processing sequence. 

In MsPASS updates can nearly always be avoided by a simple, alternative approach:   if a change is needed that needs to be saved (e.g. you compute a set of new attributes from the data) simply post that data to the associated object's `Metadata` container.   In that model, when the final results are saved the newly computed attributes will be saved with the data.  Then the overhead of writing to the database is absorbed in the normally essential save step anyway.  

With that long caveat, there are two standard ways to do updates:  `update_one` changes one document at a time, and `update_many` updates multiple documents with one client-server transaction.  Most people can understand usage of these two methods better by examples.  The examples below focus on updates to "normalizing" collections as that, from my experience, is the most common need for updates when using MsPASS.

Finally, we emphasize that the idea is that all MsPASS processing is normally driven by a list of documents from a wf collection.   Sometimes we process the entire collection.  An example is the first waveform processing loop in session 1 that we drove with wf_miniseed.  There we used this construct:
```
cursor=db.wf_miniseed.find({})   # {} mean all - now that you know MQL rules you should understand why
for doc in cursor:
    d = db.read_data(doc,collection='wf_miniseed')
```
Most processing, however, uses some form of query to limit what is passed through the processing chain.   A nearly universal one is a limit on the "data_tag" attribute you should always used to define the result of a particular save at a particular stage of processing.  For example, the box below does nothing but read all the data wf_Seismogram with the source_id value we set above:

In [23]:
query={'source_id' : sid}
n = db.wf_TimeSeries.count_documents(query)  
print("Number of documents with source_id=",sid," is ",n)
t0 = time.time()
cursor = db.wf_TimeSeries.find(query)
for doc in cursor:
    d = db.read_data(doc,collection='wf_TimeSeries')
t = time.time()
print("Time to read ",n,' TimeSeries objects was ',t-t0)

Number of documents with source_id= 666d53c3f63ce193f69019b0  is  1311
Time to read  1311  TimeSeries objects was  8.648014783859253


## Update
One has to do an "update" to a MongoDB database if you need to change the contents of one or more documents.  Database updates happen in the modern world in inconceivably huge numbers every day in commericial operations.  e.g. if you order something from Amazon all those tracking stages from your clicking history to the time a package is delivered to your home invoke a series of database transactions including, I presume, a lot of updates.  

Although updates are a common requirement in commercial databases, a less obvious thing to most people is that updates are rarely if ever needed in data processing with a system like MsPASS.   Most data processing involves three stages:  1) read the data set, 2) process the data set, and 3) save the results.   Some processors may need to do read operations from the database, but updates are rarely needed.  They are also highly undesirable in a data-driven workflow like that because database transactions, from the computer's perspective, are like a human talking to someone on Jupiter; a response to the request for an update takes forever in terms of computer clock cycles.  For that reason, updates should be avoided in any workflow and should absolutely never be embedded in a large, parallel processing sequence. 

In MsPASS updates can nearly always be avoided by a simple, alternative approach:   if a change is needed that needs to be saved (e.g. you compute a set of new attributes from the data) simply post that data to the associated object's `Metadata` container.   In that model, when the final results are saved the newly computed attributes will be saved with the data.  Then the overhead of writing to the database is absorbed in the normally essential save step anyway.  

With that long caveat, there are two standard ways to do updates:  `update_one` changes one document at a time, and `update_many` updates multiple documents with one client-server transaction.  Most people can understand usage of these two methods better by examples.  The examples below focus on updates to "normalizing" collections as that, from my experience, is the most common need for updates when using MsPASS.

### update_one example
Suppose we learned that the recording period for a seismic station are wrong.  That is, with SEED data station information has a time period for which the data are considered valid.   That period is defined by two attributes with the keys "starttime" and "endtime"  Changing these fields would be highly unusual for data downloaded from the FDSN, but is not at all uncommon for portable deployments while the experiment is in progress.  Our example is contrived as what we are about to do will make the entry we edit wrong.   So the hypothetical situation we are modeling is that we imagine we learned we the "endtime" for station O34A is wrong.  We first query the site collection to verify what we have:

In [24]:
from obspy import UTCDateTime
query={'sta' : 'O34A'}
# verify there is only one entry - not always true with this query
ndocs=db.site.count_documents(query)
print('Number of documents for station O34A = ',ndocs)
doc=db.site.find_one(query)
print(doc['sta'],
    UTCDateTime(doc['starttime']), UTCDateTime(doc['endtime']))

Number of documents for station O34A =  1
O34A 2010-06-11T00:00:00.000000Z 2012-04-18T23:59:59.000000Z


We say, "ahh the endtime should have been on March 19 not March 18 and our field notes show the actual time was 13:44 UTC. "   We can make that change with this use of update one.  

In [25]:
new_time=UTCDateTime('2012-04-19T13:44:00.0Z')
update_doc={ '$set' :
            {'endtime' : new_time.timestamp}
           }
db.site.update_one(query,update_doc)
print('Updated data for O34A')
doc=db.site.find_one(query)
print(doc['sta'],
    UTCDateTime(doc['starttime']), UTCDateTime(doc['endtime']))

Updated data for O34A
O34A 2010-06-11T00:00:00.000000Z 2012-04-19T13:44:00.000000Z


Notice update_one has two required arguments: arg0 is a query operator and arg1 is required to be an 'operator' meaning in has to use one of the 'dollar' operators discussed above.  This one uses '$set' with means replace the value.  In my experience, that is the most common operator for updates.

### update_many example
The basic argument structure required for `update_many` is the same as `update_one`.   The difference is you should use `update_many` when the query in arg0 is expected to return more than one document that are to be modified.  The example below is the same as  for `update_one` but applied to the "channel" collection.   As the `count_documents` output shows below the same query yields 3 documents for channel because the site has a three component sensor.

In [26]:
ndocs=db.channel.count_documents(query)
print('number of channel documents for O34A=',ndocs)
# we use the same query and update_doc as above
db.channel.update_many(query,update_doc)
print('Updated data for O34A')
cursor=db.channel.find(query)
for doc in cursor:
    print(doc['sta'],doc['chan'],
      UTCDateTime(doc['starttime']), 
      UTCDateTime(doc['endtime']))

number of channel documents for O34A= 3
Updated data for O34A
O34A BHE 2010-06-11T00:00:00.000000Z 2012-04-19T13:44:00.000000Z
O34A BHN 2010-06-11T00:00:00.000000Z 2012-04-19T13:44:00.000000Z
O34A BHZ 2010-06-11T00:00:00.000000Z 2012-04-19T13:44:00.000000Z


## Delete
The API for deleting documents is very similar to that for find.  There is a `delete_one` method to delete a single document and a `delete_many` method that more-or-less does a find followed by deleting each document the query found.  For instance, the following deletes what we just updated in channel:

In [27]:
# repeating this query to be clear but not required in this context
query={'sta' : 'O34A'}
ndocs=db.channel.count_documents(query)
print('number of channel documents for O34A before delete=',ndocs)
ret=db.channel.delete_many(query)
ndocs=db.channel.count_documents(query)
print('number of channel documents for O34A after delete_many=',ndocs)

number of channel documents for O34A before delete= 3
number of channel documents for O34A after delete_many= 0


Handling deletions of waveform data is a much more difficult problem.   In MsPASS there is a special method of our `Database` class called `delete_data`.  That method has to do a lot more than just call the `delete_one` method to remove the database document.  There are two reasons for that:
1.  In MsPASS the sample data, which are typically orders of magnitude larger than the "document" saved in MongoDB, are stored separately from the "document" of name-value pairs.
2.  MsPASS also support multiple "storage modes" for how to handle the sample data.   It also allow multiple "format"s for how that data is represented externally (e.g. miniseed is a "format" that is light years from the natural representation of seismic data). At this time there are three basic "storage modes":  (1) "file", (2) "gridfs", and "url".  How they need to be handled with a "delete" operation is very different.  When "storage_mode" is set to "file" the sample data are stored in a file system in a set of files.  There the problem is one file should normally contain many waveforms so if a lot of editing is done data will be stranded.  MsPASS has a way to automatically delete files that no longer contain a reference in the database to reduce debris, but it only works if the entire file content is deleted.   Using "gridfs" storage is a simpler problem as our waveform delete operator will automatically clear sample data stored in the gridfs system.  If your application requires a lot of editing to remove stale waveforms, gridfs is by far the best choice.  Finally, "URL" is pretty much defined to be read-only so the only thing that happens for data indexed that way is that the document vanishes. For data access via the cloud with the new Earthscope system this mode may become common.     

One common application of `delete_data` is to clear some temporary save copy that is no longer needed.  In MsPASS when data are saved we recommend ALWAYS using the "data_tag" argument to provide a unique tag for data at a specific stage of processing.   With that understand, suppose we saved an intermediate copy of a working dataset with the `data_tag="preprocessed"` and we wanted to clear the disk space associated with that intermediate copy.  The following simple code box would do that (Note it will do nothing here because the db we have been using contains no waveform data so I disabled the code box):  

Note arg0 of this method (currently) requires the ObjectId of the document to be deleted.  arg1 must be either "TimeSeries" or "Seismogram" or the method will throw an exception.

## Importing Tablular Data
A final point we want to teach in this session is the utility of MongoDB for importing all kinds of weird data.   You can find a more lengthy discussion of the ideas in [this section](http://www.mspass.org/user_manual/importing_tabular_data.html) of the User Manual.  There are two key points we highlight to motivate why you should listen to this:
1.  Cutting-edge research often involves reading and managing nonstandard data.   MongoDB is the best solution we know of for managing weird data because a "document" is a container than can hold just about anything we have encountered. 
2.  A large fraction of open data are distributed completely or in part as tables of information.  As a result there is a rich ecosystem for handling tabular data in python that are automatically available and packaged with MsPASS.   Examples include readers for csv files, fixed format text files, and readers to interact with any SQL database server.

We encourage you to read the User Manual page in the link above at your leisure along with examples found in a similar tutorial to this one found [here](https://github.com/mspass-team/mspass_tutorial/blob/master/notebooks/mongodb_tutorial.ipynb). 

A special case in seismology is interaction with relational database systems.   Most regional networks today use some form of relational database to manage some or all of their data.   If, in your work, you need to interact with the information system of some provider that utilizes an SQL server and you can get read access to the database, follow the link above to our User Manual section discussing this topic.   (There are standard tools in Pandas and dask to interact with SQL servers.)   A special case in our community is the "flat file" database system developed originally in the 1980s for the IRIS Joint Seismic Program originally called "Datascope".   At the end of the Joint Seismic Program the authors of Datascope spun of the software company called [Boulder Real Time Technologies](https://brtt.com/) using it as the the framework for their real-time seismic network monitoring software they called "Antelope".   Their software is used in several US seismic networks and many others around the world.  It is also used by PASSCAL for some elements of experimental data handling.  Furthermore, US seismolog research scientists not operating seismic networks can obtain a license for their software at no cost.    i.e. there are many places you can find Datascope tables that contain useful data for research in our community.   A type example is the phase picks made by the USArray network facility we will examine here that were originally downloaded from the [Array Network Facillity (ANF) website](https://anf.ucsd.edu/tools/events/).  We close this session with a brief demonstration of the special tool recently developed for MsPASS for working with a Datascope database. 

The MsPASS tool for working with Datascope is a special database class we call [DatascopeDatabase](http://www.mspass.org/python_api/mspasspy.preprocessing.html#mspasspy.preprocessing.css30.datascope.DatascopeDatabase).  In the working directory for this tutorial is a directory  containing picks made by the ANF from the Earthscope TA in January of 2011.  We downloaded and unpacked from the ANF site referenced above.  It produced the content of the directory you should see in the jupyter lab file pane called "events_usarray_2011_11".   The database tables are in that directory and all begin with the "database name" of "usarray_2011_11".  With that background we can create a `DatascopeDatabase` handle to that data with this incantation:

In [28]:
from mspasspy.preprocessing.css30.datascope import DatascopeDatabase
# Temporary until bug is repaired - should be able to remove pffile arg when that is repaired
dsdb = DatascopeDatabase("events_usarray_2011_01/usarray_2011_01",pffile="DatascopeDatabase.pf")

The [DatascopeDatabase](http://www.mspasdf = dscopeDb.CSS30Catalog2df()
dfs.org/python_api/mspasspy.preprocessing.html#mspasspy.preprocessing.css30.datascope.DatascopeDatabase) class has a number of useful methods, but the most useful one for this example is one called `CSS30Catalog2df`.   It creates a large table that is the full "catalog" of picks that includes cross-referencing "joins" defining source locations associated with each pick.  This final box is a terse example that illlustrates this functionality.  This approach can be used to jump start any study with TA data that would benefit from these phase picks.

In [29]:
df = dsdb.CSS30Catalog2df()
print(df)

           evid evname    prefor        auth  commid        lddate      lat  \
0      212668.0      -  349724.0  QED_weekly    -1.0  1.321650e+09  27.2470   
1      212668.0      -  349724.0  QED_weekly    -1.0  1.321650e+09  27.2470   
2      212668.0      -  349724.0  QED_weekly    -1.0  1.321650e+09  27.2470   
3      212668.0      -  349724.0  QED_weekly    -1.0  1.321650e+09  27.2470   
4      212668.0      -  349724.0  QED_weekly    -1.0  1.321650e+09  27.2470   
...         ...    ...       ...         ...     ...           ...      ...   
52427       NaN    NaN       NaN         NaN     NaN           NaN      NaN   
52428  213019.0      -  350200.0    ANF:tcox    -1.0  1.320339e+09  47.5331   
52429       NaN    NaN       NaN         NaN     NaN           NaN      NaN   
52430       NaN    NaN       NaN         NaN     NaN           NaN      NaN   
52431       NaN    NaN       NaN         NaN     NaN           NaN      NaN   

            lon    depth          time  ...  amp  p

Noting the NaNs result form "unassociated" picks meaning an analyst picked phase that couldn't be associated with any know earthquake.  