normalize method proposal for Database #303

pavlis · 2022-02-08T11:40:12Z

pavlis
Feb 8, 2022
Collaborator

As discussed in our meeting this morning I realized in working with refining a workflow to handle the full usarray teleseismic data set that we have a weakness in handling of normalization with MongoDB. The problem is more serious for source normalization than receiver (channel or site) normalization because data sets are highly variable. That is, how to do association with source collection documents depends on how one assembles the data. Two common choices for event-based data organization is waveform start times defined by a phase arrival time (commonly P) that depends upon source position and origin time and a common time relative to the source origin time. For the record, this problem is unique to natural source data processing as active source data face a simpler problem - or at least different - well solved in any seismic reflection processing package.

The idea I had that I want to discuss is perhaps best introduced by the following proposed new method signature for Database.

def normalize(self,mspass_object, collection="channel", 
      data_to_load={"channel_id" : "_id","lat" : "lat", "lon" : "lon","elev" : "elev"},matcher=seed_code_matcher): 
  """
  This function does a database query on a specified collection to load a set of data from that collection driven by  
  the python dict data_to_load.   The method depends on a matcher class to define which document in 
  the collection you aim to match is the mate to the contents of the input data, mspass_object.   The default 
  behavior load standard attributes to match seed net, sta, chan, loc, and time intervals to matching keys in the 
  input mspass_object.   

  :param mspass_object:   mspass data object with Metadata containing key-value pairs that have a means for 
    an exact match against specified collection
  :param collection:   auxiliar collection to match with contents of mspass_object.  For efficiency this algorithm makes 
    sense only if this collection is not huge and has an appropriate index to mesh with the matcher.
  :param data_to_load:  python dict of data to be loaded from collection.  The key is the name that that will be assigned 
   in the data's metadata and the value is the key expected in the actual collection.
  :param matcher:  subclass of Matcher class used to define how to find a unique match with collection.

The real challenge is designing the base class Matcher and the way subclasses derived from it would function.

I think the first think is the base class would have to have it's own constructor. The base class constructor needs at least the following signature:

def __init__(self,db,collection):

That way each subclass could call the base class constructor then add things it might need beyond those basics. e.g. the suggested seed_code_matcher might have this set of code as the top:

class seed_code_matcher (Matcher):
   def __init__(self,db,collection,default_net, default_sta, default_chan,loc=None):
   """
   Constructor initializes default keys for net, sta, chan, and loc.  This design would make them dynamic 
   with each instance changing the self attributes - see later
   """
      super().__init__(Matcher,db,collection)  # I think this is the right syntax
     self.net =default_net
     self.sta= default_sta 
     self.chan = default_chan 
     self.loc = default_loc
  def find(net=None, sta=None, chan=None, loc=None,time=None):
    """
    Find matching document for net, sta, chan, loc, and time in time range for that channel/site.   Default None 
   means use self keys.  Return cursor to return of find.
   """

We should produce other (i.e. the seed one above is definitely needed) subclasses of Matcher for these common situations:

Matcher for source using a fixed time offset from origin time.
Matcher for source using a phase arrival time reference. This one is much harder because it requires a query to first find a set of sources that are feasible based on a reference phase time, then the algorithm would need to compute a theoretical time for all feasible sources choosing the one that is the closest match. It could still only need one query as the find operation in this case would just need to keep the object id of all feasible sources and return the one that is the match.
A source or receiver matcher using a geospatial query and a search radius. The same class could also be used for imaging applications where a point in space is virtual as for example with pwstack or any ccp stacking algorithm.
For source we might want a 4-d search - i.e. query for all 3 space coordinates and time to find matches.
We don't have this at present, but there are a whole set of different Matcher classes that could be developed for data that could be defined with Cartesian coordinates. They are pretty much the same as geographic examples, but would be different because the coordinate system is radically different.

pavlis · 2022-03-27T19:48:51Z

pavlis
Mar 27, 2022
Collaborator Author

In our meeting last week we discussed this issue a bit. In response I wrote a prototype you can look at in a new branch (not in a pull request - I just pushed it for you guys to review. The code there is incomplete and completely untested at this point.) called extend_normalization_options.

My proposed design uses an abstract base class that requires a subclass to implement two methods:

A get_document method. The get_document method fetches the complete document from mongodb matching the criteria defined by the algorithm for that instance. e.g. I have implemented one subclass that does what we do now: matches an object id using a key like "channel_id".
A normalize method. The normalize method in every case I've implemented first calls get_document and then extracts and (sometimes) rename the key before loading a specified set of attributes into the data's Metadata. The base class actually uses the call mechanism to make calling the class name equivalent to calling normalize as normalize is the main point of this function object class.

You can find the source code for my prototype in python/db in the file called matcher.py. I should warn you the docstrings are incomplete at this time so it might be a bit confusing. Nonethessless, you will see I have implemented the following normalization algorithms in this prototype:

The same thing we do now with ObjectIDs. I gave the class the name ID_matcher. The constructor has this signature: def __init__(self, db, collection, attributes_to_load, kill_on_failure=True): where the collection argument is a string driving the use of constructs like source_id.
I build a matcher for channel data with miniseed data. It has the (verbose) name mseed_channel_matcher.
There is a very similar class to mseed_channel_matcher I called mseed_site_matcher. It is used to match against the site collection using the seed names net, sta, and loc (channel adds chan).
I wrote the simplest source collection matcher I called origin_time_source_matcher. It can be used for the (common) case where data are segmented by time relative to an origin time. The extended usarray data we have been working with, for example, was downloaded from iris using obspy's mass downloader and has miniseed files that could be normalized with this algorithm.
A different thing is the class with the name css30_arrival_interval_matcher. It is a different thing as the use there would be to attempt to load arrival time data from a css3.0 database similar to some of the prototypes found in our tutorials.

I pushed that incomplete prototype for now largely as a basis for discussing this design in our regular meeting tomorrow. Hopefully yall can review this before then and we can make some plans to move forward on this design.

One thing for sure we should discuss is how to best modify the read_data api to utilize this approach for a generic matching function object. We currently have the argument normalize that is expected to be a list of collection names to be normalized using the id method. To be backward compatible I suggest we retain the normalize parameter as a list like now but add an option parameter with a name something like normalizing_matchers (maybe that is too verbose). That could default to None like the normalize parameter but if the user wants to normalize the data they could define a python dict with keys being normalization collection names (normal expectation the same ones in the normalize list) and values being a subclass of the NMF (Normalizing Match Function - base class name for the function objects in matcher.py) constructed and set by the caller. We should allow a field defined in the normalize list to not be defined in the normalizing_matchers dict and if it is not define default to the ID_matcher class. (Note the later would be a bad thing to have a user do for a large data set as the ID_matcher would have to be constructed for every object processed while if it is defined it could be parsed.)

0 replies

pavlis · 2022-05-27T20:00:14Z

pavlis
May 27, 2022
Collaborator Author

For the record a lot happened since this discussion was started. This idea was expanded and is now implemented albeit with a few residual bugs to kill revealed by testing on a large data set.

The "large data set" point is why I want to add to this discussion today. I am in the middle of running a test of this new functionality on a data set with 3.6 Million documents. Timing data in progress shows that process is going to take over 24 hours to run. We can't blame indices or anything like that as the updates adding channel_id and site_id are using the object id of wf_miniseed documents. Note one issue is definitely that the database in this case is hosted on a plain magnetic disk on a desktop running the all-in-on mode. The process is clearly totally limited by mongodb updates even though the implementation uses the bulk_write function and does 1000 updates at a time rather than single transactions.

The discussion question is how the performance of this process could be improved? I could try hosting the database on an SSD on the same system. That might turn a marginally feasible processing run into a more reasonable one, although I do not really know how different it will be. The alternative is to try out the on-the-fly version of this new implementation. That is, this new set of matching functions could be used in a map operation to load channel_id, site_id, and source_id in the workflow. If that works out we would still need some kind of capability to sort out receiver and source metadata deficiencies. i.e. I know for certain that this data set I'm processing has some missing channel documents that could be retrieved with obspy fdsn web services.

Comments?

4 replies

pavlis May 27, 2022
Collaborator Author

A postscript to that last comment. I am now pretty certain our new implementation is also failing because of a cursor timeout problem. Running timing data I know it died about about 30 minutes, which is the cursor timeout interval in MongoDB. We should revisit the immortal cursor problem we never fully resolved. I did not think we would get a timeout problem with this algorithm because it is using the cursor object continuously without large pauses.

Here is the actual exception message posted when the script is aborted:


CursorNotFound: cursor id 5895185101045202573 not found, full error: {'ok': 0.0, 'errmsg': 'cursor id 5895185101045202573 not found', 'code': 43, 'codeName': 'CursorNotFound'}

wangyinz May 27, 2022
Maintainer

That performance you observed doesn't make sense. With 3.6 Million documents and 24 hours, that means the performance is 41 document/second. I think we do need to take a closer look and profile the run to see which part is slowing down the process.

In the meantime, to rule out the impact of magnetic disk, we should try the same run with the in-memory file system. With docker, you could mount a tmpfs into the container (example). Basically, you need to add --mount type=tmpfs,destination=/home/db to the docker run command. This will turn the /home/db directory within the container into an in-memory filesystem. With that, we should be able to completely rule out the performance impact from the filesystem.

For the cursor timeout issue, I wonder which query within the new normalization code caused the issue. This should be easily fixed by the no cursor timeout option.

pavlis May 28, 2022
Collaborator Author

You were absolutely right that didn't make sense. When I dug deeper I discovered it was a clear bug. I think @Yangzhengtang carried forward a bug that was present in my prototype. Problem was not clearing a list called "bulk" in the normalize code that was used to accumulate bulk write requests. bulk just grew and grew until the update time did lead to a cursor timeout. Also, of course, led to redundant updates. Actually, by the way, I think the default timeout is 10 min not 30 - my bad memory. At least that is what sources I read this morning said.

I'm working on fixing that and the problem I noted in an email about botching the channel_id key (double _ in the name).

pavlis May 28, 2022
Collaborator Author

One more thing for the record. After the bug fix above normalize_mseed took about 30 minutes to add channel_id and site_id values to about 3.6 M documents. That is on a mag disk. May or may not be any faster on SSD or faster media as it could be controlled by client-server comm delays. Future performance test to consider.

pavlis · 2022-08-03T09:53:52Z

pavlis
Aug 3, 2022
Collaborator Author

There is a fairly large time jump in this discussion. For that reason I considered starting a new thread on the related topic this box addresses, but because the topic is a followup on this thread it seems more appropriate to just extend this.

Topic: I want to propose a major change to the API we developed for normalization.

First, some background for the record. Since this was started some months ago the development progressed as follows:

I wrote a prototype normalize module. Some of that design is preserved above.
@Yangzhengtang took my prototype and revised it significantly. He removed some redundant code, cleaned up some nonstandard default argument definitions, and added a number of missing docstrings. I confess to not looking very carefully at the changes due to conflicting demands on my time.
For the past month I've been chipping away at the documentation for the normalize module. That involved a new user manual page and cleaning up the text for the docstrings.

This comment is fallout from item 3. I find some fundamental flaws in our implementation. I realized this mostly from trying to write the docstrings describing the components of the normalize module. I also found it very challenging to write a section in the user manual on how to write an extension to normalize. It was the later that made me realize we have a serious design problem as extensions would be very difficult. Further, I would argue our current implementation will be difficult to maintain. The class structure has some complicated, interleaved methods that are difficult to understand let alone tell our users how to use. I won't belabor this point but move to some important, fundamental advice too often ignored in all group dynamicss: don't complain about something unless you have an alternative to propose. Here is my proposal.

My revised designed comes from this quote from the draft user's manual on the concepts required to implement normalization:

An alternative to normalization during a read operation is to match records
in a normalizing collection on the fly and load desired attributes
from that collection. The abstraction of this process we use in MsPASS
makes a fundamental assumption that the normalizing collection is small
compared to the size of the wf collection that defines the working data set.
With that assumption we abstract normalization as two independent
operations:

We need to define an algorithm that provides a one-to-one match of records in
the normalizing collection with the target of the normalization.

After a match is found we usually need to copy a set of attributes
from the normalizing collection to the target.

Those word helped me realize our fundamental design flaw was mixing these two concepts into a single class structure. That is, in the current implementation we have a base class NMF that implements both concepts in one wrapper. Further, after the initial design we added the argument-driven option to have two options for implementing concept 1: database query and a cached version added for speed.

With that foundation, I proposed we restrict the class definition to only implement concept 1 and concept 2 is better implemented as a generic function. Specifically, here is my proposal for the base class implementing concept 1:

class BasicMatcher(ABC):
    """
    Base class defining the api for a generic matching capability for
    MongoDB normalization.   The base class is a skeleton that
    defines on required abstract method (find) and initializes a set of
    universal attributes all matchers need.
    """
    def __init__(
        self,
        attributes_to_load=None,
        load_if_defined=None,
        kill_on_failure=True,
        verbose=False,
    ):
    """
    Base class constructor sets only attributes considered necessary for all
    subclasses.   Most if not all subclasses will want to call this
    constructor driven by an arg list passed to the subclass constructor.
    """
    # code

    def __call__(self, d, *args, **kwargs):
        return self.find(d, *args, **kwargs)

    @abstractmethod
    def find(self, d, *args, **kwargs)->tuple:
        """
        All subclasses must implement this method.  An implementation should
        test key-value pairs in the container passed as d and return
        the matching document in a normalizing collection in a MsPASS
        Metadata container.  The return must be a 2-component tuple
        with 0 holding the Metadata and 1 containing an ErrorLogger object.
        Matching failures should be signaled by returning None in
        component 0 with an optional error messaged posted to the
        ErrorLogger object in component 1.   Normal returns should
        signal have a None for component 1 but can post warnings if
        needed and return an ErrorLogger with content.   Callers need to
        handle the 4 possible combinations cleanly:
           None None  - failure but not message available
           None ErrorLog - failure with an error message that can be passed on
           Metadata None - all is good with no detected errors
           Metadata ErrorLog - valid data returned but there is a warning
              message posted.  A common example of this would be
              match that is not unique.

        """

Noting the name here is more appropriate and matches PEP rules for class naming - the current implementation does not.

Concept 2 would be better implemented as the following generic function:

def normalize(d,matcher):
    """
    Function to apply a normalizing collection defined by the matcher
    argument that must be a concrete instance of a BasicMatcher class.
    All the complications in this api are defined in the matcher class.
    It can be a database query or cached method and use any matching
    algorithm possible with the content of d.

    :param d:  MsPASS data object.   i.e. one of TimeSeries,Seismogram,
      TimeSeriesEnsemble, or SeismogramEnsemble.   Will throw a MsPASSError
      if d is not one of the valid types.  Should also cleanly handle
      None and a datum marked dead.

    :param matcher:  implementation of the BasicMatcher class.  The function
      uses the abstract find method to obtain a match with data or Metadata
      with d.

    :return:  copy of d with normalizing attributes loaded in d.  Matcher may
      kill d and the result may be dead and have an error posted in elog if
      matching failed.  That behavior is controlled by the implementation of
      matcher.
    """
    # this is a variant of the prototype
    if not _input_is_valid(d):
        raise TypeError("normalize:  received invalid data type="+type(d))
    if d.dead():
        return d
    if not isinstance(matcher,BasicMatcher):
        raise TypeError("normalize:  arg 1 must be a subclass of BasicMatcher.  Type of data received="+type(BasicMatcher))

    [doc,elog] = matcher.find(d)
    if doc == None:
        if elog == None:
            d.elog.log_error("normalize","matcher failed applied failed but did not post an error message",ErrorSeverity.Invalid)
        else:
            d.elog += elog    # I think += is overloaded like this - may not be correct
    else:
        # may want to log a complaint in a verbose mode if a key was already
        # define
        for key in doc.keys():
            d[key] = doc[key]
        if elog != None:
            d.elog += elog
    return d

That function is totally untested, but illustrates how simple the generic function is if you separate the matcher class concept from the load and store concept.

To clean up the way we handle caching versus database matching I propose the following two subclasses to BasicMatcher:

def DatabaseMatcher(BasicMatcher):
    """
    Matcher using direct database queries to MongoDB.   Each call to
    the find method of this class constructs a query, calls the MongoDB
    database find method with that query, and extracts desired attributes
    from the return in the form of a Metadata container.
    """
def CachedMatcher(BasicMatcher):
    """
    Matcher implementing a caching method.   This class is largely an
    intermediate class for instances that define a specific key used
    internally to construct the cache.   It has an abstract (virtual)
    method instances must complete to define how the cache key is
    derived from one or more key-value pairs.  The cache as in the older
    implementation would be a dict container with keys defined by the
    _cache_id method (see below) and valued being Metadata containers of
    matching results to be returned by find.
    """
    def__init__(self, args):

    @abstractmethod
    def _cache_id(self,d)->str:
        """
        Instances should implement this method that generate an indexing
        string for the cache.  For something like an _id this would just be
        the str of the ObjectId and for a single string key it could be just
        the key to match.  For multiple keys a more complex method would be
        needed to this interface is generic enough it should be functional
        for anything we need in MsPASS.
        """

    def find(self,d)->tuple:
        """
        Implementation of find for cached method.   It uses the _cache_id
        method to create the indexing string from d and then returns a
        match to the cache stored in self. It is simillar to the prototype
        version.  The unknown here is if this interwoven set of virtual
        methods will work with the python interpreter.
        """

This design uses an untested construct I know works on C++ code, but I am unsure about python and the ABC module. That is, note the virtual method, _cache_id, defined for CachedMatcher. I am not at all sure that will work, but it should if ABC is implemented correctly as that is a standard OOP construct.

The subclasses for database and cached algorithms are, I think, a more stable and transparent way to define two fundamentally different approaches to implementing the BasicMatcher class. The dark side is it will extend the namespace, but in my opinion that is better than having hidden default options as we have in the current implementation. Here is the set of definitions I propose to replace all the current functionality.

class IDMatcherDatabase(DatabaseMatcher):
    """
    Implementation of db_matcher for ObjectIds.
    """

class IDmatcher(CachedMatcher):
    """
    Implement an ObjectId match with caching.
    """

class MiniseedMatcherDatabase(DatabaseMatcher):
    """
    Database implementation of matcher for miniseed data using
    four key net:sta:chan:loc and a time interval test.   This should
    have a option for collection to work for both site and channel.
    Users should create two instances for both site and channel matches.
    """

class MiniseedMatcher(CachedMatcher):
    """
    Cached version of matcher for miniseed.   Example of a more complex
    implementation of _cache_id as it has to handle time interval.  The
    prototype has an algorithm that I think will work with some minor
    blemishes cleand up (like an unnecessarily complicated key).
    """

class EqualityMatcher(CachedMatcher):
    """
    Match with an equality test of the value of a specified key.   Should
    support any type that supports operator ===.   Should provide the option
    of using different keys for the normalizing collection and the target.
    That provides the same functionality as a standard sql join in this
    different context.
    """


class EqualityMatcherDatabase(DatabaseMatcher):
    """
    Database equivalent of EqualityMatcher.

    """

class OriginTimeMatcher(CachedMatcher):
    """
    Cached version of matching data with start times defined by an
    approximately constant offset for an event origin time.   Previously
    called "origin_time_source_matcher".  The prototype code should be
    adaptable with minimal changes.
    """

class OriginTimeMatcherDatabase(DatabaseMatcher):
    """
    Database query version of OriginTimeMatcher.
    """

class ArrivalMatcherDatabase(DatabaseMatcher):
    """
    Previously called css30_arrival_interval_matcher.   Should only
    be implemented with database query implementation as arrival matching
    would often by a huge table.
    """

I am not married to the names I chose here, BUT they at least match PEP rules for class names and make it clear if a choice is cached or Database driven. My thought was only add a "Database" tag for database query driven matchers and have not tag at all for the cached version. My thought was that was less verbose than adding a "Cached" tag for the something like "OriginTimeMatcherCached" and is consistent with recommending use of the cached version as the default. We might consider using "DB" instead of "Database" to make this less verbose and use it as a prefix instead of a suffix. i.e. something like MiniseedMatcherDatabase might be better as DBMiniseedMatcher.

Let's discuss this on our call this week and decide if this is worth doing. The other viewpoint is to say "if it ain't broken don't fix it". Our existing implementation works, but as I noted it is fragile and will be a challenge to maintain. On the other hand, this stuff is not documented at all yet so no one out there will be using it so if we are going to revise the api like this now is the time rather than later when we might get request for backward compatibility.

1 reply

wangyinz Aug 3, 2022
Maintainer

I think the redesign looks much better and we probably should implement it now. I guess we do need to work a bit on the naming. For example, because we have DBMatcher, we probably should name derived ones as MiniseedDBMatcher for consistency.

One other issue of these normalize methods that I found is the performance. I think we need to explore the aggregation methods supported natively by MongoDB.

pavlis · 2022-08-03T19:17:47Z

pavlis
Aug 3, 2022
Collaborator Author

Ok, I'm going to start in on a few things. There will be surely be more smaller granularity issues I need to post here. One comes up immediately. The base class will NOT have a database handle inside it. The reason is I can conceive of the need for a file-based construct. e.g. it would make some sense to read an antelope css3.0 ascii site table without needing to load that data into MongoDB. I thus plan to define a third intermediate level class with this signature:

class FileMatcher(BasicMatcher):
    """
    Like CachedMatcher but constructor should implement a file reader to build the table of attributes and an associated index. 
    """
    def __init__(
       self,
       filename,
       format="CSV",
       attributes_to_load=None,
       load_if_defined=None,
       kill_on_failure=True,
       verbose=False):

The above would be a strictly intermediate class and an instance would call the above constructor and then contain code to handle the filename arg and format arg. It would load data into a cache much like the current implementation. We csv files you could add additional functionality with pandas. That is a detail. For now I think I will just stub the intermediate class and document how it would be used. I may even implement a couple of antelope readers as examples. Let me know what you think.

0 replies

pavlis · 2022-08-16T10:39:14Z

pavlis
Aug 16, 2022
Collaborator Author

For the record note there has been a fairly large jump not recorded from the previous comment. I made the initial effort to revise the normalize module api based on the ideas noted above. The changes were pushed for discussion at our group meeting Aug 15,2022, to a branch with the tag "new_normalize_api". I believe our conclusion is that ideas are sound and we should proceed with this basic design. There is, however, one major additional design element I want to preserve here. I also want to make sure we are all on the same page before trying to implement any of this.

Two fundamental issues arose when I tried to implement this design:

Some common matching conditions were very awkward to impractical for the way we implemented the cache with a python dictionary. The most important for seismology is time interval matching or matching a soft time stamp like an arrival time.
I tried to implement a pandas input option for all subclasses of the CachedMatcher that implemented the python dictionary cache.

A solution to both of the above is to always implement the cache matchers as a pandas dataframe or provide that as an option. My proposal here is to support both.

The first draft implementation has this class hierarchy (a figure would be better but hope this conveys the idea):

BasicMatcher->DatabaseMatcher->[ObjectIdMatcher, EqualityDBMatcher, MiniseedDBMatcher, OriginTimeDBMatcher, ArrivalDBMatcher]
BasicMatcher->CachedMatcher->[ObjectIdMatcher, EqualityMatcher, MiniseedMatcher]

I suggest two changes:

CachedMatcher's name should be changed to "DictionaryCacheMatcher". Other than names changes only bug fixes and some minor details need to implement this. I would suggest this be retained as the intermediate class for the following concrete implementations: ObjectIdMatcher, EqualityMatcher, and (maybe) MiniseedMatcher. i.e. the ones already using the python dictionary cache. In this model I'd also strip any mention of dataframes from the documentation for any piece of the class hierarchy using the python dictionary cache.
We should create a new intermediate class for pandas. I would suggest the name "DataframeCacheMatcher" or "PandasCacheMatcher". I would prefer the former as I always have thought pandas was a weird and confusing name for that package, but that is purely an opinion. It will only be an intermediate class anyway.

As noted 1 is pretty simple to produce from the current implementation. I think it is worth retaining for exact matches with keys like the objectid and equality matching where the use of a single key matches exactly the concepts of a python dictionary. I think the miniseed example is more debatable but worth retaining as an example of how one can use a generic find for keys that don't provide a unique match.

For item 2 a cursory reminder of the pandas api shows the approach is clearly feasible and likely pretty efficient for most applications we can anticipate. Before proceeding I'd like to get some feedback on some design issues.

A minor change from the initial discussion above for BasicMatcher is that the current implementation has two virtual methods: (1) find as noted above and (2) find_one that always returns one and only one Metadata container or a None. Those would have to be implemented for this class. DatabaseMatcher abstracts the find/find_one operation by using a virtual method called query_generator and the current CachedMatcher specifies two virtual methods called cache_id and db_make_cache_id. We all should give some thought to how to define a virtual method for a pandas implementation that would simplify implementing a new matcher class our user's might need to develop. Note the wonderful advantage of the virtual methods in CachedMethod at present is the find/find_one methods are generic and implemented in CachedMethod and to make a subclass concrete only the two virtual methods need to be defined for the concrete class. I think we should aim for a similar functionality with dataframes. I think it may be as simple as this signature:

@abstractmethod
def subset(self,mspass_object):
    """
    A concrete implementation should extract metadata or even compute something from the sample data and use that condition 
   in a subset operation on the internal pandas dataframe cache.   A common example would be a time interval match where 
   the endpoints would be fetched from mspass_object and used in a fixed subset operation on the dataframe.  
   """

I think the constructor should replace arg0 with a dataframe as input. i.e. we shouldn't support file input but tell user's if they are importing tabular data (csv files, antelope ascii tables/views, or the dataframe sql import functions) it is there job to create a valid dataframe for the operation.
An issue with a dataframe not faced with a python dictionary is null values. My reading suggests there is a standard way to handle this in dataframes using None or NaN values. I think how we do this will be an implementation detail we will need to document clearly. e.g. relational tables usually have a defined Null value but how that translates to a dataframe is likely dependent on how the user creates the dataframe.
There might be some generic subset methods that might prove helpful to reduce the need for specialized matchers. Some example would be one key "GT", "GE", "LT", and "LE" tests. I can also see a need for a generic time interval overlap matcher and a time within an interval matcher. The later would be an obvious choice to try to implement first. A version of "OriginTimeMatcher" with a dataframe, for example, would exactly use that test. test_time-tolerance < t0 < test_time+tolerance.

So, please comment on this and I'll try to move forward on some of it.

0 replies

pavlis · 2022-08-17T10:35:26Z

pavlis
Aug 17, 2022
Collaborator Author

I sat down to expand a bit on the definition of the dataframe implementation. Here is a complete skeleton of an implementation of the proposed intermediate class I have called DataFrameCacheMatcher:

class DataFrameCacheMatcher(BasicMatcher):
    def __init__(
            self,
            df,
            collection=None,
            attributes_to_load=None,
            load_if_defined=None,
            require_unique_match=False,
            prepend_collection_name=False,
            use_dataframe_index_as_cache_id=True,
            custom_null_values=None,
            ):
        """
        Constructor for this intermediate class.  It should not be 
        used except by subclasses as this intermediate class 
        is not concrete.    
        """
        super().__init__(
            attributes_to_load=attributes_to_load,
            load_if_defined=load_if_defined,
            )
        # Should have a mechanism here to reduce the dataframe removing 
        # unnecessary columns.  For now just copy it
        self.cache = df
    def find(self,mspass_object)->tuple:
        """
        DataFrame generic implementation of find method.  
        
        This method uses content in any part of the mspass_object
        (data object) to subset the internal DataFrame cache to 
        return subset of tuples matching some condition defined 
        computed through the abstract (virtual) methdod subset.  
        It then copies entries in attributes_to_load and when not 
        null load_if_defined into one Metadata container for each 
        row of the returned DataFrame.  
        """
        pass
    def find_one(self,mspass_object)->tuple:
        """
        DataFrame implementation of the find_one method.
        
        This method is mostly a wrapper around the find method.  
        It calls the find method and then does one of two thing s
        depending upon the value of self.require_unique_match.   
        When that boolean is True if the match is not unique it 
        creates an ErrorLogger object, posts a message to the log, 
        and then returns a [Null,elog] pair.  If self.require_unique_match 
        is False and the match is not ambiguous, it again creates an 
        ErrorLogger and posts a message, but it also takes the first 
        container in the list returned by find and returns in as 
        component 0 of the pair.  
        """
    @abstractmethod
    def subset(self,mspass_object)->pd.DataFrame:
        """
        Required method defining how the internal DataFrame cache is 
        to be subsetted using the contents of the data object 
        mspass_object.   Concrete implementation must implement this class. 
        The point of this abstract method is that the way one defines 
        how to get the information needed to define a match with the 
        cache is application dependent.  An implementation can use Metadata 
        attributes, data object attributes (e.g. TimeSeries t0 attribute), 
        or even sample data to compute a value to use in DataFrame 
        subset condition.   This simplifies writing a custom matcher to 
        implementing only this method as find and find_one use it.
        """
        pass

Noting it is a skeleton as the methods except the constructor have only pass statement. I will proceed with find and find_one once I get some feedback from the rest of the group.

1 reply

pavlis Aug 17, 2022
Collaborator Author

Just thought of something I neglected to mention in the previous comment. When @wangyinz and I discussed this idea earlier this week it was recognized that we should support the use of dask DataFrame seamlessly with a pandas DataFrame. I think we don't really need to do anything to do that except for two things:

Probably should have the constructor test the type of the input DataFrame and set some internal switch if the input is dask DataFrame. That could be used to control content of any error message or work around any api differences between DataFrame implementations.
Document-document-document: we would really need to emphasize this detail in the user manual and carefully explain that one should only use the dask version if the DataFrame is huge.

pavlis · 2022-08-19T11:49:00Z

pavlis
Aug 19, 2022
Collaborator Author

Haven't heard any feedback from any of you so I went ahead and made some executive decisions and did a rough implementation of the ideas above for DataFrame data. I want to summarize what I learned in doing so and by the end of this box I'll post some things I need some help on before proceeding.

First, I implemented this new intermediate class that has only a few changes from the proposal above:

class DataFrameCacheMatcher(BasicMatcher):
    def __init__(
            self,
            df,
            collection=None,
            attributes_to_load=None,
            load_if_defined=None,
            require_unique_match=False,
            prepend_collection_name=False,
            # for consideration as a feature
            #custom_null_values=None,
            #aliases=None,
            ):

I included the constructor signature line as there are two possible args there (commented out) worth a discussion. The idea I had was that custom_null_values and aliases would both be python dictionaries. custom_null_values would be useful for nonstandard null definitions of values in a dataframe. As I read the pandas documentation the standard test for null in a DataFrame is a python NoneType or a NaN. I know other scenarios that could happen from loading data from a relational database where null might be defined differently. Case in point is Antelope's flat file tables where null for each attribute is defined in the schema definition and can be a lot of different things. The idea of the arg would be to have the dictionary keyed by DataFrame column names and have values == null definition for that field. Default would be a standand definition for a DataFrame. The alternative to adding such a feature is to make it the User's responsibility to convert any custom Nulls to a standard DataFrame null definition. I tend to think that is the right model, actually, instead of burdening this low-level code base with such a feature. If we wanted to be kind to our users a python function to implement that feature on a dataframe would be fairly simple to write and we could just include it in the code base of the normalize module or elsewhere.

The aliases dictionary is a more likely option we should consider implementing. It should, perhaps, even be pushed to the base class BasicMatcher. The idea is to have a way to use different keys in a data object than in the database or cache. The more I think about it the more I think it should go in the base class.

At present I have three intermediate classes that are subclasses of BasicMatcher in the normalize module. In addition to the one above I have DictionaryCacheMatcher, which is almost identical to what I called CachedMatcher in the previous version, and DatabaseMatcher which is virtually unchanged other than a minor api change on the find method we discussed Monday. I then implemented the set of matchers that are revisions of the earlier prototype we have in what could be called version 1 of the normalize module. Here are the class definitions illustrating the inheritance structure of all new concrete classes:

class ObjectIdDBMatcher(DatabaseMatcher):
class ObjectIdMatcher(DictionaryCacheMatcher):
class MiniseedDBMatcher(DatabaseMatcher):
class MiniseedMatcher(DictionaryCacheMatcher):
class EqualityMatcher(DataFrameCacheMatcher):
class EqualityDBMatcher(DatabaseMatcher):
class OriginTimeDBMatcher(DatabaseMatcher):
class OriginTimeMatcher(DataFrameCacheMatcher):
class ArrivalDBMatcher(DatabaseMatcher):
class ArrivalMatcher(DataFrameCacheMatcher):

Some lessons learned from that exercise:

The DataFrame model is absolutely a superior solution for a caching model for any test that isn't easily reduced to a single key string match - essential for using a python dictionary as the cache.
I have no idea of the performance of DataFrame relative to a dictionary or database for any of these algorithms. I could make a pretty good guess, but the guess would be a hypothesis that should be tested. When in doubt, I think we need to design some performance tests on the dictionary versus dataframe cache implementation and hide that detail to the users. A case in point above is MiniseedMatcher. I had confidence the existing implementation worked (or at least almost works) enough that I didn't change it. I think we should consider implementing a DataFrame version, do some performance tests, and depricate the one that performs worse.
Our older implementations of ArrivalMatcher/ArrivalDBMatcher were fatally flawed. They would not have even done anything useful. The correct usage needed to do something like match a receiver to the arrival time data. In css3.0 that by the "sta" key. I implemented a "sta" with an optional "net" matching and an automatic provision to handle the "READONLYDATA_" issue - moved forward from the old code base.

Topics for discussion before we can finalize this design (The first 3 are single sentence summaries of points raised above):

Should we implement custom_null_values?
Should we implement the aliases feature in BasicMatcher?
What should we do about MiniseedMatcher?
Should the constructor of DictionaryMatcher handle input as db_or_df or just db. In the version I pushed for discussion early this week I had the db_or_df which means the table being matched could be loaded from MongoDB or a DataFrame. The db only model means it would only accept input as a MongoDB collection. My recommendation is to not support db_or_df largely because it would be very confusing. The class hierarchy is much much cleaner if we simply use DataFrameCache as an intermediate class whenever the input is a DataFrame.
Implementing concrete implementations of DataFrameCache made me realize there is a class of generic algorithms we should consider building that could be used by superclasses to handle specific attributes. Examples are: (1) TimeRangeMatcher could test a time extracted from a datum as inside a cached range defined by two column names (tmin<t<tmax test where tmin and tmax are in the dataframe and t comes form the data), (2) one sided versions of the time interval test, (3) there are gte, gt, lte, and le variants of the time inequality tests, (4) using time is actually restrictive and maybe we just need key inequality matching. This could quickly add more confusion than it solves I fear, but an example is that OriginTimeMatcher could probably be changed to class OriginTimeMatcher(TimeRangeMatcher which would create the inheritance relation: OriginTimeMatcher->TimeRangeMatcher->DataFrameMatcher->BasicMatcher. My initial thought is that if we find that ultimately useful to reduce the code base we could implement it later without changing the API for the concrete classes Users would use anyway if we follow the class hierarchy design like that I suggested for OriginTimeMatcher. It would take some thought to design the intermediate classes properly.
I have not yet done anything about implementing an option for dask DataFrame implementations. I did check that every pandas api call I used in my implementation does seem to be implemented in dask, but all of this is totally untested. I think all we need to do is do an isinstance check in the constructor of DataFrameCache and either set a self variable like self.cache_is_dask or add more isinstance checks in the find method implementation. The main thing needed, of course, is the need to called compute if the dataframe is dask. A big reason I didn't even try is I didn't know if Spark even has a comparable support for DataFrame? It would be a big perturbation to our code base if this particular module would only support Dask.
A detail I didn't discuss with out last Monday, @wangyinz, is how to handle bulk_normalize and the specialization we called normalize_mseed. The problem with those function is they aim to set cross referencing "_id" values (e.g. channel_id, site_id, and source_id) in a wf collection. The matcher design above is mainly intended for inline matching with a map operation on a bag/rdd of seismic data. Where the rubber meets the road (to use a bad cliche) on this question is a type collision that causes on the find_one method. The current api definition and all implementations require the input to find_one to be a "mspass_object" or in some cases an atomic data object (TimeSeries or Seismogram). The solution I used in MiniseedMatcher was to add a find_doc method and change bulk_normalize to call the find_doc method of the Matcher it is passed. I have NOT implemented find_doc on any of the concrete classes I've been experimenting with except MiniseedMatcher. I would argue that find_doc should not be a required base for all matchers (i.e. have and @AbstractMethod decorator inserted in BasicMatcher defining find_doc). The reason is the concept is only relevant for matchers that have use for finding and loading cross-referencing ids. In the above list that is only these: ObjectIdMatcher, MiniseedMatcher, and OrignTimeMatcher. It would be counterproducive for any database matcher and other cases unnecessary and unlikely to be of any use. A point worth discussion.

3 replies

wangyinz Aug 19, 2022
Maintainer

Thank you for all the tests and pushing the development forward. I did not respond earlier because I think the topic is much easier to discuss in our meeting and we don't necessarily need to preserve all the details of rationales behind the design here. Also, I do agree to all your proposals in general. Anyway, to respond to your points above:

1 and 2. Both custom_null_values and aliases make sense, and I think that is quite obvious, so we should just go ahead and add them.

For MiniseedMatcher, I am inclined to implement it with DataFrame. This will make the code more concise and readable. I think performance is less of a concern here for these matchers, but we could definitely test it out. Also, even if dataframe is slower than dictionary, we may still use the dask version of it to speed it up. It will also be capable of holding a large table that doesn't fit on one node.
Agree. It makes no sense to map dataframe into a dictionary when we already have a dataframe implementation.
For a range match, we could use the between method directly (doc here). I am also not sure if an intermediate class is necessary, but if we do want to create one, we could follow the API of that between method I guess.
Yeah, Spark is gonna be a problem. So, Spark definitely has a DataFrame implemented, but I don't think it is as compatible as dask to pandas based on our previous experience (that was two years ago), but it may get better now. You could definitely check it up in their API doc and see if all what you need is implemented.
Yeah, that makes sense. I think it is fine to keep bulk_normalize a special case for a handful of matchers.

BTW, I spent some time looking at the tests and tried to fix them, but I found that most of them are indeed bugs in the code rather than a simple mismatch between the test and the implementation. So, I guess we need to clean up and debug the code after all the class hierarchy being finalized.

pavlis Aug 20, 2022
Collaborator Author

First, I disagree that "we don't necessarily need to preserve all the details of rationales behind the design here". I say that for two reasons. First, we can use this page as a reference via a hyperlink in the User's Manual on background for the design for the normalize module. Some users might find that helpful, but more importantly it can reduce the need for lots of words about the rationale in the user manual. That is actually the main reason I spent the time to post all that here. Second, I think the design we have here is fairly novel and worth preserving in this forum. There may be some random person out there who finds this page useful.

Now to a couple followup questions.

Sorry your answer to question about custom_null_values use is ambiguous. My fault because there are too many words above and I never really made an unambigous suggestion. Note an issue to resolve is if we do or don't want to add that as a feature for dataframe input? We could take punt it to the user and say it is your problem to fix custom nulls. Further, an issue is how it should be done? As noted it would probably be better if we take that route to add a generic function in the normalize module that would change nonstandard null values to the standard dataframe null. To be clear, on reflection I think that is my recommendation: implement a generic function to set nonstandard nulls to the standand pandas null. Let's decide at our next meeting
The reference to "between" is so typical of pandas/dataframe from my experience. Multiple ways to solve the same problem. I think, however, that between alone is not right for a DataFrame as the documentation says "between" is a method of Series. I think the right construct is something like this example: df.query('closing_price.between(99, 101, inclusive=True)', engine="python") found here. I just checked and dask dataframe implements query so that might be the right solution for intervals. Certainly cleaner what I tried which is to construct a pair of conditionals with and & between them. I'm gong to proceed that way anyway for the what I've started as I do think it make the code much cleaner.
Given what you say about spark before we should proceed cautiously with implementing support in normalize for a dask dataframe. An idea to consider is that we create a different module (maybe called dask_normalize) that defines a dask only replacement for the new intermediate class I described above called DataFrameMatcher. The only time I can see one might want to use a dask DataFrame is for matching huge tables like the example I created a while back matching Earthscope picks from the Array Network Facility to a large waveform collection like the ones we've assembled. What is completely unclear, however, is if that would do any better than a MongoDB implementation of the same matching algorithm? Hence, I'm a little inclined to suggest we put aside any support of dask dataframes until it proves to be needed.

pavlis Aug 20, 2022
Collaborator Author

Another feature we need to explore. Some subclasses of DataFrameMatcher may want the option of loading from the database or a dataframe. e.g. OriginTimeMatcher might want to use data in the source collection or a DataFrame created from an Antelope origin table or some custom view joining event->origin.

wangyinz · 2022-08-20T12:57:44Z

wangyinz
Aug 20, 2022
Maintainer

Sounds all good. Just want to clarify on the suggestion of using between. Not sure if you misread it or something, but a Series is a column of a DataFrame. So, instead of composing a string as the query, the equivalent call to your example above will be:

df[df['closing_price'].between(99, 101)]

This is in fact provided as the highest voted answer in your link. So, using query is fine, but to my understanding, it is an overkill here as that it used to support more complex queries than just a range query here.

For custom_null_values, what I mean is that we should provide it as an optional argument for convenience. What you are trying to discuss is the actual implementation, and I don't really have a strong preference. I think the generic function you propose makes sense.

I took a closer look to spark's dataframe api, and I think it should work perfectly fine for us. In fact, the only thing that refrained us from using DataFrame as the basic type for parallelism is that Spark doesn't support a custom type, and we wanted to put data object into the DataFrame. I don't think that was a good idea anyway. We may go back and re-evaluate this (and that's a totally different topic). Here, I think it will be perfectly fine to have the matcher implementation compatible for all three different DataFrame implementations. With that said, I don't think we have to worry about the API right now. We can just assume pandas API when implementing all the matchers here. It will be straightforward to implement the parallel version later anyway.

2 replies

wangyinz Aug 20, 2022
Maintainer

btw, I found the comment in your link that suggests combining query and between, and it is probably not a good choice:

df.query('closing_price.between(99, 101, inclusive=True)', engine="python") - but this will be slower compared to "numexpr" engine.

Even the poser says this will be slow, so we probably should avoid...

pavlis Aug 20, 2022
Collaborator Author

That is all too common error I make in the modern world of internet information overload - tending to skim and miss important points like that. Thanks for responding so quickly. I'll see what I can finish between now and our Monday meeting. I think the discussion we need to preserve here may largely be done now unless something surfaces in our next meeting.

pavlis · 2022-08-21T11:14:19Z

pavlis
Aug 21, 2022
Collaborator Author

Found another small detail that is worth some discussion as opinions may differ. We have had a common feature of the normalize code since the beginning (i.e. the way the reader handles normalization of collections like channel) set in the BasicMatcher family with the boolean argument prepend_collection_name.

The question for discussion is how to handle the situation of defining aliases when prepend_collection_name is True and False. Some options are:

Make them mutually exclusive. i.e. if prepend_collection_name is True aliases is either ignored or if not empty generates an exception.
Have aliases override the automatic names generated by prepend_collection_name set True. e.g. if we had a sta attribute to be loaded in normalization from channel and aliases had {"STA" : "sta"} the name assigned would be "STA" not "channel_sta".
We could do the reverse of 2 which would effectively mean aliases would be ignored when prepend_collection_name was set True.

I tend to favor 1 to avoid confusing behavior.

2 replies

wangyinz Aug 22, 2022
Maintainer

Or, maybe we can do both? So, in your example above (normalization from channel and aliases had {"STA" : "sta"}), the key will become channel_STA. It seems to me this behavior is more intuitive.

pavlis Aug 22, 2022
Collaborator Author

That makes senses and is exactly why discussions like this are a good thing - we all have blind spots

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

normalize method proposal for Database #303

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

normalize method proposal for Database #303

pavlis Feb 8, 2022 Collaborator

Replies: 9 comments · 13 replies

pavlis Mar 27, 2022 Collaborator Author

pavlis May 27, 2022 Collaborator Author

pavlis May 27, 2022 Collaborator Author

wangyinz May 27, 2022 Maintainer

pavlis May 28, 2022 Collaborator Author

pavlis May 28, 2022 Collaborator Author

pavlis Aug 3, 2022 Collaborator Author

wangyinz Aug 3, 2022 Maintainer

pavlis Aug 3, 2022 Collaborator Author

pavlis Aug 16, 2022 Collaborator Author

pavlis Aug 17, 2022 Collaborator Author

pavlis Aug 17, 2022 Collaborator Author

pavlis Aug 19, 2022 Collaborator Author

wangyinz Aug 19, 2022 Maintainer

pavlis Aug 20, 2022 Collaborator Author

pavlis Aug 20, 2022 Collaborator Author

wangyinz Aug 20, 2022 Maintainer

wangyinz Aug 20, 2022 Maintainer

pavlis Aug 20, 2022 Collaborator Author

pavlis Aug 21, 2022 Collaborator Author

wangyinz Aug 22, 2022 Maintainer

pavlis Aug 22, 2022 Collaborator Author

pavlis
Feb 8, 2022
Collaborator

Replies: 9 comments 13 replies

pavlis
Mar 27, 2022
Collaborator Author

pavlis
May 27, 2022
Collaborator Author

pavlis May 27, 2022
Collaborator Author

wangyinz May 27, 2022
Maintainer

pavlis May 28, 2022
Collaborator Author

pavlis May 28, 2022
Collaborator Author

pavlis
Aug 3, 2022
Collaborator Author

wangyinz Aug 3, 2022
Maintainer

pavlis
Aug 3, 2022
Collaborator Author

pavlis
Aug 16, 2022
Collaborator Author

pavlis
Aug 17, 2022
Collaborator Author

pavlis Aug 17, 2022
Collaborator Author

pavlis
Aug 19, 2022
Collaborator Author

wangyinz Aug 19, 2022
Maintainer

pavlis Aug 20, 2022
Collaborator Author

pavlis Aug 20, 2022
Collaborator Author

wangyinz
Aug 20, 2022
Maintainer

wangyinz Aug 20, 2022
Maintainer

pavlis Aug 20, 2022
Collaborator Author

pavlis
Aug 21, 2022
Collaborator Author

wangyinz Aug 22, 2022
Maintainer

pavlis Aug 22, 2022
Collaborator Author