normalize method proposal for Database #303
Replies: 9 comments 13 replies
-
In our meeting last week we discussed this issue a bit. In response I wrote a prototype you can look at in a new branch (not in a pull request - I just pushed it for you guys to review. The code there is incomplete and completely untested at this point.) called extend_normalization_options. My proposed design uses an abstract base class that requires a subclass to implement two methods:
You can find the source code for my prototype in python/db in the file called
I pushed that incomplete prototype for now largely as a basis for discussing this design in our regular meeting tomorrow. Hopefully yall can review this before then and we can make some plans to move forward on this design. One thing for sure we should discuss is how to best modify the read_data api to utilize this approach for a generic matching function object. We currently have the argument |
Beta Was this translation helpful? Give feedback.
-
For the record a lot happened since this discussion was started. This idea was expanded and is now implemented albeit with a few residual bugs to kill revealed by testing on a large data set. The "large data set" point is why I want to add to this discussion today. I am in the middle of running a test of this new functionality on a data set with 3.6 Million documents. Timing data in progress shows that process is going to take over 24 hours to run. We can't blame indices or anything like that as the updates adding channel_id and site_id are using the object id of wf_miniseed documents. Note one issue is definitely that the database in this case is hosted on a plain magnetic disk on a desktop running the all-in-on mode. The process is clearly totally limited by mongodb updates even though the implementation uses the bulk_write function and does 1000 updates at a time rather than single transactions. The discussion question is how the performance of this process could be improved? I could try hosting the database on an SSD on the same system. That might turn a marginally feasible processing run into a more reasonable one, although I do not really know how different it will be. The alternative is to try out the on-the-fly version of this new implementation. That is, this new set of matching functions could be used in a map operation to load channel_id, site_id, and source_id in the workflow. If that works out we would still need some kind of capability to sort out receiver and source metadata deficiencies. i.e. I know for certain that this data set I'm processing has some missing channel documents that could be retrieved with obspy fdsn web services. Comments? |
Beta Was this translation helpful? Give feedback.
-
There is a fairly large time jump in this discussion. For that reason I considered starting a new thread on the related topic this box addresses, but because the topic is a followup on this thread it seems more appropriate to just extend this. Topic: I want to propose a major change to the API we developed for normalization. First, some background for the record. Since this was started some months ago the development progressed as follows:
This comment is fallout from item 3. I find some fundamental flaws in our implementation. I realized this mostly from trying to write the docstrings describing the components of the normalize module. I also found it very challenging to write a section in the user manual on how to write an extension to normalize. It was the later that made me realize we have a serious design problem as extensions would be very difficult. Further, I would argue our current implementation will be difficult to maintain. The class structure has some complicated, interleaved methods that are difficult to understand let alone tell our users how to use. I won't belabor this point but move to some important, fundamental advice too often ignored in all group dynamicss: don't complain about something unless you have an alternative to propose. Here is my proposal. My revised designed comes from this quote from the draft user's manual on the concepts required to implement normalization:
Those word helped me realize our fundamental design flaw was mixing these two concepts into a single class structure. That is, in the current implementation we have a base class With that foundation, I proposed we restrict the class definition to only implement concept 1 and concept 2 is better implemented as a generic function. Specifically, here is my proposal for the base class implementing concept 1:
Noting the name here is more appropriate and matches PEP rules for class naming - the current implementation does not. Concept 2 would be better implemented as the following generic function:
That function is totally untested, but illustrates how simple the generic function is if you separate the matcher class concept from the load and store concept. To clean up the way we handle caching versus database matching I propose the following two subclasses to BasicMatcher:
This design uses an untested construct I know works on C++ code, but I am unsure about python and the ABC module. That is, note the virtual method, The subclasses for database and cached algorithms are, I think, a more stable and transparent way to define two fundamentally different approaches to implementing the BasicMatcher class. The dark side is it will extend the namespace, but in my opinion that is better than having hidden default options as we have in the current implementation. Here is the set of definitions I propose to replace all the current functionality.
I am not married to the names I chose here, BUT they at least match PEP rules for class names and make it clear if a choice is cached or Database driven. My thought was only add a "Database" tag for database query driven matchers and have not tag at all for the cached version. My thought was that was less verbose than adding a "Cached" tag for the something like "OriginTimeMatcherCached" and is consistent with recommending use of the cached version as the default. We might consider using "DB" instead of "Database" to make this less verbose and use it as a prefix instead of a suffix. i.e. something like Let's discuss this on our call this week and decide if this is worth doing. The other viewpoint is to say "if it ain't broken don't fix it". Our existing implementation works, but as I noted it is fragile and will be a challenge to maintain. On the other hand, this stuff is not documented at all yet so no one out there will be using it so if we are going to revise the api like this now is the time rather than later when we might get request for backward compatibility. |
Beta Was this translation helpful? Give feedback.
-
Ok, I'm going to start in on a few things. There will be surely be more smaller granularity issues I need to post here. One comes up immediately. The base class will NOT have a database handle inside it. The reason is I can conceive of the need for a file-based construct. e.g. it would make some sense to read an antelope css3.0 ascii site table without needing to load that data into MongoDB. I thus plan to define a third intermediate level class with this signature:
The above would be a strictly intermediate class and an instance would call the above constructor and then contain code to handle the filename arg and format arg. It would load data into a cache much like the current implementation. We csv files you could add additional functionality with pandas. That is a detail. For now I think I will just stub the intermediate class and document how it would be used. I may even implement a couple of antelope readers as examples. Let me know what you think. |
Beta Was this translation helpful? Give feedback.
-
For the record note there has been a fairly large jump not recorded from the previous comment. I made the initial effort to revise the normalize module api based on the ideas noted above. The changes were pushed for discussion at our group meeting Aug 15,2022, to a branch with the tag "new_normalize_api". I believe our conclusion is that ideas are sound and we should proceed with this basic design. There is, however, one major additional design element I want to preserve here. I also want to make sure we are all on the same page before trying to implement any of this. Two fundamental issues arose when I tried to implement this design:
A solution to both of the above is to always implement the cache matchers as a pandas dataframe or provide that as an option. My proposal here is to support both. The first draft implementation has this class hierarchy (a figure would be better but hope this conveys the idea): BasicMatcher->DatabaseMatcher->[ObjectIdMatcher, EqualityDBMatcher, MiniseedDBMatcher, OriginTimeDBMatcher, ArrivalDBMatcher] I suggest two changes:
As noted 1 is pretty simple to produce from the current implementation. I think it is worth retaining for exact matches with keys like the objectid and equality matching where the use of a single key matches exactly the concepts of a python dictionary. I think the miniseed example is more debatable but worth retaining as an example of how one can use a generic find for keys that don't provide a unique match. For item 2 a cursory reminder of the pandas api shows the approach is clearly feasible and likely pretty efficient for most applications we can anticipate. Before proceeding I'd like to get some feedback on some design issues.
So, please comment on this and I'll try to move forward on some of it. |
Beta Was this translation helpful? Give feedback.
-
I sat down to expand a bit on the definition of the dataframe implementation. Here is a complete skeleton of an implementation of the proposed intermediate class I have called
Noting it is a skeleton as the methods except the constructor have only pass statement. I will proceed with find and find_one once I get some feedback from the rest of the group. |
Beta Was this translation helpful? Give feedback.
-
Haven't heard any feedback from any of you so I went ahead and made some executive decisions and did a rough implementation of the ideas above for DataFrame data. I want to summarize what I learned in doing so and by the end of this box I'll post some things I need some help on before proceeding. First, I implemented this new intermediate class that has only a few changes from the proposal above:
I included the constructor signature line as there are two possible args there (commented out) worth a discussion. The idea I had was that The At present I have three intermediate classes that are subclasses of BasicMatcher in the normalize module. In addition to the one above I have
Some lessons learned from that exercise:
Topics for discussion before we can finalize this design (The first 3 are single sentence summaries of points raised above):
|
Beta Was this translation helpful? Give feedback.
-
Sounds all good. Just want to clarify on the suggestion of using
This is in fact provided as the highest voted answer in your link. So, using For I took a closer look to spark's dataframe api, and I think it should work perfectly fine for us. In fact, the only thing that refrained us from using DataFrame as the basic type for parallelism is that Spark doesn't support a custom type, and we wanted to put data object into the DataFrame. I don't think that was a good idea anyway. We may go back and re-evaluate this (and that's a totally different topic). Here, I think it will be perfectly fine to have the matcher implementation compatible for all three different DataFrame implementations. With that said, I don't think we have to worry about the API right now. We can just assume pandas API when implementing all the matchers here. It will be straightforward to implement the parallel version later anyway. |
Beta Was this translation helpful? Give feedback.
-
Found another small detail that is worth some discussion as opinions may differ. We have had a common feature of the normalize code since the beginning (i.e. the way the reader handles normalization of collections like channel) set in the BasicMatcher family with the boolean argument The question for discussion is how to handle the situation of defining aliases when
I tend to favor 1 to avoid confusing behavior. |
Beta Was this translation helpful? Give feedback.
-
As discussed in our meeting this morning I realized in working with refining a workflow to handle the full usarray teleseismic data set that we have a weakness in handling of normalization with MongoDB. The problem is more serious for source normalization than receiver (channel or site) normalization because data sets are highly variable. That is, how to do association with source collection documents depends on how one assembles the data. Two common choices for event-based data organization is waveform start times defined by a phase arrival time (commonly P) that depends upon source position and origin time and a common time relative to the source origin time. For the record, this problem is unique to natural source data processing as active source data face a simpler problem - or at least different - well solved in any seismic reflection processing package.
The idea I had that I want to discuss is perhaps best introduced by the following proposed new method signature for Database.
The real challenge is designing the base class Matcher and the way subclasses derived from it would function.
I think the first think is the base class would have to have it's own constructor. The base class constructor needs at least the following signature:
That way each subclass could call the base class constructor then add things it might need beyond those basics. e.g. the suggested seed_code_matcher might have this set of code as the top:
We should produce other (i.e. the seed one above is definitely needed) subclasses of Matcher for these common situations:
Beta Was this translation helpful? Give feedback.
All reactions