GATHER = subset of Ensemble design #339
Replies: 5 comments 1 reply
-
Quick addition to this discussion. I just poked around the source code for Seismic Unix and don't think there is any merit to creating an SUgather object. If someone wanted to adapt an SU algorithm they would likely be best advised to produce pybind11 wrapper for the SU C code and build a python wrapper around the C code to interact with a gather. SU has way too many interwoven libraries and a long list of main programs using the unix filter model that would make adapting it to a python world clumsy to impossible in my opinio. What might be useful, however, is thinking about a generic interface to run SU unix filters on a @wangyinz what other packages were you thinking of that drove you to suggest we needed to implement what I've here called the gather concept? |
Beta Was this translation helpful? Give feedback.
-
For anyone reading this outside the mspass-team you should be aware this conversation was held for a while around a google doc an discussions during weekly meeting. I (@pavlis ) sat down and turned current ideas into a sketch of a prototype python API. I think this forum will make it easier to refine suggestions than the google doc. First, this design makes use of inheritance. A problem I faced in trying to define how to do that is a huge impedance mismatch between how inheritance works in python and C++. Why that matters is that I was trying to make this interface as much like the C++/pybind11 API for atomic data in MsPASS (TimeSeries and Seismogram) and commonality with MsPASS ensemble objects that are actually implemented as C++ templates. Things that would make sense in C++ would be crazy to impossible in python. For that reason I am here proposing the following inheritance structure for this new data object:
A few clarifications that are side issues from the above that this design includes:
That said, here is my sketched design for the base class:
I emphasize this is mostly just method signatures and docstrings, but it is more concrete than what we've been discussing on github. I next propose this for single component data that I suggest we call a
And this painfully similar for this rough outline for Seismogram input:
I think the strongest thing about this design is it cleanly separates scalar and 3C data in a manner as consistently as I can come up with. I've tried to model the methods to be as consistent with I'm going to close this comment. Please put comments on this basic design below this box as replies I'm going to add two more boxes that are independent topics you should comment on separately. |
Beta Was this translation helpful? Give feedback.
-
In composing the above and reviewing various notes I had an idea this encompasses. It build on an earlier comment about seismic unix. Let me just put down the class with a rough docstring that I think introduces the idea reasonably well:
This is very rough but it boils down to two ideas:
|
Beta Was this translation helpful? Give feedback.
-
There are two low-level algorithms we have defined for current MsPASS data objects for which we need the comparable functionality for these new data objects: (1) time windowing and (2) ExtractComponent for 3C data. I first thought both might be more appropriate as methods but should remain in algorithms. Neither are trivial. If you accept my proposal to allow the new family of gather objects to handle relative and UTC time like BasicTimeSeries, time windowing has complexities we need to discuss. I'm pretty certain ExtractComponent would have a dependence on the large matrix package implementation. Point is these two are core algorithms we would need to implement as part of this development. A discussion point is what other things currently in algorithms should accept one of these objects directly and not require a translation to something like a loop over atomic objects? |
Beta Was this translation helpful? Give feedback.
-
Have one last point for this. We perhaps need mechanisms for things like the member method to return obspy Trace or Stream objects instead of MsPASS data objects. Could streamline obspy algorithms at the cost of losing the history and error logging capability of the MsPASS data objects. I can see both sides of the argument. Suggest we not consider that until someone out there specifically asks for it. I don't think it would be a difficult retrofit. Constructing a TimeSeries and a Trace from the contents of a Gather are not really very different. |
Beta Was this translation helpful? Give feedback.
-
Purpose of this discussion page
We have a number of issues floating around about what we have been calling “ensembles” in MsPASS, but which are often also called a “gather” in seismic reflection processing. The purpose of this discussion page is to help the group (and community who might read this) understand some key ideas and conceptual models of data related to how an ensemble/gather should be defined. i.e. this is a higher level design document that we should use to flesh out the abstractions we should use of those “conceptual models” of an ensemble/gather.
Definitions
Let me do a couple internal definitions that might help make this discussion more tractable. Humans name things to distinguish different concepts so let me make this pair of definitions for this discussion:
Each section below expands what I mean in these definitions.
gather concepts
The way a gather is treated in seismic reflection processing comes from a combination of the history of instrumentation and the data model that emerged in the field for efficiency. All seismic reflection systems have relics of the “multichannel” data recorders that were universal for all production systems until about 30 years ago. Multichannel systems used a single digitizers that cycled through “channels” in a single sample cycle. The natural data output was equivalent to writing a matrix in row order (multichannel data). A big issue in the earliest days of seismic reflection processing (i.e. 1960 and early 1970s) that is almost laughable in today’s world of GBy computers was “demultiplexing”. Demultiplexing is mathematically identical to a matrix transpose of a Fortran storage array. Demux algorithms were more or less that but were a challenge when a single shot gather could not fit in a computer’s memory. The point is that a field “shot gather” was universally conceptualized as a matrix from the 1960s until today. An individual “trace” was one column in the matrix. For a “shot gather” each columns was the output of one sensor for one shot. It was thus natural to organize all other “gathers” into a matrix. e.g. a “CMP gather” is a set of column vectors from a set of shot gather matrices created by just sorting the vectors into “CMP bins”. Furthermore, because the only language of the day was FORTRAN the data order was naturally frozen as FORTRAN order (row index varies fastest).
There are three fundamental assumptions in gather processing that make that approach possible:
ensemble concepts
For the record, the roots of MsPASS are my seispp library that students and I developed from around 2005 till around 2018. That library provided a prototype for the mspasspy.ccore.seismic modules. We implemented an ensemble in a more generic way that does not require the the assumptions above. i.e. THE fundamental difference between a MsPASS ensemble and a seismic reflection "gather" is that it does not require ANY of the three assumptions above that are implicit in seismic reflection gather processing.
The way we accomplished that is to define two atomic data objects: (1)
TimeSeries
objects and (2)Seismogram
objects. The former is used for scalar data while the later are bundled three-component data. Each atomic object can have a different sampling interval or number of samples than any other ensemble member. Further, there is no assumption about the time standard. Data can be defined by unix epoch times or as some relative time standard and can be mixed up (although that would usually be a bad idea). Readers should note that an ensemble is then stored in a C++ std::vector container defined by the symbolmember
in the class definition. The same name is bound to python using pybind11 in a way that themember
symbol can be manipulated with subscripts and iterators.To be clear, this is the class template prototype for an ensemble object:
member
is the std::vector container holding atomic data objects. The template argument,Tdata
, isTimeSeries
for aTimeSeriesEnsemble
andSeismogram
for aSeismogramEnsemble
. Note also ourEnsemble
has aMetadata
container used to hold metadata common to all atomic data "members".It is also noteworthy that pybind11 has allowed us to bind the data arrays in both TimeSeries and Seismogram objects to be accessible from numpy functions. TimeSeries sample data is made to look like a vector (one-dimensional ndarray) while the sample data of a Seismogram is bound to a 3xnsample numpy array in Fortran order (each column of the matrix is a 3-component vector data sample at a time that can be computed by the vector index).
Finally, for this design discussion it is important to recognize that a patch we made after a year+ development in MsPASS is to redefine an ensemble to implement two useful concepts implemented in atomic data objects: (1) a boolean that allows the entire ensemble to be "killed" and (2) an
ErrorLogger
attribute (symbol elog) that allows ensemble related error messages to be posted without aborting a parallel job. For ease of reading this is portion of the class template that defines ourLoggingEnsemble
that pybind11 binds to the two functional namesTimeSeriesEnsemble
andSeismogramEnsemble
:ensembles versus gather issues
The concepts defining a
gather
versus a MsPASS ensemble can be summarized in one phrase: agather
is a special case of anensemble
. That is, a standard reflection seismology gather could be represented by aTimeSeriesEnsemble
if we impose three restrictions matching the assumptionsnoted above:
Only if a
TimeSeriesEnsemble
satisfies the above restrictions can its sample data array be abstracted as a matrix.A driving question is this: why is there a need for a
gather
abstraction given the now solid code base forTimeSeriesEnsemble
data objects? There are at least three reasons I can think of:Gather
object implemented.A final issue to address is how a ThreeComponentEnsemble could be mapped into a gather concept? I would suggest that the answer is actually fairly simple: a ThreeComponentEnsemble that satisfies the gather
assumption could readily be represented as a three-dimensional array. The only question would be which index varies fastest? Addressing that question is best left to later, but it is important for efficiency and how hard it would be to adapt any legacy code. Also numpy may simply the whole thing as they have stock methods to reorganize array data. I would also note that implementing a gather model for three-component data should probably be viewed as a low priority as I know of no standard package that it needs to be adapted to the framework that would profit from the distinction between an ensemble and gather as defined here.
Functional Specifications for Gather Implementation
First, I propose we agree on my definition of a
gather
versus andensemble
. If so, we should aim to use a namespace with the top level working class calledGather
. If, as I propose below, we choose to utilize inheritance as a key design element we can expect some intermediate class names likeBasicGather
orCoreGather
to be useful. There might also be subclasses ofGather
that are specialized. e.g. as I'll suggest below we might want something like anSUGather
that has frozen features mated with Seismic Unixcode.
I would propose the following functional requirements for the new
Gather
API:Gather
should match as much as possible the API for ensemble objects. i.e. all ensemble methods that match in concept with a gather should be implemented in theGather
object. That can assure maximum interoperability. It is to be determined where the mismatch in concept occurs.Gather
should be less dogmatic about sample data type than ensemble. A bit of a flaw in our design for ensemble objects is the dogmatic use of 64 bit floats for sample data. To simplify adapting legacy code we should allow support for other numeric data types.TimeSeriesEnsmble
into aGather
and vice-versa. That is probably best done as a functions rather than as class constructors to make it easy to run the function in a map operation.Gather
.Implementation Proposal
First a couple of axioms:
I'm not exactly sure if the following would work, but let me suggest the following as a base class for Gather objects:
I really have no idea if pybind11 objects can be used in an inheritance tree like this, but it at least defines the base functionality I think a base class for a
Gather
should contain.Metadata
andProcessingHistory
create a functionality that is automatically compatible with our exixtingLoggingEnsemble
akaTimeSeriesEnsemble
andSeismogramEnsemble
. UsingBasicTimeSeries
in the above way is a concise, although perhaps confusing, way to add the functionality inBasicTimeSeries
of concepts of "kill" and "time standard". The methods of BasicTimeSeries are all appropriate for a gather object - it is just the name that is potentially confusing. Thet0
,npts
, and all the related methods are appropriate for a gather because of the assumptions that define a gather as a subset of an ensemble. The only concept collision I see is that there are cases where the time standard could be ambiguous. If time is UTC then t0 is unambiguous, but if time is relative two gathers may not be equivalent with no way to tell the difference. e.g. a gather in relative time could have t0 defined by event origin time (shot time in the reflection world) or something like the "arrival time reference". Those two would need to be treated very differently. I think we could work around that by either defining some metadata key-value pair with the value defining some time standard or add to this enum class defined in the C++ code base:to make it something like this
I haven't looked to see if that would break anything in the C++ code base but it might. It shouldn't, however, be hard to handle if it does - some if-else constructs might need to be switch-case construct. The bigger unknown is if one can build a python class from multiple inheritance of 3 classes defined by pybind11? I would think it would work, but that is an unknown. If we agree on this idea one of the first things to do is see if that can be done. The key point is the three classes I suggested above get us a long way to building a functional object with a long list of useful methods we wouldn't need to implement.
I next suggest the working
Gather
object would be defined as a simple subclass ofBasicGather
but adding a data array and associated stuff. Here is a rough prototype:Other constructors are nearly guaranteed to be needed. Others please add comments to suggest other methods that would be necessary core methods.
As suggested above a goal of our design should be to maximize reusability and allow specialization of the more generic
Gather
object. Here is a rough version of the seismic unix gather idea noted earlier:A couple special features of a SUgather that would be locked in are:
Beta Was this translation helpful? Give feedback.
All reactions