Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catalog dict conversions #2210

Open
wants to merge 19 commits into
base: master
from

Conversation

Projects
None yet
5 participants
@d-chambers
Copy link
Member

d-chambers commented Sep 1, 2018

What does this PR do?

Adds a module to obspy.core.event named "dict" which contains functions for converting catalogs to and from dictionaries.

Why was it initiated? Any relevant Issues?

There is some discussion in #1910 on the topic. To recap, this is a first step towards:

  1. A viable json (or json-like) format for losslessly reading, and writing catalogs. This could be better IO format than quakeML for some applications.
  2. Enabling other event formats to use an intermediate dict for reading/writing.

PR Checklist

  • Correct base branch selected? master for new features, maintenance_... for bug fixes
  • This PR is not directly related to an existing issue (which has no PR yet).
  • If the PR is making changes to documentation, docs pages can be built automatically.
    Just remove the space in the following string after the + sign: "+ DOCS"
  • If any network modules should be tested for the PR, add them as a comma separated list
    (e.g. clients.fdsn,clients.arclink) after the colon in the following magic string: "+TESTS:"
    (you can also add "ALL" to just simply run all tests across all modules)
  • All tests still pass.
  • Any new features or fixed regressions are be covered via new tests.
  • Any new or changed features have are fully documented.
  • Significant changes have been added to CHANGELOG.txt .
  • First time contributors have added your name to CONTRIBUTORS.txt .

How it works

The catalog_to_dict function recursively converts obspy objects to dictionaries. In most cases, the dictionary created from an object can be passed to the class' __init__ method as kwargs to create a new and equal instance of the same class. This ensures only public attributes are used in order to better future-proof the serialization scheme. There are a few classes, such as obspy.UTCDateTime where a custom serialization function is needed because the the first approach does not work. In UTCDateTime's case the str representation is used.

The dict_to_catalog function uses a map of attribute names to their appropriate classes to recurse the dictionary and construct the class tree.

What it still needs

More tests cases to ensure it works with all valid catalog objects.

@d-chambers

This comment has been minimized.

Copy link
Member Author

d-chambers commented Sep 2, 2018

I did some simple profiling using this branch to read and write catalog json files here, which may be of interest to the #1910 thread. There is probably more that will need to be done, like adding a version number and some simple optimizations, but it is a reasonable start. Also needs more catalogs for testing.

@megies

This comment has been minimized.

Copy link
Member

megies commented Sep 6, 2018

I don't have time to look at this in detail yet.. but any chance this can be compatible with #752?

Only comment I have right now is that naming the file dict.py is probably not a good idea. Builtin names are scary and can bite. But obviously that's easy to fix should this get a nod..

@d-chambers

This comment has been minimized.

Copy link
Member Author

d-chambers commented Sep 6, 2018

I agree on the name but couldn't come up with something better, suggestions? Of course the three hard problems in computer science are naming variables and off by one errors.

I will take a look at #752 and see what I can do this weekend.

@d-chambers

This comment has been minimized.

Copy link
Member Author

d-chambers commented Sep 8, 2018

naming the file dict.py is probably not a good idea

Change the name to "dictionary" (I know, very creative 😄 )

any chance this can be compatible with #752?

Looking at it closer, #752 is doing something quite different than this PR. This PR adds a function to convert a obspy catalog tree to a "json-able" strucutre and back in a loss-less way (for storage/transmission). IMHO the catalog dict and Catalog objects should be identical in their structure and attribute names, otherwise the conversion is not "obvious". #752, on the other hand, is about implementing a property that other programs can use to extract spatial information on a subset of objects in the catalog hierarchy. The use cases are really not the same.

@megies

This comment has been minimized.

Copy link
Member

megies commented Sep 10, 2018

naming the file dict.py is probably not a good idea

Change the name to "dictionary" (I know, very creative smile )

Actually, as an I/O plugin it might be good to move it to io/ascii/ anyway. It could go into a new file event.py there?

This PR adds a function to convert a obspy catalog tree to a "json-able" strucutre

Oh, so it's not really de/serialization to JSON then, just the preparation for it, I guess? Then the above comment does not really fit, I guess..

@calum-chamberlain

This comment has been minimized.

Copy link
Contributor

calum-chamberlain commented Oct 31, 2018

I pulled this branch and added a fairly revolting way to get all the test catalogs into the obspy.core.utils.testing.py file. I collects 96 catalogs and they all pass the current test @d-chambers wrote. I'm loath to push those changes up here just because they are so ugly, I could create a PR onto this branch?

The function is:

def read_test_catalogs(warn=False):
    """
    Read all io test catalogs

    Will try and read everything in the test datasets, regardless of whether
    this can or should be read in.  Errors are ignored, unless warn=True. If
    warn=True then errors are reported as UserWarning.

    :return: list of catalogs
    """
    from obspy import read_events
    catalogs = []
    # This is revolting.
    base = os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(
        os.path.abspath(inspect.getfile(inspect.currentframe()))))))
    eps = _get_ordered_entry_points(
        'obspy.plugin.event', 'readFormat', EVENT_PREFERRED_ORDER)
    for name, ep in eps.items():
        test_base = os.path.join(
            base, *ep.module_name.split(".")[:3], "tests", "data")
        valid_files = glob.glob(os.path.join(test_base, "*"))
        for x in valid_files:
            with WarningsCapture():
                try:
                    catalogs.append(read_events(x, name))
                except Exception as e:
                    if warn:
                        UserWarning(e)
                    pass
    return catalogs
@d-chambers

This comment has been minimized.

Copy link
Member Author

d-chambers commented Oct 31, 2018

Actually that doesn't look all that bad. I am glad the tests pass! A way to collect all tests files would certainly be useful. How many files does it collect and fail to read? Any reason for not using __file__ rather messing with the inspect module? It might be useful to yield each catalog to minimize memory usage. Maybe make it general so we could do something similar to collect all waveforms or inventories as well?

I have no problem if you want to push changes directly to this branch.

I should apologize, the code in this PR is a bit complex (hopefully not complicated) because there is a fair bit of recursion. It is the most elegant solution I could come up with, but I know it can be a bit tricky to think through.

@megies

This comment has been minimized.

Copy link
Member

megies commented Oct 31, 2018

Putting this in the 1.2.0 milestone so it doesnt get overlooked.. when we eventually get round to doing it.. -_-

@megies megies added this to the 1.2.0 milestone Oct 31, 2018

@calum-chamberlain

This comment has been minimized.

Copy link
Contributor

calum-chamberlain commented Oct 31, 2018

Thanks for those comments @d-chambers!

How many files does it collect and fail to read?

It collects 130 files and reads 98 of them. I know some tests have both event and waveform files in their tests/data folder (seiscomp for one I think), and/or files that are known to fail (nordic has one of those), so its not surprising that not everything it finds is read. Also, all the other tests we have for those IO formats should take care of reading issues, so I'm keen to just ignore/overlook them here.

Any reason for not using file rather messing with the inspect module?

I thought that __file__ wasn't super robust for getting paths, in the past I have been advised to use the inspect and abspath methods, but in this case (or at least the way I have done it) is really ugly. If __file__ is safe then I would rather use that for "prettiness". @megies do you have an opinion on that?

It might be useful to yield each catalog to minimize memory usage.

Good suggestion, I will try that.

Maybe make it general so we could do something similar to collect all waveforms or inventories as well?

Also good suggestion, I will try to do that too.

I should apologize, the code in this PR is a bit complex (hopefully not complicated) because there is a fair bit of recursion. It is the most elegant solution I could come up with, but I know it can be a bit tricky to think through.

It looks good to me, recursion is the way I would have done it, but it can be hard to follow.

@calum-chamberlain

This comment has been minimized.

Copy link
Contributor

calum-chamberlain commented Nov 1, 2018

I generalised and used a yield. I'm not sure how people feel about using the plugin read functions directly or using the read and read_events functions (which is what I'm doing at the moment)?

@megies

This comment has been minimized.

Copy link
Member

megies commented Nov 2, 2018

If __file__ is safe then I would rather use that for "prettiness". @megies do you have an opinion on that?

Looks like the __file__ is gone anyways now? In any case, since it was in the testing routines only, no harm from it i'd say. worst thing that could ever happen there are failing tests?

@d-chambers

This comment has been minimized.

Copy link
Member Author

d-chambers commented Dec 3, 2018

@calum-chamberlain what do you think is left to do on this? Is it ready for review?

@calum-chamberlain

This comment has been minimized.

Copy link
Contributor

calum-chamberlain commented Dec 3, 2018

I think it is probably ready for review - it might be nice to change back to using __file__ rather than my horrible thing, but otherwise I think it could be okay. Would be great to have this in soon.

@d-chambers

This comment has been minimized.

Copy link
Member Author

d-chambers commented Jan 13, 2019

@calum-chamberlain have you pushed all the work you have done on this branch? I am going rebase this soon, and add make a few more tweaks. After that we should see if we can't get this merged and closed.

@calum-chamberlain

This comment has been minimized.

Copy link
Contributor

calum-chamberlain commented Jan 13, 2019

Yup, nothing more from me since the last push. Rebase away!

@d-chambers d-chambers force-pushed the catalog_dict branch from 484b739 to e654fbd Jan 13, 2019

@d-chambers

This comment has been minimized.

Copy link
Member Author

d-chambers commented Jan 13, 2019

Ok, hopefully I did that right 🤷‍♂️

@krischer krischer added this to Waiting for Review in Release 1.2.0 Feb 14, 2019

@ThomasLecocq

This comment has been minimized.

Copy link
Contributor

ThomasLecocq commented Feb 14, 2019

the only failing test that looks worrying is for py3.5 on Linux:

----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/miniconda/envs/test-environment/lib/python3.5/site-packages/obspy/core/tests/test_dict_conversion.py", line 41, in test_roundtrip_conversion
    self.assertEqual(cat1, cat2)
AssertionError: <obspy.core.event.catalog.Catalog object at 0x7f88c26524e0> != <obspy.core.event.catalog.Catalog object at 0x7f88c23d6cf8>

as I think the failure on windows (evalresp) is nor related, is it ?

d-chambers added some commits Sep 1, 2018

d-chambers and others added some commits Sep 2, 2018

@megies megies force-pushed the catalog_dict branch from 855b887 to e97ea71 Feb 14, 2019

@megies

This comment has been minimized.

Copy link
Member

megies commented Feb 14, 2019

Rebased on current master and force-pushed so that we have fresh CI results tomorrow.

@d-chambers

This comment has been minimized.

Copy link
Member Author

d-chambers commented Feb 14, 2019

as I think the failure on windows (evalresp) is nor related, is it ?

Nope, it shouldn't be using that at all.

@ThomasLecocq

This comment has been minimized.

Copy link
Contributor

ThomasLecocq commented Feb 15, 2019

@megies only the classic evalresp appveyor issue, otherwise all green

@ThomasLecocq ThomasLecocq moved this from Waiting for Review to Waiting on CI in Release 1.2.0 Feb 15, 2019

@ThomasLecocq ThomasLecocq moved this from Waiting on CI to Waiting for final manual validation by Core Dev in Release 1.2.0 Feb 15, 2019

@krischer

This comment has been minimized.

Copy link
Member

krischer commented Mar 15, 2019

I looked at this for a bit and I think we should push this to the next release to not commit to a solution we might not want to maintain in the long run. In general I really like the idea in this PR and I think we should go down that route but I feel like it should be implemented in a more general way.

A few thoughts I had:

  • This should be integrated into obspy.io.json. The serialization to JSON/dict is actually identical in both so the catalog_to_dictionary() function should be use there to not have two implementations which might diverge in the future.

  • Instead of dealing with the serialization for all types of objects in a single file/function we should rather define some interfaces. I think a __json__() method might be nice - it not a fully standard Python interface but its has been used.

  • Almost all the event classes actually inherit from AttribDict so a lot of functionality could be collected there.

  • The __init__() methods of the event objects should be smart enough to work with nested dictionary structures und convert to the proper event objects in the init function, e.g. Catalog(**nested_dict) should IMHO work. The main issue here (which this PR also mainly deal with) is how to map keys in the dictionary to Python classes. Maybe we can define some kind of interface for that so it would work for all kinds of objects?

  • This is a nice library and we could draw some inspiration from it: https://jsons.readthedocs.io/en/latest/?badge=latest After we release ObsPy 1.2 we'll move to Python 3 so we could consider employing type hints.

@krischer krischer modified the milestones: 1.2.0, 2.0.0 Mar 15, 2019

@krischer krischer removed this from Waiting for final manual validation by Core Dev in Release 1.2.0 Mar 15, 2019

@d-chambers

This comment has been minimized.

Copy link
Member Author

d-chambers commented Mar 15, 2019

@krischer sounds good. I agree there is room for improvement in both the interface and implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.