Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource id refactor #2091

Merged
merged 37 commits into from Jul 5, 2018
Merged

Resource id refactor #2091

merged 37 commits into from Jul 5, 2018

Conversation

d-chambers
Copy link
Member

This PR represents a few days of work trying to track down a particularly hard to find bug with the "event scoped" object referrals of ResourceIdentifier class. The issues were related to premature deletions of the __resource_id_weak_dict due to bugs in the resource_id's manual reference counting. While debugging the code I ended up performing some major refactoring and adding some additional features, namely:

  1. Moved the ResourceIdentifier class from obspy.core.event.base to its own module obspy.core.event.resourceid as the class definition was getting quite large.

  2. Added a context manager for safely (and temporarily) replacing the ResourceIdentifier class state
    called _debug_class_state (a private method for debugging only). This allowed the removal of
    all manual clearing of the __resource_id_weak_dict (which is explicitly warned against doing in the docs).

  3. Removed the manual id reference counting by ResourceIdentifier and replace it instead with a semi-singleton class (only one instance ever exists per id at one time) called _ResourceSingleton. The resource singleton reference are kept in a weak value dict, and are used as keys to the resource_id_weak dict. This results in the same behavior as before, but pushes the burden of reference counting back to the standard library/python implementation.

  4. Added several more tests for the event-scoped behavior of the ResourceIdentifier.

  5. Added the MegaCatalog class to the testing module of utils for creating a Catalog object with most, if not all, the classes contained in obspy.event.

  6. Added a helper function for allowing context managers to be used in the 'setUpmethod of unittest.TestCase` instances to the testing module.

PR Checklist

  • Correct base branch selected? master for new fetures, maintenance_... for bug fixes
  • This PR is not directly related to an existing issue (which has no PR yet).
  • If the PR is making changes to documentation, docs pages can be built automatically.
    Just remove the space in the following string after the + sign: "+DOCS"
  • If any network modules should be tested for the PR, add them as a comma separated list
    (e.g. clients.fdsn,clients.arclink) after the colon in the following magic string: "+TESTS:"
    (you can also add "ALL" to just simply run all tests across all modules)
  • All tests still pass.
  • Any new features or fixed regressions are be covered via new tests.
  • Any new or changed features have are fully documented.
  • Significant changes have been added to CHANGELOG.txt .
  • First time contributors have added your name to CONTRIBUTORS.txt .

@d-chambers d-chambers added .core.event issues related to our Catalog/Event functionality cleanup code refactoring etc. labels Mar 20, 2018
Copy link
Member

@krischer krischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for spending the time to tackle all these nasty issues!

I only have some minor comments - otherwise this looks pretty good. Does pickling one of the child objects work correctly with the chosen approach?

@rlock
def __setstate__(self, state_dict):
super(Event, self).__setstate__(state_dict)
ResourceIdentifier.bind_resource_ids()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just writing down my thoughts: There are now two separate thread locks for the Event objects but I think this if fine as __setstate__() and __deepcopy__() cannot really be called at the same time so there are no race conditions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should bind_resource_ids() also be called for all the other event objects like Origin, Magnitude, ...?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the idea is to call it only on the event level to bind the inter-event related resource ids (eg picks and arrivals). Since origins are on the same nested level as picks an origin could get constructed before the picks that the arrivals refer to, in which case calling bind_resource ids would cause the arrivals to be bound to the wrong picks.
This whole thing can get confusing I know...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh I am not really sure if we need an rlock on __setstate__ or not. When I am feeling really ambitious I should setup a nasty multi-threaded test case to make sure it all still works with many threads.

self.r_dict = state['rdict'] # the weak resource dict
self.unbound = state['unbound'] # the ubound resource ids

def print_state(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a nocover pragma here, e.g.:

def print_state(self):  # pragma: nocover
    ...

"""
# setup temporary id dict for tests in this case and bind to self
context = ResourceIdentifier._debug_class_state()
state = setup_context_testcase(self, context)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my understanding: __exit__() of the context manager would be call once the instance is garbage collected?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using Ned Bachelder's recipe for using context managers with unit test, which calls the setup and cleanup function before and after every test. So the context manager enters and exits for every test thanks to this helper function.

Pytest has an elegant solution for kind of thing with yield fixtures, but this was the best I could find for unittest.

def test_id_without_reference_not_in_global_list(self):
"""
This tests some internal workings of the ResourceIdentifier class.
NEVER modify the __resource_id_weak_dict!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is obsolete now.

"specified in the QuakeML manual section 3.1 and in particular "
"exclude colons for the final part.")

def test_dubug_class_state(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

@d-chambers
Copy link
Member Author

This is still a work in progress, I may would like to think about it a bit more before it gets merged.

@krischer
Copy link
Member

krischer commented Apr 11, 2018

Just a quick ping on this. No need to hurry but it would be nice if this does not get lost in a stale PR.

@d-chambers
Copy link
Member Author

Hey @krischer,
I am still working on this. I have a solution that is robust and thread-safe but I need a bit longer to clean it up and to look at ways of speeding it up (reading/copying catalogs currently take ~15% performance hit).

@d-chambers
Copy link
Member Author

I am struggling a bit with this one. I ended up abandoning the implementation form #1644 that depended on event creation order because it was causing all sorts of hard to debug problems, and multi-threading didn't work despite rlocking key functions. IMO it was not a good design.

To replace it, I implemented a method of scoping resource ids based on using a parent id (eg the id of the event object) as a namespace of sorts for resource_ids and their referred objects. This works by calling a recursive function to find all resource ids in an objects attribute tree and assigning them to the parent namespace. This approach is thread-safe and seems to be much more reliable.

However, recursion is not particularly efficient in python. This new method slows catalog creation operations (read, copy, unpickle) down by around 20%. This will hamper other efforts to improve efficiency (eg #1910).

Perhaps it might be better to leave the resource_ids un-scoped (ie they will behave how they originally did before #1644) but provide a public method on the catalog and event object in order to scope the resource ids? I imagine many catalog/event interactions don't even make use of the resource_ids' so it would just be a waste of computation in those cases, but it would allow the user the option to ensure correct resource_id behavior when it is needed. It could be rolled up into a general function called something like "validate" that will also try to detect/fix other issues that might occur in event objects.

I will push the work I have so far, but it still will require a bit of cleanup. If anyone has any suggestions they would be much appreciated.

@d-chambers
Copy link
Member Author

d-chambers commented Apr 15, 2018

I spent some time revisiting this and optimizing and I am pleased with the results. Doing the event scoping of the resource_ids still has a cost, but now it is not all that much. From my profiling on python 3.6 and Ubuntu 16.04 I get the following (no event_scoping, event_scoping):

read quakeml: 13.3 / 13.6 ms
copy catalog: 4.8 / 5.07 ms
unpickle 1.71 / 2.21 ms

Here is the ipython notebook I used to do the profiling. In order to run it you need to be on this branch with your current obspy installation.

It is interesting that unpickling an event object takes less than half the time that copying does, but that is an issue for a different day.

@d-chambers d-chambers added this to the 1.2.0 milestone Apr 19, 2018
@d-chambers d-chambers added the ready for review PRs that are ready to be reviewed to get marked ready to merge label Apr 19, 2018
Copy link
Member

@krischer krischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all sorry for being so late to review this but I just could not find the time.

I'd feel pretty comfortable merging this as the tests are very comprehensive. The only downside I see is that this is really hard to understand - I've been staring at it for a while now and still don't fully get all aspects. But I do see that this is definitely necessary to get the references to finally work correctly.

I rebased on top of the lastest master and force pushed and made some small additional changes. Aside from this I have some clarification questions/some small suggestions.

I'm gonna look at this a bit more tomorrow but would tend to merge it fairly soon if there are no objections.

Thanks a lot for this @d-chambers!

@@ -639,6 +638,111 @@ def buffered_load_entry_point(dist, group, name):
return _ENTRY_POINT_CACHE[hash_str]


def yield_obj_parent_attr(obj, cls=None, is_attr=None, has_attr=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer needed with the specialized form below. Are you fine with removing it? If we ever need it again its still in the history.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are not any issues I would like to keep it; I find this function very useful for transversing Catalog and Inventory instances and am planning to use it in #2154. I can always re-implement it if that would be cleaner. It is used several places in the new tests anyhow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That argument is good enough for me. I'll prefix is with an underscore to mark it as private just so we don't commit to maintain it and are free to change its API.

if obj is not None and (hasattr(obj, '__dict__') or
isinstance(obj, (list, tuple))):
id_tuple = (id(obj), id(parent))
if id_tuple not in ids:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stylistically its IMHO a bit nicer to do the inverse test and continue early - then the following code can be dedented one level.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea but this is a bit tricky; this line is not in the scope of a loop so the continue keyword isn't valid. Since this is a generator we can just return (which is the new way of raising StopIteration according to PEP 479). I think this will work in 2.7, let's see what the CI says.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yea sorry you are right. CI seems to like in on 2.7 (which I think works only when one does not return a value).


ids = set() # id cache to avoid circular references

def func(obj, attr=None, parent=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its slightly confusing that this function takes obj, attr, parents but the iterator yields obj, parent, attr. But yea - super minor internal detail so feel free to ignore it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is only used in the _yield_resource_id_parent_attr scope, but I agree it would be more consistent to switch the order.

def __set__(self, instance, value):
# if an object was passed, use the id of the object for hash
if value is not None:
if not isinstance(value, (int, str, native_str)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this important? I feel like this even induced (a very very very tiny) chance of collusion if an int happens to be identical to an object's id.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, some of the objects used as parents are (rightfully) not hashable. For example:

import obspy.core.event as ev
hash(ev.Event())  # raises TypeError

This little hack gets around it. In this case, we are actually just interested in ensuring the key objects are unique (in memory) so the mutability considerations that normally should guide if an object is hashable are not really applicable.

You bring up a good point about possible collisions, although I think it would have to be deliberate or just a very very bad day. I guess if we are concerned about it I could also check if value is an int and cast it to a str or float so that the only way an int gets into the hash table is through this if clause? It may incur a slight hit to performance though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, looking at the flow of the program, I would be surprised if an int ever makes it here, but it wouldn't be impossible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This little hack gets around it. In this case, we are actually just interested in ensuring the key objects are unique (in memory) so the mutability considerations that normally should guide if an object is hashable are not really applicable.

I was more thinking to just always hash id(value). I agree that we don't really have to worry about this minuscule chance and it could still happen with any hash.

>>> # Deleting it, or letting the garbage collector handle the object will
>>> # invalidate the reference.
>>> del event
>>> print(res_id.get_referred_object()) # doctest: +SKIP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we skip this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unskipped

>>> del obj_b # obj_b gets garbage collected
>>> assert ref_b.get_referred_object() is obj_a # doctest: +SKIP
>>> del obj_a # now no object with res_id exists
>>> assert ref_b.get_referred_object() is None # doctest: +SKIP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it being skipped?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unskipped

del r1
self.assertEqual(r2.get_referred_object(), t1)
self.assertIn(r2._resource_key, r_dict)
# Deleting the second one should (r_dict should not be empty)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/not/now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I am not sure what you mean.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I should have been more clear. s/SEARCH/REPLACE is ack syntax to search and replace strings. It should be r_dict should now be empty.

if self._object_id is not None:
msg = ("The object with identity of: %d no longer exists, "
"returning the most recently created object with a"
" resource id of: %s") % (self._object_id, self.id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning message is not fully correct as it would also be triggered if the object is found in the parent scope.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

self._object_id = id(referred_object)
# Use parent-scoping feature if parent is not None.
if parent is not None:
self._parent_key = parent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if this is called multiple times with different parent objects?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to remember (I had this really well thought out at one point) but I think this does happen when an event is copied. I don't think this will cause issues since the resource_id instance will now point to the correct object in the new scope.

warn=False)
else:
resource_id._parent_key = self
resource_id._object_id = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are things like the derived origin ids scoped after this function has been called? _parent_key would be set but _parent_id_tree and things like this might not be set. But I just be misunderstanding some things.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any object that is referenced by a resource_id should have the attribute resource_id. Lines 332-334 take care of this case, which would involve logging the identities of all the objects with resource_id attributes into the proper place in the _parent_id_tree. Then, resource ids that refer to other objects in the same event (but not their parent) such as the pick_id of the Arrival object will find the correct object when get_referred_object is called because it was previously put into _parent_id_tree. Because _object_id on has been set to None it will not raise a warning. I added another test to specifically check for warnings being raised in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Thanks for the explanation. After spending some more time I think I now fully understand your changes and how everything works. Nice job!

@d-chambers
Copy link
Member Author

No problem, thanks for taking a look at it. I really hope this finally fixes all the issues (it seems to have worked so far).

The only downside I see is that this is really hard to understand.

I agree. It is more complicated than I would like, but I couldn't come up with something simpler that still solved all the edge cases. The basic gist, which I am sure you know, is to have each ID scoped to some parent, in this case the event object. The only way I found to reliably do this is wait until an event and all related attributes are created, then recurse and set each resource_id in the event to the correct parent.

@krischer
Copy link
Member

krischer commented Jul 4, 2018

I cleaned up a bit more and fixed two things that got lost during my rebase. Double checking could not hurt but I think I found everything missing.

Also all event plugin's now scope the resource identifiers just to make sure everything works.

I still have trouble understanding the whole _ResourceKey + descriptor business. So I see that the _ResourceKey object is a necessary wrapper object so it can be used in the weakref dictionaries. But why does it have to be a singleton? Its only used as keys for the dictionary so why does a deterministic hash like hash(id(value)) not work? Also where do the object live? They don't seem to get garbage collected as I would expect, see e.g.

In [1]: import obspy

In [2]: cat = obspy.read_events()

In [3]: cat
Out[3]:
3 Event(s) in Catalog:
2012-04-04T14:21:42.300000Z | +41.818,  +79.689 | 4.4 mb | manual
2012-04-04T14:18:37.000000Z | +39.342,  +41.044 | 4.3 ML | manual
2012-04-04T14:08:46.000000Z | +38.017,  +37.736 | 3.0 ML | manual

In [4]: dict(cat[0].resource_id._parent_key._singleton_cache)
Out[4]:
{112059039136: <obspy.core.event.resourceid._ResourceKey at 0x1a173d6908>,
 112059050304: <obspy.core.event.resourceid._ResourceKey at 0x1a173f52e8>,
 112059175544: <obspy.core.event.resourceid._ResourceKey at 0x1a173f5748>,
 'quakeml:eu.emsc/NetworkMagnitude/rts/261017/782476/796628': <obspy.core.event.resourceid._ResourceKey at 0x1a173d6ba8>,
 'quakeml:eu.emsc/NetworkMagnitude/rts/261018/782478/796631': <obspy.core.event.resourceid._ResourceKey at 0x1a173f5438>,
 'quakeml:eu.emsc/NetworkMagnitude/rts/261020/782484/796646': <obspy.core.event.resourceid._ResourceKey at 0x1a173c7b00>,
 'quakeml:eu.emsc/event/20120404_0000038': <obspy.core.event.resourceid._ResourceKey at 0x1a173f5320>,
 'quakeml:eu.emsc/event/20120404_0000039': <obspy.core.event.resourceid._ResourceKey at 0x1a173f5b00>,
 'quakeml:eu.emsc/event/20120404_0000041': <obspy.core.event.resourceid._ResourceKey at 0x1a173d6240>,
 'quakeml:eu.emsc/origin/rts/261017/782476': <obspy.core.event.resourceid._ResourceKey at 0x1a173d66a0>,
 'quakeml:eu.emsc/origin/rts/261018/782478': <obspy.core.event.resourceid._ResourceKey at 0x1a173f53c8>,
 'quakeml:eu.emsc/origin/rts/261020/782484': <obspy.core.event.resourceid._ResourceKey at 0x1a173d3b00>,
 'smi://eu.emsc/unid': <obspy.core.event.resourceid._ResourceKey at 0x1a173f5b38>,
 'smi:eu.emsc-csem/origin_method/NA': <obspy.core.event.resourceid._ResourceKey at 0x1a173c7da0>,
 'smi:smi-registry/organization/DDA': <obspy.core.event.resourceid._ResourceKey at 0x1a173d6eb8>,
 'smi:smi-registry/organization/EMSC': <obspy.core.event.resourceid._ResourceKey at 0x1a173c7978>,
 'smi:smi-registry/organization/NNC': <obspy.core.event.resourceid._ResourceKey at 0x1a173d6a58>}

In [5]: del cat[0]; del cat[1]

In [6]: cat
Out[6]:
1 Event(s) in Catalog:
2012-04-04T14:18:37.000000Z | +39.342,  +41.044 | 4.3 ML | manual

In [7]: import gc; gc.collect()
Out[7]: 249

In [8]: dict(cat[0].resource_id._parent_key._singleton_cache)
Out[8]:
{112059039136: <obspy.core.event.resourceid._ResourceKey at 0x1a173d6908>,
 112059050304: <obspy.core.event.resourceid._ResourceKey at 0x1a173f52e8>,
 112059175544: <obspy.core.event.resourceid._ResourceKey at 0x1a173f5748>,
 'quakeml:eu.emsc/NetworkMagnitude/rts/261017/782476/796628': <obspy.core.event.resourceid._ResourceKey at 0x1a173d6ba8>,
 'quakeml:eu.emsc/NetworkMagnitude/rts/261018/782478/796631': <obspy.core.event.resourceid._ResourceKey at 0x1a173f5438>,
 'quakeml:eu.emsc/NetworkMagnitude/rts/261020/782484/796646': <obspy.core.event.resourceid._ResourceKey at 0x1a173c7b00>,
 'quakeml:eu.emsc/event/20120404_0000038': <obspy.core.event.resourceid._ResourceKey at 0x1a173f5320>,
 'quakeml:eu.emsc/event/20120404_0000039': <obspy.core.event.resourceid._ResourceKey at 0x1a173f5b00>,
 'quakeml:eu.emsc/event/20120404_0000041': <obspy.core.event.resourceid._ResourceKey at 0x1a173d6240>,
 'quakeml:eu.emsc/origin/rts/261017/782476': <obspy.core.event.resourceid._ResourceKey at 0x1a173d66a0>,
 'quakeml:eu.emsc/origin/rts/261018/782478': <obspy.core.event.resourceid._ResourceKey at 0x1a173f53c8>,
 'quakeml:eu.emsc/origin/rts/261020/782484': <obspy.core.event.resourceid._ResourceKey at 0x1a173d3b00>,
 'smi://eu.emsc/unid': <obspy.core.event.resourceid._ResourceKey at 0x1a173f5b38>,
 'smi:eu.emsc-csem/origin_method/NA': <obspy.core.event.resourceid._ResourceKey at 0x1a173c7da0>,
 'smi:smi-registry/organization/DDA': <obspy.core.event.resourceid._ResourceKey at 0x1a173d6eb8>,
 'smi:smi-registry/organization/EMSC': <obspy.core.event.resourceid._ResourceKey at 0x1a173c7978>,
 'smi:smi-registry/organization/NNC': <obspy.core.event.resourceid._ResourceKey at 0x1a173d6a58>}

@d-chambers
Copy link
Member Author

Why does the _ResouceKey have to be a singleton? Why not use hash(id(value))?

  1. The _ResourceKey is a trick to enable the python gc to handle some of the work the resource_id was doing (unreliably) on its own before in the __del__ method, so we need to use the _ResourceKey instances in the wekref dicts. The _ResourceKey.get_resource_key method handles two distinct cases:

a. An object that is not an int/str is passed. The _ResourceKey instances needs to be unique to this object only (using its identity) and return a different _ResourceKey instance if a similar (even equal) object is passed to _ResourceKey.get_resource_key. This enables the parent scoping of equal yet different Event objects for example.

b. An object that is an int/str is passed (note ResourceIdentifier.id will always be a str or a type error is raised by the setter). The same _ResourceKey instance should be used with equal int/str but that have different identities. This enables different ResourceIdentifier instances with the same id to refer the same object. The strings in the following example should return the same instance of '_ResourceKey' if passed to _ResourceKey.get_resource_key. If we used the id in the hash this would not work.

a = "'lion roars at stale pull requests'"
b = "'lion roars at stale pull requests'"
assert a == b
assert id(a) != id(b)
  1. I suppose _ResourceKey could be a non-singleton as long as it implemented a hash based on the correct input (the identity of an object or the str/int) but this could greatly increase the number of _ResourceKey instances that python has to create/manage in the weakvalue dict, as well as require an addition setattr. All of this could result in a performance/memory usage hit. In this case I figured it is better to do more with less.

I should note too, I don't think singleton is exactly the right term (but I don't know a better one) since a true singleton should only have one instance, not just one instance per unique input.

Where do the _ResourceKey objects live?

The only strong reference to these should be attached to the ResourceIdentifier instances as the _parent_key or _resource_key attributes. In your example I think there is some caching going on in the read_events call as the following simpler case does work as expected:

import gc

import obspy.core.event as ev

event = ev.Event()
print(dict(ev.ResourceIdentifier._parent_id_tree))  # dict with one entry

del event
gc.collect()

print(dict(ev.ResourceIdentifier._parent_id_tree))  # empty dict

In fact, if we count the references there seem to be many in the default catalog that are being held onto by something (which is why the del isn't working as you expected):

import gc
import sys

import obspy
import obspy.core.event as ev

def print_ref_counts(obj):
    rc = sys.getrefcount(obj)
    # subtract 3 due to temporary reference created in this function/getrefcount
    print(f'obj has {rc -3} refcounts')

event = ev.Event()
print_ref_counts(event.resource_id._parent_key)  # prints "obj has 1 refcounts"

cat = obspy.read_events()
print_ref_counts(cat[0].resource_id._parent_key)  # prints "obj has 13 refcounts"

What is the deal with the descriptor?

Since there is some logic required to figure out the key for caching the _ResourceKey instances (selecting the object identity or the str/int) the descriptor just lets me put that logic in one place rather than every time I set a value to the _parent_key or _resource_key. Of course it could be implemented as a getter/setter (which I think is the same thing in the end) but since there are two of them I figured it would be better to have an external descriptor to avoid writing the same code twice.

@d-chambers
Copy link
Member Author

Another thing to perhaps do in this PR, maybe in a new one, is to rework convert_id_to_quakeml_uri to return a copy of the resource_id with the new id rather than modifying it in place. Since the resource_ids are hashed on the id attr we should treat it as immutable.

@krischer
Copy link
Member

krischer commented Jul 4, 2018

Thanks a bunch for the explanations! I especially did not get the distinction between 1 a) and 1 b) but now everything is clear.

All of this could result in a performance/memory usage hit. In this case I figured it is better to do more with less.

I actually tested this and it is indeed a small bit but measurably (~3%) slower when running test_events.py.

Of course it could be implemented as a getter/setter (which I think is the same thing in the end) but since there are two of them I figured it would be better to have an external descriptor to avoid writing the same code twice.

Yea that's the same thing. Python internally implements properties as descriptors. I do like the descriptor approach as it (as you say) forces a single implementation.

Another thing to perhaps do in this PR, maybe in a new one, is to rework convert_id_to_quakeml_uri to return a copy of the resource_id with the new id rather than modifying it in place. Since the resource_ids are hashed on the id attr we should treat it as immutable.

👍 for dealing with in a separate PR. We have to be a bit careful with this and maybe raise a few new warnings. Some people might use ResourceIdentifiers as dictionary keys and then write out a QuakeML file which (in your proposal) would create new objects and the dictionary keys might thus become meaningless. But lets discuss it elsewhere.

Regarding the references: I investigated a bit more (gc.get_referrers() is actually pretty helpful) and it turns out that all the spurious references are coming from ipython and especially all its magic like the jedi code-completion. Just running it in the normal interpreter completely fixes all my issues and the high refcount in your example comes internal references (mostly _parent_keys). So all is well.

import gc
import obspy
from obspy.core.event.resourceid import _ResourceKey

cat = obspy.read_events()
print(len(_ResourceKey._singleton_cache))

del cat[0]
gc.collect()
print(len(_ResourceKey._singleton_cache))

del cat
gc.collect()
print(len(_ResourceKey._singleton_cache))

Output:

17
12
0

@d-chambers If you have no objections I would merge this now.

@d-chambers
Copy link
Member Author

I have not objections, go ahead and merge 🎉

@d-chambers
Copy link
Member Author

Actually hold off merging for now. I think I found another (small) issue.

@d-chambers
Copy link
Member Author

Ok, I made a slight change to the logic that governs when a warning is issued and added a test for it. If you approve of the change go ahead and merge.

Thanks again @krischer for all the time you spent reviewing this!

@krischer krischer merged commit a2689ca into master Jul 5, 2018
@krischer krischer deleted the resource_id_refactor branch July 5, 2018 08:21
@krischer
Copy link
Member

krischer commented Jul 5, 2018

🎉

Thanks @d-chambers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cleanup code refactoring etc. .core.event issues related to our Catalog/Event functionality ready for review PRs that are ready to be reviewed to get marked ready to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants