Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request/question: Slow QuakeML reading #1910

Open
calum-chamberlain opened this issue Oct 6, 2017 · 61 comments
Open

Request/question: Slow QuakeML reading #1910

calum-chamberlain opened this issue Oct 6, 2017 · 61 comments
Labels
.io.quakeml question Questions asked by users. Please use https://discourse.obspy.org for asking new questions

Comments

@calum-chamberlain
Copy link
Contributor

QuakeML reading is quite slow - in my naive test, about 10x slower than writing. For large catalog files this becomes a bit of an issue.

The main point of this issue is to ask:

  1. Is there a faster IO alternative already implemented?
  2. It looks to me like the main slow-down is the loop over events in obspy.io.quakeml.core.Unpickler._deserialize, I wonder if this can be a threaded operation?
  3. If none of the above, does anyone have any ideas for speeding up the reading of QuakeML files to ObsPy Events?

If any suggestions are made I'm happy to undertake the work to do it.

@calum-chamberlain calum-chamberlain added .io.quakeml question Questions asked by users. Please use https://discourse.obspy.org for asking new questions labels Oct 6, 2017
@megies
Copy link
Member

megies commented Oct 6, 2017

Is there a faster IO alternative already implemented?

Not to my knowledge.

It looks to me like the main slow-down is the loop over events in obspy.io.quakeml.core.Unpickler._deserialize, I wonder if this can be a threaded operation?

Sounds like a pretty good idea. The FDSN client.py has threading for the individual endpoints and might be a good starting point for this.

@megies megies added this to the 1.2.0 milestone Oct 6, 2017
@calum-chamberlain
Copy link
Contributor Author

Thanks for that @megies, I will have a look there and crack on with trying to speed it up.

@megies
Copy link
Member

megies commented Oct 6, 2017

No idea if there's hidden pitfalls I don't see right now with multiple threads working on the xml object, but I guess it shouldn't be too much work to set it up and see what happens..

@krischer
Copy link
Member

krischer commented Oct 6, 2017

Threading will not do the trick I fear - Python's GIL prevents threads to actually manipulate multiple Python objects at the same time. It is really only worth if for I/O bound tasks (the FDSN client downloads a number of files at initialization time and this actually happens in parallel).

I also think that reading QuakeML files is way too slow and we should really improve it. For a while I thought that it is due to all the objects that are created but I don't think that is really the case. This example here is a fairly large catalog with around 6000 events, once in ZMAP and once in QuakeML (the quakeml one has been created from the zmap one so the information content in both should be comparable as should the object count in both).

In [6]: %timeit obspy.read_events("./last_three_years_around_livermore.xml")
14.5 s ± 426 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]: %timeit obspy.read_events("./last_three_years_around_livermore.zmap")
2.91 s ± 877 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So I guess there is something inefficient in the way we parse things in the quakeml plugin - it is not lxml:

In [11]: %timeit etree.parse("./last_three_years_around_livermore.xml")
108 ms ± 2.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So before turning to parallelization I would do some careful benchmarking first to see where the time is actually spent. We might have to do a fairly significant re-engineering of how QuakeML files are parsed to make it faster.

Another thing that might make object creation faster would be to set the __slots__ attribute for all the event classes - this would also affect the zmap parsing.

If we really need to parallelize things we have to use the multiprocessing module. One way to split up the work would be to utilize etree.iterparse() which is blazingly fast to cut events and distribute across cores. But I would rather view this as a last resort as it significantly increases the complexity and there is also a cost for forking processes as well as later transferring the created objects to the original process.

@d-chambers
Copy link
Member

Hey @calum-chamberlain,

If you haven't used it, I recommend checking out snakeviz, it makes profiling things like this a delightful experience.

I did something similar trying to speed up reading miniseed files (#1845) and was able to find some low hanging fruit very quickly. This notebook is a simple example.

@krischer krischer reopened this Oct 6, 2017
@calum-chamberlain
Copy link
Contributor Author

Thanks for those comments guys - I thought I could get around the GIL with more direct access to lxml (which I think releases the GIL?), but it would require some strange in-between stages that I'm not that keen on.

Thanks @d-chambers I haven't used snakeviz - will do and do something similar with a notebook and share results here!

@calum-chamberlain
Copy link
Contributor Author

On that note, while I'm playing around with it, I may have a look at memory usage for writing QuakeML files too - currently it seems like quite an expensive operation - serialising an entire (large) catalog seems to use a lot of additional memory.

@krischer
Copy link
Member

krischer commented Oct 6, 2017

Thanks for those comments guys - I thought I could get around the GIL with more direct access to lxml (which I think releases the GIL?), but it would require some strange in-between stages that I'm not that keen on.

Might not be worth it - in the measurements I made above the lxml parsing was less than 1 percent of the full runtime.

@megies
Copy link
Member

megies commented Oct 7, 2017

I did something similar trying to speed up reading miniseed files (#1845) and was able to find some low hanging fruit very quickly. This notebook is a simple example.

Nice! @d-chambers would be great if you link this to https://github.com/obspy/obspy/wiki/Testing%2C-Debugging-%26-Profiling

@d-chambers
Copy link
Member

Hey @megies, I added a bit in the wiki on this.

@calum-chamberlain
Copy link
Contributor Author

Okay, I have done some timing, and initial playing, with little success. The results are in the gist I posted earlier - re-linked here.
The main point is - calls to lxml.etree.xpath take the majority of time. It is my understanding that we could make use of the XPath class to speed things up a bit there, but probably not by a huge amount, I got about a 3% speed-up by setting up XPath callables for external lookups - that doesn't cover a large space of the xpath calls, but, implementing them all within an iynb is impractical.

Of calls for obspy.io.quakeml.core._xpath, about 15% of the time is spent processing hasattr, which I think is because namespace is not passed to _xpath, so we spend time looking up the namespace for each element and subelement of the tree. Is this necessary? Is there a case where subelements of a QuakeML file have different namespaces to the whole tree? Otherwise, this is a simple speed-up case, but requires changes in lots of places.

Do you think it is worth me going ahead and making those changes in a branch and profiling that? I would probably write my own Unpickler outside of obspy first so that I can directly compare the two.

@megies
Copy link
Member

megies commented Oct 9, 2017

Is there a case where subelements of a QuakeML file have different namespaces to the whole tree?

Yes, there are such cases. See http://docs.obspy.org/master/tutorial/code_snippets/quakeml_custom_tags.html?highlight=extra

@calum-chamberlain
Copy link
Contributor Author

Ah, fair enough - will not pursue that strand any further then. I think some more reading on my part of useful ways to read in large xml files needs to be done.
On the note of the __slots__ attribute, this would probably speed-up creating new objects, but it looks like it would reduce the flexibility of the objects, which would again get in the way of custom attributes for Event subclasses...

@megies
Copy link
Member

megies commented Oct 10, 2017

custom attributes for Event subclasses...

Custom attributes that are handled by QUAKEML I/O are all stored consistently under extra. Of course, right now users can add their own custom attributes to event-type objects, though, that can be used in custom workflows.

@krischer
Copy link
Member

This is a trade-off I would be willing to make to be honest.

@krischer
Copy link
Member

If the xpath is the slow part: It might be possible to just upfront parse everything to a dictionary and then operate on that with try/except calls but I'm not sure if that would actually be a lot faster.

@megies
Copy link
Member

megies commented Oct 10, 2017

This is a trade-off I would be willing to make to be honest.

I agree, being able to attach arbitrary attributes to those objects should not be top-priority.

@d-chambers
Copy link
Member

Here is a good SO question explaining slots in more detail, and here is the official documentation on the subject for python 3.

It seems the biggest problem with slots is that it will break multiple inheritance. I am not sure if any users would be sub-classing any of the event objects with multiple inheritance (potentially adding mixins?) but it is not inconceivable.

Also, if you need dynamic behaviour you can add "_dict_" to the slots. I would be in favor of doing this as I have several codes that dynamically add methods to these objects, but this is probably not that common. If you need weak reference support (which I believe the ResourceIdentifiers depend on, so you will) you can add '_weakref_' to the slots as well.

However, it is doubtful that defining slots will actually speed up object instantiation considerably, as slots is primarily designed to save memory. The instantiating speedups may be even less after defining _dict_ and _weakref_. For example, if we take the simplest case:

# I am using python 3.6.2 on Ubuntu 16.04

class Slots:
   __slots__ = ()

class SlotsDictWeakRef:
    __slots__ = ('__dict__', '__weakref__')

class NoSlots:
    pass

%timeit [Slots() for _ in range(1_000_000)]  # best 160 ms
%timeit [SlotsDictWeakRef() for _ in range(1_000_000)]  # best 182 ms
%timeit [NoSlots() for _ in range(1_000_000)]  # best 206 ms

We really don't see much improvement. However, in a more complicated object instantiation the results may vary.

@megies
Copy link
Member

megies commented Oct 11, 2017

event objects with multiple inheritance (potentially adding mixins?) but it is not inconceivable

Actually, I'm doing exactly that in some code of mine (over at megies/obspyck).. ^^

And yeah, slots will probably only do much in terms of memory..

@calum-chamberlain
Copy link
Contributor Author

calum-chamberlain commented Jul 15, 2018

So after talking with @krischer at SSA, I finally had time to try a different implementation that converts the xml to a dictionary immediately, then makes the objects from the dictionary. This implementation is in this branch. At the moment I haven't finished implementing the _extra attributes, but, I think, namespaces stick with the attributes at the moment, so adding that in shouldn't be too difficult. All other tests pass at the moment though (not that that means it is perfect).

This is a really naive implementation of this, and I think it could be done a lot better, but I wanted to see if this made any difference to speed. In a quick and dirty test, reading a 1,000 event catalog (downloaded from GeoNet NZ, so no focal mechanism information) takes 54s on my machine (average of 7 runs), versus 2min 21s for the current master implementation (again, average of 7 runs). This is after I removed the _extra bit from master so that the comparison was fair.

I'm keen to get opinions on this, and feedback on how this should/could be done if we go down this route, and if anyone sees any obvious issues with doing it this way. I have reused all of the sub-methods that deserialise used to call, and I haven't touched the Pickler.

@calum-chamberlain
Copy link
Contributor Author

I imagine there are lots of use cases for using catalogs. Personally a common use-case I have is downloading waveforms for events and cutting them around picks (for templates in matched-filters), e.g. (psuedocode):

streams = []
for event in catalog:
    bulk = []
    for pick in event.picks:
        # Items in tuple likely to be in the wrong order
        bulk.append((pick.waveform_id.station_code,
                               pick.waveform_id.network_code,
                               pick.waveform_id.location_code,
                               pick.waveform_id.channel_code,
                               pick.time - 10, pick.time + 90))
    st = client.get_waveforms_bulk(bulk)
    for tr in st:
        pick = lookuppick # not a thing at the moment
        tr.trim(pick.time - 0.5, pick.time + 5.5)
    streams.append(st)
return streams

I also often use the amplitudes to recalculate magnitudes based on different local magnitude-scales, and put my own amplitudes in there. I would have thought most attributes get hit by someone at some time. There are many possible use-cases. Not sure if giving any one or a few would be that useful?

I also don't just iterate through things, e.g. the non-existent lookuppick in the pseudocode above would be to get a pick that matches the trace (based on tr.id and pick.waveform_id).

I also filter catalogs based on location and picks based on frequency and other things (which station or channel). Fairly sure that filtering could be done better.

@dsentinel
Copy link

Thanks @calum-chamberlain! I'll have a look.
Are there performance issues with catalogs scaling for such (once loaded)?
I've tested big inventories ~150MB, loading is painful, several minutes, but iterating/operating is reasonable.
It think before we start optimizing, we need to define what we're optimizing for.

@calum-chamberlain
Copy link
Contributor Author

Agreed - I'm certainly pretty happy with working on Catalogs once loaded, I find it quite straightforward to get the parts I care about, and haven't found any noticeable speed or memory limitations once loaded. I'm sure they could be more memory efficient, but in most of the work that I do the Catalogs are smaller than other things (waveforms mostly).

The main issues are the speed of loading, and the memory consumption of writing large Catalogs.

I remain out of my depth with how a dictionary or abc backend to Catalog would work, so maybe someone else has something more useful to add.

@dsentinel
Copy link

The main issues are the speed of loading, and the memory consumption of writing large Catalogs.

To clarify: When writing large catalogs to quakeml, memory usage is excessively high?

@calum-chamberlain
Copy link
Contributor Author

@dsentinel, yes - the whole object is serialized in memory before dumping, which is probably the fastest way, but does use a lot of memory to keep both the object and a serialized object in memory. Happy to work around that on the user side though (e.g. don't try and write massive catalogs).

@d-chambers
Copy link
Member

I agree with @calum-chamberlain about the IO speed being an issue, but I have yet to run into memory problems, although my workstation does have > 100 GB of ram. I don't have any issues with performance once the catalog is loaded into memory.

For example, I recently deployed 12 nodes over a mine for about a month. After running a STA/LTA and coincidence filter I detected ~ 12,000 events. I then stored the picks and nothing else in each Event object and saved the catalog to disk. It takes 2 minutes and 44 seconds to load the file with obspy.read_events from master, which is a significant bottleneck to the rest of my workflow.

Having an intermediate dict format would be useful for a few reasons:

  1. It would allow other serialization formats to simply convert the data to a dict, then there would be a well-tested dict_to_catalog method could be used rather than re-implementing all the object creation logic. This would be particularly useful for JSON or data stored in a json-like format in a database (eg MongoDB).

  2. It could enable swapping out the catalog backend, or making a lazy catalog, etc. This is something that would be far down the road and doesn't merit a lot of conversation in this thread.

If it is a significant slowdown to use and intermediate dict, however, maybe it is not worth it.

@calum-chamberlain
Copy link
Contributor Author

I do think the memory issue is likely to be mostly a non-issue for most people - its only been an issue for me for large catalogs (of a few thousand events) on my laptop with 16GB RAM, while I'm using memory for other things as well, e.g. when I'm being silly.

👍 for having a useful dict_to_catalog and using JSON-like things.

@filefolder
Copy link
Contributor

filefolder commented Sep 19, 2022

Sorry to revive this but the problem has gotten a lot worse in the last four years as ML pickers have made catalogs enormous.

Would HDF5 be viable solution? I think there are ways to generate the proper hierarchy via a schema file but I have only started looking into it. Seems like this idea is fairly popular in the adaptive meshing community.

I note also that Seiscomp appears to be able to read large XML much much faster via C; to the extent that what I have to do now is import a catalog into seiscomp and then read it back in obspy via localhost FDSN.

(edit: is there a way to "bump" this thread so it appears at the top among the recent issues? perhaps some fresh eyes could help)

@calum-chamberlain
Copy link
Contributor Author

Just for reference, I think the seiscomp code is here. I don't know if/how HDF5 would solve the speed issue. From memory (and my memory is not great), I think the main issues that we had were in deserialising the xml doc. The actual reading of the xml was fairly fast. I don't know if deserialising from an HDF5 file would be faster without other changes.

@d-chambers
Copy link
Member

In ObsPlus we defined pydantic models which reflect ObsPy's catalog hierarchy. This enables fast translation to/from json and with future improvements to pydantic (the rust rewrite) it will only get faster, so you might give json a try. See the docs for more details.

My first impression is that HDF5 wont perform well if you mirror the hierarchical structure of quakeml exactly. If, however, you broke the structure down into a collection of tables it might work really well, but then why not just use CSS and a proper database to begin with? Pisces might help with this if you want to go the ORM route with this approach.

@filefolder
Copy link
Contributor

Thanks both for the replies and links! I had not heard of ObsPlus but this looks like the easiest way forward.

Just doing a quick comparison, loading 2621 events (picks, amplitudes, magnitudes, etc) took 2350 seconds via obspy and only 19 seconds via ObsPlus / Json (having written to binary). Not sure how it scales at 10x or 100x but I would imagine as good or better.

e.g.

	f = open('cat.bjson','rb')
	cat_json = f.readlines()
	cat_json = cat_json[0].decode() #back to a string
	f.close()
	cat = obsplus.json_to_cat(cat_json)

It seems like ObsPy integration of something similar to this was eventually abandoned (#2210), I can understand why, but it may be worth a rethink at some point.

@d-chambers
Copy link
Member

d-chambers commented Sep 20, 2022

@filefolder I am glad it works well for you.

It seems like ObsPy integration of something similar to this was eventually abandoned (#2210), I can understand why, but it may be worth a rethink at some point.

I agree with you. There are just two obstacles:

  1. Time: There is never enough of it
  2. Dependencies: There are quite a few people using ObsPy who would rather keep a very short list of dependencies. This way it can run in more environments. To me, ObsPlus is a place for trying out "wouldn't it be cool if ObsPy ..." sorts of ideas that would require one or more significant dependencies (like pydantic, pandas, pytables, etc.).

@megies
Copy link
Member

megies commented Sep 21, 2022

So, do those 19 secs reading time also include creating all the obspy objects in Python?

@megies
Copy link
Member

megies commented Sep 21, 2022

what I have to do now is import a catalog into seiscomp and then read it back in obspy via localhost FDSN.

I'm confused. I must be missing something.. how would this be faster? If you request event data via FDSN thats just an additional step compared to just reading the data from a local file?

@filefolder
Copy link
Contributor

So, do those 19 secs reading time also include creating all the obspy objects in Python?

yes both times are from scratch to full access to the catalog. although I suppose saving the json version as a binary beforehand is cheating a little. I can be more rigorous with my testing next week.

what I have to do now is import a catalog into seiscomp and then read it back in obspy via localhost FDSN.

I'm confused. I must be missing something.. how would this be faster? If you request event data via FDSN thats just an additional step compared to just reading the data from a local file?

it does seem crazy but it only takes maybe 10 minutes to import a 10k+ catalog (scdb) and then just connect and download it back via FDSN to use in obspy. in contrast reading directly via obspy is usually a “let it run overnight” operation.

@megies
Copy link
Member

megies commented Sep 22, 2022

it does seem crazy but it only takes maybe 10 minutes to import a 10k+ catalog (scdb) and then just connect and download it back via FDSN to use in obspy. in contrast reading directly via obspy is usually a “let it run overnight” operation.

I still dont get it. Makes no sense to me. Same operation with added overhead faster than just reading directly??

Only thing I can think of that might make sense is a slow filesystem like NFS or something.

@d-chambers
Copy link
Member

I think @dsentinel said the xml's parser's poor performance and scaling is due to xpath searching the whole xml document for certain tags for each event.

@megies
Copy link
Member

megies commented Sep 24, 2022

I think @dsentinel said the xml's parser's poor performance and scaling is due to xpath searching the whole xml document for certain tags for each event.

that could explain how obsplus is faster on a plain read, although its hardly surprising if you read from some other format (some binary or json or something?) where you expect everything to be in the exact place, which will just fail if anything is out of place, as opposed to xml where everything might be all over the place and the reading is robust.

If obsplus is still parsing xml and free form, and not in some fully compiled code, let us know.

This is just about use cases, to be honest. Most users read some small quakeml and it should be robust. And then some power users want to read 5k events with tons of picks each and ideally within seconds. It's just two worlds, really.

But still doest'n explain how an obspy FDSNWS client request should be faster than a simple local obspy read_inventory, still wanna know what thats about

@d-chambers
Copy link
Member

The ObsPlus parser uses pydantic to read json, not xml. All the compiled magic is happening in pydantic.

I didn't realize the xml elements could be anywhere in the document, I guess that's why we use xpath even though it scales poorly. I wonder if we could instead parse all the elements into a hash table then just do fast look ups on ids for reconstructing the hierarchy?

@megies
Copy link
Member

megies commented Sep 25, 2022

I wonder if we could instead parse all the elements into a hash table then just do fast look ups on ids for reconstructing the hierarchy?

sounds like it could help, but sounds like quite some work too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
.io.quakeml question Questions asked by users. Please use https://discourse.obspy.org for asking new questions
Projects
None yet
Development

No branches or pull requests

9 participants