-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request/question: Slow QuakeML reading #1910
Comments
Not to my knowledge.
Sounds like a pretty good idea. The FDSN |
Thanks for that @megies, I will have a look there and crack on with trying to speed it up. |
No idea if there's hidden pitfalls I don't see right now with multiple threads working on the xml object, but I guess it shouldn't be too much work to set it up and see what happens.. |
Threading will not do the trick I fear - Python's GIL prevents threads to actually manipulate multiple Python objects at the same time. It is really only worth if for I/O bound tasks (the FDSN client downloads a number of files at initialization time and this actually happens in parallel). I also think that reading QuakeML files is way too slow and we should really improve it. For a while I thought that it is due to all the objects that are created but I don't think that is really the case. This example here is a fairly large catalog with around 6000 events, once in ZMAP and once in QuakeML (the quakeml one has been created from the zmap one so the information content in both should be comparable as should the object count in both). In [6]: %timeit obspy.read_events("./last_three_years_around_livermore.xml")
14.5 s ± 426 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [7]: %timeit obspy.read_events("./last_three_years_around_livermore.zmap")
2.91 s ± 877 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) So I guess there is something inefficient in the way we parse things in the quakeml plugin - it is not In [11]: %timeit etree.parse("./last_three_years_around_livermore.xml")
108 ms ± 2.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) So before turning to parallelization I would do some careful benchmarking first to see where the time is actually spent. We might have to do a fairly significant re-engineering of how QuakeML files are parsed to make it faster. Another thing that might make object creation faster would be to set the If we really need to parallelize things we have to use the |
Hey @calum-chamberlain, If you haven't used it, I recommend checking out snakeviz, it makes profiling things like this a delightful experience. I did something similar trying to speed up reading miniseed files (#1845) and was able to find some low hanging fruit very quickly. This notebook is a simple example. |
Thanks for those comments guys - I thought I could get around the GIL with more direct access to lxml (which I think releases the GIL?), but it would require some strange in-between stages that I'm not that keen on. Thanks @d-chambers I haven't used snakeviz - will do and do something similar with a notebook and share results here! |
On that note, while I'm playing around with it, I may have a look at memory usage for writing QuakeML files too - currently it seems like quite an expensive operation - serialising an entire (large) catalog seems to use a lot of additional memory. |
Might not be worth it - in the measurements I made above the lxml parsing was less than 1 percent of the full runtime. |
Nice! @d-chambers would be great if you link this to https://github.com/obspy/obspy/wiki/Testing%2C-Debugging-%26-Profiling |
Hey @megies, I added a bit in the wiki on this. |
Okay, I have done some timing, and initial playing, with little success. The results are in the gist I posted earlier - re-linked here. Of calls for Do you think it is worth me going ahead and making those changes in a branch and profiling that? I would probably write my own |
Yes, there are such cases. See http://docs.obspy.org/master/tutorial/code_snippets/quakeml_custom_tags.html?highlight=extra |
Ah, fair enough - will not pursue that strand any further then. I think some more reading on my part of useful ways to read in large xml files needs to be done. |
Custom attributes that are handled by QUAKEML I/O are all stored consistently under |
This is a trade-off I would be willing to make to be honest. |
If the xpath is the slow part: It might be possible to just upfront parse everything to a dictionary and then operate on that with |
I agree, being able to attach arbitrary attributes to those objects should not be top-priority. |
Here is a good SO question explaining slots in more detail, and here is the official documentation on the subject for python 3. It seems the biggest problem with slots is that it will break multiple inheritance. I am not sure if any users would be sub-classing any of the event objects with multiple inheritance (potentially adding mixins?) but it is not inconceivable. Also, if you need dynamic behaviour you can add "_dict_" to the slots. I would be in favor of doing this as I have several codes that dynamically add methods to these objects, but this is probably not that common. If you need weak reference support (which I believe the ResourceIdentifiers depend on, so you will) you can add '_weakref_' to the slots as well. However, it is doubtful that defining slots will actually speed up object instantiation considerably, as slots is primarily designed to save memory. The instantiating speedups may be even less after defining _dict_ and _weakref_. For example, if we take the simplest case: # I am using python 3.6.2 on Ubuntu 16.04
class Slots:
__slots__ = ()
class SlotsDictWeakRef:
__slots__ = ('__dict__', '__weakref__')
class NoSlots:
pass
%timeit [Slots() for _ in range(1_000_000)] # best 160 ms
%timeit [SlotsDictWeakRef() for _ in range(1_000_000)] # best 182 ms
%timeit [NoSlots() for _ in range(1_000_000)] # best 206 ms We really don't see much improvement. However, in a more complicated object instantiation the results may vary. |
Actually, I'm doing exactly that in some code of mine (over at megies/obspyck).. ^^ And yeah, slots will probably only do much in terms of memory.. |
So after talking with @krischer at SSA, I finally had time to try a different implementation that converts the xml to a dictionary immediately, then makes the objects from the dictionary. This implementation is in this branch. At the moment I haven't finished implementing the This is a really naive implementation of this, and I think it could be done a lot better, but I wanted to see if this made any difference to speed. In a quick and dirty test, reading a 1,000 event catalog (downloaded from GeoNet NZ, so no focal mechanism information) takes 54s on my machine (average of 7 runs), versus 2min 21s for the current master implementation (again, average of 7 runs). This is after I removed the I'm keen to get opinions on this, and feedback on how this should/could be done if we go down this route, and if anyone sees any obvious issues with doing it this way. I have reused all of the sub-methods that |
I imagine there are lots of use cases for using catalogs. Personally a common use-case I have is downloading waveforms for events and cutting them around picks (for templates in matched-filters), e.g. (psuedocode): streams = []
for event in catalog:
bulk = []
for pick in event.picks:
# Items in tuple likely to be in the wrong order
bulk.append((pick.waveform_id.station_code,
pick.waveform_id.network_code,
pick.waveform_id.location_code,
pick.waveform_id.channel_code,
pick.time - 10, pick.time + 90))
st = client.get_waveforms_bulk(bulk)
for tr in st:
pick = lookuppick # not a thing at the moment
tr.trim(pick.time - 0.5, pick.time + 5.5)
streams.append(st)
return streams I also often use the amplitudes to recalculate magnitudes based on different local magnitude-scales, and put my own amplitudes in there. I would have thought most attributes get hit by someone at some time. There are many possible use-cases. Not sure if giving any one or a few would be that useful? I also don't just iterate through things, e.g. the non-existent I also filter catalogs based on location and picks based on frequency and other things (which station or channel). Fairly sure that filtering could be done better. |
Thanks @calum-chamberlain! I'll have a look. |
Agreed - I'm certainly pretty happy with working on Catalogs once loaded, I find it quite straightforward to get the parts I care about, and haven't found any noticeable speed or memory limitations once loaded. I'm sure they could be more memory efficient, but in most of the work that I do the Catalogs are smaller than other things (waveforms mostly). The main issues are the speed of loading, and the memory consumption of writing large Catalogs. I remain out of my depth with how a dictionary or abc backend to Catalog would work, so maybe someone else has something more useful to add. |
To clarify: When writing large catalogs to quakeml, memory usage is excessively high? |
@dsentinel, yes - the whole object is serialized in memory before dumping, which is probably the fastest way, but does use a lot of memory to keep both the object and a serialized object in memory. Happy to work around that on the user side though (e.g. don't try and write massive catalogs). |
I agree with @calum-chamberlain about the IO speed being an issue, but I have yet to run into memory problems, although my workstation does have > 100 GB of ram. I don't have any issues with performance once the catalog is loaded into memory. For example, I recently deployed 12 nodes over a mine for about a month. After running a STA/LTA and coincidence filter I detected ~ 12,000 events. I then stored the picks and nothing else in each Having an intermediate dict format would be useful for a few reasons:
If it is a significant slowdown to use and intermediate dict, however, maybe it is not worth it. |
I do think the memory issue is likely to be mostly a non-issue for most people - its only been an issue for me for large catalogs (of a few thousand events) on my laptop with 16GB RAM, while I'm using memory for other things as well, e.g. when I'm being silly. 👍 for having a useful |
Sorry to revive this but the problem has gotten a lot worse in the last four years as ML pickers have made catalogs enormous. Would HDF5 be viable solution? I think there are ways to generate the proper hierarchy via a schema file but I have only started looking into it. Seems like this idea is fairly popular in the adaptive meshing community. I note also that Seiscomp appears to be able to read large XML much much faster via C; to the extent that what I have to do now is import a catalog into seiscomp and then read it back in obspy via localhost FDSN. (edit: is there a way to "bump" this thread so it appears at the top among the recent issues? perhaps some fresh eyes could help) |
Just for reference, I think the seiscomp code is here. I don't know if/how HDF5 would solve the speed issue. From memory (and my memory is not great), I think the main issues that we had were in deserialising the xml doc. The actual reading of the xml was fairly fast. I don't know if deserialising from an HDF5 file would be faster without other changes. |
In ObsPlus we defined pydantic models which reflect ObsPy's catalog hierarchy. This enables fast translation to/from json and with future improvements to pydantic (the rust rewrite) it will only get faster, so you might give json a try. See the docs for more details. My first impression is that HDF5 wont perform well if you mirror the hierarchical structure of quakeml exactly. If, however, you broke the structure down into a collection of tables it might work really well, but then why not just use CSS and a proper database to begin with? Pisces might help with this if you want to go the ORM route with this approach. |
Thanks both for the replies and links! I had not heard of ObsPlus but this looks like the easiest way forward. Just doing a quick comparison, loading 2621 events (picks, amplitudes, magnitudes, etc) took 2350 seconds via obspy and only 19 seconds via ObsPlus / Json (having written to binary). Not sure how it scales at 10x or 100x but I would imagine as good or better. e.g.
It seems like ObsPy integration of something similar to this was eventually abandoned (#2210), I can understand why, but it may be worth a rethink at some point. |
@filefolder I am glad it works well for you.
I agree with you. There are just two obstacles:
|
So, do those 19 secs reading time also include creating all the obspy objects in Python? |
I'm confused. I must be missing something.. how would this be faster? If you request event data via FDSN thats just an additional step compared to just reading the data from a local file? |
yes both times are from scratch to full access to the catalog. although I suppose saving the json version as a binary beforehand is cheating a little. I can be more rigorous with my testing next week.
it does seem crazy but it only takes maybe 10 minutes to import a 10k+ catalog (scdb) and then just connect and download it back via FDSN to use in obspy. in contrast reading directly via obspy is usually a “let it run overnight” operation. |
I still dont get it. Makes no sense to me. Same operation with added overhead faster than just reading directly?? Only thing I can think of that might make sense is a slow filesystem like NFS or something. |
I think @dsentinel said the xml's parser's poor performance and scaling is due to xpath searching the whole xml document for certain tags for each event. |
that could explain how obsplus is faster on a plain read, although its hardly surprising if you read from some other format (some binary or json or something?) where you expect everything to be in the exact place, which will just fail if anything is out of place, as opposed to xml where everything might be all over the place and the reading is robust. If obsplus is still parsing xml and free form, and not in some fully compiled code, let us know. This is just about use cases, to be honest. Most users read some small quakeml and it should be robust. And then some power users want to read 5k events with tons of picks each and ideally within seconds. It's just two worlds, really. But still doest'n explain how an obspy FDSNWS client request should be faster than a simple local obspy |
The ObsPlus parser uses pydantic to read json, not xml. All the compiled magic is happening in pydantic. I didn't realize the xml elements could be anywhere in the document, I guess that's why we use xpath even though it scales poorly. I wonder if we could instead parse all the elements into a hash table then just do fast look ups on ids for reconstructing the hierarchy? |
sounds like it could help, but sounds like quite some work too. |
QuakeML reading is quite slow - in my naive test, about 10x slower than writing. For large catalog files this becomes a bit of an issue.
The main point of this issue is to ask:
If any suggestions are made I'm happy to undertake the work to do it.
The text was updated successfully, but these errors were encountered: