Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
earthworm client: slow stream cleanup #1711
Earthworm's waveserverV returns data in an ordered set of small chucks. The first chuck returned may start before the requested start time, and the last chuck returned may end after the requested end time. The time required to trim the data using Stream.trim() grows linearly with the number of data chunks returned by the server.
This change replaces the call to Stream.trim() with individual calls to trim only the first and the last trace in the Stream.
krischer left a comment
Can you please rebase on master (or just cherry pick your two commits on a branch from master) - then the CircleCI failure should also go away.
Aside from that I don't think this fully works because at least the
Do you have some tests on when this actually starts to matter? The trimming would be a no-op for most traces (if there are indeed many) and while Python is not the fastest for loops it would require a very large number of traces before this results in a user-impacting delay. I would assume that the actual network I/O is by far the dominating factor in this method.
If trimming is slow with a no-op we should investigate why that is.
Thanks for the quick review. Sorry I have so much to fix.
We're wanting to migrate some operational code from Matlab to Obspy and are running into performance problems using the earthworm client. My pull requests are attempting to correct problems which are preventing us from using Obspy. I administer the wave server that the earthworm client uses as its example in the docs and have no reason to suspect that what I'm seeing is an incompatibility between my server and the code.
As an example, I retrieved 24 hours of data a few different ways and timed how long it took at each stage. The code I used was modeled after the example in the earthworm client docs.
The times are from a single run each, but are consistent with previous testing. The request for 24 hours of data returned 86,401 traces and what I found is in the table below.
I haven't looked at the cleanup code, but it appears to run in exponential time. Over 11 minutes to cleanup 24 hours of wave data is consistent with my other testing.
I'm not sure why a noop would take 39 seconds, even when called 86k times. Perhaps this is revealing another problem? If there's a better way to resolve the performance, I'm happy.
Good call on the station wildcard. The method returns early when provided a wildcard. Wy code doesn't break, but it also doesn't change the trim behavior.
I'll rebase on master and also apply my change to the trim behavior when a wildcard is provided.
Thanks for your measurements which uncover quite some issues! Furthermore none of the core developers regularly uses earthworm so we were not aware of the issues.
So 86k traces is really a bit of an extreme case and everything is definitely too slow and needs to be faster. ObsPy and Python are also not really designed for these kinds of things which is also the reason why we don't for example expose record-level MiniSEED to the users.
In most practical cases data has only very few traces so we were not really aware of the problems.
Regarding the proper fix for the waveserver client: IMHO the code should be refactored quite a bit to first collecting everything in a buffer and then treating it as a file and calling a
with io.BytesIO() as buf: while data: buf.write(data) buf.seek(0, 0) st =read_file(buf)
This (if done properly) would eliminate almost all the overhead and runtime would pretty much be the download time + some parsing time. ObsPy reads an equivalent MiniSEED file in 0.3 seconds.
Do you know the format that the waveserver servers? Judging from a quick look at the code it appears to be some kind of custom but simple binary format. This could be read quite efficiently using numpy structured arrays and then maybe going down to C for the chunk merging part.
You might have to spearhead this a bit to get it done in due time but I'm happy to help out along the way!
I think you're right about the proper fix. I had been dodging that solution, but it is the right thing to do. I'm somewhat familiar with the earthworm WSV protocol and will give it a shot. I may need to take you up on your offer of help.
Should I close this pull request and create a new one when ready, or leave this one open?
+1 for fixing this during interpreting the data returned from the server. The current implementation clearly is quick-and-dirty in this regard, creating lots of individual Trace objects and handing them over to the user to worry about.
Would be great if you work on this, this will definitly be an improvement for our earthworm client implementation. :-)
You can force-push to the branch of this PR and reuse this PR. That will avoid ticket duplication. :-)
@tparker-usgs I pushed a quick commit to here: https://github.com/megies/obspy/commits/earthworm_data_parsing
This might be a small improvement, avoiding to copy data around in memory to some extent (first determining which packets are contiguous and then merging the numpy array data once). If this is still too slow you might have to go with the record array approach @krischer mentioned. You could have a look at my implementation of reftek reader based on record arrays in #1433. Most relevant pieces might be these:
However, I clobbered the head fork of this pull request without understanding how it would affect the pull request. Is there a way to point the PR to a new fork, or should I close and re-create the pull request?