Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

earthworm client: slow stream cleanup #1711

Closed
wants to merge 3 commits into from

Conversation

Projects
None yet
3 participants
@tparker-usgs
Copy link
Member

commented Mar 10, 2017

Earthworm's waveserverV returns data in an ordered set of small chucks. The first chuck returned may start before the requested start time, and the last chuck returned may end after the requested end time. The time required to trim the data using Stream.trim() grows linearly with the number of data chunks returned by the server.

This change replaces the call to Stream.trim() with individual calls to trim only the first and the last trace in the Stream.

megies and others added some commits Feb 27, 2017

release 1.0.3
(empty commit to avoid ambiguities with 1.0.3rc1)
@krischer
Copy link
Member

left a comment

Can you please rebase on master (or just cherry pick your two commits on a branch from master) - then the CircleCI failure should also go away.

Aside from that I don't think this fully works because at least the channel argument of the get_waveforms() method appears to support wildcards. At that point you might have multiple traces that you need to trim at the beginning and the end.

Do you have some tests on when this actually starts to matter? The trimming would be a no-op for most traces (if there are indeed many) and while Python is not the fastest for loops it would require a very large number of traces before this results in a user-impacting delay. I would assume that the actual network I/O is by far the dominating factor in this method.

If trimming is slow with a no-op we should investigate why that is.

@krischer

This comment has been minimized.

Copy link
Member

commented Mar 10, 2017

Also note that if cleanup is True it will merge perfectly adjacent traces in any case which (for reasonable data) should result in very few Traces that actually enter the trimming operation.

@tparker-usgs

This comment has been minimized.

Copy link
Member Author

commented Mar 10, 2017

Thanks for the quick review. Sorry I have so much to fix.

We're wanting to migrate some operational code from Matlab to Obspy and are running into performance problems using the earthworm client. My pull requests are attempting to correct problems which are preventing us from using Obspy. I administer the wave server that the earthworm client uses as its example in the docs and have no reason to suspect that what I'm seeing is an incompatibility between my server and the code.

As an example, I retrieved 24 hours of data a few different ways and timed how long it took at each stage. The code I used was modeled after the example in the earthworm client docs.

client = Client("pubavo1.wr.usgs.gov", 16022)
st = client.get_waveforms('AV', 'ACH', '', 'EHE', UTCDateTime(2017,3,8,0,0), UTCDateTime(2017,3,9,0,0), cleanup=False)

The times are from a single run each, but are consistent with previous testing. The request for 24 hours of data returned 86,401 traces and what I found is in the table below.

Task Total Run (seconds) Data Retrieval (seconds) Build Stream Object (seconds) Cleanup (seconds) Trim (seconds)
Retrieve with cleanup=True 692.6 13.6 25.1 651.5 0
Retrieve with cleanup=False 85.2 16.2 27.5 0 38.8
Retrieve with cleanup=False and minimal trimming 40.5 12 .7 25.7 0 0

I haven't looked at the cleanup code, but it appears to run in exponential time. Over 11 minutes to cleanup 24 hours of wave data is consistent with my other testing.

I'm not sure why a noop would take 39 seconds, even when called 86k times. Perhaps this is revealing another problem? If there's a better way to resolve the performance, I'm happy.

Good call on the station wildcard. The method returns early when provided a wildcard. Wy code doesn't break, but it also doesn't change the trim behavior.

I'll rebase on master and also apply my change to the trim behavior when a wildcard is provided.

@krischer

This comment has been minimized.

Copy link
Member

commented Mar 10, 2017

Thanks for your measurements which uncover quite some issues! Furthermore none of the core developers regularly uses earthworm so we were not aware of the issues.

So 86k traces is really a bit of an extreme case and everything is definitely too slow and needs to be faster. ObsPy and Python are also not really designed for these kinds of things which is also the reason why we don't for example expose record-level MiniSEED to the users.

  • The trimming IMHO should be faster - I'll look into it next week.
  • The cleanup code path is slow as it successively adds traces together which keeps making new, larger arrays... This needs to be fixed as it also matters for traces which have way less traces.

In most practical cases data has only very few traces so we were not really aware of the problems.

Regarding the proper fix for the waveserver client: IMHO the code should be refactored quite a bit to first collecting everything in a buffer and then treating it as a file and calling a read_file() method on it, e.g (pseudocode):

with io.BytesIO() as buf:
    while data:
        buf.write(data)

    buf.seek(0, 0)
    st =read_file(buf)

This (if done properly) would eliminate almost all the overhead and runtime would pretty much be the download time + some parsing time. ObsPy reads an equivalent MiniSEED file in 0.3 seconds.

Do you know the format that the waveserver servers? Judging from a quick look at the code it appears to be some kind of custom but simple binary format. This could be read quite efficiently using numpy structured arrays and then maybe going down to C for the chunk merging part.

You might have to spearhead this a bit to get it done in due time but I'm happy to help out along the way!

@tparker-usgs

This comment has been minimized.

Copy link
Member Author

commented Mar 16, 2017

I think you're right about the proper fix. I had been dodging that solution, but it is the right thing to do. I'm somewhat familiar with the earthworm WSV protocol and will give it a shot. I may need to take you up on your offer of help.

Should I close this pull request and create a new one when ready, or leave this one open?

@megies

This comment has been minimized.

Copy link
Member

commented Mar 20, 2017

+1 for fixing this during interpreting the data returned from the server. The current implementation clearly is quick-and-dirty in this regard, creating lots of individual Trace objects and handing them over to the user to worry about.

Would be great if you work on this, this will definitly be an improvement for our earthworm client implementation. :-)

Should I close this pull request and create a new one when ready, or leave this one open?

You can force-push to the branch of this PR and reuse this PR. That will avoid ticket duplication. :-)

@megies

This comment has been minimized.

Copy link
Member

commented Mar 20, 2017

@tparker-usgs I pushed a quick commit to here: https://github.com/megies/obspy/commits/earthworm_data_parsing

This might be a small improvement, avoiding to copy data around in memory to some extent (first determining which packets are contiguous and then merging the numpy array data once). If this is still too slow you might have to go with the record array approach @krischer mentioned. You could have a look at my implementation of reftek reader based on record arrays in #1433. Most relevant pieces might be these:

@megies megies changed the title Streamline trace trimming earthworm client: slow stream cleanup Mar 20, 2017

@megies megies added this to the 1.1.0 milestone Mar 20, 2017

@megies megies added the enhancement label Mar 20, 2017

@megies

This comment has been minimized.

Copy link
Member

commented Mar 27, 2017

This might take some more time, so bumping to 1.1.1 (assuming that we can consider this some sort of a bug fix..?)

@megies megies modified the milestones: 1.1.1, 1.1.0 Mar 27, 2017

@tparker-usgs

This comment has been minimized.

Copy link
Member Author

commented Apr 24, 2017

Following suggestions by @krischer and @megies, I reworked the code to merge adjacent traces as the data is retrieved. This works well and performance is much better.

However, I clobbered the head fork of this pull request without understanding how it would affect the pull request. Is there a way to point the PR to a new fork, or should I close and re-create the pull request?

@krischer

This comment has been minimized.

Copy link
Member

commented Apr 25, 2017

Hmm - I'm not sure if you edit the origin branch/fork of a PR. You also remove the original one so feel free to open a new PR as I'm not sure how to otherwise resolve this.

@megies megies modified the milestones: 1.1.0, 1.1.1 Apr 25, 2017

@megies

This comment has been minimized.

Copy link
Member

commented Apr 25, 2017

Re-adding this to 1.1.0 as it sounds like this might be almost ready..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.