Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add protections for the get_data utility #293

Merged
merged 6 commits into from Apr 9, 2019

Conversation

alurban
Copy link
Member

@alurban alurban commented Apr 3, 2019

This PR fixes a bug in the get_data utility by adding protections against nonexistent frame files when attempting to build a frame file cache, and falling back to TimeSeries{Dict}.get rather than fetch when that fails. Unit tests are also updated to reflect these changes.

Other changes include:

  • Remove redundant obs keyword argument from get_data, and infer the observatory from the frametype (when applicable) using a regular expression
  • Remove references to obs from a few scripts
  • Simplify a couple of unit tests
  • Update docstrings
  • Misc. typo fix in gwdetchar-conlog

cc @duncanmmacleod

Copy link
Member

@duncanmmacleod duncanmmacleod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused as to why all of the tools now call out to gwdatafind. It looks like the returned file lists aren't being reused (which would be a good reason to call find_urls upfront), so can they not just let gwpy do the gwdatafind call for them?

# read from frames or NDS
if source:
if source is not None:
if isinstance(channel, (list, tuple)):
channel = remove_missing_channels(channel, source)
return series_class.read(
source, channel, start=start, end=end, nproc=nproc,
verbose=verbose, **kwargs)
elif isinstance(channel, str):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it occurs to me that this should probably be

        elif not isinstance(channel, (list, tuple))

to match the condition above that declares series_class

Copy link
Member Author

@alurban alurban Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @duncanmmacleod, I've changed this in the latest commit. I also restored the logic that attempts to load a cache using gwdatafind because I remembered that it's much simpler that way, but added some protection against non-existent frame files.

@alurban alurban changed the title Simplify the get_data utility WIP: Add protections for the get_data utility Apr 3, 2019
@alurban alurban force-pushed the patch-get-data branch 2 times, most recently from 3ba8ff0 to 2407255 Compare April 3, 2019 17:55
@alurban alurban changed the title WIP: Add protections for the get_data utility Add protections for the get_data utility Apr 3, 2019
@alurban alurban force-pushed the patch-get-data branch 2 times, most recently from 57f7b0e to f047644 Compare April 3, 2019 18:38
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 95.398% when pulling f047644 on alurban:patch-get-data into 5ef647c on gwdetchar:master.

3 similar comments
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 95.398% when pulling f047644 on alurban:patch-get-data into 5ef647c on gwdetchar:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 95.398% when pulling f047644 on alurban:patch-get-data into 5ef647c on gwdetchar:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 95.398% when pulling f047644 on alurban:patch-get-data into 5ef647c on gwdetchar:master.

@coveralls
Copy link

coveralls commented Apr 3, 2019

Coverage Status

Coverage decreased (-0.04%) to 95.472% when pulling a1e4921 on alurban:patch-get-data into a801697 on gwdetchar:master.

@alurban alurban force-pushed the patch-get-data branch 2 times, most recently from 55ff36f to be4e43b Compare April 3, 2019 19:01
@alurban
Copy link
Member Author

alurban commented Apr 3, 2019

@duncanmmacleod, I just saw your longer comment, sorry about that. Response below:

I prefer to have gwdetchar.io.datafind.get_data attempt to create its own file cache before falling back to TimeSeries.get, mainly because it then becomes possible to sieve the file cache against requested channels that actually exist in frame files. This makes gwdetchar tools extremely robust to changes in channel names, which is very common between observing runs.

For example, gwdetchar-scattering is written to analyze optical squeezer suspensions by default, but it will still run successfully (with default options) over O1 or O2 data when those squeezer optics didn't exist. Likewise for gwdetchar-omega, whose default channel list configuration changed several times just in the first few months of 2019.

if isinstance(channel, (list, tuple)):
channel = remove_missing_channels(channel, source)
return series_class.read(
source, channel, start=start, end=end, nproc=nproc,
verbose=verbose, **kwargs)
elif isinstance(channel, str):
return series_class.fetch(
except (HTTPError, TypeError): # frame files not found
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain when an HTTPError would be raised?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@duncanmmacleod, the HTTPError is raised whenever gwdatafind.find_urls can't locate frame files (you get a 400 Bad Request).

Copy link
Member Author

@alurban alurban Apr 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if you don't pass obs and the first letter of frametype is not a meaningful observatory code (e.g. for SenseMon frames), this will also lead to HTTPError. The TypeError case comes when you pass neither frametype nor source. In all these cases, the idea is that it would then fall back to TimeSeries.get with no protection against missing channels in frame files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@duncanmmacleod, I stand corrected, if you attempt to read SenseMon frames with frametype='SenseMonitor_hoft_L1_M' and obs=None then the result is an IORegistryError: Format could not be identified. Passing frametype=None falls back to TimeSeries.get, and passing both frametype and obs successfully uses TimeSeries.read.

@alurban
Copy link
Member Author

alurban commented Apr 5, 2019

@duncanmmacleod, I've fixed this so that the obs kwarg is no longer needed, the observatory is now inferred from frametype (when it's passed) using a regular expression. Also, whenever neither frametype nor source are passed, the utility will fall back to TimeSeries.get.

I've also updated the docstring to [hopefully] make it clearer to the user which arguments do what, and when.

Copy link
Member

@duncanmmacleod duncanmmacleod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aye, looks good.

@alurban alurban merged commit 7e0070b into gwdetchar:master Apr 9, 2019
@alurban alurban deleted the patch-get-data branch April 9, 2019 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants