Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce container readers. #704

Closed
wants to merge 11 commits into from
Closed

Introduce container readers. #704

wants to merge 11 commits into from

Conversation

stscieisenhamer
Copy link
Contributor

In part thinking about issue #642 and in dealing with datasets with many FITS extensions, came up with a generic FITS loader.

Haven't done tests/docs yet, but wanted to throw this out for thoughts and possibly becoming the default for FITS files.

@ChrisBeaumont
Copy link
Member

@stscieisenhamer can you add some comments here about what this change does? It looks the main difference from the current fits loader is that it loads all extensions, creating as many datasets as needed so that each dataset has the same shape.

This also seems relevant to @astrofrog's work on adding UI to let user's configure how datasets and components are organized at file load time

@stscieisenhamer
Copy link
Contributor Author

Sorry, Yes I indeed need to be more verbose, in life in general.

However, @ChrisBeaumont has it all: a bulk reader that reads as many extensions as it can, groups like dimensionality into common datasets, with some sort of reasonable name guessing, though that could use some work. Having such "bulk" or, using the coinged term, container loaders, would match with the spirit of data exploration: for those situations where one does not know what the data really is, such loaders would allow a low-barrier way of allowing to learn about a given container.

This would work in conjunction with UI that allows specification of different subsets of a container to be read, once one knows what to look for and what is needed.

With HDF5 and the future ASDF formats, both of which can contain many sets of data, not all necessarily of like dimensionality, having such a class of "container loaders" seems to be advantageous, if practical.

@astrofrog
Copy link
Member

@stscieisenhamer - this container idea is great. Would you have time in the coming few days to finish this up? The main things that need to be done:

  • Get the tests to pass on AppVeyor and Travis
  • Add a changelog entry

(if you don't have time, I can copy over these changes to another PR and finish it up)

Another comment I have, which we can address after this PR is merged, is that I think that we should prompt the user about whether to merge datasets or not rather than doing it automatically. The issue is that even if the shape matches, the WCS may not. For example, the WFCAM data files (for the UKIDSS survey) include four different images from different CCD chips as extensions. The shapes match, but the WCS do not. It's surprisingly difficult to check if two WCSs are exactly the same, hence why I'm suggesting that later we should make sure that we allow the user to decide whether or not to merge extensions. But this is quite a bit more work and this PR is already a big improvement over the current situation.

@stscieisenhamer
Copy link
Contributor Author

@astrofrog I should be able to finish this up by this Friday, 21Aug.

I agree about the WCS issue, and just in general, the user should have some input. This actually came up in a slightly different context, described below, during the last sprint. There were two approaches:

  1. Never merge: just make a data set with each extension.
  2. Merge UNLESS there is an explicit WCS defined.

Option 1. is, to my sensibilities, really dumb. However, it has the advantage that people know their FITS files by their extensions, and this would make it very explicit what went where. I prefer 2. and, unless otherwise, very easily, persuaded, will attempt that approach.

The context in which we hit this occurred when there was already an existing data set with matching shape. But, in this context, I was going to actually make an issue. When datasets are merged, it appears that the WCS of the new data takes over the dataset. In this situation, if a WCS has actually already been set, it should remain set. OR, the user has the option of choosing which WCS to use, OR, the merge is just not allowed to happen if both the existing dataset and the data to be merged both have explicit WCS definitions.

@astrofrog
Copy link
Member

@stscieisenhamer - I'm fine with option 2 for now. As I said, I think the 'right' thing in future will probably be to have a pop-up window that helps the user customize the merging or not of datasets. But for now I like option 2.

Remove use of __factories__, use @data_factory for all data factories instead
fits_container.label = "FITS full file"
fits_container.identifier = is_fits
__factories__.append(fits_container)
set_default_factory('fits', fits_container)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As done in #724, please now register the data factory using the @data_factory decorator. If you want this to take precedence over any other FITS reader, you can just set the priority in that decorator to say 100 (it has to be below 1000 because that is for the dendrogram reader, which is a more specialized format and should take precedence).

FITS is the first.
The new data_factory decorator is used.
Also brought in the updated code from the
original development in another project.
Only if an extension has no WCS defined will
the Generic Fits loader combine extensions into
the same Data container.

Otherwise, the user is prompted, through the
standard glue framework, to combine like-shaped
data. Hence the user has the option of preserving
wcs information or not.

The naming scheme is also changed from shape-based
to extension-based. Whether this is a positive
change or not will remain to be seen.
@stscieisenhamer
Copy link
Contributor Author

Did the check on whether a wcs exists or not. If one exists, a new Data element will be created.

I happen to have a good dataset where this can potentially blow up. There is a cube where three extensions exists for the data, error, and mask. The error and mask extensions do not have wcs, and the expected happens.

However, there are also 12 other extensions, all the same shape and all of wcs information. So, the loader makes no attempt to merge. However, this causes the standard prompting from glue about whether to merge the datasets. This, I believe, actually gets the same effect as creating a dedicated UI for this situation. Of course, the generic request make no mention of the same/different wcs information, but it is at least now consistent with any other data format read, and you do get the user input.

There is also another change that an opinion would be good on. Previously, the Data labels included the shape. Now, because different datasets may still have the same shape, I have gone to an extension-name-based labeling. For FITS users, this may (or may not) be better. Thoughts.

I will come back and check testing state later this evening and fix as need be.

@astrofrog
Copy link
Member

@stscieisenhamer - given your example, I'm actually inclined to say that we should not merge datasets by default, but instead just let the default prompt come up and let the user choose. The merging will be very context-dependent, and I can also find examples of datasets where I would not want merging to happen but it would happen automatically in the current PR. At the end of the day, it's easier to merge datasets than to unmerge them.

So for now, would it be ok to simply turn off the auto-merging?

Another quick note - I noticed that you merged master into your branch, but to keep the history clean, we normally rebase the branch on master instead. Would you mind rebasing this branch against the upstream master to get rid of the merge commit?

Thanks for your work on this!

@astrofrog
Copy link
Member

Actually let me make a small modification to my suggestion: if you prefer maybe you can add an option to the data factory called auto_merge that defaults to False. But if set to True, then it follows your current merging rules. Then we can see in future whether we can expose that option, but for now it would have the effect of never merging automatically, while having the benefit of keeping the code there for auto-merging.

@stscieisenhamer
Copy link
Contributor Author

srysly: 2.6? grrr...

@astrofrog
Copy link
Member

@stscieisenhamer - it looks like this needs to be rebased as it includes commits that are not related to this pull request (and also to get rid of the merge commit). Make sure you rebase against the glue-viz/glue repository, not your fork.

If you have any issues with doing this, I'm also happy to do the rebasing and open another pull request - just let me know.

@stscieisenhamer
Copy link
Contributor Author

My apologies. I've been trying to do it, but there seems to be some serious confusion. In particular, in doing the squashing, something is meeting a horrible death. So, I'm going to take you up on your offer; a new PR seems to be the only way out.

@stscieisenhamer
Copy link
Contributor Author

Otherwise, I did implement the one-to-one importing of extensions to data objects, fixed current tests, and implement a specific test for this container.

@astrofrog
Copy link
Member

@stscieisenhamer - thanks!

Note that you don't need to squash when rebasing, but it looks like you may have rebased against your fork's master rather than the upstream master. But in any case, I'll open a new PR with a rebased version, and will put the commands I use here for future reference.

@stscieisenhamer
Copy link
Contributor Author

maybe, but it must have happened sometime earlier than today...hence the weird state. my apologies again. my next step was to basically do the new PR...

@astrofrog
Copy link
Member

@stscieisenhamer - ah sorry, I mis-understood, would you prefer to open the new PR? (if not, then I can do so)

@stscieisenhamer
Copy link
Contributor Author

Sure! I think I can get that part right at least. Will do this later today.

@astrofrog
Copy link
Member

@stscieisenhamer - ok, thanks. Just to let you know, I did decide to do the rebase to better understand what the issue was, and I have a cleaned up version in the following branch

https://github.com/astrofrog/glue/tree/fits-containers

If you like, we can just open a pull request from that branch (unless you would prefer to try and do it)

@astrofrog
Copy link
Member

Just based on the rebase, I think what might have happened is that you merged the latest master into your branch and possibly then rebased on a version of master that wasn't the latest upstream one.

In future, I would recommend basically never merging master into your branch. Instead, always just rebase using:

git fetch upstream
git rebase -i upstream/master

@astrofrog
Copy link
Member

However, I should note that the rebase now is very difficult, and it took me a while to get right, so if you want to open a new PR not based on the branch I pushed, you would be better off making a completely new branch.

@stscieisenhamer
Copy link
Contributor Author

First, lets just go with the pr on your branch; I've messed up enough today.

But, good news for me: that process is exactly what I've been doing. But, bad news, I must have done something bad along the way; i'm guessing a too-quick-on-the-bash-history execution... 👎 Probably real late one night.

My apologies again. -1 on the git-fu

@astrofrog
Copy link
Member

No problem! The new PR is at #732

@stscieisenhamer
Copy link
Contributor Author

Now in PR #732

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants