Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Generalize Zenodo content provider to support other Invenio repositories #704

Merged
merged 8 commits into from Jun 21, 2019

Conversation

@tmorrell
Copy link
Contributor

tmorrell commented Jun 13, 2019

Updates the Zenodo content provider to:

  • Use a more general DOI -> URL conversion
  • Support different hosts and different metadata organizations

My interest is supporting CaltechDATA, which is based on Zenodo but has a different metadata scheme. I'll also work on getting Binder buttons into CaltechDATA, which would be cool.

I'm still working on figuring out how to generalize the object path stuff. The idutils functions could also be copied into utils; don't know what your preference is on dependencies. Any and all suggestions are appreciated!

@betatim

This comment has been minimized.

Copy link
Member

betatim commented Jun 13, 2019

This is cool!

I think I'd import the is_doi and normalize_doi functions (with a reference to the original and copyright notice) as they are both one-liners so not worth adding a dependency that has dependencies IMHO.

One thing I couldn't figure out from a quick reading of the code: we should organise for this content provider to only trigger for some DOIs (right now ones that point to Zenodo.org or CaltechDATA, in the future other Invenio instances (is Invenio really the software behind Zenodo? I always thought of it as a web framework kinda thing as the agenda and document servers at CERN were also "Invenio")) so that other content providers can deal with DOIs that point to other types of repositories. Same logic if someone passes a URL to either a Invenio record or dx.doi.org with a DOI that points to one of them.

@tmorrell

This comment has been minimized.

Copy link
Contributor Author

tmorrell commented Jun 13, 2019

I don't think we need an explicit DOI check. I resolve all DOIs to a url, but then see if the url is on the list of known hosts. So if it's a DOI for a different repository, the url won't match and you won't get anything back.

There are two versions of Invenio, and there's going to be a lot of new development around Zenodo and Zenodo flavors in the next year. It's why I didn't rename the content provider...Zenodo has better name recognition at this point.

tmorrell added 3 commits Jun 18, 2019
@tmorrell

This comment has been minimized.

Copy link
Contributor Author

tmorrell commented Jun 19, 2019

This is mostly done. I've got one functional change for Zenodo that I wanted to discuss. Currently the integration will try to download files for any Zenodo record, which for datasets could be very large (e.g. repo2docker 10.5281/zenodo.3250135). I think we should only attempt the download for items with type "Software", since those are the only ones are likely to work with repo2docker.

@betatim

This comment has been minimized.

Copy link
Member

betatim commented Jun 19, 2019

The thinking behind making it work with any kind of record was that even an empty Git repository or one containing no files recognised by repo2docker is a valid input repository. Which made me think that we could use that logic to allow datasets as a way to make it easy for people to explore a dataset. A notebook and a Python interpreter is already a good starting point for exploring data. If people use it to create very large images I'd solve this at the deployment level (by configuring BinderHub to reject DOIs that don't point to software records. What do you think of that thinking?

repo2docker/utils.py Outdated Show resolved Hide resolved
repo2docker/utils.py Outdated Show resolved Hide resolved
@betatim

This comment has been minimized.

Copy link
Member

betatim commented Jun 19, 2019

Nice work making the code more general! I left some comments on cleaning up stray white space, potentially parametrizing a test (we should definitely split it) and made a suggestion for a "deep get" that I find easier to follow (and maybe eliminate a use of deepcopy()).

@betatim

This comment has been minimized.

Copy link
Member

betatim commented Jun 19, 2019

To please the formatting robot pip install black and black . in the top level directory.

@tmorrell

This comment has been minimized.

Copy link
Contributor Author

tmorrell commented Jun 20, 2019

Thanks for the great suggestions @betatim! I think I've implemented everything, and I added a couple of new tests as well.

For functional testing, I've got a notebook in CaltechDATA that works: repo2docker 10.22002/d1.1250

And for fun, here is how a binder badge will look in the repository: https://doi.org/10.22002/d1.1250

@tmorrell tmorrell changed the title [WIP] Generalize Zenodo content provider to support other Invenio repositories [MRG] Generalize Zenodo content provider to support other Invenio repositories Jun 20, 2019
@betatim

This comment has been minimized.

Copy link
Member

betatim commented Jun 21, 2019

Merged!

Nice work and with all those tests how could I not merge it :)

@betatim betatim merged commit 7081941 into jupyter:master Jun 21, 2019
5 checks passed
5 checks passed
ci/circleci: build_docs Your tests passed on CircleCI!
Details
ci/dockercloud Your tests passed in Docker Cloud
Details
codecov/patch 95% of diff hit (target 20%)
Details
codecov/project 90.17% (+0.01%) compared to f19e159
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details


def test_is_doi():
assert utils.is_doi("10.1234/jshd123") != None

This comment has been minimized.

Copy link
@betatim

betatim Jun 21, 2019

Member

For even more Pythonicness people usually do something is None and something is not None instead of == and !=. To be honest I'd have to dig a bit to remember/explain why or in which edge cases is is better :D

@betatim

This comment has been minimized.

Copy link
Member

betatim commented Jun 21, 2019

@tmorrell tmorrell deleted the caltechlibrary:zenodo-caltechdata branch Jun 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.