Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add IPFS content provider (InterPlanetary File System) #1098

Closed
wants to merge 7 commits into from

Conversation

d70-t
Copy link

@d70-t d70-t commented Nov 26, 2021

This PR is adding an IPFS content provider (see #1096).

The following builds the requirements.txt example via IPFS:

jupyter-repo2docker QmPjPUTcXeiEdNUMEPusP4rnJNz2YPw1XrYQkp43C96DyS 

Still open:
Likely one wants to have an option to configure the list of possible IPFS gateways. E.g. an environment variable?

@welcome
Copy link

welcome bot commented Nov 26, 2021

Thanks for submitting your first pull request! You are awesome! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct. Also, please make sure you followed the pull request template, as this will help us review your contribution more quickly.
welcome
You can meet the other Jovyans by joining our Discourse forum. There is also a intro thread there where you can stop by and say Hi! 👋

Welcome to the Jupyter community! 🎉

@d70-t d70-t marked this pull request as ready for review November 26, 2021 20:37
@d70-t
Copy link
Author

d70-t commented Nov 26, 2021

Tests are now passing, including a test which actually creates a docker image from IPFS.

There's still the open question regarding customizability of the IPFS gateways to try. In ipfsspec the environment variable IPFSSPEC_GATEWAYS can be used to change the list of gateways. I'd imagine that one likely wants to specify a local one if the datacenter running repo2docker has some gateway(s) running there... What would be a preferable way to handle this in repo2docker?

  • Would we hijack / reuse this variable
  • Would we like to have a new one (e.g. REPO2DOCKER_IPFS_GATEWAY)?
  • Something else?

@yuvipanda what do you think?

@manics
Copy link
Member

manics commented Nov 29, 2021

A couple of technical issues:

  • repo2docker needs to be usable by novice users, so expecting users to configure or even understand gateways is probably too complicated
  • Just using an alpha-numeric string CID is too generic, e.g. it could coincide with the name of a local directory, or some other future provider that uses a hash may be invented and cause confusion. Does IPFS have a standard URL for referencing objects?

A more general issue which ties in with @betatim's comment on jupyterhub/mybinder.org-deploy#2082 (comment)
How widespread is IPFS usage for publishing or referencing code and data repositories? It's clear that IPFS can support reproducibility, but so can many other tools, and repo2docker can't support all of them. Do you have any evidence for how many researchers use it, and what it's unique advantages are over other content providers?

@d70-t
Copy link
Author

d70-t commented Nov 29, 2021

Thanks @manics for the comments!

For the first point: yes, that's also my concern. I believe that IPFS (or any other content addressible storage system) can be a very very useful tool in science, but it is increadibly hard to get people on board initially as there are svereal concepts which are new, creating a large initial burden. My take on the gateway issue would be to offer a somewhat large list of public gateways by default which should just work for anyone and offer a customization point for people wanting to improve performance in this point. (The use of public gateways in case of this PR is easier than for data, as code tends to be smaller and the entire thing can be downloaded by a single request, strongly reducing the pressure created on the gateways).

The second point also came to my mind after writing this. There's the ipfs:// protocol (also listed at IANA) which would be suited for this (and which is also used by ipfsspec). I can modify the PR to require the addition of this protocol prefix.

The third point is of course the hardest one, and maybe also a bit of a chicken and egg problem (see the gateway issue for example). Upfront: I believe that up to now, there are not yet too many researchers working with data on IPFS, but I hope this will change soon and repo2docker / binder would be great accelerators. The reason I started investing time into IPFS is because we've got a lot of data from a recent field campaign (https://eurec4a.eu/), which unfortunately is still distributed across the world partly in very unaccessible places and which should be made accessible for collaborative investigation. To do so, we figured that a basic requirement would be the availability of a simple-looking function which should work for all the data:

ds = get_dataset("some id")

The main goal of this function would be that one can write an analysis script and the script "just works" on any other computer of any coworkers. Thus, this function should work on any computer at any time, even when the primary server would be offline and without changing the identifier (over a long timespan). It should be fast at least if the data is close by (possible on a local disk). And the result of the function may not change over time, because otherwise my analysis wouldn't be reproducible at all. Another point which we have (and want to) deal with is the possibility for having copies of the data at our (and our collaborator's) datacenters. This is partly in order to have better performance and redundancy and partly for political reasons (some data must be held in some countries or at some institutions). For the larger datasets, we also need the possibility for efficient subsetting without downloading (in particular in the context of demonstation scripts), which is often not available at scientific data repositories.

For at least those reasons, I figure that a global content addressable storage system would be very helpful (this provides verifiable data integrity, trivial caching and consequently a simple to implement system of globally distributed copies). IPFS is of course only one possibility, but the implementation of a single global namespace without the immediate need to name the location of the data provider in the dataset identifier (due to a lookup in a distributed hash table) is particularly helpful in the setting outlined above.

Based on this reasoning, I am primarily interested in having datasets on top of IPFS and that's still true. Having code on IPFS would be a neat addition (a single CID would suffice to reference the whole analysis as well as all the data which went into it), but it's probably also fine to have that as a second step. I've made the PR mainly because @yuvipanda suggested it and it seemed to be relatively simple to implement. Also, as the use of IPFS is rather controlled (basically only a single HTTP-request to a gateway), I assumed that the implementation is relatively unproblematic.


Although I'm up to now quite conviced that IPFS will provide a lot of benefits for scientific data storage, I'm also interested in other good and practical solutions. Thus, if there are public data (and code) repositories which cover all of the requirements above, I'd be glad to learn more about them independent of this PR.

@manics
Copy link
Member

manics commented Nov 30, 2021

Definitely a chicken and egg problem 😃 . Personally I don't think repo2docker should take the lead in promoting a particular new/upcoming technology, instead I see it's role as supporting reproducible research using existing well known tools that are already in use by the community.

For example, a very quick Google brought up dat, should we also add support for that?

Perhaps a long term solution to this problem is to make all content providers into plugins so it's easy to extend r2d and experiment with new optional providers. A similar idea has already been suggested for buildpacks.

@manics manics marked this pull request as draft January 26, 2022 19:16
@consideRatio consideRatio changed the title [WIP] IPFS content provider IPFS content provider Oct 30, 2022
@consideRatio consideRatio changed the title IPFS content provider Add IPFS (InterPlanetary File System) content provider Oct 31, 2022
@consideRatio consideRatio changed the title Add IPFS (InterPlanetary File System) content provider Add IPFS content provider (InterPlanetary File System) Oct 31, 2022
@yuvipanda
Copy link
Collaborator

I think consensus here is to not add this. Personally for me, I've been disappointed with the progress IPFS has made in the last few years. So while I did initially champion this PR, I think we should close this one for now.

I'm really sorry, @d70-t!

@yuvipanda yuvipanda closed this Jun 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants