Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2021 Theme Proposal] IPFS ❤️ Data #68

Closed
atopal opened this issue Nov 19, 2020 · 3 comments
Closed

[2021 Theme Proposal] IPFS ❤️ Data #68

atopal opened this issue Nov 19, 2020 · 3 comments

Comments

@atopal
Copy link
Contributor

atopal commented Nov 19, 2020

Note, this is part of the 2021 IPFS project planning process - feel free to add other potential 2021 themes for the IPFS project by opening a new issue or discuss this proposed theme in the comments, especially other example workstreams that could fit under this theme for 2021. Please also review others’ proposed themes and leave feedback here!

Theme description

Create a delightful experience for storing and working with datasets on IPFS by making (an) awesome application stack(s) that includes storage, replication, retrieval, and computation, and improving the necessary parts of the core implementations to enable these use cases. ~20% of the world’s important datasets are stored on these systems.

Hypothesis

IPFS’s ability to enable accessibility / portability / extensibility of data is a great fit for many dataset applications, solving many problems that dataset storage and retrieval faces in web2 models. Current IPFS implementations are not far off in being able to address almost all these problems and capture these use cases.

Vision statement

There is a rich ecosystem of IPFS-based applications that support onboarding, versioning, and utilizing major datasets, and they are the premier places to store and interface with the world’s most important datasets. This data has gravity, with a rich, budding application ecosystem being built on top of these stored datasets to address many end use cases. Tooling improves from the feedback loop generated by building these products on IPFS.

Why focus this year

This is a use case for which IPFS can likely provide a ton of value, even in its current level of maturity (e.g., before read/write privacy). The use case is large and important. Further, the value Filecoin provides as a backup medium and the momentum from its ecosystem makes it a great opportunity to focus on this in 2021.

Example workstreams

Development of IPFS-based applications to store, replicate, serve, and process many types and applications of datasets, improvement of core implementations to handle large datasets (e.g., scaling ability to handle large volumes of provider records, throughput on transport connections and saturating those connections, connect to multiple providers), addition of maintainers of key datasets to IPFS ecosystem

@b5
Copy link

b5 commented Nov 25, 2020

sounds neat 😉

@ArneBinder
Copy link

ArneBinder commented Nov 26, 2020

And it would be really neat to have machine learning applications in mind when tackling this. Machine Learning is a very hot topic and requires a lot of data. Furthermore, for scientific machine learning, reproducibility is essential. Having processes that produce a well known output for a specific input helps a lot. So, having tools that deterministically identify specific versions of datasets (i.e. inputs and outputs of machine learning models) would be very beneficial.

I'm a natural language processing researcher and work a lot with https://github.com/huggingface/datasets. This tool + collection of scripts provides a promising way to easily integrate plenty of various data into the machine learning model of your choice. However, versioning is a pain, the original data is usually saved on a single server, and creating own datasets or deriving new ones by (automatically) annotating existing datasets is not transparently modeled, which often results in non-reproducibility for the dataset creation process. I really like the concepts around IPFS and think that there are a lot of potential synergies regarding the field of machine learning.

@github-actions
Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Sep 27, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants