Sprint: Data.gov (aka 300 TB Challenge) #87
Participants from IPFS Team:
During this sprint, we will work with collaborating institutions to load all of data.gov (350 TB of datasets) into IPFS, publish the hashes on the DHT, and replicate the data to nodes at participating institutions.
Main Issues & Boards for Tracking this Work
Main Issues for Tracking this work:
Top level objectives
What will be Downloaded
The data.gov website is a portal for searching through all the open data published by US federal agencies. It currently lists over 190,000 datasets. The goal is to download those datasets, back them up, and eventually publish them on IPFS, and replicate them across multiple institutions.
How is this different from the Internet Archive's EOT Harvest of data.gov?
In short,the End of Term Presidential Harvest will capture the data.gov website but is not likely to capture the datasets that it links to. We aim to capture and replicate the datasets.
This End of Term Presidential Harvest is focused on crawling websites (ie. grabbing HTML & static files, following links), including data.gov. It is not focused on downloading whole datasets, which often have to be retrieved using tools/processes beyond the capabilities of regular web crawlers.
You can see the ongoing EOT Harvest work, and nominate sites for harvest, at the End of Term Presidential Harvest 2016 Website
Isolating from the Main IPFS Network: We might do this on a separate "private" IPFS network to ensure stability while we load-test the system (we want to make sure it's all working smoothly before we flood the main public IPFS network with provide statements and additions to the DHT.
Possible Areas of Focus for Engineering Efforts
Areas that we might focus on in this sprint (needs prioritization):
This was referenced
Jan 15, 2017
Sprint Prep Action Items
This was referenced
Jan 16, 2017
Sprint Planning: data.gov (aka 300 TB Challenge)
Useful Links & Issues
TODO: dig up diagram @whyrusleeping created
Currently not scaling well. Don't have good metrics, graphs or reports about performance -- where/when/how performance dipped under certain circumstances.
We need to know more than "Does it scale?". We need to know "how does it scale?" So we can identify the domain of problems, etc.
The current implementation mixes porcelain UX concerns with the underlying iplementation/plumbing. This makes the interfaces confusing & complicated. It also makes the underlying plumbing more complicated and less robust than it should be.
@jbenet & @whyrusleeping need to sit down and figure out how they want to proceed with this. @flyingzumwalt will try to capture that info in the filestore Stories & Epics Main things that need to be specified:
The case for ipfs-pack
Currently the way people use go-ipfs is with
Extending that idea, if you create little .ipfs repositories next to the manifest files, it becomes possible to
Why implement ipfs-pack now?
This was referenced
Jan 17, 2017
I listened in from around 1:00 to 1:30.
I like the idea of ipfs-pack, but I see some potential problems. i have not had time to review the spec so it will be premature to bring them up.
I too would like to be present for the meeting on the filestore core so I can give feedback before we try to implement anything, there are some tricky aspects regarding multiple files with the same hash that need to be addressed for this to be considered a stable format. Most likely, the existing code can be adopted.
This was referenced
Jan 18, 2017
@Kubuxu sorry for the delay. Here you go: https://drive.google.com/file/d/0BzWuWHFTTIPERnpJSGJzYUFkYTA/view?usp=sharing
compressed version: https://ipfs.io/ipfs/QmW82hMetgeM1K4dTFL5w9aDsZ3LYekYkeMgHfY6wtED5c
Report from data.gov Sprint
The IPFS team have reached the end of our data.gov Sprint. Due to constraints on our very busy Q1 Roadmap, we were only able to allocate a single sprint 16-27 January 2017 (2 weeks) to work on this full-time. While we didn't reach all of the objectives, we have done our best to clear the path for our collaborators to finish the experiment. In the coming weeks, @flyingzumwalt will continue to participate in the project and the IPFS maintainers will provide information & advice when possible.
Within the IPFS team, we were excited to have the opportunity to help. This situation gets at one of the key reasons why we're building IPFS -- we want everyone to be able to hold and serve copies of the data they care about rather than relying on centralized services.
What we Accomplished
A number of collaborators have stepped up on very short notice to replicate these datasets. We're happy to tell them that the software is ready for you to use. Here's what @whyrusleeping, @Kubuxu, @kevina, @jbenet and @flyingzumwalt have done:
Next Steps with the data.gov Datasets
Specifically regarding the data.gov Datasets, next steps include:
@flyingzumwalt will remain the main point of contact on the ipfs side coordinating this work.
Next Steps for the IPFS Code Base
Relevant follow-up work on the IPFS code bases involve:
Here's a breakdown of each:
ipfs-cluster: IPFS Nodes Coordinating to Hold datasets
The DataRescue effort has triggered multiple requests for tools that allow IPFS nodes to coordinate with each other in order to hold valuable content. We were already working on this functionality, under the name ipfs-cluster. The captain for ipfs-cluster is @hsanjuan. That code base will be moving forward throughout the quarter.
Some relevant discussions related to ipfs-cluster:
More Testing and Optimization to Come
We wish we could have done more testing and optimizing before the collaborators started replicating the actual data.gov datasets, but that work will have to wait for a few more weeks. We have two sprints scheduled later in the quarter that will be specifically focused on Improving our Testing & CI Infrastructure and Building a Proper Test Lab for Distributed Networks. We're confident that those tests will allow us to achieve major improvements in speed, stability, configurability, and security.
Deduplication of Datasets
In the aftermath of this high-speed effort to download datasets, people are now asking how to deduplicate datasets. This becomes especially relevant when we consider distributing and archiving datasets as they change over time -- if parts of the datasets stay the same between versions, we want to avoid storing & replicating them multiple times.
This has spurred interest in the different chunking algorithms IPFS supports. In particular, people are taking interest in rabin fingerprinting. Here are some Github issues where the discussion is happening:
If we've missed anything important from our list of Next Steps, please let us know so we don't lose track of it.