Sprint: Data.gov (aka 300 TB Challenge) #87

flyingzumwalt · 2017-01-14T19:48:50Z

Dates: 16-27 January 2017
Sprint Milestone: https://github.com/ipfs/archives/milestone/1
Waffle Board: https://waffle.io/ipfs/archives

Participants from IPFS Team:

Collaborators

@mejackreed

Advisors

@jonnycrunch

Description

During this sprint, we will work with collaborating institutions to load all of data.gov (350 TB of datasets) into IPFS, publish the hashes on the DHT, and replicate the data to nodes at participating institutions.

Main Issues & Boards for Tracking this Work

Sprint Milestone: https://github.com/ipfs/archives/milestone/1
Waffle Board: https://waffle.io/ipfs/archives

Main Issues for Tracking this work:

Main data.gov Epic: Replicate 350 TB of Data Between 3 Peers (and then the World) #104: Main data.gov Epic: Replicate 350 TB of Data Between 3 Peers (and then the World)
Call for Participants/Collaborators for data.gov Sprint #107: Call for Participants/Collaborators for data.gov Sprint
Download all of data.gov #113: Download all of the datasets from data.gov

Objectives

Top level objectives

Add data.gov (300+ TB) to IPFS so that people around the world can replicate authenticated copies of the data using IPFS
Provide advice to organizations who are adding large volumes of content to IPFS
Test IPFS performance at this scale and tune for performance, memory usage, stability, etc.
Improve User Experience for people adding and replicating large volumes of data
Identify possible next steps
Move towards making IPFS work at exabyte scale

What will be Downloaded

The data.gov website is a portal for searching through all the open data published by US federal agencies. It currently lists over 190,000 datasets. The goal is to download those datasets, back them up, and eventually publish them on IPFS, and replicate them across multiple institutions.

How is this different from the Internet Archive's EOT Harvest of data.gov?

In short,the End of Term Presidential Harvest will capture the data.gov website but is not likely to capture the datasets that it links to. We aim to capture and replicate the datasets.

From the Federal Depository Library Program website:

The Library of Congress, California Digital Library, University of North Texas Libraries, Internet Archive, George Washington University Libraries, Stanford University Libraries, and the U.S. Government Publishing Office (GPO) are leading a collaborative project to harvest and preserve public U.S. Government websites at the conclusion of the current Presidential administration ending on January 20, 2017.

This End of Term Presidential Harvest is focused on crawling websites (ie. grabbing HTML & static files, following links), including data.gov. It is not focused on downloading whole datasets, which often have to be retrieved using tools/processes beyond the capabilities of regular web crawlers.

You can see the ongoing EOT Harvest work, and nominate sites for harvest, at the End of Term Presidential Harvest 2016 Website

Obstacles

Allocating Storage: It's hard to spin up 350TB of storage on short notice. Ideally, we need at least 700TB (one full copy backed upoutside IPFS, another copy inside IPFS)
Time Constraints: We added this to the calendar at the last minute. The assigned IPFS team is only available for these 2 weeks. After that, they are booked on other sprints through the remainder of Q1. This means we need to make the most of this time and need to leave a trail for community members & collaborators to continue the work after the end of the sprint.

Notes

Isolating from the Main IPFS Network: We might do this on a separate "private" IPFS network to ensure stability while we load-test the system (we want to make sure it's all working smoothly before we flood the main public IPFS network with provide statements and additions to the DHT.

Possible Areas of Focus for Engineering Efforts

Areas that we might focus on in this sprint (needs prioritization):

ipfs-pack
Index-in-place (aka. IPFS "Filestore") -- allows you to serve content through IPFS without copying the content itself into the ipfs repo.
Providers UX
- Providing only roots
- Easy specification of blocks to autoprovide
- Introspection of providers processes
Blockstore Perf
- Analysis of different datastores
- make datastores configurable (and tool for converting between them)
- Run benchmarks on multi-TB datasets/repos
- Investigate 'single file' blockstore
Delegated Content Routing
- Supernode DHT
- 'Trackers'
- Multi-ContentRouting
- DHT Record Signing
Memory Usage Improvements
- Multiplex
- Peerstore to disk
Deployment/Ops UX
- see Operational Peace of Mind sprint

flyingzumwalt · 2017-01-15T23:35:02Z

@jbenet suggested here:

Some things to do that would help, before tue:

review existing filestore stuff

review ipfs-pack draft proposal -- Proposing some tooling for datasets (ipfs-pack and stuff) ipfs/notes#205

make a short list of the "big bugs" and "big optimizations" relevant for this sprint. (i can think of a couple -- file attrs, bitswap supporting paths (kills so many RTTs) -- , but we'll want a good list to have in mind)

refine the concrete use case and UX we're shooting for

create the test workloads + scripts we'll work against:

for each order of magnitude of total data, from 1M -> 100TB

with giant files

with lots of little files

can just add tests to https://github.com/ipfs/fs-stress-test

review ipfs-s3 datastore options -- S3-backed IPFS ipfs/notes#214

flyingzumwalt · 2017-01-15T23:54:09Z

kevina · 2017-01-16T17:45:46Z

This sounds like it involves me. Is there a reason I am not mentioned?

flyingzumwalt · 2017-01-16T18:35:05Z

@kevina this sprint arose very quickly based on sudden interest outside our org. I'm putting together the docs as quickly as I can so we can coordinate. I suspect that @jbenet and @whyrusleeping will pull you onto the sprint if you're available.

flyingzumwalt · 2017-01-17T20:31:37Z

Sprint Planning: data.gov (aka 300 TB Challenge)

Date: 2017-17-01

Lead: @flyingzumwalt

Notetaker: @flyingzumwalt

Participants

Notes

Useful Links & Issues

Sprint: Data.gov (aka 300 TB Challenge) #87: Main Sprint Issue for data.gov Sprint
Main data.gov Epic: Replicate 350 TB of Data Between 3 Peers (and then the World) #104: Main data.gov Epic: Replicate 350 TB of Data Between 3 Peers (and then the World)
Epics for the whole sprint
ipfs-pack Stories & Epics
filestore Stories & Epics

Big Optimizations

TODO: dig up diagram @whyrusleeping created

Adding is still very slow. We can do way better.
- adding large files is faster than adding lots of small files
- need a way to test these things See Story: Test Suite for 1MB -> 100TB Payloads #102
- @lgierth recently added ~3.2 TB for CCC. It took about a day to add. Performance dropped as the repo grew. Would have taken half a day if performance had stayed constant.
- @Kubuxu ran some tests (see Identify the "big bugs" and "big optimizations" relevant for data.gov #105 (comment))
- Path forward: design good tests. See Story: Test Suite for 1MB -> 100TB Payloads #102
fetching from network is very slow (won't be able to fill the pipes)
- tests should also address this
DHT with huge datasets might get oversaturated
- DHT is not going to scale in time for this sprint, so we just won't use the DHT -- this means we need to find a way to do the routing. See Figure out content routing without DHT #120
- {@whyrusleeping mentioned something i didn't hear..}
Garbage Collection might not work with huge datasets
- leaving GC out of scope for this sprint.
Bitswap hasn't really been tested yet
- @lgierth & @whyrusleeping ran some tests on this but didn't get clear info. It took over a week to ???
- See Make sure Bitswap works in all cases #121
Private Networks -- do we need it in order to do Provide Instructions for setting up data.gov Collaborators' Testbed Network #116 and Figure out content routing without DHT #120? @Kubuxu will follow up with estimates (or merged code)

Test Suite

See #102

Currently not scaling well. Don't have good metrics, graphs or reports about performance -- where/when/how performance dipped under certain circumstances.

We need to know more than "Does it scale?". We need to know "how does it scale?" So we can identify the domain of problems, etc.

Filestore

See filestore Stories & Epics

The current implementation mixes porcelain UX concerns with the underlying iplementation/plumbing. This makes the interfaces confusing & complicated. It also makes the underlying plumbing more complicated and less robust than it should be.
Best approach: take the pieces of the code that we need and package it as an experimental feature with simple, straightforward interfaces.

@jbenet & @whyrusleeping need to sit down and figure out how they want to proceed with this. @flyingzumwalt will try to capture that info in the filestore Stories & Epics Main things that need to be specified:

How to do the internals/plumbing
What the UX should look like

ipfs-pack

ipfs-pack Stories & Epics

The case for ipfs-pack

Currently the way people use go-ipfs is with ipfs add which creates a duplicate copy of the added data on the machine. With filestore we aim to build indexes of pointers to data/blocks in-place. This solves performance concerns, but creates a brittle situation -- if you move the file, ipfs won't be able to serve it any more. ipfs-pack aims to address this by building manifest files that hold the indexes that match ipfs hashes to the content. If you store those manifest files alongside the cotnent they point to, it becomes a portable dataset.

Extending that idea, if you create little .ipfs repositories next to the manifest files, it becomes possible to

serve that dataset as its own little ipfs node
register the contents of that dataset with another ipfs node, serving the content directly from wherever you've stored/mounted it

Why implement ipfs-pack now?

Makes the UX much smoother for providers and their peers
packs make the a lot of these concepts clear, straightforward & relatable

kevina · 2017-01-18T03:20:28Z

I listened in from around 1:00 to 1:30.

I like the idea of ipfs-pack, but I see some potential problems. i have not had time to review the spec so it will be premature to bring them up.

I too would like to be present for the meeting on the filestore core so I can give feedback before we try to implement anything, there are some tricky aspects regarding multiple files with the same hash that need to be addressed for this to be considered a stable format. Most likely, the existing code can be adopted.

flyingzumwalt · 2017-01-24T19:43:50Z

@mejackreed so @whyrusleeping can do more realistic load testing in #126 can you please run du -a {path-to-data} on the datasets you've downloaded so far?

mejackreed · 2017-01-25T00:28:06Z

currently 2053258584 /data/master/pairtree_root/

Kubuxu · 2017-01-25T13:38:24Z

@mejackreed we are also interested in whole output of this command, it will allows us to know the distribution of filesizes, directories and so on.

mejackreed · 2017-01-25T22:58:16Z

@Kubuxu sorry for the delay. Here you go: https://drive.google.com/file/d/0BzWuWHFTTIPERnpJSGJzYUFkYTA/view?usp=sharing

Kubuxu · 2017-01-25T23:59:01Z

compressed version: https://ipfs.io/ipfs/QmW82hMetgeM1K4dTFL5w9aDsZ3LYekYkeMgHfY6wtED5c

flyingzumwalt · 2017-01-30T03:23:25Z

Report from data.gov Sprint

The IPFS team have reached the end of our data.gov Sprint. Due to constraints on our very busy Q1 Roadmap, we were only able to allocate a single sprint 16-27 January 2017 (2 weeks) to work on this full-time. While we didn't reach all of the objectives, we have done our best to clear the path for our collaborators to finish the experiment. In the coming weeks, @flyingzumwalt will continue to participate in the project and the IPFS maintainers will provide information & advice when possible.

Within the IPFS team, we were excited to have the opportunity to help. This situation gets at one of the key reasons why we're building IPFS -- we want everyone to be able to hold and serve copies of the data they care about rather than relying on centralized services.

What we Accomplished

A number of collaborators have stepped up on very short notice to replicate these datasets. We're happy to tell them that the software is ready for you to use. Here's what @whyrusleeping, @Kubuxu, @kevina, @jbenet and @flyingzumwalt have done:

We've written instructions for you to follow when replicating the data.gov datasets (and any other datasets). They're titled Instructions for Replicating Large Amounts of Data with Minimal Overhead
We made version 0.4.5 of ipfs support adding content in-place. This means you can now add content to ipfs without ipfs creating a duplicate copy of the data. Using this approach cuts the storage overhead by nearly 50%.
We created ipfs-pack to simplify the process of adding downloaded datasets to the ipfs network.
We ran some preliminary tests to confirm that everything will work.

Next Steps

Next Steps with the data.gov Datasets

Specifically regarding the data.gov Datasets, next steps include:

Finish Downloading the datasets
Follow the Instructions for Replicating Large Amounts of Data with Minimal Overhead
Connect the nodes to the general IPFS Network so others can replicate the data
Report on the Results

@flyingzumwalt will remain the main point of contact on the ipfs side coordinating this work.

Next Steps for the IPFS Code Base

Relevant follow-up work on the IPFS code bases involve:

ipfs-cluster: IPFS Nodes Coordinating to Hold datasets
More Testing and Optimization
Deduplication of Datasets

Here's a breakdown of each:

ipfs-cluster: IPFS Nodes Coordinating to Hold datasets

The DataRescue effort has triggered multiple requests for tools that allow IPFS nodes to coordinate with each other in order to hold valuable content. We were already working on this functionality, under the name ipfs-cluster. The captain for ipfs-cluster is @hsanjuan. That code base will be moving forward throughout the quarter.

Some relevant discussions related to ipfs-cluster:

Discussion: Making IPFS work for distributed archiving
ipfs-cluster user story: pinning rings
Making IPFS accessible for distributed archival. ipfs/notes#210

More Testing and Optimization to Come

We wish we could have done more testing and optimizing before the collaborators started replicating the actual data.gov datasets, but that work will have to wait for a few more weeks. We have two sprints scheduled later in the quarter that will be specifically focused on Improving our Testing & CI Infrastructure and Building a Proper Test Lab for Distributed Networks. We're confident that those tests will allow us to achieve major improvements in speed, stability, configurability, and security.

Deduplication of Datasets

In the aftermath of this high-speed effort to download datasets, people are now asking how to deduplicate datasets. This becomes especially relevant when we consider distributing and archiving datasets as they change over time -- if parts of the datasets stay the same between versions, we want to avoid storing & replicating them multiple times.

This has spurred interest in the different chunking algorithms IPFS supports. In particular, people are taking interest in rabin fingerprinting. Here are some Github issues where the discussion is happening:

Anything Else?

If we've missed anything important from our list of Next Steps, please let us know so we don't lose track of it.

flyingzumwalt self-assigned this Jan 14, 2017

flyingzumwalt modified the milestone: Data.gov (aka 300 TB Challenge) Jan 14, 2017

flyingzumwalt changed the title ~~Sprint: Data.Gov (aka 300 TB Challenge)~~ Sprint: Data.gov (aka 300 TB Challenge) Jan 14, 2017

This was referenced Jan 15, 2017

About this project... #88

Closed

Sprint: data.gov (aka 300 TB Challenge) ipfs/team-mgmt#342

Closed

flyingzumwalt added ready in progress and removed ready labels Jan 15, 2017

This was referenced Jan 15, 2017

Sprint: January 16-23 ipfs/team-mgmt#309

Closed

Review ipfs-s3 datastore options #100

Open

This was referenced Jan 16, 2017

Help Jack get access to storage for data.gov #106

Closed

Call for Participants/Collaborators for data.gov Sprint #107

Open

This was referenced Jan 17, 2017

Identify the "big bugs" and "big optimizations" relevant for data.gov #105

Closed

Review Filestore for 300TB Challenge. Update Stories + Specs #85

Closed

flyingzumwalt mentioned this issue Jan 18, 2017

Download all of data.gov #113

Open

flyingzumwalt added this to the Data.gov (aka 300 TB Challenge) milestone Jan 18, 2017

This was referenced Jan 18, 2017

is using ipfs for large scale content distribution a good idea? ipfs-inactive/faq#216

Closed

Meeting: Decentralizing LCSH ipld/persistent-metadata#1

Closed

flyingzumwalt mentioned this issue Jan 25, 2017

Whole team Prep Tasks CodeForPhilly/ipfs-switchboard#4

Open

flyingzumwalt mentioned this issue Jan 30, 2017

Captain's Log for the ipfs/archives endeavor #138

Open

titaniumbones mentioned this issue Jan 31, 2017

Test IPFS out as storage method for data edgi-govdata-archiving/overview#27

Closed

nicola mentioned this issue Feb 2, 2017

Storing data.gov for climate change data, together #139

Open

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sprint: Data.gov (aka 300 TB Challenge) #87

Sprint: Data.gov (aka 300 TB Challenge) #87

flyingzumwalt commented Jan 14, 2017 •

edited

Loading

flyingzumwalt commented Jan 15, 2017

flyingzumwalt commented Jan 15, 2017 •

edited

Loading

kevina commented Jan 16, 2017

flyingzumwalt commented Jan 16, 2017

flyingzumwalt commented Jan 17, 2017 •

edited

Loading

kevina commented Jan 18, 2017

flyingzumwalt commented Jan 24, 2017

mejackreed commented Jan 25, 2017

Kubuxu commented Jan 25, 2017

mejackreed commented Jan 25, 2017

Kubuxu commented Jan 25, 2017

flyingzumwalt commented Jan 30, 2017

Sprint: Data.gov (aka 300 TB Challenge) #87

Sprint: Data.gov (aka 300 TB Challenge) #87

Comments

flyingzumwalt commented Jan 14, 2017 • edited Loading

Description

Main Issues & Boards for Tracking this Work

Objectives

Top level objectives

What will be Downloaded

How is this different from the Internet Archive's EOT Harvest of data.gov?

Obstacles

Notes

Possible Areas of Focus for Engineering Efforts

flyingzumwalt commented Jan 15, 2017

flyingzumwalt commented Jan 15, 2017 • edited Loading

Sprint Prep Action Items

@jbenet:

Together:

@flyingzumwalt:

kevina commented Jan 16, 2017

flyingzumwalt commented Jan 16, 2017

flyingzumwalt commented Jan 17, 2017 • edited Loading

Sprint Planning: data.gov (aka 300 TB Challenge)

Lead: @flyingzumwalt

Notetaker: @flyingzumwalt

Participants

Notes

Useful Links & Issues

Big Optimizations

Test Suite

Filestore

ipfs-pack

The case for ipfs-pack

Why implement ipfs-pack now?

kevina commented Jan 18, 2017

flyingzumwalt commented Jan 24, 2017

mejackreed commented Jan 25, 2017

Kubuxu commented Jan 25, 2017

mejackreed commented Jan 25, 2017

Kubuxu commented Jan 25, 2017

flyingzumwalt commented Jan 30, 2017

Report from data.gov Sprint

What we Accomplished

Next Steps

Next Steps with the data.gov Datasets

Next Steps for the IPFS Code Base

ipfs-cluster: IPFS Nodes Coordinating to Hold datasets

More Testing and Optimization to Come

Deduplication of Datasets

Anything Else?

flyingzumwalt commented Jan 14, 2017 •

edited

Loading

flyingzumwalt commented Jan 15, 2017 •

edited

Loading

flyingzumwalt commented Jan 17, 2017 •

edited

Loading