NOAA OISST Zarr is now on IPFS - next steps w/ Filecoin? #40

cisaacstern · 2021-10-27T17:33:17Z

Thanks to @sheriflouis-FF, @jnthnvctr, and @d70-t, the NOAA OISST Zarr store is now on IPFS, and openable with xarray:

https://gist.github.com/cisaacstern/de5b5d0a17bc3dadb372997f43e79a42

A few notes:

This dataset was copied to IPFS from our Open Storage Network (OSN) S3 bucket. IIUC, Tobias's ipfsspec is currently read-only so there is not a direct path currently for writing directly to IPFS from pangeo-forge-recipes.
The opening time of 4+ minutes is obviously slow, and could be accelerated by:
1. Running a local IPFS gateway. ipfsspec gives preference to a local gateway, and falls back to a remote gateway if a local one is not found. I am still unclear what the recommended simplest method is for running a local gateway.
2. Removing chunking from the time dimension. The Zarr store from which this was copied was written to OSN prior to Consolidate dimension coordinates pangeo-forge-recipes#210, which I believes resolves this issue.
3. Tobias is working on an async implementation of ipfsspec which would also help.

Now that we have a minimal working example on the read-only side, I'm opening this issue to solicit input from @pangeo-forge/dev-team as well as the IPFS crew (please tag others if I've missed someone!) regarding next steps with his project. What milestones should we focus on?

The text was updated successfully, but these errors were encountered:

martindurant · 2021-10-27T17:44:24Z

I imagine the majority of the time is spent fetching the time chunks, so I, II or both would make a big difference.

d70-t · 2021-10-27T19:20:21Z

Yes @martindurant, most of the time is spent during fetching of the time chunks. Indeed, all of the three changes would help in speeding up the retrieval.

One could simulate what a proper async implementation would do by running e.g.

import xarray as xr
path = "https://tempgw01.web3.storage/ipfs/QmfLZZBXj46yz6WHfnQErBkMj65GrbrQcUUBjgr1sbfUBT/noaa_oisst/v2.1-avhrr.zarr"
ds = xr.open_zarr(path, consolidated=True, **opts)
print(ds)

which runs within 3 seconds on my laptop.

Of course, referring directly via the https link defeats the purpose of IPFS, which is to become independent of location addressing. So the proper way will in any case be to have some (or many) gateway / IPFS node(s) close to where the data is used.

If I run the async ipfsspec implementation on my laptop (which has an IPFS node running), I can open the same dataset at 500 ms. As the async implementation currently doesn't have fallback / load balancing, I'm hesitant to release it as the standard ipfsspec.

When I run the sync variant of ipfsspec against my local node, coincidentally the speed goes down to 3 s as well.

I've created another issue at ipfsspec to discuss about particular design decisions of ipfsspec.

sheriflouis-FF · 2021-11-18T01:08:09Z

The issue with why IPFSSPEC is readonly is because ZARR needs to know the keys before it starts writing, i.e. we need to figure out a way to generate the CIDs before creating the DAG (Direct Acyclic Graph).

I wanted to break down the issues with IPFS node + gateway as there are two layers
First: When a root CID is pinned to an IPFS node, the default behavior is to advertise the whole tree to the IPFS network, root CID and its children. This makes finding CIDs by IPFS pretty fast.
When this is done on Estuary, it doesn't advertise the whole children, only the root. This is why every file and directory within the dataset will need to be advertised or pinned. This also gives more control on the traffic generated by an IPFS node, and is the recommended path for busy IPFS nodes.
1- If retrieving a CID took more than one minute, that usually means IPFS had to go through it is distributed hash table (DHT) and peers to find the CID.
2- If the retrieval took a few seconds that would point to a high latency within the gateway. We are currently working on scaling this and thus improving the overall performance.

d70-t · 2021-11-18T10:50:21Z

@sheriflouis-FF regarding the pinning. I think for the particular case of ZARR, the finest level accessed directly via CID (i.e. without any path) might well be the level of an zarr array (i.e. a folder containing a file called .zarray). Likely everything below will (for now) always be accessed using a path (identifying the chunk) based off that folder. So it might be a good tradeoff to advertise CIDs down to that level, but maybe not further.

If at some point in time we would be able to trace the chunk CIDs through computation (as briefly mentioned here), this might however change.

d70-t mentioned this issue Nov 18, 2021

Enable outgoing ports for IPFS peering jupyterhub/mybinder.org-deploy#2069

Closed

This was referenced Nov 18, 2021

Add an IPFS content provider jupyterhub/repo2docker#1096

Open

Add an IPFS content provider jupyterhub/binderhub#1435

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NOAA OISST Zarr is now on IPFS - next steps w/ Filecoin? #40

NOAA OISST Zarr is now on IPFS - next steps w/ Filecoin? #40

cisaacstern commented Oct 27, 2021 •

edited

Loading

martindurant commented Oct 27, 2021

d70-t commented Oct 27, 2021

sheriflouis-FF commented Nov 18, 2021

d70-t commented Nov 18, 2021

NOAA OISST Zarr is now on IPFS - next steps w/ Filecoin? #40

NOAA OISST Zarr is now on IPFS - next steps w/ Filecoin? #40

Comments

cisaacstern commented Oct 27, 2021 • edited Loading

martindurant commented Oct 27, 2021

d70-t commented Oct 27, 2021

sheriflouis-FF commented Nov 18, 2021

d70-t commented Nov 18, 2021

cisaacstern commented Oct 27, 2021 •

edited

Loading