Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NOAA OISST Zarr is now on IPFS - next steps w/ Filecoin? #40

Open
cisaacstern opened this issue Oct 27, 2021 · 4 comments
Open

NOAA OISST Zarr is now on IPFS - next steps w/ Filecoin? #40

cisaacstern opened this issue Oct 27, 2021 · 4 comments

Comments

@cisaacstern
Copy link
Member

cisaacstern commented Oct 27, 2021

Thanks to @sheriflouis-FF, @jnthnvctr, and @d70-t, the NOAA OISST Zarr store is now on IPFS, and openable with xarray:

https://gist.github.com/cisaacstern/de5b5d0a17bc3dadb372997f43e79a42

A few notes:

  • This dataset was copied to IPFS from our Open Storage Network (OSN) S3 bucket. IIUC, Tobias's ipfsspec is currently read-only so there is not a direct path currently for writing directly to IPFS from pangeo-forge-recipes.

  • The opening time of 4+ minutes is obviously slow, and could be accelerated by:

    1. Running a local IPFS gateway. ipfsspec gives preference to a local gateway, and falls back to a remote gateway if a local one is not found. I am still unclear what the recommended simplest method is for running a local gateway.
    2. Removing chunking from the time dimension. The Zarr store from which this was copied was written to OSN prior to Consolidate dimension coordinates pangeo-forge-recipes#210, which I believes resolves this issue.
    3. Tobias is working on an async implementation of ipfsspec which would also help.

Now that we have a minimal working example on the read-only side, I'm opening this issue to solicit input from @pangeo-forge/dev-team as well as the IPFS crew (please tag others if I've missed someone!) regarding next steps with his project. What milestones should we focus on?

@martindurant
Copy link

I imagine the majority of the time is spent fetching the time chunks, so I, II or both would make a big difference.

@d70-t
Copy link

d70-t commented Oct 27, 2021

Yes @martindurant, most of the time is spent during fetching of the time chunks. Indeed, all of the three changes would help in speeding up the retrieval.

One could simulate what a proper async implementation would do by running e.g.

import xarray as xr
path = "https://tempgw01.web3.storage/ipfs/QmfLZZBXj46yz6WHfnQErBkMj65GrbrQcUUBjgr1sbfUBT/noaa_oisst/v2.1-avhrr.zarr"
ds = xr.open_zarr(path, consolidated=True, **opts)
print(ds)

which runs within 3 seconds on my laptop.

Of course, referring directly via the https link defeats the purpose of IPFS, which is to become independent of location addressing. So the proper way will in any case be to have some (or many) gateway / IPFS node(s) close to where the data is used.

If I run the async ipfsspec implementation on my laptop (which has an IPFS node running), I can open the same dataset at 500 ms. As the async implementation currently doesn't have fallback / load balancing, I'm hesitant to release it as the standard ipfsspec.

When I run the sync variant of ipfsspec against my local node, coincidentally the speed goes down to 3 s as well.


I've created another issue at ipfsspec to discuss about particular design decisions of ipfsspec.

@sheriflouis-FF
Copy link

The issue with why IPFSSPEC is readonly is because ZARR needs to know the keys before it starts writing, i.e. we need to figure out a way to generate the CIDs before creating the DAG (Direct Acyclic Graph).

I wanted to break down the issues with IPFS node + gateway as there are two layers
First: When a root CID is pinned to an IPFS node, the default behavior is to advertise the whole tree to the IPFS network, root CID and its children. This makes finding CIDs by IPFS pretty fast.
When this is done on Estuary, it doesn't advertise the whole children, only the root. This is why every file and directory within the dataset will need to be advertised or pinned. This also gives more control on the traffic generated by an IPFS node, and is the recommended path for busy IPFS nodes.
1- If retrieving a CID took more than one minute, that usually means IPFS had to go through it is distributed hash table (DHT) and peers to find the CID.
2- If the retrieval took a few seconds that would point to a high latency within the gateway. We are currently working on scaling this and thus improving the overall performance.

@d70-t
Copy link

d70-t commented Nov 18, 2021

@sheriflouis-FF regarding the pinning. I think for the particular case of ZARR, the finest level accessed directly via CID (i.e. without any path) might well be the level of an zarr array (i.e. a folder containing a file called .zarray). Likely everything below will (for now) always be accessed using a path (identifying the chunk) based off that folder. So it might be a good tradeoff to advertise CIDs down to that level, but maybe not further.

If at some point in time we would be able to trace the chunk CIDs through computation (as briefly mentioned here), this might however change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants