Support incremental appending #37

rabernat · 2021-01-22T22:11:52Z

Currently, when a recipe is run, it will always cache all of the inputs and write all of the chunks. However, it would be nice to have an option where, if the target already exists, it only write NEW chunks. This raises some design questions.

Currently, the target is never read until we start to execute the recipe (not until the prepare_target stage). However, for this to work, the iter_inputs() and iter_chunks() methods needs to know which inputs and chunks to process. In order to build the pipeline for execution, this information needs to already be inside the recipe object. So this implies that we need open the target in __post_init__. Could this cause problems?
How do we align the recipe with the target? For the standard NetCDFZarrSequential recipe, it may be as simple as looking at the length of the sequence dimension: if the target has 100 items but the recipe has 120, we assume the last 20 need to be appended. But are there edge cases to worry about?

This intersects a bit with the "versioning" question in #3.

If we agree on the answers to the questions above, I think we can move ahead with implementing incremental updates to the NetCDFZarrSequentialRecipe class.

The text was updated successfully, but these errors were encountered:

rabernat · 2021-01-24T04:56:41Z

Appending is significantly more complicated for the case discussed in #50: variable items per input. In this case, we don't know the size of the target dataset from the outset, so we can't use simple heuristics like the one I proposed above to figure out the append region.

Maybe recipes can implement their own methods for examining the target and determining which input chunks are needed? For that, it seems like the recipe would have to know more about the inputs than just a list of paths. For instance, it might have to understand time. What if input_urls were a dictionary rather than list, and the keys held some semantic meaning that could be used to compare to the target?

davidbrochart · 2021-02-19T19:27:16Z

Maybe instead of looking at what has been produced, we could look at what has been consumed? For instance, after running a recipe, we could store the list of input files that were processed, so that when we get a new list next time the recipe is run, we look at the already processed input files and restore a "resuming" state. That could be done by having a "dry run", that would run the recipe without actually producing anything. We would still have the list of total processed input files, which might be needed when we "finalize" the target.
I don't know where we could store the list of processed input files, probably alongside the target, that seems the more natural. On the source side, tying a recipe to a target doesn't seem right.

rabernat · 2021-02-20T20:29:58Z

This is a good idea David. Perhaps we could store the list of input files directly in the target dataset metadata itself (attrs). This would be useful for incremental appending but also for general provenance tracking.

davidbrochart · 2021-02-20T22:26:46Z

Perhaps we could store the list of input files directly in the target dataset metadata itself (attrs).

I was thinking about that, but what if the target is not in the Zarr format? Do other formats all have metadata that we can use for this purpose? I don't know COG that much, but I'm not sure it does for instance.

martindurant · 2021-02-21T00:42:56Z

Sounds like the kind of thing that naturally belongs in a catalogue, perhaps backed by a db, so that updates are safe to races.

…

On February 20, 2021 5:27:06 PM EST, David Brochart ***@***.***> wrote: > Perhaps we could store the list of input files directly in the target dataset metadata itself (`attrs`). I was thinking about that, but what if the target is not in the Zarr format? Do other formats all have metadata that we can use for this purpose? I don't know COG that much, but I'm not sure it does for instance. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: #37 (comment)

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

rabernat · 2022-05-05T18:02:13Z

The input hashing stuff introduced by @cisaacstern in #349 should make this doable. The user story for this is being tracked in pangeo-forge/user-stories#5.

Charles, would you be game for diving into this and developing a prototype?

cisaacstern · 2022-05-05T18:10:43Z

In order for this to work within pangeo-forge-recipes entirely (without external information from the database/orchestration layer), we'll need to leave some metadata (i.e. recipe and/or pattern hashes) in the target store. Based on reading the thread, it seems like it could be okay to put this in .zmetadata?

rabernat · 2022-05-05T18:13:27Z

That sounds reasonable to me. I think in general we should be injecting extra metatdata into the datasets we write. Stuff like

{
    "pangeo-forge:version": 0.6.2,
    "pangeo-forge:recipe-hash": "a1b2c3",
    "pangeo-forge:input-hash": "..."
}

Addressing that as a standalone issue would be a good place to start.

cisaacstern · 2023-08-28T19:09:40Z

Per conversation at today's coordination meeting, people felt it would be simpler to have a single tracking issue for appending, so closing this and directing further discussion to #447.

rabernat added design question A question of the design of Pangeo Forge recipe enhancement Solving this requires us to enhance the recipe classes labels Jan 22, 2021

rabernat added this to Discussion Needed in Software Development via automation Jan 24, 2021

davidbrochart mentioned this issue Mar 3, 2021

Support incremental appending #81

Closed

sharkinsspatial mentioned this issue Aug 2, 2021

Finalize Github workflow logic for running recipes in feedstocks. pangeo-forge/staged-recipes#67

Open

rabernat mentioned this issue Sep 16, 2021

Add version to meta.yaml pangeo-forge/roadmap#34

Merged

rabernat mentioned this issue Jan 25, 2022

hash all downloads #266

Open

cisaacstern mentioned this issue Mar 24, 2022

Proposed Recipes for NASA MODIS-COSP data (satellite observations of clouds) pangeo-forge/staged-recipes#125

Open

TomAugspurger mentioned this issue May 5, 2022

GPM-IMERG-hhr files on Planetary Computer pangeo-forge/gpm-imerge-hhr-feedstock#2

Open

rabernat mentioned this issue May 5, 2022

Append-only production runs pangeo-forge/user-stories#5

Open

cisaacstern mentioned this issue May 6, 2022

Persist execution context in storage target #359

Merged

sebastienlanglois mentioned this issue May 30, 2022

Development of an ERA5 cloud storage for efficient access Ouranosinc/raven#396

Closed

rabernat pinned this issue Jun 21, 2022

rabernat mentioned this issue Dec 5, 2022

Making appending work in the beam refactor #447

Open

cisaacstern closed this as completed Aug 28, 2023

Software Development automation moved this from Discussion Needed to Done Aug 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support incremental appending #37

Support incremental appending #37

rabernat commented Jan 22, 2021

rabernat commented Jan 24, 2021

davidbrochart commented Feb 19, 2021

rabernat commented Feb 20, 2021

davidbrochart commented Feb 20, 2021

martindurant commented Feb 21, 2021 via email

rabernat commented May 5, 2022

cisaacstern commented May 5, 2022 •

edited

rabernat commented May 5, 2022

cisaacstern commented Aug 28, 2023

Support incremental appending #37

Support incremental appending #37

Comments

rabernat commented Jan 22, 2021

rabernat commented Jan 24, 2021

davidbrochart commented Feb 19, 2021

rabernat commented Feb 20, 2021

davidbrochart commented Feb 20, 2021

martindurant commented Feb 21, 2021 via email

rabernat commented May 5, 2022

cisaacstern commented May 5, 2022 • edited

rabernat commented May 5, 2022

cisaacstern commented Aug 28, 2023

cisaacstern commented May 5, 2022 •

edited