Recipe stuck? #3

jbusecke · 2022-11-04T14:49:37Z

I am suspicious that the recipe execution got stuck at this point. I just checked the Dataflow console and there are only two jobs that ran yesterday

I presume the short successful run was the pruned test, and the cancelled run is the full recipe. Presumably something is going wrong beyond the first two files?

I wonder what the best way to debug this is. I would think pulling the logs and chasing down the error is the best next step. I am unsure if I actually have the credentials to get the logs, cc @andersy005 can you pull them?

If that does not work I can try to locally submit the recipe to the LEAP account.

raf-antwerpen · 2022-11-04T16:03:59Z

Hi @jbusecke, thanks for checking this! Could it be that the recipe is just taking a long time? The pruned recipe only did 18 timesteps, while the full dataset is more than 8000 time steps (22y and 365d) and ~130GB large.

andersy005 · 2022-11-04T17:56:45Z

Cross posting this here for reference

@jkingslake / @raf-antwerpen, thank you for your patience!

it turns out i made a change to the backend that resulted in the full recipe not being run after merging this PR. i'm still looking into fixing this bug, and will report back in #3

Originally posted by @andersy005 in #2 (comment)

raf-antwerpen · 2022-11-15T18:13:18Z

Hi @andersy005, any update on this? Or anything I can do from this side?

Trying to get #3 unstuck by resubmitting with a dummy commit (I think that having a way to resubmit without changes is underway, but not functional yet).

raf-antwerpen · 2023-01-05T18:14:15Z

Hi @andersy005 and @jbusecke. Is there any update on this? Can we do anything from our side to help this process?

jkingslake · 2023-01-12T23:45:41Z

Just pinging this again @jbusecke , in case it got missed. :-)

jbusecke · 2023-01-13T18:07:24Z

@andersy005 @cisaacstern is there any update that merits a retry here?

andersy005 · 2023-01-13T21:17:30Z

@jkingslake / @raf-antwerpen / @jbusecke,

thank you all for your patience!

@andersy005 @cisaacstern is there any update that merits a retry here?

Perhaps @cisaacstern can help (I haven't followed the recent development closely).

raf-antwerpen · 2023-01-26T16:07:09Z

Hi @jbusecke @andersy005 @cisaacstern @jkingslake,

Just wanted to open this again. I'd love to get this recipe working! Is there anything I can do to help with this? I'd be happy to meet with one of you at the LEAP workspace to discuss how to approach this if that helps! Just let me know.

cisaacstern · 2023-01-26T18:52:18Z

Hi all, thank you for the persistence and patience here, and sincere apologies that this has escaped my attention until now.

(We've gotten to the point where there are now quite a large number of active feedstocks in pangeo-forge, and frankly it's a bit difficult to keep track of all of them... @andersy005, this is a great example of where the labeling and project board system we discussed this week will be of great benefit!)

As it seems we're lacking clarity on what the problem was the first time around, my typical approach to debugging something like this would be to re-run the recipe (assuming it may fail) and then begin investigating causes of the failure using the logs from the most recently failed run.

Unfortunately, that is not possible immediately, because we seem to have hit some type of bug which is preventing the backend service from submitted jobs from any feedstock (not just this one). I've been tracking this issue for a week or so in pangeo-forge/pangeo-forge-orchestrator#220 and am now working on adding integration testing in pangeo-forge/pangeo-forge-orchestrator#226, as the first step towards figuring out why this is happening (and preventing it from taking down this aspect of our backend service in the future).

This is a long way of saying that I am happy to revisit this issue as soon as our job submission problem is resolved, which being realistic, will probably be at least a week from now. My hope in providing more detail above is to provide some context for what is involved to make this all work, and perhaps something of an explanation for why things do at times take longer than we'd hope.

Happy to answer any more specific question here, and please feel free to ping me again here in a week or so if I haven't checked back by then.

raf-antwerpen · 2023-01-26T22:29:45Z

Thanks for the explanation and clarity @cisaacstern! This is really helpful. Good luck on the backend issue.

raf-antwerpen · 2023-02-02T15:49:37Z

Hi @cisaacstern, just wanted to check in on this. How's everything going?

cisaacstern · 2023-02-03T22:08:30Z

@raf-antwerpen thanks for checking in. The backend service bug remains unresolved. Thanks for your patience and please feel free to check back here at regular intervals. As an order of magnitude, my expectation is that this bug fix timeline will be measured in weeks. (Again, the longer turnaround here being a reflection of the fact that Pangeo Forge Cloud has only one full time maintainer: yours truly 😆 .)

In the event it's useful, I'll note add that so long as pangeo-forge/pangeo-forge-orchestrator#226 remains un-merged, it's quite likely the bug is unresolved (that is the test I'm working on to figure out the cause of the bug).

Also, in the chance this may be able to help unblock you in the short term, it is entirely possible to run your own recipes (using various execution modes), as described in:

https://pangeo-forge.readthedocs.io/en/latest/pangeo_forge_recipes/recipe_user_guide/execution.html

The caveat here is that this will require writing to your own storage (whether that's cloud object storage, local disk, etc.).

raf-antwerpen · 2023-02-07T16:36:21Z

Thanks for the transparency @cisaacstern! I'll keep an eye on #226 for updates and check on how to run my own recipe.

raf-antwerpen · 2023-03-15T13:17:39Z

Hi @cisaacstern. I just wanted to check in about this since I'm unsure of the status of #226 and how to proceed from my end. I'd love to start using the data I'm trying to get onto Pangeo!

jbusecke · 2023-03-16T13:51:46Z

I just wanted to quickly chime in here:
If it would be of benefit to @raf-antwerpen we could try to run the recipe locally or on dataflow on the LEAP resources.
This however begs the question of where the data target should be. Several options:

LEAP persistent bucket (only accessible for LEAP members)
Perhaps the same bucket that hosts the netcdf can be used to host a zarr version (cc @jkingslake)

I wonder if this is a decent solution to unblock @raf-antwerpen.

jkingslake · 2023-03-16T14:47:36Z

Yes, this could be a good work-around for now. Putting it in the same bucket at where NC data is works for me.

jbusecke · 2023-03-20T15:19:34Z

How about we discuss further steps together during my LEAP-Pangeo Office Hours?

cisaacstern · 2023-03-20T20:33:47Z

Big 👍 from me on this idea, thanks for the offer @jbusecke! Please let me know how I can advise/help.

jkingslake · 2023-03-21T00:44:51Z

sounds good. I missed them today, s I am booked in for next thursday.

andersy005 mentioned this issue Nov 4, 2022

Fix meta file #2

Merged

jbusecke added a commit that referenced this issue Nov 16, 2022

Dummy commit

881f430

Trying to get #3 unstuck by resubmitting with a dummy commit (I think that having a way to resubmit without changes is underway, but not functional yet).

jbusecke mentioned this issue Nov 16, 2022

Rerun recipe using dummy changes #4

Open

cisaacstern mentioned this issue Jan 27, 2023

osi_saf_450_430-a_rg025 pangeo-forge/staged-recipes#245

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recipe stuck? #3

Recipe stuck? #3

jbusecke commented Nov 4, 2022

raf-antwerpen commented Nov 4, 2022

andersy005 commented Nov 4, 2022

raf-antwerpen commented Nov 15, 2022

raf-antwerpen commented Jan 5, 2023

jkingslake commented Jan 12, 2023

jbusecke commented Jan 13, 2023

andersy005 commented Jan 13, 2023

raf-antwerpen commented Jan 26, 2023

cisaacstern commented Jan 26, 2023

raf-antwerpen commented Jan 26, 2023

raf-antwerpen commented Feb 2, 2023

cisaacstern commented Feb 3, 2023 •

edited

Loading

raf-antwerpen commented Feb 7, 2023

raf-antwerpen commented Mar 15, 2023

jbusecke commented Mar 16, 2023

jkingslake commented Mar 16, 2023

jbusecke commented Mar 20, 2023

cisaacstern commented Mar 20, 2023

jkingslake commented Mar 21, 2023

Recipe stuck? #3

Recipe stuck? #3

Comments

jbusecke commented Nov 4, 2022

raf-antwerpen commented Nov 4, 2022

andersy005 commented Nov 4, 2022

raf-antwerpen commented Nov 15, 2022

raf-antwerpen commented Jan 5, 2023

jkingslake commented Jan 12, 2023

jbusecke commented Jan 13, 2023

andersy005 commented Jan 13, 2023

raf-antwerpen commented Jan 26, 2023

cisaacstern commented Jan 26, 2023

raf-antwerpen commented Jan 26, 2023

raf-antwerpen commented Feb 2, 2023

cisaacstern commented Feb 3, 2023 • edited Loading

raf-antwerpen commented Feb 7, 2023

raf-antwerpen commented Mar 15, 2023

jbusecke commented Mar 16, 2023

jkingslake commented Mar 16, 2023

jbusecke commented Mar 20, 2023

cisaacstern commented Mar 20, 2023

jkingslake commented Mar 21, 2023

cisaacstern commented Feb 3, 2023 •

edited

Loading