Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recipe stuck? #3

Open
jbusecke opened this issue Nov 4, 2022 · 19 comments
Open

Recipe stuck? #3

jbusecke opened this issue Nov 4, 2022 · 19 comments

Comments

@jbusecke
Copy link

jbusecke commented Nov 4, 2022

I am suspicious that the recipe execution got stuck at this point. I just checked the Dataflow console and there are only two jobs that ran yesterday
image
I presume the short successful run was the pruned test, and the cancelled run is the full recipe. Presumably something is going wrong beyond the first two files?

I wonder what the best way to debug this is. I would think pulling the logs and chasing down the error is the best next step. I am unsure if I actually have the credentials to get the logs, cc @andersy005 can you pull them?

If that does not work I can try to locally submit the recipe to the LEAP account.

@raf-antwerpen
Copy link
Contributor

Hi @jbusecke, thanks for checking this! Could it be that the recipe is just taking a long time? The pruned recipe only did 18 timesteps, while the full dataset is more than 8000 time steps (22y and 365d) and ~130GB large.

@andersy005 andersy005 mentioned this issue Nov 4, 2022
@andersy005
Copy link
Member

Cross posting this here for reference

@jkingslake / @raf-antwerpen, thank you for your patience!

it turns out i made a change to the backend that resulted in the full recipe not being run after merging this PR. i'm still looking into fixing this bug, and will report back in #3

Originally posted by @andersy005 in #2 (comment)

@raf-antwerpen
Copy link
Contributor

Hi @andersy005, any update on this? Or anything I can do from this side?

jbusecke added a commit that referenced this issue Nov 16, 2022
Trying to get #3 unstuck by resubmitting with a dummy commit (I think that having a way to resubmit without changes is underway, but not functional yet).
@raf-antwerpen
Copy link
Contributor

Hi @andersy005 and @jbusecke. Is there any update on this? Can we do anything from our side to help this process?

@jkingslake
Copy link

Just pinging this again @jbusecke , in case it got missed. :-)

@jbusecke
Copy link
Author

@andersy005 @cisaacstern is there any update that merits a retry here?

@andersy005
Copy link
Member

@jkingslake / @raf-antwerpen / @jbusecke,

thank you all for your patience!

@andersy005 @cisaacstern is there any update that merits a retry here?

Perhaps @cisaacstern can help (I haven't followed the recent development closely).

@raf-antwerpen
Copy link
Contributor

Hi @jbusecke @andersy005 @cisaacstern @jkingslake,

Just wanted to open this again. I'd love to get this recipe working! Is there anything I can do to help with this? I'd be happy to meet with one of you at the LEAP workspace to discuss how to approach this if that helps! Just let me know.

@cisaacstern
Copy link
Member

Hi all, thank you for the persistence and patience here, and sincere apologies that this has escaped my attention until now.

(We've gotten to the point where there are now quite a large number of active feedstocks in pangeo-forge, and frankly it's a bit difficult to keep track of all of them... @andersy005, this is a great example of where the labeling and project board system we discussed this week will be of great benefit!)

As it seems we're lacking clarity on what the problem was the first time around, my typical approach to debugging something like this would be to re-run the recipe (assuming it may fail) and then begin investigating causes of the failure using the logs from the most recently failed run.

Unfortunately, that is not possible immediately, because we seem to have hit some type of bug which is preventing the backend service from submitted jobs from any feedstock (not just this one). I've been tracking this issue for a week or so in pangeo-forge/pangeo-forge-orchestrator#220 and am now working on adding integration testing in pangeo-forge/pangeo-forge-orchestrator#226, as the first step towards figuring out why this is happening (and preventing it from taking down this aspect of our backend service in the future).

This is a long way of saying that I am happy to revisit this issue as soon as our job submission problem is resolved, which being realistic, will probably be at least a week from now. My hope in providing more detail above is to provide some context for what is involved to make this all work, and perhaps something of an explanation for why things do at times take longer than we'd hope.

Happy to answer any more specific question here, and please feel free to ping me again here in a week or so if I haven't checked back by then.

@raf-antwerpen
Copy link
Contributor

Thanks for the explanation and clarity @cisaacstern! This is really helpful. Good luck on the backend issue.

@raf-antwerpen
Copy link
Contributor

Hi @cisaacstern, just wanted to check in on this. How's everything going?

@cisaacstern
Copy link
Member

cisaacstern commented Feb 3, 2023

@raf-antwerpen thanks for checking in. The backend service bug remains unresolved. Thanks for your patience and please feel free to check back here at regular intervals. As an order of magnitude, my expectation is that this bug fix timeline will be measured in weeks. (Again, the longer turnaround here being a reflection of the fact that Pangeo Forge Cloud has only one full time maintainer: yours truly 😆 .)

In the event it's useful, I'll note add that so long as pangeo-forge/pangeo-forge-orchestrator#226 remains un-merged, it's quite likely the bug is unresolved (that is the test I'm working on to figure out the cause of the bug).

Also, in the chance this may be able to help unblock you in the short term, it is entirely possible to run your own recipes (using various execution modes), as described in:

https://pangeo-forge.readthedocs.io/en/latest/pangeo_forge_recipes/recipe_user_guide/execution.html

The caveat here is that this will require writing to your own storage (whether that's cloud object storage, local disk, etc.).

@raf-antwerpen
Copy link
Contributor

Thanks for the transparency @cisaacstern! I'll keep an eye on #226 for updates and check on how to run my own recipe.

@raf-antwerpen
Copy link
Contributor

Hi @cisaacstern. I just wanted to check in about this since I'm unsure of the status of #226 and how to proceed from my end. I'd love to start using the data I'm trying to get onto Pangeo!

@jbusecke
Copy link
Author

I just wanted to quickly chime in here:
If it would be of benefit to @raf-antwerpen we could try to run the recipe locally or on dataflow on the LEAP resources.
This however begs the question of where the data target should be. Several options:

  • LEAP persistent bucket (only accessible for LEAP members)
  • Perhaps the same bucket that hosts the netcdf can be used to host a zarr version (cc @jkingslake)

I wonder if this is a decent solution to unblock @raf-antwerpen.

@jkingslake
Copy link

Yes, this could be a good work-around for now. Putting it in the same bucket at where NC data is works for me.

@jbusecke
Copy link
Author

How about we discuss further steps together during my LEAP-Pangeo Office Hours?

@cisaacstern
Copy link
Member

Big 👍 from me on this idea, thanks for the offer @jbusecke! Please let me know how I can advise/help.

@jkingslake
Copy link

sounds good. I missed them today, s I am booked in for next thursday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants