-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recipe stuck? #3
Comments
Hi @jbusecke, thanks for checking this! Could it be that the recipe is just taking a long time? The pruned recipe only did 18 timesteps, while the full dataset is more than 8000 time steps (22y and 365d) and ~130GB large. |
Cross posting this here for reference @jkingslake / @raf-antwerpen, thank you for your patience! it turns out i made a change to the backend that resulted in the full recipe not being run after merging this PR. i'm still looking into fixing this bug, and will report back in #3 Originally posted by @andersy005 in #2 (comment) |
Hi @andersy005, any update on this? Or anything I can do from this side? |
Trying to get #3 unstuck by resubmitting with a dummy commit (I think that having a way to resubmit without changes is underway, but not functional yet).
Hi @andersy005 and @jbusecke. Is there any update on this? Can we do anything from our side to help this process? |
Just pinging this again @jbusecke , in case it got missed. :-) |
@andersy005 @cisaacstern is there any update that merits a retry here? |
@jkingslake / @raf-antwerpen / @jbusecke, thank you all for your patience!
Perhaps @cisaacstern can help (I haven't followed the recent development closely). |
Hi @jbusecke @andersy005 @cisaacstern @jkingslake, Just wanted to open this again. I'd love to get this recipe working! Is there anything I can do to help with this? I'd be happy to meet with one of you at the LEAP workspace to discuss how to approach this if that helps! Just let me know. |
Hi all, thank you for the persistence and patience here, and sincere apologies that this has escaped my attention until now. (We've gotten to the point where there are now quite a large number of active feedstocks in As it seems we're lacking clarity on what the problem was the first time around, my typical approach to debugging something like this would be to re-run the recipe (assuming it may fail) and then begin investigating causes of the failure using the logs from the most recently failed run. Unfortunately, that is not possible immediately, because we seem to have hit some type of bug which is preventing the backend service from submitted jobs from any feedstock (not just this one). I've been tracking this issue for a week or so in pangeo-forge/pangeo-forge-orchestrator#220 and am now working on adding integration testing in pangeo-forge/pangeo-forge-orchestrator#226, as the first step towards figuring out why this is happening (and preventing it from taking down this aspect of our backend service in the future). This is a long way of saying that I am happy to revisit this issue as soon as our job submission problem is resolved, which being realistic, will probably be at least a week from now. My hope in providing more detail above is to provide some context for what is involved to make this all work, and perhaps something of an explanation for why things do at times take longer than we'd hope. Happy to answer any more specific question here, and please feel free to ping me again here in a week or so if I haven't checked back by then. |
Thanks for the explanation and clarity @cisaacstern! This is really helpful. Good luck on the backend issue. |
Hi @cisaacstern, just wanted to check in on this. How's everything going? |
@raf-antwerpen thanks for checking in. The backend service bug remains unresolved. Thanks for your patience and please feel free to check back here at regular intervals. As an order of magnitude, my expectation is that this bug fix timeline will be measured in weeks. (Again, the longer turnaround here being a reflection of the fact that Pangeo Forge Cloud has only one full time maintainer: yours truly 😆 .) In the event it's useful, I'll note add that so long as pangeo-forge/pangeo-forge-orchestrator#226 remains un-merged, it's quite likely the bug is unresolved (that is the test I'm working on to figure out the cause of the bug). Also, in the chance this may be able to help unblock you in the short term, it is entirely possible to run your own recipes (using various execution modes), as described in:
The caveat here is that this will require writing to your own storage (whether that's cloud object storage, local disk, etc.). |
Thanks for the transparency @cisaacstern! I'll keep an eye on #226 for updates and check on how to run my own recipe. |
Hi @cisaacstern. I just wanted to check in about this since I'm unsure of the status of #226 and how to proceed from my end. I'd love to start using the data I'm trying to get onto Pangeo! |
I just wanted to quickly chime in here:
I wonder if this is a decent solution to unblock @raf-antwerpen. |
Yes, this could be a good work-around for now. Putting it in the same bucket at where NC data is works for me. |
How about we discuss further steps together during my LEAP-Pangeo Office Hours? |
Big 👍 from me on this idea, thanks for the offer @jbusecke! Please let me know how I can advise/help. |
sounds good. I missed them today, s I am booked in for next thursday. |
I am suspicious that the recipe execution got stuck at this point. I just checked the Dataflow console and there are only two jobs that ran yesterday
![image](https://user-images.githubusercontent.com/14314623/200003439-9d552631-db43-4a15-8cd6-7b70bd80898a.png)
I presume the short successful run was the pruned test, and the cancelled run is the full recipe. Presumably something is going wrong beyond the first two files?
I wonder what the best way to debug this is. I would think pulling the logs and chasing down the error is the best next step. I am unsure if I actually have the credentials to get the logs, cc @andersy005 can you pull them?
If that does not work I can try to locally submit the recipe to the LEAP account.
The text was updated successfully, but these errors were encountered: