Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokio stack overflow #130

Closed
augustuswm opened this issue May 5, 2022 · 10 comments
Closed

Tokio stack overflow #130

augustuswm opened this issue May 5, 2022 · 10 comments
Labels
needs fix Currently functionality that is broken

Comments

@augustuswm
Copy link
Contributor

This occurs a few times per day in the webhooky. Looks to be correlated to M-F work (as expected). Needs investigation.

2022-05-05 13:50:35.960 CDT thread 'tokio-runtime-worker' has overflowed its stack
2022-05-05 13:50:35.960 CDT fatal runtime error: stack overflow
2022-05-05 13:50:35.961 CDT Uncaught signal: 6, pid=1, tid=3, fault_addr=0.
@augustuswm augustuswm added 🐛 bug Something that isn't working. needs fix Currently functionality that is broken and removed 🐛 bug Something that isn't working. labels May 6, 2022
@zephraph
Copy link
Contributor

zephraph commented May 6, 2022

I believe @jessfraz was looking into something related to this a little while back.

@jessfraz
Copy link
Contributor

jessfraz commented May 6, 2022

Yeah we've talked about the stack overflows

@jessfraz
Copy link
Contributor

jessfraz commented May 6, 2022

The nightmare bugs

@augustuswm
Copy link
Contributor Author

Looks to be highly correlated with pushes to RFD branches handled by handle_rfd_push. These are the main RFD branches (named 0\d\d\d) as opposed to something named 0003-new-edits (which are ignored).

Looking over the after the last two weeks all stack overflows are preceded by a log entry of the form:
{"level":"INFO", "msg":"push event -> file rfd/0219/README.adoc was modified on branch 0219", "ts":"2022-04-25T10:09:18.045736121Z"}
about 4-5 seconds before tokio crashes. The inverse also looks to be true (RFD push is followed by overflow).

This GCP query can be used to isolate the log entries: (("`push` event -> file" AND jsonPayload.msg =~ "branch 0\d\d\d$") OR "has overflowed its stack")

I need to look closer to determine what exactly is causing the overflow, but it looks to originate between lines 555 and 577 in webhooky/src/handlers_github.rs (those lines call a lot of other code though)

@augustuswm
Copy link
Contributor Author

augustuswm commented May 10, 2022

It seems to be failing on let r = github.repos().get(owner, repo).await?; in get_rfd_contents_from_repo I don't think the line itself it an issue, but possible the app is creating a deep enough tokio stack that it is actually running out of memory?

@jessfraz
Copy link
Contributor

okay so this is where I have gotten as well. Then I dug deep into the generated client and couldn't find anything. And putting printfs random places didn't give me anything either. Somewhere there must be a recursive function right, but I could not for the life of me figure it out, like it must be in a lib of a lib or something, anyways I didn't get much further other than it was happening right before the auth function in the octorust generated lib

@augustuswm
Copy link
Contributor Author

Running tokio with 4MB thread stacks instead of the default 2MB looks to provide enough memory for the RFD push events to be handled. After measuring more locally it looks like the large async trees are getting too large. I never hit an overflow locally, but got very close. My best guess at this point is we weren't hitting recursive calls, we were just creating too deeply nested of an async struct. I'll likely leave the increased thread stack size at 4MB for now, and will retest when we look at breaking down some of the functions.

@jessfraz
Copy link
Contributor

OMG that makes so much more sense!

@jessfraz
Copy link
Contributor

I was going crazy thinking something was recursive!!

@augustuswm
Copy link
Contributor Author

Closing this for now as the immediate problem is resolved. We will revisit memory usage in the future to bring down the needed stack size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs fix Currently functionality that is broken
Projects
None yet
Development

No branches or pull requests

3 participants