Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix proper exception handling and impedance match in tornado_requests #1128

Merged
merged 5 commits into from Oct 10, 2022

Conversation

galmasi
Copy link
Contributor

@galmasi galmasi commented Oct 10, 2022

Summary

This patch fixes an impedance match problem between the verifier and the python tornado package. TLDR exception handling is incorrectly implemented in the glue layer (tornado_requests.py) resulting in both uncaught exceptions and potentially returned None values that the verifier cannot handle.

Technical details

  • The uncaught exception addressed by this fix occurs in cloudlab_verifier_tornado in the function invoke_get_quote.
  • The immediate cause for the failure is an operating system level failure that occurs in rare circumstances. We could only make it happen in virtual deployments, at large enough scale (1k+ agents) and with sufficient traffic over a virtual e1000 network adapter. We were unable to recreate this exception in controlled circumstances; as is, we are seeing approximately 3 network device driver crashes per verifier instance per day.
  • With a relatively low probability (< 5% based on rough back of the envelope calculations) the network device driver crash causes an uncaught exception in invoke_get_quote. We call this an Old Attestation Failure (OAF) event, because the net effect is that the verifier thread quits, leading to a situation where a particular agent no longer receives get_quote requests. The verifier's state machine remains in the verified state, and the Verifiermain database is no longer updated for this agent.
  • 95% of the time the outcome is a caught exception in invoke_get_quote. This results in a standard communication failure exception followed by a keylime retry with exponential backoff, and the system recovers normally. The only record of such an event is the retry notification in the log.
  • We believe that the uncaught communication exception is tied to the network device driver failing mid-communication. We believe neither tornado nor tornado_requests is handling this situation appropriately.
  • Given the above deployment numbers above, we saw approximately one OAF event per verifier every 6 days. With 10 verifiers handling 1500 nodes, this resulted in about one OAF event every half day.
  • OAF events do not show up in any keylime logs. Apparently exceptions thrown by coroutines to not print stack traces in the log unless async debugging is on.

We became aware of OAF events only after the patch by @maugustosilva (#1091 #1093) to dump attestation timers into the VerifierMain database.

cc: @mdrocco @maugustosilva

George Almasi and others added 2 commits October 8, 2022 14:10
@THS-on
Copy link
Member

THS-on commented Oct 10, 2022

@galmasi the fix itself LGTM, but can you fix the style and linter issues and then rebase your commit on the latest master without the merge commit?

George Almasi added 3 commits October 10, 2022 17:26
Signed-off-by: George Almasi <gheorghe@us.ibm.com>
Signed-off-by: George Almasi <gheorghe@us.ibm.com>
@maugustosilva maugustosilva self-requested a review October 10, 2022 17:52
@THS-on
Copy link
Member

THS-on commented Oct 10, 2022

@galmasi can you squash the commits together?

@maugustosilva
Copy link
Contributor

I will squash and merge. This commit is highly relevant to address a persistent timing bug we are seeing on our scalability tests.

@maugustosilva maugustosilva merged commit f969d39 into keylime:master Oct 10, 2022
8 of 13 checks passed
@galmasi
Copy link
Contributor Author

galmasi commented Oct 10, 2022

whoops, @maugustosilva did my work for me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants