Fix proper exception handling and impedance match in tornado_requests
#1128
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Summary
This patch fixes an impedance match problem between the verifier and the python tornado package. TLDR exception handling is incorrectly implemented in the glue layer (
tornado_requests.py) resulting in both uncaught exceptions and potentially returnedNonevalues that the verifier cannot handle.Technical details
cloudlab_verifier_tornadoin the functioninvoke_get_quote.e1000network adapter. We were unable to recreate this exception in controlled circumstances; as is, we are seeing approximately 3 network device driver crashes per verifier instance per day.invoke_get_quote. We call this an Old Attestation Failure (OAF) event, because the net effect is that the verifier thread quits, leading to a situation where a particular agent no longer receivesget_quoterequests. The verifier's state machine remains in theverifiedstate, and theVerifiermaindatabase is no longer updated for this agent.invoke_get_quote. This results in a standard communication failure exception followed by a keylime retry with exponential backoff, and the system recovers normally. The only record of such an event is the retry notification in the log.tornadonortornado_requestsis handling this situation appropriately.We became aware of OAF events only after the patch by @maugustosilva (#1091 #1093) to dump attestation timers into the
VerifierMaindatabase.cc: @mdrocco @maugustosilva