Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve logging at Arrow stream shutdown; avoid the explicit Canceled message at stream lifetime #170

Merged
merged 9 commits into from
Apr 3, 2024

Conversation

jmacd
Copy link
Contributor

@jmacd jmacd commented Apr 1, 2024

Fixes #160.
Fixes a test-flake discovered while preparing open-telemetry/opentelemetry-collector-contrib#31996

For the test flake, rewrite the test. The test was relying on the behavior of a single stream at shutdown, whereas the test wanted to test the seamlessness of stream restart. The new test covers the intended condition better -- it runs a short lifetime stream and runs the test long enough for 5 stream lifetimes, then checks for no observed logs.

I carried out manual testing of the shutdown process to help with #160. The problem was that the exporter receives an error from both its writer and its reader goroutines, and it logs both. There is new, consistent handling of the EOF and Canceled error states, which are very similar. EOF is what happens when the server responds to the client's CloseSend(). Canceled is gRPC-internally generated.

Both exporter and receiver have similar error-logging logic now. Both consider EOF and Canceled the same condition, and will Debug-log a "shutdown" message when this happens. All other error conditions are logged as errors.

This removes the former use of the StatusCode called Canceled, it wasn't necessary after switching the code to use a Status-wrapped Canceled code instead. If receivers are updated ahead of exporters, good. If exporters are upgraded ahead of receivers, they will log spurious errors at stream shutdown (but as #160 shows they were already issuing spurious logs). The code remains defined in the protocol, as it is mapped to the gRPC code space, and remains handled by the code as an error condition.

return str.anyStreamClient, nil
}
}

// healthyTestChannel accepts the connection and returns an OK status immediately.
type healthyTestChannel struct {
lock sync.Mutex
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lock is needed to avoid a race because CloseSend() now closes the sent channel, which some former tests were doing manually. Now CloseSend() is always called and the closed channel serves to assist with some tests.

Comment on lines 237 to 238
case <-e.shutdown:
err = context.Canceled
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test flake was related to this code here. Shutdown does not cancel the request context, so we needed another signal to cleanly exit the exporter.

Comment on lines +211 to +220
// This (client == nil) signals the controller to
// downgrade when all streams have returned in that
// status.
//
// This is a special case because we reset s.client,
// which sets up a downgrade after the streams return.
s.client = nil
s.telemetry.Logger.Info("arrow is not supported",
zap.String("message", status.Message()),
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this is a special case carried over from the old code. The rest of this rewrite in this function body is a major simplification.

Comment on lines +243 to +245
defer func() {
s.client.CloseSend()
}()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is as much about cleaning up tests as it is about being consistent when the stream is shutting down. Always closing the send channel appears to be working and it's simpler.

Comment on lines +363 to +365
return status.Error(codes.Canceled, "client stream shutdown")
} else if errors.Is(err, context.Canceled) {
return status.Error(codes.Canceled, "server stream shutdown")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the big change. Instead of returning an error, returning a status.Error() w/ the canceled code -- now the exporter will see that canceled in its reader and shut down cleanly.

@jmacd
Copy link
Contributor Author

jmacd commented Apr 1, 2024

My manual test consisted of three collectors w/ two arrow exporter/receiver pairs connecting an ordinary receiver and a debug exporter. There are no log messages at Info-level despite stream recycling, and shutdown will happen cleanly if the next hop is functioning.

Copy link
Contributor

@lquerel lquerel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jmacd jmacd merged commit 9551a08 into open-telemetry:main Apr 3, 2024
2 checks passed
@jmacd jmacd deleted the jmacd/combined branch May 22, 2024 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

graceful shutdown shows internal error instead of context canceled
3 participants