Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Context.canceled handling changes for slo and receiver shim #3505

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions modules/distributor/receiver/shim.go
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ package receiver

import (
"context"
"errors"
"fmt"
"os"
"time"
Expand Down Expand Up @@ -342,6 +343,12 @@ func (r *receiversShim) ConsumeTraces(ctx context.Context, td ptrace.Traces) err
metricPushDuration.Observe(time.Since(start).Seconds())
if err != nil {
r.logger.Log("msg", "pusher failed to consume trace data", "err", err)

// Client disconnects are logged but not propogated back.
if errors.Is(err, context.Canceled) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this brings to mind a difficulty I have on the read path. it's impossible to tell where this context cancelled came from. Is it further up in the otel receiver code due to a client disconnect? or deeper down in the distributor code.

For instance, if we fail to write to 2+ ingesters due to this timeout I think that would bubble up as a context canceled as well:

localCtx, cancel := context.WithTimeout(ctx, d.clientCfg.RemoteTimeout)
defer cancel()

withCancelCause was added in 1.20:

https://pkg.go.dev/context#WithCancelCause

to allow for communication of the reason, but I don't know if this is set correctly in the GRPC server. it's definitely not in our own code. Maybe we set it in our code and assume if there is no cause it's due to client disconnect?

we unfortunately cancel context in a lot of places and don't have good patterns for when, why or what is communicated when we do. as is, i think this would mask timeouts to the ingesters.

return nil
}

err = wrapErrorIfRetryable(err, r.retryDelay)
}

Expand Down
8 changes: 6 additions & 2 deletions modules/frontend/slos.go
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
package frontend

import (
"context"
"errors"
"net/http"
"time"

Expand Down Expand Up @@ -66,8 +68,10 @@ func sloHook(allByTenantCounter, withinSLOByTenantCounter *prometheus.CounterVec

// most errors are SLO violations
if err != nil {
// however, if this is a grpc resource exhausted error (429) then we are within SLO
if status.Code(err) == codes.ResourceExhausted {
// However these errors are considered within SLO:
// * grpc resource exhausted error (429)
// * context canceled (client disconnected or canceled)
if status.Code(err) == codes.ResourceExhausted || errors.Is(err, context.Canceled) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thoughts here. maybe we just log the cancel cause if one exists and see if that's populated?

withinSLOByTenantCounter.WithLabelValues(tenant).Inc()
}
return
Expand Down
6 changes: 6 additions & 0 deletions modules/frontend/slos_test.go
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
package frontend

import (
"context"
"errors"
"net/http"
"testing"
Expand Down Expand Up @@ -32,6 +33,11 @@ func TestSLOHook(t *testing.T) {
name: "no slo fails : error",
err: errors.New("foo"),
},
{
name: "client disconnect (context canceled) passes",
err: context.Canceled,
expectedWithSLO: 1.0,
},
{
name: "no slo passes : resource exhausted grpc error",
err: status.Error(codes.ResourceExhausted, "foo"),
Expand Down
Loading