Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about correctness of metric.Meter implementation and use #4836

Closed
cep21 opened this issue Jan 20, 2024 · 16 comments · Fixed by open-telemetry/opentelemetry-go-contrib#4875
Labels
bug Something isn't working

Comments

@cep21
Copy link

cep21 commented Jan 20, 2024

Description

metric.Meter has the following code

type Meter interface {
...
  Int64Counter(name string, options ...Int64CounterOption) (Int64Counter, error)
...
}

Users, like grpc, have the following code.

	c.rpcResponseSize, err = c.meter.Int64Histogram("rpc."+role+".response.size",
		metric.WithDescription("Measures size of RPC response messages (uncompressed)."),
		metric.WithUnit("By"))
	if err != nil {
		otel.Handle(err)
	}

My problem is that metric.Meter does not document if it is ok to use the first returned value if the second is non-nil. In Go, this answer is generally assumed "no". In the example above, otel.Handle is called, but the function is not returned and continues. Later on, because there is no return, code could use the c.rpcResponseSize object and panic.

The only documentation for otel.Handle is

	// Handle handles any error deemed irremediable by an OpenTelemetry
	// component.
	Handle(error)

My assumption is that handle is intended to allow more than just panic. Maybe one idea is the service emails an error to a sys admin, but continues to run.

If that's the case, then creates of otel instrumentation, such as gRPC, need to either:

  • Do much more than otel.Handle, but actually set a noop object into their config
  • Implementers of metric.Meter need to always return a non-nil object as the first parameter and this contract should be clear in the documentation of metric.Meter

For the first option above, the implementation of gRPC otel should be modified. For the second option above, the documentation of metric.Meter needs to be modified.

Recommendation

The contract, IE documentation, of metric.Meter should say "If the second returned value is non-nil, then the first returned value will be a no-op object and users should call otel.Handle."

@cep21 cep21 added the bug Something isn't working label Jan 20, 2024
@MrAlias
Copy link
Contributor

MrAlias commented Jan 23, 2024

"If the second returned value is non-nil, then the first returned value will be a no-op object and users should call otel.Handle."

This is not true if there was an instrument name error. The returned instrument will be "usable" (though it may produce conflicting metric streams) and the error is ErrInstrumentName. How the user want's to handle this situation is going to be case-by-case dependent.

It is also not true that all users should pipe their error to otel.Handle. That is one option, but it is not universal. If it were we would have just done so within the called function. We cannot make a global suggestion that users should do that, it would be incorrect.

@cep21
Copy link
Author

cep21 commented Jan 23, 2024

The returned instrument will be "usable"

Should this be part of the documentation of the interface since implementations, such as gRPC, are depending upon this behavior to be correct?

also not true that all users should pipe their error to otel.Handle

Is there guidance on when to call otel.Handle and when not to?

@MrAlias
Copy link
Contributor

MrAlias commented Jan 23, 2024

Is there guidance on when to call otel.Handle and when not to?

For authors of the OTel SDK: whenever an error that cannot be returned to the user (asynchronous, not allowed via the API).

For authors of OTel instrumentation: as little as possible. If the error can be handled, handle it; if it cannot but can be returned to the user, return it to the user; if neither can be done and it is possibly recoverable or critical to the use then send to the otel.Handler.

For users of OTel that is defined by themselves: they are setting a handler, they are in the best position to decide this.

@MrAlias
Copy link
Contributor

MrAlias commented Jan 23, 2024

Should this be part of the documentation of the interface since implementations, such as gRPC, are depending upon this behavior to be correct?

Sure, sounds good to me. But I think it should be a part of the implementation of the interface's documentation not the interface itself given that is not where the behavior comes from.

@cep21
Copy link
Author

cep21 commented Jan 23, 2024

should be a part of the implementation of the interface's documentation

I disagree with that part. Here is my reasoning:

gRPC's usage takes a metric.Meter, not an implementation of a Meter. Therefor, it must only code against the contract of the thing it takes, an interface, not the contract of an implementation of that interface, of which there could be infinite. In the future, if people want to implement their own metric.Meter, they will need to know what behavior to use to implement a correct object. If correctness requires that the first parameter is always non nil and usable, then that detail must exist in the interface itself.

@MrAlias
Copy link
Contributor

MrAlias commented Jan 23, 2024

which there could be infinite

This it the reason you cannot say the interface will always return an error for a name defined by OpenTelemetry to be invalid for their SDKs in the interface. There is nothing stopping one of those infinite implementations from deviating from this behavior, nor should there be.

@cep21
Copy link
Author

cep21 commented Jan 23, 2024

I'm sorry I'll clarify my problem. I am writing code that takes a metric.Meter. I as calling Int64Counter. If the second parameter is an error, I want to know if it is ok to call functions on the first returned Int64Counter, or if those function calls will panic.

If it is ok to call functions on the first parameter, I would like to solidify this contract by putting it in the documentation of the object I'm calling: metric.Meter.

@MrAlias
Copy link
Contributor

MrAlias commented Jan 24, 2024

Yeah, I think we can document the Meter interface to state that if a non-nil instrument is returned it should never panic when its methods are called.

@cep21
Copy link
Author

cep21 commented Jan 24, 2024

if a non-nil instrument is returned

I'm sorry, I want to document.

All function calls of Meter that return an object and an error, always return a usable non-nil first value even if the second value is an error

I would like to document this because this current behavior of Meter is different than the rest of the Go standard library. For example, http.NewRequest returns a nil first object if the second return value, error, is non nil.

cep21 added a commit to cep21/opentelemetry-go that referenced this issue Jan 24, 2024
@cep21
Copy link
Author

cep21 commented Jan 24, 2024

I've submitted an example PR here: #4853

@MrAlias
Copy link
Contributor

MrAlias commented Jan 24, 2024

You are again describing the behavior of the OTel SDK that backs the interface. I don't think we should limit other implementations of that interface to never return nil. That is too restrictive.

@cep21
Copy link
Author

cep21 commented Jan 24, 2024

I see the disagreement. Is it fair to say:

I'm saying:

  • If a function takes a meter.Meter, then it can only rely on the documented contract of the object it takes: meter.Meter.

You're saying:

  • Code can take a meter.Meter and depend upon the contract of a Meter implementation, not meter.Meter itself, and still be correct.

@MrAlias
Copy link
Contributor

MrAlias commented Jan 24, 2024

I'm saying all a user can rely on is the interface contract itself. If an implementation of that interface wants to return a partial result and an error that is fine. If they want to return an unrecoverable error and nil, also fine.

If we want to communicate to users that an implementation may return partial results through the interface that sounds fine with me. I'm not okay with saying all implementations will be expected to do this and never return nil and an error.

@cep21
Copy link
Author

cep21 commented Jan 24, 2024

Thanks for clearing this up! Back to the gRPC implementation of open telemetry with this code

	c.rpcResponseSize, err = c.meter.Int64Histogram("rpc."+role+".response.size",
		metric.WithDescription("Measures size of RPC response messages (uncompressed)."),
		metric.WithUnit("By"))
	if err != nil {
		otel.Handle(err)
	}
        // ...
        c.rpcResponseSize.Record(...)

Since this implementation is relying on the behavior that c.rpcResponseSize is non nil, is this implementation of gRPC open telemetry incorrect? Should the implementation of open telemetry for gRPC be modified to not call functions on c.rpcResponseSize if the error is nil? Maybe

	c.rpcResponseSize, err = c.meter.Int64Histogram("rpc."+role+".response.size",
		metric.WithDescription("Measures size of RPC response messages (uncompressed)."),
		metric.WithUnit("By"))
	if err != nil {
		otel.Handle(err)
+		c.rpcResponseSize = &noopHistogram{}
	}

@MrAlias
Copy link
Contributor

MrAlias commented Jan 24, 2024

I think the instrumentation needs to check if the instrument is nil

	c.rpcResponseSize, err = c.meter.Int64Histogram("rpc."+role+".response.size",
		metric.WithDescription("Measures size of RPC response messages (uncompressed)."),
		metric.WithUnit("By"))
	if err != nil {
		otel.Handle(err)
	}
        // ...
        if c.rpcResponseSize != nil {
		c.rpcResponseSize.Record(...)
	}

or, like you mentioned, just set it to a noop if it is nil

	c.rpcResponseSize, err = c.meter.Int64Histogram("rpc."+role+".response.size",
		metric.WithDescription("Measures size of RPC response messages (uncompressed)."),
		metric.WithUnit("By"))
	if err != nil {
		otel.Handle(err)
		if c.rpcResponseSize == nil {
			c.rpcResponseSize = &noopHistogram{}
		}
	}

@cep21
Copy link
Author

cep21 commented Jan 24, 2024

Thanks for clarifying. I'm trying to write my own collector and am using the existing contrib collectors as models for best practice, which is where my confusion came from.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants