Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: add OTEL_RESOURCE_ATTRIBUTES env var. #14556

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

lgfa29
Copy link
Contributor

@lgfa29 lgfa29 commented Sep 13, 2022

Add new task hook to inject a OTEL_RESOURCE_ATTRIBUTES environment variable with Nomad attributes into tasks. The attributes set are related to the alloc and specific task that is running, the node where the alloc is running, and the job and eval that generated the alloc.

These attributes are merged if the task already defines a OTEL_RESOURCE_ATTRIBUTES environment variable, or disabled if the value defined by the task is an empty string.

@lgfa29 lgfa29 added this to the 1.4.0 milestone Sep 13, 2022
lgfa29 added a commit that referenced this pull request Sep 13, 2022
Add new task hook to inject a `OTEL_RESOURCE_ATTRIBUTES` environment
variable with Nomad attributes into tasks. The attributes set are
related to the alloc and specific task that is running, the node where
the alloc is running, and the job and eval that generated the alloc.

These attributes are merged if the task already defines a
`OTEL_RESOURCE_ATTRIBUTES` environment variable, or disabled if the
value defined by the task is an empty string.
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code mostly looks fine here, but I'm not sure about some of the design motivations of this one @lgfa29. There's a hidden backwards compatibility issue here as well, which we've run into when discussing adding Docker labels: if you are running OTEL already by adding the OTEL env var to your tasks, Nomad just ballooned your OTEL provider costs. You can only opt-out of the env var entirely, not out of having Nomad add to it.

multierror "github.com/hashicorp/go-multierror"
"github.com/hashicorp/nomad/client/allocrunner/interfaces"
"github.com/hashicorp/nomad/nomad/structs"
"go.opentelemetry.io/otel/baggage"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They went with the name "baggage" for this concept? 😦

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, that's another spec 😅
https://www.w3.org/TR/baggage/

Comment on lines +104 to +120
members := []baggage.Member{
newMember("nomad.alloc.createTime", fmt.Sprintf("%v", alloc.CreateTime), mErr),
newMember("nomad.alloc.id", alloc.ID, mErr),
newMember("nomad.alloc.name", alloc.Name, mErr),
newMember("nomad.eval.id", alloc.EvalID, mErr),
newMember("nomad.group.name", alloc.TaskGroup, mErr),
newMember("nomad.job.id", job.ID, mErr),
newMember("nomad.job.name", job.Name, mErr),
newMember("nomad.job.region", job.Region, mErr),
newMember("nomad.job.type", job.Type, mErr),
newMember("nomad.namespace", alloc.Namespace, mErr),
newMember("nomad.node.id", node.ID, mErr),
newMember("nomad.node.name", node.Name, mErr),
newMember("nomad.node.datacenter", node.Datacenter, mErr),
newMember("nomad.task.name", task.Name, mErr),
newMember("nomad.task.driver", task.Driver, mErr),
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the value to operators in providing these specific attributes to task processes. Ex. why do we want to expose the eval ID for every running process on the cluster?

From a design standpoint I'm not sure it makes sense to have Nomad itself define a hard-coded set of attributes, rather than making this something the operators define (as either client configuration or in the jobspec). Could this whole thing be done via clever templating of an env block, and if it can be but only with a lot of work, could we implement something to make it easier instead of tying ourselves to the extremely fast-moving OTEL project?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. I thought of some kind of configuration (either jobspec or client side), but that would be even more commitment to keep it updated, or meta values, but they are not really well defined and should probably remain opaque to Nomad.

Maybe an external project may be better for now then. I will think more about it, thanks!

Comment on lines +89 to +94
// TODO(luiz): remove decode step once the Otel SDK handles it internally.
// https://github.com/open-telemetry/opentelemetry-go/pull/2963
attrs, err := url.QueryUnescape(resourceAttrs.String())
if err != nil {
attrs = resourceAttrs.String()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be misunderstanding this PR, but this looks like it impacts the read side and not the write side. If the read side of the SDK gets fixed, don't we still need to encode on the write side so that older versions of the SDK aren't broken? (And are there lots of read-side SDKs for different languages? If there's only go, then it doesn't seem sensible for us to bake-in support for a single language in Nomad.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading a bit more, this looks like we could end up double-encoding in the case where the user has something we're merging onto.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem was a mismatch between how the baggage spec and the collector handled encoding values. Only the baggage required encoding/decoding so the more general "fix" was an update to the spec. Other languages will need to be updated to handle this as well, but good point on supporting older versions. I will make sure to test it.

Attempting to double-encode would result in an error that is handled by the using the original string.

client/allocrunner/taskrunner/otel_hook.go Show resolved Hide resolved
// https://github.com/open-telemetry/opentelemetry-go/issues/3164
member, err := baggage.NewMember(k, v)
if err != nil {
logger.Warn("failed to create new baggage member", "key", k, "value", v, "error", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ridiculous name "baggage" aside, which isn't your fault, I'm not sure we should expose that inside-baseball terminology to end-users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, good point 👍

@lgfa29
Copy link
Contributor Author

lgfa29 commented Sep 13, 2022

Yup, those are great observations. I'm going to mark this as draft and rethink the approach. Thanks for the review!

@lgfa29 lgfa29 marked this pull request as draft September 13, 2022 15:04
@schmichael
Copy link
Member

This is a great candidate for Task Hook Plugins... if such a thing existed. The hook API was designed to seamlessly be able to add them someday. 🤷

@lgfa29 lgfa29 removed this from the 1.4.0 milestone Sep 22, 2022
@tgross tgross added the stage/needs-rebase This PR needs to be rebased on main before it can be backported to pick up new BPA workflows label May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/needs-rebase This PR needs to be rebased on main before it can be backported to pick up new BPA workflows
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants