Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add detail for GCP Cloud Run Job execution #3378

Merged
merged 6 commits into from
May 9, 2023

Conversation

damemi
Copy link
Contributor

@damemi damemi commented Apr 10, 2023

Changes

Add a note for Cloud Run Jobs on how to identify faas.version.

Related issues
Downstream PR proposing this semconv change: GoogleCloudPlatform/opentelemetry-operations-go#465

Related OTEP(s) #

@damemi
Copy link
Contributor Author

damemi commented Apr 10, 2023

/cc @punya @jsuereth @aabmass @psx95

@damemi damemi force-pushed the gcp-cloud-run-service-version branch 2 times, most recently from 1d2542c to 5b37817 Compare April 10, 2023 18:12
@damemi damemi force-pushed the gcp-cloud-run-service-version branch from 5b37817 to 3dbd919 Compare April 10, 2023 18:14
@damemi
Copy link
Contributor Author

damemi commented Apr 14, 2023

ping @carlosalberto ptal

semantic_conventions/resource/faas.yaml Outdated Show resolved Hide resolved
Copy link
Member

@Oberon00 Oberon00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the linked page https://cloud.google.com/run/docs/managing/job-executions and it seems to confirm what I already suspected from the name "job execution". It does not really fit the version, it sounds like it would better fit the instance or invocation, or maybe cloud run jobs don't fit FaaS conventions at all.

A version should be something more or less permanent , that you can refer to, e.g. when you invoke a function (these conventions are about functions, not jobs), some version gets selected, and a particular version may be launched repeatedly.

E.g. when I read the docs here:

https://cloud.google.com/run/docs/managing/jobs#delete

Deleting a job cancels all running job executions of that job.

A job execution is something running, that's what an instance or invocation is, a version is something that can be run (multiple times).

See also https://cloud.google.com/run/docs/resource-model#job-executions and https://cloud.google.com/run/docs/resource-model#jobs

So I would say that CloudRun jobs don't really fit FaaS conventions well, but if we want to force them into the FaaS conventions, the job execution would match both an invocation_id and instance. I'm leaning slightly towards invocation_id, but not sure

@damemi
Copy link
Contributor Author

damemi commented Apr 18, 2023

Versions, instances, and invocations are all functionally similar (you could even argue they are the same) in the isolated context of Jobs, due to the inherent nature of Jobs. In effect, each version corresponds to one or more instances triggered by an invocation. So, you could map Execution to any one of version, instance, or invocation based solely on how it behaves. What's critical is what it represents.

In the broader context of Cloud Run, if you consider a job to be just a short-lived, single-request Service, then the concepts of version, instance, and invocation are equivalent:

  • Invocation: User request made to the application.
  • Instance: Container instance ID within the application.
  • Version: Each new deployment of the application.

We can rule out invocation, because that is not what we want to represent with this attribute. This attribute is meant to represent the Job itself, not the request made to it.

Instance maps better to Execution (because each Task runs as its own instance). But there are 2 problems: 1) the Instance ID does not represent the deployment, it represents a single container and 2) we also provide Instance as its own attribute. Mapping Instance to both Version and Instance is redundant and leaves no room for us to provide the Execution.

So with Invocation ruled out, and Instance mapped to Instance, that logically leaves (Job)Execution to map to (Service)Revision. We at the very least need somewhere to expose the Job Execution. And from a Cloud Run user's perspective, it would be expected in the same place as Service Revision.

For users on Cloud Run, the same resource detector should be able to fill in the same fields whether for a Cloud Run Service or Cloud Run Job. If only for consistency with the existing Services conventions, we should not fill in fields based on different metadata sources within Cloud Run.

So I would say that CloudRun jobs don't really fit FaaS conventions well

I understand that I'm being very self-referential by using the Service conventions to map to Jobs. I'm assuming that these existing conventions set a precedent for the inclusion of Cloud Run apps as FaaS concepts. So if the conventions for Services are valid, then this proposal for Jobs is also valid. Because you can, in theory, reduce a Service to a Job, and a Job to a Function.

@Oberon00
Copy link
Member

Oberon00 commented Apr 18, 2023

Actually "invocation" was just recently renamed and used to be named "execution" #3209. I think there is also not really a difference between "the Job itself" and "the request made to it". After all, the same applies to the invocation_id: It identifies the executing invocation, not the request causing the invocation (for example, on AWS Lambda, the request ID is generated by the Lambda backend, not by the requester).

Regarding "instance":

But there are 2 problems: 1) the Instance ID does not represent the deployment, it represents a single container and 2) we also provide Instance as its own attribute.

What is that own attribute? The term "instance" has many meanings, but a "faas.instance" is an instance of a FaaS execution environment, whatever that encompasses (even multiple "instances" of something else). According to the Google Documentation, a job may invoke multiple containers, so container IDs cannot be used in that attribute anyway (since then a single faas.invocation_id would span multiple faas.instances which is not how these concepts are supposed to be used)

In the end, it might also be worth it to introduce a special attribute gcp.job_execution. I just don't see how "version" would fit.

@damemi
Copy link
Contributor Author

damemi commented Apr 18, 2023

Actually "invocation" was just recently renamed and used to be named "execution" #3209.

I don't know the background for this change, but that PR description says it was changed because execution was "too generic and confusing". Maybe I'm misinterpreting, but that implies to me that the name change is intentionally meant to focus it on one thing (the request?)

Is there any discussion on how the term "invocation" was decided, or a definition of it for this context? Josh's #3378 (comment) defined it as the request, so that's what I'm basing this on.

I think there is also not really a difference between "the Job itself" and "the request made to it".

The Job is a resource that exists in the GCP Cloud Run API and handles requests. The request is an event initiated by a user to that resource. I think the fact that execution was renamed to invocation to make the term more specific means we should draw the distinction between the two, which I'm comparing to the definition in this section on the difference between instance and invocation.

But there are 2 problems: 1) the Instance ID does not represent the deployment, it represents a single container and 2) we also provide Instance as its own attribute.

What is that own attribute?

faas.instance

The term "instance" has many meanings, but a "faas.instance" is an instance of a FaaS execution environment, whatever that encompasses (even multiple "instances" of something else).

A container is the instance of a Cloud Run execution environment. In Cloud Run Services, this container can serve multiple invocations. In Jobs, the container only lives for a single invocation but is technically still the execution environment.

According to the Google Documentation, a job may invoke multiple containers, so container IDs cannot be used in that attribute anyway (since then a single faas.invocation_id would span multiple faas.instances which is not how these concepts are supposed to be used)

As a span attribute, does invocation_id cross multiple instances? eg, a user calls a function that calls another function, is that the same invocation_id across both of those instances, or is the invocation_id meant to change whenever it hits a new instance?

@Oberon00
Copy link
Member

that execution was renamed to invocation to make the term more specific means we should draw the distinction between the two

The renaming definitely did not have the intention to restrict the meaning of the attribute such that execution and invocation could now be two separate things. At least that is my interpretation. @tylerbenson do you agree? Or should we re-introduce execution (or execution_id) as an additional attribute? I don't think so...

But I can see that the current definition of invocation_id is unclear with regard to your use case. You could adapt it though.

@Oberon00
Copy link
Member

As a span attribute, does invocation_id cross multiple instances? eg, a user calls a function that calls another function, is that the same invocation_id across both of those instances, or is the invocation_id meant to change whenever it hits a new instance?

It would usually be different for each function, but that might depend on the cloud provider. The attribute was originally modeled after the AWS Lambda Request ID https://docs.aws.amazon.com/lambda/latest/dg/runtimes-api.html#runtimes-api-next.

@Oberon00
Copy link
Member

Oberon00 commented Apr 18, 2023

But there are 2 problems: 1) the Instance ID does not represent the deployment, it represents a single container and 2) we also provide Instance as its own attribute.

What is that own attribute?

faas.instance

OK, now I get that I still don't get this aspect, sorry to keep bugging you here, but I think it might be important. So, what do you set as value here? The instance ID of the host that executes the container? Is it then possible that the same Job execution spans multiple instances?

@damemi
Copy link
Contributor Author

damemi commented Apr 18, 2023

OK, now I get that I still don't get this aspect, sorry to keep bugging you here, but I think it might be important. So, what do you set as value here?

No problem. We use the InstanceID() gce metadata function https://pkg.go.dev/cloud.google.com/go/compute/metadata#InstanceID which returns the current VM's numeric instance ID, which in Cloud Run is a Unique identifier of the container instance https://cloud.google.com/run/docs/container-contract#metadata-server, iow the container ID.

Is it then possible that the same Job execution spans multiple instances?

Yes

The renaming definitely did not have the intention to restrict the meaning of the attribute such that execution and invocation could now be two separate things. At least that is my interpretation. @tylerbenson do you agree? Or should we re-introduce execution (or execution_id) as an additional attribute?

I don't mean that the renaming meant to split execution and invocation. I mean that it meant to more clearly define invocation as a request, rather than the execution. My argument here is that the execution is already represented by the version in Cloud Run.

It would usually be different for each function, but that might depend on the cloud provider. The attribute was originally modeled after the AWS Lambda Request ID https://docs.aws.amazon.com/lambda/latest/dg/runtimes-api.html#runtimes-api-next.

That doc convinces me more that invocation is meant to map to the request event.

@Oberon00
Copy link
Member

Oberon00 commented Apr 18, 2023

Is it then possible that the same Job execution spans multiple instances?

Yes

I think this is not correct then. You should probably use host.id to store that. faas.instance is meant for the exectution environment of the function call, not part of it. That we even have to consider the difference, is a hint that faas conventions might really not be suited for CloudRun jobs.

That doc convinces me more that invocation is meant to map to the request event.

You may be right.

Regarding the version though, maybe a litmus test if you can use it could be "can you invoke the version multiple times?". If not, then it is probably not a version.

@damemi
Copy link
Contributor Author

damemi commented Apr 18, 2023

I think this is not correct then. You should probably use host.id to store that. faas.instance is meant for the exectution environment of the function call, not part of it.

This is the container ID on Cloud Run. That is the execution environment. The Host is not visible to the user, and is not appropriate for this.

Regarding the version though, maybe a litmus test if you can use it could be "can you invoke the version multiple times?". If not, then it is probably not a version.

Is this a requirement, or can a version auto-increment with each request?

@Oberon00
Copy link
Member

Oberon00 commented Apr 18, 2023

Is this a requirement, or can a version auto-increment with each request?

If the version auto-increments with every request, it would be something that can be used to uniquely identify the request, which would be usable as as invocation_id...

It is not a requirement that is codified in the semantic conventions (yet), but is for the version concept I had in my mind.

Regarding host.id, it is defined as:

Unique host ID. For Cloud, this must be the instance_id assigned by the cloud provider. For non-containerized systems [...]

But there is also container.id here: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/resource/semantic_conventions/container.md

@tylerbenson
Copy link
Member

(Can I suggest adding feedback as inline comments so it is easier to thread the conversation better. This format makes it difficult to follow the conversation and respond to individual items. I also suggest this would be a good topic for the FAAS SIG meeting later today. Perhaps you both could attend or send a representative?)

@Oberon00
Copy link
Member

Unfortunately that meeting is very late in my timezone, I won't be able to attend. But I'm very likely to accept the outcome reached there, if you discuss it and other FaaS SIG members had opinions on the topic. 😃

@tylerbenson
Copy link
Member

@Oberon00 understand. Perhaps you could summarize your concerns into a single paragraph that I can paste into the meeting agenda? (Or you're welcome to add it directly yourself.)

@jsuereth
Copy link
Contributor

jsuereth commented Apr 18, 2023

@Oberon00 My read of invocation_id is that it's a concept way too tightly tied to AWS lambda. It does not fit the runtime model of Cloud Run, and it's not something we would want to promote or encourage OTEL users to assume our instance ID is actually a viable "invocation_id" as defined.

Specifically:

  • faas.instance.id is described as The execution environment ID as a string, that will be potentially reused for other invocations to the same function/function version.

  • You claim: What is that own attribute? The term "instance" has many meanings, but a "faas.instance" is an instance of a FaaS execution environment, whatever that encompasses (even multiple "instances" of something else). According to the Google Documentation, a job may invoke multiple containers, so container IDs cannot be used in that attribute anyway (since then a single faas.invocation_id would span multiple faas.instances which is not how these concepts are supposed to be used)

You cannot just change this wording in your definition. The way I read the description, everything Mike has proposed is inline with the current definition. In fact given the potentially reused for other invocation, I'd argue your definition is strictly invalid here. We can use an id that changes.

Additionally, the term "execution environment" here is highly suspect. Perhaps in AWS there's some abstraction between your deployment and a running container. In cloud run that abstraction is NOT identified or called out as something you can look at. There's just (deletable) does this function name / job name exist, or there's a running container instance.

+1 to @tylerbenson comments. Right now I'd like an ordered list of your concerns with this PR so that they can be discussed and addressed.

@tylerbenson
Copy link
Member

Quick summary from the SIG meeting:
GCR jobs have a "generation" version associated with them, but it isn't exposed in a way that is useful to the user. As such, we decided that it is better to not include it. (One potential reason to include it still would be to signal to the observability system that a code/config change precipitated a problem even though it is difficult to tie back to the specific version in the UI.)
We decided that the information that was going to version should be instead populated in some combination of faas. invocation_id or faas.instance instead. @damemi is going to provide more details on how this information is exposed to the user, and examples of what the values will look like in various scenarios.

@Oberon00
Copy link
Member

Oberon00 commented Apr 19, 2023

My concern is simple, and there is only a single one: A job execution ID sounds like a totally different concept from a function version.

Using invocation_id or instance were suggestions. I am also OK if you use another more fitting attribute, like gcp.cloud_run.job_execution_id.

@damemi damemi requested a review from a team as a code owner April 24, 2023 20:52
@damemi damemi force-pushed the gcp-cloud-run-service-version branch from 36c082f to 5b58946 Compare April 25, 2023 14:31
@damemi
Copy link
Contributor Author

damemi commented Apr 25, 2023

Thanks for all the feedback so far. I do appreciate everyone's patience and help with this.

I pushed a new commit 5b58946 that reverts the originally proposed Job Run version definition. I added a clarification that the existing version definition only applies to Services.

I then added a new cloud provider resource directory and 2 new attributes for Cloud Run based on Josh's suggestion #3378 (comment):

  1. gcp.cloud_run.job.execution
  2. gcp.cloud_run.job.task_index

We will need to update our resource detectors to differentiate between Jobs and Services for Cloud Run. These detectors are stable so we can't change how Services are handled, but I think we can add Jobs as a subset without breaking the existing Service workflow.

Thanks again for all the thoughtful input on this, hopefully we can reach a resolution soon.

@Oberon00 Oberon00 dismissed their stale review April 25, 2023 14:50

Main concerns addressed. Thank you!

@Oberon00
Copy link
Member

Oberon00 commented Apr 25, 2023

@tylerbenson @Aneurysm9 @jsuereth What do you think about reverting the renaming / redefinition of faas.execution to faas.invocation_id done in #3209, or adjusting the changes in such a way that they could also be used to store the Cloud Watch Job execution ID, so that we don't need a new attribute?
EDIT: Maybe a small clarification to the description of the invocation_id would be enough? (I don't think we want to distinguish request ID vs invocation ID vs execution ID on the generic faas level)

@damemi
Copy link
Contributor Author

damemi commented Apr 25, 2023

@Oberon00 faas.invocation/execution is a span attribute, but what we want is a resource attribute so would that also involve generalizing it to the resource level?

@Oberon00
Copy link
Member

Oberon00 commented Apr 25, 2023

We actually already have many FaaS attributes that can be set at either the Span or resource level (like faas.id). They are all "usually resource", so this would be the first "usually span" attribute explicitly being allowed in both positions, but I don't see a problem with it.

(Note that you will still want at least one FaaS attribute on your FaaS root span that is always on the span, otherwise you can't determine which of your spans represents the FaaS invocation, if everything is on the resource. E.g. faas.trigger might be used as such a "marker attribute" if you put everything else on the Resource (or you might duplicate it on Span+Resource))

@jsuereth
Copy link
Contributor

jsuereth commented May 1, 2023

@Oberon00 I'm still not certain invocation_id / execution_id make sense in my head vs. a trace_id.

I think it's worth having that discussion on a new bug/channel. Given this PR now introduces gcp specific attributes, we can always double-write IDs if we define an "exeuction_id" we're happy with.

I think fundamentally, the "model" used by the FAAS semconv group should be well documented so we can better understand how to align our solution against it. Happy to continue that conversation.

For now, we'd love to see this PR go through so Cloud Run Jobs have valid identities in OTEL in the meantime.

@Oberon00
Copy link
Member

Oberon00 commented May 2, 2023

I think fundamentally, the "model" used by the FAAS semconv group should be well documented so we can better understand how to align our solution against it. Happy to continue that conversation.

Since the model is not formally defined in all details, we can also define it in such a way that your concerns about "not fitting" attributes are resolved

EDIT: There is a bit of definition language at https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/faas.md#difference-between-invocation-and-instance that should be taken into account/moved/changed when we do this.

@damemi damemi force-pushed the gcp-cloud-run-service-version branch from f2d4075 to 92070cc Compare May 2, 2023 18:59
@damemi damemi changed the title Add detail for GCP Cloud Run Job version Add detail for GCP Cloud Run Job execution May 5, 2023
@damemi
Copy link
Contributor Author

damemi commented May 5, 2023

The link check failure looks unrelated to this change, is this good to merge?

@reyang reyang added the area:semantic-conventions Related to semantic conventions label May 9, 2023
@reyang
Copy link
Member

reyang commented May 9, 2023

FYI @damemi @jsuereth @reyang this is missing a changelog entry, we need to make sure the changelog entry is added in the new https://github.com/open-telemetry/semantic-conventions (to be created this week) repo.

Related to #3474 (review).

@reyang reyang merged commit ccac3a2 into open-telemetry:main May 9, 2023
9 checks passed
@damemi
Copy link
Contributor Author

damemi commented May 12, 2023

@reyang should the new repo have its own CHANGELOG.md just for semantic conventions?

@reyang
Copy link
Member

reyang commented May 12, 2023

@reyang should the new repo have its own CHANGELOG.md just for semantic conventions?

I think the new repo should have a single CHANGELOG.md file, if there isn't one, it should be created.

@damemi
Copy link
Contributor Author

damemi commented May 12, 2023

@reyang should the new repo have its own CHANGELOG.md just for semantic conventions?

I think the new repo should have a single CHANGELOG.md file, if there isn't one, it should be created.

sounds good, opened that PR here open-telemetry/semantic-conventions#18

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:semantic-conventions Related to semantic conventions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants