Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

registry: add heartbeat call for running builds #11846

Merged
merged 4 commits into from Jul 5, 2022
Merged

Conversation

lbajolet-hashicorp
Copy link
Collaborator

When a build is running, we send periodic heartbeats to HCP Packer's
API.

This will be used on the service side to detect if a build is stalled on
the core side because of a crash or any other malfunction that caused
Packer not to send an update on the status of a build.

@lbajolet-hashicorp lbajolet-hashicorp requested a review from a team as a code owner June 10, 2022 19:32
@@ -50,6 +55,31 @@ func (b *RegistryBuilder) Run(ctx context.Context, ui packersdk.Ui, hook packers
log.Printf("[TRACE] failed to update HCP Packer registry status for %q: %s", b.Name, err)
}

go func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe similar logic will be needed in the RegistryPostProcessor which wraps all configured post processors and run after a successful build has completed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I don't think copying this logic to the RegistryPostProcessor is necessary, since we only stop sending heartbeats on context cancellation, which happens independently from the build finishing.
However there is another issue due to the fact that the cloud-provider is set, which causes heartbeats to fail as they include the cloud-provider in the updates (this should not be the case). We need to fix this before we can consider merging this PR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: fixed in c31a436

@lbajolet-hashicorp lbajolet-hashicorp changed the title WIP: registry: add heartbeat call for running builds registry: add heartbeat call for running builds Jun 17, 2022
Copy link
Contributor

@JenGoldstrich JenGoldstrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally looks good to me, nice work Lucas! I think we should add some acceptance tests though in HPATS as we talked about, and probably wait for the next release on this one.

@@ -156,10 +156,10 @@ func (b *Bucket) UpdateBuildStatus(ctx context.Context, name string, status mode
_, err = b.client.UpdateBuild(ctx,
buildToUpdate.ID,
buildToUpdate.RunUUID,
buildToUpdate.CloudProvider,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that the builds were never set to running? 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They were, just once they passed a certain point in the build (after the build is done, when post-processors start to be worked on), the status would never be updated again as all requests would fail.
Also a new run of the build would also never succeed to change the status as it would get rejected since the updates payload would contain the cloud provider, which gets rejected on HCP

Copy link
Member

@nywilken nywilken Jun 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the issue here. A build, if I'm not mistaken, will get the call to update the status to running once a build begins in the registry builder. The cloud provider information is derived from an build image (at the end of a build) so cloud provider will always be empty here.

Unless a build was somehow given a provider without an image. Is this possible on the API?

Copy link
Member

@nywilken nywilken Jun 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why the updating of build labels is being removed here. If the issue is that we can not update the cloud provider once set would it makes sense to consider the following options.

  1. Update the API to only error if the value of cloudprovider is being changed to some new value
  2. Update the UpdateBuild to check if the status of buildToUpdate is already set to DONE and if so bail as there is no need for a heartbeat. All builds are stored in memory and refreshed after a successful API call so that state in memory should contain all the information needed to determine if a heartbeat update needs to be sent.
  3. Handle the error of calling UpdateBuild when the provider is set.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's probably not super clear here, but that's linked to heartbeats being emitted during the post-processing steps, since they use that same function to update the status to "RUNNING" again in order to bump "UpdatedAt", which in turn caused the cloud-provider filed to be set at that time, and because updates to the field are rejected and we include it all the time, we can't send anymore updates to the build later on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the API to only error if the value of cloudprovider is being changed to some new value.

The API already does that! Labels can be updated and new labels will be concatenated with what's already stored.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think checking for the Done status will fix this particular issue. If I got it right, the heartbeat that sends the cloud-provider happens before the build is set to done but after the build is populated with the metadata. So the heartbeat is the one updating those fields as Lucas said.

Is safe to only set the cloud provider inside markBuildComplete, this way we make sure this is set just once when the build is actually done.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sylviamoss you have nailed that issue, well said.

Copy link
Member

@nywilken nywilken Jun 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. Thanks for clarifying!

@sylviamoss pointed out that the persisting of the cloud provider to the in-memory build object is also happening here and not only when calling markBuildComplete. Which is why CloudProvider is not empty for the UpdateBuild call.

Looking at the bucket method (b *Bucket) UpdateBuildStatus I agree with removing all of the fields except the required and status. So this change make sense.

I do wonder if we should just create a new method called (c *Client) UpdateBuildStatus that just updates the status that is called in (b *Bucket) UpdateBuildStatus and leave the call to (c *Client) UpdateBuild inside of the (b *Bucket) markBuildComplete method. To avoid confusion in the future. What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For visibility, there's going to be some changes to this code:

1- We'll change the conditions for the heartbeats, replicate the code, or mutualise it so we continue broadcasting the heartbeats until the individual build is done. As of now with the current code, the heartbeats may continue after a build has terminated (either with a success or a failure), which is not ideal.

2- We'll do a slight refactor to the heartbeat logic so they only send the state of the build, and not change the value of the shared status.

3- We'll split status updates from the final build done step.

There are more things to work on, but that'll probably be when the HCP logic gets extracted to a separate plugin later on.

@sylviamoss
Copy link
Member

Agree with @JenGoldstrich that we can bake this one a little more before releasing it.

@lbajolet-hashicorp lbajolet-hashicorp force-pushed the heartbeat_hcp branch 3 times, most recently from 03f1a12 to 7fb9615 Compare June 23, 2022 12:19
Copy link
Member

@nywilken nywilken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done. The changes look good to me. I left a comment about the log messaging. Debug logs are used by the user for troubleshooting and sometimes general logging. We should provide some context into what information is being sent up every 2 minutes - heartbeat seems a bit vague.

internal/registry/types.bucket.go Outdated Show resolved Hide resolved
internal/registry/types.bucket.go Outdated Show resolved Hide resolved
internal/registry/types.bucket.go Show resolved Hide resolved
@@ -141,7 +146,7 @@ func (b *Bucket) CreateInitialBuildForIteration(ctx context.Context, componentTy
// UpdateBuildStatus updates the status of a build entry on the HCP Packer registry with its current local status.
func (b *Bucket) UpdateBuildStatus(ctx context.Context, name string, status models.HashicorpCloudPackerBuildStatus) error {
if status == models.HashicorpCloudPackerBuildStatusDONE {
return b.markBuildComplete(ctx, name)
return fmt.Errorf("do not use UpdateBuildStatus for updating to DONE")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice way to bubble up the info to the caller.

When a build is running, we send periodic heartbeats to HCP Packer's
API.

This will be used on the service side to detect if a build is stalled on
the core side because of a crash or any other malfunction that caused
Packer not to send an update on the status of a build.
Prior to this commit, we'd send updates to both the labels and the
cloud-provider whenever an update to the status of a build would be
sent.

This would cause a bug in which once a build reached the post-processing
state, and its cloud-provider was set, the status could not be updated
anymore, as the cloud-provider would be set and further updates are
rejected by the platform.

To avoid this problem, we only transmit the status when doing a status
update, and no other fields along with it.
The markBuildComplete private function used to be called from a final
status update to DONE, through the UpdateBuildStatus function.

The problem being that the (now) CompleteBuild function not only updates
the status of the build, but also all its metadata.

For consistency, we remove the indirection, and explicitely call
CompleteBuild when we want to finish a build.
@nywilken nywilken merged commit 0a41694 into main Jul 5, 2022
@nywilken nywilken deleted the heartbeat_hcp branch July 5, 2022 18:37
@github-actions
Copy link

github-actions bot commented Aug 5, 2022

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants