Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-1076 Fix CR stuck in Updating state #374

Merged
merged 2 commits into from
Jun 20, 2023

Conversation

jotak
Copy link
Member

@jotak jotak commented Jun 15, 2023

The problem was that copying certificates across namespaces was done on every reconciliation call, with or without digest change. This is not a critical issue as it would only trigger a no-op update, but it was screwing up our "updating" state detection.

There's actually two changes in this commit:

  1. Fixing the "Updating" flag, so that it's not set when an update is actually performed (we need to fetch back the object (from cache) and check if the resource version was actually modified)
  2. Fixing the certificate watching process to avoid unnecessary tries to update

Also:

  • Add / fix tests
  • Modify some logs to make important information more visible
  • Makefile change for building manifest without error (unrelated)

The problem was that copying certificates across namespaces was done on
every reconciliation call, with or without digest change. This is not
a critical issue as it would only trigger a no-op update, but it was
screwing up our "updating" state detection.

There's actually two changes in this commit:
1. Fixing the "Updating" flag, so that it's not set when an update is
   actually performed (we need to fetch back the object (from cache) and
check if the resource version was actually modified)
2. Fixing the certificate watching process to avoid unnecessary tries to
   update

Also:
- Add / fix tests
- Modify some logs to make important information more visible
- Makefile change for building manifest without error (unrelated)
@jotak
Copy link
Member Author

jotak commented Jun 15, 2023

Backport: #375

Makefile Outdated
ifeq (${OCI_BIN}, docker)
DOCKER_BUILDKIT=1 $(OCI_BIN) manifest create ${IMAGE} $(foreach target,$(MULTIARCH_TARGETS), --amend ${IMAGE}-$(target));
else
DOCKER_BUILDKIT=1 $(OCI_BIN) rmi ${IMAGE}; \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use -f in case the image doesn't exits so it won't fail ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like -i is a better fit for that?

Options:
  -a, --all        Remove all images
  -f, --force      Force Removal of the image
  -i, --ignore     Ignore errors if a specified image does not exist
      --no-prune   Do not remove dangling images

@msherif1234 can you check if it's the same with docker ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

 docker rmi --help

Usage:  docker rmi [OPTIONS] IMAGE [IMAGE...]

Remove one or more images

Options:
  -f, --force      Force removal of the image
      --no-prune   Do not delete untagged parents

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like it's not for docker https://docs.docker.com/engine/reference/commandline/rmi/
so let's go with -f

Makefile Outdated
DOCKER_BUILDKIT=1 $(OCI_BIN) manifest create ${IMAGE} $(foreach target,$(MULTIARCH_TARGETS), --amend ${IMAGE}-$(target));
else
DOCKER_BUILDKIT=1 $(OCI_BIN) rmi ${IMAGE}; \
DOCKER_BUILDKIT=1 $(OCI_BIN) manifest create ${IMAGE}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm why changed this logic since docker has --amend support its better to use it than the work around we have for the older podman version that we are stuck with ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's to avoid having multiple ways of doing things as much as we can ? I think when we can run the same script regardless of the environment, we should do that, rather than having different procedures for different environments?
Moreover the intent of this make target is really to create a fresh new manifest, not updating an existing one, so it seems to me that even with docker it makes sense to create a fresh one? For instance, what happens if you build an image for arm64 only and then, later, build an image for amd64 only? I think with the current script you would get the two of them in the manifest, which isn't what you asked for

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to check on this I think we should endup with one manifest for the example u have at the end, my concern here what I did for podman was more of a workaround because it lack --amend support and now you proposing using the workaround for both so I see this as step backward while if we have the latest podman we would just do --amend for both

@@ -103,7 +108,7 @@ func (m *NamespacedObjectManager) TryDelete(ctx context.Context, obj client.Obje
if m.Exists(obj) {
log := log.FromContext(ctx)
kind := reflect.TypeOf(obj).String()
log.Info("Deleting old "+kind, "Namespace", obj.GetNamespace(), "Name", obj.GetName())
log.Info("DELETING "+kind, "Namespace", obj.GetNamespace(), "Name", obj.GetName())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't know if there is a need to have all caps here I don't think this align with go coding style ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent is to make some important information more visible - I found it very useful when debugging, whereas with the normal sentence case it's sometimes hard to find relevant logs at a glance. That's really for pragmatism.

log.Error(err, "Failed to get updated resource "+kind, "Namespace", obj.GetNamespace(), "Name", obj.GetName())
return err
}
if obj.GetResourceVersion() != old.GetResourceVersion() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about using deep.Equal(old, obj) to know if it was updated or not ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that would work also, but is there a benefit? A simple GetResourceVersion() comparison is for sure more efficient, and I would expect that it does exactly what we need here ... unless if it's possible that the object is actually updated without a version change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

u never know what k8s do that is why I suggested to deepEqual to be 100% certain

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think given the official definition of resource version, we should be fine: https://kubernetes.io/docs/reference/using-api/api-concepts/#resource-versions

Resource versions are strings that identify the server's internal version of an object. Resource versions can be used by clients to determine when objects have changed [etc.]

That's exactly what's done here

return "", err
}
if report.Check("Digest changed", targetDigest != digest) {
// Update existing
rlog.Info(fmt.Sprintf("updating %s %s in namespace %s", ref.kind, ref.name, destNamespace))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this log spam the log file or its not expected to change too often ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no it's not expected to change often, hardly never, in fact: only during certificate rotations or if the certificate owner (kafka/loki/...) is reinstalled

@codecov
Copy link

codecov bot commented Jun 15, 2023

Codecov Report

Merging #374 (4aceac4) into main (92e78be) will increase coverage by 0.05%.
The diff coverage is 75.75%.

❗ Current head 4aceac4 differs from pull request most recent head 1d8b524. Consider uploading reports for the commit 1d8b524 to get more accurate results

@@            Coverage Diff             @@
##             main     #374      +/-   ##
==========================================
+ Coverage   53.54%   53.59%   +0.05%     
==========================================
  Files          45       45              
  Lines        5381     5403      +22     
==========================================
+ Hits         2881     2896      +15     
- Misses       2294     2299       +5     
- Partials      206      208       +2     
Flag Coverage Δ
unittests 53.59% <75.75%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/helper/client_helper.go 53.44% <66.66%> (+0.38%) ⬆️
pkg/watchers/watcher.go 75.00% <66.66%> (-1.00%) ⬇️
controllers/ebpf/agent_controller.go 85.77% <100.00%> (ø)
...trollers/reconcilers/namespaced_objects_manager.go 81.08% <100.00%> (+1.37%) ⬆️

@jotak jotak force-pushed the fix-stuck-update branch 2 times, most recently from e6660ea to 4aceac4 Compare June 15, 2023 13:35
@jotak
Copy link
Member Author

jotak commented Jun 15, 2023

@msherif1234 ok I finally ran through the test workflow procedure: https://github.com/netobserv/network-observability-operator/blob/main/DEVELOPMENT.md#testing-the-github-workflow
Which helped to spot a bug :)
... and ended up doing a mix of the two solutions: we still remove the manifest before creating it AND we use the --amend syntax: this works both with podman and docker, as long as they're not to outdated. The CI runs with a recent docker so that's fine - I tested it here: https://github.com/netobserv/network-observability-operator/actions/runs/5280309768

tl;dr: should all be good now

@@ -303,13 +303,8 @@ image-push: ## Push MULTIARCH_TARGETS images
.PHONY: manifest-build
manifest-build: ## Build MULTIARCH_TARGETS manifest
@echo 'building manifest $(IMAGE)'
ifeq (${OCI_BIN}, docker)
DOCKER_BUILDKIT=1 $(OCI_BIN) rmi ${IMAGE} -f
DOCKER_BUILDKIT=1 $(OCI_BIN) manifest create ${IMAGE} $(foreach target,$(MULTIARCH_TARGETS), --amend ${IMAGE}-$(target));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will break with older versions podman since it doesn't have --amend but since we default to docker at least out CI should be fine

Copy link
Member Author

@jotak jotak Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since recent podman does have --amend : https://docs.podman.io/en/latest/markdown/podman-manifest-create.1.html#options I think that's ok.
idk in which version it was introduced, but the outdated version shipped with the github CI image that used once was really out of date in my understanding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@msherif1234 msherif1234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/LGTM

@openshift-ci openshift-ci bot added the lgtm label Jun 15, 2023
@jotak jotak added the no-doc This PR doesn't require documentation change on the NetObserv operator label Jun 16, 2023
@Amoghrd
Copy link
Contributor

Amoghrd commented Jun 20, 2023

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 20, 2023
@github-actions
Copy link

New images:

  • quay.io/netobserv/network-observability-operator:8a62641
  • quay.io/netobserv/network-observability-operator-bundle:v0.0.0-8a62641
  • quay.io/netobserv/network-observability-operator-catalog:v0.0.0-8a62641

They will expire after two weeks.

Catalog source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-8a62641
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

@Amoghrd
Copy link
Contributor

Amoghrd commented Jun 20, 2023

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved QE has approved this pull request label Jun 20, 2023
@jotak
Copy link
Member Author

jotak commented Jun 20, 2023

/approve

@openshift-ci
Copy link

openshift-ci bot commented Jun 20, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jotak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit e54f527 into netobserv:main Jun 20, 2023
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm no-doc This PR doesn't require documentation change on the NetObserv operator ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. qe-approved QE has approved this pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants