-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NETOBSERV-1076 Fix CR stuck in Updating state #374
Conversation
The problem was that copying certificates across namespaces was done on every reconciliation call, with or without digest change. This is not a critical issue as it would only trigger a no-op update, but it was screwing up our "updating" state detection. There's actually two changes in this commit: 1. Fixing the "Updating" flag, so that it's not set when an update is actually performed (we need to fetch back the object (from cache) and check if the resource version was actually modified) 2. Fixing the certificate watching process to avoid unnecessary tries to update Also: - Add / fix tests - Modify some logs to make important information more visible - Makefile change for building manifest without error (unrelated)
Backport: #375 |
Makefile
Outdated
ifeq (${OCI_BIN}, docker) | ||
DOCKER_BUILDKIT=1 $(OCI_BIN) manifest create ${IMAGE} $(foreach target,$(MULTIARCH_TARGETS), --amend ${IMAGE}-$(target)); | ||
else | ||
DOCKER_BUILDKIT=1 $(OCI_BIN) rmi ${IMAGE}; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use -f
in case the image doesn't exits so it won't fail ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like -i is a better fit for that?
Options:
-a, --all Remove all images
-f, --force Force Removal of the image
-i, --ignore Ignore errors if a specified image does not exist
--no-prune Do not remove dangling images
@msherif1234 can you check if it's the same with docker ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docker rmi --help
Usage: docker rmi [OPTIONS] IMAGE [IMAGE...]
Remove one or more images
Options:
-f, --force Force removal of the image
--no-prune Do not delete untagged parents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like it's not for docker https://docs.docker.com/engine/reference/commandline/rmi/
so let's go with -f
Makefile
Outdated
DOCKER_BUILDKIT=1 $(OCI_BIN) manifest create ${IMAGE} $(foreach target,$(MULTIARCH_TARGETS), --amend ${IMAGE}-$(target)); | ||
else | ||
DOCKER_BUILDKIT=1 $(OCI_BIN) rmi ${IMAGE}; \ | ||
DOCKER_BUILDKIT=1 $(OCI_BIN) manifest create ${IMAGE} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm why changed this logic since docker has --amend
support its better to use it than the work around we have for the older podman version that we are stuck with ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's to avoid having multiple ways of doing things as much as we can ? I think when we can run the same script regardless of the environment, we should do that, rather than having different procedures for different environments?
Moreover the intent of this make target is really to create a fresh new manifest, not updating an existing one, so it seems to me that even with docker it makes sense to create a fresh one? For instance, what happens if you build an image for arm64 only and then, later, build an image for amd64 only? I think with the current script you would get the two of them in the manifest, which isn't what you asked for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to check on this I think we should endup with one manifest for the example u have at the end, my concern here what I did for podman was more of a workaround because it lack --amend support and now you proposing using the workaround for both so I see this as step backward while if we have the latest podman we would just do --amend for both
@@ -103,7 +108,7 @@ func (m *NamespacedObjectManager) TryDelete(ctx context.Context, obj client.Obje | |||
if m.Exists(obj) { | |||
log := log.FromContext(ctx) | |||
kind := reflect.TypeOf(obj).String() | |||
log.Info("Deleting old "+kind, "Namespace", obj.GetNamespace(), "Name", obj.GetName()) | |||
log.Info("DELETING "+kind, "Namespace", obj.GetNamespace(), "Name", obj.GetName()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I don't know if there is a need to have all caps here I don't think this align with go coding style ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent is to make some important information more visible - I found it very useful when debugging, whereas with the normal sentence case it's sometimes hard to find relevant logs at a glance. That's really for pragmatism.
log.Error(err, "Failed to get updated resource "+kind, "Namespace", obj.GetNamespace(), "Name", obj.GetName()) | ||
return err | ||
} | ||
if obj.GetResourceVersion() != old.GetResourceVersion() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about using deep.Equal(old, obj) to know if it was updated or not ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that would work also, but is there a benefit? A simple GetResourceVersion()
comparison is for sure more efficient, and I would expect that it does exactly what we need here ... unless if it's possible that the object is actually updated without a version change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
u never know what k8s do that is why I suggested to deepEqual to be 100% certain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think given the official definition of resource version, we should be fine: https://kubernetes.io/docs/reference/using-api/api-concepts/#resource-versions
Resource versions are strings that identify the server's internal version of an object. Resource versions can be used by clients to determine when objects have changed [etc.]
That's exactly what's done here
return "", err | ||
} | ||
if report.Check("Digest changed", targetDigest != digest) { | ||
// Update existing | ||
rlog.Info(fmt.Sprintf("updating %s %s in namespace %s", ref.kind, ref.name, destNamespace)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this log spam the log file or its not expected to change too often ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no it's not expected to change often, hardly never, in fact: only during certificate rotations or if the certificate owner (kafka/loki/...) is reinstalled
Codecov Report
@@ Coverage Diff @@
## main #374 +/- ##
==========================================
+ Coverage 53.54% 53.59% +0.05%
==========================================
Files 45 45
Lines 5381 5403 +22
==========================================
+ Hits 2881 2896 +15
- Misses 2294 2299 +5
- Partials 206 208 +2
Flags with carried forward coverage won't be shown. Click here to find out more.
|
e6660ea
to
4aceac4
Compare
@msherif1234 ok I finally ran through the test workflow procedure: https://github.com/netobserv/network-observability-operator/blob/main/DEVELOPMENT.md#testing-the-github-workflow tl;dr: should all be good now |
@@ -303,13 +303,8 @@ image-push: ## Push MULTIARCH_TARGETS images | |||
.PHONY: manifest-build | |||
manifest-build: ## Build MULTIARCH_TARGETS manifest | |||
@echo 'building manifest $(IMAGE)' | |||
ifeq (${OCI_BIN}, docker) | |||
DOCKER_BUILDKIT=1 $(OCI_BIN) rmi ${IMAGE} -f | |||
DOCKER_BUILDKIT=1 $(OCI_BIN) manifest create ${IMAGE} $(foreach target,$(MULTIARCH_TARGETS), --amend ${IMAGE}-$(target)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will break with older versions podman since it doesn't have --amend
but since we default to docker at least out CI should be fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since recent podman does have --amend
: https://docs.podman.io/en/latest/markdown/podman-manifest-create.1.html#options I think that's ok.
idk in which version it was introduced, but the outdated version shipped with the github CI image that used once was really out of date in my understanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/LGTM
/ok-to-test |
New images:
They will expire after two weeks. Catalog source: apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
name: netobserv-dev
namespace: openshift-marketplace
spec:
sourceType: grpc
image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-8a62641
displayName: NetObserv development catalog
publisher: Me
updateStrategy:
registryPoll:
interval: 1m |
/label qe-approved |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jotak The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The problem was that copying certificates across namespaces was done on every reconciliation call, with or without digest change. This is not a critical issue as it would only trigger a no-op update, but it was screwing up our "updating" state detection.
There's actually two changes in this commit:
Also: