Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

image-generator: fix image creation edge cases #369

Merged
merged 2 commits into from Jan 15, 2024
Merged

Conversation

snir911
Copy link
Contributor

@snir911 snir911 commented Dec 13, 2023

  • at creation: name images with unique naming to aviod name reuse failures
  • at deletion: ignore if image is already deleted

Fixes: #KATA-2618

to test:

quay.io/snir/osc-operator:v9.9.2618
quay.io/snir/osc-operator-catalog:v9.9.2618
quay.io/snir/osc-operator-bundle:v9.9.2618

in OCP make sure to set also:
oc set env deployment.apps/controller-manager SANDBOXED_CONTAINERS_EXTENSION=sandboxed-containers -n openshift-sandboxed-containers-operator

* at creation: name images with unique naming to aviod name reuse failures
* at deletion: ignore if image is already deleted

Signed-off-by: Snir Sheriber <ssheribe@redhat.com>
@snir911 snir911 self-assigned this Dec 13, 2023
@abhbaner
Copy link

abhbaner commented Dec 14, 2023

Successfully Pre-Merge tested it on AWS

Edge case 1 : If there is an existing podvm Image with the name 'peer-pod-XXXX' (on the cloud provider) the kataconfig install does not begin as the IMAGE ID creation job fails (due to the same name of the existing podvm image) and that causes the kataconfig install to be stuck in and there is no progress

Edge case 2: If by mistake the user deletes the podvm IMAGE (from the console of the cloud provider ) the kataconfig deletion will not progress as the process waits for the image deletion job to complete which it wont in this case as the user has deleted the podvm image already and the job fails , unable to find the podvm image id.

Pre-Merge test Steps followed

  • Applied the catalog source (with the catalog quay.io/snir/osc-operator-catalog:v9.9.2618 in it)
  • Installed the OSC Operator
  • Post install of the operator I issued the cmd
    oc set env deployment.apps/controller-manager SANDBOXED_CONTAINERS_EXTENSION=sandboxed-containers -n openshift-sandboxed-containers-operator
  • Created the the peer pod cm and secret
  • Created the kataconfig with enablepeerpods set to 'true'

Edge case 1 tested

  • The AWS console had an existing AMI with name (and ID) as peer-pod-ami (ami-0cc7239b859af5641). Despite that the image id creation job was able to create a new AMI ID and named it peer-pod-ami-3c37 (ami-0cd66ff9c8c21755c)
  • The kataconfig creation went through fine as well

Edge case 2 tested

  • Deregistered all AMIs from the AWS console with peer-pod-ami in their name
  • Successfully deleted kataconfig

@cpmeadors
Copy link
Contributor

/override ci/prow/check

Copy link

openshift-ci bot commented Jan 5, 2024

@cpmeadors: Overrode contexts on behalf of cpmeadors: ci/prow/check

In response to this:

/override ci/prow/check

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cpmeadors
Copy link
Contributor

@abhbaner can you approve since you did the pre-merge testing?

Copy link
Contributor

@cpmeadors cpmeadors left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some questions for docs (who should probably be reviewing as well) and a question about the output and user visibility.

@@ -59,6 +57,7 @@ spec:
- -c
- |
set -e
[[ ! "${IMAGE_NAME}" ]] && UUID=$(uuidgen) && export IMAGE_NAME="peer-pod-ami-${UUID::6}" && echo "IMAGE_NAME:${IMAGE_NAME}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are not requiring IMAGE_NAME to be explicitly unset we probably need to add that to the docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User shouldn't even know it's possible to set it, this variable is not mentioned in the docs

Comment on lines +34 to +35
[[ ! "${PODVM_AMI_ID}" ]] && echo "PODVM_AMI_ID is missing, it's unknown which image to delete" && exit 1
[[ "${IMAGE_NAME}" ]] && echo "IMAGE_NAME:${IMAGE_NAME} is set, it implies image was not automatically created, delete it manually" && exit 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to make sure it is clear in docs as to what happens with manually created images and that user is responsible for deleting them as the message states.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is custom images supported ATM? AFAIK the reason it's supported is for tech/dev-preview features only

Comment on lines +45 to +48
RES=$(aws ec2 deregister-image --image-id "${PODVM_AMI_ID}" --region "${AWS_REGION}" 2>&1) || ERR=true
echo ${RES}
[[ ${ERR} ]] && [[ "$RES" =~ InvalidAMIID\.(Unavailable|NotFound) ]] # if deregister returned error and image is already deleted, continue
echo "Deleted AMI: ${PODVM_AMI_ID} - DONE"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are users supposed to see the error message in the output of the job? Is that captured in a log anywhere? Also a little weird that we say the AMI is deleted when it didn't exist. I feel like there is an "if" statement missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible to see the error in the job logs, however, in this case the error means image is already deleted and there's nothing else to do which is fine, would it make more sense to change the wording to something like: "image deletion job completed"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are users supposed to see the error message in the output of the job?

Users can see these logs with oc logs but they aren't supposed to look at them. This is an internal job.

Is that captured in a log anywhere?

This should probably be captured by the OSC must-gather if it isn't the case already.

It's possible to see the error in the job logs, however, in this case the error means image is already deleted and there's nothing else to do which is fine, would it make more sense to change the wording to something like: "image deletion job completed"?

IIUC the only case where we get the InvalidAMIID\.(Unavailable|NotFound) error is when someone deleted the image automatically created by us in our back, right ? Assuming this is fine, then the job just needs to be idempotent and report something like "PODVM_AMI_ID: \"$PODVM_AMI_ID\" is deleted".

@@ -59,6 +57,7 @@ spec:
- -c
- |
set -e
[[ ! "${IMAGE_NAME}" ]] && UUID=$(uuidgen) && export IMAGE_NAME="peer-pod-ami-${UUID::6}" && echo "IMAGE_NAME:${IMAGE_NAME}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snir911 I don't understand the motivation behind the [[ ! "${IMAGE_NAME}" ]] check... what is it for ? Don't we want to enforce the name of the image to be something like peer-pod-ami-XXXXXX ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To preserve previous behavior mostly I think, for example if users wants to specify a name that would be easy for him to track later using AWS cli.
TBH i don't recall if i had better reason, but it make sense to me that if user explicitly set it we should comply

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so IMAGE_NAME is a kind of hidden API that we don't want to document for users but we still want to be usable, correct ? It would be very nice to document this in some document for developers at least, or even comments in the yaml files, because the questions in this PR clearly show that it isn't obvious.

Copy link
Member

@gkurz gkurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have appreciated a comment to be added in the deletion scripts so that one doesn't need to git blame to fully understand what's going on. Not critical enough to hold this PR. Thanks @snir911 !

Copy link
Member

@gkurz gkurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @snir911 !

@gkurz
Copy link
Member

gkurz commented Jan 15, 2024

/override ci/prow/check

Copy link

openshift-ci bot commented Jan 15, 2024

@gkurz: Overrode contexts on behalf of gkurz: ci/prow/check

In response to this:

/override ci/prow/check

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

openshift-ci bot commented Jan 15, 2024

@snir911: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/sandboxed-containers-operator-e2e 96cb6bc link false /test sandboxed-containers-operator-e2e

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@gkurz gkurz merged commit cd3716d into openshift:devel Jan 15, 2024
3 of 4 checks passed
@gkurz gkurz mentioned this pull request Feb 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants