New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
image-generator: fix image creation edge cases #369
Conversation
* at creation: name images with unique naming to aviod name reuse failures * at deletion: ignore if image is already deleted Signed-off-by: Snir Sheriber <ssheribe@redhat.com>
Successfully Pre-Merge tested it on AWS Edge case 1 : If there is an existing podvm Image with the name 'peer-pod-XXXX' (on the cloud provider) the kataconfig install does not begin as the IMAGE ID creation job fails (due to the same name of the existing podvm image) and that causes the kataconfig install to be stuck in and there is no progress Edge case 2: If by mistake the user deletes the podvm IMAGE (from the console of the cloud provider ) the kataconfig deletion will not progress as the process waits for the image deletion job to complete which it wont in this case as the user has deleted the podvm image already and the job fails , unable to find the podvm image id. Pre-Merge test Steps followed
Edge case 1 tested
Edge case 2 tested
|
/override ci/prow/check |
@cpmeadors: Overrode contexts on behalf of cpmeadors: ci/prow/check In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@abhbaner can you approve since you did the pre-merge testing? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some questions for docs (who should probably be reviewing as well) and a question about the output and user visibility.
@@ -59,6 +57,7 @@ spec: | |||
- -c | |||
- | | |||
set -e | |||
[[ ! "${IMAGE_NAME}" ]] && UUID=$(uuidgen) && export IMAGE_NAME="peer-pod-ami-${UUID::6}" && echo "IMAGE_NAME:${IMAGE_NAME}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are not requiring IMAGE_NAME to be explicitly unset we probably need to add that to the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
User shouldn't even know it's possible to set it, this variable is not mentioned in the docs
[[ ! "${PODVM_AMI_ID}" ]] && echo "PODVM_AMI_ID is missing, it's unknown which image to delete" && exit 1 | ||
[[ "${IMAGE_NAME}" ]] && echo "IMAGE_NAME:${IMAGE_NAME} is set, it implies image was not automatically created, delete it manually" && exit 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to make sure it is clear in docs as to what happens with manually created images and that user is responsible for deleting them as the message states.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is custom images supported ATM? AFAIK the reason it's supported is for tech/dev-preview features only
RES=$(aws ec2 deregister-image --image-id "${PODVM_AMI_ID}" --region "${AWS_REGION}" 2>&1) || ERR=true | ||
echo ${RES} | ||
[[ ${ERR} ]] && [[ "$RES" =~ InvalidAMIID\.(Unavailable|NotFound) ]] # if deregister returned error and image is already deleted, continue | ||
echo "Deleted AMI: ${PODVM_AMI_ID} - DONE" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are users supposed to see the error message in the output of the job? Is that captured in a log anywhere? Also a little weird that we say the AMI is deleted when it didn't exist. I feel like there is an "if" statement missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible to see the error in the job logs, however, in this case the error means image is already deleted and there's nothing else to do which is fine, would it make more sense to change the wording to something like: "image deletion job completed"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are users supposed to see the error message in the output of the job?
Users can see these logs with oc logs
but they aren't supposed to look at them. This is an internal job.
Is that captured in a log anywhere?
This should probably be captured by the OSC must-gather if it isn't the case already.
It's possible to see the error in the job logs, however, in this case the error means image is already deleted and there's nothing else to do which is fine, would it make more sense to change the wording to something like: "image deletion job completed"?
IIUC the only case where we get the InvalidAMIID\.(Unavailable|NotFound)
error is when someone deleted the image automatically created by us in our back, right ? Assuming this is fine
, then the job just needs to be idempotent and report something like "PODVM_AMI_ID: \"$PODVM_AMI_ID\" is deleted"
.
@@ -59,6 +57,7 @@ spec: | |||
- -c | |||
- | | |||
set -e | |||
[[ ! "${IMAGE_NAME}" ]] && UUID=$(uuidgen) && export IMAGE_NAME="peer-pod-ami-${UUID::6}" && echo "IMAGE_NAME:${IMAGE_NAME}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@snir911 I don't understand the motivation behind the [[ ! "${IMAGE_NAME}" ]]
check... what is it for ? Don't we want to enforce the name of the image to be something like peer-pod-ami-XXXXXX
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To preserve previous behavior mostly I think, for example if users wants to specify a name that would be easy for him to track later using AWS cli.
TBH i don't recall if i had better reason, but it make sense to me that if user explicitly set it we should comply
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so IMAGE_NAME is a kind of hidden API that we don't want to document for users but we still want to be usable, correct ? It would be very nice to document this in some document for developers at least, or even comments in the yaml files, because the questions in this PR clearly show that it isn't obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have appreciated a comment to be added in the deletion scripts so that one doesn't need to git blame
to fully understand what's going on. Not critical enough to hold this PR. Thanks @snir911 !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @snir911 !
/override ci/prow/check |
@gkurz: Overrode contexts on behalf of gkurz: ci/prow/check In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@snir911: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Fixes: #KATA-2618
to test:
in OCP make sure to set also:
oc set env deployment.apps/controller-manager SANDBOXED_CONTAINERS_EXTENSION=sandboxed-containers -n openshift-sandboxed-containers-operator