Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-32495: setup stateroot with a job and other prep stage refactoring #476

Merged

Conversation

pixelsoccupied
Copy link
Contributor

@pixelsoccupied pixelsoccupied commented May 2, 2024

This PR aims to make the Prep stage error handling more robust and overall hopefully simpler. To achieve this...the following things were done (at a high level):

  • Bulk of the stateroot setup functions (e.g pull image, calling rpmostree) are now in a cli and are called as a k8s job.
    • this allows the workload to be always "watched" by the job and any unexpected events (any pods deleted or system reboot) can be captured instantly. Currently any type of unexpected events will mark the stage as failed (no retries).
  • Precache job is treated as a generic Job and handled similar how we are orchestrating stateroot setup job, allowing for reduced code.
  • As part of an effort to identify any data that needs to be persistent, the Prep stage will now annotate the IBU cr with a (NEW) warning if it detects that current cluster is missing certain CRDs during validation. The annotation with example value may look like this:
  annotations:
    lca.openshift.io/warn-extramanifest-cm-missing-crd: '{"velero.io/v1, Kind=Backup":"acm-klusterlet2","velero.io/v1,
      Kind=Restore":"acm-klusterlet3"}'

Additionally note:

  • There's a new function (with required rbacs) to grab running Job's pod log
  • AutoRollbackOnFailure is now a pointer so suppress false positive can not change spec.autoRollbackOnFailure while ibu is in progress. Without it, the struct would auto initialize and wouldn't allow IBU to update with new annotation at runtime. edit: using patch (instead of update) should avoid auto init but regardless it's good have the struct as pointer.
  • No more tasks/go routines during the prep stage.
  • Suppress successful exec podman call logs in preache to reduce noise

/cc @Missxiaoguo @donpenney @jc-rh @browsell

@openshift-ci-robot
Copy link

@pixelsoccupied: This pull request references Jira Issue OCPBUGS-32495, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @yliu127

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This PR aims to make the Prep stage error handling more robust and overall hopefully simpler. To achieve this...the following things were done (at a high level):

  • Bulk of the stateroot setup functions (e.g pull image, calling rpmostree) are now in a cli and are called as a k8s job.
  • this allows the workload to be always "watched" by the job and any unexpected events (any pods deleted or system reboot) can be captured instantly. Currently any type of events will mark the stage as failed (no retries).
  • Precache job is treated as a generic Job and treated similar how we are orchestrating stateroot setup job, allowing for reduced code.
  • As part of the effort remove any data that needs to persistence, the Prep stage will annotate the IBU cr with NEW warning if it detects that extramanifest. This is expected to be removed by user to acknowledge and start Upgrade The annotation with example value may look like this:
  annotations:
    lca.openshift.io/warn-extramanifest-cm-missing-crd: '{"velero.io/v1, Kind=Backup":"acm-klusterlet2","velero.io/v1,
      Kind=Restore":"acm-klusterlet3"}'
 - lastTransitionTime: "2024-05-02T16:57:25Z"
   message: Detected of presence of lca.openshift.io/warn-extramanifest-cm-missing-crd
     in IBU Cr. Please remove this annotation to acknowledge and allow Upgrade stage
     to proceed
   observedGeneration: 55
   reason: InProgress
   status: "True"
   type: UpgradeInProgress

Additionally note:

  • There's a new function (with required rbacs) to grab running Job's pod log
  • AutoRollbackOnFailure is now a pointer so suppress false positive can not change spec.autoRollbackOnFailure while ibu is in progress. Without it, the struct would auto initialized and wouldn't allow IBU to update with new annotation at runtime.
  • No more tasks/go routines during the prep stage.
  • Precache reduced log level and only print the error during podman exec calls e.g from LCA pod you may see this during fails
2024-05-01T23:27:34Z    INFO    controllers.ImageBasedUpgrade   ------ start pod `lca-precache-job-q8q9k` log  -----    {"job name": "lca-precache-job"}
2024-05-01T23:27:34Z    INFO    controllers.ImageBasedUpgrade   time="2024-05-01T23:27:24Z" level=info msg="Attempt 1/5: Failed to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:25Z" level=info msg="Attempt 2/5: Failed to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:25Z" level=info msg="Attempt 3/5: Failed to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:25Z" level=info msg="Attempt 4/5: Failed to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:26Z" level=info msg="Attempt 5/5: Failed to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:26Z" level=error msg="Failed to pull image: registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3, error: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:26Z" level=info msg="Waiting for precaching threads to finish..."
time="2024-05-01T23:27:32Z" level=info msg="All the precaching threads have finished."
time="2024-05-01T23:27:32Z" level=info msg="Total Images: 136"
time="2024-05-01T23:27:32Z" level=info msg="Images Pulled Successfully: 135"
time="2024-05-01T23:27:32Z" level=info msg="Images Failed to Pull: 1"
time="2024-05-01T23:27:32Z" level=info msg="failed: registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3"
time="2024-05-01T23:27:32Z" level=info msg="Completed executing pre-caching"
time="2024-05-01T23:27:32Z" level=info msg="Failed to pre-cache the following images:"
time="2024-05-01T23:27:32Z" level=info msg="registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3"
time="2024-05-01T23:27:32Z" level=error msg="terminating pre-caching job due to error: failed to pre-cache one or more images"

2024-05-01T23:27:34Z    INFO    controllers.ImageBasedUpgrade   ------ end pod `lca-precache-job-q8q9k` log  -----  {"job name": "lca-precache-job"}

/cc @Missxiaoguo @donpenney @jc-rh @browsell

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

@pixelsoccupied: This pull request references Jira Issue OCPBUGS-32495, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @yliu127

In response to this:

This PR aims to make the Prep stage error handling more robust and overall hopefully simpler. To achieve this...the following things were done (at a high level):

  • Bulk of the stateroot setup functions (e.g pull image, calling rpmostree) are now in a cli and are called as a k8s job.
  • this allows the workload to be always "watched" by the job and any unexpected events (any pods deleted or system reboot) can be captured instantly. Currently any type of events will mark the stage as failed (no retries).
  • Precache job is treated as a generic Job and handled similar how we are orchestrating stateroot setup job, allowing for reduced code.
  • As part of an effort to identify any data that needs to be persistent, the Prep stage will now annotate the IBU cr with a (NEW) warning if it detects that current cluster is missing certain CRDs during validation. This is expected to be removed by user to acknowledge it and finally continue with Upgrade. The annotation with example value may look like this:
  annotations:
    lca.openshift.io/warn-extramanifest-cm-missing-crd: '{"velero.io/v1, Kind=Backup":"acm-klusterlet2","velero.io/v1,
      Kind=Restore":"acm-klusterlet3"}'
 - lastTransitionTime: "2024-05-02T16:57:25Z"
   message: Detected of presence of lca.openshift.io/warn-extramanifest-cm-missing-crd
     in IBU Cr. Please remove this annotation to acknowledge and allow Upgrade stage
     to proceed
   observedGeneration: 55
   reason: InProgress
   status: "True"
   type: UpgradeInProgress

Additionally note:

  • There's a new function (with required rbacs) to grab running Job's pod log
  • AutoRollbackOnFailure is now a pointer so suppress false positive can not change spec.autoRollbackOnFailure while ibu is in progress. Without it, the struct would auto initialized and wouldn't allow IBU to update with new annotation at runtime.
  • No more tasks/go routines during the prep stage.
  • Precache reduced log level and only print the error during podman exec calls e.g from LCA pod you may see this during fails
2024-05-01T23:27:34Z    INFO    controllers.ImageBasedUpgrade   ------ start pod `lca-precache-job-q8q9k` log  -----    {"job name": "lca-precache-job"}
2024-05-01T23:27:34Z    INFO    controllers.ImageBasedUpgrade   time="2024-05-01T23:27:24Z" level=info msg="Attempt 1/5: Failed to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:25Z" level=info msg="Attempt 2/5: Failed to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:25Z" level=info msg="Attempt 3/5: Failed to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:25Z" level=info msg="Attempt 4/5: Failed to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:26Z" level=info msg="Attempt 5/5: Failed to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:26Z" level=error msg="Failed to pull image: registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3, error: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:26Z" level=info msg="Waiting for precaching threads to finish..."
time="2024-05-01T23:27:32Z" level=info msg="All the precaching threads have finished."
time="2024-05-01T23:27:32Z" level=info msg="Total Images: 136"
time="2024-05-01T23:27:32Z" level=info msg="Images Pulled Successfully: 135"
time="2024-05-01T23:27:32Z" level=info msg="Images Failed to Pull: 1"
time="2024-05-01T23:27:32Z" level=info msg="failed: registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3"
time="2024-05-01T23:27:32Z" level=info msg="Completed executing pre-caching"
time="2024-05-01T23:27:32Z" level=info msg="Failed to pre-cache the following images:"
time="2024-05-01T23:27:32Z" level=info msg="registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3"
time="2024-05-01T23:27:32Z" level=error msg="terminating pre-caching job due to error: failed to pre-cache one or more images"

2024-05-01T23:27:34Z    INFO    controllers.ImageBasedUpgrade   ------ end pod `lca-precache-job-q8q9k` log  -----  {"job name": "lca-precache-job"}

/cc @Missxiaoguo @donpenney @jc-rh @browsell

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

@pixelsoccupied: This pull request references Jira Issue OCPBUGS-32495, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @yliu127

In response to this:

This PR aims to make the Prep stage error handling more robust and overall hopefully simpler. To achieve this...the following things were done (at a high level):

  • Bulk of the stateroot setup functions (e.g pull image, calling rpmostree) are now in a cli and are called as a k8s job.
  • this allows the workload to be always "watched" by the job and any unexpected events (any pods deleted or system reboot) can be captured instantly. Currently any type of unexpected events will mark the stage as failed (no retries).
  • Precache job is treated as a generic Job and handled similar how we are orchestrating stateroot setup job, allowing for reduced code.
  • As part of an effort to identify any data that needs to be persistent, the Prep stage will now annotate the IBU cr with a (NEW) warning if it detects that current cluster is missing certain CRDs during validation. This is expected to be removed by user to acknowledge it and finally continue with Upgrade. The annotation with example value may look like this:
  annotations:
    lca.openshift.io/warn-extramanifest-cm-missing-crd: '{"velero.io/v1, Kind=Backup":"acm-klusterlet2","velero.io/v1,
      Kind=Restore":"acm-klusterlet3"}'
 - lastTransitionTime: "2024-05-02T16:57:25Z"
   message: Detected of presence of lca.openshift.io/warn-extramanifest-cm-missing-crd
     in IBU Cr. Please remove this annotation to acknowledge and allow Upgrade stage
     to proceed
   observedGeneration: 55
   reason: InProgress
   status: "True"
   type: UpgradeInProgress

Additionally note:

  • There's a new function (with required rbacs) to grab running Job's pod log
  • AutoRollbackOnFailure is now a pointer so suppress false positive can not change spec.autoRollbackOnFailure while ibu is in progress. Without it, the struct would auto initialized and wouldn't allow IBU to update with new annotation at runtime.
  • No more tasks/go routines during the prep stage.
  • Precache reduced log level and only print the error during podman exec calls e.g from LCA pod you may see this during fails
2024-05-01T23:27:34Z    INFO    controllers.ImageBasedUpgrade   ------ start pod `lca-precache-job-q8q9k` log  -----    {"job name": "lca-precache-job"}
2024-05-01T23:27:34Z    INFO    controllers.ImageBasedUpgrade   time="2024-05-01T23:27:24Z" level=info msg="Attempt 1/5: Failed to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:25Z" level=info msg="Attempt 2/5: Failed to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:25Z" level=info msg="Attempt 3/5: Failed to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:25Z" level=info msg="Attempt 4/5: Failed to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:26Z" level=info msg="Attempt 5/5: Failed to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:26Z" level=error msg="Failed to pull image: registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3, error: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:26Z" level=info msg="Waiting for precaching threads to finish..."
time="2024-05-01T23:27:32Z" level=info msg="All the precaching threads have finished."
time="2024-05-01T23:27:32Z" level=info msg="Total Images: 136"
time="2024-05-01T23:27:32Z" level=info msg="Images Pulled Successfully: 135"
time="2024-05-01T23:27:32Z" level=info msg="Images Failed to Pull: 1"
time="2024-05-01T23:27:32Z" level=info msg="failed: registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3"
time="2024-05-01T23:27:32Z" level=info msg="Completed executing pre-caching"
time="2024-05-01T23:27:32Z" level=info msg="Failed to pre-cache the following images:"
time="2024-05-01T23:27:32Z" level=info msg="registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3"
time="2024-05-01T23:27:32Z" level=error msg="terminating pre-caching job due to error: failed to pre-cache one or more images"

2024-05-01T23:27:34Z    INFO    controllers.ImageBasedUpgrade   ------ end pod `lca-precache-job-q8q9k` log  -----  {"job name": "lca-precache-job"}

/cc @Missxiaoguo @donpenney @jc-rh @browsell

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

internal/reboot/reboot.go Outdated Show resolved Hide resolved
lca-cli/cmd/staterootSetupIBU.go Outdated Show resolved Hide resolved
@pixelsoccupied
Copy link
Contributor Author

/hold
code review changes and other updates

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 6, 2024
@openshift-ci-robot
Copy link

@pixelsoccupied: This pull request references Jira Issue OCPBUGS-32495, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @yliu127

In response to this:

This PR aims to make the Prep stage error handling more robust and overall hopefully simpler. To achieve this...the following things were done (at a high level):

  • Bulk of the stateroot setup functions (e.g pull image, calling rpmostree) are now in a cli and are called as a k8s job.
  • this allows the workload to be always "watched" by the job and any unexpected events (any pods deleted or system reboot) can be captured instantly. Currently any type of unexpected events will mark the stage as failed (no retries).
  • Precache job is treated as a generic Job and handled similar how we are orchestrating stateroot setup job, allowing for reduced code.
  • As part of an effort to identify any data that needs to be persistent, the Prep stage will now annotate the IBU cr with a (NEW) warning if it detects that current cluster is missing certain CRDs during validation. This is expected to be removed by user to acknowledge it and finally continue with Upgrade. The annotation with example value may look like this:
  annotations:
    lca.openshift.io/warn-extramanifest-cm-missing-crd: '{"velero.io/v1, Kind=Backup":"acm-klusterlet2","velero.io/v1,
      Kind=Restore":"acm-klusterlet3"}'
 - lastTransitionTime: "2024-05-02T16:57:25Z"
   message: Detected of presence of lca.openshift.io/warn-extramanifest-cm-missing-crd
     in IBU Cr. Please remove this annotation to acknowledge and allow Upgrade stage
     to proceed
   observedGeneration: 55
   reason: InProgress
   status: "True"
   type: UpgradeInProgress

Additionally note:

  • There's a new function (with required rbacs) to grab running Job's pod log
  • AutoRollbackOnFailure is now a pointer so suppress false positive can not change spec.autoRollbackOnFailure while ibu is in progress. Without it, the struct would auto initialize and wouldn't allow IBU to update with new annotation at runtime. edit: using patch (instead of update) should avoid auto init but regardless it's good have the struct as pointer.
  • No more tasks/go routines during the prep stage.
  • Suppress exec calls for podman calls in preache to reduce noise

/cc @Missxiaoguo @donpenney @jc-rh @browsell

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

@pixelsoccupied: This pull request references Jira Issue OCPBUGS-32495, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @yliu127

In response to this:

This PR aims to make the Prep stage error handling more robust and overall hopefully simpler. To achieve this...the following things were done (at a high level):

  • Bulk of the stateroot setup functions (e.g pull image, calling rpmostree) are now in a cli and are called as a k8s job.
  • this allows the workload to be always "watched" by the job and any unexpected events (any pods deleted or system reboot) can be captured instantly. Currently any type of unexpected events will mark the stage as failed (no retries).
  • Precache job is treated as a generic Job and handled similar how we are orchestrating stateroot setup job, allowing for reduced code.
  • As part of an effort to identify any data that needs to be persistent, the Prep stage will now annotate the IBU cr with a (NEW) warning if it detects that current cluster is missing certain CRDs during validation. This is expected to be removed by user to acknowledge it and finally continue with Upgrade. The annotation with example value may look like this:
  annotations:
    lca.openshift.io/warn-extramanifest-cm-missing-crd: '{"velero.io/v1, Kind=Backup":"acm-klusterlet2","velero.io/v1,
      Kind=Restore":"acm-klusterlet3"}'
 - lastTransitionTime: "2024-05-02T16:57:25Z"
   message: Detected of presence of lca.openshift.io/warn-extramanifest-cm-missing-crd
     in IBU Cr. Please remove this annotation to acknowledge and allow Upgrade stage
     to proceed
   observedGeneration: 55
   reason: InProgress
   status: "True"
   type: UpgradeInProgress

Additionally note:

  • There's a new function (with required rbacs) to grab running Job's pod log
  • AutoRollbackOnFailure is now a pointer so suppress false positive can not change spec.autoRollbackOnFailure while ibu is in progress. Without it, the struct would auto initialize and wouldn't allow IBU to update with new annotation at runtime. edit: using patch (instead of update) should avoid auto init but regardless it's good have the struct as pointer.
  • No more tasks/go routines during the prep stage.
  • Suppress successful exec podman call logs in preache to reduce noise

/cc @Missxiaoguo @donpenney @jc-rh @browsell

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@pixelsoccupied pixelsoccupied force-pushed the stateroot_job branch 2 times, most recently from 62a5c70 to ce9b507 Compare May 6, 2024 19:37
@pixelsoccupied
Copy link
Contributor Author

/hold

@pixelsoccupied
Copy link
Contributor Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 6, 2024
@pixelsoccupied
Copy link
Contributor Author

/hold
will patch instead of update during cleanup as well

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 6, 2024
@pixelsoccupied pixelsoccupied force-pushed the stateroot_job branch 3 times, most recently from 0f636cb to 917593e Compare May 6, 2024 21:24
@pixelsoccupied
Copy link
Contributor Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 6, 2024
@pixelsoccupied pixelsoccupied force-pushed the stateroot_job branch 2 times, most recently from cb9c80d to 17ab70f Compare May 7, 2024 01:36
@pixelsoccupied
Copy link
Contributor Author

/test ibu-e2e-flow

@openshift-ci-robot
Copy link

@pixelsoccupied: This pull request references Jira Issue OCPBUGS-32495, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @yliu127

In response to this:

This PR aims to make the Prep stage error handling more robust and overall hopefully simpler. To achieve this...the following things were done (at a high level):

  • Bulk of the stateroot setup functions (e.g pull image, calling rpmostree) are now in a cli and are called as a k8s job.
  • this allows the workload to be always "watched" by the job and any unexpected events (any pods deleted or system reboot) can be captured instantly. Currently any type of unexpected events will mark the stage as failed (no retries).
  • Precache job is treated as a generic Job and handled similar how we are orchestrating stateroot setup job, allowing for reduced code.
  • As part of an effort to identify any data that needs to be persistent, the Prep stage will now annotate the IBU cr with a (NEW) warning if it detects that current cluster is missing certain CRDs during validation. The annotation with example value may look like this:
  annotations:
    lca.openshift.io/warn-extramanifest-cm-missing-crd: '{"velero.io/v1, Kind=Backup":"acm-klusterlet2","velero.io/v1,
      Kind=Restore":"acm-klusterlet3"}'

Additionally note:

  • There's a new function (with required rbacs) to grab running Job's pod log
  • AutoRollbackOnFailure is now a pointer so suppress false positive can not change spec.autoRollbackOnFailure while ibu is in progress. Without it, the struct would auto initialize and wouldn't allow IBU to update with new annotation at runtime. edit: using patch (instead of update) should avoid auto init but regardless it's good have the struct as pointer.
  • No more tasks/go routines during the prep stage.
  • Suppress successful exec podman call logs in preache to reduce noise

/cc @Missxiaoguo @donpenney @jc-rh @browsell

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@pixelsoccupied pixelsoccupied force-pushed the stateroot_job branch 2 times, most recently from 8ec10cb to 6634b38 Compare May 7, 2024 15:28
@donpenney
Copy link
Collaborator

/retest

@donpenney
Copy link
Collaborator

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 7, 2024
Copy link
Contributor

@Missxiaoguo Missxiaoguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your great work!!
/lgtm

internal/extramanifest/extramanifest.go Show resolved Hide resolved
main/main.go Show resolved Hide resolved
@donpenney
Copy link
Collaborator

/approve

Copy link
Contributor

openshift-ci bot commented May 8, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: donpenney

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 8, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 8650bb8 into openshift-kni:main May 8, 2024
8 checks passed
@openshift-ci-robot
Copy link

@pixelsoccupied: Jira Issue OCPBUGS-32495: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-32495 has been moved to the MODIFIED state.

In response to this:

This PR aims to make the Prep stage error handling more robust and overall hopefully simpler. To achieve this...the following things were done (at a high level):

  • Bulk of the stateroot setup functions (e.g pull image, calling rpmostree) are now in a cli and are called as a k8s job.
  • this allows the workload to be always "watched" by the job and any unexpected events (any pods deleted or system reboot) can be captured instantly. Currently any type of unexpected events will mark the stage as failed (no retries).
  • Precache job is treated as a generic Job and handled similar how we are orchestrating stateroot setup job, allowing for reduced code.
  • As part of an effort to identify any data that needs to be persistent, the Prep stage will now annotate the IBU cr with a (NEW) warning if it detects that current cluster is missing certain CRDs during validation. The annotation with example value may look like this:
  annotations:
    lca.openshift.io/warn-extramanifest-cm-missing-crd: '{"velero.io/v1, Kind=Backup":"acm-klusterlet2","velero.io/v1,
      Kind=Restore":"acm-klusterlet3"}'

Additionally note:

  • There's a new function (with required rbacs) to grab running Job's pod log
  • AutoRollbackOnFailure is now a pointer so suppress false positive can not change spec.autoRollbackOnFailure while ibu is in progress. Without it, the struct would auto initialize and wouldn't allow IBU to update with new annotation at runtime. edit: using patch (instead of update) should avoid auto init but regardless it's good have the struct as pointer.
  • No more tasks/go routines during the prep stage.
  • Suppress successful exec podman call logs in preache to reduce noise

/cc @Missxiaoguo @donpenney @jc-rh @browsell

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants