OCPBUGS-32495: setup stateroot with a job and other prep stage refactoring #476

pixelsoccupied · 2024-05-02T17:42:35Z

This PR aims to make the Prep stage error handling more robust and overall hopefully simpler. To achieve this...the following things were done (at a high level):

Bulk of the stateroot setup functions (e.g pull image, calling rpmostree) are now in a cli and are called as a k8s job.
- this allows the workload to be always "watched" by the job and any unexpected events (any pods deleted or system reboot) can be captured instantly. Currently any type of unexpected events will mark the stage as failed (no retries).
Precache job is treated as a generic Job and handled similar how we are orchestrating stateroot setup job, allowing for reduced code.
As part of an effort to identify any data that needs to be persistent, the Prep stage will now annotate the IBU cr with a (NEW) warning if it detects that current cluster is missing certain CRDs during validation. The annotation with example value may look like this:

  annotations:
    lca.openshift.io/warn-extramanifest-cm-missing-crd: '{"velero.io/v1, Kind=Backup":"acm-klusterlet2","velero.io/v1,
      Kind=Restore":"acm-klusterlet3"}'

Additionally note:

There's a new function (with required rbacs) to grab running Job's pod log
AutoRollbackOnFailure is now a pointer so suppress false positive can not change spec.autoRollbackOnFailure while ibu is in progress. Without it, the struct would auto initialize and wouldn't allow IBU to update with new annotation at runtime. edit: using patch (instead of update) should avoid auto init but regardless it's good have the struct as pointer.
No more tasks/go routines during the prep stage.
Suppress successful exec podman call logs in preache to reduce noise

/cc @Missxiaoguo @donpenney @jc-rh @browsell

openshift-ci-robot · 2024-05-02T17:42:45Z

@pixelsoccupied: This pull request references Jira Issue OCPBUGS-32495, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @yliu127

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This PR aims to make the Prep stage error handling more robust and overall hopefully simpler. To achieve this...the following things were done (at a high level):

Bulk of the stateroot setup functions (e.g pull image, calling rpmostree) are now in a cli and are called as a k8s job.
this allows the workload to be always "watched" by the job and any unexpected events (any pods deleted or system reboot) can be captured instantly. Currently any type of events will mark the stage as failed (no retries).
Precache job is treated as a generic Job and treated similar how we are orchestrating stateroot setup job, allowing for reduced code.
As part of the effort remove any data that needs to persistence, the Prep stage will annotate the IBU cr with NEW warning if it detects that extramanifest. This is expected to be removed by user to acknowledge and start Upgrade The annotation with example value may look like this:

  annotations:
    lca.openshift.io/warn-extramanifest-cm-missing-crd: '{"velero.io/v1, Kind=Backup":"acm-klusterlet2","velero.io/v1,
      Kind=Restore":"acm-klusterlet3"}'

 - lastTransitionTime: "2024-05-02T16:57:25Z"
   message: Detected of presence of lca.openshift.io/warn-extramanifest-cm-missing-crd
     in IBU Cr. Please remove this annotation to acknowledge and allow Upgrade stage
     to proceed
   observedGeneration: 55
   reason: InProgress
   status: "True"
   type: UpgradeInProgress

Additionally note:

There's a new function (with required rbacs) to grab running Job's pod log
AutoRollbackOnFailure is now a pointer so suppress false positive can not change spec.autoRollbackOnFailure while ibu is in progress. Without it, the struct would auto initialized and wouldn't allow IBU to update with new annotation at runtime.
No more tasks/go routines during the prep stage.
Precache reduced log level and only print the error during podman exec calls e.g from LCA pod you may see this during fails

2024-05-01T23:27:34Z    INFO    controllers.ImageBasedUpgrade   ------ start pod `lca-precache-job-q8q9k` log  -----    {"job name": "lca-precache-job"}
2024-05-01T23:27:34Z    INFO    controllers.ImageBasedUpgrade   time="2024-05-01T23:27:24Z" level=info msg="Attempt 1/5: Failed to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:25Z" level=info msg="Attempt 2/5: Failed to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:25Z" level=info msg="Attempt 3/5: Failed to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:25Z" level=info msg="Attempt 4/5: Failed to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:26Z" level=info msg="Attempt 5/5: Failed to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:26Z" level=error msg="Failed to pull image: registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3, error: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:26Z" level=info msg="Waiting for precaching threads to finish..."
time="2024-05-01T23:27:32Z" level=info msg="All the precaching threads have finished."
time="2024-05-01T23:27:32Z" level=info msg="Total Images: 136"
time="2024-05-01T23:27:32Z" level=info msg="Images Pulled Successfully: 135"
time="2024-05-01T23:27:32Z" level=info msg="Images Failed to Pull: 1"
time="2024-05-01T23:27:32Z" level=info msg="failed: registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3"
time="2024-05-01T23:27:32Z" level=info msg="Completed executing pre-caching"
time="2024-05-01T23:27:32Z" level=info msg="Failed to pre-cache the following images:"
time="2024-05-01T23:27:32Z" level=info msg="registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3"
time="2024-05-01T23:27:32Z" level=error msg="terminating pre-caching job due to error: failed to pre-cache one or more images"

2024-05-01T23:27:34Z    INFO    controllers.ImageBasedUpgrade   ------ end pod `lca-precache-job-q8q9k` log  -----  {"job name": "lca-precache-job"}

/cc @Missxiaoguo @donpenney @jc-rh @browsell

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2024-05-02T18:17:27Z

@pixelsoccupied: This pull request references Jira Issue OCPBUGS-32495, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @yliu127

In response to this:

This PR aims to make the Prep stage error handling more robust and overall hopefully simpler. To achieve this...the following things were done (at a high level):

Bulk of the stateroot setup functions (e.g pull image, calling rpmostree) are now in a cli and are called as a k8s job.
this allows the workload to be always "watched" by the job and any unexpected events (any pods deleted or system reboot) can be captured instantly. Currently any type of events will mark the stage as failed (no retries).
Precache job is treated as a generic Job and handled similar how we are orchestrating stateroot setup job, allowing for reduced code.
As part of an effort to identify any data that needs to be persistent, the Prep stage will now annotate the IBU cr with a (NEW) warning if it detects that current cluster is missing certain CRDs during validation. This is expected to be removed by user to acknowledge it and finally continue with Upgrade. The annotation with example value may look like this:

  annotations:
    lca.openshift.io/warn-extramanifest-cm-missing-crd: '{"velero.io/v1, Kind=Backup":"acm-klusterlet2","velero.io/v1,
      Kind=Restore":"acm-klusterlet3"}'

 - lastTransitionTime: "2024-05-02T16:57:25Z"
   message: Detected of presence of lca.openshift.io/warn-extramanifest-cm-missing-crd
     in IBU Cr. Please remove this annotation to acknowledge and allow Upgrade stage
     to proceed
   observedGeneration: 55
   reason: InProgress
   status: "True"
   type: UpgradeInProgress

Additionally note:

There's a new function (with required rbacs) to grab running Job's pod log
AutoRollbackOnFailure is now a pointer so suppress false positive can not change spec.autoRollbackOnFailure while ibu is in progress. Without it, the struct would auto initialized and wouldn't allow IBU to update with new annotation at runtime.
No more tasks/go routines during the prep stage.
Precache reduced log level and only print the error during podman exec calls e.g from LCA pod you may see this during fails

2024-05-01T23:27:34Z    INFO    controllers.ImageBasedUpgrade   ------ start pod `lca-precache-job-q8q9k` log  -----    {"job name": "lca-precache-job"}
2024-05-01T23:27:34Z    INFO    controllers.ImageBasedUpgrade   time="2024-05-01T23:27:24Z" level=info msg="Attempt 1/5: Failed to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:25Z" level=info msg="Attempt 2/5: Failed to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:25Z" level=info msg="Attempt 3/5: Failed to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:25Z" level=info msg="Attempt 4/5: Failed to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:26Z" level=info msg="Attempt 5/5: Failed to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:26Z" level=error msg="Failed to pull image: registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3, error: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:26Z" level=info msg="Waiting for precaching threads to finish..."
time="2024-05-01T23:27:32Z" level=info msg="All the precaching threads have finished."
time="2024-05-01T23:27:32Z" level=info msg="Total Images: 136"
time="2024-05-01T23:27:32Z" level=info msg="Images Pulled Successfully: 135"
time="2024-05-01T23:27:32Z" level=info msg="Images Failed to Pull: 1"
time="2024-05-01T23:27:32Z" level=info msg="failed: registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3"
time="2024-05-01T23:27:32Z" level=info msg="Completed executing pre-caching"
time="2024-05-01T23:27:32Z" level=info msg="Failed to pre-cache the following images:"
time="2024-05-01T23:27:32Z" level=info msg="registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3"
time="2024-05-01T23:27:32Z" level=error msg="terminating pre-caching job due to error: failed to pre-cache one or more images"

2024-05-01T23:27:34Z    INFO    controllers.ImageBasedUpgrade   ------ end pod `lca-precache-job-q8q9k` log  -----  {"job name": "lca-precache-job"}

/cc @Missxiaoguo @donpenney @jc-rh @browsell

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2024-05-02T19:39:40Z

@pixelsoccupied: This pull request references Jira Issue OCPBUGS-32495, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @yliu127

In response to this:

This PR aims to make the Prep stage error handling more robust and overall hopefully simpler. To achieve this...the following things were done (at a high level):

Bulk of the stateroot setup functions (e.g pull image, calling rpmostree) are now in a cli and are called as a k8s job.
this allows the workload to be always "watched" by the job and any unexpected events (any pods deleted or system reboot) can be captured instantly. Currently any type of unexpected events will mark the stage as failed (no retries).
Precache job is treated as a generic Job and handled similar how we are orchestrating stateroot setup job, allowing for reduced code.
As part of an effort to identify any data that needs to be persistent, the Prep stage will now annotate the IBU cr with a (NEW) warning if it detects that current cluster is missing certain CRDs during validation. This is expected to be removed by user to acknowledge it and finally continue with Upgrade. The annotation with example value may look like this:

  annotations:
    lca.openshift.io/warn-extramanifest-cm-missing-crd: '{"velero.io/v1, Kind=Backup":"acm-klusterlet2","velero.io/v1,
      Kind=Restore":"acm-klusterlet3"}'

 - lastTransitionTime: "2024-05-02T16:57:25Z"
   message: Detected of presence of lca.openshift.io/warn-extramanifest-cm-missing-crd
     in IBU Cr. Please remove this annotation to acknowledge and allow Upgrade stage
     to proceed
   observedGeneration: 55
   reason: InProgress
   status: "True"
   type: UpgradeInProgress

Additionally note:

There's a new function (with required rbacs) to grab running Job's pod log
AutoRollbackOnFailure is now a pointer so suppress false positive can not change spec.autoRollbackOnFailure while ibu is in progress. Without it, the struct would auto initialized and wouldn't allow IBU to update with new annotation at runtime.
No more tasks/go routines during the prep stage.
Precache reduced log level and only print the error during podman exec calls e.g from LCA pod you may see this during fails

2024-05-01T23:27:34Z    INFO    controllers.ImageBasedUpgrade   ------ start pod `lca-precache-job-q8q9k` log  -----    {"job name": "lca-precache-job"}
2024-05-01T23:27:34Z    INFO    controllers.ImageBasedUpgrade   time="2024-05-01T23:27:24Z" level=info msg="Attempt 1/5: Failed to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:25Z" level=info msg="Attempt 2/5: Failed to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:25Z" level=info msg="Attempt 3/5: Failed to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:25Z" level=info msg="Attempt 4/5: Failed to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:26Z" level=info msg="Attempt 5/5: Failed to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:26Z" level=error msg="Failed to pull image: registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3, error: failed podman pull with args [pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 --authfile /var/lib/kubelet/config.json]: Trying to pull registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3...\nError: initializing source docker://registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3: reading manifest sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3 in registry.redhat.io/redhat/redhat-operator-index: manifest unknown\n: exit status 125"
time="2024-05-01T23:27:26Z" level=info msg="Waiting for precaching threads to finish..."
time="2024-05-01T23:27:32Z" level=info msg="All the precaching threads have finished."
time="2024-05-01T23:27:32Z" level=info msg="Total Images: 136"
time="2024-05-01T23:27:32Z" level=info msg="Images Pulled Successfully: 135"
time="2024-05-01T23:27:32Z" level=info msg="Images Failed to Pull: 1"
time="2024-05-01T23:27:32Z" level=info msg="failed: registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3"
time="2024-05-01T23:27:32Z" level=info msg="Completed executing pre-caching"
time="2024-05-01T23:27:32Z" level=info msg="Failed to pre-cache the following images:"
time="2024-05-01T23:27:32Z" level=info msg="registry.redhat.io/redhat/redhat-operator-index@sha256:6f6f64186e2384056f53f9fd44065387a2f4ca75011e7ce60e772c2efa12efd3"
time="2024-05-01T23:27:32Z" level=error msg="terminating pre-caching job due to error: failed to pre-cache one or more images"

2024-05-01T23:27:34Z    INFO    controllers.ImageBasedUpgrade   ------ end pod `lca-precache-job-q8q9k` log  -----  {"job name": "lca-precache-job"}

/cc @Missxiaoguo @donpenney @jc-rh @browsell

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

internal/prep/prep_stateroot_setup_job.go

internal/reboot/reboot.go

lca-cli/cmd/staterootSetupIBU.go

pixelsoccupied · 2024-05-06T13:22:01Z

/hold
code review changes and other updates

openshift-ci-robot · 2024-05-06T16:50:50Z

@pixelsoccupied: This pull request references Jira Issue OCPBUGS-32495, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @yliu127

In response to this:

This PR aims to make the Prep stage error handling more robust and overall hopefully simpler. To achieve this...the following things were done (at a high level):

Bulk of the stateroot setup functions (e.g pull image, calling rpmostree) are now in a cli and are called as a k8s job.

this allows the workload to be always "watched" by the job and any unexpected events (any pods deleted or system reboot) can be captured instantly. Currently any type of unexpected events will mark the stage as failed (no retries).

Precache job is treated as a generic Job and handled similar how we are orchestrating stateroot setup job, allowing for reduced code.

As part of an effort to identify any data that needs to be persistent, the Prep stage will now annotate the IBU cr with a (NEW) warning if it detects that current cluster is missing certain CRDs during validation. This is expected to be removed by user to acknowledge it and finally continue with Upgrade. The annotation with example value may look like this:
  annotations:
    lca.openshift.io/warn-extramanifest-cm-missing-crd: '{"velero.io/v1, Kind=Backup":"acm-klusterlet2","velero.io/v1,
      Kind=Restore":"acm-klusterlet3"}'
 - lastTransitionTime: "2024-05-02T16:57:25Z"
   message: Detected of presence of lca.openshift.io/warn-extramanifest-cm-missing-crd
     in IBU Cr. Please remove this annotation to acknowledge and allow Upgrade stage
     to proceed
   observedGeneration: 55
   reason: InProgress
   status: "True"
   type: UpgradeInProgress
Additionally note:

There's a new function (with required rbacs) to grab running Job's pod log

AutoRollbackOnFailure is now a pointer so suppress false positive can not change spec.autoRollbackOnFailure while ibu is in progress. Without it, the struct would auto initialize and wouldn't allow IBU to update with new annotation at runtime. edit: using patch (instead of update) should avoid auto init but regardless it's good have the struct as pointer.

No more tasks/go routines during the prep stage.

Suppress exec calls for podman calls in preache to reduce noise

/cc @Missxiaoguo @donpenney @jc-rh @browsell

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2024-05-06T16:54:26Z

@pixelsoccupied: This pull request references Jira Issue OCPBUGS-32495, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @yliu127

In response to this:

This PR aims to make the Prep stage error handling more robust and overall hopefully simpler. To achieve this...the following things were done (at a high level):

Bulk of the stateroot setup functions (e.g pull image, calling rpmostree) are now in a cli and are called as a k8s job.

this allows the workload to be always "watched" by the job and any unexpected events (any pods deleted or system reboot) can be captured instantly. Currently any type of unexpected events will mark the stage as failed (no retries).

Precache job is treated as a generic Job and handled similar how we are orchestrating stateroot setup job, allowing for reduced code.

As part of an effort to identify any data that needs to be persistent, the Prep stage will now annotate the IBU cr with a (NEW) warning if it detects that current cluster is missing certain CRDs during validation. This is expected to be removed by user to acknowledge it and finally continue with Upgrade. The annotation with example value may look like this:
  annotations:
    lca.openshift.io/warn-extramanifest-cm-missing-crd: '{"velero.io/v1, Kind=Backup":"acm-klusterlet2","velero.io/v1,
      Kind=Restore":"acm-klusterlet3"}'
 - lastTransitionTime: "2024-05-02T16:57:25Z"
   message: Detected of presence of lca.openshift.io/warn-extramanifest-cm-missing-crd
     in IBU Cr. Please remove this annotation to acknowledge and allow Upgrade stage
     to proceed
   observedGeneration: 55
   reason: InProgress
   status: "True"
   type: UpgradeInProgress
Additionally note:

There's a new function (with required rbacs) to grab running Job's pod log

AutoRollbackOnFailure is now a pointer so suppress false positive can not change spec.autoRollbackOnFailure while ibu is in progress. Without it, the struct would auto initialize and wouldn't allow IBU to update with new annotation at runtime. edit: using patch (instead of update) should avoid auto init but regardless it's good have the struct as pointer.

No more tasks/go routines during the prep stage.

Suppress successful exec podman call logs in preache to reduce noise

/cc @Missxiaoguo @donpenney @jc-rh @browsell

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

pixelsoccupied · 2024-05-06T19:37:50Z

/hold

pixelsoccupied · 2024-05-06T19:49:03Z

/unhold

pixelsoccupied · 2024-05-06T20:34:30Z

/hold
will patch instead of update during cleanup as well

pixelsoccupied · 2024-05-06T21:29:19Z

/unhold

pixelsoccupied · 2024-05-07T13:12:10Z

/test ibu-e2e-flow

openshift-ci-robot · 2024-05-07T15:14:57Z

@pixelsoccupied: This pull request references Jira Issue OCPBUGS-32495, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @yliu127

In response to this:

This PR aims to make the Prep stage error handling more robust and overall hopefully simpler. To achieve this...the following things were done (at a high level):

Bulk of the stateroot setup functions (e.g pull image, calling rpmostree) are now in a cli and are called as a k8s job.

this allows the workload to be always "watched" by the job and any unexpected events (any pods deleted or system reboot) can be captured instantly. Currently any type of unexpected events will mark the stage as failed (no retries).

Precache job is treated as a generic Job and handled similar how we are orchestrating stateroot setup job, allowing for reduced code.

As part of an effort to identify any data that needs to be persistent, the Prep stage will now annotate the IBU cr with a (NEW) warning if it detects that current cluster is missing certain CRDs during validation. The annotation with example value may look like this:
  annotations:
    lca.openshift.io/warn-extramanifest-cm-missing-crd: '{"velero.io/v1, Kind=Backup":"acm-klusterlet2","velero.io/v1,
      Kind=Restore":"acm-klusterlet3"}'
Additionally note:

There's a new function (with required rbacs) to grab running Job's pod log

AutoRollbackOnFailure is now a pointer so suppress false positive can not change spec.autoRollbackOnFailure while ibu is in progress. Without it, the struct would auto initialize and wouldn't allow IBU to update with new annotation at runtime. edit: using patch (instead of update) should avoid auto init but regardless it's good have the struct as pointer.

No more tasks/go routines during the prep stage.

Suppress successful exec podman call logs in preache to reduce noise

/cc @Missxiaoguo @donpenney @jc-rh @browsell

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

donpenney · 2024-05-07T17:18:25Z

/retest

donpenney · 2024-05-07T20:15:21Z

/lgtm

Missxiaoguo

Thanks for your great work!!
/lgtm

internal/extramanifest/extramanifest.go

main/main.go

donpenney · 2024-05-08T01:44:54Z

/approve

openshift-ci · 2024-05-08T01:45:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: donpenney

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [donpenney]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2024-05-08T01:47:26Z

@pixelsoccupied: Jira Issue OCPBUGS-32495: All pull requests linked via external trackers have merged:

openshift-kni/lifecycle-agent#476

Jira Issue OCPBUGS-32495 has been moved to the MODIFIED state.

In response to this:

This PR aims to make the Prep stage error handling more robust and overall hopefully simpler. To achieve this...the following things were done (at a high level):

Bulk of the stateroot setup functions (e.g pull image, calling rpmostree) are now in a cli and are called as a k8s job.

this allows the workload to be always "watched" by the job and any unexpected events (any pods deleted or system reboot) can be captured instantly. Currently any type of unexpected events will mark the stage as failed (no retries).

Precache job is treated as a generic Job and handled similar how we are orchestrating stateroot setup job, allowing for reduced code.

As part of an effort to identify any data that needs to be persistent, the Prep stage will now annotate the IBU cr with a (NEW) warning if it detects that current cluster is missing certain CRDs during validation. The annotation with example value may look like this:
  annotations:
    lca.openshift.io/warn-extramanifest-cm-missing-crd: '{"velero.io/v1, Kind=Backup":"acm-klusterlet2","velero.io/v1,
      Kind=Restore":"acm-klusterlet3"}'
Additionally note:

There's a new function (with required rbacs) to grab running Job's pod log

AutoRollbackOnFailure is now a pointer so suppress false positive can not change spec.autoRollbackOnFailure while ibu is in progress. Without it, the struct would auto initialize and wouldn't allow IBU to update with new annotation at runtime. edit: using patch (instead of update) should avoid auto init but regardless it's good have the struct as pointer.

No more tasks/go routines during the prep stage.

Suppress successful exec podman call logs in preache to reduce noise

/cc @Missxiaoguo @donpenney @jc-rh @browsell

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci bot requested review from browsell, donpenney, jc-rh and Missxiaoguo May 2, 2024 17:42

openshift-ci bot requested a review from yliu127 May 2, 2024 17:42

pixelsoccupied force-pushed the stateroot_job branch from 90e84a8 to 48f6168 Compare May 2, 2024 17:50

browsell reviewed May 3, 2024

View reviewed changes

internal/prep/prep_stateroot_setup_job.go Show resolved Hide resolved

donpenney reviewed May 5, 2024

View reviewed changes

internal/reboot/reboot.go Outdated Show resolved Hide resolved

lca-cli/cmd/staterootSetupIBU.go Outdated Show resolved Hide resolved

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 6, 2024

pixelsoccupied force-pushed the stateroot_job branch from 48f6168 to 6f8f28e Compare May 6, 2024 16:46

pixelsoccupied force-pushed the stateroot_job branch 2 times, most recently from 62a5c70 to ce9b507 Compare May 6, 2024 19:37

pixelsoccupied requested a review from donpenney May 6, 2024 19:38

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 6, 2024

pixelsoccupied force-pushed the stateroot_job branch from ce9b507 to 48d7280 Compare May 6, 2024 20:31

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 6, 2024

pixelsoccupied force-pushed the stateroot_job branch 3 times, most recently from 0f636cb to 917593e Compare May 6, 2024 21:24

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 6, 2024

pixelsoccupied force-pushed the stateroot_job branch 2 times, most recently from cb9c80d to 17ab70f Compare May 7, 2024 01:36

pixelsoccupied force-pushed the stateroot_job branch from 17ab70f to ec833b5 Compare May 7, 2024 15:14

pixelsoccupied force-pushed the stateroot_job branch 2 times, most recently from 8ec10cb to 6634b38 Compare May 7, 2024 15:28

setup stateroot with a job and other prep stage refactoring

10e368d

pixelsoccupied force-pushed the stateroot_job branch from 032860b to 10e368d Compare May 7, 2024 16:35

openshift-ci bot assigned donpenney May 7, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 7, 2024

Missxiaoguo reviewed May 7, 2024

View reviewed changes

internal/extramanifest/extramanifest.go Show resolved Hide resolved

main/main.go Show resolved Hide resolved

openshift-ci bot assigned Missxiaoguo May 7, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 8, 2024

openshift-merge-bot bot merged commit 8650bb8 into openshift-kni:main May 8, 2024
8 checks passed

pixelsoccupied mentioned this pull request May 15, 2024

OCPBUGS-32495: prep stage doc updates and unify logs and naming #517

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-32495: setup stateroot with a job and other prep stage refactoring #476

OCPBUGS-32495: setup stateroot with a job and other prep stage refactoring #476

pixelsoccupied commented May 2, 2024 •

edited

openshift-ci-robot commented May 2, 2024

openshift-ci-robot commented May 2, 2024

openshift-ci-robot commented May 2, 2024

pixelsoccupied commented May 6, 2024

openshift-ci-robot commented May 6, 2024

openshift-ci-robot commented May 6, 2024

pixelsoccupied commented May 6, 2024

pixelsoccupied commented May 6, 2024

pixelsoccupied commented May 6, 2024

pixelsoccupied commented May 6, 2024

pixelsoccupied commented May 7, 2024

openshift-ci-robot commented May 7, 2024

donpenney commented May 7, 2024

donpenney commented May 7, 2024

Missxiaoguo left a comment

donpenney commented May 8, 2024

openshift-ci bot commented May 8, 2024

openshift-ci-robot commented May 8, 2024

OCPBUGS-32495: setup stateroot with a job and other prep stage refactoring #476

OCPBUGS-32495: setup stateroot with a job and other prep stage refactoring #476

Conversation

pixelsoccupied commented May 2, 2024 • edited

openshift-ci-robot commented May 2, 2024

openshift-ci-robot commented May 2, 2024

openshift-ci-robot commented May 2, 2024

pixelsoccupied commented May 6, 2024

openshift-ci-robot commented May 6, 2024

openshift-ci-robot commented May 6, 2024

pixelsoccupied commented May 6, 2024

pixelsoccupied commented May 6, 2024

pixelsoccupied commented May 6, 2024

pixelsoccupied commented May 6, 2024

pixelsoccupied commented May 7, 2024

openshift-ci-robot commented May 7, 2024

donpenney commented May 7, 2024

donpenney commented May 7, 2024

Missxiaoguo left a comment

Choose a reason for hiding this comment

donpenney commented May 8, 2024

openshift-ci bot commented May 8, 2024

openshift-ci-robot commented May 8, 2024

pixelsoccupied commented May 2, 2024 •

edited