Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/openshift-install/analyze: Attempt to analyze bootstrap tarballs #2567

Closed
wants to merge 2 commits into from

Conversation

wking
Copy link
Member

@wking wking commented Oct 24, 2019

Instead of just dropping them into the users lap "here's a big tarball, have fun", look through them for obvious things that we can summarize. With:

 func runGatherBootstrapCmd(directory string) error {
+       return analyzeGatheredBootstrap("/tmp/log-bundle.tar.gz")

to feed this tarball into the analysis logic, the output looks like:

WARNING control-plane/10.0.134.229 had failing systemd units: crio.service
WARNING control-plane/10.0.134.229: crio.service:
● crio.service - Open Container Initiative Daemon
   Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/crio.service.d
           └─10-default-env.conf
   Active: failed (Result: exit-code) since Thu 2019-10-24 11:11:31 UTC; 320ms ago
     Docs: https://github.com/cri-o/cri-o
  Process: 8491 ExecStart=/usr/bin/crio $CRIO_STORAGE_OPTIONS $CRIO_NETWORK_OPTIONS $CRIO_METRICS_OPTIONS (code=exited, status=1/FAILURE)
 Main PID: 8491 (code=exited, status=1/FAILURE)
      CPU: 144ms

Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: Starting Open Container Initiative Daemon...
Oct 24 11:11:31 ip-10-0-134-229 crio[8491]: time="2019-10-24 11:11:31.895986612Z" level=fatal msg="opening seccomp profile (/etc/crio/seccomp.json) failed: open /etc/crio/seccomp.json: no such file or directory"
Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE
Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: crio.service: Failed with result 'exit-code'.
Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: Failed to start Open Container Initiative Daemon.
Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: crio.service: Consumed 144ms CPU time
WARNING control-plane/10.0.134.243 had failing systemd units: crio.service
WARNING control-plane/10.0.134.243: crio.service:
● crio.service - Open Container Initiative Daemon
   Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/crio.service.d
           └─10-default-env.conf
   Active: failed (Result: exit-code) since Thu 2019-10-24 11:11:35 UTC; 8s ago
     Docs: https://github.com/cri-o/cri-o
  Process: 8439 ExecStart=/usr/bin/crio $CRIO_STORAGE_OPTIONS $CRIO_NETWORK_OPTIONS $CRIO_METRICS_OPTIONS (code=exited, status=1/FAILURE)
 Main PID: 8439 (code=exited, status=1/FAILURE)
      CPU: 151ms

Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: Starting Open Container Initiative Daemon...
Oct 24 11:11:35 ip-10-0-134-243 crio[8439]: time="2019-10-24 11:11:35.238163016Z" level=fatal msg="opening seccomp profile (/etc/crio/seccomp.json) failed: open /etc/crio/seccomp.json: no such file or directory"
Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE
Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: crio.service: Failed with result 'exit-code'.
Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: Failed to start Open Container Initiative Daemon.
Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: crio.service: Consumed 151ms CPU time
WARNING control-plane/10.0.157.61 had failing systemd units: crio.service
WARNING control-plane/10.0.157.61: crio.service:
● crio.service - Open Container Initiative Daemon
   Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/crio.service.d
           └─10-default-env.conf
   Active: failed (Result: exit-code) since Thu 2019-10-24 11:11:36 UTC; 1s ago
     Docs: https://github.com/cri-o/cri-o
  Process: 8379 ExecStart=/usr/bin/crio $CRIO_STORAGE_OPTIONS $CRIO_NETWORK_OPTIONS $CRIO_METRICS_OPTIONS (code=exited, status=1/FAILURE)
 Main PID: 8379 (code=exited, status=1/FAILURE)
      CPU: 158ms

Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: Starting Open Container Initiative Daemon...
Oct 24 11:11:36 ip-10-0-157-61 crio[8379]: time="2019-10-24 11:11:36.807284677Z" level=fatal msg="opening seccomp profile (/etc/crio/seccomp.json) failed: open /etc/crio/seccomp.json: no such file or directory"
Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE
Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: crio.service: Failed with result 'exit-code'.
Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: Failed to start Open Container Initiative Daemon.
Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: crio.service: Consumed 158ms CPU time

That's maybe a bit noisy, but mostly because all three control-plane machines failed the same way.

It might be worth exposing this as:

$ openshift-install analyze bootstrap PATH

so folks could look at bootstrap logs which had been gathered by third-parties, but I'm punting on that for now.

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 24, 2019
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 24, 2019
@eparis
Copy link
Member

eparis commented Oct 24, 2019

It seems like a lot of code for what should, hopefully, only ever show up in RHCOS promotion jobs. do we have more cases where getting the end of the bootkube logs or something like that would also be a really good set of analysis?

@abhinavdahiya
Copy link
Contributor

/hold

let's make sure have an github issue that we use to discuss the actual use and scope of the analysis or even an openshift/enhancement before we think about adding this.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 24, 2019
@wking
Copy link
Member Author

wking commented Oct 24, 2019

Spun off #2569 for "do we want this?"

@abhinavdahiya abhinavdahiya added version/4.6 Tracking changes that should end up in 4.6 release and removed version/4.5 labels May 8, 2020
@iranzo
Copy link
Member

iranzo commented Aug 13, 2020

@wking is there a new tarball available anywhere to use as base data for testing? thx

@openshift-ci-robot openshift-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 13, 2020
@wking
Copy link
Member Author

wking commented Aug 13, 2020

bbc87bc -> e20102c rebases around #3489 and adds a new commit to force a failure so folks can see this working in CI. If/when folks are satisfied (still no comment in #2569 about which first thing folks want analyzed), we can drop that commit and lift the hold here.

/hold

@wking
Copy link
Member Author

wking commented Aug 13, 2020

And yes, many, many of these in CI: https://search.ci.openshift.org/?search=Bootstrap%20gather%20logs%20captured%20here

@abhinavdahiya
Copy link
Contributor

Closing this as it has stalled for very long. Please feel free to re-open in case you intent to keep pushing it, or consider tracking the work in an issue if this requires agreement for implementation.

/close

@openshift-ci-robot
Copy link
Contributor

@abhinavdahiya: Closed this PR.

In response to this:

Closing this as it has stalled for very long. Please feel free to re-open in case you intent to keep pushing it, or consider tracking the work in an issue if this requires agreement for implementation.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking
Copy link
Member Author

wking commented Oct 6, 2020

/reopen

I will keep pushing :)

@openshift-ci-robot
Copy link
Contributor

@wking: Reopened this PR.

In response to this:

/reopen

I will keep pushing :)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign wking after the PR has been reviewed.
You can assign the PR to them by writing /assign @wking in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Oct 6, 2020

@wking: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-scaleup-rhel7 bbc87bc link /test e2e-aws-scaleup-rhel7
ci/prow/e2e-aws-fips e20102c link /test e2e-aws-fips
ci/prow/e2e-aws-workers-rhel7 e20102c link /test e2e-aws-workers-rhel7
ci/prow/e2e-crc e20102c link /test e2e-crc
ci/prow/e2e-aws e20102c link /test e2e-aws
ci/prow/e2e-openstack e20102c link /test e2e-openstack
ci/prow/e2e-metal-ipi e20102c link /test e2e-metal-ipi
ci/prow/e2e-ovirt e20102c link /test e2e-ovirt
ci/prow/e2e-libvirt e20102c link /test e2e-libvirt

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@wking wking force-pushed the summarize-failed-units branch 2 times, most recently from 61bf20c to 1c874dc Compare November 4, 2020 22:32
pkg/systemd/entry.go Outdated Show resolved Hide resolved
@wking
Copy link
Member Author

wking commented Nov 6, 2020

e2e-aws:

level=info msg=Bootstrap gather logs captured here "/tmp/installer/log-bundle-20201106191638.tar.gz"
level=error msg=Attempted to gather debug logs after installation failure: failed to analyze "/tmp/installer/log-bundle-20201106191638.tar.gz": parsing bootstrap log bundle /tmp/installer/log-bundle-20201106191638.tar.gz: parsing systemd log JSON "log-bundle-20201106191638/unit-status/bootkube.service.log.json": parse N_RESTARTS: strconv.Atoi: parsing "": invalid syntax
level=fatal msg=Bootstrap failed to complete: failed waiting for Kubernetes API: Get "https://api.ci-op-crddhnmp-1d3f3.origin-ci-int-aws.dev.rhcloud.com:6443/version?timeout=32s": dial tcp 3.19.166.52:6443: connect: connection refused
2020/11/06 19:16:43 Failed to upload $KUBECONFIG: timed out waiting for the condition 

I've pushed 759c627 -> 2755198 updating my fixtures to include more entry samples, and adjusted the parser to be more robust so it doesn't choke on any of those samples.

Instead of just dropping them into the users lap "here's a big
tarball, have fun", look through them for obvious things that we can
summarize.  With:

   func runGatherBootstrapCmd(directory string) error {
  +       return analyzeGatheredBootstrap("/tmp/log-bundle.tar.gz")

to feed [1] into the analysis logic, the output looks like:

  WARNING control-plane/10.0.134.229 had failing systemd units: crio.service
  WARNING control-plane/10.0.134.229: crio.service:
  ● crio.service - Open Container Initiative Daemon
     Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
    Drop-In: /etc/systemd/system/crio.service.d
             └─10-default-env.conf
     Active: failed (Result: exit-code) since Thu 2019-10-24 11:11:31 UTC; 320ms ago
       Docs: https://github.com/cri-o/cri-o
    Process: 8491 ExecStart=/usr/bin/crio $CRIO_STORAGE_OPTIONS $CRIO_NETWORK_OPTIONS $CRIO_METRICS_OPTIONS (code=exited, status=1/FAILURE)
   Main PID: 8491 (code=exited, status=1/FAILURE)
        CPU: 144ms

  Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: Starting Open Container Initiative Daemon...
  Oct 24 11:11:31 ip-10-0-134-229 crio[8491]: time="2019-10-24 11:11:31.895986612Z" level=fatal msg="opening seccomp profile (/etc/crio/seccomp.json) failed: open /etc/crio/seccomp.json: no such file or directory"
  Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE
  Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: crio.service: Failed with result 'exit-code'.
  Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: Failed to start Open Container Initiative Daemon.
  Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: crio.service: Consumed 144ms CPU time
  WARNING control-plane/10.0.134.243 had failing systemd units: crio.service
  WARNING control-plane/10.0.134.243: crio.service:
  ● crio.service - Open Container Initiative Daemon
     Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
    Drop-In: /etc/systemd/system/crio.service.d
             └─10-default-env.conf
     Active: failed (Result: exit-code) since Thu 2019-10-24 11:11:35 UTC; 8s ago
       Docs: https://github.com/cri-o/cri-o
    Process: 8439 ExecStart=/usr/bin/crio $CRIO_STORAGE_OPTIONS $CRIO_NETWORK_OPTIONS $CRIO_METRICS_OPTIONS (code=exited, status=1/FAILURE)
   Main PID: 8439 (code=exited, status=1/FAILURE)
        CPU: 151ms

  Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: Starting Open Container Initiative Daemon...
  Oct 24 11:11:35 ip-10-0-134-243 crio[8439]: time="2019-10-24 11:11:35.238163016Z" level=fatal msg="opening seccomp profile (/etc/crio/seccomp.json) failed: open /etc/crio/seccomp.json: no such file or directory"
  Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE
  Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: crio.service: Failed with result 'exit-code'.
  Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: Failed to start Open Container Initiative Daemon.
  Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: crio.service: Consumed 151ms CPU time
  WARNING control-plane/10.0.157.61 had failing systemd units: crio.service
  WARNING control-plane/10.0.157.61: crio.service:
  ● crio.service - Open Container Initiative Daemon
     Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
    Drop-In: /etc/systemd/system/crio.service.d
             └─10-default-env.conf
     Active: failed (Result: exit-code) since Thu 2019-10-24 11:11:36 UTC; 1s ago
       Docs: https://github.com/cri-o/cri-o
    Process: 8379 ExecStart=/usr/bin/crio $CRIO_STORAGE_OPTIONS $CRIO_NETWORK_OPTIONS $CRIO_METRICS_OPTIONS (code=exited, status=1/FAILURE)
   Main PID: 8379 (code=exited, status=1/FAILURE)
        CPU: 158ms

  Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: Starting Open Container Initiative Daemon...
  Oct 24 11:11:36 ip-10-0-157-61 crio[8379]: time="2019-10-24 11:11:36.807284677Z" level=fatal msg="opening seccomp profile (/etc/crio/seccomp.json) failed: open /etc/crio/seccomp.json: no such file or directory"
  Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE
  Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: crio.service: Failed with result 'exit-code'.
  Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: Failed to start Open Container Initiative Daemon.
  Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: crio.service: Consumed 158ms CPU time

That's maybe a bit noisy, but mostly because all three control-plane
machines failed the same way.

Journal field docs are linked from the Go docstrings, but if these
evolve later, the ones I'm using are from [2], except for UNIT, where I have opened an upstream documentation request [3].

It might be worth exposing this as:

  $ openshift-install analyze bootstrap PATH

so folks could look at bootstrap logs which had been gathered by
third-parties, but I'm punting on that for now.

[1]: https://storage.googleapis.com/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.3/2455/artifacts/e2e-aws/installer/log-bundle-20191024111122.tar
[2]: https://github.com/systemd/systemd/blob/v246/man/systemd.journal-fields.xml
[3]: systemd/systemd#17538
@wking
Copy link
Member Author

wking commented Nov 6, 2020

e2e-aws:

level=info msg=Bootstrap gather logs captured here "/tmp/installer/log-bundle-20201106211245.tar.gz"
level=warning msg=bootstrap had failing or restarting systemd units: bootkube.service
level=warning msg=bootstrap had restarting systemd unit bootkube.service
level=warning msg=● bootkube.service - Bootstrap a Kubernetes cluster
level=warning msg=   Loaded: loaded (/etc/systemd/system/bootkube.service; static; vendor preset: disabled)
level=warning msg=   Active: activating (auto-restart) (Result: exit-code) since Fri 2020-11-06 21:12:45 UTC; 412ms ago
level=warning msg=  Process: 15924 ExecStart=/usr/local/bin/bootkube.sh (code=exited, status=127)
level=warning msg= Main PID: 15924 (code=exited, status=127)
level=warning
level=warning msg=2020-11-06T21:12:45Z ip-10-0-5-167 systemd[1]: Started Bootstrap a Kubernetes cluster.
level=warning msg=2020-11-06T21:12:45Z ip-10-0-5-167 bootkube.sh[15924]: /usr/local/bin/bootkube.sh: line 6: i-am-a-command-that-does-not-exist: command not found
level=warning msg=2020-11-06T21:12:45Z ip-10-0-5-167 systemd[1]: bootkube.service: Main process exited, code=exited, status=127/n/a
level=warning msg=2020-11-06T21:12:45Z ip-10-0-5-167 systemd[1]: bootkube.service: Failed with result 'exit-code'.
level=warning msg=control-plane 10.0.168.25 access error: failed to SSH into host
level=warning msg=control-plane 10.0.183.105 access error: failed to SSH into host
level=warning msg=control-plane 10.0.222.207 access error: failed to SSH into host
level=fatal msg=Bootstrap failed to complete: failed waiting for Kubernetes API: Get "https://api.ci-op-84bd5z4s-1d3f3.origin-ci-int-aws.dev.rhcloud.com:6443/version?timeout=32s": dial tcp 35.155.151.1:6443: connect: connection refused 

❤️ I've pushed 46e6124 -> f7a1d33 to scrape SSH errors out of #4345's gather.log, and added a new test-only commit to break release-image auth, so we can see what the "I am installing a CI release, and my CI-registry token is missing/stale" use-case looks like.

@wking
Copy link
Member Author

wking commented Nov 6, 2020

e2e-aws:

level=info msg=Bootstrap gather logs captured here "/tmp/installer/log-bundle-20201106221538.tar.gz"
level=warning msg=control-plane 10.0.182.219 access error: ssh: connect to host 10.0.182.219 port 22: Connection refused
level=warning msg=control-plane 10.0.188.197 access error: ssh: connect to host 10.0.188.197 port 22: Connection refused
level=warning msg=control-plane 10.0.240.47 access error: ssh: connect to host 10.0.240.47 port 22: Connection refused
level=warning msg=bootstrap had failing or restarting systemd units: bootkube.service
level=warning msg=bootstrap had restarting systemd unit bootkube.service
level=warning msg=● bootkube.service - Bootstrap a Kubernetes cluster
level=warning msg=   Loaded: loaded (/etc/systemd/system/bootkube.service; static; vendor preset: disabled)
level=warning msg=   Active: activating (auto-restart) (Result: exit-code) since Fri 2020-11-06 22:15:35 UTC; 3s ago
level=warning msg=  Process: 15832 ExecStart=/usr/local/bin/bootkube.sh (code=exited, status=127)
level=warning msg= Main PID: 15832 (code=exited, status=127)
level=warning msg=    Tasks: 0 (limit: 49124)
level=warning msg=   Memory: 0B
level=warning msg=   CGroup: /system.slice/bootkube.service
level=warning
level=warning msg=2020-11-06T22:15:35Z ip-10-0-9-32 systemd[1]: Started Bootstrap a Kubernetes cluster.
level=warning msg=2020-11-06T22:15:35Z ip-10-0-9-32 bootkube.sh[15832]: /usr/local/bin/bootkube.sh: line 6: i-am-a-command-that-does-not-exist: command not found
level=warning msg=2020-11-06T22:15:35Z ip-10-0-9-32 systemd[1]: bootkube.service: Main process exited, code=exited, status=127/n/a
level=warning msg=2020-11-06T22:15:35Z ip-10-0-9-32 systemd[1]: bootkube.service: Failed with result 'exit-code'.
level=fatal msg=Bootstrap failed to complete: failed waiting for Kubernetes API: Get "https://api.ci-op-x1lmrvsl-1d3f3.origin-ci-int-aws.dev.rhcloud.com:6443/version?timeout=32s": dial tcp 3.18.149.28:6443: connect: connection refused 

Because release-image access is apparently open:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2567/pull-ci-openshift-installer-master-e2e-aws/1324825813257818112/artifacts/e2e-aws/ipi-install-install/log-bundle-20201106221538.tar | tar xOz log-bundle-20201106221538/bootstrap/journals/release-image.log
-- Logs begin at Fri 2020-11-06 21:50:55 UTC, end at Fri 2020-11-06 22:15:38 UTC. --
Nov 06 21:51:12 ip-10-0-9-32 systemd[1]: Starting Download the OpenShift Release Image...
Nov 06 21:51:12 ip-10-0-9-32 release-image-download.sh[1667]: Pulling registry.build02.ci.openshift.org/ci-op-x1lmrvsl/release@sha256:63e99e4df15cf875fbbf7b660227a1c6833cce624d5ec66f6767c81b4c96659b...
Nov 06 21:51:22 ip-10-0-9-32 release-image-download.sh[1667]: d1c62a5431ad4b5886dc9dcb36ed1b86ade893ab6a12dd80d3807d65f19fe9ca
Nov 06 21:51:22 ip-10-0-9-32 systemd[1]: Started Download the OpenShift Release Image.

while CI-release images are not:

$ cat auth.json 
{}
$ podman pull --authfile auth.json registry.build02.ci.openshift.org/ci-op-x1lmrvsl/release@sha256:63e99e4df15cf875fbbf7b660227a1c6833cce624d5ec66f6767c81b4c96659b
Trying to pull registry.build02.ci.openshift.org/ci-op-x1lmrvsl/release@sha256:63e99e4df15cf875fbbf7b660227a1c6833cce624d5ec66f6767c81b4c96659b...Getting image source signatures
Copying blob sha256:ec1681b6a383e4ecedbeddd5abc596f3de835aed6db39a735f62395c8edbff30
 3.44 MB / 70.44 MB [==>-------------------------------------------------------]^C
$ podman pull --authfile auth.json registry.svc.ci.openshift.org/ocp/release:4.7.0-0.ci-2020-11-06-074141
Trying to pull registry.svc.ci.openshift.org/ocp/release:4.7.0-0.ci-2020-11-06-074141...Failed
error pulling image "registry.svc.ci.openshift.org/ocp/release:4.7.0-0.ci-2020-11-06-074141": unable to pull registry.svc.ci.openshift.org/ocp/release:4.7.0-0.ci-2020-11-06-074141: unable to pull image: Error determining manifest MIME type for docker://registry.svc.ci.openshift.org/ocp/release:4.7.0-0.ci-2020-11-06-074141: Error reading manifest 4.7.0-0.ci-2020-11-06-074141 in registry.svc.ci.openshift.org/ocp/release: unauthorized: authentication required

I've pushed f7a1d33 -> f7c9ff4, hard-coding the auth-requiring CI release.

@wking
Copy link
Member Author

wking commented Nov 6, 2020

e2e-aws:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2567/pull-ci-openshift-installer-master-e2e-aws/1324841426894196736/artifacts/e2e-aws/ipi-install-install/log-bundle-20201106231049.tar | tar xOz log-bundle-20201106231049/bootstrap/journals/release-image.log
-- Logs begin at Fri 2020-11-06 22:45:49 UTC, end at Fri 2020-11-06 23:10:49 UTC. --
Nov 06 22:46:06 ip-10-0-25-221 systemd[1]: Starting Download the OpenShift Release Image...
Nov 06 22:46:06 ip-10-0-25-221 release-image-download.sh[1651]: Pulling registry.svc.ci.openshift.org/ocp/release:4.7.0-0.ci-2020-11-06-074141...
Nov 06 22:46:17 ip-10-0-25-221 release-image-download.sh[1651]: 75e4113eaae0d4aa041d405dea7dfb3745961dae05421c0420b3965d66deb5d2
Nov 06 22:46:17 ip-10-0-25-221 systemd[1]: Started Download the OpenShift Release Image.

I dunno why my empty auth-file didn't break the pull, but I've pushed f7c9ff4 -> fa8d68e to slot in a does-not-exist tag that will certainly (I hope ;)) fail.

@wking
Copy link
Member Author

wking commented Nov 7, 2020

e2e-aws:

level=info msg=Bootstrap gather logs captured here "/tmp/installer/log-bundle-20201107001451.tar.gz"
level=warning msg=control-plane 10.0.149.19 access error: ssh: connect to host 10.0.149.19 port 22: Connection refused
level=warning msg=control-plane 10.0.164.179 access error: ssh: connect to host 10.0.164.179 port 22: Connection refused
level=warning msg=control-plane 10.0.249.133 access error: ssh: connect to host 10.0.249.133 port 22: Connection refused
level=info msg=bootstrap had no failing or restarting systemd units
level=fatal msg=Bootstrap failed to complete: failed waiting for Kubernetes API: Get "https://api.ci-op-kqxfq9f7-1d3f3.origin-ci-int-aws.dev.rhcloud.com:6443/version?timeout=32s": dial tcp 54.184.162.11:6443: connect: connection refused 

This is because we are looping in the pulling script, not failing and restarting the unit:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2567/pull-ci-openshift-installer-master-e2e-aws/1324857293598101504/artifacts/e2e-aws/ipi-install-install/log-bundle-20201107001451.tar | tar xOz log-bundle-20201107001451/bootstrap/journals/release-image.log | tail -n4
Nov 07 00:14:49 ip-10-0-63-105 release-image-download.sh[1624]: Error: error pulling image "registry.svc.ci.openshift.org/ocp/release:does-not-exist": unable to pull registry.svc.ci.openshift.org/ocp/release:does-not-exist: unable to pull image: Error initializing source docker://registry.svc.ci.openshift.org/ocp/release:does-not-exist: Error reading manifest does-not-exist in registry.svc.ci.openshift.org/ocp/release: manifest unknown: manifest unknown
Nov 07 00:14:49 ip-10-0-63-105 release-image-download.sh[1624]: Pull failed. Retrying registry.svc.ci.openshift.org/ocp/release:does-not-exist...
Nov 07 00:14:51 ip-10-0-63-105 release-image-download.sh[1624]: Error: error pulling image "registry.svc.ci.openshift.org/ocp/release:does-not-exist": unable to pull registry.svc.ci.openshift.org/ocp/release:does-not-exist: unable to pull image: Error initializing source docker://registry.svc.ci.openshift.org/ocp/release:does-not-exist: Error reading manifest does-not-exist in registry.svc.ci.openshift.org/ocp/release: manifest unknown: manifest unknown
Nov 07 00:14:51 ip-10-0-63-105 release-image-download.sh[1624]: Pull failed. Retrying registry.svc.ci.openshift.org/ocp/release:does-not-exist...

Ideally we'd have a oneshot unit that we could restart until it succeeded, but it would still remain around with logs and block dependent units until it completed. Looks like systemd grew support for that in systemd/systemd#13754, which went out v244. Unfortunately, 4.7 is currently running:

$ oc adm release info registry.svc.ci.openshift.org/ocp/release:4.7.0-0.nightly-2020-10-27-051128 | grep "Red Hat Enterprise Linux CoreOS"
  machine-os 47.82.202010270142-0 Red Hat Enterprise Linux CoreOS
$ curl -s https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.7/47.82.202010270142-0/x86_64/commitmeta.json | jq -c '.["rpmostree.rpmdb.pkglist"][] | select(.[0] == "systemd")'
["systemd","0","239","31.el8_2.2","x86_64"]

So we'll have to wait a bit before we get to 244 (or get a backport) before we get nice reporting for failed release-image pulls. I've pushed fa8d68e -> fe43889, dropping my debugging commits, and I think with that we're good to land this.

/assign @staebler

@wking
Copy link
Member Author

wking commented Nov 10, 2020

/retest

@wking
Copy link
Member Author

wking commented Nov 10, 2020

/hold cancel

Debugging commits removed.

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 10, 2020
@openshift-merge-robot
Copy link
Contributor

@wking: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-crc fe43889 link /test e2e-crc
ci/prow/e2e-ovirt fe43889 link /test e2e-ovirt
ci/prow/e2e-aws-fips fe43889 link /test e2e-aws-fips
ci/prow/e2e-aws-upgrade fe43889 link /test e2e-aws-upgrade
ci/prow/e2e-aws fe43889 link /test e2e-aws
ci/prow/e2e-openstack fe43889 link /test e2e-openstack
ci/prow/e2e-aws-workers-rhel7 fe43889 link /test e2e-aws-workers-rhel7
ci/prow/e2e-libvirt fe43889 link /test e2e-libvirt
ci/prow/e2e-metal-ipi fe43889 link /test e2e-metal-ipi

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@wking
Copy link
Member Author

wking commented Dec 5, 2020

/hold

I guess I'll take another shot at building consensus around this when 4.8 opens up ;).

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 5, 2020
@staebler
Copy link
Contributor

staebler commented Dec 7, 2020

/hold

I guess I'll take another shot at building consensus around this when 4.8 opens up ;).

Sorry that I did not get time to look at this for 4.7.

@romfreiman
Copy link

@eranco74 @tsorya @ronniel1 any reuse we can do to this logic in AI?

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jun 5, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 5, 2021

@wking: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking
Copy link
Member Author

wking commented Jun 14, 2021

Replaced by #4751.

@wking wking closed this Jun 14, 2021
@wking wking deleted the summarize-failed-units branch June 14, 2021 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants