Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bootstrap machine fails to install with okd 4.7, 4.8 and 4.9 #897

Closed
sandrobonazzola opened this issue Sep 28, 2021 · 40 comments
Closed

Bootstrap machine fails to install with okd 4.7, 4.8 and 4.9 #897

sandrobonazzola opened this issue Sep 28, 2021 · 40 comments

Comments

@sandrobonazzola
Copy link

Describe the bug
bootstrap machine fails to install with okd 4.8 and okd 4.9.

Watching the installation with: # openshift-install --dir /root/install_dir wait-for install-complete --log-level=debug reports:

DEBUG Still waiting for the cluster to initialize: Working towards 4.9.0-0.okd-2021-09-27-224448: 8 of 742 done (1% complete) 
DEBUG Still waiting for the cluster to initialize: Working towards 4.9.0-0.okd-2021-09-27-224448: 193 of 742 done (26% complete) 
DEBUG Still waiting for the cluster to initialize: Working towards 4.9.0-0.okd-2021-09-27-224448: 332 of 742 done (44% complete) 
DEBUG Still waiting for the cluster to initialize: Working towards 4.9.0-0.okd-2021-09-27-224448: 517 of 742 done (69% complete) 
DEBUG Still waiting for the cluster to initialize: Working towards 4.9.0-0.okd-2021-09-27-224448: 602 of 742 done (81% complete) 
DEBUG Still waiting for the cluster to initialize: Multiple errors are preventing progress: 
DEBUG * Could not update oauthclient "console" (449 of 742): the server does not recognize this resource, check extension API servers 
DEBUG * Could not update role "openshift-apiserver/prometheus-k8s" (720 of 742): resource may have been deleted 
DEBUG * Could not update role "openshift-authentication/prometheus-k8s" (630 of 742): resource may have been deleted 
DEBUG * Could not update role "openshift-console-operator/prometheus-k8s" (667 of 742): resource may have been deleted 
DEBUG * Could not update role "openshift-controller-manager/prometheus-k8s" (728 of 742): resource may have been deleted 
DEBUG * Could not update rolebinding "openshift/cluster-samples-operator-openshift-edit" (412 of 742): resource may have been deleted 
DEBUG Still waiting for the cluster to initialize: Working towards 4.9.0-0.okd-2021-09-27-224448: 602 of 742 done (81% complete)

Version
4.9.0-0.okd-2021-09-27-224448

How reproducible
100%

Log bundle

# oc adm must-gather
[must-gather      ] OUT the server could not find the requested resource (get imagestreams.image.openshift.io must-gather)
[must-gather      ] OUT 
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information.
[must-gather      ] OUT Using must-gather plug-in image: registry.redhat.io/openshift4/ose-must-gather:latest
ClusterID: 6e34f392-23f5-4e12-82b1-e612e5d38fb1
ClusterVersion: Installing "4.9.0-0.okd-2021-09-27-224448" for 8 minutes: Working towards 4.9.0-0.okd-2021-09-27-224448: 602 of 742 done (81% complete)
ClusterOperators:
	clusteroperator/authentication is not available (<missing>) because <missing>
	clusteroperator/baremetal is not available (<missing>) because <missing>
	clusteroperator/cloud-controller-manager is not available (<missing>) because <missing>
	clusteroperator/cluster-autoscaler is not available (<missing>) because <missing>
	clusteroperator/config-operator is not available (<missing>) because <missing>
	clusteroperator/console is not available (<missing>) because <missing>
	clusteroperator/csi-snapshot-controller is not available (<missing>) because <missing>
	clusteroperator/dns is not available (<missing>) because <missing>
	clusteroperator/etcd is not available (<missing>) because <missing>
	clusteroperator/image-registry is not available (<missing>) because <missing>
	clusteroperator/ingress is not available (<missing>) because <missing>
	clusteroperator/insights is not available (<missing>) because <missing>
	clusteroperator/kube-apiserver is not available (<missing>) because <missing>
	clusteroperator/kube-controller-manager is not available (<missing>) because <missing>
	clusteroperator/kube-scheduler is not available (<missing>) because <missing>
	clusteroperator/kube-storage-version-migrator is not available (<missing>) because <missing>
	clusteroperator/machine-api is not available (<missing>) because <missing>
	clusteroperator/machine-approver is not available (<missing>) because <missing>
	clusteroperator/machine-config is not available (<missing>) because <missing>
	clusteroperator/marketplace is not available (<missing>) because <missing>
	clusteroperator/monitoring is not available (<missing>) because <missing>
	clusteroperator/network is not available (<missing>) because <missing>
	clusteroperator/node-tuning is not available (<missing>) because <missing>
	clusteroperator/openshift-apiserver is not available (<missing>) because <missing>
	clusteroperator/openshift-controller-manager is not available (<missing>) because <missing>
	clusteroperator/openshift-samples is not available (<missing>) because <missing>
	clusteroperator/operator-lifecycle-manager is not available (<missing>) because <missing>
	clusteroperator/operator-lifecycle-manager-catalog is not available (<missing>) because <missing>
	clusteroperator/operator-lifecycle-manager-packageserver is not available (<missing>) because <missing>
	clusteroperator/service-ca is not available (<missing>) because <missing>
	clusteroperator/storage is not available (<missing>) because <missing>


[must-gather      ] OUT namespace/openshift-must-gather-qqz9v created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-728c6 created
[must-gather      ] OUT pod for plug-in image registry.redhat.io/openshift4/ose-must-gather:latest created

and stays stuck there.

@sandrobonazzola
Copy link
Author

About

DEBUG * Could not update oauthclient "console" (449 of 742): the server does not recognize this resource, check extension API servers

tried:

# oc get ouathclient
error: the server doesn't have a resource type "ouathclient"

@sandrobonazzola sandrobonazzola changed the title Bootstrap machine fails to install with okd 4.8 and okd 4.9 Bootstrap machine fails to install with okd 4.7, 4.8 and 4.9 Sep 29, 2021
@sandrobonazzola
Copy link
Author

sandrobonazzola commented Sep 29, 2021

Failing on 4.7 and 4.8 as well

@vrutkovs
Copy link
Member

Please attach log bundle

@sandrobonazzola
Copy link
Author

@vrutkovs
Copy link
Member

Masters booted, but didn't request master ignition. Check boot log on master machine, seems network is down there

@sandrobonazzola
Copy link
Author

Master is up, the console shows:
Screenshot

@vrutkovs
Copy link
Member

Seems to be invalid network config - it can't reach to DNS server apparently?

@sandrobonazzola
Copy link
Author

the DNS is provided by the lab but here it seems it's trying to connect on [::1]:53 rather than connecting to the lab dns

@sandrobonazzola
Copy link
Author

is there any way for configuring the dns to be used from the install-config.yaml ?

@rvanderp3
Copy link
Contributor

Can you confirm that the master node is getting an IP address(i.e. can you ping it)? This does seem to indicate an issue with the network configuration. The IP configuration is not able to be set via install-config.

@sandrobonazzola
Copy link
Author

master0 doesn't reply to ping but I see it got both IPv4 and IPv6 from the guest agent running on it (it's a bare metal UPI but the host is virtual)

@sandrobonazzola
Copy link
Author

turning the nic up and down shows that eth0 got renamed to ens3
Screenshot2

@sandrobonazzola
Copy link
Author

is there a way for settinmg up core user password from install-config.yaml as for https://docs.fedoraproject.org/en-US/fedora-coreos/authentication/#_using_password_authentication ?

@rvanderp3
Copy link
Contributor

rvanderp3 commented Sep 29, 2021

Unfortunately, I don't believe there are a lot of further insights to be offered without understanding why the master isn't unreachable. You might consider booting the fcos live ISO to see if you can troubleshoot the network from a vantage point where you can run troubleshooting commands.

install-config.yaml doesn't allow specific users to be created. Users are created via the ignition content that is produced by the installer binary. Regardless, until the master nodes retrieve their ignition, a user provided via master.ign wouldn't be applied anyway. IMO your best bet is to boot the ISO and see if you can inspect the network configuration.

@sandrobonazzola
Copy link
Author

  • Running Fedora CoreOS 34 live shows network is working properly, getting the same IP address reported by guest agent and is able to query dns. The NIC is called ens3.
  • I edited the passwd section for masters in the generated metadata, added a password for core user. Re-generated ignition and re-deployed the master node. It was pointless because at boot I could observe nm-initrd.service failed and it never reach a point where the user can login.

@sandrobonazzola
Copy link
Author

After a dozen of tentatives, got this screenshot:
Screenshot3

@sandrobonazzola
Copy link
Author

sandrobonazzola commented Sep 29, 2021

@dustymabe sounds like something you handled in coreos/fedora-coreos-tracker#883

@sandrobonazzola
Copy link
Author

sandrobonazzola commented Sep 29, 2021

looks like passing console=null to kernel boot command line let the network to go up. I'll give it another re-deploy from scratch passing this command line.

@LorbusChris
Copy link
Contributor

LorbusChris commented Sep 29, 2021

^ i.e. coreos/fedora-coreos-tracker#883
It seems that change has made it into OKD's machine-os-content, but not the boot image used for 4.7, as that is older and wouldn't have it: https://github.com/vrutkovs/installer/blob/release-4.7-okd/data/data/fcos-amd64.json#L72
I'll open a PR to bump that.

@sandrobonazzola
Copy link
Author

@LorbusChris please check also 4.8 and 4.9 as I see the issue there too

@LorbusChris
Copy link
Contributor

Which is also a version from well before the change even made it into FCOS's testing-devel (on 20210709: coreos/fedora-coreos-config@876fda1)

@markandrewj
Copy link

markandrewj commented Sep 29, 2021

We have been struggling with this same issue for the last week while trying to stand up a new OKD cluster. When the nodes try pull the ignition config from the bootstrap we get a 'connection refused' message. We are using FCOS 34.20210904.3.0 and trying to build an OKD 4.7 UPI bare metal cluster.

okd_issue

Any feedback would be appreciated. We have validated our network configuration (DNS, DHCP, load balancer, ect) multiple times.

@sandrobonazzola
Copy link
Author

@markandrewj I managed to pass the network issue by adding console=null to kernel boot command line on master nodes first boot.

I now progressed the deployment from 80% to 87% and got stuck with:

DEBUG Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, ingress, machine-config, monitoring

I tried attaching the log bundle here but I guess it's too large. Uploaded here: http://file.rdu.redhat.com/~sbonazzo/log-bundle-20210930072055.tar.gz but I'm not sure it's visible outside Red Hat.

@sandrobonazzola
Copy link
Author

After a few reboots of masters and workers the OKD deployment reached 100%.

The whole flow will need to be re-tested once @LorbusChris ' patches will get into a build.

@markandrewj
Copy link

markandrewj commented Sep 30, 2021

@sandrobonazzola we can give your suggestion a try, thank you. The log bundle you have linked is unfortunately not available from our network. Once the patches are applied we will probably try rebuilding the nodes again. We are building this as a test cluster to support two other OKD/OCP clusters we currently have. The plan is to use it for testing upgrades, cluster configuration changes, and other administrative tasks.

@kai-uwe-rommel
Copy link

Please see here: coreos/fedora-coreos-tracker#943
You perhaps hit the same problem. It's diagnosed and a fix pending for FCOS.
Try again with a FCOS 20210626 or earlier.

@LorbusChris
Copy link
Contributor

If that's the case, bumping the boot image now (see PRs linked above) won't solve it.

@markandrewj
Copy link

markandrewj commented Oct 1, 2021

@kai-uwe-rommel we took your suggestion and tried to ignite the cluster using the fedora-coreos-34.20210626.3.2-live.x86_64 iso. Using this version of FCOS we were able to get past the issue we were having. The cluster is currently still bootstrapping, I will follow up further after the installation completes (or doesn't complete). Thank you for the feedback.

@smuda
Copy link

smuda commented Oct 3, 2021

I hade the same problem (masters querying localhost DNS instead of network DNS) on my NUCs. It seems the NICs on the NUCs are too slow during startup and I had to add a timeout during first boot:

sudo coreos-installer \
    install \
    /dev/sda \
    --firstboot-args='console=tty0 rd.neednet=1 rd.net.timeout.carrier=30' \
    --insecure-ignition \
    --ignition-url=http://infra1.ocp4.example.com:8080/master.ign

perhaps that would be something to try out as well?

@markandrewj
Copy link

We tried the @smuda's suggestion, and we got past the original issue we had where we saw a 'connection refused' message. However now when the worker nodes try to pull the config from the internal API we receive a 'internal server error' message. The master nodes were able to pull their configs though.

We tried using an older FCOS image from June, which allowed the config to be pulled, but the bootstrap never completed. We also tried installing using both the the 4.7 and 4.8 installer.

okd-issue

@rvanderp3
Copy link
Contributor

The worker ignition config is made available until sometime around the bootstrap-complete phase of the installation. It may need some time to finish bootstrapping. A must-gather could be useful in understanding why the machine-config server isn't yet returning a worker ignition.

@smuda
Copy link

smuda commented Oct 22, 2021

I use FCOS 34.20210626.3.1 which works for me but it needs some fixes.

Once I detect the bootstrap has finished, I remove the bootstrap machine from the loadbalancer in front of api-int, thereby only including the masters.

I've documented my setup at AyoyAB/okd-with-ansible. Perhaps something in that setup can give you some inspiration?

Among other things there are fixes for (746, 975) FCOS issues which I've included in ansible role "0-create-local-files".

@markandrewj
Copy link

markandrewj commented Oct 26, 2021

We were able to finally get the cluster running using openshift-install-linux-4.7.0-0.okd-2021-07-03-190901.tar.gz and fedora-coreos-34.20210626.3.2-live.x86_64.iso. I would recommend doing a test install of OKD using the bare metal UPI method, if this isn't already tested in as part of the OKD release pipeline. It seems that some of the FCOS images have issues currently.

@vrutkovs
Copy link
Member

I would recommend doing a test install of OKD using the bare metal UPI method, if this isn't already tested in as part of the OKD release pipeline

That would indeed be great, but we don't have a capacity for this yet.

It seems that some of the FCOS images have issues currently

Yes. See https://github.com/openshift/okd/blob/master/FAQ.md#which-fedora-coreos-should-i-use for OKD 4.8

@sandrobonazzola do you mind if I close this one?

@sandrobonazzola
Copy link
Author

@vrutkovs ok to close this one, the issue I hit is being tracked on fcos side.

@karthik101
Copy link

@sandrobonazzola Where can i find the 4.9 version of installer? I don't see it in the release tags.

@sandrobonazzola
Copy link
Author

@karthik101 you can get it here: https://amd64.origin.releases.ci.openshift.org/#4.9.0-0.okd but be aware it's not promoted to 4-sable channel.

@konup
Copy link

konup commented Feb 4, 2022

Hi,
I have same problem as reported @sandrobonazzola

bootstrap machine fails to install with okd 4.7, 4.8 and okd 4.9.
(bootstrap machine is ok with all previous instalations up to okd 4.5)

for test was used VMs with these compatibility modes
ESXi 6.5 and later (VM version 13)
ESXi 6.7 U2 and later (VM version 15)

for tests these okd versions have been (unsuccessfully) tried:
oc adm release extract --tools quay.io/openshift/okd:4.7.0-0.okd-2021-03-28-152009
oc adm release extract --tools quay.io/openshift/okd:4.8.0-0.okd-2021-11-14-052418
oc adm release extract --tools quay.io/openshift/okd:4.9.0-0.okd-2022-01-14-230113
oc adm release extract --tools quay.io/openshift/okd:4.9.0-0.okd-2022-01-29-035536

The first part of ignition process looks fine

  • bootstrap machine booted from fcos live image
    (these versions of fcos was tried)
    fedora-coreos-33.20210328.3.0
    fedora-coreos-34.20210529.3.0
    fedora-coreos-34.20210626.3.1
    fedora-coreos-34.20211031.3.0
    fedora-coreos-35.20220116.3.0

  • networking (routing, fw, dns) looks ok
    remote ignition bootstrap.ign file successfully used from lan over http with hostname resolving
    (coreos.inst.ignition_url=http://fs.company.local/okd/ignition/bootstrap.ign)

  • after second automatic machine reboot during ignition process, okd instalation was started
    after few seconds is acessible api on https://api.okd.company.local:6443

  • i try connect to bootstrap machine over ssh and DNS resolving of internal or public names looks ok
    api.okd.company.local
    api-int.okd.company.local
    console-openshift-console.apps.okd.company.local
    and public names as google.com, quay.io, ...

  • network local and public connection is also ok, i try accessibility to
    api.okd.company.local:6443
    api.okd.company.local:22623
    fs.company.local:80 (place with ignition files)
    tcp to any public :80 and :443 scuch as redhat.com or google.com

the installation process ends on the same place as describe @sandrobonazzola and symptoms are very similar

logs:

$ openshift-install --dir=/opt/remctl/OKD/init wait-for install-complete --log-level debug

INFO Waiting up to 40m0s for the cluster at https://api.okd.company.local:6443 to initialize...
DEBUG Still waiting for the cluster to initialize:
DEBUG Still waiting for the cluster to initialize: Working towards 4.9.0-0.okd-2022-01-29-035536
DEBUG Still waiting for the cluster to initialize: Working towards 4.9.0-0.okd-2022-01-29-035536: downloading update
DEBUG Still waiting for the cluster to initialize: Working towards 4.9.0-0.okd-2022-01-29-035536: downloading update
DEBUG Still waiting for the cluster to initialize: Working towards 4.9.0-0.okd-2022-01-29-035536
DEBUG Still waiting for the cluster to initialize: Working towards 4.9.0-0.okd-2022-01-29-035536: 203 of 745 done (27% complete)
DEBUG Still waiting for the cluster to initialize: Working towards 4.9.0-0.okd-2022-01-29-035536: 365 of 745 done (48% complete)
DEBUG Still waiting for the cluster to initialize: Working towards 4.9.0-0.okd-2022-01-29-035536: 550 of 745 done (73% complete)
DEBUG Still waiting for the cluster to initialize: Working towards 4.9.0-0.okd-2022-01-29-035536: 551 of 745 done (73% complete)
DEBUG Still waiting for the cluster to initialize: Working towards 4.9.0-0.okd-2022-01-29-035536: 593 of 745 done (79% complete)
DEBUG Still waiting for the cluster to initialize: Working towards 4.9.0-0.okd-2022-01-29-035536: 595 of 745 done (79% complete)
DEBUG Still waiting for the cluster to initialize: Multiple errors are preventing progress:
DEBUG * Could not update imagestream "openshift/driver-toolkit" (501 of 745): resource may have been deleted
DEBUG * Could not update oauthclient "console" (451 of 745): the server does not recognize this resource, check extension API servers
DEBUG * Could not update role "openshift-apiserver/prometheus-k8s" (723 of 745): resource may have been deleted
DEBUG * Could not update role "openshift-authentication/prometheus-k8s" (633 of 745): resource may have been deleted
DEBUG * Could not update role "openshift-console-operator/prometheus-k8s" (670 of 745): resource may have been deleted
DEBUG * Could not update role "openshift-controller-manager/prometheus-k8s" (731 of 745): resource may have been deleted
DEBUG * Could not update rolebinding "openshift/cluster-samples-operator-openshift-edit" (414 of 745): resource may have been deleted
DEBUG Still waiting for the cluster to initialize: Working towards 4.9.0-0.okd-2022-01-29-035536: 596 of 745 done (80% complete)

$ oc adm must-gather

[must-gather      ] OUT the server could not find the requested resource (get imagestreams.image.openshift.io must-gather)
[must-gather      ] OUT
[must-gather      ] OUT Using must-gather plug-in image: registry.redhat.io/openshift4/ose-must-gather:latest
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information.
ClusterID: af807ebb-b069-46d6-af2d-98aa24f7a32f
ClusterVersion: Installing "4.9.0-0.okd-2022-01-29-035536" for 5 minutes: Working towards 4.9.0-0.okd-2022-01-29-035536: 596 of 745 done (80% complete)
ClusterOperators:
        clusteroperator/authentication is not available (<missing>) because <missing>
        clusteroperator/baremetal is not available (<missing>) because <missing>
        clusteroperator/cloud-controller-manager is not available (<missing>) because <missing>
        clusteroperator/cluster-autoscaler is not available (<missing>) because <missing>
        clusteroperator/config-operator is not available (<missing>) because <missing>
        clusteroperator/console is not available (<missing>) because <missing>
        clusteroperator/csi-snapshot-controller is not available (<missing>) because <missing>
        clusteroperator/dns is not available (<missing>) because <missing>
        clusteroperator/etcd is not available (<missing>) because <missing>
        clusteroperator/image-registry is not available (<missing>) because <missing>
        clusteroperator/ingress is not available (<missing>) because <missing>
        clusteroperator/insights is not available (<missing>) because <missing>
        clusteroperator/kube-apiserver is not available (<missing>) because <missing>
        clusteroperator/kube-controller-manager is not available (<missing>) because <missing>
        clusteroperator/kube-scheduler is not available (<missing>) because <missing>
        clusteroperator/kube-storage-version-migrator is not available (<missing>) because <missing>
        clusteroperator/machine-api is not available (<missing>) because <missing>
        clusteroperator/machine-approver is not available (<missing>) because <missing>
        clusteroperator/machine-config is not available (<missing>) because <missing>
        clusteroperator/marketplace is not available (<missing>) because <missing>
        clusteroperator/monitoring is not available (<missing>) because <missing>
        clusteroperator/network is not available (<missing>) because <missing>
        clusteroperator/node-tuning is not available (<missing>) because <missing>
        clusteroperator/openshift-apiserver is not available (<missing>) because <missing>
        clusteroperator/openshift-controller-manager is not available (<missing>) because <missing>
        clusteroperator/openshift-samples is not available (<missing>) because <missing>
        clusteroperator/operator-lifecycle-manager is not available (<missing>) because <missing>
        clusteroperator/operator-lifecycle-manager-catalog is not available (<missing>) because <missing>
        clusteroperator/operator-lifecycle-manager-packageserver is not available (<missing>) because <missing>
        clusteroperator/service-ca is not available (<missing>) because <missing>
        clusteroperator/storage is not available (<missing>) because <missing>

[must-gather      ] OUT namespace/openshift-must-gather-nvkch created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-b9gpq created
[must-gather      ] OUT pod for plug-in image registry.redhat.io/openshift4/ose-must-gather:latest created

end of jornalctl from bootstrap machine:
[core@bootstrap ~]$ sudo journalctl -xe

Feb 04 12:47:11 bootstrap.okd.company.local approve-csr.sh[5356]: No resources found
Feb 04 12:47:14 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:14.242163    3974 kubelet_node_status.go:386] "Setting node annotation to enable volume controller attach/detach"
Feb 04 12:47:14 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:14.243984    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasSufficientMemory"
Feb 04 12:47:14 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:14.244030    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasNoDiskPressure"
Feb 04 12:47:14 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:14.244045    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasSufficientPID"
Feb 04 12:47:14 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:14.402334    3974 kubelet_node_status.go:386] "Setting node annotation to enable volume controller attach/detach"
Feb 04 12:47:14 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:14.404235    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasSufficientMemory"
Feb 04 12:47:14 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:14.404276    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasNoDiskPressure"
Feb 04 12:47:14 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:14.404289    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasSufficientPID"
Feb 04 12:47:24 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:24.007782    3974 kubelet_getters.go:176] "Pod status updated" pod="kube-system/bootstrap-kube-scheduler-bootstrap.okd.company.local" status=Running
Feb 04 12:47:24 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:24.007823    3974 kubelet_getters.go:176] "Pod status updated" pod="openshift-etcd/etcd-bootstrap-member-bootstrap.okd.company.local" status=Running
Feb 04 12:47:24 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:24.007834    3974 kubelet_getters.go:176] "Pod status updated" pod="default/bootstrap-machine-config-operator-bootstrap.okd.company.local" status=Running
Feb 04 12:47:24 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:24.007846    3974 kubelet_getters.go:176] "Pod status updated" pod="openshift-cluster-version/bootstrap-cluster-version-operator-bootstrap.okd.company.local" status=Running
Feb 04 12:47:24 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:24.007856    3974 kubelet_getters.go:176] "Pod status updated" pod="openshift-cloud-credential-operator/cloud-credential-operator-bootstrap.okd.company.local" status=Running
Feb 04 12:47:24 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:24.007869    3974 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-apiserver/bootstrap-kube-apiserver-bootstrap.okd.company.local" status=Running
Feb 04 12:47:24 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:24.007879    3974 kubelet_getters.go:176] "Pod status updated" pod="kube-system/bootstrap-kube-controller-manager-bootstrap.okd.company.local" status=Running
Feb 04 12:47:24 bootstrap.okd.company.local audit[5372]: AVC avc:  denied  { ioctl } for  pid=5372 comm="iptables" path="/sys/fs/cgroup" dev="cgroup2" ino=1 scontext=system_u:system_r:iptables_t:s0 tcontext=system_u:object_r:cgroup_t:s0 tclass=dir permissive=0
Feb 04 12:47:24 bootstrap.okd.company.local audit[5372]: SYSCALL arch=c000003e syscall=59 success=yes exit=0 a0=c00137c9f0 a1=c001364c30 a2=c0011bcf60 a3=8 items=0 ppid=3974 pid=5372 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="iptables" exe="/usr/sbin/xtables-legacy-multi" subj=system_u:system_r:iptables_t:s0 key=(null)
Feb 04 12:47:24 bootstrap.okd.company.local kernel: audit: type=1400 audit(1643978844.086:438): avc:  denied  { ioctl } for  pid=5372 comm="iptables" path="/sys/fs/cgroup" dev="cgroup2" ino=1 scontext=system_u:system_r:iptables_t:s0 tcontext=system_u:object_r:cgroup_t:s0 tclass=dir permissive=0
Feb 04 12:47:24 bootstrap.okd.company.local kernel: audit: type=1300 audit(1643978844.086:438): arch=c000003e syscall=59 success=yes exit=0 a0=c00137c9f0 a1=c001364c30 a2=c0011bcf60 a3=8 items=0 ppid=3974 pid=5372 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="iptables" exe="/usr/sbin/xtables-legacy-multi" subj=system_u:system_r:iptables_t:s0 key=(null)
Feb 04 12:47:24 bootstrap.okd.company.local kernel: audit: type=1309 audit(1643978844.086:438): argc=9 a0="iptables" a1="-w" a2="5" a3="-W" a4="100000" a5="-S" a6="KUBE-KUBELET-CANARY" a7="-t" a8="mangle"
Feb 04 12:47:24 bootstrap.okd.company.local audit: EXECVE argc=9 a0="iptables" a1="-w" a2="5" a3="-W" a4="100000" a5="-S" a6="KUBE-KUBELET-CANARY" a7="-t" a8="mangle"
Feb 04 12:47:24 bootstrap.okd.company.local audit: PROCTITLE proctitle=69707461626C6573002D770035002D5700313030303030002D53004B5542452D4B5542454C45542D43414E415259002D74006D616E676C65
Feb 04 12:47:24 bootstrap.okd.company.local kernel: audit: type=1327 audit(1643978844.086:438): proctitle=69707461626C6573002D770035002D5700313030303030002D53004B5542452D4B5542454C45542D43414E415259002D74006D616E676C65
Feb 04 12:47:24 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:24.241922    3974 kubelet_node_status.go:386] "Setting node annotation to enable volume controller attach/detach"
Feb 04 12:47:24 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:24.243702    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasSufficientMemory"
Feb 04 12:47:24 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:24.243744    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasNoDiskPressure"
Feb 04 12:47:24 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:24.243757    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasSufficientPID"
Feb 04 12:47:24 bootstrap.okd.company.local audit[5373]: AVC avc:  denied  { ioctl } for  pid=5373 comm="ip6tables" path="/sys/fs/cgroup" dev="cgroup2" ino=1 scontext=system_u:system_r:iptables_t:s0 tcontext=system_u:object_r:cgroup_t:s0 tclass=dir permissive=0
Feb 04 12:47:24 bootstrap.okd.company.local audit[5373]: SYSCALL arch=c000003e syscall=59 success=yes exit=0 a0=c00137d560 a1=c001365950 a2=c00162ea80 a3=8 items=0 ppid=3974 pid=5373 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ip6tables" exe="/usr/sbin/xtables-legacy-multi" subj=system_u:system_r:iptables_t:s0 key=(null)
Feb 04 12:47:24 bootstrap.okd.company.local kernel: audit: type=1400 audit(1643978844.251:439): avc:  denied  { ioctl } for  pid=5373 comm="ip6tables" path="/sys/fs/cgroup" dev="cgroup2" ino=1 scontext=system_u:system_r:iptables_t:s0 tcontext=system_u:object_r:cgroup_t:s0 tclass=dir permissive=0
Feb 04 12:47:24 bootstrap.okd.company.local kernel: audit: type=1300 audit(1643978844.251:439): arch=c000003e syscall=59 success=yes exit=0 a0=c00137d560 a1=c001365950 a2=c00162ea80 a3=8 items=0 ppid=3974 pid=5373 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ip6tables" exe="/usr/sbin/xtables-legacy-multi" subj=system_u:system_r:iptables_t:s0 key=(null)
Feb 04 12:47:24 bootstrap.okd.company.local kernel: audit: type=1309 audit(1643978844.251:439): argc=9 a0="ip6tables" a1="-w" a2="5" a3="-W" a4="100000" a5="-S" a6="KUBE-KUBELET-CANARY" a7="-t" a8="mangle"
Feb 04 12:47:24 bootstrap.okd.company.local audit: EXECVE argc=9 a0="ip6tables" a1="-w" a2="5" a3="-W" a4="100000" a5="-S" a6="KUBE-KUBELET-CANARY" a7="-t" a8="mangle"
Feb 04 12:47:24 bootstrap.okd.company.local audit: PROCTITLE proctitle=6970367461626C6573002D770035002D5700313030303030002D53004B5542452D4B5542454C45542D43414E415259002D74006D616E676C65
Feb 04 12:47:24 bootstrap.okd.company.local kernel: audit: type=1327 audit(1643978844.251:439): proctitle=6970367461626C6573002D770035002D5700313030303030002D53004B5542452D4B5542454C45542D43414E415259002D74006D616E676C65
Feb 04 12:47:24 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:24.414222    3974 kubelet_node_status.go:386] "Setting node annotation to enable volume controller attach/detach"
Feb 04 12:47:24 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:24.416164    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasSufficientMemory"
Feb 04 12:47:24 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:24.416205    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasNoDiskPressure"
Feb 04 12:47:24 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:24.416221    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasSufficientPID"
Feb 04 12:47:31 bootstrap.okd.company.local approve-csr.sh[5375]: No resources found
Feb 04 12:47:33 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:33.242931    3974 kubelet_node_status.go:386] "Setting node annotation to enable volume controller attach/detach"
Feb 04 12:47:33 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:33.245710    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasSufficientMemory"
Feb 04 12:47:33 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:33.245760    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasNoDiskPressure"
Feb 04 12:47:33 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:33.245777    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasSufficientPID"
Feb 04 12:47:34 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:34.428619    3974 kubelet_node_status.go:386] "Setting node annotation to enable volume controller attach/detach"
Feb 04 12:47:34 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:34.431286    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasSufficientMemory"
Feb 04 12:47:34 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:34.431370    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasNoDiskPressure"
Feb 04 12:47:34 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:34.431400    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasSufficientPID"
Feb 04 12:47:37 bootstrap.okd.company.local systemd[5176]: Starting Mark boot as successful...
░░ Subject: A start job for unit UNIT has begun execution
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ A start job for unit UNIT has begun execution.
░░
░░ The job identifier is 13.
Feb 04 12:47:37 bootstrap.okd.company.local grub2-set-bootflag[5389]: Creating tmpfile failed: Read-only file system
Feb 04 12:47:37 bootstrap.okd.company.local systemd[5176]: grub-boot-success.service: Main process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ An ExecStart= process belonging to unit UNIT has exited.
░░
░░ The process' exit code is 'exited' and its exit status is 1.
Feb 04 12:47:37 bootstrap.okd.company.local systemd[5176]: grub-boot-success.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ The unit UNIT has entered the 'failed' state with result 'exit-code'.
Feb 04 12:47:37 bootstrap.okd.company.local systemd[5176]: Failed to start Mark boot as successful.
░░ Subject: A start job for unit UNIT has failed
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ A start job for unit UNIT has finished with a failure.
░░
░░ The job identifier is 13 and the job result is failed.

@smuda
Copy link

smuda commented Feb 5, 2022

@konup It would seem you have a very different problem, since you have ESXi (vs metal) and you seem to be able to connect från master to bootstrap during installation (vs having a network problem fetching the secondary ignition file).

You'd probably best of creating your own issue.

@konup
Copy link

konup commented Feb 6, 2022

ok, because this is diferent problem, new issue #1093 was created

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants