Bootstrap machine fails to install with okd 4.7, 4.8 and 4.9 #897

sandrobonazzola · 2021-09-28T08:26:56Z

Describe the bug
bootstrap machine fails to install with okd 4.8 and okd 4.9.

Watching the installation with: # openshift-install --dir /root/install_dir wait-for install-complete --log-level=debug reports:

DEBUG Still waiting for the cluster to initialize: Working towards 4.9.0-0.okd-2021-09-27-224448: 8 of 742 done (1% complete) 
DEBUG Still waiting for the cluster to initialize: Working towards 4.9.0-0.okd-2021-09-27-224448: 193 of 742 done (26% complete) 
DEBUG Still waiting for the cluster to initialize: Working towards 4.9.0-0.okd-2021-09-27-224448: 332 of 742 done (44% complete) 
DEBUG Still waiting for the cluster to initialize: Working towards 4.9.0-0.okd-2021-09-27-224448: 517 of 742 done (69% complete) 
DEBUG Still waiting for the cluster to initialize: Working towards 4.9.0-0.okd-2021-09-27-224448: 602 of 742 done (81% complete) 
DEBUG Still waiting for the cluster to initialize: Multiple errors are preventing progress: 
DEBUG * Could not update oauthclient "console" (449 of 742): the server does not recognize this resource, check extension API servers 
DEBUG * Could not update role "openshift-apiserver/prometheus-k8s" (720 of 742): resource may have been deleted 
DEBUG * Could not update role "openshift-authentication/prometheus-k8s" (630 of 742): resource may have been deleted 
DEBUG * Could not update role "openshift-console-operator/prometheus-k8s" (667 of 742): resource may have been deleted 
DEBUG * Could not update role "openshift-controller-manager/prometheus-k8s" (728 of 742): resource may have been deleted 
DEBUG * Could not update rolebinding "openshift/cluster-samples-operator-openshift-edit" (412 of 742): resource may have been deleted 
DEBUG Still waiting for the cluster to initialize: Working towards 4.9.0-0.okd-2021-09-27-224448: 602 of 742 done (81% complete)

Version
4.9.0-0.okd-2021-09-27-224448

How reproducible
100%

Log bundle

# oc adm must-gather
[must-gather      ] OUT the server could not find the requested resource (get imagestreams.image.openshift.io must-gather)
[must-gather      ] OUT 
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information.
[must-gather      ] OUT Using must-gather plug-in image: registry.redhat.io/openshift4/ose-must-gather:latest
ClusterID: 6e34f392-23f5-4e12-82b1-e612e5d38fb1
ClusterVersion: Installing "4.9.0-0.okd-2021-09-27-224448" for 8 minutes: Working towards 4.9.0-0.okd-2021-09-27-224448: 602 of 742 done (81% complete)
ClusterOperators:
	clusteroperator/authentication is not available (<missing>) because <missing>
	clusteroperator/baremetal is not available (<missing>) because <missing>
	clusteroperator/cloud-controller-manager is not available (<missing>) because <missing>
	clusteroperator/cluster-autoscaler is not available (<missing>) because <missing>
	clusteroperator/config-operator is not available (<missing>) because <missing>
	clusteroperator/console is not available (<missing>) because <missing>
	clusteroperator/csi-snapshot-controller is not available (<missing>) because <missing>
	clusteroperator/dns is not available (<missing>) because <missing>
	clusteroperator/etcd is not available (<missing>) because <missing>
	clusteroperator/image-registry is not available (<missing>) because <missing>
	clusteroperator/ingress is not available (<missing>) because <missing>
	clusteroperator/insights is not available (<missing>) because <missing>
	clusteroperator/kube-apiserver is not available (<missing>) because <missing>
	clusteroperator/kube-controller-manager is not available (<missing>) because <missing>
	clusteroperator/kube-scheduler is not available (<missing>) because <missing>
	clusteroperator/kube-storage-version-migrator is not available (<missing>) because <missing>
	clusteroperator/machine-api is not available (<missing>) because <missing>
	clusteroperator/machine-approver is not available (<missing>) because <missing>
	clusteroperator/machine-config is not available (<missing>) because <missing>
	clusteroperator/marketplace is not available (<missing>) because <missing>
	clusteroperator/monitoring is not available (<missing>) because <missing>
	clusteroperator/network is not available (<missing>) because <missing>
	clusteroperator/node-tuning is not available (<missing>) because <missing>
	clusteroperator/openshift-apiserver is not available (<missing>) because <missing>
	clusteroperator/openshift-controller-manager is not available (<missing>) because <missing>
	clusteroperator/openshift-samples is not available (<missing>) because <missing>
	clusteroperator/operator-lifecycle-manager is not available (<missing>) because <missing>
	clusteroperator/operator-lifecycle-manager-catalog is not available (<missing>) because <missing>
	clusteroperator/operator-lifecycle-manager-packageserver is not available (<missing>) because <missing>
	clusteroperator/service-ca is not available (<missing>) because <missing>
	clusteroperator/storage is not available (<missing>) because <missing>


[must-gather      ] OUT namespace/openshift-must-gather-qqz9v created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-728c6 created
[must-gather      ] OUT pod for plug-in image registry.redhat.io/openshift4/ose-must-gather:latest created

and stays stuck there.

The text was updated successfully, but these errors were encountered:

sandrobonazzola · 2021-09-28T09:14:54Z

About

DEBUG * Could not update oauthclient "console" (449 of 742): the server does not recognize this resource, check extension API servers

tried:

# oc get ouathclient
error: the server doesn't have a resource type "ouathclient"

sandrobonazzola · 2021-09-29T05:38:05Z

Failing on 4.7 and 4.8 as well

vrutkovs · 2021-09-29T12:29:50Z

Please attach log bundle

sandrobonazzola · 2021-09-29T12:44:09Z

log-bundle-20210929123712.tar.gz

vrutkovs · 2021-09-29T12:48:38Z

Masters booted, but didn't request master ignition. Check boot log on master machine, seems network is down there

sandrobonazzola · 2021-09-29T13:04:29Z

Master is up, the console shows:

vrutkovs · 2021-09-29T13:06:17Z

Seems to be invalid network config - it can't reach to DNS server apparently?

sandrobonazzola · 2021-09-29T13:07:46Z

the DNS is provided by the lab but here it seems it's trying to connect on [::1]:53 rather than connecting to the lab dns

sandrobonazzola · 2021-09-29T13:08:44Z

is there any way for configuring the dns to be used from the install-config.yaml ?

rvanderp3 · 2021-09-29T13:09:09Z

Can you confirm that the master node is getting an IP address(i.e. can you ping it)? This does seem to indicate an issue with the network configuration. The IP configuration is not able to be set via install-config.

sandrobonazzola · 2021-09-29T13:17:54Z

master0 doesn't reply to ping but I see it got both IPv4 and IPv6 from the guest agent running on it (it's a bare metal UPI but the host is virtual)

sandrobonazzola · 2021-09-29T13:27:07Z

turning the nic up and down shows that eth0 got renamed to ens3

sandrobonazzola · 2021-09-29T13:36:26Z

is there a way for settinmg up core user password from install-config.yaml as for https://docs.fedoraproject.org/en-US/fedora-coreos/authentication/#_using_password_authentication ?

rvanderp3 · 2021-09-29T13:39:05Z

Unfortunately, I don't believe there are a lot of further insights to be offered without understanding why the master isn't unreachable. You might consider booting the fcos live ISO to see if you can troubleshoot the network from a vantage point where you can run troubleshooting commands.

install-config.yaml doesn't allow specific users to be created. Users are created via the ignition content that is produced by the installer binary. Regardless, until the master nodes retrieve their ignition, a user provided via master.ign wouldn't be applied anyway. IMO your best bet is to boot the ISO and see if you can inspect the network configuration.

sandrobonazzola · 2021-09-29T15:01:52Z

Running Fedora CoreOS 34 live shows network is working properly, getting the same IP address reported by guest agent and is able to query dns. The NIC is called ens3.
I edited the passwd section for masters in the generated metadata, added a password for core user. Re-generated ignition and re-deployed the master node. It was pointless because at boot I could observe nm-initrd.service failed and it never reach a point where the user can login.

sandrobonazzola · 2021-09-29T15:06:18Z

After a dozen of tentatives, got this screenshot:

sandrobonazzola · 2021-09-29T15:07:15Z

@dustymabe sounds like something you handled in coreos/fedora-coreos-tracker#883

sandrobonazzola · 2021-09-29T15:47:25Z

looks like passing console=null to kernel boot command line let the network to go up. I'll give it another re-deploy from scratch passing this command line.

LorbusChris · 2021-09-29T15:47:35Z

^ i.e. coreos/fedora-coreos-tracker#883
It seems that change has made it into OKD's machine-os-content, but not the boot image used for 4.7, as that is older and wouldn't have it: https://github.com/vrutkovs/installer/blob/release-4.7-okd/data/data/fcos-amd64.json#L72
I'll open a PR to bump that.

sandrobonazzola · 2021-09-29T15:50:58Z

@LorbusChris please check also 4.8 and 4.9 as I see the issue there too

LorbusChris · 2021-09-29T15:57:22Z

4.9 uses FCOS 34.20210626.3.1 (https://github.com/openshift/installer/blob/release-4.9/data/data/fcos-amd64.json#L72)
4.8 also uses FCOS 34.20210626.3.1 (https://github.com/vrutkovs/installer/blob/release-4.8-okd/data/data/fcos-amd64.json#L72)

Which is also a version from well before the change even made it into FCOS's testing-devel (on 20210709: coreos/fedora-coreos-config@876fda1)

markandrewj · 2021-09-29T17:26:51Z

We have been struggling with this same issue for the last week while trying to stand up a new OKD cluster. When the nodes try pull the ignition config from the bootstrap we get a 'connection refused' message. We are using FCOS 34.20210904.3.0 and trying to build an OKD 4.7 UPI bare metal cluster.

Any feedback would be appreciated. We have validated our network configuration (DNS, DHCP, load balancer, ect) multiple times.

sandrobonazzola · 2021-09-30T07:45:38Z

@markandrewj I managed to pass the network issue by adding console=null to kernel boot command line on master nodes first boot.

I now progressed the deployment from 80% to 87% and got stuck with:

DEBUG Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, ingress, machine-config, monitoring

I tried attaching the log bundle here but I guess it's too large. Uploaded here: http://file.rdu.redhat.com/~sbonazzo/log-bundle-20210930072055.tar.gz but I'm not sure it's visible outside Red Hat.

sandrobonazzola · 2021-09-30T14:46:41Z

After a few reboots of masters and workers the OKD deployment reached 100%.

The whole flow will need to be re-tested once @LorbusChris ' patches will get into a build.

markandrewj · 2021-09-30T14:53:18Z

@sandrobonazzola we can give your suggestion a try, thank you. The log bundle you have linked is unfortunately not available from our network. Once the patches are applied we will probably try rebuilding the nodes again. We are building this as a test cluster to support two other OKD/OCP clusters we currently have. The plan is to use it for testing upgrades, cluster configuration changes, and other administrative tasks.

kai-uwe-rommel · 2021-10-01T13:34:07Z

Please see here: coreos/fedora-coreos-tracker#943
You perhaps hit the same problem. It's diagnosed and a fix pending for FCOS.
Try again with a FCOS 20210626 or earlier.

LorbusChris · 2021-10-01T13:56:20Z

If that's the case, bumping the boot image now (see PRs linked above) won't solve it.

markandrewj · 2021-10-01T22:47:31Z

@kai-uwe-rommel we took your suggestion and tried to ignite the cluster using the fedora-coreos-34.20210626.3.2-live.x86_64 iso. Using this version of FCOS we were able to get past the issue we were having. The cluster is currently still bootstrapping, I will follow up further after the installation completes (or doesn't complete). Thank you for the feedback.

smuda · 2021-10-03T06:40:11Z

I hade the same problem (masters querying localhost DNS instead of network DNS) on my NUCs. It seems the NICs on the NUCs are too slow during startup and I had to add a timeout during first boot:

sudo coreos-installer \
    install \
    /dev/sda \
    --firstboot-args='console=tty0 rd.neednet=1 rd.net.timeout.carrier=30' \
    --insecure-ignition \
    --ignition-url=http://infra1.ocp4.example.com:8080/master.ign

perhaps that would be something to try out as well?

markandrewj · 2021-10-21T18:03:55Z

We tried the @smuda's suggestion, and we got past the original issue we had where we saw a 'connection refused' message. However now when the worker nodes try to pull the config from the internal API we receive a 'internal server error' message. The master nodes were able to pull their configs though.

We tried using an older FCOS image from June, which allowed the config to be pulled, but the bootstrap never completed. We also tried installing using both the the 4.7 and 4.8 installer.

rvanderp3 · 2021-10-21T20:51:06Z

The worker ignition config is made available until sometime around the bootstrap-complete phase of the installation. It may need some time to finish bootstrapping. A must-gather could be useful in understanding why the machine-config server isn't yet returning a worker ignition.

smuda · 2021-10-22T06:25:46Z

I use FCOS 34.20210626.3.1 which works for me but it needs some fixes.

Once I detect the bootstrap has finished, I remove the bootstrap machine from the loadbalancer in front of api-int, thereby only including the masters.

I've documented my setup at AyoyAB/okd-with-ansible. Perhaps something in that setup can give you some inspiration?

Among other things there are fixes for (746, 975) FCOS issues which I've included in ansible role "0-create-local-files".

markandrewj · 2021-10-26T20:26:33Z

We were able to finally get the cluster running using openshift-install-linux-4.7.0-0.okd-2021-07-03-190901.tar.gz and fedora-coreos-34.20210626.3.2-live.x86_64.iso. I would recommend doing a test install of OKD using the bare metal UPI method, if this isn't already tested in as part of the OKD release pipeline. It seems that some of the FCOS images have issues currently.

vrutkovs · 2021-10-27T07:53:31Z

I would recommend doing a test install of OKD using the bare metal UPI method, if this isn't already tested in as part of the OKD release pipeline

That would indeed be great, but we don't have a capacity for this yet.

It seems that some of the FCOS images have issues currently

Yes. See https://github.com/openshift/okd/blob/master/FAQ.md#which-fedora-coreos-should-i-use for OKD 4.8

@sandrobonazzola do you mind if I close this one?

sandrobonazzola · 2021-10-27T08:11:36Z

@vrutkovs ok to close this one, the issue I hit is being tracked on fcos side.

karthik101 · 2021-11-17T12:46:48Z

@sandrobonazzola Where can i find the 4.9 version of installer? I don't see it in the release tags.

sandrobonazzola · 2021-11-17T13:07:27Z

@karthik101 you can get it here: https://amd64.origin.releases.ci.openshift.org/#4.9.0-0.okd but be aware it's not promoted to 4-sable channel.

konup · 2022-02-04T15:30:53Z

Hi,
I have same problem as reported @sandrobonazzola

bootstrap machine fails to install with okd 4.7, 4.8 and okd 4.9.
(bootstrap machine is ok with all previous instalations up to okd 4.5)

for test was used VMs with these compatibility modes
ESXi 6.5 and later (VM version 13)
ESXi 6.7 U2 and later (VM version 15)

for tests these okd versions have been (unsuccessfully) tried:
oc adm release extract --tools quay.io/openshift/okd:4.7.0-0.okd-2021-03-28-152009
oc adm release extract --tools quay.io/openshift/okd:4.8.0-0.okd-2021-11-14-052418
oc adm release extract --tools quay.io/openshift/okd:4.9.0-0.okd-2022-01-14-230113
oc adm release extract --tools quay.io/openshift/okd:4.9.0-0.okd-2022-01-29-035536

The first part of ignition process looks fine

bootstrap machine booted from fcos live image
(these versions of fcos was tried)
fedora-coreos-33.20210328.3.0
fedora-coreos-34.20210529.3.0
fedora-coreos-34.20210626.3.1
fedora-coreos-34.20211031.3.0
fedora-coreos-35.20220116.3.0
networking (routing, fw, dns) looks ok
remote ignition bootstrap.ign file successfully used from lan over http with hostname resolving
(coreos.inst.ignition_url=http://fs.company.local/okd/ignition/bootstrap.ign)
after second automatic machine reboot during ignition process, okd instalation was started
after few seconds is acessible api on https://api.okd.company.local:6443
i try connect to bootstrap machine over ssh and DNS resolving of internal or public names looks ok
api.okd.company.local
api-int.okd.company.local
console-openshift-console.apps.okd.company.local
and public names as google.com, quay.io, ...
network local and public connection is also ok, i try accessibility to
api.okd.company.local:6443
api.okd.company.local:22623
fs.company.local:80 (place with ignition files)
tcp to any public :80 and :443 scuch as redhat.com or google.com

the installation process ends on the same place as describe @sandrobonazzola and symptoms are very similar

logs:

$ openshift-install --dir=/opt/remctl/OKD/init wait-for install-complete --log-level debug

INFO Waiting up to 40m0s for the cluster at https://api.okd.company.local:6443 to initialize...
DEBUG Still waiting for the cluster to initialize:
DEBUG Still waiting for the cluster to initialize: Working towards 4.9.0-0.okd-2022-01-29-035536
DEBUG Still waiting for the cluster to initialize: Working towards 4.9.0-0.okd-2022-01-29-035536: downloading update
DEBUG Still waiting for the cluster to initialize: Working towards 4.9.0-0.okd-2022-01-29-035536: downloading update
DEBUG Still waiting for the cluster to initialize: Working towards 4.9.0-0.okd-2022-01-29-035536
DEBUG Still waiting for the cluster to initialize: Working towards 4.9.0-0.okd-2022-01-29-035536: 203 of 745 done (27% complete)
DEBUG Still waiting for the cluster to initialize: Working towards 4.9.0-0.okd-2022-01-29-035536: 365 of 745 done (48% complete)
DEBUG Still waiting for the cluster to initialize: Working towards 4.9.0-0.okd-2022-01-29-035536: 550 of 745 done (73% complete)
DEBUG Still waiting for the cluster to initialize: Working towards 4.9.0-0.okd-2022-01-29-035536: 551 of 745 done (73% complete)
DEBUG Still waiting for the cluster to initialize: Working towards 4.9.0-0.okd-2022-01-29-035536: 593 of 745 done (79% complete)
DEBUG Still waiting for the cluster to initialize: Working towards 4.9.0-0.okd-2022-01-29-035536: 595 of 745 done (79% complete)
DEBUG Still waiting for the cluster to initialize: Multiple errors are preventing progress:
DEBUG * Could not update imagestream "openshift/driver-toolkit" (501 of 745): resource may have been deleted
DEBUG * Could not update oauthclient "console" (451 of 745): the server does not recognize this resource, check extension API servers
DEBUG * Could not update role "openshift-apiserver/prometheus-k8s" (723 of 745): resource may have been deleted
DEBUG * Could not update role "openshift-authentication/prometheus-k8s" (633 of 745): resource may have been deleted
DEBUG * Could not update role "openshift-console-operator/prometheus-k8s" (670 of 745): resource may have been deleted
DEBUG * Could not update role "openshift-controller-manager/prometheus-k8s" (731 of 745): resource may have been deleted
DEBUG * Could not update rolebinding "openshift/cluster-samples-operator-openshift-edit" (414 of 745): resource may have been deleted
DEBUG Still waiting for the cluster to initialize: Working towards 4.9.0-0.okd-2022-01-29-035536: 596 of 745 done (80% complete)

$ oc adm must-gather

[must-gather      ] OUT the server could not find the requested resource (get imagestreams.image.openshift.io must-gather)
[must-gather      ] OUT
[must-gather      ] OUT Using must-gather plug-in image: registry.redhat.io/openshift4/ose-must-gather:latest
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information.
ClusterID: af807ebb-b069-46d6-af2d-98aa24f7a32f
ClusterVersion: Installing "4.9.0-0.okd-2022-01-29-035536" for 5 minutes: Working towards 4.9.0-0.okd-2022-01-29-035536: 596 of 745 done (80% complete)
ClusterOperators:
        clusteroperator/authentication is not available (<missing>) because <missing>
        clusteroperator/baremetal is not available (<missing>) because <missing>
        clusteroperator/cloud-controller-manager is not available (<missing>) because <missing>
        clusteroperator/cluster-autoscaler is not available (<missing>) because <missing>
        clusteroperator/config-operator is not available (<missing>) because <missing>
        clusteroperator/console is not available (<missing>) because <missing>
        clusteroperator/csi-snapshot-controller is not available (<missing>) because <missing>
        clusteroperator/dns is not available (<missing>) because <missing>
        clusteroperator/etcd is not available (<missing>) because <missing>
        clusteroperator/image-registry is not available (<missing>) because <missing>
        clusteroperator/ingress is not available (<missing>) because <missing>
        clusteroperator/insights is not available (<missing>) because <missing>
        clusteroperator/kube-apiserver is not available (<missing>) because <missing>
        clusteroperator/kube-controller-manager is not available (<missing>) because <missing>
        clusteroperator/kube-scheduler is not available (<missing>) because <missing>
        clusteroperator/kube-storage-version-migrator is not available (<missing>) because <missing>
        clusteroperator/machine-api is not available (<missing>) because <missing>
        clusteroperator/machine-approver is not available (<missing>) because <missing>
        clusteroperator/machine-config is not available (<missing>) because <missing>
        clusteroperator/marketplace is not available (<missing>) because <missing>
        clusteroperator/monitoring is not available (<missing>) because <missing>
        clusteroperator/network is not available (<missing>) because <missing>
        clusteroperator/node-tuning is not available (<missing>) because <missing>
        clusteroperator/openshift-apiserver is not available (<missing>) because <missing>
        clusteroperator/openshift-controller-manager is not available (<missing>) because <missing>
        clusteroperator/openshift-samples is not available (<missing>) because <missing>
        clusteroperator/operator-lifecycle-manager is not available (<missing>) because <missing>
        clusteroperator/operator-lifecycle-manager-catalog is not available (<missing>) because <missing>
        clusteroperator/operator-lifecycle-manager-packageserver is not available (<missing>) because <missing>
        clusteroperator/service-ca is not available (<missing>) because <missing>
        clusteroperator/storage is not available (<missing>) because <missing>

[must-gather      ] OUT namespace/openshift-must-gather-nvkch created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-b9gpq created
[must-gather      ] OUT pod for plug-in image registry.redhat.io/openshift4/ose-must-gather:latest created

end of jornalctl from bootstrap machine:
[core@bootstrap ~]$ sudo journalctl -xe

Feb 04 12:47:11 bootstrap.okd.company.local approve-csr.sh[5356]: No resources found
Feb 04 12:47:14 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:14.242163    3974 kubelet_node_status.go:386] "Setting node annotation to enable volume controller attach/detach"
Feb 04 12:47:14 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:14.243984    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasSufficientMemory"
Feb 04 12:47:14 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:14.244030    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasNoDiskPressure"
Feb 04 12:47:14 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:14.244045    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasSufficientPID"
Feb 04 12:47:14 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:14.402334    3974 kubelet_node_status.go:386] "Setting node annotation to enable volume controller attach/detach"
Feb 04 12:47:14 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:14.404235    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasSufficientMemory"
Feb 04 12:47:14 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:14.404276    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasNoDiskPressure"
Feb 04 12:47:14 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:14.404289    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasSufficientPID"
Feb 04 12:47:24 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:24.007782    3974 kubelet_getters.go:176] "Pod status updated" pod="kube-system/bootstrap-kube-scheduler-bootstrap.okd.company.local" status=Running
Feb 04 12:47:24 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:24.007823    3974 kubelet_getters.go:176] "Pod status updated" pod="openshift-etcd/etcd-bootstrap-member-bootstrap.okd.company.local" status=Running
Feb 04 12:47:24 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:24.007834    3974 kubelet_getters.go:176] "Pod status updated" pod="default/bootstrap-machine-config-operator-bootstrap.okd.company.local" status=Running
Feb 04 12:47:24 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:24.007846    3974 kubelet_getters.go:176] "Pod status updated" pod="openshift-cluster-version/bootstrap-cluster-version-operator-bootstrap.okd.company.local" status=Running
Feb 04 12:47:24 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:24.007856    3974 kubelet_getters.go:176] "Pod status updated" pod="openshift-cloud-credential-operator/cloud-credential-operator-bootstrap.okd.company.local" status=Running
Feb 04 12:47:24 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:24.007869    3974 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-apiserver/bootstrap-kube-apiserver-bootstrap.okd.company.local" status=Running
Feb 04 12:47:24 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:24.007879    3974 kubelet_getters.go:176] "Pod status updated" pod="kube-system/bootstrap-kube-controller-manager-bootstrap.okd.company.local" status=Running
Feb 04 12:47:24 bootstrap.okd.company.local audit[5372]: AVC avc:  denied  { ioctl } for  pid=5372 comm="iptables" path="/sys/fs/cgroup" dev="cgroup2" ino=1 scontext=system_u:system_r:iptables_t:s0 tcontext=system_u:object_r:cgroup_t:s0 tclass=dir permissive=0
Feb 04 12:47:24 bootstrap.okd.company.local audit[5372]: SYSCALL arch=c000003e syscall=59 success=yes exit=0 a0=c00137c9f0 a1=c001364c30 a2=c0011bcf60 a3=8 items=0 ppid=3974 pid=5372 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="iptables" exe="/usr/sbin/xtables-legacy-multi" subj=system_u:system_r:iptables_t:s0 key=(null)
Feb 04 12:47:24 bootstrap.okd.company.local kernel: audit: type=1400 audit(1643978844.086:438): avc:  denied  { ioctl } for  pid=5372 comm="iptables" path="/sys/fs/cgroup" dev="cgroup2" ino=1 scontext=system_u:system_r:iptables_t:s0 tcontext=system_u:object_r:cgroup_t:s0 tclass=dir permissive=0
Feb 04 12:47:24 bootstrap.okd.company.local kernel: audit: type=1300 audit(1643978844.086:438): arch=c000003e syscall=59 success=yes exit=0 a0=c00137c9f0 a1=c001364c30 a2=c0011bcf60 a3=8 items=0 ppid=3974 pid=5372 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="iptables" exe="/usr/sbin/xtables-legacy-multi" subj=system_u:system_r:iptables_t:s0 key=(null)
Feb 04 12:47:24 bootstrap.okd.company.local kernel: audit: type=1309 audit(1643978844.086:438): argc=9 a0="iptables" a1="-w" a2="5" a3="-W" a4="100000" a5="-S" a6="KUBE-KUBELET-CANARY" a7="-t" a8="mangle"
Feb 04 12:47:24 bootstrap.okd.company.local audit: EXECVE argc=9 a0="iptables" a1="-w" a2="5" a3="-W" a4="100000" a5="-S" a6="KUBE-KUBELET-CANARY" a7="-t" a8="mangle"
Feb 04 12:47:24 bootstrap.okd.company.local audit: PROCTITLE proctitle=69707461626C6573002D770035002D5700313030303030002D53004B5542452D4B5542454C45542D43414E415259002D74006D616E676C65
Feb 04 12:47:24 bootstrap.okd.company.local kernel: audit: type=1327 audit(1643978844.086:438): proctitle=69707461626C6573002D770035002D5700313030303030002D53004B5542452D4B5542454C45542D43414E415259002D74006D616E676C65
Feb 04 12:47:24 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:24.241922    3974 kubelet_node_status.go:386] "Setting node annotation to enable volume controller attach/detach"
Feb 04 12:47:24 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:24.243702    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasSufficientMemory"
Feb 04 12:47:24 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:24.243744    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasNoDiskPressure"
Feb 04 12:47:24 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:24.243757    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasSufficientPID"
Feb 04 12:47:24 bootstrap.okd.company.local audit[5373]: AVC avc:  denied  { ioctl } for  pid=5373 comm="ip6tables" path="/sys/fs/cgroup" dev="cgroup2" ino=1 scontext=system_u:system_r:iptables_t:s0 tcontext=system_u:object_r:cgroup_t:s0 tclass=dir permissive=0
Feb 04 12:47:24 bootstrap.okd.company.local audit[5373]: SYSCALL arch=c000003e syscall=59 success=yes exit=0 a0=c00137d560 a1=c001365950 a2=c00162ea80 a3=8 items=0 ppid=3974 pid=5373 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ip6tables" exe="/usr/sbin/xtables-legacy-multi" subj=system_u:system_r:iptables_t:s0 key=(null)
Feb 04 12:47:24 bootstrap.okd.company.local kernel: audit: type=1400 audit(1643978844.251:439): avc:  denied  { ioctl } for  pid=5373 comm="ip6tables" path="/sys/fs/cgroup" dev="cgroup2" ino=1 scontext=system_u:system_r:iptables_t:s0 tcontext=system_u:object_r:cgroup_t:s0 tclass=dir permissive=0
Feb 04 12:47:24 bootstrap.okd.company.local kernel: audit: type=1300 audit(1643978844.251:439): arch=c000003e syscall=59 success=yes exit=0 a0=c00137d560 a1=c001365950 a2=c00162ea80 a3=8 items=0 ppid=3974 pid=5373 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ip6tables" exe="/usr/sbin/xtables-legacy-multi" subj=system_u:system_r:iptables_t:s0 key=(null)
Feb 04 12:47:24 bootstrap.okd.company.local kernel: audit: type=1309 audit(1643978844.251:439): argc=9 a0="ip6tables" a1="-w" a2="5" a3="-W" a4="100000" a5="-S" a6="KUBE-KUBELET-CANARY" a7="-t" a8="mangle"
Feb 04 12:47:24 bootstrap.okd.company.local audit: EXECVE argc=9 a0="ip6tables" a1="-w" a2="5" a3="-W" a4="100000" a5="-S" a6="KUBE-KUBELET-CANARY" a7="-t" a8="mangle"
Feb 04 12:47:24 bootstrap.okd.company.local audit: PROCTITLE proctitle=6970367461626C6573002D770035002D5700313030303030002D53004B5542452D4B5542454C45542D43414E415259002D74006D616E676C65
Feb 04 12:47:24 bootstrap.okd.company.local kernel: audit: type=1327 audit(1643978844.251:439): proctitle=6970367461626C6573002D770035002D5700313030303030002D53004B5542452D4B5542454C45542D43414E415259002D74006D616E676C65
Feb 04 12:47:24 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:24.414222    3974 kubelet_node_status.go:386] "Setting node annotation to enable volume controller attach/detach"
Feb 04 12:47:24 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:24.416164    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasSufficientMemory"
Feb 04 12:47:24 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:24.416205    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasNoDiskPressure"
Feb 04 12:47:24 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:24.416221    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasSufficientPID"
Feb 04 12:47:31 bootstrap.okd.company.local approve-csr.sh[5375]: No resources found
Feb 04 12:47:33 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:33.242931    3974 kubelet_node_status.go:386] "Setting node annotation to enable volume controller attach/detach"
Feb 04 12:47:33 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:33.245710    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasSufficientMemory"
Feb 04 12:47:33 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:33.245760    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasNoDiskPressure"
Feb 04 12:47:33 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:33.245777    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasSufficientPID"
Feb 04 12:47:34 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:34.428619    3974 kubelet_node_status.go:386] "Setting node annotation to enable volume controller attach/detach"
Feb 04 12:47:34 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:34.431286    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasSufficientMemory"
Feb 04 12:47:34 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:34.431370    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasNoDiskPressure"
Feb 04 12:47:34 bootstrap.okd.company.local kubelet.sh[3974]: I0204 12:47:34.431400    3974 kubelet_node_status.go:581] "Recording event message for node" node="bootstrap.okd.company.local" event="NodeHasSufficientPID"
Feb 04 12:47:37 bootstrap.okd.company.local systemd[5176]: Starting Mark boot as successful...
░░ Subject: A start job for unit UNIT has begun execution
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ A start job for unit UNIT has begun execution.
░░
░░ The job identifier is 13.
Feb 04 12:47:37 bootstrap.okd.company.local grub2-set-bootflag[5389]: Creating tmpfile failed: Read-only file system
Feb 04 12:47:37 bootstrap.okd.company.local systemd[5176]: grub-boot-success.service: Main process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ An ExecStart= process belonging to unit UNIT has exited.
░░
░░ The process' exit code is 'exited' and its exit status is 1.
Feb 04 12:47:37 bootstrap.okd.company.local systemd[5176]: grub-boot-success.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ The unit UNIT has entered the 'failed' state with result 'exit-code'.
Feb 04 12:47:37 bootstrap.okd.company.local systemd[5176]: Failed to start Mark boot as successful.
░░ Subject: A start job for unit UNIT has failed
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ A start job for unit UNIT has finished with a failure.
░░
░░ The job identifier is 13 and the job result is failed.

smuda · 2022-02-05T09:46:23Z

@konup It would seem you have a very different problem, since you have ESXi (vs metal) and you seem to be able to connect från master to bootstrap during installation (vs having a network problem fetching the secondary ignition file).

You'd probably best of creating your own issue.

konup · 2022-02-06T14:27:41Z

ok, because this is diferent problem, new issue #1093 was created

sandrobonazzola changed the title ~~Bootstrap machine fails to install with okd 4.8 and okd 4.9~~ Bootstrap machine fails to install with okd 4.7, 4.8 and 4.9 Sep 29, 2021

This was referenced Sep 29, 2021

OKD 4.7: Bump FCOS 34 boot images vrutkovs/installer#3

Closed

OKD 4.8: Bump FCOS 34 boot images vrutkovs/installer#4

Closed

OKD: bump FCOS 34 boot images openshift/installer#5253

Closed

vrutkovs closed this as completed Oct 27, 2021

markandrewj mentioned this issue Dec 3, 2021

Building/deploying pods on 4.8.0-0.okd-2021-10-24-061736 via git fails with memory cgroup limits error #998

Closed

Bootstrap machine fails to install with okd 4.7, 4.8 and 4.9 #897

Bootstrap machine fails to install with okd 4.7, 4.8 and 4.9 #897

Comments

sandrobonazzola commented Sep 28, 2021

sandrobonazzola commented Sep 28, 2021

sandrobonazzola commented Sep 29, 2021 • edited

vrutkovs commented Sep 29, 2021

sandrobonazzola commented Sep 29, 2021

vrutkovs commented Sep 29, 2021

sandrobonazzola commented Sep 29, 2021

vrutkovs commented Sep 29, 2021

sandrobonazzola commented Sep 29, 2021

sandrobonazzola commented Sep 29, 2021

rvanderp3 commented Sep 29, 2021

sandrobonazzola commented Sep 29, 2021

sandrobonazzola commented Sep 29, 2021

sandrobonazzola commented Sep 29, 2021

rvanderp3 commented Sep 29, 2021 • edited

sandrobonazzola commented Sep 29, 2021

sandrobonazzola commented Sep 29, 2021

sandrobonazzola commented Sep 29, 2021 • edited by LorbusChris

sandrobonazzola commented Sep 29, 2021 • edited

LorbusChris commented Sep 29, 2021 • edited

sandrobonazzola commented Sep 29, 2021

LorbusChris commented Sep 29, 2021

markandrewj commented Sep 29, 2021 • edited

sandrobonazzola commented Sep 30, 2021

sandrobonazzola commented Sep 30, 2021

markandrewj commented Sep 30, 2021 • edited

kai-uwe-rommel commented Oct 1, 2021

LorbusChris commented Oct 1, 2021

markandrewj commented Oct 1, 2021 • edited

smuda commented Oct 3, 2021

markandrewj commented Oct 21, 2021

rvanderp3 commented Oct 21, 2021

smuda commented Oct 22, 2021 • edited

markandrewj commented Oct 26, 2021 • edited

vrutkovs commented Oct 27, 2021

sandrobonazzola commented Oct 27, 2021

karthik101 commented Nov 17, 2021

sandrobonazzola commented Nov 17, 2021

konup commented Feb 4, 2022

smuda commented Feb 5, 2022

konup commented Feb 6, 2022

sandrobonazzola commented Sep 29, 2021 •

edited

rvanderp3 commented Sep 29, 2021 •

edited

sandrobonazzola commented Sep 29, 2021 •

edited by LorbusChris

sandrobonazzola commented Sep 29, 2021 •

edited

LorbusChris commented Sep 29, 2021 •

edited

markandrewj commented Sep 29, 2021 •

edited

markandrewj commented Sep 30, 2021 •

edited

markandrewj commented Oct 1, 2021 •

edited

smuda commented Oct 22, 2021 •

edited

markandrewj commented Oct 26, 2021 •

edited