Skip to content

Conversation

@dougbtv
Copy link
Contributor

@dougbtv dougbtv commented Sep 5, 2019

This is a WIP to look at addressing release blocker bz @ https://bugzilla.redhat.com/show_bug.cgi?id=1749448

NOTE: Originally submitted as a test. I am actively looking for feedback as to where this code to create a directory should exist. Originally, I simply found a convenient part to sketch it in (otherwise, I am somewhat unfamiliar with the MCO)

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 5, 2019
@dougbtv
Copy link
Contributor Author

dougbtv commented Sep 5, 2019

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 5, 2019
@kikisdeliveryservice
Copy link
Contributor

This would probably be better suited as a template...

Take look in https://github.com/openshift/machine-config-operator/tree/master/templates/common/_base/files and see what you think.

@dougbtv
Copy link
Contributor Author

dougbtv commented Sep 5, 2019

@kikisdeliveryservice -- that's a monstrous help, thank you! I think this is just what I'm looking for, would I create a .dummy file to specify that I want to make a dir? I'm copying the example from https://github.com/openshift/machine-config-operator/blob/master/templates/common/_base/files/volume-plugins.yaml

filesystem: "root"
mode: 0755
path: "/var/run/multus/cni/net.d/.dummy"
contents:
  inline: |

And just create it in a file like ./templates/common/_base/files/alternate-cni-configuration-dir.yaml?

@dougbtv dougbtv force-pushed the mco-add-cni-config-dir branch from 15f58d3 to a1f1ec7 Compare September 5, 2019 20:07
@dougbtv dougbtv changed the title [WIP] Creates alternative configuration directory for the cluster network o… [WIP] Creates alternative CNI configuration directory for the cluster network operator Sep 5, 2019
@kikisdeliveryservice
Copy link
Contributor

/retest

@wking
Copy link
Member

wking commented Sep 5, 2019

e2e-aws died waiting for bootstrap completion. From the gathered tarball's ./bootstrap/journals/bootkube.log:

Sep 05 20:28:36 ip-10-0-3-90 bootkube.sh[1621]: Created "cvo-overrides.yaml" clusterversions.v1.config.openshift.io/version -n openshift-cluster-version
Sep 05 20:48:02 ip-10-0-3-90 bootkube.sh[1621]: Error: error while checking pod status: timed out waiting for the condition
Sep 05 20:48:02 ip-10-0-3-90 bootkube.sh[1621]: Tearing down temporary bootstrap control plane...
Sep 05 20:48:02 ip-10-0-3-90 bootkube.sh[1621]: Error: error while checking pod status: timed out waiting for the condition

Would be nice if they mentioned the condition that had timed out ;). There are a number of error lines in ./bootstrap/containers/kube-apiserver-9735b1d92756368339de6e60586dfdc157b6a3ded46bad6149916166c04ff65f.log, but I'm not clear enough on what that log usually looks like for any of them to stand out as obvious issues. The gathered ClusterVersion had a bunch of server does not recognize this resource errors.

So I dunno what happened. We'll see how the retest goes.

@kikisdeliveryservice
Copy link
Contributor

@wking I've been looking through the logs and haven't figured it out.. thought maybe retesting will help

@wking
Copy link
Member

wking commented Sep 5, 2019

Oh, thinking this PR over a bit more gave me an idea to grep for in the gathered tarball ;)

$ grep -r multus/cni
...
control-plane/10.0.129.21/containers/kube-multus-e7ca15b25b701bdab39b49dc5534db009ff6f72c2905573827d542966adfc33c.log:2019-09-05T20:39:29+0000 Generating Multus configuration file using files in /host/var/run/multus/cni/net.d...
...
$ cat control-plane/10.0.129.21/containers/kube-multus-e7ca15b25b701bdab39b49dc5534db009ff6f72c2905573827d542966adfc33c.log
2019-09-05T20:39:29+0000 Generating Multus configuration file using files in /host/var/run/multus/cni/net.d...
2019-09-05T20:39:29+0000 Attemping to find master plugin configuration, attempt 0
2019-09-05T20:39:34+0000 Attemping to find master plugin configuration, attempt 5
...
2019-09-05T20:49:26+0000 Attemping to find master plugin configuration, attempt 595
2019-09-05T20:49:31+0000 ERR:  {Multus could not be configured: no master plugin was found.}

I don't understand multus enough to know what that means. But maybe we need to drop in more than the empty directory holder?

@kikisdeliveryservice
Copy link
Contributor

kikisdeliveryservice commented Sep 5, 2019

I saw multus is somehow connected to sdn:

                        "lastTransitionTime": "2019-09-05T20:29:19Z",
                        "message": "DaemonSet \"openshift-multus/multus-admission-controller\" is not available (awaiting 3 nodes)\nDaemonSet \"openshift-sdn/sdn\" is not available (awaiting 3 nodes)",
                        "reason": "Deploying",
                        "status": "True",
                        "type": "Progressing"
                    },

@dougbtv
Copy link
Contributor Author

dougbtv commented Sep 5, 2019

Thanks for triggering the restart. The Multus could not be configured: no master plugin was not found means that it couldn't find the source CNI config generated by openshift-sdn -- which seems counterintuitive as this has worked generally -- worth looking at the logs for openshift-sdn and the cluster network operator. I am unfortunately mobile for a couple hours, but, I will take a deeper look when I get back to a workstation.

@wking
Copy link
Member

wking commented Sep 5, 2019

cni-plugins containers look happy, although we don't have logs for them:

$ wc -l control-plane/*/containers/cni-plugins-*.log 
0 control-plane/10.0.129.21/containers/cni-plugins-2e8ba60c18074801d4f2b9847d5a74170faab910085d6e02761cdaa4289d652b.log
0 control-plane/10.0.134.250/containers/cni-plugins-fab3a0e40eefa7943a0cab0129b33bb1874ed782446b4379c53c8f7f14c7f33f.log
0 control-plane/10.0.159.6/containers/cni-plugins-a08e3f7cae4781717866a444e8da3951241eb73e17ce33cc812f004cd66f478d.log
0 total
$ jq -r '.status | .finishedAt + "\t" + .state + "\t" + (.exitCode | tostring)' control-plane/*/containers/cni-plugins-*.inspect
2019-09-05T20:29:24.00616605Z	CONTAINER_EXITED	0
2019-09-05T20:29:23.816250778Z	CONTAINER_EXITED	0
2019-09-05T20:29:24.107361774Z	CONTAINER_EXITED	0

Also install-cni-plugins containers:

$ head control-plane/*/containers/install-cni-plugins-*.log 
==> control-plane/10.0.129.21/containers/install-cni-plugins-9709d95c620224f019a1d7edaf07b73272bbd7d2226fc8e181427d67e967d8d7.log <==
+ cp -f /usr/src/plugins/bin/loopback /usr/src/plugins/bin/host-local /host/opt/cni/bin

==> control-plane/10.0.134.250/containers/install-cni-plugins-811d847504753eb2c69605fbdee748b74399a2d062a53bc54d189091ed843e22.log <==
+ cp -f /usr/src/plugins/bin/loopback /usr/src/plugins/bin/host-local /host/opt/cni/bin

==> control-plane/10.0.159.6/containers/install-cni-plugins-1f8b64a80db4e68a55924cff7c17960eb35feaf704d145b4b52dfc0d786b8b4f.log <==
+ cp -f /usr/src/plugins/bin/loopback /usr/src/plugins/bin/host-local /host/opt/cni/bin
$ jq -r '.status | .finishedAt + "\t" + .state + "\t" + (.exitCode | tostring)' control-plane/*/containers/install-cni-plugins-*.inspect
2019-09-05T20:29:31.662820192Z	CONTAINER_EXITED	0
2019-09-05T20:29:31.718027272Z	CONTAINER_EXITED	0
2019-09-05T20:29:31.726282675Z	CONTAINER_EXITED	0

SDN containers seem to be dying with:

$ tail -n2 control-plane/10.0.129.21/containers/sdn-b43bd292f53e0c76cea4daf0d6b79d7f6d3ae6207d4122fa4df9550a9c817a75.log 
I0905 20:55:27.453413    4389 healthcheck.go:42] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
I0905 20:55:28.453428    4389 healthcheck.go:42] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory

@kikisdeliveryservice
Copy link
Contributor

I think the bug should be: https://bugzilla.redhat.com/show_bug.cgi?id=1749446 the one linked above has been marked as a dupe for this one.

@wking
Copy link
Member

wking commented Sep 5, 2019

/retitle [WIP] Bug 1749446: Creates alternative CNI configuration directory for the cluster network operator

@openshift-ci-robot openshift-ci-robot changed the title [WIP] Creates alternative CNI configuration directory for the cluster network operator [WIP] Bug 1749446: Creates alternative CNI configuration directory for the cluster network operator Sep 5, 2019
@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Sep 5, 2019
@openshift-ci-robot
Copy link
Contributor

@dougbtv: This pull request references Bugzilla bug 1749446, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

[WIP] Bug 1749446: Creates alternative CNI configuration directory for the cluster network operator

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dougbtv
Copy link
Contributor Author

dougbtv commented Sep 6, 2019

Still not back to a desk, but I'm wondering if /var/run (edit: as a parent directory of the for created by this patch) is getting created early or something (edit: with different perms?)? Otherwise, I haven't seen this out of openshift-sdn before, that error as posted.

@dougbtv dougbtv force-pushed the mco-add-cni-config-dir branch from a1f1ec7 to e6b29a8 Compare September 6, 2019 01:35
@dougbtv
Copy link
Contributor Author

dougbtv commented Sep 6, 2019

New commit creates /run/multus/cni/net.d/.dummy as opposed to /var/run/multus/cni/net.d/.dummy as /var/run can be a link to /run.

@dougbtv
Copy link
Contributor Author

dougbtv commented Sep 6, 2019

/retest

Didn't quite look the same, API didn't come up. I'm not 100% positive, but, this looks like a potential flake. So, giving it another run.

@dougbtv dougbtv force-pushed the mco-add-cni-config-dir branch from e6b29a8 to c0103ab Compare September 6, 2019 04:11
@dougbtv
Copy link
Contributor Author

dougbtv commented Sep 6, 2019

I got @danwinship to take a look, and he's noted that:

It looks like the ovs pod mounts /run/openvswitch while the sdn pod mounts /var/run/openvswitch. And those should be the same, but apparently aren't now. So it seems like maybe your change is causing /var/run and /run to become separate directories

I'm starting to wonder if there was any legs to my thought earlier about /var/run being a symlink to /run and how the directory is created there. Thinking it through now.

Edit: trying the /run/multus/cni/net.d/.dummy another time, and pushing another commit. Had two CI failures that I couldn't exactly diagnose the reason for the failures last night. But in light of this, I'll try it again.

@dougbtv dougbtv force-pushed the mco-add-cni-config-dir branch from c0103ab to 5c69661 Compare September 6, 2019 14:54
@cgwalters
Copy link
Member

We can't create files in /run via Ignition - or rather we can but they will vanish on reboot.

You need to install a systemd tmpfiles snippet instead.

@cgwalters
Copy link
Member

It's a lot like having an RPM install files in /run, which fails for the same reason.

@cgwalters
Copy link
Member

Another alternative is to have an init container pre-create the directories.

@dougbtv dougbtv force-pushed the mco-add-cni-config-dir branch from 5c69661 to e189be7 Compare September 6, 2019 18:21
@dougbtv
Copy link
Contributor Author

dougbtv commented Sep 6, 2019

Thanks @cgwalters!! Huge help. I've got a new commit pushed where I added another instruction to ./templates/common/_base/files/cleanup-cni-conf.yaml (as it certainly relevant CNI configuration aspects there) to take the systemd tmpfiles.d approach.

Added:

d /var/run/multus/cni/net.d/ 0755 root root - -

Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind changing your commit message to something like:

templates: Create /run/multus/cni/net.d/

The multus containers need this to exist.

https://bugzilla.redhat.com/show_bug.cgi?id=1749446

See https://chris.beams.io/posts/git-commit/ (and git log in this repository)

@cgwalters
Copy link
Member

/approve

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 6, 2019
@dougbtv dougbtv force-pushed the mco-add-cni-config-dir branch from e189be7 to 3e11bb7 Compare September 6, 2019 19:25
@dougbtv
Copy link
Contributor Author

dougbtv commented Sep 6, 2019

Thanks @cgwalters for the comments, got an updated commit referencing /run (as opposed to /var/run) and an updated commit message.

@cgwalters
Copy link
Member

/lgtm

But you have both WIP and a /hold here, I'll leave undoing that to you if you want it to merge.

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 6, 2019
Copy link
Contributor

@kikisdeliveryservice kikisdeliveryservice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

glad this got sorted out. thanks @cgwalters !!

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, dougbtv, kikisdeliveryservice

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [cgwalters,kikisdeliveryservice]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@runcom runcom changed the title [WIP] Bug 1749446: Creates alternative CNI configuration directory for the cluster network operator Bug 1749446: Creates alternative CNI configuration directory for the cluster network operator Sep 8, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 8, 2019
@runcom
Copy link
Member

runcom commented Sep 8, 2019

@dougbtv do we feel this is good to go now? I'll lift the hold if that's the case

@dougbtv
Copy link
Contributor Author

dougbtv commented Sep 8, 2019

@runcom looking good to me! Thanks!

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 8, 2019
@openshift-merge-robot openshift-merge-robot merged commit a18c203 into openshift:master Sep 9, 2019
@openshift-ci-robot
Copy link
Contributor

@dougbtv: All pull requests linked via external trackers have merged. Bugzilla bug 1749446 has been moved to the MODIFIED state.

In response to this:

Bug 1749446: Creates alternative CNI configuration directory for the cluster network operator

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants