sdn: better warning when non-restarted 3.3 node process calls updated 3.4 script by dcbw · Pull Request #11990 · openshift/origin

dcbw · 2016-11-21T23:21:54Z

When the openshift-sdn RPM gets updated, that drops a new
/usr/bin/openshift-sdn-ovs script on disk. But if the openshift-node
process hasn't been restarted yet for whatever reason, openshift-sdn-ovs
will print unhelpful error messages due to argument mismatches between
the 3.3 and 3.4 versions of that script.

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1396919

@openshift/networking @sdodson

sdodson · 2016-11-22T02:42:16Z

Thanks, we should really be restarting the service and we can probably fix
that installer side, but maybe it's worth using a new script and leaving
the old one until 1.5?

Cc: @dgoodwin

On Nov 21, 2016 6:22 PM, "Dan Williams" notifications@github.com wrote:

When the openshift-sdn RPM gets updated, that drops a new
/usr/bin/openshift-sdn-ovs script on disk. But if the openshift-node
process hasn't been restarted yet for whatever reason, openshift-sdn-ovs
will print unhelpful error messages due to argument mismatches between
the 3.3 and 3.4 versions of that script.

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1396919

@openshift/networking https://github.com/orgs/openshift/teams/networking

@sdodson https://github.com/sdodson

You can view, comment on, or merge this pull request online at:

#11990
Commit Summary

sdn: better warning when non-restarted 3.3 node process calls
updated 3.4 script

File Changes

M pkg/sdn/plugin/bin/openshift-sdn-ovs
https://github.com/openshift/origin/pull/11990/files#diff-0 (5)

Patch Links:

https://github.com/openshift/origin/pull/11990.patch

https://github.com/openshift/origin/pull/11990.diff

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#11990, or mute the thread
https://github.com/notifications/unsubscribe-auth/AAC8IaIHm1ON0nhpHirYqonkf33-pbhRks5rAieigaJpZM4K40n_
.

dgoodwin · 2016-11-22T13:42:09Z

I think there's an issue with restarting node as soon as the rpm is updated, at least on masters the rpms get updated during control plane upgrade, which pulls in node/ovs packages as dependencies. At this point it's not safe to restart the node service as it hasn't been unscheduled safely, that part comes a little later during node+docker upgrade.

danwinship · 2016-11-22T14:00:55Z

pkg/sdn/plugin/bin/openshift-sdn-ovs

    case "$action" in
 	setup)
+	    # With openshift 3.3 $2 was the netns path; check for that
+	    if [[ "${veth_host}" =~ "/proc" ]]; then


should we put this at the top level rather than making it be setup-specific?

…usters When the openshift-sdn RPM gets updated, that drops a new /usr/bin/openshift-sdn-ovs script on disk. But if the openshift-node process hasn't been restarted yet for whatever reason, openshift-sdn-ovs will print unhelpful error messages due to argument mismatches between the 3.3 and 3.4 versions of that script. Bring back the 3.3 script and redirect the request to the 3.3 script when we detect one. And yell at people to restart their node process. Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1396919

dcbw · 2016-11-22T20:06:11Z

@danwinship @sdodson @dgoodwin how about something more like this then? Except we apply it only to the release-1.4 branch, not to git master.

Do we really support running a 3.3 node process against a 3.4 master process anyway? How are cluster upgrades supposed to work?

dgoodwin · 2016-11-22T20:08:40Z

My understanding we do need to have an older node work against a newer master.

In the context of upgrade that should be a short lived scenario, just for the period of time in between control plane upgrade and node upgrade.

sdodson · 2016-11-23T13:52:13Z

Yeah, we definitely need to support newer master and old nodes.

sdodson · 2016-11-28T15:47:09Z

@dcbw that looks good to me, 👍 to only doing this on release-1.4

dcbw · 2016-11-29T15:45:13Z

@danwinship @knobunc what do you guys think about this approach?

danwinship · 2016-11-29T16:49:40Z

"ugh".

We shouldn't be deploying new pods during an upgrade should we? If so, we'd really only need to deal with the teardown case, which should be simpler...

dgoodwin · 2016-11-29T17:54:10Z

I reordered the router/registry upgrade (which uses a deployer pod) to follow the node upgrade, so everything should be up and running by that point now, and we won't be running new pods during upgrade as far as I'm aware. Not to say the cluster won't be trying to though.

danwinship · 2016-11-29T18:58:26Z

Do we mark the node as unschedulable while upgrading it? (Starting before we install the new RPM and continuing until after we restart it.) If not, should we?

dgoodwin · 2016-11-29T19:00:52Z

Indeed we do set unschedulable before touching docker, or our services, and only make it schedulable again when everything is completed.

For masters, their node rpms can get updated before evacuation because the node rpm is pulled in as a dep when updating the master rpm, however we don't bounce the node services until later when it should be safe.

dcbw · 2016-11-29T20:41:15Z

@dgoodwin @danwinship hmm, if we already mark the node as unschedulable, how did we get into the situation in the RH bugzilla bug? Was it because it was a node-on-master and thus the service wasn't bounced?

Couldn't we just mark the node-on-master as unschedulable when upgrading the master RPMs, since the SDN scripts get updated at the same time? Then when the node-on-master service gets updated and restarted, mark it schedulable again.

danwinship · 2016-11-29T21:08:09Z

race conditions maybe? blah.

I don't really like shipping the 3.3 script in 3.4, but if that's the only way to make upgrades work without undesirable downtime then I guess we have to.

dgoodwin · 2016-11-30T14:00:14Z

@dcbw @danwinship the bug was reported against a "master node" yes.

I don't think we can commit to reworking upgrade to push a node restart into master upgrade. This has been a delicate balance to get everything covered, without duplicating evacuations, docker restarts, and service restarts. Separate control plane + node upgrade phases is just becoming a feature in 1.4 where you can run each separately, and on sub-sets of your nodes by label. If it were possible to have the sdn keep working until the services are restarted, that would be ideal from my perspective. It would mean carrying the previous version of the script, if there has been a breaking change like this. (only one previous version required as we don't upgrade beyond more than one release)

I assume there's no way for the new script to just handle the old args intelligently?

If absolutely necessary, I could go back and rework upgrade again to try to get master upgrade to include a node restart, I can't pinpoint why but I'm fairly certain there was a blocker with this but it's been awhile. In any case it will be too destabilizing I think to consider now, and would have to be 1.5+.

danwinship · 2016-11-30T14:30:20Z

I assume there's no way for the new script to just handle the old args intelligently?

Well, it could, but that would be a larger change with more chance of breaking the non-upgrade codepaths. Or else it would end up just having two copies of every code path and be incredibly ugly and confusing.

dcbw · 2016-12-01T15:52:30Z

@danwinship @knobunc should we just merge this one, or do you think there's a better approach?

danwinship · 2016-12-03T14:33:02Z

Ideally I think we'd just return an error if you tried to deploy a pod while the node was in the middle of an upgrade, but kubernetes doesn't really recover very well from the network plugin returning errors (as seen in the original report) so that's not a useful solution.

I hate this patch but I have no better suggestion.

danwinship · 2016-12-15T13:51:44Z

So I think this should be closed and reopened against OSE 3.4?

danwinship · 2017-02-21T19:15:48Z

So we never dealt with this, right? And now the same problem exists for 3.5, except that instead of returning errors, it will just install OVS rules that don't have any effect (due to the table renumbering from 3.4 to 3.5).

knobunc · 2017-06-09T13:00:41Z

Do we intend to do anything with this? Or are we hand-waving it as an installer problem?

dcbw · 2017-07-06T15:27:42Z

@knobunc at this point, perhaps we just close and ignore. I haven't seen anyone yell about this in a long time, though I may just not be looking for complaints.

@danwinship for 3.3 -> 3.5, wouldn't this still work, more or less? We update this PR to add the 3.4 script back, and then each version calls its respective script and renumbering won't be an issue. When they restart to 3.5+, it just won't call a script anymore at all.

danwinship · 2017-07-06T15:51:36Z

@danwinship for 3.3 -> 3.5, wouldn't this still work, more or less?

If we had committed it, maybe, but we didn't.
(Yes, the general idea still works.)

danwinship · 2017-07-06T15:58:35Z

When they restart to 3.5+, it just won't call a script anymore at all.

3.5 still used a script (which is not compatible with 3.4 because of table numbering). 3.6 ships the 3.5 script but doesn't use it. 3.7 will drop the script entirely

openshift-merge-robot · 2017-07-25T18:57:21Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dcbw
We suggest the following additional approver: eparis

Assign the PR to them by writing /assign @eparis in a comment when ready.

No associated issue. Update pull-request body to add a reference to an issue, or get approval with /approve no-issue

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these OWNERS Files:

OWNERS

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

dcbw added the component/networking label Nov 21, 2016

danwinship reviewed Nov 22, 2016

View reviewed changes

dcbw force-pushed the sdn-script-version-check branch from afbb99e to 8b5cb03 Compare November 22, 2016 20:04

danwinship mentioned this pull request Dec 5, 2016

Kill plugin script #12145

Merged

openshift-merge-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 24, 2017

openshift-merge-robot assigned eparis Jul 25, 2017

openshift-merge-robot assigned pravisankar Jul 25, 2017

openshift-merge-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 28, 2017

eparis closed this Jul 28, 2017

Comments

Conversation

dcbw commented Nov 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sdodson commented Nov 22, 2016

@sdodson https://github.com/sdodson

Uh oh!

dgoodwin commented Nov 22, 2016

Uh oh!

danwinship Nov 22, 2016

Choose a reason for hiding this comment

Uh oh!

dcbw commented Nov 22, 2016

Uh oh!

dgoodwin commented Nov 22, 2016

Uh oh!

sdodson commented Nov 23, 2016

Uh oh!

sdodson commented Nov 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dcbw commented Nov 29, 2016

Uh oh!

danwinship commented Nov 29, 2016

Uh oh!

dgoodwin commented Nov 29, 2016

Uh oh!

danwinship commented Nov 29, 2016

Uh oh!

dgoodwin commented Nov 29, 2016

Uh oh!

dcbw commented Nov 29, 2016

Uh oh!

danwinship commented Nov 29, 2016

Uh oh!

dgoodwin commented Nov 30, 2016

Uh oh!

danwinship commented Nov 30, 2016

Uh oh!

dcbw commented Dec 1, 2016

Uh oh!

danwinship commented Dec 3, 2016

Uh oh!

danwinship commented Dec 15, 2016

Uh oh!

danwinship commented Feb 21, 2017

Uh oh!

knobunc commented Jun 9, 2017

Uh oh!

dcbw commented Jul 6, 2017

Uh oh!

danwinship commented Jul 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danwinship commented Jul 6, 2017

Uh oh!

openshift-merge-robot commented Jul 25, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

dcbw commented Nov 21, 2016 •

edited

Loading

sdodson commented Nov 28, 2016 •

edited

Loading

danwinship commented Jul 6, 2017 •

edited

Loading