A ovs process gets killed when oom-killer is invoked #80

pecameron · 2019-01-28T18:44:01Z

Changes from 3.9 to 3.10 now has OVS running in a pod. There must be
sufficient memory or the OOM killer will be invoked. This change adds a
liveness probe that checks the process is running. Also, resource limits
are relaxed a little.

bug 1671822
https://bugzilla.redhat.com/show_bug.cgi?id=1671822
clone of bug 1669311
https://bugzilla.redhat.com/show_bug.cgi?id=1669311

Signed-off-by: Phil Cameron pcameron@redhat.com

pecameron · 2019-01-28T18:46:13Z

@squeed PTAL

dcbw · 2019-01-29T14:08:40Z

@pecameron did the updated resource limits seem OK to the OVS team? 400M seems pretty large. Are we also sure that whoever gets this error is running with the revalidor thread limits we added a while back?

danwinship · 2019-01-29T14:46:57Z

did the updated resource limits seem OK to the OVS team? 400M seems pretty large.

As I understand it, the limit section is there not because we want kubelet to actually act on those specific limits, but just because we want it to set its oom-score to "don't kill me" rather than "kill me please". So it's ok if limits.memory is "too high", since we don't expect/want to hit it anyway.

As for requests.memory: the amount of memory the pod uses is going to depend on the size of the cluster, and there doesn't seem to be any way to represent that... We could see if there was some install-time "cluster size" hint, or maybe CNO could sporadically update the daemonset with a current estimate (except that that would be terrible because any change to the daemonset will cause all the pods to be redeployed).

pecameron · 2019-01-29T16:31:52Z

@danwinship So it sounds like we should pick a number that works in "most" clusters and let the admin adjust it as needed based on guidelines provided by Red Hat. If that is the case, are the proposed numbers OK?

danwinship · 2019-01-29T16:41:29Z

There is no way for the admin to adjust it; the values are hardcoded into the CNO. The only current option is to specify a value that will inevitably be incorrect for most users...

pecameron · 2019-01-29T19:04:04Z

@danwinship Your observation is generally applicable to the cluster. Admins will have little opportunity to tune anything. Unless these numbers are very sensitive and must be carefully tuned, a number must be selected that will generally work. 4.0 is new turf.

squeed · 2019-01-31T13:33:42Z

bindata/network/openshift-sdn/sdn-ovs.yaml

+        livenessProbe:
+          exec:
+            command:
+            - cat


This should just be /usr/share/openvswitch/scripts/ovs-ctl status. Then you can get replace the loop in the pod's command with just sleep 10000.

squeed · 2019-01-31T13:34:55Z

Yeah, we're in a tricky spot. We cant get guaranteed resources unless we also set a limit, which is reasonable (We would also have to change the QOSClass). We also don't really know what a safe limit is, since it depends on cluster size. We also don't want to waste resources, which a too-high limit would tie up memory unnecessarily.

For the time being, I think the right thing to do is stay in BestEffort, set a reasonable request, and have a good liveness probe.

@sjenning explicitly removed the limits from OVS two weeks ago in #62 - I don't recall the exact reason why. However, I don't think we should be re-adding limits.

danwinship · 2019-01-31T14:18:55Z

We also don't want to waste resources, which a too-high limit would tie up memory unnecessarily.

Hm... we should probably ask someone who knows this stuff better, but my understanding is that the limits section doesn't tie anything up; it just declares a size after which kube can consider the pod to be misbehaving and potentially evictable (as well as changing the qos class). It's the requests section that ties up memory; kubelet will assume that the OVS pod is eventually going to need all of the memory listed in its requests, even if it is currently using much less.

Maybe we should just drop requests?

and have a good liveness probe.

Alternatively, if we could run ovsdb-server and ovs-vswitchd in separate containers with appropriate RestartPolicy then each would just get automatically restarted if it got killed...

Um... wait a minute... the ovs-ctl docs say:

   --no-monitor
          By   default   ovs-ctl  passes  --monitor  to  ovs-vswitchd  and
          ovsdb-server, requesting that it spawn a process  monitor  which
          will  restart  the daemon if it crashes.  This option suppresses
          that behavior.

and we are not passing --no-monitor... so why isn't OVS restarting itself?

pecameron · 2019-01-31T20:45:25Z

@danwinship oom likely killed the ovs monitor as well. It kills until it gets "enough" memory. OOM kills what it likes. You can give it hints, and sometimes it takes them. OOM will keep the kernel alive. Nothing else matters.

pecameron · 2019-01-31T20:55:07Z

Aaron C sent email on this topic: I think it's crazy to put a limit on the memory, anyway. Someday, when
openshift has to deploy VMs that get connected with hugepages, the
numbers will be in the dozens or more of gigs! Is it really a
requirement to have resource constraints for these kinds of core system
daemons?

danwinship · 2019-01-31T21:57:08Z

oom likely killed the ovs monitor as well

That seems improbable; the monitor is tiny, so killing it does not help the oomkiller free up memory.

dcbw · 2019-02-04T03:58:22Z

and we are not passing --no-monitor... so why isn't OVS restarting itself?

@danwinship because the monitor code only restarts on program-type errors. SIGKILL is not in that list; I guess that's not considered a program crash (and I think I'd agree).

static bool
should_restart(int status)
{
    if (WIFSIGNALED(status)) {
        static const int error_signals[] = {
            /* This list of signals is documented in daemon.man.  If you
             * change the list, update the documentation too. */
            SIGABRT, SIGALRM, SIGBUS, SIGFPE, SIGILL, SIGPIPE, SIGSEGV,
            SIGXCPU, SIGXFSZ
        };

        size_t i;

        for (i = 0; i < ARRAY_SIZE(error_signals); i++) {
            if (error_signals[i] == WTERMSIG(status)) {
                return true;
            }
        }
    }
    return false;
}

eparis · 2019-02-05T17:02:25Z

@danwinship So it sounds like we should pick a number that works in "most" clusters and let the admin adjust it as needed based on guidelines provided by Red Hat. If that is the case, are the proposed numbers OK?

under no circumstances should we let an admin control this value. period. If we wish to have this value change it must be because the network operator is smart enough to manage it.

dcbw · 2019-02-05T17:41:23Z

@pecameron here's what I think we should do. We set some pretty high requests, but no limit. That means we are Burstable, but OVS will be killed later than other burstable things. We have to balance our request amount though because higher requests will have scheduling implications.

There's no way to set ourselves Guaranteed (eg killed after everything else) unless we set requests == limits, and since our memory usage is likely quite variable based on cluster size, we can't find a good value that fits all clusters.

So maybe bump it to 500m request and call it a day?

pecameron · 2019-02-05T20:16:47Z

@dcbw This is the 4.0 fix for https://bugzilla.redhat.com/show_bug.cgi?id=1669311 which is on 3.10. We need to reach consensus there as well.

openshift-merge-robot · 2019-02-06T06:28:10Z

/retest

squeed · 2019-02-06T11:09:10Z

At this point, the additional "ovs-ctl" check on line 77 is redundant. Can you just replace it with a sleep 10000? And check that it all works when you kill openvswitch?

openshift-merge-robot · 2019-02-06T12:42:45Z

/retest

Changes from 3.9 to 3.10 now has OVS running in a pod. There must be sufficient memory or the OOM killer will be invoked. This change adds a liveness probe that checks the process is running. Also, resource limits are removed. bug 1671822 https://bugzilla.redhat.com/show_bug.cgi?id=1671822 clone of bug 1669311 https://bugzilla.redhat.com/show_bug.cgi?id=1669311 Signed-off-by: Phil Cameron <pcameron@redhat.com>

pecameron · 2019-02-06T15:07:37Z

@squeed made the change, PTAL
Will test when I get the test setup working

squeed · 2019-02-06T17:44:06Z

/lgtm

openshift-ci-robot · 2019-02-06T17:44:18Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pecameron, squeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [squeed]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sjenning · 2019-02-06T20:40:51Z

related kubernetes/kubernetes#73758

This would allow pods with system critical priority to get a low oom_score_adj without having to be in the Guaranteed QoS teir, which requires setting a memory limit.

Looking to backport this.

squeed · 2019-02-07T18:13:19Z

/retest

squeed reviewed Jan 31, 2019

View reviewed changes

pecameron force-pushed the bz1669311 branch from 75c2be6 to ff3ffae Compare January 31, 2019 20:46

openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 31, 2019

pecameron force-pushed the bz1669311 branch from ff3ffae to dd2b6f9 Compare February 1, 2019 19:05

pecameron force-pushed the bz1669311 branch from dd2b6f9 to ef910b1 Compare February 4, 2019 18:12

pecameron force-pushed the bz1669311 branch from ef910b1 to b919ac8 Compare February 6, 2019 15:06

openshift-ci-robot assigned squeed Feb 6, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 6, 2019

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 6, 2019

This was referenced Feb 6, 2019

UPSTREAM: 73758: kubelet: set low oom_score_adj for containers in critical pods openshift/origin#21978

Merged

bindata: use system-node-critical priority #90

Merged

openshift-merge-robot merged commit 0890964 into openshift:master Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A ovs process gets killed when oom-killer is invoked #80

A ovs process gets killed when oom-killer is invoked #80

pecameron commented Jan 28, 2019 •

edited

Loading

pecameron commented Jan 28, 2019

dcbw commented Jan 29, 2019

danwinship commented Jan 29, 2019

pecameron commented Jan 29, 2019

danwinship commented Jan 29, 2019

pecameron commented Jan 29, 2019

squeed Jan 31, 2019

squeed commented Jan 31, 2019

danwinship commented Jan 31, 2019

pecameron commented Jan 31, 2019

pecameron commented Jan 31, 2019

danwinship commented Jan 31, 2019

dcbw commented Feb 4, 2019 •

edited

Loading

eparis commented Feb 5, 2019

dcbw commented Feb 5, 2019

pecameron commented Feb 5, 2019

openshift-merge-robot commented Feb 6, 2019

squeed commented Feb 6, 2019

openshift-merge-robot commented Feb 6, 2019

pecameron commented Feb 6, 2019

squeed commented Feb 6, 2019

openshift-ci-robot commented Feb 6, 2019

sjenning commented Feb 6, 2019

squeed commented Feb 7, 2019

A ovs process gets killed when oom-killer is invoked #80

A ovs process gets killed when oom-killer is invoked #80

Conversation

pecameron commented Jan 28, 2019 • edited Loading

pecameron commented Jan 28, 2019

dcbw commented Jan 29, 2019

danwinship commented Jan 29, 2019

pecameron commented Jan 29, 2019

danwinship commented Jan 29, 2019

pecameron commented Jan 29, 2019

squeed Jan 31, 2019

Choose a reason for hiding this comment

squeed commented Jan 31, 2019

danwinship commented Jan 31, 2019

pecameron commented Jan 31, 2019

pecameron commented Jan 31, 2019

danwinship commented Jan 31, 2019

dcbw commented Feb 4, 2019 • edited Loading

eparis commented Feb 5, 2019

dcbw commented Feb 5, 2019

pecameron commented Feb 5, 2019

openshift-merge-robot commented Feb 6, 2019

squeed commented Feb 6, 2019

openshift-merge-robot commented Feb 6, 2019

pecameron commented Feb 6, 2019

squeed commented Feb 6, 2019

openshift-ci-robot commented Feb 6, 2019

sjenning commented Feb 6, 2019

squeed commented Feb 7, 2019

pecameron commented Jan 28, 2019 •

edited

Loading

dcbw commented Feb 4, 2019 •

edited

Loading