Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A ovs process gets killed when oom-killer is invoked #80

Merged
merged 1 commit into from
Feb 7, 2019

Conversation

pecameron
Copy link
Contributor

@pecameron pecameron commented Jan 28, 2019

Changes from 3.9 to 3.10 now has OVS running in a pod. There must be
sufficient memory or the OOM killer will be invoked. This change adds a
liveness probe that checks the process is running. Also, resource limits
are relaxed a little.

bug 1671822
https://bugzilla.redhat.com/show_bug.cgi?id=1671822
clone of bug 1669311
https://bugzilla.redhat.com/show_bug.cgi?id=1669311

Signed-off-by: Phil Cameron pcameron@redhat.com

@pecameron
Copy link
Contributor Author

@squeed PTAL

@dcbw
Copy link
Contributor

dcbw commented Jan 29, 2019

@pecameron did the updated resource limits seem OK to the OVS team? 400M seems pretty large. Are we also sure that whoever gets this error is running with the revalidor thread limits we added a while back?

@danwinship
Copy link
Contributor

did the updated resource limits seem OK to the OVS team? 400M seems pretty large.

As I understand it, the limit section is there not because we want kubelet to actually act on those specific limits, but just because we want it to set its oom-score to "don't kill me" rather than "kill me please". So it's ok if limits.memory is "too high", since we don't expect/want to hit it anyway.

As for requests.memory: the amount of memory the pod uses is going to depend on the size of the cluster, and there doesn't seem to be any way to represent that... We could see if there was some install-time "cluster size" hint, or maybe CNO could sporadically update the daemonset with a current estimate (except that that would be terrible because any change to the daemonset will cause all the pods to be redeployed).

@pecameron
Copy link
Contributor Author

@danwinship So it sounds like we should pick a number that works in "most" clusters and let the admin adjust it as needed based on guidelines provided by Red Hat. If that is the case, are the proposed numbers OK?

@danwinship
Copy link
Contributor

There is no way for the admin to adjust it; the values are hardcoded into the CNO. The only current option is to specify a value that will inevitably be incorrect for most users...

@pecameron
Copy link
Contributor Author

@danwinship Your observation is generally applicable to the cluster. Admins will have little opportunity to tune anything. Unless these numbers are very sensitive and must be carefully tuned, a number must be selected that will generally work. 4.0 is new turf.

livenessProbe:
exec:
command:
- cat
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should just be /usr/share/openvswitch/scripts/ovs-ctl status. Then you can get replace the loop in the pod's command with just sleep 10000.

@squeed
Copy link
Contributor

squeed commented Jan 31, 2019

Yeah, we're in a tricky spot. We cant get guaranteed resources unless we also set a limit, which is reasonable (We would also have to change the QOSClass). We also don't really know what a safe limit is, since it depends on cluster size. We also don't want to waste resources, which a too-high limit would tie up memory unnecessarily.

For the time being, I think the right thing to do is stay in BestEffort, set a reasonable request, and have a good liveness probe.

@sjenning explicitly removed the limits from OVS two weeks ago in #62 - I don't recall the exact reason why. However, I don't think we should be re-adding limits.

@danwinship
Copy link
Contributor

We also don't want to waste resources, which a too-high limit would tie up memory unnecessarily.

Hm... we should probably ask someone who knows this stuff better, but my understanding is that the limits section doesn't tie anything up; it just declares a size after which kube can consider the pod to be misbehaving and potentially evictable (as well as changing the qos class). It's the requests section that ties up memory; kubelet will assume that the OVS pod is eventually going to need all of the memory listed in its requests, even if it is currently using much less.

Maybe we should just drop requests?

and have a good liveness probe.

Alternatively, if we could run ovsdb-server and ovs-vswitchd in separate containers with appropriate RestartPolicy then each would just get automatically restarted if it got killed...

Um... wait a minute... the ovs-ctl docs say:

   --no-monitor
          By   default   ovs-ctl  passes  --monitor  to  ovs-vswitchd  and
          ovsdb-server, requesting that it spawn a process  monitor  which
          will  restart  the daemon if it crashes.  This option suppresses
          that behavior.

and we are not passing --no-monitor... so why isn't OVS restarting itself?

@pecameron
Copy link
Contributor Author

@danwinship oom likely killed the ovs monitor as well. It kills until it gets "enough" memory. OOM kills what it likes. You can give it hints, and sometimes it takes them. OOM will keep the kernel alive. Nothing else matters.

@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 31, 2019
@pecameron
Copy link
Contributor Author

Aaron C sent email on this topic: I think it's crazy to put a limit on the memory, anyway. Someday, when
openshift has to deploy VMs that get connected with hugepages, the
numbers will be in the dozens or more of gigs! Is it really a
requirement to have resource constraints for these kinds of core system
daemons?

@danwinship
Copy link
Contributor

oom likely killed the ovs monitor as well

That seems improbable; the monitor is tiny, so killing it does not help the oomkiller free up memory.

@dcbw
Copy link
Contributor

dcbw commented Feb 4, 2019

and we are not passing --no-monitor... so why isn't OVS restarting itself?

@danwinship because the monitor code only restarts on program-type errors. SIGKILL is not in that list; I guess that's not considered a program crash (and I think I'd agree).

static bool
should_restart(int status)
{
    if (WIFSIGNALED(status)) {
        static const int error_signals[] = {
            /* This list of signals is documented in daemon.man.  If you
             * change the list, update the documentation too. */
            SIGABRT, SIGALRM, SIGBUS, SIGFPE, SIGILL, SIGPIPE, SIGSEGV,
            SIGXCPU, SIGXFSZ
        };

        size_t i;

        for (i = 0; i < ARRAY_SIZE(error_signals); i++) {
            if (error_signals[i] == WTERMSIG(status)) {
                return true;
            }
        }
    }
    return false;
}

@eparis
Copy link
Member

eparis commented Feb 5, 2019

@danwinship So it sounds like we should pick a number that works in "most" clusters and let the admin adjust it as needed based on guidelines provided by Red Hat. If that is the case, are the proposed numbers OK?

under no circumstances should we let an admin control this value. period. If we wish to have this value change it must be because the network operator is smart enough to manage it.

@dcbw
Copy link
Contributor

dcbw commented Feb 5, 2019

@pecameron here's what I think we should do. We set some pretty high requests, but no limit. That means we are Burstable, but OVS will be killed later than other burstable things. We have to balance our request amount though because higher requests will have scheduling implications.

There's no way to set ourselves Guaranteed (eg killed after everything else) unless we set requests == limits, and since our memory usage is likely quite variable based on cluster size, we can't find a good value that fits all clusters.

So maybe bump it to 500m request and call it a day?

@pecameron
Copy link
Contributor Author

@dcbw This is the 4.0 fix for https://bugzilla.redhat.com/show_bug.cgi?id=1669311 which is on 3.10. We need to reach consensus there as well.

@openshift-merge-robot
Copy link
Contributor

/retest

@squeed
Copy link
Contributor

squeed commented Feb 6, 2019

At this point, the additional "ovs-ctl" check on line 77 is redundant. Can you just replace it with a sleep 10000? And check that it all works when you kill openvswitch?

@openshift-merge-robot
Copy link
Contributor

/retest

Changes from 3.9 to 3.10 now has OVS running in a pod. There must be
sufficient memory or the OOM killer will be invoked. This change adds a
liveness probe that checks the process is running. Also, resource limits
are removed.

bug 1671822
https://bugzilla.redhat.com/show_bug.cgi?id=1671822
clone of bug 1669311
https://bugzilla.redhat.com/show_bug.cgi?id=1669311

Signed-off-by: Phil Cameron <pcameron@redhat.com>
@pecameron
Copy link
Contributor Author

@squeed made the change, PTAL
Will test when I get the test setup working

@squeed
Copy link
Contributor

squeed commented Feb 6, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 6, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pecameron, squeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 6, 2019
@sjenning
Copy link
Contributor

sjenning commented Feb 6, 2019

related kubernetes/kubernetes#73758

This would allow pods with system critical priority to get a low oom_score_adj without having to be in the Guaranteed QoS teir, which requires setting a memory limit.

Looking to backport this.

@squeed
Copy link
Contributor

squeed commented Feb 7, 2019

/retest

@openshift-merge-robot openshift-merge-robot merged commit 0890964 into openshift:master Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants