Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine if we should support cpuset-cpus and cpuset-mem #10570

Closed
derekwaynecarr opened this issue Jun 30, 2015 · 27 comments
Closed

Determine if we should support cpuset-cpus and cpuset-mem #10570

derekwaynecarr opened this issue Jun 30, 2015 · 27 comments
Labels
area/isolation lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@derekwaynecarr
Copy link
Member

Docker now supports cpuset-cpus and cpuset-mem arguments when running a container to control which cpus can execute a container. We have been asked by some users if we plan to support this feature in Kubernetes.

https://docs.docker.com/reference/run/#runtime-constraints-on-cpu-and-memory

Opening an issue to gather feedback on if its desired or not.

@erictune @dchen1107

@erictune
Copy link
Member

erictune commented Jul 1, 2015

Short answer

Not soon.

Long answer

Supporting both cpusets and cfs well is very complex.

Having to support both options will increase the development time of some features which we definitely want, such as:

  • QoS Tiers and dealing with CPU overcommitment
  • Auto-scaling

The complexity will become further evident if we later add features like:

  • normalizing CPU for different platforms
  • optimizing access to i/o devices
  • dealing with processor cache interference.
  • dealing with NUMA

If we expose the docker flags to users via the Pod spec today, it will be much harder to build these things.

However, we should being collecting the use cases where users think cpusets will help them. Then we can have a broader cost/benefit discussion about supporting cpusets. I think I know what many of these use cases are, but it would still to be good to hear users state them.

@derekwaynecarr
Copy link
Member Author

@erictune - I am going to direct the original questioner to this issue for feedback on their needs, but my initial response was in-line with yours prior to opening the issue.

@zmerlynn zmerlynn added this to the v1.0-post milestone Jul 2, 2015
@zmerlynn zmerlynn added sig/node Categorizes an issue or PR as relevant to SIG Node. team/master priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Jul 2, 2015
@bgrant0607 bgrant0607 removed this from the v1.0-post milestone Jul 24, 2015
@resouer
Copy link
Contributor

resouer commented Nov 4, 2015

@erictune Is it possible for you to explain the main difficulty if we support cpuset?

My use case is: as k8s will support job controller soon, we want to fix Long Running Container to some cpu and let other jobs (batches) to compete for other cores. Does that make sense?

@jeremyeder
Copy link

Hi Eric,

By your initial comments, I'm sure we're on the same page in understanding the technical, performance benefits of this feature.

Indeed, the use-cases for exposing these knobs are precisely as you've stated.
Optimizing access to i/o devices, dealing with processor cache interference and dealing with NUMA.

If a users has an existing performance sensitive application that is lends itself well to microservices/containerization, they'll need all of these features in order to consider containers as a possibility.

These applications typically have their own init-scripts which apply tuning at both the host and app level, and (very often), automatically account for different hardware/virtual topologies.

We can approach containerizing this sort of application by simply migrating the init-scripts (hardware discovery, numactl, SCHED_FIFO and irq stuff) into each container. This of course is the definition of anti-pattern.

However, I'd argue that wanting to run performance-sensitive-pods (I call them "PSP's") in Kube is not an anti-pattern -- it's a pattern that's not implemented yet. Agree that usage may require some sort of nodeSelector matching, to avoid the complexities of having CFS workloads co-located with these HPC loads.

Exposing these knobs are one step in enabling a Kube stack to support workloads such as batch processing, HPC/scientific, big-data compute-bound analysis.

Examples of additional enablement would be exposing specific I/O devices to pods, RDMA, having a container "follow" a PCI device in terms of NUMA locality.

Thanks!

@erictune
Copy link
Member

There are benefits to exposing this level of detail (cores, numa nodes) to users, and there are costs to exposing it as well.

Within the problem space of running web services, my experience is that on balance it is better not to expose these details to users.

I can see from your comments, Jeremy, that you are interested in high-frequency trading (HFT). With web services, latency is measured in milliseconds. I've heard that for HFT, it is measured in microseconds or even nanoseconds. For very short deadlines, like with HFT, I see how it it is important to expose these details.

So, I think the question is whether Kubernetes is a good fit for HFT, or if it can be adapted to be a good fit for HFT without compromising the web services use case. I'm not sure what the answer is to this question.

@erictune
Copy link
Member

Some examples to illustrate my previous comment:

Say your process is sharing a cpu core with another process (time-slicing), and that other process occasionally evicts all your data from L1 cache when it runs. Then the next time you run, you have to refill your L1. It takes, around ten of microseconds to refill the L1. In the web services case, 10us << 1ms, so you don't care. In the HFT case, 10us >> 1us, so you totally care.

Even if the processes are sharing the L1 cache nicely, Linux context switch time can be several microseconds (ref). Again, this is too long for HFT, but not too long for web services.

So, for short deadlines, like in HFT, you need to allocate specific cores to specific containers to avoid context switch overheads and L1 cache interference.

But for web serving applications, you need to allow fraction-of-a-core requests, to get better cluster utilization* (see Borg paper, section 5.4). Mixing specific core allocations with fraction-of-a-core requests is tricky. Doing so while also considering NUMA is trickier. Also, allocating entire cores to containers hinders resource reclamation (ibid, section 5.5).

@thockin
Copy link
Member

thockin commented Nov 22, 2015

I agree with Eric (not surprising since we're drawing on mostly the same experiences from Borg). It's not that these features are not useful, it's that they are an attractive nuisance. People see them and want to tweak them. Once a significant number of people use them it is MUCH harder to claim them for automated use (impossible in some cases) and we lose out on more globally optimal designs. This isn't even hypothetical - we know pretty clearly what we want to build here, we just need to get a hundred other things lined up before it makes sense.

So the question becomes should we enable control of a know we know we want to take back later, and if so, how?

My feeling is that IF we do it, we should either do it very coarsely (reserved whole-cores or even reserved LLCs) or we should do it in a way that is very clearly stepping outside the normal bounds (i.e. you break it, you buy it). It's DEFINITELY not as simple as adding a couple fields to the API.

@thockin
Copy link
Member

thockin commented Nov 22, 2015

For concrete example: we could do this through the proposed opaque extensions. Tell docker what you want, but you are very clearly outside the supported sphere. Or we could make opaque counted resources for things like LLC0 and LLC1 (with a count of 1) so you could schedule and ask for 1 instance of LLC0 - clearly you know what your platform architecture is in that case. There are a lot of design avenues I might think are acceptable.

@jeremyeder
Copy link

Understand it's unavoidable to at least consider the implications of this feature, but let's walk this back to a higher level.

Eric asked: "I think the question is whether Kubernetes is a good fit for HFT, or if it can be adapted to be a good fit for HFT without compromising the web services use case."

Indeed this is the root-question. I have mentioned HFT in other github issues, but was careful not to bring it up here because it's too focused on a particular use-case.

I've seen others express interest in various github issues for Kube to support performance-sensitive-pods and workloads. One that was brought up is NFV, or network function virtualization. NFV is remarkably similar to HFT in terms of what it will need from Kube (I'd argue they're basically the same).

Another is HPC workloads (which I alluded to with comment about RDMA), again which are vastly different than web services.

All of these workloads and industry verticals stand to benefit significantly from the development methodology efficiencies of container-based workflows and related ecosystems. However -- without proper support from the orchestration system, it's likely that those industries would need to custom build in-house solutions (again).

I've read the borg paper several times (thanks for releasing it), and so am somewhat aware of background decisions/experiments and Google's business factors that were involved in the genesis of Borg. However, just wanted to point out as I'm quite sure you're aware -- there isn't always a 1:1 mapping between those experiences and other industries.

Support for performance-sensitive-pods would broaden the applicability of Kube beyond web services -- and this is the philosophical question to be answered.

@jeremyeder
Copy link

Any thoughts, guys ?

@thockin
Copy link
Member

thockin commented Dec 12, 2015

I think we can support cases like HFT, but it's going to require a
non-trivial set of features and enhancements that are design in concert.
We might enable things like this today, but it has to be clear that they
are outside the sphere of what is "well supported".

There is actually a cost to running apps under a system like Borg or
Kubernetes - it becomes more difficult to squeeze out every last drop of
performance, without adding a million very targeted features, which has a
cost of its own. My biggest concern is that we don't sacrifce usability
and comprehension to satisfy very particular workloads.

Philosophically there is no objection, I think, just a desire to tread
carefully and very very intentionally. Also we're BURIED in work, so
things like this are sort of under-serviced.

On Thu, Dec 10, 2015 at 4:51 PM, Jeremy Eder notifications@github.com
wrote:

Any thoughts, guys ?


Reply to this email directly or view it on GitHub
#10570 (comment)
.

@erictune
Copy link
Member

Agree with your point, Jeremy, that the Borg experience does not necessarily map to experience in other industries. I've long known that there is an impedance mismatch when running with some types of scientific computing workloads on a Borg-style system. The HFT and NFV cases are new to me, but I can see where, if you were designing a system specifically for those cases, that you might do some things differently.

We've got to first make Kubernetes not just good but awesome for the use cases (web service and web analytics and map-reduce-type batch) that it was created for. Then we can double back and see what can be done to optimize it for cases like NFV and HFT. I think a bunch of us have ideas on how to support those cases better, but doing deep dive on those right now could draw attention away from the current focus.

@timothysc
Copy link
Member

/cc @nqn @ConnorDoyle

@neomantra
Copy link

neomantra commented Oct 21, 2016

I am using containers with an HFT-type workload, generically termed "Fast Data". I like this notion of Performance Sensitive Pods. I'm sharing a concrete use case to help with the design of this:

My underlying setup involves common practices for these workloads (though often not containerized):

  • a kernel bypass technology (e.g OpenOnload), which requires docker options --devices and --net=host
  • isolating CPUs on the host with isolcpus and pinning containers to cores with docker --cpu plus sometimes pthread_setaffinity_np within the app
  • choosing cores on the NUMA node of the network card
  • the application is configured to spin in epoll, reads, etc (hence the need for core isolation)

Currently, I manually manage all of this with configuration (e.g. run which containers on cores on which hosts) and schedule it with cron. It's a pain a doesn't scale well at all. But, performance-wise it is awesome.

Since I don't currently use Kubernetes for these workloads, I don't use Pods. I do set up multiple containers that should be placed near each other (same NUMA node or even share a socket), e.g. a network-heavy process reading packets, processing them, then feeding the results to a Redis process on another core.

At the simplest level, I'd like Kubernetes to know that I have:

  • dedicated/isolated cores in my cluster
  • nodes that allow certain exposed devices and network modes

and that I'd like to schedule a Pod that needs N cores on a compatible node.

@jeremyeder
Copy link

Thanks very much for that -- I've seen your github repo. We are designing for exactly your use-case and more. We have NUMA and cpuset prototype working internally and have shared our design goals with Solarflare engineering. I was planning to discuss in more depth at the KubeCon developer conference. So you have some background, I represent RH in stacresearch.com consortia for tuning for HFT and exchange-side workloads and have written several whitepapers for Red Hat on tuning RHEL for HFT using Solarflare adapters. We should be able to build a kick-ass system!

@derekwaynecarr
Copy link
Member Author

Evan - thanks for the details!

Given what you described about your workload (multiple containers that
share a common NUMA node), if you were to adopt Kubernetes in the future
for this style of workload, I am curious how you would look at adopting
Pods.

For this style of workload, do you prefer a model:

  1. Having a single Pod object that grouped multiple containers (shared
    fate, shared memory, single IP, etc.) that gets scheduled to a single NUMA
    node
  2. Having multiple Pod object(s) (independent fate, non-shared memory,
    individual IPs, etc.) that each need to get scheduled to the same NUMA node
    on the same box

In general, I am wondering if there is simplification possible at the Pod
level versus what is exposed on a per container basis today because you
lack the concept of Pod(s) for co-located scheduling. For example, is a
single Pod (w/ multiple containers) per NUMA node realistic for your
workload style?

On Fri, Oct 21, 2016 at 4:09 PM, Jeremy Eder notifications@github.com
wrote:

Thanks very much for that -- I've seen your github repo. We are designing
for exactly your use-case and more. We have NUMA and cpuset prototype
working internally and have shared our design goals with Solarflare
engineering. I was planning to discuss in more depth at the KubeCon
developer conference. So you have some background, I represent RH in
stacresearch.com consortia for tuning for HFT and exchange-side workloads
and have written several whitepapers for Red Hat on tuning RHEL for HFT
using Solarflare adapters. We should be able to build a kick-ass system!


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#10570 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF8dbDIx2yYtLHv6Skq5tV69okEzi9Q6ks5q2Rv5gaJpZM4FPRxg
.

@thockin
Copy link
Member

thockin commented Oct 25, 2016

I'm particularly interested in the CAREFUL evolution of this at the API
level, given our experiences with Borg. Looking forward to it.

On Fri, Oct 21, 2016 at 1:09 PM, Jeremy Eder notifications@github.com
wrote:

Thanks very much for that -- I've seen your github repo. We are designing
for exactly your use-case and more. We have NUMA and cpuset prototype
working internally and have shared our design goals with Solarflare
engineering. I was planning to discuss in more depth at the KubeCon
developer conference. So you have some background, I represent RH in
stacresearch.com consortia for tuning for HFT and exchange-side workloads
and have written several whitepapers for Red Hat on tuning RHEL for HFT
using Solarflare adapters. We should be able to build a kick-ass system!


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#10570 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVOV62c01U6Z4Qzflu8QJTrbAIK8Fks5q2Rv3gaJpZM4FPRxg
.

@derekwaynecarr
Copy link
Member Author

@thockin -- agreed. hope to chat more at kubecon on it.

@neomantra
Copy link

@derekwaynecarr Sorry for the delay in replying, I've re-written this a few times and still feel like I don't know enough to comment well and don't want to bike shed. So, I'll try to document more of what I'm doing now and map it to the two choices you gave (which did help me think about this).

For the case I wrote about earlier, I would have to go with multiple pods as my containers have different lifetimes. The datastore (Redis) goes round the clock, whereas the feed handlers / data processors are restarted overnight. Those are in-house programs, so I could change that. What would be important here is that I get accelerated TCP loopback as it makes a difference for me.

I also have single-process, multi-threaded services that are configured to take a set of isolated cores and non-isolated cores. The isolated cores are devoted to ripping multicast packets off the network, performing calculations and transformations on them, and then storing results in thread-safe/concurrent data structures. The non-isolated cores have HTTP worker threads (via Proxygen) assigned to them; these take client requests and access the data structures to respond. I've also though about adopting gRPC workers in a similar fashion. I don't do this currently, but if I had shared-memory multi-process concurrent data structures, then the single pod approach would be appropriate.

Another common tuning suggestion that I forgot to mention before is IRQ affinity.

Unfortunately, I won't be at kubecon, but am happy to connect in other ways than an issue thread.

@janosi
Copy link
Contributor

janosi commented Nov 2, 2016

Please consider this presentation about "cpushare vs. cpuset vs JRE"
https://vimeo.com/181900266
As I understand from this one (though I am not an expert in this area) with the current kernel capabilities (i.e. no API to the application to query the available resources assigned to it) the reliable way to run Java apps in a container is to use cpuset. With cpushare the JRE gets false information about its environment, even when Java9 switches to sched_getaffinity().

@jakub-d
Copy link

jakub-d commented Nov 16, 2016

Any updates on this topic? What are the decisions after the kubecon?

@jeremyeder
Copy link

​The feedback was we need to prototype using side-car techniques, prove out the ideas, and then work to get them built into Kube. This means using node-feature-discovery pod for the hardware discovery. NFD feeds information into Opaque Integer Resources. We extend the Kube API using ThirdPartyResources such that a pod manifest can request those resources as specified by OIR.

  • NFD runs as a daemonset, is out of tree, and we can extend it however we need.
  • OIR is upstream already, as I understand it has the features we need already to do cpuset.
  • TPR is free-form and non-restricted...will help us prototype quicker.​

A user would then specify numa=true and (optionally) numa_node=1. The numa=true portion is handled by the kube scheduler because we have OIR upstream and thus the scheduler can do the "fitting" aspect. The numa_node=1 decision would be ​a "node local decision", IOW handled by the Kubelet. If it cannot fulfill the request, the pod would fail scheduling and the Kube scheduler would try another node.

The mechanics between the Kubelet and the NFD pod for the node-local-decision ... I don't think we reached precise concensus on that, but thought was that NFD would provide feedback to Kubelet about the NUMA node. We think we'd need a small kubelet change to "shell-out" to NFD for node-local-decisions.

Long term we hope to graduate this into proper Kube objects rather than TPR, while formalizing the NFD (aka node-agent) pod technique along with several other supporting features we need to complete the design (I'm sure you understand that cpuset is a small (yet critical) part of the big picture).

@davidopp
Copy link
Member

davidopp commented Jan 2, 2017

ref/ kubernetes/community#171

@neomantra
Copy link

I just discovered that Docker doesn't honor isolcpus -- it will assign an affinity of ALL cores to a container, disregarding isolcpus. See moby/moby#31086

I'm pretty surprised by this. That makes it "dangerous" to combine pinned workloads (either bare or via --cpu-set) and unpinned Docker workloads, in the sense that a random container can interrupt the pinned container (who is not allowed to switch to other cores to do its work).

Basically you need to specify --cpu-set for every container if you are dealing with pinning at all.
This is a Docker issue, but of course affects this conversation.

@ConnorDoyle
Copy link
Contributor

xref kubernetes/community#654

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 25, 2017
@ConnorDoyle
Copy link
Contributor

/close

CPU manager docs: https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/

If folks don’t agree this thread has run its course please re-open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/isolation lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests