Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MESOS: scheduler: implement role awareness #15775

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
8 changes: 4 additions & 4 deletions cluster/mesos/docker/docker-compose.yml
Expand Up @@ -23,6 +23,7 @@ mesosmaster1:
- MESOS_QUORUM=1
- MESOS_REGISTRY=in_memory
- MESOS_WORK_DIR=/var/lib/mesos
- MESOS_ROLES=role1
links:
- etcd
- "ambassador:apiserver"
Expand All @@ -40,15 +41,15 @@ mesosslave:
DOCKER_NETWORK_OFFSET=0.0.$${N}.0
exec wrapdocker mesos-slave
--work_dir="/var/tmp/mesos/$${N}"
--attributes="rack:$${N};gen:201$${N}"
--attributes="rack:$${N};gen:201$${N};role:role$${N}"
--hostname=$$(getent hosts mesosslave | cut -d' ' -f1 | sort -u | tail -1)
--resources="cpus:4;mem:1280;disk:25600;ports:[8000-21099];cpus(role$${N}):1;mem(role$${N}):640;disk(role$${N}):25600;ports(role$${N}):[7000-7999]"
command: []
environment:
- MESOS_MASTER=mesosmaster1:5050
- MESOS_PORT=5051
- MESOS_LOG_DIR=/var/log/mesos
- MESOS_LOGGING_LEVEL=INFO
- MESOS_RESOURCES=cpus:4;mem:1280;disk:25600;ports:[8000-21099]
- MESOS_SWITCH_USER=0
- MESOS_CONTAINERIZERS=docker,mesos
- MESOS_ISOLATION=cgroups/cpu,cgroups/mem
Expand All @@ -58,8 +59,6 @@ mesosslave:
- etcd
- mesosmaster1
- "ambassador:apiserver"
volumes:
- ${MESOS_DOCKER_WORK_DIR}/mesosslave:/var/tmp/mesos
apiserver:
hostname: apiserver
image: mesosphere/kubernetes-mesos
Expand Down Expand Up @@ -145,6 +144,7 @@ scheduler:
--mesos-executor-cpus=1.0
--mesos-sandbox-overlay=/opt/sandbox-overlay.tar.gz
--static-pods-config=/opt/static-pods
--mesos-roles=*,role1
--v=4
--executor-logv=4
--profiling=true
Expand Down
88 changes: 88 additions & 0 deletions contrib/mesos/docs/scheduler.md
Expand Up @@ -30,6 +30,93 @@ example, the Kubernetes-Mesos executor manages `k8s.mesosphere.io/attribute`
labels and will auto-detect and update modified attributes when the mesos-slave
is restarted.

## Resource Roles

A Mesos cluster can be statically partitioned using [resources roles][2]. Each
resource is assigned such a role (`*` is the default role, if none is explicitly
assigned in the mesos-slave command line). The Mesos master will send offers to
frameworks for `*` resources and – optionally – for one extra role that a
framework is assigned to. Right now only one such extra role for a framework is
supported.

### Configuring Roles for the Scheduler

Every Mesos framework scheduler can choose among the offered `*` resources and
those of the extra role. The Kubernetes-Mesos scheduler supports this by setting
the framework roles in the scheduler command line, e.g.

```bash
$ km scheduler ... --mesos-roles="*,role1" ...
```

This will tell the Kubernetes-Mesos scheduler to default to using `*` resources
if a pod is not specially assigned to another role. Moreover, the extra role
`role1` is allowed, i.e. the Mesos master will send resources or role `role1`
to the Kubernetes scheduler.

Note the following restrictions and possibilities:
- Due to the restrictions of Mesos, only one extra role may be provided on the
command line.
- It is allowed to only pass an extra role without the `*`, e.g. `--mesos-roles=role1`.
This means that no `*` resources should be considered by the scheduler at all.
- It is allowed to pass the extra role first, e.g. `--mesos-roles=role1,*`.
This means that `role1` is the default role for pods without special role
assignment (see below). But `*` resources would be considered for pods with a special `*`
assignment.

### Specifying Roles for Pods

By default a pod is scheduled using resources of the role which comes first in
the list of scheduler roles.

A pod can opt-out of this default behaviour using the `k8s.mesosphere.io/roles`
label:

```yaml
k8s.mesosphere.io/roles: role1,role2,role3
```

The format is a comma separated list of allowed resource roles. The scheduler
will try to schedule the pod with `role1` resources first, using `role2`
resources if the former are not available and finally falling back to `role3`
resources.

The `*` role may be specified as well in this list.

**Note:** An empty list will mean that no resource roles are allowed which is
equivalent to a pod which is unschedulable.

For example:

```yaml
apiVersion: v1
kind: Pod
metadata:
name: backend
labels:
k8s.mesosphere.io/roles: *,prod,test,dev
namespace: prod
spec:
...
```

This `prod/backend` pod will be scheduled using resources from all four roles,
preferably using `*` resources, followed by `prod`, `test` and `dev`. If none
of those for roles provides enough resources, the scheduling fails.

**Note:** The scheduler will also allow to mix different roles in the following
sense: if a node provides `cpu` resources for the `*` role, but `mem` resources
only for the `prod` role, the upper pod will be schedule using `cpu(*)` and
`mem(prod)` resources.

**Note:** The scheduler might also mix within one resource type, i.e. it will
use as many `cpu`s of the `*` role as possible. If a pod requires even more
`cpu` resources (defined using the `pod.spec.resources.limits` property) for successful
scheduling, the scheduler will add resources from the `prod`, `test` and `dev`
roles, in this order until the pod resource requirements are satisfied. E.g. a
pod might be scheduled with 0.5 `cpu(*)`, 1.5 `cpu(prod)` and 1 `cpu(test)`
resources plus e.g. 2 GB `mem(prod)` resources.

## Tuning

The scheduler configuration can be fine-tuned using an ini-style configuration file.
Expand All @@ -49,6 +136,7 @@ offer-ttl = 5s
; duration an expired offer lingers in history
offer-linger-ttl = 2m

<<<<<<< HEAD
; duration between offer listener notifications
listener-delay = 1s

Expand Down
59 changes: 53 additions & 6 deletions contrib/mesos/pkg/executor/executor.go
Expand Up @@ -17,6 +17,7 @@ limitations under the License.
package executor

import (
"bytes"
"encoding/json"
"fmt"
"strings"
Expand All @@ -33,6 +34,7 @@ import (
"k8s.io/kubernetes/contrib/mesos/pkg/executor/messages"
"k8s.io/kubernetes/contrib/mesos/pkg/node"
"k8s.io/kubernetes/contrib/mesos/pkg/podutil"
"k8s.io/kubernetes/contrib/mesos/pkg/scheduler/executorinfo"
"k8s.io/kubernetes/contrib/mesos/pkg/scheduler/meta"
"k8s.io/kubernetes/pkg/api"
unversionedapi "k8s.io/kubernetes/pkg/api/unversioned"
Expand Down Expand Up @@ -223,13 +225,21 @@ func (k *Executor) sendPodsSnapshot() bool {
}

// Registered is called when the executor is successfully registered with the slave.
func (k *Executor) Registered(driver bindings.ExecutorDriver,
executorInfo *mesos.ExecutorInfo, frameworkInfo *mesos.FrameworkInfo, slaveInfo *mesos.SlaveInfo) {
func (k *Executor) Registered(
driver bindings.ExecutorDriver,
executorInfo *mesos.ExecutorInfo,
frameworkInfo *mesos.FrameworkInfo,
slaveInfo *mesos.SlaveInfo,
) {
if k.isDone() {
return
}
log.Infof("Executor %v of framework %v registered with slave %v\n",
executorInfo, frameworkInfo, slaveInfo)

log.Infof(
"Executor %v of framework %v registered with slave %v\n",
executorInfo, frameworkInfo, slaveInfo,
)

if !(&k.state).transition(disconnectedState, connectedState) {
log.Errorf("failed to register/transition to a connected state")
}
Expand All @@ -241,8 +251,22 @@ func (k *Executor) Registered(driver bindings.ExecutorDriver,
}
}

annotations, err := executorInfoToAnnotations(executorInfo)
if err != nil {
log.Errorf(
"cannot get node annotations from executor info %v error %v",
executorInfo, err,
)
}

if slaveInfo != nil {
_, err := node.CreateOrUpdate(k.client, slaveInfo.GetHostname(), node.SlaveAttributesToLabels(slaveInfo.Attributes))
_, err := node.CreateOrUpdate(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

future TODO: integrate the updates that we want to send to the node w/ the node update loop that already exists within the kubelet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a one-time update. I don't think it's worth it to optimize this much more.

k.client,
slaveInfo.GetHostname(),
node.SlaveAttributesToLabels(slaveInfo.Attributes),
annotations,
)

if err != nil {
log.Errorf("cannot update node labels: %v", err)
}
Expand Down Expand Up @@ -270,7 +294,13 @@ func (k *Executor) Reregistered(driver bindings.ExecutorDriver, slaveInfo *mesos
}

if slaveInfo != nil {
_, err := node.CreateOrUpdate(k.client, slaveInfo.GetHostname(), node.SlaveAttributesToLabels(slaveInfo.Attributes))
_, err := node.CreateOrUpdate(
k.client,
slaveInfo.GetHostname(),
node.SlaveAttributesToLabels(slaveInfo.Attributes),
nil, // don't change annotations
)

if err != nil {
log.Errorf("cannot update node labels: %v", err)
}
Expand Down Expand Up @@ -988,3 +1018,20 @@ func nodeInfo(si *mesos.SlaveInfo, ei *mesos.ExecutorInfo) NodeInfo {
}
return ni
}

func executorInfoToAnnotations(ei *mesos.ExecutorInfo) (annotations map[string]string, err error) {
annotations = map[string]string{}
if ei == nil {
return
}

var buf bytes.Buffer
if err = executorinfo.EncodeResources(&buf, ei.GetResources()); err != nil {
return
}

annotations[meta.ExecutorIdKey] = ei.GetExecutorId().GetValue()
annotations[meta.ExecutorResourcesKey] = buf.String()

return
}
34 changes: 30 additions & 4 deletions contrib/mesos/pkg/executor/executor_test.go
Expand Up @@ -168,10 +168,23 @@ func TestExecutorLaunchAndKillTask(t *testing.T) {
}

pod := NewTestPod(1)
podTask, err := podtask.New(api.NewDefaultContext(), "", pod)
executorinfo := &mesosproto.ExecutorInfo{}
podTask, err := podtask.New(
api.NewDefaultContext(),
"",
pod,
executorinfo,
nil,
)

assert.Equal(t, nil, err, "must be able to create a task from a pod")

taskInfo := podTask.BuildTaskInfo(&mesosproto.ExecutorInfo{})
podTask.Spec = &podtask.Spec{
Executor: executorinfo,
}
taskInfo, err := podTask.BuildTaskInfo()
assert.Equal(t, nil, err, "must be able to build task info")

data, err := testapi.Default.Codec().Encode(pod)
assert.Equal(t, nil, err, "must be able to encode a pod's spec data")
taskInfo.Data = data
Expand Down Expand Up @@ -370,8 +383,21 @@ func TestExecutorFrameworkMessage(t *testing.T) {

// set up a pod to then lose
pod := NewTestPod(1)
podTask, _ := podtask.New(api.NewDefaultContext(), "foo", pod)
taskInfo := podTask.BuildTaskInfo(&mesosproto.ExecutorInfo{})
executorinfo := &mesosproto.ExecutorInfo{}
podTask, _ := podtask.New(
api.NewDefaultContext(),
"foo",
pod,
executorinfo,
nil,
)

podTask.Spec = &podtask.Spec{
Executor: executorinfo,
}
taskInfo, err := podTask.BuildTaskInfo()
assert.Equal(t, nil, err, "must be able to build task info")

data, _ := testapi.Default.Codec().Encode(pod)
taskInfo.Data = data

Expand Down