The Kubernetes Scheduler operator manages and updates the Kubernetes Scheduler deployed on top of OpenShift. The operator is based on OpenShift library-go framework and it is installed via Cluster Version Operator (CVO).
It contains the following components:
- Operator
- Bootstrap manifest renderer
- Installer based on static pods
- Configuration observer
By default, the operator exposes Prometheus metrics via metrics
service.
The metrics are collected from following components:
- Kubernetes Scheduler operator
The configuration for the Kubernetes Scheduler is the result of merging:
- a default config
- an observed config (compare observed values above) from the spec
schedulers.config.openshift.io
.
All of these are sparse configurations, i.e. unvalidated json snippets which are merged in order to form a valid configuration at the end.
The following profiles are currently provided:
Each of these enables cluster-wide scheduling.
Configured via Scheduler
custom resource:
$ oc get scheduler cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Scheduler
metadata:
name: cluster
spec:
mastersSchedulable: false
policy:
name: ""
profile: LowNodeUtilization
...
This profile disables NodeResourcesBalancedAllocation
and NodeResourcesFit
plugin with (LeastAllocated
type)
and enables NodeResourcesFit
plugin (with MostAllocated
type).
Favoring nodes that have a high allocation of resources.
In the past the profile corresponded to disabling NodeResourcesLeastAllocated
and NodeResourcesBalancedAllocation
plugins
and enabling NodeResourcesMostAllocated
plugin.
The default list of scheduling profiles as provided by the kube-scheduler.
This profiles disabled all scoring plugins.
Customizations of existing profiles are available under the .spec.profileCustomizations
field:
Name | Type | Description |
---|---|---|
dynamicResourceAllocation |
string |
Enable Dynamic Resource Allocation functionality |
E.g.
apiVersion: config.openshift.io/v1
kind: Scheduler
metadata:
name: cluster
spec:
mastersSchedulable: false
policy:
name: ""
profile: HighNodeUtilization
profileCustomizations:
dynamicResourceAllocation: Enabled
...
Operator also expose events that can help debugging issues. To get operator events, run following command:
$ oc get events -n openshift-cluster-kube-scheduler-operator
This operator is configured via KubeScheduler
custom resource:
$ oc describe kubescheduler
apiVersion: operator.openshift.io/v1
kind: KubeScheduler
metadata:
name: cluster
spec:
managementState: Managed
...
The log level of individual kube-scheduler instances can be increased by setting .spec.logLevel
field:
$ oc explain kubescheduler.spec.logLevel
KIND: KubeScheduler
VERSION: operator.openshift.io/v1
FIELD: logLevel <string>
DESCRIPTION:
logLevel is an intent based logging for an overall component. It does not
give fine grained control, but it is a simple way to manage coarse grained
logging choices that operators have to interpret for their operands. Valid
values are: "Normal", "Debug", "Trace", "TraceAll". Defaults to "Normal".
For example:
apiVersion: operator.openshift.io/v1
kind: KubeScheduler
metadata:
name: cluster
spec:
logLevel: Debug
...
Currently the log levels correspond to:
logLevel | log level |
---|---|
Normal | 2 |
Debug | 4 |
Trace | 6 |
TraceAll | 10 |
More about the individual configuration options can be learnt by invoking oc explain
:
$ oc explain kubescheduler
The current operator status is reported using the ClusterOperator
resource. To get the current status you can run follow command:
$ oc get clusteroperator/kube-scheduler
In the running cluster cluster-version-operator is responsible for maintaining functioning and non-altered elements. In that case to be able to use custom operator image one has to perform one of these operations:
- Set your operator in umanaged state, see here for details, in short:
oc patch clusterversion/version --type='merge' -p "$(cat <<- EOF
spec:
overrides:
- group: apps
kind: Deployment
name: kube-scheduler-operator
namespace: openshift-kube-scheduler-operator
unmanaged: true
EOF
)"
- Scale down cluster-version-operator:
oc scale --replicas=0 deploy/cluster-version-operator -n openshift-cluster-version
IMPORTANT: This apprach disables cluster-version-operator completly, whereas previous only tells it to not manage a kube-scheduler-operator!
After doing this you can now change the image of the operator to the desired one:
oc patch pod/openshift-kube-scheduler-operator-<rand_digits> -n openshift-kube-scheduler-operator -p '{"spec":{"containers":[{"name":"kube-scheduler-operator-container","image":"<user>/cluster-kube-scheduler-operator"}]}}'
The operator image version used by the installer bootstrap phase can be overridden by creating a custom origin-release image pointing to the developer's operator :latest
image:
$ IMAGE_ORG=<user> make images
$ docker push <user>/origin-cluster-kube-scheduler-operator
$ cd ../cluster-kube-apiserver-operator
$ IMAGES=cluster-kube-scheduler-operator IMAGE_ORG=<user> make origin-release
$ docker push <user>/origin-release:latest
$ cd ../installer
$ OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=docker.io/<user>/origin-release:latest bin/openshift-install cluster ...
By default the kube-scheduler profiling is disabled. The profiling can be enabled manually by editing config.yaml
files under each master node.
Warning: the configuration gets undone after the new revision gets performed and the steps need to be repeated.
Steps:
-
access every master node (e.g.
ssh
or withoc debug
)- edit
/etc/kubernetes/static-pod-resources/kube-scheduler-pod-$REV/configmaps/config/config.yaml
(where$REV
corresponds to the latest revision) and setenableProfiling
field toTrue
. - make a benign change to
/etc/kubernetes/manifests/kube-scheduler-pod.yaml
, e.g. updating "Waiting for port" to "Waiting for port" (adding one blank space to the string). Wait for the updated pod manifest to be picked up and a new kube-scheduler instance running and ready.
- edit
-
oc port-forward pod/$KUBE_SCHEDULER_POD_NAME 10259:10259
in a separate terminal/window (where$KUBE_SCHEDULER_POD_NAME
corresponds to a running kube-scheduler pod instance) -
apply the following manifests to allow anonymous access:
--- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: kubescheduler-anonymous-access rules: - nonResourceURLs: ["/debug", "/debug/*"] verbs: - get - list --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: kubescheduler-anonymous-access roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: kubescheduler-anonymous-access subjects: - apiGroup: rbac.authorization.k8s.io kind: User name: system:anonymous
The tool requires to pull the heap file and the kube-scheduler binary.
Steps:
- Pull the heap data by accessing https://localhost:10259/debug/pprof/heap
- Extract the kube-scheduler binary from the corresponding image (by checking the kube-scheduler pod manifest):
$ podman pull --authfile $AUTHFILE $KUBE_SCHEDULER_IMAGE $ podman cp $(podman create --name kube-scheduler $KUBE_SCHEDULER_IMAGE):/usr/bin/kube-scheduler ./kube-scheduler
$AUTHFILE
corresponds to your authentication file if not already located in the known paths$KUBE_SCHEDULER_IMAGE
corresponds to the kube-scheduler image found in a kube-scheduler pod manifest
- Run
go tool pprof kube-scheduler heap
// CacheDebugger provides ways to check and write cache information for debugging.
// ListenForSignal starts a goroutine that will trigger the CacheDebugger's
// behavior when the process receives SIGINT (Windows) or SIGUSER2 (non-Windows).
When a kube-scheduler process receives SIGUSER2
the node cache gets dumped into the logs. E.g.:
I0105 03:32:31.936642 1 dumper.go:52] "Dump of cached NodeInfo" nodes=<
Node name: NODENAME1
Deleted: false
Requested Resources: ...
Scheduled Pods(number: 41):
name: POD_NAME, namespace: POD_NAMESPACE, uid: 23c63c58-cc36-48be-97d9-f4f6088a709d, phase: Running, nominated node:
name: POD_NAME, namespace: POD_NAMESPACE, uid: 04b3b3b4-52a3-46d0-b7ff-aa748eecd404, phase: Running, nominated node:
...
Node name: NODENAME2
Deleted: false
Requested Resources: ...
Scheduled Pods(number: 53):
name: POD_NAME, namespace: POD_NAMESPACE, uid: 7cbce63f-3fb9-404a-a69b-6728592e6b2, phase: Running, nominated node:
name: POD_NAME, namespace: POD_NAMESPACE, uid: 50bc7d7e-bd30-4c47-82ce-a9d3eb737434, phase: Running, nominated node:
...