Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet: get cgroup driver config from CRI #118770

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
67 changes: 62 additions & 5 deletions cmd/kubelet/app/server.go
Expand Up @@ -35,6 +35,8 @@ import (
"github.com/coreos/go-systemd/v22/daemon"
"github.com/spf13/cobra"
"github.com/spf13/pflag"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/status"
"k8s.io/klog/v2"
"k8s.io/mount-utils"

Expand Down Expand Up @@ -76,6 +78,7 @@ import (
"k8s.io/component-base/version"
"k8s.io/component-base/version/verflag"
nodeutil "k8s.io/component-helpers/node/util"
runtimeapi "k8s.io/cri-api/pkg/apis/runtime/v1"
kubeletconfigv1beta1 "k8s.io/kubelet/config/v1beta1"
"k8s.io/kubernetes/cmd/kubelet/app/options"
"k8s.io/kubernetes/pkg/api/legacyscheme"
Expand Down Expand Up @@ -625,6 +628,17 @@ func run(ctx context.Context, s *options.KubeletServer, kubeDeps *kubelet.Depend
runAuthenticatorCAReload(ctx.Done())
}

if err := kubelet.PreInitRuntimeService(&s.KubeletConfiguration, kubeDeps); err != nil {
return err
}

// Get cgroup driver setting from CRI
if utilfeature.DefaultFeatureGate.Enabled(features.KubeletCgroupDriverFromCRI) {
if err := getCgroupDriverFromCRI(ctx, s, kubeDeps); err != nil {
return err
}
}
Comment on lines +636 to +640
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we do this:

  • per new connection to the socket?
  • per kubelet startup?

I'm thinking that if someone restarts the process at the other end of the socket, we should re-query this setting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens only at kubelet startup.

Technically we could query it at CRI reconnection but I wouldn't go there.

  1. The cgroup driver is really not something that's expected to be changed without taking the node down. Changing it correctly requires restarting all pods, also the static ones. System reboot is a sage method.
  2. I don't know what all would break in kubelet (internally) if we changed the cgroup-driver in-flight

We could re-query the cgroup driver at re-connection and exit if the cgroup driver setting has changed. But I'm not sure if that's desired. Thoughts @SergeyKanzhelev @mrunalp @haircommander

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our efforts would be better spent making it very clear a change in the cgroup driver in the CRI requires a node reboot


var cgroupRoots []string
nodeAllocatableRoot := cm.NodeAllocatableRoot(s.CgroupRoot, s.CgroupsPerQOS, s.CgroupDriver)
cgroupRoots = append(cgroupRoots, nodeAllocatableRoot)
Expand Down Expand Up @@ -775,11 +789,6 @@ func run(ctx context.Context, s *options.KubeletServer, kubeDeps *kubelet.Depend
klog.InfoS("Failed to ApplyOOMScoreAdj", "err", err)
}

err = kubelet.PreInitRuntimeService(&s.KubeletConfiguration, kubeDeps)
if err != nil {
return err
}

if err := RunKubelet(s, kubeDeps, s.RunOnce); err != nil {
return err
}
Expand Down Expand Up @@ -1282,3 +1291,51 @@ func newTracerProvider(s *options.KubeletServer) (oteltrace.TracerProvider, erro
}
return tp, nil
}

func getCgroupDriverFromCRI(ctx context.Context, s *options.KubeletServer, kubeDeps *kubelet.Dependencies) error {
klog.V(4).InfoS("Getting CRI runtime configuration information")

var (
runtimeConfig *runtimeapi.RuntimeConfigResponse
err error
)
// Retry a couple of times, hoping that any errors are transient.
// Fail quickly on known, non transient errors.
for i := 0; i < 3; i++ {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that adding a loop here is adding much value. What transient errors do we expect to succeed on retries? Also, there is no backoff or wait in between attempts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if kubelet and runtime are started together, transient issue may be just a timing of a start. Before we were establishing connection a bit later in a function - after reading some metrics and such. We can exit kubelet immediately and rely on external restart of kubelet. Or we can do a few iterations here. But if we do retry here - there should be a delay between attempts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, without delay there isn't much value here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is 2 seconds? I had trouble finding a value to base it on so I just made a guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm back from PTO, thanks for all the help with this @kubernetes/milestone-maintainers

How long would we expect for the CRI runtime startup to take? Just wonder should the sleep be 1 or 2 seconds or smth else.

runtimeConfig, err = kubeDeps.RemoteRuntimeService.RuntimeConfig(ctx)
if err != nil {
s, ok := status.FromError(err)
if !ok || s.Code() != codes.Unimplemented {
// We could introduce a backoff delay or jitter, but this is largely catching cases
// where the runtime is still starting up and we request too early.
// Give it a little more time.
time.Sleep(time.Second * 2)
continue
}
// CRI implementation doesn't support RuntimeConfig, fallback
klog.InfoS("CRI implementation should be updated to support RuntimeConfig when KubeletCgroupDriverFromCRI feature gate has been enabled. Falling back to using cgroupDriver from kubelet config.")
return nil
}
}
if err != nil {
return err
}

// Calling GetLinux().GetCgroupDriver() won't segfault, but it will always default to systemd
// which is not intended by the fields not being populated
linuxConfig := runtimeConfig.GetLinux()
if linuxConfig == nil {
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can there be a comment saying where the cgroup driver is coming from/set by if this case happens

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It now states Falling back to using cgroupDriver from kubelet config. Do you suggest something else/in addition? @haircommander

}

switch d := linuxConfig.GetCgroupDriver(); d {
case runtimeapi.CgroupDriver_SYSTEMD:
s.CgroupDriver = "systemd"
case runtimeapi.CgroupDriver_CGROUPFS:
s.CgroupDriver = "cgroupfs"
default:
return fmt.Errorf("runtime returned an unknown cgroup driver %d", d)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what the kubelet's policy on panics are, but this seems like a prime reason to panic as it's a programming error that should never happen. I'm not stubborn about that though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, It's a programming, but on the CRI runtime's side, out of our (kubelet's) control so I'm not sure we want to panic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's fall back to the configuration file for now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more... this should be treated differently than if runtime is not available, for example, or returned error, going forward. I think if runtime returned error we will need to retry a few times and fallback to config. If it's a new unknown value, kubelet should not retry and should exit immediately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, is this what you were thinking @SergeyKanzhelev

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrunalp I added it bc of @SergeyKanzhelev 's comment. I am personally fine without it, i'll let y'all decide :)

}
klog.InfoS("Using cgroup driver setting received from the CRI runtime", "cgroupDriver", s.CgroupDriver)
return nil
}
13 changes: 13 additions & 0 deletions pkg/features/kube_features.go
Expand Up @@ -416,6 +416,17 @@ const (
// yet.
JobTrackingWithFinalizers featuregate.Feature = "JobTrackingWithFinalizers"

// owner: @marquiz
// kep: http://kep.k8s.io/4033
// alpha: v1.28
//
// Enable detection of the kubelet cgroup driver configuration option from
// the CRI. The CRI runtime also needs to support this feature in which
// case the kubelet will ignore the cgroupDriver (--cgroup-driver)
// configuration option. If runtime doesn't support it, the kubelet will
// fallback to using it's cgroupDriver option.
KubeletCgroupDriverFromCRI featuregate.Feature = "KubeletCgroupDriverFromCRI"

// owner: @AkihiroSuda
// alpha: v1.22
//
Expand Down Expand Up @@ -1007,6 +1018,8 @@ var defaultKubernetesFeatureGates = map[featuregate.Feature]featuregate.FeatureS

JobTrackingWithFinalizers: {Default: true, PreRelease: featuregate.GA, LockToDefault: true}, // remove in 1.28

KubeletCgroupDriverFromCRI: {Default: false, PreRelease: featuregate.Alpha},

KubeletInUserNamespace: {Default: false, PreRelease: featuregate.Alpha},

KubeletPodResources: {Default: true, PreRelease: featuregate.GA, LockToDefault: true}, // GA in 1.28, remove in 1.30
Expand Down
10 changes: 10 additions & 0 deletions pkg/kubelet/cri/remote/fake/fake_runtime.go
Expand Up @@ -356,3 +356,13 @@ func (f *RemoteRuntime) ListPodSandboxMetrics(ctx context.Context, req *kubeapi.

return &kubeapi.ListPodSandboxMetricsResponse{PodMetrics: podMetrics}, nil
}

// RuntimeConfig returns the configuration information of the runtime.
func (f *RemoteRuntime) RuntimeConfig(ctx context.Context, req *kubeapi.RuntimeConfigRequest) (*kubeapi.RuntimeConfigResponse, error) {
resp, err := f.RuntimeService.RuntimeConfig(ctx)
if err != nil {
return nil, err
}

return resp, nil
}
15 changes: 15 additions & 0 deletions pkg/kubelet/cri/remote/remote_runtime.go
Expand Up @@ -865,3 +865,18 @@ func (r *remoteRuntimeService) ListPodSandboxMetrics(ctx context.Context) ([]*ru

return resp.GetPodMetrics(), nil
}

// RuntimeConfig returns the configuration information of the runtime.
func (r *remoteRuntimeService) RuntimeConfig(ctx context.Context) (*runtimeapi.RuntimeConfigResponse, error) {
ctx, cancel := context.WithTimeout(ctx, r.timeout)
defer cancel()

resp, err := r.runtimeClient.RuntimeConfig(ctx, &runtimeapi.RuntimeConfigRequest{})
if err != nil {
klog.ErrorS(err, "RuntimeConfig from runtime service failed")
return nil, err
}
klog.V(10).InfoS("[RemoteRuntimeService] RuntimeConfigResponse", "linuxConfig", resp.GetLinux())

return resp, nil
}
9 changes: 9 additions & 0 deletions pkg/kubelet/kuberuntime/instrumented_services.go
Expand Up @@ -361,3 +361,12 @@ func (in instrumentedRuntimeService) ListPodSandboxMetrics(ctx context.Context)
recordError(operation, err)
return out, err
}

func (in instrumentedRuntimeService) RuntimeConfig(ctx context.Context) (*runtimeapi.RuntimeConfigResponse, error) {
const operation = "runtime_config"
defer recordOperation(operation, time.Now())

out, err := in.service.RuntimeConfig(ctx)
recordError(operation, err)
return out, err
}