Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propose a feature to troubleshoot running pods #649

Merged
merged 6 commits into from
Sep 27, 2017

Conversation

verb
Copy link
Contributor

@verb verb commented May 22, 2017

This feature allows troubleshooting of running pods by running a new "Debug Container" in the pod namespaces.

This proposal was originally opened and reviewed in kubernetes/kubernetes#35584.

This proposal needs LGTM by the following SIGs:

  • SIG Node
  • SIG CLI
  • SIG Auth
  • API Reviewer

Work in Progress:

  • Prototype kubectl attach for debug containers
  • Talk to sig-api-machinery about /debug subresource semantics

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 22, 2017
@verb
Copy link
Contributor Author

verb commented May 22, 2017

/assign @dashpole

@dashpole
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 22, 2017
@derekwaynecarr
Copy link
Member

I think this proposal needs wider review since it comes with API changes.

@derekwaynecarr derekwaynecarr removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 22, 2017
@derekwaynecarr
Copy link
Member

Related to feature kubernetes/enhancements#277

```
type PodStatus struct {
...
DebugContainerStatuses []ContainerStatus
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this feature is alpha, doesn't the field name need to reflect that?

Copy link
Contributor Author

@verb verb May 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would make sense, I just didn't know that was how to do it. I know there's been a lot of discussion in kubernetes/kubernetes#30819, but I haven't followed it. It doesn't look like there's consensus, but I can try to cherry-pick something that might work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we do not need to update the filed name but how about using annotations like init container? Such as adding an annotation as pod.alpha.kubernetes.io/debug-containers? You can get more detail from https://github.com/kubernetes/community/blob/master/contributors/design-proposals/container-init.md

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should definitely not store something as important as debug container info in an annotation. Those are also compatibility nightmares when they transition to supported fields.

@kubernetes/sig-api-machinery-misc for guidance on alpha field names and gating

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt can you please show more detail for this? I found that the init container was using this logic to be from alpha->beta->graduate, why cannot the debug container?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It did, and it has been a very painful transition, with very poor user experience (c.f. kubernetes/kubernetes#45627 (comment))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks @liggitt

@derekwaynecarr
Copy link
Member

I think a section that covers the security concerns associated with this feature would be appreciated for future readers.

This creates an interactive shell in a pod which can examine and signal all
processes in the pod. It has access to the same network and IPC as processes in
the pod. It can access the filesystem of other processes by `/proc/$PID/root`,
and enter aribitrary namespaces of another container via `nsenter` when
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to hop from one container to another within a pod with nsenter today? is the proposed debug container more powerful than a bash prompt inside an existing container obtained via exec today?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nsenter is possible today with CAP_SYS_ADMIN. Debug containers aren't different from containers in Kubernetes today except that they aren't in a pod spec. If you were to add the following container to a pod spec:

  - name: shell
    image: debian
    stdin: true
    tty: true
    securityContext:
      capabilities:
        add:
          - SYS_ADMIN

Then, as long as there was a shared pid namespace, you can attach and nsenter other processes.

SYS_ADMIN wouldn't be granted to Debug Containers by default, but it should be an option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubernetes master with a new enough docker version uses a shared pid namespace by default now.


The process for creating a Debug Container is:

1. `kubectl` constructs a `v1.Container` based on command line flags and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v1.Container is not a top-level object (it has no objectmeta or typemeta)... can you describe the wrapper object that would be posted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added description of a new object PodDebugContainer


1. `kubectl` constructs a `v1.Container` based on command line flags and
`POST`s it to `/api/v1/namespaces/$NS/pods/$POD/debug`.
1. The API server performs admission control and proxies the connection to the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this means that all admission plugins that gate on pods, pods/exec, and pods/attach would need to be updated to guard a new kind. Expanding the surface area an admission plugin needs to protect will become a bigger deal when we have out of tree admission plugins (the mechanism for which is in progress)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've noted this in a new security considerations section. It sounds like it will be good to get this change in prior to admission plugins becoming GA, though I would hope the plugins will deny resources they don't recognize.

Copy link
Member

@liggitt liggitt May 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would hope the plugins will deny resources they don't recognize.

they do not, they select the resources/subresources they guard (and admission plugins are already GA, it's the externalized ones that are being developed)

is used because `/debug` is already used by the kubelet. `/podDebug` was
chosen to parallel existing endpoints like `/containerLogs`.
1. The kubelet instructs the Generic Runtime Manager (this feature is only
implemented for the CRI) to create a Debug Container.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is availability of this feature determined, if only some CRI implementations support it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this was poorly worded. What I was trying to say is that this is implemented using the CRI and not the legacy runtimes. It doesn't require any changes to the CRI and will work with any runtime implementing the interface. I've updated the wording in my copy.

It is an error to attempt to create a Debug Container with the same name as a
container that exists in the pod spec. There are no limits on the number of
Debug Containers that can be created in a pod, but exceeding a pod's resource
allocation may cause it to be evicted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarify what would be evicted? the pod? the debug container?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pod would be evicted. Updated the doc.

to streaming.

It is an error to attempt to create a Debug Container with the same name as a
container that exists in the pod spec. There are no limits on the number of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what component detects this error and what response is returned in that case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The apiserver detects the error and returns a BadRequest, i.e.:

https://github.com/kubernetes/kubernetes/pull/46243/files#diff-c73f80ad83608f18657d22a06950d929R470

will update doc.

policy.
* Explicit reattaching isn't implemented. Instead a `kubectl debug` invocation
will implicitly reattach if there is an existing, running container with the
same name. In this case container configuration will be ignored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignoring specified information and implicitly reattaching seems confusing, and not like something we'd want long-term

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree 100%

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if this moves from alpha to beta to stable, how do you maintain skew compatibility with older kubectl clients' debug implementations without propagating the implicit throw-away-info-and-reattach behavior into the stable version of the feature?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we just expand kubectl attach to support debug containers, that would solve the problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either way this has to be resolved prior to moving out of alpha


`startContainer()` will be updated to write a new label
`io.kubernetes.container.type` to the runtime. Existing containers will be
started with a type of `REGULAR` or `INIT`. When added in a subsequent step,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean a 1.7 kubelet managing containers without this label (that were started by a 1.6 kubelet) will be confused about whether they are debug containers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's backwards compatible (and there's a test). The REGULAR and INIT labels aren't used by anything initially and the kubelet behavior only differs when Type == "DEBUG". (Type will be an empty string for a container that existed prior to the feature being enabled.)


### Additional Constraints

1. Non-interactive workloads are explicitly supported. There are no plans to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the container run and attach steps are distinct, how is stdout/stderr coordinated so that the attach request obtains the first byte written to each? Is the first attach special? How do subsequent or additional attaches behave w.r.t. previous output from the debug container's entrypoint?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, there's a race here since the runtime is buffering the output. The container starts and its initial output goes to the buffer (visible via kubectl log) and then the attach picks up mid-stream. This is fine for the interactive troubleshooting but might be a problem for non-interactive workloads.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if @ncdc has learned anything, it is that if an output buffer race can happen, it will, and that if it might be a problem, it will.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt the typical process for starting a docker container is (outside of kubernetes):

  1. create container
  2. attach to container
  3. start container

If you must start the container prior to attaching (which tends to be the case for things likes kubectl run), then your only option to make sure you see all prior output is to specify logs=true when attaching. This has downsides: last time I checked, you can't limit the output to e.g. the last 100 lines, and if you have a TTY, I'm not sure what happens in that case. Also note, this isn't currently available in the version of the docker api vendored in to kubernetes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ncdc What's the longer term direction here? It would be significantly more work, but we could theoretically add create/attach/start functionality for Debug Containers. I wouldn't want to do it for an MVP, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wrote in another comment about this, but I think given the way the kubelet sync loop works, it will be very difficult to achieve create/attach/start without blocking one of the kubelet workers. We will need to discuss with sig-node if we want to pursue this.

1. Non-interactive workloads are explicitly supported. There are no plans to
supported detached workloads, but doing so would be trivial with an
`attach=false` flag.
1. There are no guaranteed resources for ad-hoc troubleshooting. If
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like it would make debugging a pod that was memory constrained pretty difficult.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but in practice that's not been a problem we've had so far and we have to start somewhere. This could be improved in the future with the planned vertical pod autoscaling feature.

@liggitt
Copy link
Member

liggitt commented May 23, 2017

how do debug containers interact with graceful termination of pods?

@verb
Copy link
Contributor Author

verb commented May 23, 2017

@liggitt Debug containers receive the same signals as other containers in the pod for lifecycle events. They differ only in that they aren't deleted when syncing pod spec while the pod is alive.

@verb
Copy link
Contributor Author

verb commented May 23, 2017

/assign @pwittrock
/assign @liggitt

```

It would be reasonable for Kubernetes to provide a default container name and
image1, making the minimal possible debug command:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected, thanks.

image1, making the minimal possible debug command:

```
kubectl debug -it target-pod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think you'd ever want to run kubectl debug and not attach stdin and use a tty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't, no, but then again I'm not sure why the option exists in kubectl exec. I can imagine wanting to run kubectl debug target-pod -- netstat -an, but only if I'll definitely get the first byte of the output stream and that would of course work just as well with a tty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the document to specify that -i & -t will be enabled by default.

This creates an interactive shell in a pod which can examine and signal all
processes in the pod. It has access to the same network and IPC as processes in
the pod. It can access the filesystem of other processes by `/proc/$PID/root`,
and enter aribitrary namespaces of another container via `nsenter` when
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubernetes master with a new enough docker version uses a shared pid namespace by default now.

replace a Debug Container that has exited by re-using a Debug Container name. It
is an error to attempt to replace a Debug Container that is still running.

One way in which `kubectl debug` differs from `kubectl exec` is the ability to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubectl attach supports attaching to both init and normal containers. Would you want to expand it to support debug containers too? That would require the least amount of coding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, well this is a compelling alternative. I may have misunderstood your intention, but not streaming the /debug subresource and instead relying solely on attach would solve several problems. It would sidestep (though not solve) output stream coordination and allow kubectl to generate the container configuration, which is more flexible. Off the top of my head:

  1. kubectl debug would do a 2 step run debug container followed by an optional attach. The optional attach would better support non-interactive workloads.
  2. the apiserver can't check that the debug container exists by examining Pod.Spec as it does for regular/init containers, but it should be able to check Pod.Status.DebugContainerStatuses. It's the same story for kubectl.

Output stream coordination would then be solved for Debug Containers when/if it's solved for attach.

Great idea, I'll prototype it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only true way to ensure you don't miss any output is create-container, attach-container, start-container, in that order. That's how docker run works. For something like kubectl run and probably kubectl debug too, we can't really do that, because of the way the kubelet sync loop works (kubectl waits until it sees the pod is Running before attaching). Well, I guess we could potentially do that, but it would require pausing a sync loop iteration until the remote client (kubectl) attaches, which isn't ideal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this info to the doc, thanks!

### Killing Debug Containers

Debug containers will not be killed automatically until the pod (specifically,
the pod sandbox) is destroyed. Unlike `kubectl exec`, Debug Containers will not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is roundabout support in newer docker versions for killing exec sessions. It now records the pid of the process that was exec'd, and we could use that information to do a kill.

the pod sandbox) is destroyed. Unlike `kubectl exec`, Debug Containers will not
receive an EOF if their connection is interrupted. Instead, Debug Containers
must be reattached to exit a running process. This could be tricky if the
process does not allocate a TTY, in this case a second Debug Container could be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above re why would you not allocate a tty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless there's a reason that non-interactive workloads might shun a TTY I have no argument for not having one. I was just following the perceived convention of kubectl exec

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would view kubectl debug as an interactive utility for debugging a running pod, whereas I see kubectl exec as a tool in my toolbox that might or might not require user interaction. Although the more I think about it, typically when I do use kubectl exec it's to get a shell, in which case I almost always do -it. There's a recent issue proposing that exec defaults these to true: #46300.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ncdc @liggitt since kubectl debug is mainly a tool for getting a shell with a TTY, how much do we care about output stream coordination when kubectl will already say "press enter to get a prompt"? I think it's reasonable to defer to a future general solution rather than trying to solve this for debug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objections to deferring


### Additional Constraints

1. Non-interactive workloads are explicitly supported. There are no plans to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt the typical process for starting a docker container is (outside of kubernetes):

  1. create container
  2. attach to container
  3. start container

If you must start the container prior to attaching (which tends to be the case for things likes kubectl run), then your only option to make sure you see all prior output is to specify logs=true when attaching. This has downsides: last time I checked, you can't limit the output to e.g. the last 100 lines, and if you have a TTY, I'm not sure what happens in that case. Also note, this isn't currently available in the version of the docker api vendored in to kubernetes.

policy.
* Explicit reattaching isn't implemented. Instead a `kubectl debug` invocation
will implicitly reattach if there is an existing, running container with the
same name. In this case container configuration will be ignored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we just expand kubectl attach to support debug containers, that would solve the problem.

@verb verb changed the title Propose a feature to troubleshoot running pods WIP: Propose a feature to troubleshoot running pods May 24, 2017
@verb verb changed the title WIP: Propose a feature to troubleshoot running pods Propose a feature to troubleshoot running pods May 31, 2017
of Debug Containers is reported via a new field in `v1.PodStatus`, described in
a subsequent section.

#### Alternative: Extending `/exec`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dchen1107 @lavalamp @smarterclayton @pwittrock

For pod troubleshooting we need to choose between the object-based approach described above (suggested by @smarterclayton ) and the exec-based approach described below (suggested by @lavalamp and resembling the "image exec" approach of the original proposal, interestingly).

13 ? Ss 0:00 bash
26 ? Ss+ 0:00 /neato
107 ? R+ 0:00 ps x
root@debug-image:~# cat /proc/26/root/etc/resolv.conf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't think this works with docker.

This updates the Pod Troubleshooting Design Proposal for recent
developments in the community and to reflect the consensus from the API
review: using the existing /exec endpoint as a starting point for this
feature.
...
// DebugName is the name of the Debug Container. Its presence will cause
// exec to create a Debug Container rather than performing a runtime exec.
DebugName string `json:"debugName,omitempty" ...`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for making a sub section:

type PodExecOptions struct {
  ...
  EphemeralContainer *PodExecEphemeralContainerSpec
}

type PodExecEphemeralContainerSpec struct {
  Name string
  Image string
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, though in my naive prototype the sub section wasn't populated from the HTTP params.

I'm not very familiar with the api machinery so I probably just missed something. I see where queryparams flattens the struct based on JSON names. Do I need to write a custom converter somewhere to get it back from params to an object? Could you point me in the right direction?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, nested structs parsed from query params won't work cleanly (kubernetes/kubernetes#21476)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier in the text it said this was not using query params--I thought it was using a message sent after the SPDY channel was opened.

We will extend `v1.Pod`'s `/exec` subresource to support "executing" container
images. The current `/exec` endpoint must implement `GET` to support streaming
for all clients. We don't want to encode a (potentially large) `v1.Container` as
an HTTP parameter, so we must extend `v1.PodExecOptions` with the specific
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. I misunderstood this line.

type PodExecOptions struct {
...
// Run Command in an ephemeral container which shares some namespaces with Container.
EphemeralContainer PodExecEphemeralContainerSpec
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be a pointer. Uh, but I will suggest something else since sub structs are indeed currently broken.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// PodExecOptions is the query options to a Pod's remote exec call
type PodExecOptions struct {
  ...
  // EphemeralContainerName is the name of an ephemeral container in which the
  // command ought to be run. Either both EphemeralContainerName and
  // EphemeralContainerImage fields must be set, or neither.
  EphemeralContainerName *string `json:"ephemeralContainerName,omitempty" ...`

  // EphemeralContainerImage is the image of an ephemeral container in which the command
  // ought to be run. Either both EphemeralContainerName and EphemeralContainerImage
  // fields must be set, or neither.
  EphemeralContainerImage *string `json:"ephemeralContainerImage,omitempty" ...`
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed as suggested

@lavalamp
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 27, 2017
@k8s-github-robot
Copy link

Automatic merge from submit-queue.

@k8s-github-robot k8s-github-robot merged commit c4d900e into kubernetes:master Sep 27, 2017
```
type PodStatus struct {
...
DebugStatuses []DebugStatus
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this mirror the exec option names? ephemeral statuses, etc?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use consistent naming

Copy link
Member

@lavalamp lavalamp Sep 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh, yes, I was in a rush and didn't see this section. @verb can you modify to

type PodStatus struct {
  
  EphemeralContainerStatuses []v1.ContainerStatus
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Actually were you trying to represent all exec actions with this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(note I edited my comment above. I can't see anything you need that isn't already in v1.ContainerStatus.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lavalamp I'd like to at least have command and args, which aren't part of ContainerStatus.


1. `kubectl` invokes the debug API as described in the preceding section.
1. The API server checks for name collisions with existing containers, performs
admission control and proxies the connection to the kubelet's
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that all admission plugins that do anything related to checking containers, images, etc, would need to be updated to check ephemeral images specified in exec options now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is noted in the "Security Considerations" section.

requests and the kubelet must return an error to all but one request.

There are no limits on the number of Debug Containers that can be created in a
pod, but exceeding a pod's resource allocation may cause the pod to be evicted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also would bypass admission plugins that set resource limits/range on containers. please describe the container spec that would result from an ephemeral exec request

1. `KillPod()` already operates on all running containers returned by the
runtime.
1. Containers created prior to this feature being enabled will have a
`containerType` of `""`. Since this does not match `"DEBUG"` the special
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEBUG or EPHEMERAL?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please change references in the API to "ephemeral" everywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, this is a private label in the kubelet's runtime manager and not part of the API. I've updated it to EPHEMERAL for consistency, though.

Copy link
Contributor

@smarterclayton smarterclayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a discussion of how this will be secure (users with access to CAP_SYS_ADMIN)

* Exited Debug Containers will be garbage collected as regular containers and
may disappear from the list of Debug Container Statuses.
* Security Context for the Debug Container is not configurable. It will always
be run with `CAP_SYS_PTRACE` and `CAP_SYS_ADMIN`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So only a cluster admin should ever use debug containers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, pods don't have security context set in many cases. Exec implicitly escalating to root is bad.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be configurable at debug time, but that depends on the API. Some of my proposed API changes addressed this, but SIG Node and the API reviewers deadlocked on which was best. Our compromise was to proceed to alpha with the minimum possible API change, which doesn't include a configurable security context.

Created kubernetes/kubernetes#53188 to track this.

/cc @thockin

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our compromise was to proceed to alpha with the minimum possible API change, which doesn't include a configurable security context.

you have a pointer to where that discussion happened? no one from the auth/psp side was involved afaik

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand, you've been involved since the first draft of the proposal 10 months ago, and your input has always been most welcome. It's not too late, what would you like to see changed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@verb how bad would it be to pass a full v1.Container?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thockin Not bad, It's what the kubelet does internally and I've had it working in a prototype.

We would use the API described in Alternative 1 to POST a v1.Container (wrapped in a new top level object) to a new /debug subresource. Then the client would perform a separate /attach.

The API reviewers had concerns about this being a novel use of the API. Nothing else POSTs to a subresource.

Both the API reviewers and SIG Node reviewers had concerns about using v1.Container being confusing or communicating the wrong intent to the user. Debug Containers are not general purpose containers and should not be used to build services or for routine operations. Some prefer to consider "extended exec" rather than "configurable container".

Most of the fields of v1.Container do not apply to Debug Containers and should be rejected if configured (lifecycle, livenessProbe, ports, readinessProbe, resources, stdin, stdinOnce, terminationMessagePath, terminationMessagePolicy, tty, volumeMounts). We can do this with a validation whitelist, though it would be simpler to pass a securityContext rather than a full v1.Container.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API reviewers had concerns about this being a novel use of the API. Nothing else POSTs to a subresource.

I thought scheduler POSTs to pod/binding subresource, clients POST to pod/eviction subresource, etc. Or is there a distinction between POST and PUT here? (Which I can never keep straight.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidopp Oh, that's good news then. When I prototyped this ~6 months ago I recall needing a couple of changes in the apiserver in order to make it work, but maybe those were specific to upgrading a connection to streaming after a POST. It's been a long ride.

Let's figure out a way to move this forward. Since we already had agreement among the reviewers at the time, and now we want to renegotiate that agreement based on new reviewers, I suggest that take the form of a new PR to amend the proposal where the new and old reviewers can work out their conflicting requirements. I'll prepare a diff.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API reviewers had concerns about this being a novel use of the API. Nothing else POSTs to a subresource.

I thought scheduler POSTs to pod/binding subresource, clients POST to pod/eviction subresource, etc. Or is there a distinction between POST and PUT here? (Which I can never keep straight.)

Not sure about subresources, but there are issues with both PUT and POST as part of websocket requests (not all clients support them)

```
type PodStatus struct {
...
DebugStatuses []DebugStatus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use consistent naming

particular, they should enforce the same container image policy on the `Image`
parameter as is enforced for regular containers. During the alpha phase we will
additionally support a container image whitelist as a kubelet flag to allow
cluster administrators to easily constraint debug container images.
Copy link
Member

@liggitt liggitt Sep 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what security context settings (uid/gid, selinux, apparmor) will the debug container have? how will admission plugins that constraint/force those (like PodSecurityPolicy) govern an ephemeral container

edit: just saw https://github.com/kubernetes/community/pull/649/files#diff-5cfb31b40ca47511743d0545d5697aa0R394

can we determine the equivalent v1.Container (including securitycontext) that would correspond to the ephemeral container? If so, we could see if a PodSecurityPolicy would allow the pod with a container with those settings/permissions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought @verb had worked this out with @derekwaynecarr already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt Compatibility with admission plugins is a top priority and a strict requirement. The implementation will depend a little bit on how the Kubernetes API settles, but it will be one of:

  • The client may end up providing tje v1.Container that creates the Ephemeral Container
  • If we stick with the imperative, exec-style API we can do exactly as you suggest and provide the v1.Container based on the PodExecOptions.

* Security Context for the Debug Container is not configurable. It will always
be run with `CAP_SYS_PTRACE` and `CAP_SYS_ADMIN`.
* Image pull policy for the Debug Container is not configurable. It will
always be run with `PullAlways`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this prevents offline installations with pre-pulled images

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Created kubernetes/kubernetes#53189 to track this.

DebugStatuses []DebugStatus
}

type DebugStatus struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the necessity of this.

If you wanted to represent the way(s) in which the container is "dirty" after exec/attach/portrforward etc, I don't think this is the way to go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's something we want to represent along with details about what command was run by exec (whether traditional exec or in an ephemeral container). This isn't a blocker for the alpha implementation, though, so if you think this is the wrong approach then I'll remove this bit from the proposal and we can figure out the correct way later.

/cc @dchen1107 @thockin

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do want some way to represent the taintedness (also should do for exec )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@r2d4 Can you work on the best way to represent taintedness in v1.PodStatus?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@verb ack looking into it

@ndeloof
Copy link

ndeloof commented Jun 25, 2018

Hi @verb , I'm glad to see your repeated effort to get this feature added to k8s.

I understand the initial use-case for this effort is related to diagnostic on a running Pod, but would like to share with you another use-case which would benefit this exact same improvement:

I'm working on Jenkins integration in k8s, we run containerized builds as pods. During the build execution, a new container might be required. In many case this can be identified before the build is scheduled and as such set as part of the Pod's spec, but in some cases the required image is dynamically selected as part of the build. Also, being able to run containers as part of the build just like developer do on their workstation helps to make the build script reproducible and portable.

with a new API to add (transient) containers to a Pod I could provide the glue code for build script to control such additional containers. The current usage for most users is to run a privileged (!) DinD container to host the build, or to bind mount docker.sock from host (!). As you can guess I'd prefer we don't rely on this :-\

Hope this will help understand potential use-cases this feature could support.

@smarterclayton
Copy link
Contributor

smarterclayton commented Jul 2, 2018 via email

@smarterclayton
Copy link
Contributor

smarterclayton commented Jul 3, 2018 via email

@avanier
Copy link

avanier commented Oct 22, 2018

I have a keen interest for this feature. Does anyone have movement to report on this outside of this ticket?

Also, to echo @ndeloof mentions for debuging Jenkins, I have to say this is indeed quite useful. I've been running Concourse at a few shops over recent years, and using Pivotal's Garden OCI runtime they achieve exactly that.

@verb
Copy link
Contributor Author

verb commented Oct 22, 2018

@avanier kubernetes/enhancements#277 might be better for tracking progress of this feature. The API change is under review in kubernetes/kubernetes#59416, once the API changes there should be quick progress.

MadhavJivrajani pushed a commit to MadhavJivrajani/community that referenced this pull request Nov 30, 2021
Automatic merge from submit-queue.

Propose a feature to troubleshoot running pods

This feature allows troubleshooting of running pods by running a new "Debug Container" in the pod namespaces.

This proposal was originally opened and reviewed in kubernetes/kubernetes#35584.

This proposal needs LGTM by the following SIGs:
- [ ] SIG Node
- [ ] SIG CLI
- [ ] SIG Auth
- [x] API Reviewer

Work in Progress:
- [x] Prototype `kubectl attach` for debug containers
- [x] Talk to sig-api-machinery about `/debug` subresource semantics
danehans pushed a commit to danehans/community that referenced this pull request Jul 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet