Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: KEP-4381: DRA: avoid kubelet API version dependency with REST proxy #4615

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
329 changes: 269 additions & 60 deletions keps/sig-node/4381-dra-structured-parameters/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,9 +110,11 @@ SIG Architecture for cross-cutting KEPs).
- [PreBind](#prebind)
- [Unreserve](#unreserve)
- [kubelet](#kubelet)
- [Managing resources](#managing-resources)
- [Communication between kubelet and resource kubelet plugin](#communication-between-kubelet-and-resource-kubelet-plugin)
- [NodeListAndWatchResources](#nodelistandwatchresources)
- [REST proxy](#rest-proxy)
- [Security](#security)
- [gRPC API](#grpc-api)
- [Managing resources](#managing-resources)
- [NodePrepareResource](#nodeprepareresource)
- [NodeUnprepareResources](#nodeunprepareresources)
- [Simulation with CA](#simulation-with-ca)
Expand Down Expand Up @@ -531,20 +533,15 @@ the kubelet, as described below. However, the source of this data may vary; for
example, a cloud provider controller could populate this based upon information
from the cloud provider API.

In the kubelet case, each kubelet publishes kubelet publishes a set of
`ResourceSlice` objects to the API server with content provided by the
corresponding DRA drivers running on its node. Access control through the node
authorizer ensures that the kubelet running on one node is not allowed to
create or modify `ResourceSlices` belonging to another node. A `nodeName`
field in each `ResourceSlice` object is used to determine which objects are
managed by which kubelet.

**NOTE:** `ResourceSlices` are published separately for each driver, using
whatever version of the `resource.k8s.io` API is supported by the kubelet. That
same version is then also used in the gRPC interface between the kubelet and
the DRA drivers providing content for those objects. It might be possible to
support version skew (= keeping kubelet at an older version than the control
plane and the DRA drivers) in the future, but currently this is out of scope.
In the kubelet case, each driver running on a node publishes a set of
`ResourceSlice` objects to the API server for its own resources.
These requests are [proxied by kubelet](#kubelet-rest-proxy)
and thus seen as coming from the kubelet by the apiserver.
Access control through the node
authorizer ensures that the drivers running on one node are not allowed to
create or modify `ResourceSlices` belonging to another node. The `nodeName`
and `driverName` fields in each `ResourceSlice` object are used to determine which objects are
managed by which driver instance.

Embedded inside each `ResourceSlice` is the representation of the resources
managed by a driver according to a specific "structured model". In the example
Expand Down Expand Up @@ -931,7 +928,7 @@ Several components must be implemented or modified in Kubernetes:
ResourceClaim (directly or through a template) and ensure that the
resource is allocated before the Pod gets scheduled, similar to
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/scheduling/scheduler_binder.go
- Kubelet must be extended to retrieve information from ResourceClaims
- Kubelet must be extended to manage ResourceClaims
and to call a resource kubelet plugin. That plugin returns CDI device ID(s)
which then must be passed to the container runtime.

Expand Down Expand Up @@ -1188,13 +1185,13 @@ drivers are expected to be written for Kubernetes.

##### ResourceSlice

For each node, one or more ResourceSlice objects get created. The kubelet
publishes them with the node as the owner, so they get deleted when a node goes
For each node, one or more ResourceSlice objects get created. The drivers
on a node publish them with the node as the owner, so they get deleted when a node goes
down and then gets removed.

All list types are atomic because that makes tracking the owner for
server-side-apply (SSA) simpler. Patching individual list elements is not
needed and there is a single owner (kubelet).
needed and there is a single owner.

```go
// ResourceSlice provides information about available
Expand Down Expand Up @@ -2049,6 +2046,247 @@ Unreserve is called in two scenarios:

### kubelet

#### Communication between kubelet and resource kubelet plugin

Resource kubelet plugins are discovered through the [kubelet plugin registration
mechanism](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-registration). A
new "ResourcePlugin" type will be used in the Type field of the
[PluginInfo](https://pkg.go.dev/k8s.io/kubelet/pkg/apis/pluginregistration/v1#PluginInfo)
response to distinguish the plugin from device and CSI plugins.

Under the advertised Unix Domain socket the kubelet plugin provides the
k8s.io/kubelet/pkg/apis/dra gRPC interface. It was inspired by
[CSI](https://github.com/container-storage-interface/spec/blob/master/spec.md),
with “volume” replaced by “resource” and volume specific parts removed.

#### REST proxy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/sig api-machinery auth
/cc @jpbetz @deads2k @enj

For visibility to a kube-apiserver <-- kubelet <-- client proxied API surface proposal

The read part of this is similar in some ways to what was proposed in https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4188-kubelet-pod-readiness-api#proposed-api

  1. the reads / watches are passed through to kube-apiserver with the kubelet's credentials
  2. the API is a complete proxy, so client-go is usable (though tunneled over gRPC, which I'm not sure we've done before)
  3. the API allows drivers to write through to kube-apiserver using the kubelet's credentials

A lot of the discussion around that proposal seems relevant to this one (sig-arch 2023-08-24, https://docs.google.com/document/d/1BlmHq5uPyBUDlppYqAAzslVbAO8hilgjqZUTaNXUhKM/edit#bookmark=id.vud1o04xj4iv, https://www.youtube.com/watch?v=toN7t_y4zCk&t=22m45s)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The read part of this is similar in some ways to what was proposed in https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4188-kubelet-pod-readiness-api#proposed-api

One key difference is that I am not proposing to add a new kubelet socket. I was worried about security implications around that. Instead, kubelet connects to plugins the same way as before and takes requests through one stream per plugin which is kept open by each plugin.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given a desire to read/write effectively from the kube-apiserver and the development of validatingadmissionpolicy + serviceaccount node claims + downward API for node name that can be used to self-restrict reads (for convenience) and enforce write requirements (must be from the same node), why is the proxy a better and more secure choice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar enough with that alternative to do a comparison. Can you point me to documentation?

The ask is to ensure that a DRA driver deployed as daemonset gets RBAC permissions which allow it to read/create/update ResourceSlice objects where the "nodeName" field is the same as the node on which the pod is running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ResourceClaim, the ask is to ensure that the pod only gets read permission for ResourceClaims referenced by pods which have been scheduled to the node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deads2k: I looked through https://kubernetes.io/docs/reference/access-authn-authz/validating-admission-policy and https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/. It's not clear to me how I can connect the identity of the driver pod with the node that it is running on and that identity with rules that restricts what that pod can access.

The node authorizer uses hand-crafted Go code for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See kubernetes/kubernetes#124711 for the VAP+SA example David is referring to.

Note that admission only covers writes, so reads cannot be restricted in this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be the key part:
object.metadata.name == request.userInfo.extra[\"authentication.kubernetes.io/node-name\"][0])

Does object provide access to any field?

Is it possible to extend what gets added to the credentials? If there was a dra.kubernetes.io/driver-name, then the same check could be done for the driverName field in ResourceSlice. That would be better than what the REST proxy can do.

I'm fine with dropping the read check on ResourceClaim. Node authorizer also has a "loop hole" there because it does limit what data gets returned by watches.


Previously, kubelet retrieved ResourceClaims and published ResourceSlices on
behalf of DRA drivers on the node. The information included in those got passed
between API server, kubelet, and kubelet plugin using the version of the
resource.k8s.io used by the kubelet. Combining a kubelet using some older API
version with a plugin using a new version was not possible because conversion
of the resource.k8s.io types is only supported in the API server and an old
kubelet wouldn't know about a new version anyway.

Keeping kubelet at some old release while upgrading the control and DRA drivers
is desirable and officially supported by Kubernetes. To support the same when
using DRA, the kubelet now provides a REST proxy via gRPC that can be used by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems generally useful for a range of version skew scenarios between on-node agents and the apiserver, as well as security controls.

I'm slightly concerned as to the scalability of this approach (kubelet acting as gateway) but the upsides involved for managing the impact of per node agents is very interesting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly concerned as to the scalability of this approach (kubelet acting as gateway)

OTOH, kube-apiserver will be rate-limiting requests anyway, so I doubt that it would really be kubelet that would be a gatekeeper here - in such case those would probably be throttled server side anyway.

drivers to execute HTTP requests against the API server. The drivers determine
the version of the resource.k8s.io. In those few cases where the kubelet needs
to access the resource.k8s.io API itself, it does so with a dynamic client.

The alternative to implementing a generic REST proxy would have been to create
dedicated gRPC APIs for all operations that are needed by a driver. This is a
larger API and prevents using the normal client-go APIs. The way it is
implemented now, client-go in the plugin is instantiated with a HTTP
roundtripper provided by
k8s.io/dynamic-resource-allocation/restproxy. Implementations of that helper
code in other languages than Go are possible and may provide similar benefits.

##### Security

Requests originating from a driver will be sent to the API server with
credentials of the kubelet. This has the advantage that the node authorizer can
be used to limit access to objects belonging to the node. The downside is that
a driver might abuse this to access some other resources that kubelet has
access to, like pods. To prevent this, the REST proxy filters requests by path
and method and only passes requests through which the driver is meant to have
access to (ResourceSlice and ResourceClaim). Access to ResourceClaims is further
limited to read-only access by the node authorizer.

In addition, the REST proxy adds `nodeName` and `driverName` field selectors to
the header of requests that list ResourceSlice objects. This is mostly for
convenience because for a PUT of a ResourceSlice, the driver has to be trusted
to not create an object for some other driver on the node. This cannot be
Comment on lines +2100 to +2101
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a downside of the proxy path, right? The unstructured / dynamic client aggregation approach could have checked the driver association field, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

However, the "unstructured / dynamic client" then has other downsides. Right now, it can check this because we assume that all future ResourceSlices will be of the form "nodeName + driverName + other fields". Without knowing anything about "other fields", it's hard to write a controller that synchronizes the content in the ResourceSlice objects with the desired content - not impossible, but harder.

The other downside if we consider apiserver traffic is the usage of JSON for the request bodies (both ways) instead of protobuf.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want kubelet in the middle here if drivers are going to be doing more than write-only status reporting? If they are watching / getting / reading / reconciling status, that seems more like a normal client to me. Is the only reason for the node proxy so they can piggy-back on kubelet credentials and get authorized?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The advantage is indeed that the node authorizer continues to work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but... not really the way we want, since it lets devices stomp each others' status?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drivers on the same node have to trust each other. They often run with elevated privileges, so they can already do much harm locally even without this shared access to ResourceSlice objects of the node.

That's still better than having to trust all drivers anywhere in the cluster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this is the ideal scenario for us to build out the authz capabilities of node bound SA tokens (possibly combined with #4600). We have a good story for node agents that want to perform write requests via VAP+SA node ref. Having something similar to that for reads seems broadly useful. Building an entire gRPC proxy as a one-off for DRA seems like the wrong approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The REST proxy could be reused. But I agree, figuring out how to do this through normal auth mechanisms is the better approach. That would be out-of-scope for this KEP, though.

Instead, I'll put in some wording along the lines of:

  • DRA kubelet plugins must connect to the API server themselves to publish their own ResourceSlice obects and to read ResourceClaim objects.
  • Limiting access of the plugin is the responsibility of whoever deploys the driver. Some methods and best practices may get documented separately.

Does that sound okay? Then I'll close this PR and create a different one with that approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pohly I added this to the SIG Auth meeting for May 22nd. Hopefully we can hash out a concrete path forward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It gets even more interesting with some uses cases that people have brought up for DRA: a DRA driver for a NIC runs on the node and must publish network-related information, ideally to the ResourceClaim status. It must do that itself because kubelet might not know what that information is due to version skew.

The node authorizer can limit write access to ResourceClaims that are actually in use on the node because it builds the graph (node -> pod -> claim). In contrast to ResourceSlice, there isn't a specific field in a ResourceClaim that can be checked.

checked by the REST proxy because it would have to decode the opaque request
body.

##### gRPC API

What makes implementation of the REST proxy complicated is that the kubelet
acts as gRPC client and the plugins as gRPC server. gRPC itself doesn't support
requests from a server to a client. The REST proxy emulates that change of
direction by creating a gRPC stream. Each message sent through that stream by
the plugin represents one REST request. The entire request body is included
directly. This works because requests are small enough. Each request has a
unique ID.

The response from the API server is delivered through multiple gRPC calls which
contain that same unique ID and incrementally provide the response headers and
the response body data as it comes in from the API server. These methods block
if the API server delivers data too fast, which in turn slows down the API
server. Delivery of the data continues until either side closes their end of
the stream. Long-running requests, like watching a resource, are supported.

This sequence diagram shows the initialization of the different components and
the execution of one REST request:

```mermaid
sequenceDiagram
participant apiserver
box kubelet
participant plugins as plugin manager
participant manager as DRA manager
participant proxy as REST proxy
participant proxyreader as REST proxy reader
end
box DRA driver plugin
participant grpc as gRPC server
participant roundtripper as REST roundtripper
participant restclient as REST client
end
Note over grpc, roundtripper: handle gRPC via registration socket<br>and plugin socket
plugins -->> grpc: pluginregistration.Registration/GetInfo
grpc -->> plugins: GetInfo response: PluginInfo{Type:DRAPlugin}
plugins -->> grpc: pluginregistration.Registration/NotifyRegistrationStatus
grpc -->> plugins:
plugins ->> manager: add plugin
Note over roundtripper,restclient: The REST client can<br>be used immediately,<br>but the roundtripper blocks until<br>it has a stream.
loop while plugin is registered
manager -->> grpc: REST/NodeObject{Name, UID}
grpc -->> manager:
manager ->>+ proxy: start
%% Strictly speaking, the proxy gets created here (one for each plugin).
%% Mermaid cannot model that when using boxes (https://github.com/mermaid-js/mermaid/issues/5023)
Note right of proxy: one proxy per plugin
proxy -->> roundtripper: REST/Proxy
restclient ->> roundtripper: GET /api/...
roundtripper -->> proxy: Request{Id:1,Method:GET, ...}
Note right of proxy: checks allow list for methods+path,<br>adds field filter (driverName, nodeName)
proxy ->>+ proxyreader: start
Note right of proxyreader: one per request
proxy ->> apiserver: HTTP GET
par Response delivery
loop while there is response data
apiserver ->> proxyreader: write response
proxyreader -->> roundtripper: ReplyMessage{ID:1,Header:...,Body:...}
roundtripper -->> proxyreader:
end
and
loop while there is response data
roundtripper ->> restclient: Read
end
end
deactivate proxyreader
roundtripper -->> proxy: close REST/Proxy stream
deactivate proxy
end
```

The corresponding gRPC API is defined in
`k8s.io/dynamic-resource-allocation/apis/restproxy`:

```
// REST is the the gRPC service on the side which issues REST requests.
//
// Because a gRPC server cannot send requests to its client,
// a stream gets established by the client where each response
// is a REST request, which then gets handled by the client.
service REST {
// Proxy is called by the REST proxy to enable sending
// REST requests. It gets called again after errors.
//
// Each stream response is a single REST request. The response
// is returned by the proxy through one or more Reply
// calls.
rpc Proxy (ProxyMessage)
returns (stream Request) {}

// Reply provides part of the response for a REST request.
rpc Reply (ReplyMessage)
returns (ReplyResponse) {}

// NodeObject is called as soon as kubelet has information
// about its node object. It's not called when used elsewhere.
rpc NodeObject(NodeObjectRequest)
returns (NodeObjectResponse) {}
}

message ProxyMessage {
// Intentionally empty.
}

message Request {
// Id is used as identifier for all response messages for this
// request. It is included in all ReplyMessages for this Request.
int64 id = 1;

string method = 2;
string path = 3;
string rawQuery = 4;
map<string, RESTHeader> header = 5;

// Body contains the entire request body data.
bytes body = 6;
}

message RESTHeader {
repeated string values = 1;
}

// ReplyMessage is one of many replies that are sent
// by the proxy for each Request. If the error and/or close are set,
// then the request has failed and no further replies are going to
// be sent.
//
// The proxy waits for the ReplyResponse before sending the next
// ReplyMessage. This ensures that the gRPC server receives
// the body chunks in the right order.
message ReplyMessage {
// Id matches the Id in the Request that this reply belongs to.
int64 id = 1;

// Error is set if and only if executing the request encountered a problem.
string error = 2;

// Close indicates that the end of the body has been reached.
bool close = 3;

// Header contains the response from the REST server. It is
// set in all reply messages.
ResponseHeader header = 4;

// BodyOffset is the index of the body data in the overall response body.
int64 body_offset = 5;

// Body contains some of the response body data.
// The entire data is provided in chunks in multiple
// replies. A reply may provide an error, indicate the
// end of the response data, and contain some more data.
bytes body = 6;
}

message ResponseHeader {
string status = 1; // e.g. "200 OK"
int32 status_code = 2; // e.g. 200
string proto = 3; // e.g. "HTTP/1.0"
int32 proto_major = 4; // e.g. 1
int32 proto_minor = 5; // e.g. 0
map<string, RESTHeader> header = 6;

// ContentLength is the total expect length of the response body.
int64 content_length = 7;
}

message ReplyResponse {
// Close is true if the client is not interested in receiving more reply data.
bool close = 1;
}

message NodeObjectRequest {
string name = 1;
string uid = 2;
}

message NodeObjectResponse {
}
```

Adding `NodeObject` to this gRPC interface was done because it was
convenient. The information about the node is used by the ResourceSlice
controller, not the REST proxy itself.

#### Managing resources

kubelet must ensure that resources are ready for use on the node before running
Expand All @@ -2068,38 +2306,13 @@ successfully before allowing the pod to be deleted. This ensures that network-at
for other Pods, including those that might get scheduled to other nodes. It
also signals that it is safe to deallocate and delete the ResourceClaim.

The kubelet uses a specific version of the resource.k8s.io API for these
checks. Version skew between kubelet and the control plane is supported as long
as the apiserver still provides ResourceClaim objects with the version needed
by the kubelet.

![kubelet](./kubelet.png)

#### Communication between kubelet and resource kubelet plugin

Resource kubelet plugins are discovered through the [kubelet plugin registration
mechanism](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-registration). A
new "ResourcePlugin" type will be used in the Type field of the
[PluginInfo](https://pkg.go.dev/k8s.io/kubelet/pkg/apis/pluginregistration/v1#PluginInfo)
response to distinguish the plugin from device and CSI plugins.

Under the advertised Unix Domain socket the kubelet plugin provides the
k8s.io/kubelet/pkg/apis/dra gRPC interface. It was inspired by
[CSI](https://github.com/container-storage-interface/spec/blob/master/spec.md),
with “volume” replaced by “resource” and volume specific parts removed.

##### NodeListAndWatchResources

NodeListAndWatchResources returns a stream of NodeResourcesResponse objects.
At the start and whenever resource availability changes, the
plugin must send one such object with all information to the kubelet. The
kubelet then syncs that information with ResourceSlice objects.

```
message NodeListAndWatchResourcesRequest {
}

message NodeListAndWatchResourcesResponse {
repeated k8s.io.api.resource.v1alpha2.ResourceModel resources = 1;
}
```

##### NodePrepareResource

This RPC is called by the kubelet when a Pod that wants to use the specified
Expand Down Expand Up @@ -2155,20 +2368,16 @@ message Claim {
// The name of the Resource claim (ResourceClaim.meta.Name)
// This field is REQUIRED.
string name = 3;
// Resource handle (AllocationResult.ResourceHandles[*].Data)
// This field is OPTIONAL.
string resource_handle = 4;
// Structured parameter resource handle (AllocationResult.ResourceHandles[*].StructuredData).
// This field is OPTIONAL. If present, it needs to be used
// instead of resource_handle. It will only have a single entry.
//
// Using "repeated" instead of "optional" is a workaround for https://github.com/gogo/protobuf/issues/713.
repeated k8s.io.api.resource.v1alpha2.StructuredResourceHandle structured_resource_handle = 5;
}
```

`resource_handle` and `structured_resource_handle` will be set depending on how
the claim was allocated. See also KEP #3063.
The allocation result is intentionally not included here. The content of that
field is version-dependent. The kubelet would need to discover in which version
each plugin wants the data, then potentially get the claim multiple times
because only the apiserver can convert between versions. Instead, each plugin
is required to get the claim itself using the REST proxy. In the most common
case of one plugin per claim, that doubles the number of GETs for each claim
(once by the kubelet, once by the plugin).

```
message NodePrepareResourcesResponse {
Expand Down