Distributed provisioning #838

pohly · 2020-12-18T11:06:35Z

This changes the architecture of PMEM-CSI such that the controller part only implements the webhooks and provisioning happens on the nodes.

This PR is based on some pending changes (PR #825, PR #836) and uses the external-provisioner canary image because v2.1.0 has been tagged, but the versioned image is not available yet. Therefore the PR is ready for review and further testing, but merging should wait.

Fixes: #823, #733, #729, #594, #526, #507

pohly · 2020-12-18T18:53:22Z

I've rebased, so only new commits are present now.

I also re-enabled the stress test. It passed in local testing.

pohly · 2020-12-20T13:13:25Z

pkg/pmem-csi-operator/controller/deployment/controller_driver.go

 						VolumeSource: corev1.VolumeSource{
 							Secret: &corev1.SecretVolumeSource{
+								// TODO: replace name, create secret via cert-manager


This is a bit more complicated. When deploying via OLM, we need OLM to create the certificates. My understanding is that this can be done for "normal" webhooks, just not for conversion webhooks.

For deployment of the operator via YAML files we need to document how to set up those certs.

In both cases, re-configuring the cluster to use the webhooks is something that needs to be done manually.

If the user of the operator doesn't need the scheduler extensions, then it's still okay to start the statefulset unconditionally. The pod simply won't run.

This is a bit more complicated. When deploying via OLM, we need OLM to create the certificates. My understanding is that this can be done for "normal" webhooks, just not for conversion webhooks.

I doubt if OLM can creates certificates for services created by the operator for the CRs, though it provides service certificates for operator itself.

@pohly How this scheduler extension and the Pod mutation works in the context of multiple PmemCSIDeployments? I think we should /cannot support multiple deployments.

For the mutating pod webhook: it must use a resource name which includes the driver name. I thought it already did this, but currently it doesn't:

pmem-csi/pkg/scheduler/mutate_pod.go

Line 31 in 692818b

Resource = "pmem-csi.intel.com/scheduler"

For the scheduler extender: it must be triggered by the same resource name (driver name to be added) and then check only pods that belong to that instance - already implemented:

pmem-csi/pkg/scheduler/capacity.go

Lines 103 to 106 in 82bcde1

if label.GetName() == "driver_name" &&

label.GetValue() != c.driverName {

return 0, wrongPod

}

I doubt if OLM can creates certificates for services created by the operator for the CRs, though it provides service certificates for operator itself.

When we discussed it this morning, my conclusion was that the user needs to supply TLS credentials and that the webhooks need to be disabled if not provided.

In both cases, re-configuring the cluster to use the webhooks is something that needs to be done manually.

Have I mentioned how much I hate configuring a scheduler extension? 😢

The way we currently do it is by creating the certificates first, then setting up the "kubeadm init" configuration so that those certificates can be used by kube-scheduler. This no longer works when creating those certificates inside the cluster with a tool like cert-manager.

It is possible to set up the cluster without the extension enabled, then create certificates, and then reconfigure kube-scheduler. kubeadm upgrade supports that. But this looks complicated. I think I'll stick with the current approach.

pohly · 2020-12-20T13:32:33Z

deploy/kubernetes-1.17/direct/pmem-csi.yaml

-        - -nodeid=$(KUBE_NODE_NAME)
+        - -mode=webhooks
+        - -schedulerListen=:8000
+        - -drivername=$(PMEM_CSI_DRIVER_NAME)


We still need the external-provisioner in the controller as fallback for kubernetes-csi/external-provisioner#544

... or a custom controller which removes the selected-node annotation.

... or just set AllowedTopologies in our storage classes for late binding (kubernetes-csi/external-provisioner#544 (comment)).

avalluri · 2021-01-04T09:11:23Z

deploy/kustomize/kubernetes-distributed-lvm/kustomization.yaml

+  patch: |-
+    - op: add
+      path: /spec/template/spec/containers/0/command/-
+      value: -mode=webhooks


What does this "webhooks" mode does? Isn't it that valid mods are 'controller' and 'node'?

Okay, after going through the design documentation changes, I understood that the new 'webhooks' mode replaces the existing 'controller' mode. But the driver command-line flags in pkg/pmem-csi-dirver/main.go is not synced with this.

avalluri · 2021-01-04T09:13:17Z

deploy/kustomize/kubernetes-distributed-lvm/kustomization.yaml

+      path: /spec/template/spec/containers/0/command/-
+      value: -kube-api-burst=10
+    - op: remove
+      path: /spec/template/spec/containers/1


Why can't we do theses(remove external provisioner from controller) changes to our base driver YAML instead of a kustomize layer.

We can and I thought I had made that change when squashing. I'll check again.... it looks like I simply forgot to remove this file. It shouldn't be needed anymore.

Initially it wasn't done because distributed and centralized provisioning were both supported for a while.

avalluri · 2021-01-04T09:41:30Z

docs/design.md

@@ -155,88 +155,35 @@ Kata Container support has to be enabled explicitly via a [storage
 class parameter and Kata Containers must be set up
 appropriately](install.md#kata-containers-support)

-## Driver modes


The reference is not removed from the table of contents.

avalluri · 2021-01-04T09:52:14Z

docs/design.md

+
+PMEM-CSI solves this problem by deploying `external-provisioner`
+alongside each node driver and enabling ["distributed
+provisioning"](https://github.com/kubernetes-csi/external-provisioner/issues/487):


Wouldn't it be better to link to the documentation if any, instead of the issue.

Yes, now that that it is merged. Will change it.

avalluri · 2021-01-04T10:40:46Z

docs/design.md

+on which nodes the PMEM-CSI driver is running and how much capacity is
+available there. This information is retrieved by dynamically
+discovering PMEM-CSI pods and connecting to their [metrics
+endpoint](/docs/install.md#metrics-support).


This makes 'enabling the metrics support' a hard dependency on 'scheduler extension'. The code should implement this by always enabling node metrics when scheduler extension is enabled.

Does the node part know whether scheduler extensions are enabled? I don't think so, because those parameters are passed to the controller part.

Or did you mean some other code than the main.go and pmem-csi-driver.go?

Does the node part know whether scheduler extensions are enabled? I don't think so, because those parameters are passed to the controller part.

Yes, your are true, it is not possible to enforce this in the code. Would be good to add a note about "enabling metrics while using scheduler extension" in the documentation.

It's documented that the scheduler extension code uses metrics. I think that is enough. We don't describe how to disable metrics support (it's always on in YAML and operator deployments), so the risk that someone accidentally deploys incorrectly is low.

avalluri · 2021-01-04T11:39:01Z

docs/install.md

@@ -8,7 +8,7 @@
    - [Install PMEM-CSI driver](#install-pmem-csi-driver)
      - [Install using the operator](#install-using-the-operator)
      - [Install from source](#install-from-source)
-    - [Expose persistent and cache volumes to applications](#expose-persistent-and-cache-volumes-to-applications)
+    - [Expose volumes to applications](#expose-volumes-to-applications)


Links in Table of Contents has broken.

Fixed, also in the body.

avalluri · 2021-01-04T11:57:45Z

pkg/pmem-csi-driver/main.go

@@ -71,7 +67,7 @@ func Main() int {
 	klog.V(3).Info("Version: ", version)

 	if config.schedulerListen != "" {
-		if config.Mode != Controller {
+		if config.Mode != Webhooks {


I feel like offloading the webhook and scheduler extension part from the CSI driver to a separate binary makes much sense and cleaner than treating it as a mode of the driver.

What's the benefit? Have you considered the drawbacks?

"Cleaner" is subjective.

Is this perhaps about publishing one image that has multiple purposed vs. two different images with different content and different purpose?

I prefer to stick with a single image until someone has some strong argument against it. I doubt that the size of an image that only runs once in a cluster matters that much and publishing a single image is simpler, so I prefer that.

When we have a single image, then a single binary is likely to be smaller overall.

What that binary is called and how it is invoked hardly matters for end-users - they never invoke the binary. Renaming the binary is possible, but makes this PR more complex for (IMHO) little gain.

My point was not about different images, more about a different binary. As now there is no central controller driver component, IMHO the scheduler code is not suitable to be part of the driver code.

We discussed this in the standup meeting.

"pmem-csi-driver" has already before served multiple different purposes and we agreed to not change that in this PR because doing so would make our image larger for not much gain (basically cleaner command line arguments of the binary which isn't that important).

We also raised the question whether "pmem-csi-operator" should be a separate binary. Probably not. But again, that is a different issue.

pohly · 2021-01-14T20:16:23Z

I've pushed a major update: now the operator supports configuration of the webhooks and distributed provisioning. Some E2E tests passed (in particular the ones for the operator), but for full coverage I let the CI do its work over night.

I'll double-check tomorrow, but I think there is only one minor TODO left (the one about reading the node port from the shell script in deploy.go).

I've not tried to ensure that all tests pass for all commits, so strictly speaking I would have to squash "distributed provisioning", "deploy: consistent object naming, app.kubernetes.io labels" and "operator: distributed provisioning, scheduler extensions". But I think it's easier to review when they are separate.

pohly · 2021-01-15T08:21:13Z

I'll double-check tomorrow, but I think there is only one minor TODO left (the one about reading the node port from the shell script in deploy.go).

That referred to TODOs inside the patch itself. I still need to address review comments.

avalluri · 2021-01-15T09:15:01Z

pkg/pmem-csi-driver/pmem-csi-driver.go

@@ -104,9 +98,6 @@ type Config struct {
 	NodeID string
 	//Endpoint exported csi driver endpoint
 	Endpoint string
-	//TestEndpoint adds the controller service to the server listening on Endpoint.
-	//Only needed for testing.
-	TestEndpoint bool
 	//Mode mode fo the driver
 	Mode DriverMode
 	//RegistryEndpoint exported registry server endpoint


RegistryEndpoint no more needed and could be removed.

Removed (but not pushed yet).

pkg/pmem-csi-driver/pmem-csi-driver.go

pohly · 2021-01-15T09:46:05Z

I've pushed an update with various test fixes. Several tests still used the old labels and/or had issues with the optional controller part.

avalluri · 2021-01-15T09:48:20Z

pkg/pmem-csi-driver/pmem-csi-driver.go

@@ -188,6 +196,16 @@ func (csid *csiDriver) Run() error {
 	)
 	csid.gatherers = append(csid.gatherers, cmm.GetRegistry())

+	var client kubernetes.Interface
+	if config.schedulerListen != "" ||
+		csid.cfg.Mode == Node {


I think the check supposed to be csid..cfg.Mode == Webhook, in that case we could remove this check and move the client creation under webhooks section in switch case.

Indeed, good catch. The initialization of cmm can also be moved.

avalluri · 2021-01-15T13:10:20Z

deploy/kustomize/driver/pmem-csi.yaml

+kind: ServiceAccount
+metadata:
+  labels:
+    pmem-csi.intel.com/deployment: lvm-testing


Is keeping this label(and also rest of this change) intentional ?

Yes. The usage of the "pmem-csi.intel.com/deployment" to mark different kinds of deployments in deploy.go doesn't change.

avalluri · 2021-01-15T16:44:02Z

docs/install.md

+
+That secret must contain the following data items:
+- `ca.crt`: root CA certificate
+- `tls.key`: secret key of the webhook


Wouldn't be better to mention that the secret/public key is for both webhook and scheduler extender.

The scheduler extender is also a webhook. Both run in the controller.

With that in mind, it's better to say above that "The controller needs TLS certificates" - i'll use that instead. The name of the API was already chosen accordingly.

avalluri · 2021-01-15T16:51:33Z

pkg/pmem-csi-operator/controller/deployment/controller_driver.go

@@ -377,6 +424,7 @@ var subObjectHandlers = map[string]redeployObject{
 	},
 	"metrics service": {
 		objType: reflect.TypeOf(&corev1.Service{}),
+		enabled: controllerEnabled,


Node metrics service cant be enabled regardless of the controller?

Node metrics are always enabled. There's nothing in the operator API to control that.

pohly · 2021-01-16T08:55:02Z

deploy/kustomize/storageclass/pmem-storageclass.yaml

+  # It must match the "provisioner" field above.
+  - key: pmem-csi.intel.com/driver
+    values:
+    - started


Looks like this breaks skew testing; I've not looked into it yet, but probably it is related to the older deployment not setting this label.

I wanted to ensure that only nodes where the driver actually runs (or at least ran) is used, vs. nodes where it is supposed to run. The latter is what we would get when using the label that admins need to set for the nodes with PMEM.

Will have to think about this some more...

The problem is that the 0.8.0 deployment doesn't handle the new topology:

pmem-csi-controller-0/pmem-driver@pmem..ker1: I0118 11:38:34.703703 1 tracing.go:19] GRPC call: /csi.v1.Controller/CreateVolume pmem-csi-controller-0/pmem-driver@pmem..ker1: I0118 11:38:34.703770 1 tracing.go:20] GRPC request: {"accessibility_requirements":{"preferred":[{"segments":{"pmem-csi.intel.com/driver":"started"}}],"requisite":[{"segments":{"pmem-csi.intel.com/driver":"started"}}]},"capacity_range":{"required_bytes":115343360},"name":"pvc-269396e3-8ad6-4b4a-8296-9944cdd3acf0","parameters":{"eraseafter":"true"},"volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]} pmem-csi-controller-0/pmem-driver@pmem..ker1: I0118 11:38:34.704225 1 controllerserver-master.go:181] Controller CreateVolume: Name:pvc-269396e3-8ad6-4b4a-8296-9944cdd3acf0 required_bytes:115343360 limit_bytes:0 pmem-csi-controller-0/pmem-driver@pmem..ker1: I0118 11:38:34.704251 1 controllerserver-master.go:86] Controller CreateVolume: Create VolumeID:pvc-26-328dad453a5df7f8348d57f9f7422e6b084a58303f5400af6f92906a based on name:pvc-269396e3-8ad6-4b4a-8296-9944cdd3acf0 pmem-csi-controller-0/pmem-driver@pmem..ker1: W0118 11:38:34.704273 1 controllerserver-master.go:236] failed to connect to : No node registered with id:

I'm a bit surprised that "pmem-csi.intel.com/node" doesn't show up at all; might be related to how we configure the external-provisioner. Anyway, we can't change our topology labels without breaking compatibility, so I'll switch to using the node selector labels that we use for the PMEM-CSI node pods also for the storage classes.

Using our node selector label also doesn't work because the external-provisioner only seems to allow node labels that were set by the CSI driver:

pmem-csi-intel-com-node-q94vv/external-provisioner@pmem..ker2: I0118 14:46:54.540845 1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"skew-4425", Name:"direct-testingv4nl9", UID:"fb94e3a9-62e5-44ec-b48b-33471452142c", APIVersion:"v1", ResourceVersion:"1204", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "pmem-pmem-csi-sc-ext4-skew-4425-unused": error generating accessibility requirements: selected node '"pmem-csi-pmem-govm-worker2"' topology 'map[pmem-csi.intel.com/node:pmem-csi-pmem-govm-worker2]' is not in allowed topologies: [map[feature.node.kubernetes.io/memory-nv.dax:true]]

At this point, running a "deprovisioner" in the PMEM-CSI controller looks like the simplest solution.

I've implemented that, albeit with the name "(PVC) rescheduler" because that's what it is.

I decided to keep the commits for the previous two attempts because it is something that we might want to refer to. I could also squash.

I used the existing external-provisioner helper lib. It's not a perfect fit because it needs a PV informer without ever doing anything with PVs, but using it was simpler than writing everything from scratch (which probably would have amounted to a lot of copy-and-paste).

I've pushed after some local testing ("make test", the new late binding test, some skew testing).

@avalluri can you take another look?

I've filed kubernetes-csi/external-provisioner#548 to discuss why we can't use the node label in AllowedTopology. With a modified external-provisioner we might be able to restore that solution, which would be better than the current solution alone because then a bad node selection should happen a lot less. Even if we can enable that again, rescheduling as a second line of defense will still be useful (think "node selected base on labels", "labels change", "provisioning no longer possible" -> "must reschedule").

This can be run concurrently to creating volumes and then will show how objects change over time.

The previous error message only listed the last path that had been tried. Listing all the paths that were checked makes it more obvious what might be missing.

User input is already checked in DriverMode.Set, therefore we can assume that all DriverMode strings are valid.

This is useful for debugging outside of a cluster.

By moving the code into the device manager package it becomes reusable, for example in upcoming tests. While at it, the resync period gets moved and increased. The code shouldn't depend on resyncing to catch up with the current state of the world.

Events get recorded in one goroutine and checked in another. We must protect our data structure where we stored events.

The newer runtime has a nicer API: - client.Object combines access to meta data and object data, which in some cases means we can drop one parameter or extra code - context is passed explicitly to Reconcile()

This is merely a precaution. Right now, the existing initial events are secrets which should be read-only and thus safe to share.

The Go runtime sometimes fails the deployment test with an error about a concurrent write access to a map (copied below, see also intel#852). It's not clear what that map is. It looks like it's related to decoding into the same object, but objects shouldn't be shared between goroutines. As a workaround we can prevent parallel execution of Patch in the test. This is less restrictive than preventing parallel execution altogether. fatal error: concurrent map writes goroutine 1498 [running]: runtime.throw(0x17a34b4, 0x15) /nvme/gopath/go-1.15.2/src/runtime/panic.go:1116 +0x72 fp=0xc0019627a0 sp=0xc001962770 pc=0x439552 runtime.mapassign(0x15d4600, 0xc000206bd0, 0xc00051a7f0, 0xc001513440) /nvme/gopath/go-1.15.2/src/runtime/map.go:585 +0x5c5 fp=0xc001962820 sp=0xc0019627a0 pc=0x411ca5 reflect.mapassign(0x15d4600, 0xc000206bd0, 0xc00051a7f0, 0xc00051a820) /nvme/gopath/go-1.15.2/src/runtime/map.go:1319 +0x3f fp=0xc001962850 sp=0xc001962820 pc=0x46abbf github.com/modern-go/reflect2.(*UnsafeMapType).UnsafeSetIndex(...) /nvme/gopath/pkg/mod/github.com/modern-go/reflect2@v1.0.1/unsafe_map.go:76 github.com/json-iterator/go.(*mapDecoder).Decode(0xc0007ed090, 0xc001b40ba8, 0xc001513440) /nvme/gopath/pkg/mod/github.com/json-iterator/go@v1.1.10/reflect_map.go:180 +0x1a5 fp=0xc0019628d0 sp=0xc001962850 pc=0x880f65 github.com/json-iterator/go.(*placeholderDecoder).Decode(0xc000998130, 0xc001b40ba8, 0xc001513440) /nvme/gopath/pkg/mod/github.com/json-iterator/go@v1.1.10/reflect.go:324 +0x47 fp=0xc0019628f8 sp=0xc0019628d0 pc=0x879ea7 github.com/json-iterator/go.(*structFieldDecoder).Decode(0xc000ba0100, 0xc001b40b18, 0xc001513440) /nvme/gopath/pkg/mod/github.com/json-iterator/go@v1.1.10/reflect_struct_decoder.go:1054 +0x78 fp=0xc001962980 sp=0xc0019628f8 pc=0x890f58 github.com/json-iterator/go.(*generalStructDecoder).decodeOneField(0xc000ba0960, 0xc001b40b18, 0xc001513440) /nvme/gopath/pkg/mod/github.com/json-iterator/go@v1.1.10/reflect_struct_decoder.go:552 +0x217 fp=0xc0019629f8 sp=0xc001962980 pc=0x88e577 github.com/json-iterator/go.(*generalStructDecoder).Decode(0xc000ba0960, 0xc001b40b18, 0xc001513440) /nvme/gopath/pkg/mod/github.com/json-iterator/go@v1.1.10/reflect_struct_decoder.go:508 +0x85 fp=0xc001962a68 sp=0xc0019629f8 pc=0x88e065 github.com/json-iterator/go.(*structFieldDecoder).Decode(0xc000ba1f20, 0xc001b40a20, 0xc001513440) /nvme/gopath/pkg/mod/github.com/json-iterator/go@v1.1.10/reflect_struct_decoder.go:1054 +0x78 fp=0xc001962af0 sp=0xc001962a68 pc=0x890f58 github.com/json-iterator/go.(*twoFieldsStructDecoder).Decode(0xc0007db380, 0xc001b40a20, 0xc001513440) /nvme/gopath/pkg/mod/github.com/json-iterator/go@v1.1.10/reflect_struct_decoder.go:617 +0xab fp=0xc001962b58 sp=0xc001962af0 pc=0x88ebab github.com/json-iterator/go.(*structFieldDecoder).Decode(0xc000e8c6a0, 0xc001b40a18, 0xc001513440) /nvme/gopath/pkg/mod/github.com/json-iterator/go@v1.1.10/reflect_struct_decoder.go:1054 +0x78 fp=0xc001962be0 sp=0xc001962b58 pc=0x890f58 github.com/json-iterator/go.(*fiveFieldsStructDecoder).Decode(0xc00058d8c0, 0xc001b40a18, 0xc001513440) /nvme/gopath/pkg/mod/github.com/json-iterator/go@v1.1.10/reflect_struct_decoder.go:743 +0xcb fp=0xc001962c48 sp=0xc001962be0 pc=0x88f5eb github.com/json-iterator/go.(*structFieldDecoder).Decode(0xc000e8d6e0, 0xc001b40900, 0xc001513440) /nvme/gopath/pkg/mod/github.com/json-iterator/go@v1.1.10/reflect_struct_decoder.go:1054 +0x78 fp=0xc001962cd0 sp=0xc001962c48 pc=0x890f58 github.com/json-iterator/go.(*fiveFieldsStructDecoder).Decode(0xc00058daa0, 0xc001b40900, 0xc001513440) /nvme/gopath/pkg/mod/github.com/json-iterator/go@v1.1.10/reflect_struct_decoder.go:741 +0x2f2 fp=0xc001962d38 sp=0xc001962cd0 pc=0x88f812 github.com/json-iterator/go.(*Iterator).ReadVal(0xc001513440, 0x176d840, 0xc001b40900) /nvme/gopath/pkg/mod/github.com/json-iterator/go@v1.1.10/reflect.go:79 +0xc2 fp=0xc001962da8 sp=0xc001962d38 pc=0x877d42 github.com/json-iterator/go.(*frozenConfig).Unmarshal(0xc0001e4fa0, 0xc002213300, 0x11f7, 0x1300, 0x176d840, 0xc001b40900, 0x0, 0x0) /nvme/gopath/pkg/mod/github.com/json-iterator/go@v1.1.10/config.go:348 +0xb7 fp=0xc001962e08 sp=0xc001962da8 pc=0x86da17 k8s.io/apimachinery/pkg/runtime/serializer/json.(*Serializer).Decode(0xc0000e8aa0, 0xc002213300, 0x11f7, 0x1300, 0x0, 0x1948e40, 0xc001b40900, 0x2b, 0x2b, 0xc000044390, ...) /nvme/gopath/pkg/mod/k8s.io/apimachinery@v0.19.1-rc.0/pkg/runtime/serializer/json/json.go:263 +0x602 fp=0xc001963130 sp=0xc001962e08 pc=0x10ab082 k8s.io/apimachinery/pkg/runtime/serializer/recognizer.(*decoder).Decode(0xc000230240, 0xc002213300, 0x11f7, 0x1300, 0x0, 0x1948e40, 0xc001b40900, 0xc0007270e0, 0x7f510e3e3498, 0xc0007270e0, ...) /nvme/gopath/pkg/mod/k8s.io/apimachinery@v0.19.1-rc.0/pkg/runtime/serializer/recognizer/recognizer.go:107 +0x4f7 fp=0xc001963218 sp=0xc001963130 pc=0x10a9377 k8s.io/apimachinery/pkg/runtime/serializer/versioning.(*codec).Decode(0xc0007270e0, 0xc002213300, 0x11f7, 0x1300, 0x0, 0x1948e40, 0xc001b40900, 0x3, 0x4, 0x19752a0, ...) /nvme/gopath/pkg/mod/k8s.io/apimachinery@v0.19.1-rc.0/pkg/runtime/serializer/versioning/versioning.go:136 +0xa9 fp=0xc001963300 sp=0xc001963218 pc=0x10b0bc9 sigs.k8s.io/controller-runtime/pkg/client/fake.(*fakeClient).Patch(0xc00376a060, 0x1978f60, 0xc003ad2fc0, 0x1948e40, 0xc001b40900, 0x194e480, 0xc002cb60a0, 0x0, 0x0, 0x0, ...) /nvme/gopath/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/client/fake/client.go:378 +0x747 fp=0xc001963610 sp=0xc001963300 pc=0x14bf607 github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment_test.(*testClient).Patch(0xc00376a080, 0x1978f60, 0xc003ad2fc0, 0x1948e40, 0xc001b40900, 0x194e480, 0xc002cb60a0, 0x0, 0x0, 0x0, ...) <autogenerated>:1 +0xb6 fp=0xc001963680 sp=0xc001963610 pc=0x14c9596 github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment.(*objectPatch).apply(0xc0023af848, 0x1978f60, 0xc003ad2fc0, 0x198a040, 0xc00376a080, 0x0, 0x0) /nvme/gopath/src/github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment/controller_driver.go:192 +0x6a4 fp=0xc0019637c0 sp=0xc001963680 pc=0x13cbee4 github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment.(*pmemCSIDeployment).redeploy(0xc0036020c0, 0x1978f60, 0xc003ad2fc0, 0xc000a26b40, 0x199d160, 0x176d840, 0x0, 0x183dc88, 0x183dce0, 0x183dd20, ...) /nvme/gopath/src/github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment/controller_driver.go:329 +0x37d fp=0xc0019638f8 sp=0xc0019637c0 pc=0x13cd49d github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment.(*pmemCSIDeployment).reconcile.func1(0xc003602200, 0x179457c) /nvme/gopath/src/github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment/controller_driver.go:233 +0x1d2 fp=0xc001963ab0 sp=0xc0019638f8 pc=0x13debd2 github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment.(*pmemCSIDeployment).reconcile(0xc0036020c0, 0x1978f60, 0xc003ad2fc0, 0xc000a26b40, 0x2, 0x2) /nvme/gopath/src/github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment/controller_driver.go:242 +0x2e4 fp=0xc001963c00 sp=0xc001963ab0 pc=0x13cc244 github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment.(*ReconcileDeployment).Reconcile(0xc000a26b40, 0x0, 0x0, 0xc003014440, 0x16, 0xc002103e00, 0x0, 0x0, 0x0) /nvme/gopath/src/github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment/deployment_controller.go:324 +0x774 fp=0xc001963dd8 sp=0xc001963c00 pc=0x13d7354 github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment_test.testReconcile(0xc000834d80, 0x19407a0, 0xc000a26b40, 0xc003014440, 0x16, 0x0) /nvme/gopath/src/github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment/deployment_controller_test.go:154 +0x5a fp=0xc001963e48 sp=0xc001963dd8 pc=0x14c181a github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment_test.testReconcilePhase(0xc000834d80, 0x19407a0, 0xc000a26b40, 0x198a040, 0xc00376a080, 0xc003014440, 0x16, 0x0, 0x179520f, 0x7) /nvme/gopath/src/github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment/deployment_controller_test.go:164 +0x6f fp=0xc001963e90 sp=0xc001963e48 pc=0x14c1a2f github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment_test.TestDeploymentController.func1.9.1.1(0xc000088300) /nvme/gopath/src/github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment/deployment_controller_test.go:575 +0x769 fp=0xc001963f68 sp=0xc001963e90 pc=0x14c5549 github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment_test.TestDeploymentController.func1.9.1.2(0xc0009a9b00) /nvme/gopath/src/github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment/deployment_controller_test.go:584 +0x2a fp=0xc001963f80 sp=0xc001963f68 pc=0x14c56aa testing.tRunner(0xc0009a9b00, 0xc000d900b0) /nvme/gopath/go-1.15.2/src/testing/testing.go:1127 +0xef fp=0xc001963fd0 sp=0xc001963f80 pc=0x521a6f runtime.goexit() /nvme/gopath/go-1.15.2/src/runtime/asm_amd64.s:1374 +0x1 fp=0xc001963fd8 sp=0xc001963fd0 pc=0x471081 created by testing.(*T).Run /nvme/gopath/go-1.15.2/src/testing/testing.go:1178 +0x386

The code had two problems. It stored pointers to the objects passed in by the caller, which is dangerous because those instances might be reused by the caller, in which case the content changes. In practice this doesn't seem to happen, but we can't know that. It relied on receiving events in a certain order. The apiserver and (to some extend) the fake client may combine events, which apparently can also cause them to be delivered out-of-order. This has happened in practice and broke the test (intel#852). The solution is to de-duplicate after sorting. Because the order depends on what we compare against, it is done during the actual comparison together with de-duplication.

This is redundant when the scheduler extensions are active, but when those are disabled, then allowedTopologies prevents scheduling of pods onto nodes where the driver is not running. The central provisioner recovered from that by removing the bad "selected node" annotation. With distributed provisioning such a pod would get stuck forever. Testing of the scenario without scheduler extensions is covered by adding late binding tests to "operator-direct-production". Using an alternative driver name is covered by "operator-lvm-production".

The previous attempt with a fixed label did not work when downgrading to 0.8.0: pmem-csi-controller-0/pmem-driver@pmem..ker1: I0118 11:38:34.703703 1 tracing.go:19] GRPC call: /csi.v1.Controller/CreateVolume pmem-csi-controller-0/pmem-driver@pmem..ker1: I0118 11:38:34.703770 1 tracing.go:20] GRPC request: {"accessibility_requirements":{"preferred":[{"segments":{"pmem-csi.intel.com/driver":"started"}}],"requisite":[{"segments":{"pmem-csi.intel.com/driver":"started"}}]},"capacity_range":{"required_bytes":115343360},"name":"pvc-269396e3-8ad6-4b4a-8296-9944cdd3acf0","parameters":{"eraseafter":"true"},"volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]} pmem-csi-controller-0/pmem-driver@pmem..ker1: I0118 11:38:34.704225 1 controllerserver-master.go:181] Controller CreateVolume: Name:pvc-269396e3-8ad6-4b4a-8296-9944cdd3acf0 required_bytes:115343360 limit_bytes:0 pmem-csi-controller-0/pmem-driver@pmem..ker1: I0118 11:38:34.704251 1 controllerserver-master.go:86] Controller CreateVolume: Create VolumeID:pvc-26-328dad453a5df7f8348d57f9f7422e6b084a58303f5400af6f92906a based on name:pvc-269396e3-8ad6-4b4a-8296-9944cdd3acf0 pmem-csi-controller-0/pmem-driver@pmem..ker1: W0118 11:38:34.704273 1 controllerserver-master.go:236] failed to connect to : No node registered with id: This attempt uses the node selector label for the PMEM-CSI node pod in the allowedTopologies field. Unfortunately, it also doesn't work: pmem-csi-intel-com-node-q94vv/external-provisioner@pmem..ker2: I0118 14:46:54.540845 1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"skew-4425", Name:"direct-testingv4nl9", UID:"fb94e3a9-62e5-44ec-b48b-33471452142c", APIVersion:"v1", ResourceVersion:"1204", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "pmem-pmem-csi-sc-ext4-skew-4425-unused": error generating accessibility requirements: selected node '"pmem-csi-pmem-govm-worker2"' topology 'map[pmem-csi.intel.com/node:pmem-csi-pmem-govm-worker2]' is not in allowed topologies: [map[feature.node.kubernetes.io/memory-nv.dax:true]]

pohly · 2021-01-20T07:59:02Z

I intentionally left #857 out of the current implementation. It's not going to be a problem for a while and can be addressed later.

pohly · 2021-01-20T08:03:18Z

deploy/kustomize/driver/pmem-csi.yaml

+  verbs:
+  - get
+  - list
+  - watch


With these changes, the ClusterRole becomes a combination of permissions for the scheduler extensions and for the rescheduler.

I suspect putting "webhooks" into the name of the ClusterRole was a poor choice from the beginning; "controller" would have been better. I didn't change it in this commit because I wanted to keep it small, but I can do it in a separate cleanup commit.

avalluri · 2021-01-20T08:42:18Z

deploy/kustomize/storageclass/pmem-storageclass.yaml

@@ -3,12 +3,3 @@ kind: StorageClass
 metadata:
  name: pmem-csi-sc
 provisioner: pmem-csi.intel.com
-# Setting allowedTopology ensures that pods using volumes
-# with late binding only get scheduled to pmem
-allowedTopologies:


@pohly Why should we remove allowedTopologies form the storageclass?

Because it didn't work - see the commit message in 7c4ee86

If I am not wrong That attempt was for using {"pmem-csi.intel.com/driver":"started"} as allowed topology. Isn't it?. But it should work for {storage: "pmem} as before. And anyways your are passing the same label for controller's --nodeSelecter here. Then why exactly this --nodeSelector is needed as we get the same info from from the storage class?

If I am not wrong That attempt was for using {"pmem-csi.intel.com/driver":"started"} as allowed topology.

That's the first attempt (a03a700). The second attempt was {storage: pmem} which also didn't work.

Then why exactly this --nodeSelector is needed as we get the same info from from the storage class?

We can't use AllowedTopology at all at the moment, not with the currently released external-provisioner. Even if we can start using AllowedTopology with some future external-provisioner, then the label check in kube-scheduler may still go wrong when labels get changed (less likely, but can happen). The rescheduler then handles that case and thus remains useful.

avalluri · 2021-01-20T08:57:06Z

pkg/pmem-csi-driver/main.go

@@ -48,6 +48,7 @@ func init() {

 	/* Controller mode options */
 	flag.StringVar(&config.schedulerListen, "schedulerListen", "", "controller: listen address (like :8000) for scheduler extender and mutating webhook, disabled by default")
+	flag.Var(&config.nodeSelector, "nodeSelector", "controller: reschedule PVCs with a selected node where PMEM-CSI is not meant to run because the node does not have these labels (represented as JSON map)")


I am bit confused, where exactly the provided nodeSelecter is used in the code, I could only see that this is used to check if the rescheduler should be started or not by the controller. But I couldn't find anywhere the contents of the provided nodeSeleter is used in the code. In this case why can't it be a simple boolean value. @pohly Pardon and please correct me if I miss something.

It's passed to the rescheduler here:

pmem-csi/pkg/pmem-csi-driver/pmem-csi-driver.go

Lines 219 to 223 in 3a7d2cf

pcp = newRescheduler(ctx,

csid.cfg.DriverName,

client, pvcInformer, scInformer, pvInformer, csiNodeLister,

csid.cfg.nodeSelector,

serverVersion.GitVersion)

Yes, but it is not used used by rescheduler?

And it should be used here:

pmem-csi/pkg/pmem-csi-driver/rescheduler.go

Lines 196 to 201 in 3a7d2cf

driverMightRun := true

for key, value := range node.Labels {

if node.Labels[key] == value {

driverMightRun = false

}

}

... but isn't! Duh. That code is of course bogus. Not sure what I was thinking. I'll fix it.

I also need to add unit tests.

Done, see updated commit.

avalluri · 2021-01-20T09:17:51Z

pkg/pmem-csi-driver/rescheduler.go

+
+	driverMightRun := true
+	for key, value := range node.Labels {
+		if node.Labels[key] == value {


I think this check is wrong, It is where pcp.nodeSelecter supposed to be used?

I've fixed the commit.

avalluri · 2021-01-20T10:33:50Z

pkg/pmem-csi-driver/rescheduler.go

+		}
+	}
+
+	reschedule := !driverMightRun && !driverIsRunning


Isn't it that we should make sure that the selected node has enough PMEM capacity instead of just checking if the driver is running on this node?

If the driver is running, then it will do that check itself.

Checking capacity here is redundant and racy (we check capacity, fail, removed selected node, while sometime later the driver processes the PVC and creates a volume although it is no longer assigned to the node).

In other words, the rescheduler only may handle the case where the driver definitely isn't running.

When a PVC gets assigned to a node which has no PMEM-CSI running, whether it is because scheduler extensions are not enabled or there was a race while changing where to run the driver, then the new "descheduler" (= a stripped down external provisioner) unsets the "selected node" annotation. For the user this looks like this: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal ExternalProvisioning 2s (x2 over 2s) persistentvolume-controller waiting for a volume to be created, either by external provisioner "pmem-csi.intel.com" or manually created by system administrator Normal Provisioning 2s pmem-csi.intel.com_pmem-csi-intel-com-controller-0_9e16d74d-c645-4478-9f0d-50db58a962ce External provisioner is provisioning volume for claim "latebinding-7887/pvc-g7p97" Warning ProvisioningFailed 2s pmem-csi.intel.com_pmem-csi-intel-com-controller-0_9e16d74d-c645-4478-9f0d-50db58a962ce failed to provision volume with StorageClass "pmem-pmem-csi-sc-ext4-latebinding-7887": reschedule PVC latebinding-7887/pvc-g7p97 because it is assigned to node pmem-csi-pmem-govm-master which has no PMEM-CSI driver Normal WaitForPodScheduled 2s persistentvolume-controller waiting for pod pvc-volume-tester-writer-latebinding-4pkpv to be scheduled Normal Provisioning 2s pmem-csi.intel.com_pmem-csi-intel-com-node-4gblz_23b46c1d-b5c3-4cfe-a2f9-b22c77e666c6 External provisioner is provisioning volume for claim "latebinding-7887/pvc-g7p97" Normal ProvisioningSucceeded 2s pmem-csi.intel.com_pmem-csi-intel-com-node-4gblz_23b46c1d-b5c3-4cfe-a2f9-b22c77e666c6 Successfully provisioned volume pvc-66a4a486-5782-4cf2-8b13-f13f66e30e19 For the sake of simplicity, RBAC rules and resource allocation are based on what they would have to be when running both webhook and rescheduler.

avalluri

Looks good to me

pohly · 2021-01-20T19:22:34Z

There was a test flake (?), filed as #858.

I don't think that it was caused by this PR, so I am going to merge it despite that.

pohly requested a review from avalluri December 18, 2020 11:06

pohly force-pushed the distributed-provisioning branch 3 times, most recently from 8e75f25 to 753962f Compare December 18, 2020 18:52

pohly commented Dec 20, 2020

View reviewed changes

avalluri reviewed Jan 4, 2021

View reviewed changes

pohly force-pushed the distributed-provisioning branch from 82bcde1 to 053de08 Compare January 14, 2021 20:11

avalluri reviewed Jan 15, 2021

View reviewed changes

pkg/pmem-csi-driver/pmem-csi-driver.go Show resolved Hide resolved

pohly force-pushed the distributed-provisioning branch from 053de08 to 5518483 Compare January 15, 2021 09:44

avalluri reviewed Jan 15, 2021

View reviewed changes

pohly force-pushed the distributed-provisioning branch from 5518483 to 42de002 Compare January 15, 2021 19:12

pohly commented Jan 16, 2021

View reviewed changes

pohly added 6 commits January 17, 2021 16:17

update runtime deps with more recent Intel-internal IDs

b5b4fb7

test: add command for dumping PVs and PVCs

5e68dc4

This can be run concurrently to creating volumes and then will show how objects change over time.

setup-deployment.sh: better error message when files are missing

c1c81d8

The previous error message only listed the last path that had been tried. Listing all the paths that were checked makes it more obvious what might be missing.

pmem-csi-driver: remove redundant driver mode check

14cb92e

User input is already checked in DriverMode.Set, therefore we can assume that all DriverMode strings are valid.

pmem-csi-driver: support KUBECONFIG

64ccb96

This is useful for debugging outside of a cluster.

pohly added 7 commits January 18, 2021 15:56

operator test: avoid race around access to events

4029bfd

Events get recorded in one goroutine and checked in another. We must protect our data structure where we stored events.

operator: update to controller-runtime v0.8.0

ded3e36

The newer runtime has a nicer API: - client.Object combines access to meta data and object data, which in some cases means we can drop one parameter or extra code - context is passed explicitly to Reconcile()

test: copy initial operator test objects

fbd56a0

This is merely a precaution. Right now, the existing initial events are secrets which should be read-only and thus safe to share.

pohly force-pushed the distributed-provisioning branch from 2dd4288 to 7c4ee86 Compare January 18, 2021 16:31

pohly mentioned this pull request Jan 18, 2021

verbose external-provisioner output when shutting down #856

Closed

pohly force-pushed the distributed-provisioning branch from 4aee333 to 3a7d2cf Compare January 19, 2021 15:51

pohly mentioned this pull request Jan 20, 2021

operator ControllerTLSSecret: refresh mutating pod webhook config when secret changes #857

Open

pohly commented Jan 20, 2021

View reviewed changes

avalluri reviewed Jan 20, 2021

View reviewed changes

pohly force-pushed the distributed-provisioning branch from 3a7d2cf to 2b24963 Compare January 20, 2021 10:57

avalluri approved these changes Jan 20, 2021

View reviewed changes

pohly merged commit 1d141c6 into intel:devel Jan 20, 2021

	if label.GetName() == "driver_name" &&
	label.GetValue() != c.driverName {
	return 0, wrongPod
	}

	pcp = newRescheduler(ctx,
	csid.cfg.DriverName,
	client, pvcInformer, scInformer, pvInformer, csiNodeLister,
	csid.cfg.nodeSelector,
	serverVersion.GitVersion)

	driverMightRun := true
	for key, value := range node.Labels {
	if node.Labels[key] == value {
	driverMightRun = false
	}
	}

Distributed provisioning #838

Distributed provisioning #838

Conversation

pohly commented Dec 18, 2020 • edited

pohly commented Dec 18, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pohly Jan 11, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

avalluri Jan 4, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pohly commented Jan 14, 2021

pohly commented Jan 15, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pohly commented Jan 15, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pohly commented Jan 20, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

avalluri left a comment

Choose a reason for hiding this comment

pohly commented Jan 20, 2021

pohly commented Dec 18, 2020 •

edited

pohly Jan 11, 2021 •

edited

avalluri Jan 4, 2021 •

edited