Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed provisioning #838

Merged
merged 24 commits into from
Jan 20, 2021
Merged

Distributed provisioning #838

merged 24 commits into from
Jan 20, 2021

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Dec 18, 2020

This changes the architecture of PMEM-CSI such that the controller part only implements the webhooks and provisioning happens on the nodes.

This PR is based on some pending changes (PR #825, PR #836) and uses the external-provisioner canary image because v2.1.0 has been tagged, but the versioned image is not available yet. Therefore the PR is ready for review and further testing, but merging should wait.

Fixes: #823, #733, #729, #594, #526, #507

@pohly pohly requested a review from avalluri December 18, 2020 11:06
@pohly pohly force-pushed the distributed-provisioning branch 3 times, most recently from 8e75f25 to 753962f Compare December 18, 2020 18:52
@pohly
Copy link
Contributor Author

pohly commented Dec 18, 2020

I've rebased, so only new commits are present now.

I also re-enabled the stress test. It passed in local testing.

VolumeSource: corev1.VolumeSource{
Secret: &corev1.SecretVolumeSource{
// TODO: replace name, create secret via cert-manager
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit more complicated. When deploying via OLM, we need OLM to create the certificates. My understanding is that this can be done for "normal" webhooks, just not for conversion webhooks.

For deployment of the operator via YAML files we need to document how to set up those certs.

In both cases, re-configuring the cluster to use the webhooks is something that needs to be done manually.

If the user of the operator doesn't need the scheduler extensions, then it's still okay to start the statefulset unconditionally. The pod simply won't run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit more complicated. When deploying via OLM, we need OLM to create the certificates. My understanding is that this can be done for "normal" webhooks, just not for conversion webhooks.

I doubt if OLM can creates certificates for services created by the operator for the CRs, though it provides service certificates for operator itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pohly How this scheduler extension and the Pod mutation works in the context of multiple PmemCSIDeployments? I think we should /cannot support multiple deployments.

Copy link
Contributor Author

@pohly pohly Jan 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the mutating pod webhook: it must use a resource name which includes the driver name. I thought it already did this, but currently it doesn't:

Resource = "pmem-csi.intel.com/scheduler"

For the scheduler extender: it must be triggered by the same resource name (driver name to be added) and then check only pods that belong to that instance - already implemented:

if label.GetName() == "driver_name" &&
label.GetValue() != c.driverName {
return 0, wrongPod
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt if OLM can creates certificates for services created by the operator for the CRs, though it provides service certificates for operator itself.

When we discussed it this morning, my conclusion was that the user needs to supply TLS credentials and that the webhooks need to be disabled if not provided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In both cases, re-configuring the cluster to use the webhooks is something that needs to be done manually.

Have I mentioned how much I hate configuring a scheduler extension? 😢

The way we currently do it is by creating the certificates first, then setting up the "kubeadm init" configuration so that those certificates can be used by kube-scheduler. This no longer works when creating those certificates inside the cluster with a tool like cert-manager.

It is possible to set up the cluster without the extension enabled, then create certificates, and then reconfigure kube-scheduler. kubeadm upgrade supports that. But this looks complicated. I think I'll stick with the current approach.

- -nodeid=$(KUBE_NODE_NAME)
- -mode=webhooks
- -schedulerListen=:8000
- -drivername=$(PMEM_CSI_DRIVER_NAME)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need the external-provisioner in the controller as fallback for kubernetes-csi/external-provisioner#544

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... or a custom controller which removes the selected-node annotation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... or just set AllowedTopologies in our storage classes for late binding (kubernetes-csi/external-provisioner#544 (comment)).

patch: |-
- op: add
path: /spec/template/spec/containers/0/command/-
value: -mode=webhooks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this "webhooks" mode does? Isn't it that valid mods are 'controller' and 'node'?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, after going through the design documentation changes, I understood that the new 'webhooks' mode replaces the existing 'controller' mode. But the driver command-line flags in pkg/pmem-csi-dirver/main.go is not synced with this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

path: /spec/template/spec/containers/0/command/-
value: -kube-api-burst=10
- op: remove
path: /spec/template/spec/containers/1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we do theses(remove external provisioner from controller) changes to our base driver YAML instead of a kustomize layer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can and I thought I had made that change when squashing. I'll check again.... it looks like I simply forgot to remove this file. It shouldn't be needed anymore.

Initially it wasn't done because distributed and centralized provisioning were both supported for a while.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -155,88 +155,35 @@ Kata Container support has to be enabled explicitly via a [storage
class parameter and Kata Containers must be set up
appropriately](install.md#kata-containers-support)

## Driver modes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference is not removed from the table of contents.

docs/design.md Outdated

PMEM-CSI solves this problem by deploying `external-provisioner`
alongside each node driver and enabling ["distributed
provisioning"](https://github.com/kubernetes-csi/external-provisioner/issues/487):
Copy link
Contributor

@avalluri avalluri Jan 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to link to the documentation if any, instead of the issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, now that that it is merged. Will change it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

on which nodes the PMEM-CSI driver is running and how much capacity is
available there. This information is retrieved by dynamically
discovering PMEM-CSI pods and connecting to their [metrics
endpoint](/docs/install.md#metrics-support).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes 'enabling the metrics support' a hard dependency on 'scheduler extension'. The code should implement this by always enabling node metrics when scheduler extension is enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the node part know whether scheduler extensions are enabled? I don't think so, because those parameters are passed to the controller part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or did you mean some other code than the main.go and pmem-csi-driver.go?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the node part know whether scheduler extensions are enabled? I don't think so, because those parameters are passed to the controller part.

Yes, your are true, it is not possible to enforce this in the code. Would be good to add a note about "enabling metrics while using scheduler extension" in the documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's documented that the scheduler extension code uses metrics. I think that is enough. We don't describe how to disable metrics support (it's always on in YAML and operator deployments), so the risk that someone accidentally deploys incorrectly is low.

docs/install.md Outdated
@@ -8,7 +8,7 @@
- [Install PMEM-CSI driver](#install-pmem-csi-driver)
- [Install using the operator](#install-using-the-operator)
- [Install from source](#install-from-source)
- [Expose persistent and cache volumes to applications](#expose-persistent-and-cache-volumes-to-applications)
- [Expose volumes to applications](#expose-volumes-to-applications)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Links in Table of Contents has broken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, also in the body.

@@ -71,7 +67,7 @@ func Main() int {
klog.V(3).Info("Version: ", version)

if config.schedulerListen != "" {
if config.Mode != Controller {
if config.Mode != Webhooks {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like offloading the webhook and scheduler extension part from the CSI driver to a separate binary makes much sense and cleaner than treating it as a mode of the driver.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit? Have you considered the drawbacks?

"Cleaner" is subjective.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this perhaps about publishing one image that has multiple purposed vs. two different images with different content and different purpose?

I prefer to stick with a single image until someone has some strong argument against it. I doubt that the size of an image that only runs once in a cluster matters that much and publishing a single image is simpler, so I prefer that.

When we have a single image, then a single binary is likely to be smaller overall.

What that binary is called and how it is invoked hardly matters for end-users - they never invoke the binary. Renaming the binary is possible, but makes this PR more complex for (IMHO) little gain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point was not about different images, more about a different binary. As now there is no central controller driver component, IMHO the scheduler code is not suitable to be part of the driver code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this in the standup meeting.

"pmem-csi-driver" has already before served multiple different purposes and we agreed to not change that in this PR because doing so would make our image larger for not much gain (basically cleaner command line arguments of the binary which isn't that important).

We also raised the question whether "pmem-csi-operator" should be a separate binary. Probably not. But again, that is a different issue.

@pohly
Copy link
Contributor Author

pohly commented Jan 14, 2021

I've pushed a major update: now the operator supports configuration of the webhooks and distributed provisioning. Some E2E tests passed (in particular the ones for the operator), but for full coverage I let the CI do its work over night.

I'll double-check tomorrow, but I think there is only one minor TODO left (the one about reading the node port from the shell script in deploy.go).

I've not tried to ensure that all tests pass for all commits, so strictly speaking I would have to squash "distributed provisioning", "deploy: consistent object naming, app.kubernetes.io labels" and "operator: distributed provisioning, scheduler extensions". But I think it's easier to review when they are separate.

@pohly
Copy link
Contributor Author

pohly commented Jan 15, 2021

I'll double-check tomorrow, but I think there is only one minor TODO left (the one about reading the node port from the shell script in deploy.go).

That referred to TODOs inside the patch itself. I still need to address review comments.

@@ -104,9 +98,6 @@ type Config struct {
NodeID string
//Endpoint exported csi driver endpoint
Endpoint string
//TestEndpoint adds the controller service to the server listening on Endpoint.
//Only needed for testing.
TestEndpoint bool
//Mode mode fo the driver
Mode DriverMode
//RegistryEndpoint exported registry server endpoint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RegistryEndpoint no more needed and could be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed (but not pushed yet).

@pohly
Copy link
Contributor Author

pohly commented Jan 15, 2021

I've pushed an update with various test fixes. Several tests still used the old labels and/or had issues with the optional controller part.

@@ -188,6 +196,16 @@ func (csid *csiDriver) Run() error {
)
csid.gatherers = append(csid.gatherers, cmm.GetRegistry())

var client kubernetes.Interface
if config.schedulerListen != "" ||
csid.cfg.Mode == Node {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the check supposed to be csid..cfg.Mode == Webhook, in that case we could remove this check and move the client creation under webhooks section in switch case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, good catch. The initialization of cmm can also be moved.

kind: ServiceAccount
metadata:
labels:
pmem-csi.intel.com/deployment: lvm-testing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is keeping this label(and also rest of this change) intentional ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The usage of the "pmem-csi.intel.com/deployment" to mark different kinds of deployments in deploy.go doesn't change.


That secret must contain the following data items:
- `ca.crt`: root CA certificate
- `tls.key`: secret key of the webhook
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't be better to mention that the secret/public key is for both webhook and scheduler extender.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scheduler extender is also a webhook. Both run in the controller.

With that in mind, it's better to say above that "The controller needs TLS certificates" - i'll use that instead. The name of the API was already chosen accordingly.

@@ -377,6 +424,7 @@ var subObjectHandlers = map[string]redeployObject{
},
"metrics service": {
objType: reflect.TypeOf(&corev1.Service{}),
enabled: controllerEnabled,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Node metrics service cant be enabled regardless of the controller?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Node metrics are always enabled. There's nothing in the operator API to control that.

# It must match the "provisioner" field above.
- key: pmem-csi.intel.com/driver
values:
- started
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this breaks skew testing; I've not looked into it yet, but probably it is related to the older deployment not setting this label.

I wanted to ensure that only nodes where the driver actually runs (or at least ran) is used, vs. nodes where it is supposed to run. The latter is what we would get when using the label that admins need to set for the nodes with PMEM.

Will have to think about this some more...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the 0.8.0 deployment doesn't handle the new topology:

pmem-csi-controller-0/pmem-driver@pmem..ker1: I0118 11:38:34.703703       1 tracing.go:19] GRPC call: /csi.v1.Controller/CreateVolume
pmem-csi-controller-0/pmem-driver@pmem..ker1: I0118 11:38:34.703770       1 tracing.go:20] GRPC request: {"accessibility_requirements":{"preferred":[{"segments":{"pmem-csi.intel.com/driver":"started"}}],"requisite":[{"segments":{"pmem-csi.intel.com/driver":"started"}}]},"capacity_range":{"required_bytes":115343360},"name":"pvc-269396e3-8ad6-4b4a-8296-9944cdd3acf0","parameters":{"eraseafter":"true"},"volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]}
pmem-csi-controller-0/pmem-driver@pmem..ker1: I0118 11:38:34.704225       1 controllerserver-master.go:181] Controller CreateVolume: Name:pvc-269396e3-8ad6-4b4a-8296-9944cdd3acf0 required_bytes:115343360 limit_bytes:0
pmem-csi-controller-0/pmem-driver@pmem..ker1: I0118 11:38:34.704251       1 controllerserver-master.go:86] Controller CreateVolume: Create VolumeID:pvc-26-328dad453a5df7f8348d57f9f7422e6b084a58303f5400af6f92906a based on name:pvc-269396e3-8ad6-4b4a-8296-9944cdd3acf0
pmem-csi-controller-0/pmem-driver@pmem..ker1: W0118 11:38:34.704273       1 controllerserver-master.go:236] failed to connect to : No node registered with id: 

I'm a bit surprised that "pmem-csi.intel.com/node" doesn't show up at all; might be related to how we configure the external-provisioner. Anyway, we can't change our topology labels without breaking compatibility, so I'll switch to using the node selector labels that we use for the PMEM-CSI node pods also for the storage classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using our node selector label also doesn't work because the external-provisioner only seems to allow node labels that were set by the CSI driver:

pmem-csi-intel-com-node-q94vv/external-provisioner@pmem..ker2: I0118 14:46:54.540845       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"skew-4425", Name:"direct-testingv4nl9", UID:"fb94e3a9-62e5-44ec-b48b-33471452142c", APIVersion:"v1", ResourceVersion:"1204", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "pmem-pmem-csi-sc-ext4-skew-4425-unused": error generating accessibility requirements: selected node '"pmem-csi-pmem-govm-worker2"' topology 'map[pmem-csi.intel.com/node:pmem-csi-pmem-govm-worker2]' is not in allowed topologies: [map[feature.node.kubernetes.io/memory-nv.dax:true]]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, running a "deprovisioner" in the PMEM-CSI controller looks like the simplest solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented that, albeit with the name "(PVC) rescheduler" because that's what it is.

I decided to keep the commits for the previous two attempts because it is something that we might want to refer to. I could also squash.

I used the existing external-provisioner helper lib. It's not a perfect fit because it needs a PV informer without ever doing anything with PVs, but using it was simpler than writing everything from scratch (which probably would have amounted to a lot of copy-and-paste).

I've pushed after some local testing ("make test", the new late binding test, some skew testing).

@avalluri can you take another look?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've filed kubernetes-csi/external-provisioner#548 to discuss why we can't use the node label in AllowedTopology. With a modified external-provisioner we might be able to restore that solution, which would be better than the current solution alone because then a bad node selection should happen a lot less. Even if we can enable that again, rescheduling as a second line of defense will still be useful (think "node selected base on labels", "labels change", "provisioning no longer possible" -> "must reschedule").

This can be run concurrently to creating volumes and then will show
how objects change over time.
The previous error message only listed the last path that had been
tried. Listing all the paths that were checked makes it more obvious
what might be missing.
User input is already checked in DriverMode.Set, therefore we can
assume that all DriverMode strings are valid.
This is useful for debugging outside of a cluster.
By moving the code into the device manager package it becomes
reusable, for example in upcoming tests.

While at it, the resync period gets moved and increased. The code
shouldn't depend on resyncing to catch up with the current state of
the world.
Events get recorded in one goroutine and checked in another.
We must protect our data structure where we stored events.
The newer runtime has a nicer API:
- client.Object combines access to meta data and object data,
  which in some cases means we can drop one parameter
  or extra code
- context is passed explicitly to Reconcile()
This is merely a precaution. Right now, the existing initial events
are secrets which should be read-only and thus safe to share.
The Go runtime sometimes fails the deployment test with an error about
a concurrent write access to a map (copied below, see also
intel#852). It's not clear what
that map is. It looks like it's related to decoding into the same
object, but objects shouldn't be shared between goroutines.

As a workaround we can prevent parallel execution of Patch in the
test. This is less restrictive than preventing parallel execution
altogether.

fatal error: concurrent map writes

goroutine 1498 [running]:
runtime.throw(0x17a34b4, 0x15)
	/nvme/gopath/go-1.15.2/src/runtime/panic.go:1116 +0x72 fp=0xc0019627a0 sp=0xc001962770 pc=0x439552
runtime.mapassign(0x15d4600, 0xc000206bd0, 0xc00051a7f0, 0xc001513440)
	/nvme/gopath/go-1.15.2/src/runtime/map.go:585 +0x5c5 fp=0xc001962820 sp=0xc0019627a0 pc=0x411ca5
reflect.mapassign(0x15d4600, 0xc000206bd0, 0xc00051a7f0, 0xc00051a820)
	/nvme/gopath/go-1.15.2/src/runtime/map.go:1319 +0x3f fp=0xc001962850 sp=0xc001962820 pc=0x46abbf
github.com/modern-go/reflect2.(*UnsafeMapType).UnsafeSetIndex(...)
	/nvme/gopath/pkg/mod/github.com/modern-go/reflect2@v1.0.1/unsafe_map.go:76
github.com/json-iterator/go.(*mapDecoder).Decode(0xc0007ed090, 0xc001b40ba8, 0xc001513440)
	/nvme/gopath/pkg/mod/github.com/json-iterator/go@v1.1.10/reflect_map.go:180 +0x1a5 fp=0xc0019628d0 sp=0xc001962850 pc=0x880f65
github.com/json-iterator/go.(*placeholderDecoder).Decode(0xc000998130, 0xc001b40ba8, 0xc001513440)
	/nvme/gopath/pkg/mod/github.com/json-iterator/go@v1.1.10/reflect.go:324 +0x47 fp=0xc0019628f8 sp=0xc0019628d0 pc=0x879ea7
github.com/json-iterator/go.(*structFieldDecoder).Decode(0xc000ba0100, 0xc001b40b18, 0xc001513440)
	/nvme/gopath/pkg/mod/github.com/json-iterator/go@v1.1.10/reflect_struct_decoder.go:1054 +0x78 fp=0xc001962980 sp=0xc0019628f8 pc=0x890f58
github.com/json-iterator/go.(*generalStructDecoder).decodeOneField(0xc000ba0960, 0xc001b40b18, 0xc001513440)
	/nvme/gopath/pkg/mod/github.com/json-iterator/go@v1.1.10/reflect_struct_decoder.go:552 +0x217 fp=0xc0019629f8 sp=0xc001962980 pc=0x88e577
github.com/json-iterator/go.(*generalStructDecoder).Decode(0xc000ba0960, 0xc001b40b18, 0xc001513440)
	/nvme/gopath/pkg/mod/github.com/json-iterator/go@v1.1.10/reflect_struct_decoder.go:508 +0x85 fp=0xc001962a68 sp=0xc0019629f8 pc=0x88e065
github.com/json-iterator/go.(*structFieldDecoder).Decode(0xc000ba1f20, 0xc001b40a20, 0xc001513440)
	/nvme/gopath/pkg/mod/github.com/json-iterator/go@v1.1.10/reflect_struct_decoder.go:1054 +0x78 fp=0xc001962af0 sp=0xc001962a68 pc=0x890f58
github.com/json-iterator/go.(*twoFieldsStructDecoder).Decode(0xc0007db380, 0xc001b40a20, 0xc001513440)
	/nvme/gopath/pkg/mod/github.com/json-iterator/go@v1.1.10/reflect_struct_decoder.go:617 +0xab fp=0xc001962b58 sp=0xc001962af0 pc=0x88ebab
github.com/json-iterator/go.(*structFieldDecoder).Decode(0xc000e8c6a0, 0xc001b40a18, 0xc001513440)
	/nvme/gopath/pkg/mod/github.com/json-iterator/go@v1.1.10/reflect_struct_decoder.go:1054 +0x78 fp=0xc001962be0 sp=0xc001962b58 pc=0x890f58
github.com/json-iterator/go.(*fiveFieldsStructDecoder).Decode(0xc00058d8c0, 0xc001b40a18, 0xc001513440)
	/nvme/gopath/pkg/mod/github.com/json-iterator/go@v1.1.10/reflect_struct_decoder.go:743 +0xcb fp=0xc001962c48 sp=0xc001962be0 pc=0x88f5eb
github.com/json-iterator/go.(*structFieldDecoder).Decode(0xc000e8d6e0, 0xc001b40900, 0xc001513440)
	/nvme/gopath/pkg/mod/github.com/json-iterator/go@v1.1.10/reflect_struct_decoder.go:1054 +0x78 fp=0xc001962cd0 sp=0xc001962c48 pc=0x890f58
github.com/json-iterator/go.(*fiveFieldsStructDecoder).Decode(0xc00058daa0, 0xc001b40900, 0xc001513440)
	/nvme/gopath/pkg/mod/github.com/json-iterator/go@v1.1.10/reflect_struct_decoder.go:741 +0x2f2 fp=0xc001962d38 sp=0xc001962cd0 pc=0x88f812
github.com/json-iterator/go.(*Iterator).ReadVal(0xc001513440, 0x176d840, 0xc001b40900)
	/nvme/gopath/pkg/mod/github.com/json-iterator/go@v1.1.10/reflect.go:79 +0xc2 fp=0xc001962da8 sp=0xc001962d38 pc=0x877d42
github.com/json-iterator/go.(*frozenConfig).Unmarshal(0xc0001e4fa0, 0xc002213300, 0x11f7, 0x1300, 0x176d840, 0xc001b40900, 0x0, 0x0)
	/nvme/gopath/pkg/mod/github.com/json-iterator/go@v1.1.10/config.go:348 +0xb7 fp=0xc001962e08 sp=0xc001962da8 pc=0x86da17
k8s.io/apimachinery/pkg/runtime/serializer/json.(*Serializer).Decode(0xc0000e8aa0, 0xc002213300, 0x11f7, 0x1300, 0x0, 0x1948e40, 0xc001b40900, 0x2b, 0x2b, 0xc000044390, ...)
	/nvme/gopath/pkg/mod/k8s.io/apimachinery@v0.19.1-rc.0/pkg/runtime/serializer/json/json.go:263 +0x602 fp=0xc001963130 sp=0xc001962e08 pc=0x10ab082
k8s.io/apimachinery/pkg/runtime/serializer/recognizer.(*decoder).Decode(0xc000230240, 0xc002213300, 0x11f7, 0x1300, 0x0, 0x1948e40, 0xc001b40900, 0xc0007270e0, 0x7f510e3e3498, 0xc0007270e0, ...)
	/nvme/gopath/pkg/mod/k8s.io/apimachinery@v0.19.1-rc.0/pkg/runtime/serializer/recognizer/recognizer.go:107 +0x4f7 fp=0xc001963218 sp=0xc001963130 pc=0x10a9377
k8s.io/apimachinery/pkg/runtime/serializer/versioning.(*codec).Decode(0xc0007270e0, 0xc002213300, 0x11f7, 0x1300, 0x0, 0x1948e40, 0xc001b40900, 0x3, 0x4, 0x19752a0, ...)
	/nvme/gopath/pkg/mod/k8s.io/apimachinery@v0.19.1-rc.0/pkg/runtime/serializer/versioning/versioning.go:136 +0xa9 fp=0xc001963300 sp=0xc001963218 pc=0x10b0bc9
sigs.k8s.io/controller-runtime/pkg/client/fake.(*fakeClient).Patch(0xc00376a060, 0x1978f60, 0xc003ad2fc0, 0x1948e40, 0xc001b40900, 0x194e480, 0xc002cb60a0, 0x0, 0x0, 0x0, ...)
	/nvme/gopath/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/client/fake/client.go:378 +0x747 fp=0xc001963610 sp=0xc001963300 pc=0x14bf607
github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment_test.(*testClient).Patch(0xc00376a080, 0x1978f60, 0xc003ad2fc0, 0x1948e40, 0xc001b40900, 0x194e480, 0xc002cb60a0, 0x0, 0x0, 0x0, ...)
	<autogenerated>:1 +0xb6 fp=0xc001963680 sp=0xc001963610 pc=0x14c9596
github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment.(*objectPatch).apply(0xc0023af848, 0x1978f60, 0xc003ad2fc0, 0x198a040, 0xc00376a080, 0x0, 0x0)
	/nvme/gopath/src/github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment/controller_driver.go:192 +0x6a4 fp=0xc0019637c0 sp=0xc001963680 pc=0x13cbee4
github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment.(*pmemCSIDeployment).redeploy(0xc0036020c0, 0x1978f60, 0xc003ad2fc0, 0xc000a26b40, 0x199d160, 0x176d840, 0x0, 0x183dc88, 0x183dce0, 0x183dd20, ...)
	/nvme/gopath/src/github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment/controller_driver.go:329 +0x37d fp=0xc0019638f8 sp=0xc0019637c0 pc=0x13cd49d
github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment.(*pmemCSIDeployment).reconcile.func1(0xc003602200, 0x179457c)
	/nvme/gopath/src/github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment/controller_driver.go:233 +0x1d2 fp=0xc001963ab0 sp=0xc0019638f8 pc=0x13debd2
github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment.(*pmemCSIDeployment).reconcile(0xc0036020c0, 0x1978f60, 0xc003ad2fc0, 0xc000a26b40, 0x2, 0x2)
	/nvme/gopath/src/github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment/controller_driver.go:242 +0x2e4 fp=0xc001963c00 sp=0xc001963ab0 pc=0x13cc244
github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment.(*ReconcileDeployment).Reconcile(0xc000a26b40, 0x0, 0x0, 0xc003014440, 0x16, 0xc002103e00, 0x0, 0x0, 0x0)
	/nvme/gopath/src/github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment/deployment_controller.go:324 +0x774 fp=0xc001963dd8 sp=0xc001963c00 pc=0x13d7354
github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment_test.testReconcile(0xc000834d80, 0x19407a0, 0xc000a26b40, 0xc003014440, 0x16, 0x0)
	/nvme/gopath/src/github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment/deployment_controller_test.go:154 +0x5a fp=0xc001963e48 sp=0xc001963dd8 pc=0x14c181a
github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment_test.testReconcilePhase(0xc000834d80, 0x19407a0, 0xc000a26b40, 0x198a040, 0xc00376a080, 0xc003014440, 0x16, 0x0, 0x179520f, 0x7)
	/nvme/gopath/src/github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment/deployment_controller_test.go:164 +0x6f fp=0xc001963e90 sp=0xc001963e48 pc=0x14c1a2f
github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment_test.TestDeploymentController.func1.9.1.1(0xc000088300)
	/nvme/gopath/src/github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment/deployment_controller_test.go:575 +0x769 fp=0xc001963f68 sp=0xc001963e90 pc=0x14c5549
github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment_test.TestDeploymentController.func1.9.1.2(0xc0009a9b00)
	/nvme/gopath/src/github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment/deployment_controller_test.go:584 +0x2a fp=0xc001963f80 sp=0xc001963f68 pc=0x14c56aa
testing.tRunner(0xc0009a9b00, 0xc000d900b0)
	/nvme/gopath/go-1.15.2/src/testing/testing.go:1127 +0xef fp=0xc001963fd0 sp=0xc001963f80 pc=0x521a6f
runtime.goexit()
	/nvme/gopath/go-1.15.2/src/runtime/asm_amd64.s:1374 +0x1 fp=0xc001963fd8 sp=0xc001963fd0 pc=0x471081
created by testing.(*T).Run
	/nvme/gopath/go-1.15.2/src/testing/testing.go:1178 +0x386
The code had two problems.

It stored pointers to the objects passed in by the caller, which is
dangerous because those instances might be reused by the caller, in
which case the content changes. In practice this doesn't seem to
happen, but we can't know that.

It relied on receiving events in a certain order. The apiserver
and (to some extend) the fake client may combine events, which
apparently can also cause them to be delivered out-of-order. This has
happened in practice and broke the
test (intel#852).

The solution is to de-duplicate after sorting. Because the order
depends on what we compare against, it is done during the actual
comparison together with de-duplication.
This is redundant when the scheduler extensions are active, but when
those are disabled, then allowedTopologies prevents scheduling
of pods onto nodes where the driver is not running.

The central provisioner recovered from that by removing the bad
"selected node" annotation. With distributed provisioning such a pod
would get stuck forever.

Testing of the scenario without scheduler extensions is covered by
adding late binding tests to "operator-direct-production". Using an
alternative driver name is covered by "operator-lvm-production".
The previous attempt with a fixed label did not work when downgrading
to 0.8.0:

pmem-csi-controller-0/pmem-driver@pmem..ker1: I0118 11:38:34.703703       1 tracing.go:19] GRPC call: /csi.v1.Controller/CreateVolume
pmem-csi-controller-0/pmem-driver@pmem..ker1: I0118 11:38:34.703770       1 tracing.go:20] GRPC request: {"accessibility_requirements":{"preferred":[{"segments":{"pmem-csi.intel.com/driver":"started"}}],"requisite":[{"segments":{"pmem-csi.intel.com/driver":"started"}}]},"capacity_range":{"required_bytes":115343360},"name":"pvc-269396e3-8ad6-4b4a-8296-9944cdd3acf0","parameters":{"eraseafter":"true"},"volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]}
pmem-csi-controller-0/pmem-driver@pmem..ker1: I0118 11:38:34.704225       1 controllerserver-master.go:181] Controller CreateVolume: Name:pvc-269396e3-8ad6-4b4a-8296-9944cdd3acf0 required_bytes:115343360 limit_bytes:0
pmem-csi-controller-0/pmem-driver@pmem..ker1: I0118 11:38:34.704251       1 controllerserver-master.go:86] Controller CreateVolume: Create VolumeID:pvc-26-328dad453a5df7f8348d57f9f7422e6b084a58303f5400af6f92906a based on name:pvc-269396e3-8ad6-4b4a-8296-9944cdd3acf0
pmem-csi-controller-0/pmem-driver@pmem..ker1: W0118 11:38:34.704273       1 controllerserver-master.go:236] failed to connect to : No node registered with id:

This attempt uses the node selector label for the PMEM-CSI node pod in
the allowedTopologies field. Unfortunately, it also doesn't work:

pmem-csi-intel-com-node-q94vv/external-provisioner@pmem..ker2: I0118 14:46:54.540845       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"skew-4425", Name:"direct-testingv4nl9", UID:"fb94e3a9-62e5-44ec-b48b-33471452142c", APIVersion:"v1", ResourceVersion:"1204", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "pmem-pmem-csi-sc-ext4-skew-4425-unused": error generating accessibility requirements: selected node '"pmem-csi-pmem-govm-worker2"' topology 'map[pmem-csi.intel.com/node:pmem-csi-pmem-govm-worker2]' is not in allowed topologies: [map[feature.node.kubernetes.io/memory-nv.dax:true]]
@pohly
Copy link
Contributor Author

pohly commented Jan 20, 2021

I intentionally left #857 out of the current implementation. It's not going to be a problem for a while and can be addressed later.

verbs:
- get
- list
- watch
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With these changes, the ClusterRole becomes a combination of permissions for the scheduler extensions and for the rescheduler.

I suspect putting "webhooks" into the name of the ClusterRole was a poor choice from the beginning; "controller" would have been better. I didn't change it in this commit because I wanted to keep it small, but I can do it in a separate cleanup commit.

@@ -3,12 +3,3 @@ kind: StorageClass
metadata:
name: pmem-csi-sc
provisioner: pmem-csi.intel.com
# Setting allowedTopology ensures that pods using volumes
# with late binding only get scheduled to pmem
allowedTopologies:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pohly Why should we remove allowedTopologies form the storageclass?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it didn't work - see the commit message in 7c4ee86

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am not wrong That attempt was for using {"pmem-csi.intel.com/driver":"started"} as allowed topology. Isn't it?. But it should work for {storage: "pmem} as before. And anyways your are passing the same label for controller's --nodeSelecter here. Then why exactly this --nodeSelector is needed as we get the same info from from the storage class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am not wrong That attempt was for using {"pmem-csi.intel.com/driver":"started"} as allowed topology.

That's the first attempt (a03a700). The second attempt was {storage: pmem} which also didn't work.

Then why exactly this --nodeSelector is needed as we get the same info from from the storage class?

We can't use AllowedTopology at all at the moment, not with the currently released external-provisioner. Even if we can start using AllowedTopology with some future external-provisioner, then the label check in kube-scheduler may still go wrong when labels get changed (less likely, but can happen). The rescheduler then handles that case and thus remains useful.

@@ -48,6 +48,7 @@ func init() {

/* Controller mode options */
flag.StringVar(&config.schedulerListen, "schedulerListen", "", "controller: listen address (like :8000) for scheduler extender and mutating webhook, disabled by default")
flag.Var(&config.nodeSelector, "nodeSelector", "controller: reschedule PVCs with a selected node where PMEM-CSI is not meant to run because the node does not have these labels (represented as JSON map)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am bit confused, where exactly the provided nodeSelecter is used in the code, I could only see that this is used to check if the rescheduler should be started or not by the controller. But I couldn't find anywhere the contents of the provided nodeSeleter is used in the code. In this case why can't it be a simple boolean value. @pohly Pardon and please correct me if I miss something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's passed to the rescheduler here:

pcp = newRescheduler(ctx,
csid.cfg.DriverName,
client, pvcInformer, scInformer, pvInformer, csiNodeLister,
csid.cfg.nodeSelector,
serverVersion.GitVersion)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it is not used used by rescheduler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it should be used here:

driverMightRun := true
for key, value := range node.Labels {
if node.Labels[key] == value {
driverMightRun = false
}
}

... but isn't! Duh. That code is of course bogus. Not sure what I was thinking. I'll fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also need to add unit tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, see updated commit.


driverMightRun := true
for key, value := range node.Labels {
if node.Labels[key] == value {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this check is wrong, It is where pcp.nodeSelecter supposed to be used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've fixed the commit.

}
}

reschedule := !driverMightRun && !driverIsRunning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it that we should make sure that the selected node has enough PMEM capacity instead of just checking if the driver is running on this node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the driver is running, then it will do that check itself.

Checking capacity here is redundant and racy (we check capacity, fail, removed selected node, while sometime later the driver processes the PVC and creates a volume although it is no longer assigned to the node).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, the rescheduler only may handle the case where the driver definitely isn't running.

When a PVC gets assigned to a node which has no PMEM-CSI running,
whether it is because scheduler extensions are not enabled or there
was a race while changing where to run the driver, then the new
"descheduler" (= a stripped down external provisioner) unsets the
"selected node" annotation.

For the user this looks like this:

Events:
  Type     Reason                 Age              From                                                                                     Message
  ----     ------                 ----             ----                                                                                     -------
  Normal   ExternalProvisioning   2s (x2 over 2s)  persistentvolume-controller                                                              waiting for a volume to be created, either by external provisioner "pmem-csi.intel.com" or manually created by system administrator
  Normal   Provisioning           2s               pmem-csi.intel.com_pmem-csi-intel-com-controller-0_9e16d74d-c645-4478-9f0d-50db58a962ce  External provisioner is provisioning volume for claim "latebinding-7887/pvc-g7p97"
  Warning  ProvisioningFailed     2s               pmem-csi.intel.com_pmem-csi-intel-com-controller-0_9e16d74d-c645-4478-9f0d-50db58a962ce  failed to provision volume with StorageClass "pmem-pmem-csi-sc-ext4-latebinding-7887": reschedule PVC latebinding-7887/pvc-g7p97 because it is assigned to node pmem-csi-pmem-govm-master which has no PMEM-CSI driver
  Normal   WaitForPodScheduled    2s               persistentvolume-controller                                                              waiting for pod pvc-volume-tester-writer-latebinding-4pkpv to be scheduled
  Normal   Provisioning           2s               pmem-csi.intel.com_pmem-csi-intel-com-node-4gblz_23b46c1d-b5c3-4cfe-a2f9-b22c77e666c6    External provisioner is provisioning volume for claim "latebinding-7887/pvc-g7p97"
  Normal   ProvisioningSucceeded  2s               pmem-csi.intel.com_pmem-csi-intel-com-node-4gblz_23b46c1d-b5c3-4cfe-a2f9-b22c77e666c6    Successfully provisioned volume pvc-66a4a486-5782-4cf2-8b13-f13f66e30e19

For the sake of simplicity, RBAC rules and resource allocation are
based on what they would have to be when running both webhook and
rescheduler.
Copy link
Contributor

@avalluri avalluri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@pohly
Copy link
Contributor Author

pohly commented Jan 20, 2021

There was a test flake (?), filed as #858.

I don't think that it was caused by this PR, so I am going to merge it despite that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

volume leak during stress test
2 participants