Make sure the v1 CRDs exist before starting the controller #504

mauriciopoppe · 2021-04-22T01:09:56Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

In the upgrade to the v1 version of VolumeSnapshots from v1beta with a deployment rollout of the controller it might be possible to get into a state where the v1 snapshot-controller is up but the CRDs for v1 aren't applied yet (from https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/)

It might take a few seconds for the endpoint to be created. You can watch the Established condition of your CustomResourceDefinition to be true or watch the discovery information of the API server for your resource to show up.

To test it I had 2 replicas of the v3 controller in a cluster and I did kubectl apply -f with the following:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    k8s-app: csi-snapshot-controller
  name: csi-snapshot-controller
  namespace: kube-system
spec:
  replicas: 2
  selector:
    matchLabels:
      k8s-app: csi-snapshot-controller
  minReadySeconds: 30
  strategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      labels:
        k8s-app: csi-snapshot-controller
    spec:
      containers:
      - args:
        - --v=5
        - --leader-election=true
        - --leader-election-namespace=$(SNAPSHOT_CONTROLLER_NAMESPACE)
        env:
        - name: SNAPSHOT_CONTROLLER_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        image: repo/snapshot-controller:v4.0.0
        name: csi-snapshot-controller
      serviceAccount: csi-snapshot-controller

Note maxSurge: 0 and maxUnavailable: 1, also note that I didn't deploy the v1 CRDs, the order of events were:

(immediately after apply) [v3 controller ready, v4 controller creating, v3 controller terminating] the v3 controller that's still running becomes the leader
[v3 controller ready, v4 controller running & ready] (at this point I saw that the v4 controller is showing as ready i.e. it's not crashing even though it couldn't watch the v1 CRDs, these are the logs https://pastebin.com/rWx9sXes)
(after 30 seconds) [v3 controller terminating, v4 controller ready, v4 controller creating], the v4 controller that's ready becomes the leader!
(after 30 seconds) [v4 controller ready, v4 controller ready], all v3 controller pods are gone!

Because the v4 controllers can't read the v1 CRDs we believe that the container should fail to start, I also noted that a controller will use the informers and fetch info about the CRDs only after it becomes the leader.

With this change we ensure that there's a start period where the container cannot continue until it sees the CRDs, the order of events for the same scenario is:

(immediately after apply) [v3 controller ready, v4 controller creating, v3 controller terminating] the v3 controller that's still running becomes the leader
[v3 controller ready, v4 controller in CrashLoopBackOff], the v3 controller is still the leader

NAME                                       READY   STATUS             RESTARTS   AGE
csi-snapshot-controller-547ccbc8b4-78cx9   0/1     CrashLoopBackOff   9          24m
csi-snapshot-controller-679f6d65c6-zqcfw   1/1     Running            0          75m

Logs:

The controller gets stuck in this state until I deploy the v1 CRDs, the order of events after I deploy the v1 CRDs is:

(wait some seconds until the pod is restarted) [v3 controller ready, v4 controller ready]
(after 30 seconds) [v4 controller ready, v3 controller terminating, v4 controller creating], the first v4 controller becomes the leader
(after 30 seconds) [v4 controller ready, v4 controller ready]

NAME                                       READY   STATUS    RESTARTS   AGE
csi-snapshot-controller-547ccbc8b4-78cx9   1/1     Running   10         27m
csi-snapshot-controller-547ccbc8b4-shpcq   1/1     Running   0          79s

Does this PR introduce a user-facing change?:

Add check for v1 CRDs to allow for rolling update of the snapshot-controller

/assign @msau42 @verult

k8s-ci-robot · 2021-04-22T01:10:03Z

Welcome @mauriciopoppe!

It looks like this is your first PR to kubernetes-csi/external-snapshotter 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-csi/external-snapshotter has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2021-04-22T01:10:03Z

Hi @mauriciopoppe. Thanks for your PR.

I'm waiting for a kubernetes-csi member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mauriciopoppe · 2021-04-22T01:14:51Z

cmd/snapshot-controller/main.go

+
+	condition := func() (bool, error) {
+		var err error
+		_, err = client.SnapshotV1().VolumeSnapshots("kube-system").List(context.TODO(), metav1.ListOptions{})


VolumeSnapshots needs a namespace, I was not sure if this should be kube-system although I'm not sure it exists in any cluster or if I should just check for the VolumeSnapshotClasses / VolumeSnapshotContents

We can't assume VolumeSnapshots are in "kube-system" namespace.

Also I see that you are checking if there are any VolumeSnapshot/VolumeSnapshotContent/VolumeSnapshotClass resources. Can you just check if the CRDs are deployed?

The purpose of the check is that the endpoint exists, it doesn't matter if there's nothing in the namespace.

I was also thinking of checking the CRD directly but that will require new rbacs.

If the response of the list function is err, not only caused by the crd that not exists, maybe the problem of the network can effect it.

Can you check if the specific error indicates CRDs are not deployed?

I believe both calls are the same, I did the following experiment with volumesnapshotclasses

The dynamic client is returning unstructured data of the results of GET volumesnapshots (not of the CRD), I tried to make a request with kube-proxy and curl:

with the dynamic client we're doing a GET /apis/GROUP/VERSION/RESOURCE/RESOURCENAME

# v1beta1 curl localhost:8080/apis/snapshot.storage.k8s.io/v1beta1/volumesnapshotclasses/ {"apiVersion":"snapshot.storage.k8s.io/v1beta1","items":[],"kind":"VolumeSnapshotClassList","metadata":{"continue":"","resourceVersion":"58636","selfLink":"/apis/snapshot.storage.k8s.io/v1beta1/volumesnapshotclasses/"}} # v1 curl localhost:8080/apis/snapshot.storage.k8s.io/v1/volumesnapshotclasses/ 404 page not found

with the snapshot v1 client and kubectl I saw the following

# v1beta1 kubectl get volumesnapshotclasses.v1beta1.snapshot.storage.k8s.io --v=7 I0422 08:06:43.695353 77906 loader.go:375] Config loaded from file: bmctl-workspace/ci-6ecffb9bde2ccf8/ci-6ecffb9bde2ccf8-kubeconfig I0422 08:06:43.705084 77906 round_trippers.go:421] GET https://10.200.0.5:443/apis/snapshot.storage.k8s.io/v1beta1/volumesnapshotclasses?limit=500 I0422 08:06:43.705120 77906 round_trippers.go:428] Request Headers: I0422 08:06:43.705131 77906 round_trippers.go:432] Accept: application/json;as=Table;v=v1;g=meta.k8s.io,application/json;as=Table;v=v1beta1;g=meta.k8s.io,application/json I0422 08:06:43.705151 77906 round_trippers.go:432] User-Agent: kubectl/v1.19.7 (linux/amd64) kubernetes/1dd5338 I0422 08:06:43.714630 77906 round_trippers.go:447] Response Status: 200 OK in 9 milliseconds No resources found #v1 kubectl get volumesnapshotclasses.v1.snapshot.storage.k8s.io --v=6 I0422 08:18:48.320835 83848 loader.go:375] Config loaded from file: bmctl-workspace/ci-6ecffb9bde2ccf8/ci-6ecffb9bde2ccf8-kubeconfig I0422 08:18:48.329195 83848 discovery.go:214] Invalidating discovery information I0422 08:18:48.338454 83848 round_trippers.go:444] GET https://10.200.0.5:443/api?timeout=32s 200 OK in 9 milliseconds I0422 08:18:48.348421 83848 round_trippers.go:444] GET https://10.200.0.5:443/apis?timeout=32s 200 OK in 1 milliseconds ... lots of OKs ... F0422 08:18:48.792106 83848 helpers.go:115] error: the server doesn't have a resource type "volumesnapshotclasses"

The endpoints for both the dynamic client and the snapshot v1 client are the same, if our objective is to check the CRDs I think I could use another client like the apiextensions client to get the CRDs, check that they have the v1 version and check that the CRD has the Established condition as the docs say e.g.

kubectl get crd volumesnapshots.snapshot.storage.k8s.io --v=7 -o custom-columns=NAME:.metadata.name,VERSIONS:.spec.versions.*.name,CONDITIONS:.status.conditions.*.type I0422 08:17:25.251236 83018 loader.go:375] Config loaded from file: bmctl-workspace/ci-6ecffb9bde2ccf8/ci-6ecffb9bde2ccf8-kubeconfig I0422 08:17:25.260059 83018 round_trippers.go:421] GET https://10.200.0.5:443/apis/apiextensions.k8s.io/v1/customresourcedefinitions/volumesnapshots.snapshot.storage.k8s.io I0422 08:17:25.260089 83018 round_trippers.go:428] Request Headers: I0422 08:17:25.260096 83018 round_trippers.go:432] Accept: application/json I0422 08:17:25.260101 83018 round_trippers.go:432] User-Agent: kubectl/v1.19.7 (linux/amd64) kubernetes/1dd5338 I0422 08:17:25.273409 83018 round_trippers.go:447] Response Status: 200 OK in 13 milliseconds I0422 08:17:25.274596 83018 table_printer.go:45] Unable to decode server response into a Table. Falling back to hardcoded types: attempt to decode non-Table object NAME VERSIONS CONDITIONS volumesnapshots.snapshot.storage.k8s.io v1beta1 KubernetesAPIApprovalPolicyConformant,NamesAccepted,Established

Thanks for the comments! This was a really useful exercise to know more about how the API server works

Can you check if the specific error indicates CRDs are not deployed?

I'm not sure it's important to check that specific error. If we're not able to access the API server due to network issues, that's also a problem (although may be temporary)

As @msau42 mentioned this change is doable however we'd need to add new rules to the RBAC e.g.

- apiGroups: - apiextensions.k8s.io resources: - customresourcedefinitions verbs: - get

I didn't try this in a cluster yet but this would be an approach:

// Checks that the VolumeSnapshot v1 CRDs exist and that they have the Established condition. func ensureCRDsExist(client apiextensionsclientset.Interface) error { crdHasV1Version := func(crd *apiextensionsv1.CustomResourceDefinition) bool { for _, version := range crd.Spec.Versions { if version.Name == snapshotv1.SchemeGroupVersion.Version { return true } } return false } crdHasEstablishedCondition := func(crd *apiextensionsv1.CustomResourceDefinition) bool { for _, condition := range crd.Status.Conditions { if condition.Type == apiextensionsv1.Established { return true } } return false } condition := func() (bool, error) { groupResources := []string{ snapshotv1.Resource("volumesnapshots").String(), snapshotv1.Resource("volumesnapshotclasses").String(), snapshotv1.Resource("volumesnapshotcontents").String(), } for _, groupResource := range groupResources { crd, err := client.ApiextensionsV1().CustomResourceDefinitions().Get(context.TODO(), groupResource, metav1.GetOptions{}) if err != nil { klog.Errorf("Failed to get CRD %s with error: %+v", groupResource, err) return false, nil } // Check that the CRD has the v1 version and that it has the Established condition // The Established condition was mentioned in https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#create-a-customresourcedefinition if !crdHasV1Version(crd) { klog.Errorf("CRD %s doesn't declare a v1 version", groupResource) return false, nil } if !crdHasEstablishedCondition(crd) { klog.Errorf("CRD %s doesn't have an Established condition", groupResource) return false, nil } } return true, nil } // with a Factor of 1.5 we wait up to 7.5 seconds (the 10th attempt) backoff := wait.Backoff{ Duration: 100 * time.Millisecond, Factor: 1.5, Steps: 10, } if err := wait.ExponentialBackoff(backoff, condition); err != nil { return err } return nil }

I believe that just by testing that the endpoint exist with the snapshotclient might give us a good indication that the CRD had an Established condition.

VolumeSnapshots needs a namespace, I was not sure if this should be kube-system although I'm not sure it exists in any cluster or if I should just check for the VolumeSnapshotClasses / VolumeSnapshotContents

if we deploy the sidecar under other namespace, it can list the resoures under the kube-system?

thanks for pointing this out, I created #514 to fix this

msau42 · 2021-04-22T01:14:57Z

/ok-to-test

msau42 · 2021-04-22T01:19:02Z

/assign @xing-yang

msau42 · 2021-04-22T17:10:26Z

cmd/snapshot-controller/main.go

+	}
+
+	pollPeriod := 100 * time.Millisecond
+	if err := wait.PollImmediateUntil(pollPeriod, condition, timeoutCh); err != nil {


In general, we prefer retrying with exponential backoff so that we don't potentially overload the API server with too many requests in a short time.

This is good to know, I was thinking about timing out in less than 10s e.g.

// with a Factor of 1.5 we wait up to 7.5 seconds (the 10th attempt) backoff := wait.Backoff{ Duration: 100 * time.Millisecond, Factor: 1.5, Steps: 10, }

mauriciopoppe · 2021-04-22T20:00:20Z

I've deployed the a new version and without the CRDs it's crashing after around 7.5 seconds.

Logs:

I0422 19:45:45.406459       1 main.go:108] Version: v4.0.0-mauricio.0-0-g1e5b3a93
 I0422 19:45:45.407920       1 main.go:157] Start NewCSISnapshotController with kubeconfig [] resyncPeriod [15m0s]
 E0422 19:45:45.414761       1 main.go:72] Failed to list v1 volumesnapshots with error=the server could not find the requested resource (get volumesnapshots.snapshot.storage.k8s.io)
 E0422 19:45:45.517047       1 main.go:72] Failed to list v1 volumesnapshots with error=the server could not find the requested resource (get volumesnapshots.snapshot.storage.k8s.io)
 E0422 19:45:45.668903       1 main.go:72] Failed to list v1 volumesnapshots with error=the server could not find the requested resource (get volumesnapshots.snapshot.storage.k8s.io)
 E0422 19:45:45.895563       1 main.go:72] Failed to list v1 volumesnapshots with error=the server could not find the requested resource (get volumesnapshots.snapshot.storage.k8s.io)
 E0422 19:45:46.235683       1 main.go:72] Failed to list v1 volumesnapshots with error=the server could not find the requested resource (get volumesnapshots.snapshot.storage.k8s.io)
 E0422 19:45:46.743988       1 main.go:72] Failed to list v1 volumesnapshots with error=the server could not find the requested resource (get volumesnapshots.snapshot.storage.k8s.io)
 E0422 19:45:47.505403       1 main.go:72] Failed to list v1 volumesnapshots with error=the server could not find the requested resource (get volumesnapshots.snapshot.storage.k8s.io)
 E0422 19:45:48.646981       1 main.go:72] Failed to list v1 volumesnapshots with error=the server could not find the requested resource (get volumesnapshots.snapshot.storage.k8s.io)
 E0422 19:45:50.358475       1 main.go:72] Failed to list v1 volumesnapshots with error=the server could not find the requested resource (get volumesnapshots.snapshot.storage.k8s.io)
 E0422 19:45:52.924478       1 main.go:72] Failed to list v1 volumesnapshots with error=the server could not find the requested resource (get volumesnapshots.snapshot.storage.k8s.io)
 E0422 19:45:52.924501       1 main.go:171] Failed to ensure CRDs exist during startup: timed out waiting for the condition
 stream closed

Status:

kubectl -n kube-system describe deploy/csi-snapshot-controller

Name:                   csi-snapshot-controller
Namespace:              kube-system
CreationTimestamp:      Thu, 22 Apr 2021 06:36:16 +0000
Labels:                 k8s-app=csi-snapshot-controller
Annotations:            deployment.kubernetes.io/revision: 8
Selector:               k8s-app=csi-snapshot-controller
Replicas:               2 desired | 1 updated | 2 total | 1 available | 1 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        15
RollingUpdateStrategy:  1 max unavailable, 0 max surge
Pod Template:
  Labels:           k8s-app=csi-snapshot-controller
  Service Account:  csi-snapshot-controller
  Containers:
   csi-snapshot-controller:
    Image:      image/repo/snapshot-controller:v4.0.0-mauricio.0
    Port:       <none>
    Host Port:  <none>
    Args:
      --v=5
      --leader-election=true
      --leader-election-namespace=$(SNAPSHOT_CONTROLLER_NAMESPACE)
    Environment:
      SNAPSHOT_CONTROLLER_NAMESPACE:   (v1:metadata.namespace)
    Mounts:                           <none>
  Volumes:                            <none>
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    ReplicaSetUpdated
OldReplicaSets:  csi-snapshot-controller-679f6d65c6 (1/1 replicas created)
NewReplicaSet:   csi-snapshot-controller-574d748468 (1/1 replicas created)
Events:
  Type    Reason             Age                  From                   Message
  ----    ------             ----                 ----                   -------
  Normal  ScalingReplicaSet  15m                  deployment-controller  Scaled down replica set csi-snapshot-controller-85779fc79c to 0
  Normal  ScalingReplicaSet  5m55s                deployment-controller  Scaled up replica set csi-snapshot-controller-55d57c8644 to 1
  Normal  ScalingReplicaSet  4m39s (x4 over 13h)  deployment-controller  Scaled up replica set csi-snapshot-controller-679f6d65c6 to 2
  Normal  ScalingReplicaSet  4m39s                deployment-controller  Scaled down replica set csi-snapshot-controller-55d57c8644 to 0
  Normal  ScalingReplicaSet  4m3s (x4 over 90m)   deployment-controller  Scaled down replica set csi-snapshot-controller-679f6d65c6 to 1
  Normal  ScalingReplicaSet  4m3s                 deployment-controller  Scaled up replica set csi-snapshot-controller-574d748468 to

kubectl -n kube-system get pod -l k8s-app=csi-snapshot-controller
NAME                                       READY   STATUS             RESTARTS   AGE
csi-snapshot-controller-574d748468-7mfck   0/1     CrashLoopBackOff   5          6m4s
csi-snapshot-controller-679f6d65c6-wl648   1/1     Running            0          13h

After I deployed the CRD, I had to delete the v4 pod to make it work (after going to CrashLoopBackOff it wasn't getting restarted anymore after 5 minutes)

kubectl -n kube-system get pod -l k8s-app=csi-snapshot-controller
NAME                                       READY   STATUS    RESTARTS   AGE
csi-snapshot-controller-574d748468-kvblk   1/1     Running   0          3m59s
csi-snapshot-controller-574d748468-p6jhb   1/1     Running   0          4m18s

cmd/snapshot-controller/main.go

verult · 2021-04-22T22:47:45Z

Just found: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy

So the restart's exponential backoff is capped at 5m, so hopefully after 5m after the last crash it'd start again

xing-yang · 2021-04-22T22:51:32Z

cmd/snapshot-controller/main.go

+func ensureCustomResourceDefinitionsExist(client *clientset.Clientset) error {
+	condition := func() (bool, error) {
+		var err error
+		_, err = client.SnapshotV1().VolumeSnapshots("kube-system").List(context.TODO(), metav1.ListOptions{})


Are we sure "kube-system" namespace exist in all Kubernetes distributions?

From https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/#working-with-namespaces

Kubernetes starts with four initial namespaces:

default The default namespace for objects with no other namespace

kube-system The namespace for objects created by the Kubernetes system

kube-public This namespace is created automatically and is readable by all users (including those not authenticated). This namespace is mostly reserved for cluster usage, in case that some resources should be visible and readable publicly throughout the whole cluster. The public aspect of this namespace is only a convention, not a requirement.

kube-node-lease This namespace for the lease objects associated with each node which improves the performance of the node heartbeats as the cluster scales.

I'm not sure if it can be deleted later though

Can you add a check to see if "kube-system" exist before checking if there are snapshot resources there?

Done, if kube-system doesn't exist we skip the volumesnapshot list request

msau42 · 2021-04-22T23:26:34Z

After I deployed the CRD, I had to delete the v4 pod to make it work

Can you verify that after 5 minutes it came back?

mauriciopoppe · 2021-04-23T00:45:14Z

Sure, same scenario but I waited until the deployment got to this state:

kubectl -n kube-system get pods -l k8s-app=csi-snapshot-controller
NAME                                       READY   STATUS             RESTARTS   AGE
csi-snapshot-controller-689dbc454-hjbgx    0/1     CrashLoopBackOff   6          7m52s
csi-snapshot-controller-759d567df6-kxv8j   1/1     Running            0          19m

I applied the CRDs and then I saw:

kubectl -n kube-system get pods --watch -l k8s-app=csi-snapshot-controller
NAME                                       READY   STATUS             RESTARTS   AGE
csi-snapshot-controller-689dbc454-hjbgx    0/1     CrashLoopBackOff   6          11m
csi-snapshot-controller-759d567df6-kxv8j   1/1     Running            0          22m
csi-snapshot-controller-689dbc454-hjbgx    1/1     Running            7          11m
csi-snapshot-controller-759d567df6-kxv8j   1/1     Terminating        0          23m
csi-snapshot-controller-689dbc454-45zz2    0/1     Pending            0          0s
csi-snapshot-controller-689dbc454-45zz2    0/1     Pending            0          0s
csi-snapshot-controller-689dbc454-45zz2    0/1     ContainerCreating   0          0s
csi-snapshot-controller-759d567df6-kxv8j   0/1     Terminating         0          23m
csi-snapshot-controller-759d567df6-kxv8j   0/1     Terminating         0          23m
csi-snapshot-controller-759d567df6-kxv8j   0/1     Terminating         0          23m
csi-snapshot-controller-689dbc454-45zz2    1/1     Running             0          4s

kubectl -n kube-system get pods -l k8s-app=csi-snapshot-controller
NAME                                      READY   STATUS    RESTARTS   AGE
csi-snapshot-controller-689dbc454-45zz2   1/1     Running   0          2m3s
csi-snapshot-controller-689dbc454-hjbgx   1/1     Running   7          14m

Something I noticed this time was the downtime that exists between terminating the last v3 controller and one of the v4 controllers making a request to become the leader, it was around 5 seconds.

mauriciopoppe · 2021-04-23T20:54:38Z

Can you verify that after 5 minutes it came back?

@msau42 Is this question about the v3 controller coming back? AFAIK we'd need to do rollout undo after we detect that the v4 controller isn't up, I think that the cap @verult pointed out is about how often kubernetes would trigger a restart e.g. (every 10s, 20s, 40s, ..., 5m, 5m, 5m, ...)

msau42 · 2021-04-23T22:25:39Z

Is this question about the v3 controller coming back?

You mentioned earlier that you had to restart the v4 pod after installing the CRDs because it wasn't being restarted after failing for 5 minutes. I wanted to check that this was because of the issue Cheng mentioned about the backoff period becoming 5 minutes after that point. I would expect that after an additional 5 minutes, the v4 pod would get restarted.

mauriciopoppe · 2021-04-23T22:32:28Z

yeah, in the first experiment I waited around 5 minutes before I manually restarted the pod, I think I should've waited just a little bit more to see the pod restart, in the last experiment I just waited and it was restarted

verult · 2021-04-24T02:20:13Z

Something I noticed this time was the downtime that exists between terminating the last v3 controller and one of the v4 controllers making a request to become the leader, it was around 5 seconds.

That makes sense:

external-snapshotter/vendor/github.com/kubernetes-csi/csi-lib-utils/leaderelection/leader_election.go

Line 42 in 679e71e

defaultRetryPeriod = 5 * time.Second

IMO this is pretty good; if retry period is 5s then the amount of downtime should often be less than 5s.

verult · 2021-04-26T21:49:47Z

LGTM from my perspective!

msau42 · 2021-04-27T01:55:16Z

Can you also update the release note to say something like "Add check for v1 CRDs to allow for rolling update of the snapshot-controller"

/lgtm

xing-yang · 2021-04-27T15:05:10Z

Can you squash your commits?

…oller

mauriciopoppe · 2021-04-27T17:13:12Z

@xing-yang I've squashed the commits
@msau42 I've updated the release note too

xing-yang · 2021-04-27T18:51:12Z

/lgtm
/approve

k8s-ci-robot · 2021-04-27T18:51:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mauriciopoppe, xing-yang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [xing-yang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

zhucan · 2021-05-07T09:38:48Z

cmd/snapshot-controller/main.go

+		Factor:   1.5,
+		Steps:    10,
+	}
+	if err := wait.ExponentialBackoff(backoff, condition); err != nil {


I don't understand, why need to wait?

Wait until the CRDs are installed.

if crash immediately and not retry, when the container restart again, it will check agian?

yes however we'd like to have a faster startup time, if the container exits quickly and goes into CrashLoopBackOff it'll be restarted every 10s, 20s, 40s and so on, during the first run instead of failing quickly it might be possible that the CRDs are ready after a few seconds and in one of these 10 runs the wait condition might succeed

@mauriciopoppe copy that, thanks.

k8s-ci-robot assigned msau42 Apr 22, 2021

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Apr 22, 2021

k8s-ci-robot assigned verult Apr 22, 2021

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 22, 2021

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 22, 2021

k8s-ci-robot requested review from lpabon and yuxiangqian April 22, 2021 01:10

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 22, 2021

mauriciopoppe commented Apr 22, 2021

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 22, 2021

mauriciopoppe changed the title ~~[WIP] Make sure the v1 CRDs exist before starting the controller~~ Make sure the v1 CRDs exist before starting the controller Apr 22, 2021

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 22, 2021

k8s-ci-robot assigned xing-yang Apr 22, 2021

mauriciopoppe force-pushed the ensure-crd branch from a0a6186 to cbabf04 Compare April 22, 2021 01:20

msau42 reviewed Apr 22, 2021

View reviewed changes

mauriciopoppe force-pushed the ensure-crd branch from 1e5b3a9 to 810f2aa Compare April 22, 2021 20:57

verult reviewed Apr 22, 2021

View reviewed changes

cmd/snapshot-controller/main.go Outdated Show resolved Hide resolved

xing-yang reviewed Apr 22, 2021

View reviewed changes

mauriciopoppe force-pushed the ensure-crd branch from 810f2aa to a628278 Compare April 22, 2021 22:52

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 27, 2021

Make sure the VolumeSnapshots v1 CRDs exist before starting the contr…

660420b

…oller

mauriciopoppe force-pushed the ensure-crd branch from 6d44eb3 to 660420b Compare April 27, 2021 17:00

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 27, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 27, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 27, 2021

k8s-ci-robot merged commit 4e568e4 into kubernetes-csi:master Apr 27, 2021

mauriciopoppe mentioned this pull request Apr 27, 2021

Update reference implementation of the snapshot controller #505

Merged

mauriciopoppe deleted the ensure-crd branch April 28, 2021 22:56

mauriciopoppe mentioned this pull request May 6, 2021

Startup code that gets a namespace fails if there are no RBAC rules #514

Closed

zhucan reviewed May 7, 2021

View reviewed changes

mauriciopoppe mentioned this pull request Jun 17, 2021

Node driver could not register on Windows in a little possibility kubernetes-csi/node-driver-registrar#143

Closed

Make sure the v1 CRDs exist before starting the controller #504

Make sure the v1 CRDs exist before starting the controller #504

Conversation

mauriciopoppe commented Apr 22, 2021 • edited

k8s-ci-robot commented Apr 22, 2021

k8s-ci-robot commented Apr 22, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhucan Apr 22, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mauriciopoppe Apr 22, 2021 • edited

Choose a reason for hiding this comment

zhucan May 6, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msau42 commented Apr 22, 2021

msau42 commented Apr 22, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mauriciopoppe commented Apr 22, 2021 • edited

verult commented Apr 22, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msau42 commented Apr 22, 2021

mauriciopoppe commented Apr 23, 2021 • edited

mauriciopoppe commented Apr 23, 2021

msau42 commented Apr 23, 2021

mauriciopoppe commented Apr 23, 2021

verult commented Apr 24, 2021 • edited

verult commented Apr 26, 2021

msau42 commented Apr 27, 2021

xing-yang commented Apr 27, 2021

mauriciopoppe commented Apr 27, 2021

xing-yang commented Apr 27, 2021

k8s-ci-robot commented Apr 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mauriciopoppe commented Apr 22, 2021 •

edited

zhucan Apr 22, 2021 •

edited

mauriciopoppe Apr 22, 2021 •

edited

zhucan May 6, 2021 •

edited

mauriciopoppe commented Apr 22, 2021 •

edited

mauriciopoppe commented Apr 23, 2021 •

edited

verult commented Apr 24, 2021 •

edited