Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detaching massively EBS volumes after CSI installed #13197

Closed
cmotta2016 opened this issue Feb 3, 2022 · 3 comments · Fixed by #13203
Closed

Detaching massively EBS volumes after CSI installed #13197

cmotta2016 opened this issue Feb 3, 2022 · 3 comments · Fixed by #13203
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@cmotta2016
Copy link

/kind bug

1. What kops version are you running? The command kops version, will display
this information.

1.22.2

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

1.18.20

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?
Enable CSI in a 1.18.20 cluster with existing PVs

5. What happened after the commands executed?
kops isn't adding CSIMigrationAWSComplete feature gate in kube-controller-manager

6. What did you expect to happen?
Feature gate CSIMigrationAWSComplete enabled at kube-controller-manager

7. Anything else do we need to know?
Some days after we enabled CSI, we are experiencing KCM detaching massively EBS.

This bug occurs every time the KCM restarts with this log:

I0203 14:59:35.895309 1 attach_detach_controller.go:757] Marking volume attachment as uncertain as volume:"kubernetes.io/aws-ebs/aws://us-east-1a/vol-xxxx" ("ip-xxxxx.ec2.internal") is not attached (Detached)
I0203 15:00:26.501827 1 reconciler.go:219] attacherDetacher.DetachVolume started for volume "pvc-a9beee94-48dd-4a2e-a844-07a9ea08cc3c" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-xxxx") on node "ip-xxxxx.ec2.internal"
I0203 15:00:26.506450 1 operation_generator.go:1384] Verified volume is safe to detach for volume "pvc-a9beee94-48dd-4a2e-a844-07a9ea08cc3c" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-xxxx") on node "ip-xxxxx.ec2.internal"
I0203 15:00:32.045628 1 aws.go:2251] Waiting for volume "vol-xxxx" state: actual=detaching, desired=detached
I0203 15:00:34.153425 1 aws.go:2477] waitForAttachmentStatus returned non-nil attachment with state=detached: {
AttachTime: 2022-02-03 14:57:58 +0000 UTC,
State: "detaching",
I0203 15:00:34.155182 1 operation_generator.go:472] DetachVolume.Detach succeeded for volume "pvc-a9beee94-48dd-4a2e-a844-07a9ea08cc3c" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-xxxx") on node "ip-xxxxx.ec2.internal"

This occurs only with PV migrated.

After investigating we discover that kops doesn't enable the CSIMigrationAWSComplete feature gate at KCM, just the CSIMigrationAWS:

  kubeControllerManager:
    allocateNodeCIDRs: true
    attachDetachReconcileSyncPeriod: 1m0s
    cloudProvider: aws
    clusterCIDR: 100.96.0.0/11
    clusterName: cluster-name
    configureCloudRoutes: false
    featureGates:
      CSIMigrationAWS: "true"
    image: k8s.gcr.io/kube-controller-manager:v1.18.20
    leaderElection:
      leaderElect: true
    logLevel: 2
    useServiceAccountCredentials: true

We enabled the feature gate directly at kube-controller-manager manifest (/etc/kubernetes/manifests). Now the KCM isn't detaching volumes anymore:

W0201 20:30:30.604716 1 attach_detach_controller.go:738] Skipping processing the volume "pvc-13fd2d5f-f1fd-4850-9571-d4018e67245e" on nodeName: "ip-10-48-122-60.ec2.internal", no attacher interface found. err=no volume plugin matched

So what's wrong?

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 3, 2022
@olemarkus
Copy link
Member

It really helps if you follow the template when submitting a bug, and paste the entire (redacted) cluster spec.

Have you actually enabled CSI driver in the cluster spec? KCM feature flag should be set when you do. If things are configured correctly, you should see the flag in kops get cluster --name <cluster> --full

@cmotta2016
Copy link
Author

cmotta2016 commented Feb 4, 2022

It really helps if you follow the template when submitting a bug, and paste the entire (redacted) cluster spec.

Sorry for that, I will fix soon

Have you actually enabled CSI driver in the cluster spec? KCM feature flag should be set when you do. If things are configured correctly, you should see the flag in kops get cluster --name --full

Yes, we enabled CSI in cluster spec:

spec:
  cloudConfig:
    awsEBSCSIDriver:
      enabled: true

At cluster-completed.spec we can see the feature gate CSIMigrationAWSComplete was not appear at kubeControllerManager block:

  kubeControllerManager:
    allocateNodeCIDRs: true
    attachDetachReconcileSyncPeriod: 1m0s
    cloudProvider: aws
    clusterCIDR: <redacted>
    clusterName: <redacted>
    configureCloudRoutes: false
    featureGates:
      CSIMigrationAWS: "true"
    image: k8s.gcr.io/kube-controller-manager:v1.18.20
    leaderElection:
      leaderElect: true
    logLevel: 2
    useServiceAccountCredentials: true

But appear at kubelet block:

 kubelet:
   anonymousAuth: false
   authenticationTokenWebhook: true
   authorizationMode: Webhook
   cgroupRoot: /
   cloudProvider: aws
   clusterDNS: <redacted>
   clusterDomain: cluster.local
   cpuCFSQuota: false
   cpuManagerPolicy: static
   enableDebuggingHandlers: true
   evictionHard: memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5%,imagefs.available<10%,imagefs.inodesFree<5%
   featureGates:
     CSIMigrationAWS: "true"
     CSIMigrationAWSComplete: "true"
   hostnameOverride: '@aws'

@olemarkus
Copy link
Member

okay. I see the problem. Thanks for reporting this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants