Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Federation cluster scripts accidentally delete PV's. #46380

Closed
madhusudancs opened this issue May 24, 2017 · 11 comments
Closed

Federation cluster scripts accidentally delete PV's. #46380

madhusudancs opened this issue May 24, 2017 · 11 comments
Assignees
Milestone

Comments

@madhusudancs
Copy link
Contributor

We have a PVC that uses the alpha annotation to dynamically provision a GCE PD/PV. Sometimes in our test environment, the dynamically provisioned PV gets automatically deleted by the PV controller without any of us deleting the PVC. I am attaching the controller manager logs here: kube-controller-manager.log-20170523-1495580401.gz

Interesting bits in the logs start at:

I0523 22:08:01.117123       5 gce_util.go:122] Successfully created GCE PD volume jenkins-us-central1-f--pvc-4723f163-4004-11e7-a75a-42010a80000a

The namespaced name of the claim is f8n-system-agent-pr-93-0/e2e-f8n-agent-pr-93-0-apiserver-etcd-claim

Here is the deployment yaml

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    federation.alpha.kubernetes.io/federation-name: e2e-f8n-agent-pr-93-0
  labels:
    app: federated-cluster
  name: e2e-f8n-agent-pr-93-0-apiserver
  namespace: f8n-system-agent-pr-93-0
spec:
  replicas: 1
  selector:
    matchLabels:
      app: federated-cluster
      module: federation-apiserver
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      annotations:
        federation.alpha.kubernetes.io/federation-name: e2e-f8n-agent-pr-93-0
      creationTimestamp: null
      labels:
        app: federated-cluster
        module: federation-apiserver
      name: e2e-f8n-agent-pr-93-0-apiserver
    spec:
      containers:
      - command:
        - /hyperkube
        - federation-apiserver
        - --admission-control=NamespaceLifecycle
        - --advertise-address=104.154.166.121
        - --basic-auth-file=/etc/federation/apiserver/basicauth.csv
        - --bind-address=0.0.0.0
        - --client-ca-file=/etc/federation/apiserver/ca.crt
        - --etcd-servers=http://localhost:2379
        - --secure-port=8443
        - --tls-cert-file=/etc/federation/apiserver/server.crt
        - --tls-private-key-file=/etc/federation/apiserver/server.key
        - --token-auth-file=/etc/federation/apiserver/token.csv
        - --v=4
        image: gcr.io/k8s-jkns-pr-bldr-e2e-gce-fdrtn/hyperkube-amd64:v1.7.0-alpha.4.362_c5319821fe72fa
        imagePullPolicy: IfNotPresent
        name: apiserver
        ports:
        - containerPort: 8443
          name: https
          protocol: TCP
        - containerPort: 8080
          name: local
          protocol: TCP
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/federation/apiserver
          name: e2e-f8n-agent-pr-93-0-apiserver-credentials
          readOnly: true
      - command:
        - /usr/local/bin/etcd
        - --data-dir
        - /var/etcd/data
        image: gcr.io/google_containers/etcd:3.0.17
        imagePullPolicy: IfNotPresent
        name: etcd
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/etcd
          name: etcddata
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - name: e2e-f8n-agent-pr-93-0-apiserver-credentials
        secret:
          defaultMode: 420
          secretName: e2e-f8n-agent-pr-93-0-apiserver-credentials
      - name: etcddata
        persistentVolumeClaim:
          claimName: e2e-f8n-agent-pr-93-0-apiserver-etcd-claim

cc @kubernetes/sig-storage-bugs @saad-ali @kubernetes/sig-federation-bugs

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/federation labels May 24, 2017
@madhusudancs madhusudancs changed the title PV controller deletes the dynamically provisioned PV without any intervention PV controller deletes dynamically provisioned PV without any intervention May 24, 2017
@madhusudancs
Copy link
Contributor Author

PVC

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    federation.alpha.kubernetes.io/federation-name: e2e-f8n-agent-pr-93-0
    volume.alpha.kubernetes.io/storage-class: "yes"
  labels:
    app: federated-cluster
  name: e2e-f8n-agent-pr-93-0-apiserver-etcd-claim
  namespace: e2e-f8n-agent-pr-93-0
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi

@wongma7
Copy link
Contributor

wongma7 commented May 24, 2017

The PV got deleted somehow and so PVC entered Lost state. If we rule out that PV was accidentally deleted or it wasn't deleted at all then potentially there's an issue of the pv controller's cache getting out of sync with reality or something of that nature

I0523 22:08:01.361501 5 pv_controller_base.go:215] volume "pvc-4723f163-4004-11e7-a75a-42010a80000a" deleted

edit: clarification, above line only proves it's deleting the volume from the internal cache which it should only do if it observes a PV deletion event. Not triyng to state the obvious, just thinking out loud...

@wongma7
Copy link
Contributor

wongma7 commented May 24, 2017

I don't see "doDeleteVolume [pvc-4723f163-4004-11e7-a75a-42010a80000a" so I'm inclined to think the PV object was accidentally deleted by a user.

ping @jsafrane since this involves PV controller

@msau42
Copy link
Member

msau42 commented May 24, 2017

@madhusudancs can you also attach apiserver logs?

@madhusudancs
Copy link
Contributor Author

@msau42 we don't have the API server logs for the attached controller manager logs unfortunately. That cluster has been torn down. I can attach the API server logs belonging to a different instance when this happens again.

@jsafrane
Copy link
Member

jsafrane commented May 26, 2017

@wongma7 is right, it seems that something else than the PV controller deleted PV with name pvc-4723f163-4004-11e7-a75a-42010a80000a.
If the controller deleted the volume, we would see at least:

  • deleteVolumeOperation [%s] started from PV controller
  • Successfully deleted GCE PD volume %s from volume plugin (at level 2!)
  • volume %q deleted from the controller again (at level 2!)

Looking at the log, there is quite lot of Lost PVCs... Some of the corresponding PVs are deleted very quickly after they're provisioned, some of them survived for couple of minutes. They are deleted in batches. Here is log from the controller where it provisioned 3 PVs during 3 minutes and they were deleted at the same time by something:

I0523 22:28:51.909294       5 pv_controller.go:1414] volume "pvc-30852a65-4007-11e7-a75a-42010a80000a" provisioned for claim "f8n-system-agent-pr-33-0/e2e-f8n-agent-pr-33-0-apiserver-etcd-claim"
I0523 22:29:35.778848       5 pv_controller.go:1414] volume "pvc-4aab8a3b-4007-11e7-a75a-42010a80000a" provisioned for claim "f8n-system-agent-pr-39-0/e2e-f8n-agent-pr-39-0-apiserver-etcd-claim"
I0523 22:30:28.335114       5 pv_controller.go:1414] volume "pvc-69f8571d-4007-11e7-a75a-42010a80000a" provisioned for claim "f8n-system-agent-pr-13-0/e2e-f8n-agent-pr-13-0-apiserver-etcd-claim"
I0523 22:31:01.482932       5 pv_controller.go:653] claim "f8n-system-agent-pr-33-0/e2e-f8n-agent-pr-33-0-apiserver-etcd-claim" entered phase "Lost"
I0523 22:31:01.505094       5 pv_controller.go:653] claim "f8n-system-agent-pr-39-0/e2e-f8n-agent-pr-39-0-apiserver-etcd-claim" entered phase "Lost"
I0523 22:31:01.522478       5 pv_controller.go:653] claim "f8n-system-agent-pr-13-0/e2e-f8n-agent-pr-13-0-apiserver-etcd-claim" entered phase "Lost"

Something must be watching the PVs periodically and deleting them. Do you have any 3rd party controllers / provisioners? Do you accidentally run second controller-manager in parallel? Is the GCE PD deleted too or just it's Kubernetes PV?

Watching the API server logs could help, especially if you could tell who deleted the PV object.

madhusudancs added a commit to madhusudancs/kubernetes that referenced this issue Jun 4, 2017
PV is a non-namespaced resource. Running `kubectl delete pv --all`, even
with `--namespace` is going to delete all the PVs in the cluster. This
is a dangerous operation and should not be deleted this way.

Instead we now retrieve the PVs bound to the PVCs in the namespace we
are deleteing and delete only those PVs.

Fixes issue kubernetes#46380.
@madhusudancs
Copy link
Contributor Author

@jsafrane @wongma7 you were right. We were running kubectl delete pvc,pv,pods,... --namespace=${FEDERATION_NAME} --all without paying close attention to what was going on there. Fix sent in #46945.

@madhusudancs
Copy link
Contributor Author

/assign

@madhusudancs madhusudancs added this to the v1.7 milestone Jun 4, 2017
mrIncompetent pushed a commit to kubermatic/kubernetes that referenced this issue Jun 6, 2017
PV is a non-namespaced resource. Running `kubectl delete pv --all`, even
with `--namespace` is going to delete all the PVs in the cluster. This
is a dangerous operation and should not be deleted this way.

Instead we now retrieve the PVs bound to the PVCs in the namespace we
are deleteing and delete only those PVs.

Fixes issue kubernetes#46380.
@saad-ali
Copy link
Member

saad-ali commented Jun 7, 2017

Good debugging @jsafrane!

@saad-ali saad-ali removed the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Jun 7, 2017
@marun
Copy link
Contributor

marun commented Jun 12, 2017

@madhusudancs Is there more to do with this issue that wasn't addressed by #46945?

@ghost ghost changed the title PV controller deletes dynamically provisioned PV without any intervention Federation cluster scripts accidentally delete PV's. Jun 12, 2017
@ghost
Copy link

ghost commented Jun 12, 2017

Renamed and closed, as #46945 has merged.

@ghost ghost closed this as completed Jun 12, 2017
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants