Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes deleted volume unexpectedly while leaving PV/PVC state unchanged #44372

Closed
zhan849 opened this issue Apr 12, 2017 · 10 comments
Closed
Labels
sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@zhan849
Copy link
Contributor

zhan849 commented Apr 12, 2017

Is this a request for help? NO

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):

Kubernetes volume deleted unexpectedly. Kubernetes volume deleted after apiserver panic

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG REPORT

Kubernetes version (use kubectl version):

root@2b1dfb511ac9:~# kubectl version                 
Client Version: version.Info{Major:"1", Minor:"4+", GitVersion:"v1.4.3-39+397fdc3990005b", GitCommit:"397fdc3990005b3823a020d001de4b8381efff89", GitTreeState:"clean", BuildDate:"2017-03-16T22:22:25Z", GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"4+", GitVersion:"v1.4.3-39+397fdc3990005b", GitCommit:"397fdc3990005b3823a020d001de4b8381efff89", GitTreeState:"clean", BuildDate:"2017-03-16T22:13:52Z", GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Cloud provider or hardware configuration: aws m3.large instance
  • OS (e.g. from /etc/os-release):
root@ip-10-128-0-9:/home/admin# cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 8 (jessie)"
NAME="Debian GNU/Linux"
VERSION_ID="8"
VERSION="8 (jessie)"
ID=debian
HOME_URL="http://www.debian.org/"
SUPPORT_URL="http://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
  • Kernel (e.g. uname -a): Linux ip-10-128-0-9 4.4.41-k8s #1 SMP Mon Jan 9 15:34:39 UTC 2017 x86_64 GNU/Linux
  • Install tools: kube-up (For node setup only)
  • Others:

What happened:
kubernetes controller manager deleted aws ebs volume unintentionally (no one issued a DELETE api call), while leaving PV/PVC state unchanged. Before deleting volume, api server had couple of go routines crashed due to timeout

What you expected to happen:

  1. EBS volume should not be unintentionally deleted
  2. Even EBS volume is deleted, PV/PVC status should reflect it

How to reproduce it (as minimally and precisely as possible):
Cannot reproduce it stably.

Anything else we need to know:
Here are some details:

PV status (After volume is deleted)

root@2b1dfb511ac9:~# kubectl --kubeconfig /tmp/ax_kube/cluster_hz-mac-ttt93-b1c768d0-1f09-11e7-b6c5-0242ac110007.conf --namespace axsys get pv pvc-e34de0c7-1f0a-11e7-bcca-0a92a07a4e23 -o yaml 
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    kubernetes.io/createdby: aws-ebs-dynamic-provisioner
    pv.kubernetes.io/bound-by-controller: "yes"
    pv.kubernetes.io/provisioned-by: kubernetes.io/aws-ebs
  creationTimestamp: 2017-04-11T23:03:50Z
  labels:
    failure-domain.beta.kubernetes.io/region: us-west-2
    failure-domain.beta.kubernetes.io/zone: us-west-2c
  name: pvc-e34de0c7-1f0a-11e7-bcca-0a92a07a4e23
  resourceVersion: "425"
  selfLink: /api/v1/persistentvolumes/pvc-e34de0c7-1f0a-11e7-bcca-0a92a07a4e23
  uid: 2023c519-1f0b-11e7-bcca-0a92a07a4e23
spec:
  accessModes:
  - ReadWriteOnce
  awsElasticBlockStore:
    fsType: ext4
    volumeID: aws://us-west-2c/vol-0b3bc6fd027da7cf4
  capacity:
    storage: 30Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: gateway-persistent-storage
    namespace: axsys
    resourceVersion: "241"
    uid: e34de0c7-1f0a-11e7-bcca-0a92a07a4e23
  persistentVolumeReclaimPolicy: Delete
status:
  phase: Bound

PVC status (after volume is deleted):

root@2b1dfb511ac9:~# kubectl --kubeconfig /tmp/ax_kube/cluster_hz-mac-ttt93-b1c768d0-1f09-11e7-b6c5-0242ac110007.conf --namespace axsys get pvc gateway-persistent-storage -o yaml             
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
    volume.alpha.kubernetes.io/storage-class: anything
  creationTimestamp: 2017-04-11T23:02:08Z
  labels:
    app: gateway
  name: gateway-persistent-storage
  namespace: axsys
  resourceVersion: "427"
  selfLink: /api/v1/namespaces/axsys/persistentvolumeclaims/gateway-persistent-storage
  uid: e34de0c7-1f0a-11e7-bcca-0a92a07a4e23
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 30Gi
  volumeName: pvc-e34de0c7-1f0a-11e7-bcca-0a92a07a4e23
status:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 30Gi
  phase: Bound

kube-controller-manager log about this volume (as well as pv, pvc)

I0411 23:02:08.346500       5 util.go:326] Creating volume for PVC "gateway-persistent-storage"; chose zone="us-west-2c" from zones=["us-west-2c"]
I0411 23:02:08.382722       5 aws.go:1077] Found instances in zones map[us-west-2c:{}]
......
I0411 23:02:08.631235       5 aws_util.go:123] Successfully created EBS Disk volume aws://us-west-2c/vol-0b3bc6fd027da7cf4
......
I0411 23:03:50.249778       5 controller.go:669] volume "pvc-e34de0c7-1f0a-11e7-bcca-0a92a07a4e23" entered phase "Bound"
I0411 23:03:50.249827       5 controller.go:805] volume "pvc-e34de0c7-1f0a-11e7-bcca-0a92a07a4e23" bound to claim "axsys/gateway-persistent-storage"
......
I0411 23:04:54.685169       5 aws_util.go:56] Successfully deleted EBS Disk volume aws://us-west-2c/vol-0b3bc6fd027da7cf4
I0411 23:04:54.685197       5 controller.go:1179] volume "pvc-e34de0c7-1f0a-11e7-bcca-0a92a07a4e23" deleted

kube-apiserver panics

E0411 23:03:53.202143       5 apiserver.go:516] apiserver was unable to write a JSON response: http: Handler timeout
E0411 23:03:53.202196       5 errors.go:63] apiserver received an error that is not an unversioned.Status: http: Handler timeout
I0411 23:03:53.202292       5 trace.go:61] Trace "Create /api/v1/persistentvolumes" (started 2017-04-11 23:02:08.611031602 +0000 UTC):
[27.501µs] [27.501µs] About to convert to expected version
[141.219µs] [113.718µs] Conversion done
[1m44.589453767s] [1m44.589312548s] About to store object in database
[1m44.591236532s] [1.782765ms] END
I0411 23:03:53.202352       5 handlers.go:162] POST /api/v1/persistentvolumes: (1m44.591426253s) 500
goroutine 5551 [running]:
k8s.io/kubernetes/pkg/httplog.(*respLogger).recordStatus(0xc82398eaf0, 0x1f4)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/httplog/log.go:219 +0xa3
k8s.io/kubernetes/pkg/httplog.(*respLogger).WriteHeader(0xc82398eaf0, 0x1f4)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/httplog/log.go:198 +0x2b
k8s.io/kubernetes/pkg/apiserver/metrics.(*responseWriterDelegator).WriteHeader(0xc823ad8ff0, 0x1f4)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/apiserver/metrics/metrics.go:117 +0x4c
k8s.io/kubernetes/pkg/apiserver.errorJSONFatal(0x0, 0x0, 0x7f14ea7b3460, 0xc822d04300, 0x7f14ea5b7d50, 0xc820d1cab0, 0xc820010ae0)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/apiserver/apiserver.go:526 +0x47a
k8s.io/kubernetes/pkg/apiserver.writeNegotiated(0x7f14ea7b33e8, 0xc820564f50, 0x0, 0x0, 0x3f1ae88, 0x2, 0x7f14ea5b7d50, 0xc820d1cab0, 0xc822b4eee0, 0x199, ...)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/apiserver/apiserver.go:496 +0x3d4
k8s.io/kubernetes/pkg/apiserver.errorNegotiated(0x7f14ea608528, 0xc822d6b780, 0x7f14ea7b33e8, 0xc820564f50, 0x0, 0x0, 0x3f1ae88, 0x2, 0x7f14ea5b7d50, 0xc820d1cab0, ...)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/apiserver/apiserver.go:509 +0x1a9
k8s.io/kubernetes/pkg/apiserver.(*RequestScope).err(0xc8200afc00, 0x7f14ea608528, 0xc822d6b780, 0x7f14ea5b7d50, 0xc820d1cab0, 0xc822b4eee0)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/apiserver/resthandler.go:87 +0x191
k8s.io/kubernetes/pkg/apiserver.createHandler.func1(0xc823ad8f60, 0xc8239abaa0)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/apiserver/resthandler.go:410 +0x14cd
k8s.io/kubernetes/pkg/apiserver/metrics.InstrumentRouteFunc.func1(0xc823ad8f60, 0xc8239abaa0)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/apiserver/metrics/metrics.go:101 +0x25f
k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.(*Container).dispatch(0xc820531200, 0x7f14ea5b79c0, 0xc82398eaf0, 0xc822b4eee0)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/emicklei/go-restful/container.go:272 +0xf30
k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.(*Container).(k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.dispatch)-fm(0x7f14ea5b79c0, 0xc82398eaf0, 0xc822b4eee0)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/emicklei/go-restful/container.go:120 +0x3e
net/http.HandlerFunc.ServeHTTP(0xc820ca57d0, 0x7f14ea5b79c0, 0xc82398eaf0, 0xc822b4eee0)
        /usr/local/go/src/net/http/server.go:1618 +0x3a
net/http.(*ServeMux).ServeHTTP(0xc8200336e0, 0x7f14ea5b79c0, 0xc82398eaf0, 0xc822b4eee0)
        /usr/local/go/src/net/http/server.go:1910 +0x17d
k8s.io/kubernetes/pkg/api.NewRequestContextFilter.func1(0x7f14ea5b79c0, 0xc82398eaf0, 0xc822b4eee0)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/api/requestcontext.go:101 +0x157
net/http.HandlerFunc.ServeHTTP(0xc8208296c0, 0x7f14ea5b79c0, 0xc82398eaf0, 0xc822b4eee0)
        /usr/local/go/src/net/http/server.go:1618 +0x3a
k8s.io/kubernetes/pkg/apiserver.RecoverPanics.func1(0x7f14ea5b79c0, 0xc82398eaf0, 0xc822b4eee0)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/apiserver/handlers.go:161 +0x1f3
net/http.HandlerFunc.ServeHTTP(0xc821c08460, 0x7f14ea5b7988, 0xc820d1ca98, 0xc822b4eee0)
        /usr/local/go/src/net/http/server.go:1618 +0x3a
k8s.io/kubernetes/pkg/apiserver.(*timeoutHandler).ServeHTTP.func1(0xc821c084a0, 0x7f14ea5b7948, 0xc820d1ca98, 0xc822b4eee0, 0xc8239ab9e0)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/apiserver/handlers.go:194 +0x8d
created by k8s.io/kubernetes/pkg/apiserver.(*timeoutHandler).ServeHTTP
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/apiserver/handlers.go:196 +0x15b

logging error output: "k8s\x00\n\f\n\x02v1\x12\x06Status\x12+\n\x04\n\x00\x12\x00\x12\aFailure\x1a\x15http: Handler timeout\"\x000\xf4\x03\x1a\x00\"\x00"
 [[kube-controller-manager/v1.4.3 (linux/amd64) kubernetes/397fdc3/persistent-volume-binder] 127.0.0.1:47348]

(The above happened a couple of times before)

......

I0411 23:04:27.534961       5 trace.go:61] Trace "Create /api/v1/persistentvolumes" (started 2017-04-11 23:04:19.443869726 +0000 UTC):
[25.663µs] [25.663µs] About to convert to expected version
[132.76µs] [107.097µs] Conversion done
[8.08948191s] [8.08934915s] About to store object in database
[8.091061632s] [1.579722ms] END
I0411 23:04:27.535009       5 handlers.go:162] POST /api/v1/persistentvolumes: (8.091446476s) 409 [[kube-controller-manager/v1.4.3 (linux/amd64) kubernetes/397fdc3/persistent-volume-binder] 127.0.0.1:48064]

......

I0411 23:04:44.578049       5 trace.go:61] Trace "Create /api/v1/persistentvolumes" (started 2017-04-11 23:04:37.536507929 +0000 UTC):
[28.013µs] [28.013µs] About to convert to expected version
[154.613µs] [126.6µs] Conversion done
[7.03989939s] [7.039744777s] About to store object in database
[7.041511925s] [1.612535ms] END
I0411 23:04:44.578299       5 handlers.go:162] POST /api/v1/persistentvolumes: (7.041757553s) 409 [[kube-controller-manager/v1.4.3 (linux/amd64) kubernetes/397fdc3/persistent-volume-binder] 127.0.0.1:48076]
@msau42
Copy link
Member

msau42 commented Apr 13, 2017

cc @kubernetes/sig-storage-bugs

@jsafrane
Copy link
Member

Can you please post complete controller-manager and apiserver logs between 23:02:08.631235 (volume is created) and 23:04:54.685169 (it's deleted)?

@zhan849
Copy link
Contributor Author

zhan849 commented Apr 14, 2017

@jsafrane sorry we have teared down the cluster. But I tried to paste everything related to that PV/PVC/Volume, and rests are irrelevant.

For API server, that same go-routine panic happened 4~5 times and the one I pasted was the last one. For controller manager, that's really everything I have, as I copy and pasted them line by line here and anything in between were related to other controllers.

Anything suspicious but missing?

@jsafrane
Copy link
Member

I want to see the reason why a PV is deleted. There should be at least volume %q is released and reclaim policy %q will be executed in the controller log.

There is a very slim possibility that the API server crashed after it has stored a new PV object and before sending a response to the controller that the object was written. The controller will then try to delete the volume and it won't try to delete the PV from Kubernetes because it was never successfully written. There should be "Error creating provisioned PV object for claim %s: %v. Deleting the volume" in the controller-manager logs at level 3 and also the same error should be visible in kubectl describe pvc as an event on the PVC.

@zhan849
Copy link
Contributor Author

zhan849 commented Apr 18, 2017

@jsafrane thanks for the detailed explanation. It might because we are turning --v=2 so there is little to see. Will use kubectl describe pvc to get more information if it happened again. But its weird why API server go routine can crash, as there is no other intervention such as host level OOMKill.

Btw is this slim possibility fixed/improved in later version? Any pull request / bug I can refer to?

@jsafrane
Copy link
Member

I created patch #44719 which fixes my suspicion in my previous comment, but it's hard to guess if it really fixes the bug you are reporting because I don't know if it's the same bug. Collect the logs please next time.

@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 31, 2017
@0xmichalis
Copy link
Contributor

/sig storage

@k8s-ci-robot k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Jun 4, 2017
@k8s-github-robot k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 4, 2017
@Calpicow
Copy link
Contributor

Calpicow commented Jun 19, 2017

Having a similar issue. Not sure if it is triggered by an apiserver panic, but is causing my PVs to get deleted, causing PVC status of 'Lost'. Most are prevented since they're in use, but unused ones are deleted about once every two weeks.

I0613 23:13:26.041679 1 pv_controller.go:441] volume "pvc-e7c1bc14-7e99-11e6-8738-06b2026391b7" is released and reclaim policy "Delete" will be executed
I0613 23:13:26.050315 1 panics.go:76] PUT /api/v1/persistentvolumes/pvc-e7c1bc14-7e99-11e6-8738-06b2026391b7/status: (8.297651ms) 200 [[hyperkube/v1.5.3 (linux/amd64) kubernetes/029c3a4/persistent-volume-binder] 127.0.0.1:40662]
I0613 23:13:26.050664 1 pv_controller.go:672] volume "pvc-e7c1bc14-7e99-11e6-8738-06b2026391b7" entered phase "Released"
I0613 23:13:26.054434 1 pv_controller.go:1155] isVolumeReleased[pvc-e7c1bc14-7e99-11e6-8738-06b2026391b7]: volume is released
E0613 23:13:26.275924 1 goroutinemap.go:164] Operation for "delete-pvc-e7c1bc14-7e99-11e6-8738-06b2026391b7[e89c8a48-7e99-11e6-8738-06b2026391b7]" failed. No retries permitted until 2017-06-13 23:13:26.775907733 +0000 UTC (durationBeforeRetry 500ms). Error: VolumeInUse: Volume vol-4f18efcb is currently attached to i-0acfa60b1af84faaf

@jsafrane
Copy link
Member

@Calpicow, this looks like a different bug. The log snippet shows that a PVC was deleted by something and Kubernetes tries to delete a PV as a consequence. It does not show a PVC getting Lost because something has deleted a PV.

Please create a new issue and post there longer controller-manager log snippet where a PVC gets Lost + few minutes before.

@Calpicow
Copy link
Contributor

Calpicow commented Jun 26, 2017

Before I create an issue... the PVC is definitely not getting deleted. Below is the result when the PV disappears:

NAMESPACE   NAME   STATUS   VOLUME                                     CAPACITY   ACCESSMODES   AGE
test        test   Lost     pvc-e7c1bc14-7e99-11e6-8738-06b2026391b7   0                        195d

k8s-github-robot pushed a commit that referenced this issue Aug 29, 2017
Automatic merge from submit-queue (batch tested with PRs 44719, 48454)

Fix handling of APIserver errors when saving provisioned PVs.

When API server crashes *after* saving a provisioned PV and before sending
200 OK, the controller tries to save the PV again. In this case, it gets
AlreadyExists error, which should be interpreted as success and not as error.

Especially, a volume that corresponds to the PV should not be deleted in the
underlying storage.

Fixes #44372

```release-note
NONE
```

@kubernetes/sig-storage-pr-reviews
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

No branches or pull requests

7 participants