Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resize pv failed #72393

Closed
z8772083 opened this issue Dec 28, 2018 · 29 comments · Fixed by #72431
Closed

resize pv failed #72393

z8772083 opened this issue Dec 28, 2018 · 29 comments · Fixed by #72431
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@z8772083
Copy link

What happened:
kubectl create -f ceph-pvc.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-expand-test
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: ceph-storage

and result is ok
pvc-expand-test Bound pvc-962a674c-0a73-11e9-b2d8-0050569bfc0f 1Gi RWO ceph-storage 1h

now i want to resize pv from 1G to 2G, edit

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-expand-test
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 2Gi
  storageClassName: ceph-storage

kubectl describe pvc pvc-expand-test

Name:          pvc-expand-test
Namespace:     default
StorageClass:  ceph-storage
Status:        Bound
Volume:        pvc-962a674c-0a73-11e9-b2d8-0050569bfc0f
Labels:        <none>
Annotations:   pv.kubernetes.io/bind-completed=yes
               pv.kubernetes.io/bound-by-controller=yes
               volume.beta.kubernetes.io/storage-provisioner=ceph.com/rbd
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      1Gi
Access Modes:  RWO
Conditions:
  Type       Status  LastProbeTime                     LastTransitionTime                Reason  Message
  ----       ------  -----------------                 ------------------                ------  -------
  Resizing   True    Mon, 01 Jan 0001 00:00:00 +0000   Fri, 28 Dec 2018 15:40:04 +0800           
Events:
  Type     Reason              Age                From           Message
  ----     ------              ----               ----           -------
  Warning  VolumeResizeFailed  24s (x21 over 1h)  volume_expand  Error expanding volume "default/pvc-expand-test" of plugin kubernetes.io/rbd : rbd info failed, error: can not get image size info kubernetes-dynamic-pvc-962c05a3-0a73-11e9-951a-0a580af404fa: rbd image 'kubernetes-dynamic-pvc-962c05a3-0a73-11e9-951a-0a580af404fa':
           size 1 GiB in 256 objects
           order 22 (4 MiB objects)
           id: 374526b8b4567
           block_name_prefix: rbd_data.374526b8b4567
           format: 2
           features: 
           op_features: 
           flags: 
           create_timestamp: Fri Dec 28 07:38:36 2018

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:
ceph storageclass yaml:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
   name: ceph-storage
provisioner: ceph.com/rbd
parameters:
  monitors: 192.168.20.195:6789,192.168.20.196:6789,192.168.20.197:6789
  adminId: admin
  adminSecretNamespace: default
  adminSecretName: ceph-secret
  pool: k8s-rbd
  userId: admin
  userSecretName: ceph-secret
  fsType: ext4
  imageFormat: "2"
allowVolumeExpansion: true

ceph secret:


apiVersion: v1
kind: Secret
metadata:
  name: ceph-secret
type: "kubernetes.io/rbd"
data:
  key: QVFCaVhBWmNJcdfa1QTJmZXJRRmNRLzBtSnlYZ1BEdmlMakE9PQ==

Environment:

  • Kubernetes version (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.5", GitCommit:"753b2dbc622f5cc417845f0ff8a77f539a4213ea", GitTreeState:"clean", BuildDate:"2018-11-26T14:41:50Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.5", GitCommit:"753b2dbc622f5cc417845f0ff8a77f539a4213ea", GitTreeState:"clean", BuildDate:"2018-11-26T14:31:35Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

  • Cloud provider or hardware configuration:

  • OS (e.g. from /etc/os-release):
    CentOS Linux release 7.4.1708 (Core)

  • Kernel (e.g. uname -a):
    Linux m01 3.10.0-693.el7.x86_64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

  • Install tools:
    kubeadm

  • Others:
    ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)

/kind bug

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 28, 2018
@z8772083
Copy link
Author

@kubernetes/sig-storage-bugs

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 28, 2018
@k8s-ci-robot
Copy link
Contributor

@z8772083: Reiterating the mentions to trigger a notification:
@kubernetes/sig-storage-bugs

In response to this:

@kubernetes/sig-storage-bugs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mlmhl
Copy link
Contributor

mlmhl commented Dec 29, 2018

What's the version of your ceph cluster? k8s uses the rbd info command to get rbd image's size, and it expects the size unit in the output is MB. However, in your environment the unit is GiB:

size 1 GiB in 256 objects
order 22 (4 MiB objects)
...

@z8772083
Copy link
Author

@mlmhl ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)

@mlmhl
Copy link
Contributor

mlmhl commented Dec 29, 2018

It seems that the output format is different between different ceph versions. I will try to fix it.

@mlmhl
Copy link
Contributor

mlmhl commented Jan 4, 2019

/assign

@polym
Copy link

polym commented Jan 8, 2019

@mlmhl your PR has been merged to master branch, and when will Kubernetes v1.11 fix this bug?

@mlmhl
Copy link
Contributor

mlmhl commented Jan 9, 2019

Sorry for the late, I've send cherry-pick PRs: #72718 #72719 #72720

@lucaim
Copy link

lucaim commented Jan 29, 2019

@mlmhl It seems this change is not working in some cases:

  Type     Reason              Age                    From           Message
  ----     ------              ----                   ----           -------
  Warning  VolumeResizeFailed  100s (x53 over 3h29m)  volume_expand  (combined from similar events): Error expanding volume "namespace/prometheus1-server" of plugin kubernetes.io/rbd : rbd info failed, error: parse rbd info output failed: 2019-01-28 16:47:24.878565 7f7eef611100 -1 did not load config file, using default settings.
{"name":"kubernetes-dynamic-pvc-155ee065-4ede-11e8-b665-02420a141559","size":536870912000,"objects":128000,"order":22,"object_size":4194304,"block_name_prefix":"rbd_data.259f7374b0dc51","format":2,"features":[],"flags":[]}, invalid character '-' after top-level value

I think this is due to two concurrent conditions:
1- ceph client still complain on stderr about not being able to load config file even if there is no need as a key is supplied and the result is (hence the output on stderr of 2019-01-28 16:47:24.878565 7f7eef611100 -1 did not load config file, using default settings.). This happens not only in hyperkube.
2- it seems we capture both stdout and stderr from the execution of the rbd info command (please note the output in the error from exec.Run) and then there is then a problem trying to unmarshall the JSON (to be clear, the unuseful warning is on stderr, the json output on stdout).

I am able to have a workaround creating empty files for /etc/ceph/ceph.conf and /etc/ceph/ceph.keyring so that the ceph client does not output anymore but I feel it would be much better to parse only stdout.

A reference on the source code for #72431 https://github.com/kubernetes/kubernetes/pull/72431/files#diff-36f18e327f36d95eb1333c4a18781184R704

Thanks
edit:typo

@hasonhai
Copy link

hasonhai commented Feb 5, 2019

@lucaim could you tell where did you put the empty ceph.conf and ceph.keyring files? On the node running kubelet?

@lucaim
Copy link

lucaim commented Feb 5, 2019

@hasonhai Inside the pod of the kube-controller-manager as the rbd resize is executed by this component.

@xiyangxdy
Copy link

@lucaim I had the same problem, I added ceph. conf and ceph. Keyring and solved it, but I think k8s got the default user of CEPH before going back to ceph. conf and ceph. Keyring. The following code validates my conjecture: https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/rbd/rbd_util.go//num 672
Ultimately, I want k8s to get those two files through configurable CEPH parameters.
What do you think?

@lucaim
Copy link

lucaim commented Mar 6, 2019

@xiyangxdy as far as I know you do not need to configure those two files if you use a key. That is still an hypothesis as in the version of Ceph we are using this switch seems to be undocumented: http://docs.ceph.com/docs/luminous/man/8/rbd/
Anyway the warning can be safely ignored and I do not see why we should parse stderr when there is no error

@xiyangxdy
Copy link

@lucaim I added two files with empty content before, no error; later I added a soft connection called ceph.conf and ceph. Keyring (pointing to the ceph configuration file and key ring generated by my cluster), but ceph did not report an error. . I think ceph only needs two files to exist and doesn't care about its contents. Fortunately, the current measures can solve the problem.

@lucaim
Copy link

lucaim commented Mar 7, 2019

@xiyangxdy this matches my experience (see above, I created the empty files too and it gives no error) but another possibility is just not parse stderr which I find "cleaner" as there is no error in this case even from the error code, and even what we get is just a warning.

@cschramm
Copy link

cschramm commented Jul 1, 2019

Same issue as @lucaim here. The warnings about the missing etc files break the JSON output. Adding empty files in the kube-controller-manager container fixes resizing.

@Bluewind
Copy link

Shouldn't this be reopened until the bug is fixed?

Creating the two files also "solves" the issue for us.

@msau42
Copy link
Member

msau42 commented Jul 17, 2019

/reopen
cc @gnufied @humblec

@k8s-ci-robot
Copy link
Contributor

@msau42: Reopened this issue.

In response to this:

/reopen
cc @gnufied @humblec

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Jul 17, 2019
@humblec
Copy link
Contributor

humblec commented Jul 18, 2019

@lucaim @Bluewind which version of ceph cluster is in use in your setup ?

@Bluewind
Copy link

We are running a ceph cluster with ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable). kube-controller-manager is running rancher/hyperkube:v1.14.3-rancher1 which internally uses ceph version 10.2.11 (e4b061b47f07f583c92a050d9e84b1813a35671e). Also just to prevent any confusion: I've used kubectl edit to resize the PVC, not rancher.

@Bluewind
Copy link

This also seems related to #66757.

@lucaim
Copy link

lucaim commented Jul 18, 2019

@humblec We are running ceph version 12.2.8 (6f01265ca03a6b9d7f3b7f759d8894bb9dbb6840) luminous (stable)

@adampl
Copy link

adampl commented Oct 11, 2019

Are there plans to resolve this issue?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 9, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 8, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@adampl
Copy link

adampl commented Dec 9, 2020

Answering myself - apparently it has been fixed in k8s 1.20: #92027

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

Successfully merging a pull request may close this issue.