Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbage collector deleted wrong resources #88097

Open
luisdavim opened this issue Feb 12, 2020 · 4 comments
Open

Garbage collector deleted wrong resources #88097

luisdavim opened this issue Feb 12, 2020 · 4 comments
Assignees

Comments

@luisdavim
Copy link

@luisdavim luisdavim commented Feb 12, 2020

What happened:
Today I ran into a problem that seems to be a race condition in the way kube-controller-manager primes the cache before starting the garbage collection. We have a cluster with hundreds of namespaces and thousands of custom resources that have parent child relationships and during a cluster upgrade, whilst replacing the control plane nodes the garbage collector deleted some resources that was not supposed to, they had proper owner references and the owners where live and well.

Looking at the audit logs I can see a spike in the number of objects being deleted by the garbage collector during the upgrade.

My theory is that the garbage collector started looking at the child resources before their parents were cached, would this be possible?

What you expected to happen:
The healthy non orphan resources should not have been deleted

How to reproduce it (as minimally and precisely as possible):
Not really sure but this seems to be a race condition I think that killing the leader kube-controller-manager in a cluster with a large amount of custom resources that have a parent child relationship is the way to reproduce, I was able to reproduce it 3 times but not constantly.

Environment:

  • Kubernetes version (use kubectl version):
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.9", GitCommit:"2e808b7cb054ee242b68e62455323aa783991f03", GitTreeState:"clean", BuildDate:"2020-01-18T23:24:23Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
    Azure VMs/Scale sets
  • OS (e.g: cat /etc/os-release):
NAME="Ubuntu"                                           VERSION="18.04.3 LTS (Bionic Beaver)"                   ID=ubuntu                                               ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.3 LTS"                        VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"                      SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"     PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"                             VERSION_CODENAME=bionic                                 UBUNTU_CODENAME=bionic
  • Kernel (e.g. uname -a):
Linux cpscu0c52000000 5.0.0-1027-azure #29~18.04.1-Ubuntu SMP Mon Nov 25 21:18:57 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
@neolit123

This comment has been minimized.

Copy link
Member

@neolit123 neolit123 commented Feb 12, 2020

/sig api-machinery

@liggitt

This comment has been minimized.

Copy link
Member

@liggitt liggitt commented Feb 13, 2020

this is likely the same issue as #65200

are your owner resources namespaced or cluster-scoped?
are your child resources namespaced or cluster-scoped?
if both are namespaced, are the child resources in the same namespace as the parent?

@luisdavim

This comment has been minimized.

Copy link
Author

@luisdavim luisdavim commented Feb 13, 2020

Hi, the resources are all namespaced and the parents and children are in the same namespace.

@luisdavim

This comment has been minimized.

Copy link
Author

@luisdavim luisdavim commented Feb 13, 2020

The description in the issue you linked seems like the opposite of what I'm describing. In my case the children had valid owner references pointing to valid parents that were not marked for deletion both the parents and the children were namespaced and in the same namespace.

@liggitt liggitt self-assigned this Feb 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.