New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GC] controller manager gets error "unable to get REST mapping for kind" for ownerRefs to TPR and add-on APIs #39816
Comments
well, the garbage collector still uses a static RESTMapper( |
Not recognizing is fine and expected. But controller manager keeps on printing out error message without backing off. @caesarxuchao Can you confirm this is unintended? Or we are missing anything here? |
Right. I should have mentioned it: the controller manager is unresponsive and overloaded with error message. Updated in title and top comment. |
there are two things that need fixing here:
|
@liggitt |
ownerref could have bad data, or it could reference a kind that is no longer part of the cluster. need to define what happens in those cases. when it is referencing a dynamic kind (either a kind contributed by an add-on server or by a thirdpartyresource), yes, it should re-discover on some interval to recognize the kind |
I have the same issue when launching an etcd cluster with the coreos etcd-operator. If causes the api server to run out of disk space as it logs infinitely. I have to keep restart the controller container to reclaim disk space. {"log":"I0118 09:51:08.806215 1 event.go:217] Event(api.ObjectReference{Kind:\"Node\", Namespace:\"\", Name:\"10.10.2.102\", UID:\"882b86ac-dc0d-11e6-9ff0-001e4f520ea4\", APIVersion:\"\", ResourceVersion:\"\", FieldPath:\"\"}): type: 'Normal' reason: 'RegisteredNode' Node 10.10.2.102 event: Registered Node 10.10.2.102 in NodeController\n","stream":"stderr","time":"2017-01-18T09:51:08.8087232Z"}
{"log":"I0118 09:51:08.806251 1 event.go:217] Event(api.ObjectReference{Kind:\"Node\", Namespace:\"\", Name:\"10.10.3.103\", UID:\"db9e7d24-dc0d-11e6-9ff0-001e4f520ea4\", APIVersion:\"\", ResourceVersion:\"\", FieldPath:\"\"}): type: 'Normal' reason: 'RegisteredNode' Node 10.10.3.103 event: Registered Node 10.10.3.103 in NodeController\n","stream":"stderr","time":"2017-01-18T09:51:08.808748906Z"}
{"log":"I0118 09:51:08.806268 1 event.go:217] Event(api.ObjectReference{Kind:\"Node\", Namespace:\"\", Name:\"10.10.3.104\", UID:\"d856a908-dc0d-11e6-9ff0-001e4f520ea4\", APIVersion:\"\", ResourceVersion:\"\", FieldPath:\"\"}): type: 'Normal' reason: 'RegisteredNode' Node 10.10.3.104 event: Registered Node 10.10.3.104 in NodeController\n","stream":"stderr","time":"2017-01-18T09:51:08.808759381Z"}
{"log":"I0118 09:51:08.806282 1 event.go:217] Event(api.ObjectReference{Kind:\"Node\", Namespace:\"\", Name:\"10.10.3.202\", UID:\"d175e1b3-dc0d-11e6-9ff0-001e4f520ea4\", APIVersion:\"\", ResourceVersion:\"\", FieldPath:\"\"}): type: 'Normal' reason: 'RegisteredNode' Node 10.10.3.202 event: Registered Node 10.10.3.202 in NodeController\n","stream":"stderr","time":"2017-01-18T09:51:08.808769116Z"}
{"log":"I0118 09:51:08.806296 1 event.go:217] Event(api.ObjectReference{Kind:\"Node\", Namespace:\"\", Name:\"10.10.3.204\", UID:\"c8bca4db-dc0d-11e6-9ff0-001e4f520ea4\", APIVersion:\"\", ResourceVersion:\"\", FieldPath:\"\"}): type: 'Normal' reason: 'RegisteredNode' Node 10.10.3.204 event: Registered Node 10.10.3.204 in NodeController\n","stream":"stderr","time":"2017-01-18T09:51:08.808778906Z"}
{"log":"I0118 09:51:08.806310 1 event.go:217] Event(api.ObjectReference{Kind:\"Node\", Namespace:\"\", Name:\"10.10.3.205\", UID:\"d54de405-dc0d-11e6-9ff0-001e4f520ea4\", APIVersion:\"\", ResourceVersion:\"\", FieldPath:\"\"}): type: 'Normal' reason: 'RegisteredNode' Node 10.10.3.205 event: Registered Node 10.10.3.205 in NodeController\n","stream":"stderr","time":"2017-01-18T09:51:08.808788798Z"}
{"log":"I0118 09:51:08.806324 1 event.go:217] Event(api.ObjectReference{Kind:\"Node\", Namespace:\"\", Name:\"10.10.3.102\", UID:\"d18c2b99-dc0d-11e6-9ff0-001e4f520ea4\", APIVersion:\"\", ResourceVersion:\"\", FieldPath:\"\"}): type: 'Normal' reason: 'RegisteredNode' Node 10.10.3.102 event: Registered Node 10.10.3.102 in NodeController\n","stream":"stderr","time":"2017-01-18T09:51:08.808798638Z"}
{"log":"I0118 09:51:08.806338 1 event.go:217] Event(api.ObjectReference{Kind:\"Node\", Namespace:\"\", Name:\"10.10.3.101\", UID:\"db492f64-dc0d-11e6-9ff0-001e4f520ea4\", APIVersion:\"\", ResourceVersion:\"\", FieldPath:\"\"}): type: 'Normal' reason: 'RegisteredNode' Node 10.10.3.101 event: Registered Node 10.10.3.101 in NodeController\n","stream":"stderr","time":"2017-01-18T09:51:08.808808629Z"}
{"log":"I0118 09:51:08.806352 1 event.go:217] Event(api.ObjectReference{Kind:\"Node\", Namespace:\"\", Name:\"10.10.3.201\", UID:\"d50dc664-dc0d-11e6-9ff0-001e4f520ea4\", APIVersion:\"\", ResourceVersion:\"\", FieldPath:\"\"}): type: 'Normal' reason: 'RegisteredNode' Node 10.10.3.201 event: Registered Node 10.10.3.201 in NodeController\n","stream":"stderr","time":"2017-01-18T09:51:08.80881979Z"}
{"log":"I0118 09:51:08.806368 1 event.go:217] Event(api.ObjectReference{Kind:\"Node\", Namespace:\"\", Name:\"10.10.3.203\", UID:\"0eea82fc-dc0e-11e6-9ff0-001e4f520ea4\", APIVersion:\"\", ResourceVersion:\"\", FieldPath:\"\"}): type: 'Normal' reason: 'RegisteredNode' Node 10.10.3.203 event: Registered Node 10.10.3.203 in NodeController\n","stream":"stderr","time":"2017-01-18T09:51:08.808830097Z"}
{"log":"I0118 09:51:18.471149 1 garbagecollector.go:780] Garbage Collector: All monitored resources synced. Proceeding to collect garbage\n","stream":"stderr","time":"2017-01-18T09:51:18.471400353Z"}
{"log":"E0118 09:58:30.196257 1 garbagecollector.go:593] Error syncing item \u0026garbagecollector.node{identity:garbagecollector.objectReference{OwnerReference:metatypes.OwnerReference{APIVersion:\"coreos.com/v1\", Kind:\"EtcdCluster\", UID:\"a9df64c0-dd64-11e6-a5cc-001e4f520ea4\", Name:\"etcd-client\", Controller:(*bool)(0xc421948230)}, Namespace:\"rhino-ci\"}, dependentsLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, dependents:map[*garbagecollector.node]struct {}{(*garbagecollector.node)(0xc422b50510):struct {}{}}, owners:[]metatypes.OwnerReference(nil)}: unable to get REST mapping for kind: EtcdCluster, version: coreos.com/v1\n","stream":"stderr","time":"2017-01-18T09:58:30.196391795Z"}
{"log":"E0118 09:58:30.197389 1 garbagecollector.go:593] Error syncing item \u0026garbagecollector.node{identity:garbagecollector.objectReference{OwnerReference:metatypes.OwnerReference{APIVersion:\"coreos.com/v1\", Kind:\"EtcdCluster\", UID:\"a9df64c0-dd64-11e6-a5cc-001e4f520ea4\", Name:\"etcd-client\", Controller:(*bool)(0xc421948230)}, Namespace:\"rhino-ci\"}, dependentsLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, dependents:map[*garbagecollector.node]struct {}{(*garbagecollector.node)(0xc422b50510):struct {}{}}, owners:[]metatypes.OwnerReference(nil)}: unable to get REST mapping for kind: EtcdCluster, version: coreos.com/v1\n","stream":"stderr","time":"2017-01-18T09:58:30.197454872Z"}
{"log":"E0118 09:58:30.197714 1 garbagecollector.go:593] Error syncing item \u0026garbagecollector.node{identity:garbagecollector.objectReference{OwnerReference:metatypes.OwnerReference{APIVersion:\"coreos.com/v1\", Kind:\"EtcdCluster\", UID:\"a9df64c0-dd64-11e6-a5cc-001e4f520ea4\", Name:\"etcd-client\", Controller:(*bool)(0xc421948230)}, Namespace:\"rhino-ci\"}, dependentsLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, dependents:map[*garbagecollector.node]struct {}{(*garbagecollector.node)(0xc422b50510):struct {}{}}, owners:[]metatypes.OwnerReference(nil)}: unable to get REST mapping for kind: EtcdCluster, version: coreos.com/v1\n","stream":"stderr","time":"2017-01-18T09:58:30.197757981Z"}
{"log":"E0118 09:58:30.198055 1 garbagecollector.go:593] Error syncing item \u0026garbagecollector.node{identity:garbagecollector.objectReference{OwnerReference:metatypes.OwnerReference{APIVersion:\"coreos.com/v1\", Kind:\"EtcdCluster\", UID:\"a9df64c0-dd64-11e6-a5cc-001e4f520ea4\", Name:\"etcd-client\", Controller:(*bool)(0xc421948230)}, Namespace:\"rhino-ci\"}, dependentsLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, dependents:map[*garbagecollector.node]struct {}{(*garbagecollector.node)(0xc422b50510):struct {}{}}, owners:[]metatypes.OwnerReference(nil)}: unable to get REST mapping for kind: EtcdCluster, version: coreos.com/v1\n","stream":"stderr","time":"2017-01-18T09:58:30.198092124Z"}
{"log":"E0118 09:58:30.198427 1 garbagecollector.go:593] Error syncing item \u0026garbagecollector.node{identity:garbagecollector.objectReference{OwnerReference:metatypes.OwnerReference{APIVersion:\"coreos.com/v1\", Kind:\"EtcdCluster\", UID:\"a9df64c0-dd64-11e6-a5cc-001e4f520ea4\", Name:\"etcd-client\", Controller:(*bool)(0xc421948230)}, Namespace:\"rhino-ci\"}, dependentsLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, dependents:map[*garbagecollector.node]struct {}{(*garbagecollector.node)(0xc422b50510):struct {}{}}, owners:[]metatypes.OwnerReference(nil)}: unable to get REST mapping for kind: EtcdCluster, version: coreos.com/v1\n","stream":"stderr","time":"2017-01-18T09:58:30.198735295Z"}
{"log":"E0118 09:58:30.199169 1 garbagecollector.go:593] Error syncing item \u0026garbagecollector.node{identity:garbagecollector.objectReference{OwnerReference:metatypes.OwnerReference{APIVersion:\"coreos.com/v1\", Kind:\"EtcdCluster\", UID:\"a9df64c0-dd64-11e6-a5cc-001e4f520ea4\", Name:\"etcd-client\", Controller:(*bool)(0xc421948230)}, Namespace:\"rhino-ci\"}, dependentsLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, dependents:map[*garbagecollector.node]struct {}{(*garbagecollector.node)(0xc422b50510):struct {}{}}, owners:[]metatypes.OwnerReference(nil)}: unable to get REST mapping for kind: EtcdCluster, version: coreos.com/v1\n","stream":"stderr","time":"2017-01-18T09:58:30.19922236Z"}
{"log":"E0118 09:58:30.199861 1 garbagecollector.go:593] Error syncing item \u0026garbagecollector.node{identity:garbagecollector.objectReference{OwnerReference:metatypes.OwnerReference{APIVersion:\"coreos.com/v1\", Kind:\"EtcdCluster\", UID:\"a9df64c0-dd64-11e6-a5cc-001e4f520ea4\", Name:\"etcd-client\", Controller:(*bool)(0xc421948230)}, Namespace:\"rhino-ci\"}, dependentsLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, dependents:map[*garbagecollector.node]struct {}{(*garbagecollector.node)(0xc422b50510):struct {}{}}, owners:[]metatypes.OwnerReference(nil)}: unable to get REST mapping for kind: EtcdCluster, version: coreos.com/v1\n","stream":"stderr","time":"2017-01-18T09:58:30.199929128Z"} |
@Kargakis Why the target fix is in 1.6 not a patch release for 1.5? |
I would not anticipate the garbage collector actually enabling support for TPRs in 1.5, but the bug that causes it to become unresponsive when it sees an ownerRef of an unknown type should be fixed |
Changed the milestone to 1.5 |
I agree. I do not expect the gc to work with TPR either. But it should not "kill" controller manager. And it is the issue we should fix in 1.5. We can make GC work for TPR in 1.6. |
For me this is currently a DOS attack |
Repro instructions on master in GCE: $ cat <<EOF | kubectl apply -f -
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1alpha1
metadata:
namespace: default
name: service-account-all
rules:
- apiGroups: ["*"]
resources: ["*"]
verbs: ["*"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1alpha1
metadata:
name: read-secrets-global
subjects:
- kind: Group
name: system:serviceaccounts
roleRef:
kind: ClusterRole
name: service-account-all
apiGroup: rbac.authorization.k8s.io
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: etcd-operator
spec:
replicas: 1
template:
metadata:
labels:
name: etcd-operator
spec:
containers:
- name: etcd-operator
image: quay.io/coreos/etcd-operator
env:
- name: MY_POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
EOF
$ sleep 30 # wait for operator to create third party resource
$ cat <<EOF | kubectl create -f -
apiVersion: "coreos.com/v1"
kind: "EtcdCluster"
metadata:
name: "etcd-cluster"
spec:
size: 3
version: "v3.1.0-alpha.1"
EOF |
@mikedanese so you don't even need to have something with an owner ref pointing at the TPR to trigger this? |
I would guess an ownerRef pointing to any arbitrary unknown kind would trigger the same behavior |
Ah, the operator triggers it. OK, I see. |
etcd operator does not rely on the k8s GC features (it has internal GC). But it does write owner ref. when k8s gc works for tpr, we will remove our temp internal GC. but now the owner ref causes controller manager to spam. |
FWIW, here's the patch I wrote to get back to a more reasonable behavior: diff --git a/pkg/controller/garbagecollector/garbagecollector.go b/pkg/controller/garbagecollector/garbagecollector.go
index 77b8251ca5..4db4583915 100644
--- a/pkg/controller/garbagecollector/garbagecollector.go
+++ b/pkg/controller/garbagecollector/garbagecollector.go
@@ -591,8 +591,11 @@ func (gc *GarbageCollector) worker() {
err := gc.processItem(timedItem.Object.(*node))
if err != nil {
utilruntime.HandleError(fmt.Errorf("Error syncing item %#v: %v", timedItem.Object, err))
- // retry if garbage collection of an object failed.
- gc.dirtyQueue.Add(timedItem)
+ go func() {
+ time.Sleep(500 * time.Millisecond) // quick and dirty fix the hot loop
+ // retry if garbage collection of an object failed.
+ gc.dirtyQueue.Add(timedItem)
+ }()
return
}
DirtyProcessingLatency.Observe(sinceInMicroseconds(gc.clock, timedItem.StartTime)) (then, rebuild your controller-manager or hyperkube) |
Thanks @MikaelCluseau. #38679 changed gc to use the ratelimited queue, so the hotloop should have been fixed on the master branch. [edit] That said, we still need to make the restmapper dynamic, i'm working on a fix. |
My patch was done on the |
@caesarxuchao BTW, is there a backport to 1.5 planned? #38679 is flagged |
Update: I tried to reuse DeferredDiscoveryRESTMapper to dynamically translate TPR, but it didn't seem to work because of #42516. I'll find a workaround. |
@MikaelCluseau what version are you on? @mikedanese said #40497 was landed in v1.5.3, so the problems should be masked. Is that true? |
@caesarxuchao I had to overwrite hyperkube in the docker image
|
Ahhh it has this commit 64029a2 which says:
So probably not a difference a human can detect. |
I see. 1 ms is still too spamming. |
At least, it make the log unusable without filtering ;) I have a global feeling that Kubernetes log system could be improved but I still need to have a better of that. The global problem to solve is probably the repetition of log, something that's solved by events (or at the way kubectl shows them); so maybe I should move my practice more to events and, when I don't have the answer in the events, file an issue/PR. |
Automatic merge from submit-queue (batch tested with PRs 38805, 42362, 42862) Let GC print specific message for RESTMapping failure Make the error messages reported in #39816 to be more specific, also only print the message once. I'll also update the garbage collector's doc to clearly state we don't support tpr yet. We'll wait for the watchable discovery feature (@sttts are you going to work on that?) to land in 1.7, and then enable the garbage collector to handle TPR. cc @hongchaodeng @MikaelCluseau @djMax
@caesarxuchao the backoff removed the log flooding, right? if so, I think this can be moved to 1.7 |
c.f. #42615 (comment) |
I think the original issue of "spamming error messages" was solved. I created #44507 to track the issue of "GC should support non-core API". Closing this issue. |
Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.):
What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):
TPR. GC. OwnerReference. REST mapping.
Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG REPORT.
Kubernetes version (use
kubectl version
):master. v1.5.1
Environment:
Repro in "hack/local-up-cluster.sh". But I have seen it in many different env.
uname -a
):What happened:
The controller manager is unresponsive and overloaded with error message:
This error just keeps bloating up without any backoff. This seems that REST mapping didn't recognize the kind.
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Generic way:
Easy way:
Anything else do we need to know:
The text was updated successfully, but these errors were encountered: