New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRD age does not change from <invalid> as soon as it becomes Established #57042

Open
vadimeisenbergibm opened this Issue Dec 11, 2017 · 29 comments

Comments

Projects
None yet
10 participants
@vadimeisenbergibm

vadimeisenbergibm commented Dec 11, 2017

Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug

What happened:
When deploying https://github.com/istio/istio/blob/master/install/kubernetes/istio.yaml on minikube, I get the following errors:

unable to recognize "install/kubernetes/istio.yaml": no matches for config.istio.io/, Kind=attributemanifest

for each of the custom resources specified in https://github.com/istio/istio/blob/master/install/kubernetes/istio.yaml

Running kubernetes get crd returns:

NAME                                  AGE
attributemanifests.config.istio.io    <invalid>

After several seconds, kubernetes get crd returns valid age for the CRDs. Then the custom resources are added successfully.

What you expected to happen:
I expect that it should be explicitly documented: can custom resources be added immediately after their CRDs? Or the user should wait until kubernetes get crd returns valid age.

How to reproduce it (as minimally and precisely as possible):
Deploy https://github.com/istio/istio/blob/master/install/kubernetes/istio.yaml on a minikube with limited resources, so it will take time for the CRDs to become valid.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.8.0
  • Cloud provider or hardware configuration: Minikube 0.23.0
  • OS (e.g. from /etc/os-release): Minikube on Virtual Box on Mac OS
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@vadimeisenbergibm

This comment has been minimized.

vadimeisenbergibm commented Dec 11, 2017

/sig api-machinery

@ncdc

This comment has been minimized.

Member

ncdc commented Dec 11, 2017

It typically takes a short amount of time for the server to process a new CRD before you can begin creating resources. You can check the CRD's status.conditions for an entry in the array where type equals Established.

cc @deads2k @sttts @enisoc @nikhita - do we have this documented anywhere?

@vadimeisenbergibm

This comment has been minimized.

vadimeisenbergibm commented Dec 11, 2017

@ncdc So what are the users of kubectl advised to do with regard to new CRDs and their resources? Define CRDs first, wait several seconds and then add resources? Or just add new CRDS and their resources, and in case of failure retry? I would like this advice to be documented somewhere.

You can check the CRD's status.conditions for an entry in the array where type equals Established

By which command can I check this?

@ncdc

This comment has been minimized.

Member

ncdc commented Dec 11, 2017

@vadimeisenbergibm

So what are the users of kubectl advised to do with regard to new CRDs and their resources? Define CRDs first, wait several seconds and then add resources? Or just add new CRDS and their resources, and in case of failure retry? I would like this advice to be documented somewhere.

I'm not sure what our guidance is beyond waiting. You could also programmatically poll/watch the CRD and wait for it to be established. cc @kubernetes/sig-api-machinery-misc for increased visibility & guidance

By which command can I check this?

Here is a sample I just ran:

$ get crd/backups.ark.heptio.com -o jsonpath='{.status.conditions[?(.type == "Established")].status}{"\n"}'
True
@sttts

This comment has been minimized.

Contributor

sttts commented Dec 11, 2017

@sttts

This comment has been minimized.

Contributor

sttts commented Dec 11, 2017

And of course there is no connnection to Minikube, but it is like that by design on every cluster.

@nikhita

This comment has been minimized.

Member

nikhita commented Dec 12, 2017

For the docs: kubernetes/website#6660.

@nikhita

This comment has been minimized.

Member

nikhita commented Dec 12, 2017

/assign

@vadimeisenbergibm

This comment has been minimized.

vadimeisenbergibm commented Dec 12, 2017

@ncdc I have the output of kubectl get crd/backups.ark.heptio.com -o jsonpath='{.status.conditions[?(.type == "Established")].status}{"\n"}' as True, despite the fact that kubectl get crd returns AGE <invalid>. I think the simplest way is just to run kubectl get crd and to wait until all the CRDs have non-invalid age.

@ncdc

This comment has been minimized.

Member

ncdc commented Dec 12, 2017

@deads2k @nikhita @sttts does the age matter at all?

@ncdc

This comment has been minimized.

Member

ncdc commented Dec 12, 2017

From what I'm reading, you'll see <invalid> if the object's lifespan appears to be negative. I think you can ignore that. The definitive green-light is the Established condition and/or the availability of the custom api group+resource in discovery.

@vadimeisenbergibm

This comment has been minimized.

vadimeisenbergibm commented Dec 12, 2017

I experience that adding resources fails until the age becomes 0s. I mean while the age is <invalid>, adding resources fails. Established condition is True immediately, even when the age is <invalide> and the resources fail to be added.

@ncdc

This comment has been minimized.

Member

ncdc commented Dec 12, 2017

Can you share the logs from the apiserver around the time of the failures?

@sttts

This comment has been minimized.

Contributor

sttts commented Dec 13, 2017

We might have a race-condition between the update of the Established condition (https://github.com/sttts/kubernetes/blob/1161561ee12eea5ba28eb5efa9b384ec732def0e/staging/src/k8s.io/apiextensions-apiserver/pkg/controller/status/naming_controller.go#L253) and the HTTP handler of the API server (same process) actually picking in up (through informers, https://github.com/sttts/kubernetes/blob/b5b62c68318be79a665257c260ea9f9bbb6d6318/staging/src/k8s.io/apiextensions-apiserver/pkg/apiserver/customresource_handler.go#L116). We could synchronize these two, but this won't help in HA setups either.

@vadimeisenbergibm

This comment has been minimized.

vadimeisenbergibm commented Dec 13, 2017

@ncdc Here is the audit log
audit_log.txt

See for example attributemanifests.config.istio.io .

Do you want me to produce any other log?

@roycaihw

This comment has been minimized.

Member

roycaihw commented Dec 14, 2017

@nikhita

This comment has been minimized.

Member

nikhita commented Dec 23, 2017

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Dec 23, 2017

@venkatsc

This comment has been minimized.

Contributor

venkatsc commented Jan 11, 2018

Facing the same issue on 1.9.1. CRD AGE never changed from <invalid>.

Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.1", GitCommit:"3a1c9449a956b6026f075fa3134ff92f7d55f812", GitTreeState:"clean", BuildDate:"2018-01-04T11:52:23Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"", Minor:"", GitVersion:"v1.9.1+coreos.0", GitCommit:"49020061462317aa0d54c2577f5bfb257e9cf58e", GitTreeState:"clean", BuildDate:"2018-01-08T18:21:24Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}

@nikhita

This comment has been minimized.

Member

nikhita commented Jan 23, 2018

Looks like the real issue here is that CRD age does not change from <invalid> as soon as it becomes Established.

Can someone update the PR title to reflect the same? Thanks!

@sttts sttts changed the title from Adding custom resources fails in minikube when they are defined together with their CRDs. to CRD age does not change from <invalid> as soon as it becomes Established Jan 23, 2018

@nikhita nikhita referenced this issue Jan 23, 2018

Open

Umbrella issue for CRDs moving to GA #58682

17 of 53 tasks complete

@sttts sttts added the help wanted label Mar 1, 2018

@xmudrii

This comment has been minimized.

Member

xmudrii commented Mar 1, 2018

Working on reproducing the issue and implementing tests for this case.

@xmudrii

This comment has been minimized.

Member

xmudrii commented Mar 13, 2018

I've worked with @sttts on reproducing the issue and finding the root of the problem. I'll try my best to explain what have I found.

It's hard to define what limited resources are. In the most of cases, everything worked flawlessly. To be able to reproduce the issue, I had to put my local machine (which was running the cluster via ./hack/local-cluster-up.sh) under the heavy load (100% CPU and a lot of I/O operations).
To accomplish this I was using the stress command.

Under normal conditions everything is working as expected, age is valid, and CRs are created as well.
Under heavy load, it can happen that CRs fail to create, but it's very hard to reproduce the age <invalid>.

It is expected for CRDs to take some time to get fully created/provisioned, but in normal conditions, that doesn't make a problem. If cluster has limited resources, this can be a problem, but there could be a possible fix that we can discuss.

About the <invalid> age.. the age is parsed/set by the ShortHumanDuration function under the duration package.
That happens only if there's some deviation between machine times.

This function is invoked by the translateTimestamp function with now-creationTimestamp.

Because the translateTimestamp function checks is time zero (and if yes it returns <unknown> instead of <invalid>), we are sure some creation timestamp is returned by the API.

To further analyze it, I've tried to see does API return indeed the creationTimestamp, even under the load, and yes, it does.
The time is set by the BeforeCreate function, specifically, by the FillObjectMetaSystemFields function, which is defined in the same package.
The FillObjectMetaSystemFields function sets the creationTimestamp to Now().

This Playground is from the beginning of my research, but it can contain some useful information, along with a test that confirm that creationTimestamp is not zero.

To fix the problem with CR creation failure, @sttts proposed the following solution when I was talking to him about the issue:

there is a fix for the race: now we have this: 1) CRD is created. 2) CRD names are checked and validated. 3) CRD is set to "Established". 4) The http handler sees the "Established" conditation on the CRD and 5) starts to serve it. The race is between 3 and 5, when the user observes 3, but 5 is not done yet. Instead we have to turn 3 and 5 around. i.e. when the handler starts serving it it should set the Established conditation. For that we need another signal in 3. Maybe we already have a signal "NoNameConflict" or something like that could be used here.

I would also like to mention, that the age <invalid> is not limited to CRDs. I had the same problem with the local cluster, but with pods instead the CRDs.
The following command: kubectl get pods -w --all-namespaces returned:

kube-system   kube-dns-86f6f55dd5-8kbzn   0/3       Pending   0         <invalid>
kube-system   kube-dns-86f6f55dd5-8kbzn   0/3       ContainerCreating   0         <invalid>
kube-system   kube-dns-86f6f55dd5-8kbzn   2/3       Running   0         <invalid>
kube-system   kube-dns-86f6f55dd5-8kbzn   0/3       Evicted   0         37s
kube-system   kube-dns-86f6f55dd5-mgfng   0/3       Pending   0         <invalid>
kube-system   kube-dns-86f6f55dd5-mgfng   0/3       Pending   0         <invalid>
kube-system   kube-dns-86f6f55dd5-mgfng   0/3       ContainerCreating   0         <invalid>
kube-system   kube-dns-86f6f55dd5-mgfng   2/3       Running   0         <invalid>

Just to note, the stress command was running while the cluster was provisioning, so the local machine had limited resources.

@fejta-bot

This comment has been minimized.

fejta-bot commented Jun 11, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@xmudrii

This comment has been minimized.

Member

xmudrii commented Jun 11, 2018

The race between updating condition of CRD to Established and HTTP handlers serving resources should be improved/fixed by #63068. The fix could also prevent, or lower the frequency of the mentioned errors when applying CRs, but I still have not tested this specially.

The age <invalid> problem could be there, but I'm not sure what's the root of the problem there. I have left some comments on this earlier, in this issue.

/remove-lifecycle stale

@nikhita

This comment has been minimized.

Member

nikhita commented Jun 13, 2018

The age problem could be there, but I'm not sure what's the root of the problem there. I have left some comments on this earlier, in this issue.

This is still pending.

I'm going to explicitly unassign myself here, since it has the help-wanted label (and could use some help for the above). Will be involved in this though.
/unassign

@fejta-bot

This comment has been minimized.

fejta-bot commented Sep 11, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@xmudrii

This comment has been minimized.

Member

xmudrii commented Sep 11, 2018

/remove-lifecycle stale

The issue #65517 describes similar problem, so maybe one can be closed.

@nikhita

This comment has been minimized.

Member

nikhita commented Sep 13, 2018

The issue #65517 describes similar problem, so maybe one can be closed.

I think the underlying issue should be the same, but let's keep it open until we confirm this.

Removing help because we don't have explicit directions about solving this yet.
/remove-help

@fejta-bot

This comment has been minimized.

fejta-bot commented Dec 12, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@nikhita

This comment has been minimized.

Member

nikhita commented Dec 12, 2018

/remove-lifecycle stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment