Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Establishing logic for CRDs by implementing Installing status to prevent race conditions #62725

Closed
xmudrii opened this issue Apr 17, 2018 · 5 comments · Fixed by #63068
Labels
area/custom-resources kind/feature Categorizes issue or PR as related to a new feature. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.

Comments

@xmudrii
Copy link
Member

xmudrii commented Apr 17, 2018

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

What happened:

If you create CR shorty after CRD, it can happen that it returns error that CRD doesn't exists, even if it exists and is Established.

Relevant issue for this problem #57042, where this happened on the cluster with limited resources while deploying Istio. This one describes age <invalid> more, but could is relevant to the problem. For earlier research, check out #57042 (comment).

What you expected to happen:

CR creation success if CRD is Established.

Anything else we need to know?:

With the current implementation, we have the following steps when creating and establishing the CRD:

  • Naming controller sets the NamesAccepted status.
  • Naming controller sets the Established status shortly after (synchronously without roundtripping to the apiserver).

After some time, the following happens:

  • Informer gets notified that Established is true.
  • Some CR request is sent (e.g. by the user code) and received by the CR handler (e.g. ServeHTTP). It calls the lister to make sure CRD is Established.

However, we think there can be a race condition here, which causes the above described issue.
What can happen, is that the Established becomes true, CR request comes in, the lister is called, but the informer hasn't got the UPDATE event, and therefore, Established is not true in the lister memory cache. In that case, the request is rejected even if Established is set to true, and we get the described issue.

The potential fix could be to change the workflow little bit. Instead of NamesAccpeted=> Established, we want to add another status, Installing, which would happen after NamesAccpeted. By the implementation idea, once Installation is set to true, it can't be turned to false (maybe just when it's time to turn it into Established), unlike the NamesAccepted status.

Then, we would turn CRD into Established in the UPDATE event handler for CR handlers, e.g. in the updateCustomResourceDefinition handler.

Also, we want to make ServeHTTP for CR to answer the request when the CRD is still in Installing, but not Established state, e.g. by changing this part of the code.

This can be further improved for HA environments if we get an HA checkpointing algorithm, so we can extend to logic to only set Established when all API servers have seen installing.

I'll be working on fixing this one, so this issue is a placeholder for further discussions on the approach until I open the PR. If you have any suggestion, please let me know.

/cc deads2k sttts
/sig api-machinery
/area custom-resources

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. area/custom-resources labels Apr 17, 2018
@xmudrii
Copy link
Member Author

xmudrii commented Apr 17, 2018

/cc @sttts
/cc @deads2k

@deads2k
Copy link
Contributor

deads2k commented Apr 17, 2018

Adding a new status just pushes the problem somewhere else. I'm ok with holding requests for a very short period (order of five-ish seconds).

@sttts
Copy link
Contributor

sttts commented Apr 17, 2018

I'm ok with holding requests for a very short period

With the Installing condition we can add this 5s sleep (to survive an HA environment reasonably well) between Installing and Established transparently. I.e. it will be transparent for the user. Now we have to make every 3rdparty user to add those 5 seconds of grace period after seeing the Established condition.

Another solution would be:

  • set NamesAccepted.
  • wait 5s
  • if NamesAccepted has been true for 5 seconds (the timestamp of last change is in the condition afaik), set Established.

@sttts
Copy link
Contributor

sttts commented Apr 17, 2018

But we have to serve the CR during those 5 seconds already to avoid the race. So maybe we still need the Installing condition.

@lavalamp
Copy link
Member

cc @mbohlool @yliaog

k8s-github-robot pushed a commit that referenced this issue May 29, 2018
Automatic merge from submit-queue (batch tested with PRs 64258, 63068). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Add Establishing Controller to avoid race between Established condition and CRs actually served

**What this PR does / why we need it**: If you create CR shorty after CRD, it can happen that it returns error that CRD doesn't exists, even if it exists and is Established. This implements the Establishing Controller, is used to Establish CRD once we're sure it's ready and CRs are served. For more details, check issue #62725.

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #62725 

```release-note
Add Establishing Controller on CRDs to avoid race between Established condition and CRs actually served. In HA setups, the Established condition is delayed by 5 seconds.
```

/sig api-machinery
/area custom-resources
/cc sttts deads2k
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/custom-resources kind/feature Categorizes issue or PR as related to a new feature. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
Projects
None yet
5 participants