Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow unsafe read from cache for Azure #83685

Open
wants to merge 2 commits into
base: master
from

Conversation

@aramase
Copy link
Member

commented Oct 9, 2019

What type of PR is this?

/kind feature

What this PR does / why we need it:

  • Allows unsafe read from cache for certain LIST/GET calls as it's safe to read the stalled data for these.
  • For the disk reconcile loop which ensures all the disks are attached/detached, we read stalled data from cache instead of calling ARM
    • After every Attach/Detach, the cache entry for the VM is invalidated by deleting it. When the reconcile loop runs for the first time after the last OP, since the entry for VM no longer exists, the data will fetched and added to cache. This ensures we have a fresh copy of the entry after our OP.

Which issue(s) this PR fixes:

Partially fixes
kubernetes/cloud-provider-azure#247

Special notes for your reviewer:
Testing to benchmark is currently in progress -

  1. Scenario 1: 1 VMSS (100 nodes) with 100 disks ~ 100 calls/h on idle state.

Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@aramase

This comment has been minimized.

Copy link
Member Author

commented Oct 9, 2019

/area provider/azure

@aramase

This comment has been minimized.

Copy link
Member Author

commented Oct 9, 2019

/cc @khenidak

@k8s-ci-robot k8s-ci-robot requested a review from khenidak Oct 9, 2019
@aramase aramase force-pushed the aramase:cache branch 3 times, most recently from b75a4fa to 26617af Oct 10, 2019
@aramase

This comment has been minimized.

Copy link
Member Author

commented Oct 10, 2019

/retest

@aramase aramase force-pushed the aramase:cache branch 2 times, most recently from 6d4c705 to 5c2a0c4 Oct 11, 2019
@aramase

This comment has been minimized.

Copy link
Member Author

commented Oct 16, 2019

/test pull-kubernetes-e2e-gce

@craiglpeters craiglpeters added this to In progress in Provider Azure Oct 17, 2019
@craiglpeters craiglpeters moved this from In progress to Needs Review in Provider Azure Oct 17, 2019
@aramase aramase force-pushed the aramase:cache branch from 5c2a0c4 to b9c06a8 Oct 17, 2019
@aramase

This comment has been minimized.

Copy link
Member Author

commented Oct 17, 2019

/test pull-kubernetes-e2e-gce

var machine compute.VirtualMachine
var retryErr error
err := wait.ExponentialBackoff(az.RequestBackoff(), func() (bool, error) {
machine, retryErr = az.getVirtualMachine(name)
machine, retryErr = az.getVirtualMachine(name, crt)

This comment has been minimized.

Copy link
@khenidak

khenidak Oct 17, 2019

Contributor

@feiskyer - do we still need the wait.ExponentialBackoff even when the SDK does that on behalf?

This comment has been minimized.

Copy link
@feiskyer

feiskyer Oct 22, 2019

Member

I have plans to remove the backoff logic from SDK, and let the cloud provider fully control it. We have seen a lot of issues because of SDK retries, e.g. infinite 429 retries. @khenidak WDYT?

This comment has been minimized.

Copy link
@khenidak

khenidak Oct 22, 2019

Contributor

I think better off relying on SDK, not our code. the SDK can then adapt the RP interface as it changes. i.e. supporting retry-after etc.

@aramase aramase force-pushed the aramase:cache branch from 253811d to 5a2114b Oct 21, 2019
@aramase aramase force-pushed the aramase:cache branch 2 times, most recently from 651445d to 4c8216f Oct 21, 2019
@aramase

This comment has been minimized.

Copy link
Member Author

commented Oct 21, 2019

/retest

@feiskyer

This comment has been minimized.

Copy link
Member

commented Oct 22, 2019

Fixes kubernetes/cloud-provider-azure#247

@aramase this doesn't fix it, but part of the fixes. Let me know when you finish the validation and ready for review.

@aramase

This comment has been minimized.

Copy link
Member Author

commented Oct 22, 2019

@feiskyer Yeah, I changed it to partially fixes. We are testing different scenarios. But will let you know when ready for review.

@feiskyer feiskyer self-assigned this Oct 22, 2019
@justaugustus justaugustus removed their request for review Oct 22, 2019
@@ -132,7 +132,7 @@ func (ss *scaleSet) getVmssVM(nodeName string) (string, string, *compute.Virtual

// GetPowerStatusByNodeName returns the power state of the specified node.
func (ss *scaleSet) GetPowerStatusByNodeName(name string) (powerState string, err error) {
_, _, vm, err := ss.getVmssVM(name)
_, _, vm, err := ss.getVmssVM(name, cachedData)

This comment has been minimized.

Copy link
@khenidak

khenidak Oct 22, 2019

Contributor

This one of the reasons i want to actually keep the cache ttl short not long.

@feiskyer FYI

@@ -536,7 +536,7 @@ func (ss *scaleSet) getAgentPoolScaleSets(nodes []*v1.Node) (*[]string, error) {
}

nodeName := nodes[nx].Name
ssName, _, _, err := ss.getVmssVM(nodeName)
ssName, _, _, err := ss.getVmssVM(nodeName, cachedData)

This comment has been minimized.

Copy link
@khenidak

khenidak Oct 22, 2019

Contributor

should be unsafe. agents don't change pools.

Copy link
Contributor

left a comment

few more comments. The main question i have now is. The impact of getLUN/DetachDisk using cached data. we rely on the fact that each attach deletes the object from cache, and we also relay on the fact that attach/detach calls are serialized (one at a time). if these two facts are true then we should be fine

update common controller

add comment

update to cached data

add cacheReadType type

update logic for disk reconcile loop to read from cache

update delete cache for node

review feedback
@aramase aramase force-pushed the aramase:cache branch from f3838ca to cebdd85 Oct 22, 2019
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Oct 22, 2019

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aramase
To complete the pull request process, please assign feiskyer
You can assign the PR to them by writing /assign @feiskyer in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@aramase aramase changed the title [WIP] Allow unsafe read from cache for Azure Allow unsafe read from cache for Azure Oct 22, 2019
@aramase

This comment has been minimized.

Copy link
Member Author

commented Oct 22, 2019

/retest

@aramase

This comment has been minimized.

Copy link
Member Author

commented Oct 22, 2019

@feiskyer The PR is ready for a round of review. We have soak tests currently running with my custom hyperkube. PTAL!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.