Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update dual stack KEP metadata to include status and PRR fields #2327

Merged
merged 26 commits into from Feb 9, 2021

Conversation

lachie83
Copy link
Member

  • Add milestone and status fields
  • Add PRR related fields in preparation for beta

Signed-off-by: Lachlan Evenson lachlan.evenson@microsoft.com

Add PRR related fields in preparation for beta

Signed-off-by: Lachlan Evenson <lachlan.evenson@microsoft.com>
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 25, 2021
@k8s-ci-robot k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Jan 25, 2021
@lachie83
Copy link
Member Author

Tracking enhancement issue #563

Signed-off-by: Lachlan Evenson <lachlan.evenson@microsoft.com>
@lachie83 lachie83 changed the title Update dual stack KEP metadata to include status and PRR fields [WIP] Update dual stack KEP metadata to include status and PRR fields Jan 26, 2021
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 26, 2021
@bridgetkromhout
Copy link
Member

Here's how I could reproduce the failure:

$ ./hack/verify-kep-metadata.sh
--- FAIL: TestValidation (0.06s)
    --- FAIL: TestValidation/../../keps/sig-network/563-dual-stack/kep.yaml (0.00s)
        main_test.go:112: PRR approval is required to target milestone v1.21 (stage beta)
        main_test.go:113: For more details about PRR approval see: https://github.com/kubernetes/community/blob/master/sig-architecture/production-readiness.md
        main_test.go:114: To get PRR approval modify appropriately file ../../keps/prod-readiness/sig-network/563.yaml and have this approved by PRR team
FAIL
FAIL	k8s.io/enhancements/cmd/kepval	0.854s
FAIL
$ 

I think this is how we correct this test - update keps/prod-readiness/sig-network/563.yaml (but not with just any name):

$ ./hack/verify-kep-metadata.sh
--- FAIL: TestValidation (0.09s)
    --- FAIL: TestValidation/../../keps/sig-network/563-dual-stack/kep.yaml (0.00s)
        main_test.go:122: PRR approval file ../../keps/prod-readiness/sig-network/563.yaml has an error: error validating PRR approval metadata: invalid beta field: "approver" must be one of (deads2k,johnbelamaric,wojtek-t) but it is a string: thockin
FAIL
FAIL	k8s.io/enhancements/cmd/kepval	0.905s
FAIL
$

We put one of the approved people in this file:

$ cat keps/prod-readiness/sig-network/563.yaml 
kep-number: 563
beta:
  approver: "@wojtek-t"
$

Then this check completes:

$ ./hack/verify-kep-metadata.sh
ok  	k8s.io/enhancements/cmd/kepval	0.661s
$

I'm not sure who's correct to list in this case.

@jeremyrickard
Copy link
Contributor

jeremyrickard commented Jan 27, 2021

I don’t believe that Tim is a PRR approver, I believe only @deads2k or @johnbelamaric or @wojtek-are the only PRR approvers currently. Looks like the PRR approver field is empty, I’d recommend adding one of the approvers there.

@bridgetkromhout
Copy link
Member

@jeremyrickard I do indeed see that they are the only ones in prod-readiness-approvers. When quickly scanning https://github.com/kubernetes/community/blob/master/sig-architecture/production-readiness.md#submitting-a-kep-for-production-readiness-approval I was looking at the sig-network-leads list in https://github.com/kubernetes/enhancements/blob/master/OWNERS_ALIASES#L57-L60 instead of literally prod-readiness-approvers.

Of course, we'll start with our SIG leads and then get a PRR approval. Thanks!

@lachie83
Copy link
Member Author

lachie83 commented Jan 27, 2021

I don’t believe that Tim is a PRR approver, I believe only @deads2k or @johnbelamaric or @wojtek-are the only PRR approvers currently. Looks like the PRR approver field is empty, I’d recommend adding one of the approvers there.

I've connected with @johnbelamaric via the prod-readiness channel and he has agreed to review this once ready. I will push a commit to update.

Signed-off-by: Lachlan Evenson <lachlan.evenson@microsoft.com>
@lachie83 lachie83 marked this pull request as draft January 27, 2021 17:43
@lachie83
Copy link
Member Author

lachie83 commented Feb 2, 2021

PRR sections added to KEP here - #2325

Add implementation history

Signed-off-by: Lachlan Evenson <lachlan.evenson@microsoft.com>
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 2, 2021
@lachie83 lachie83 marked this pull request as ready for review February 2, 2021 21:31
@lachie83 lachie83 changed the title [WIP] Update dual stack KEP metadata to include status and PRR fields Update dual stack KEP metadata to include status and PRR fields Feb 2, 2021
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 2, 2021
Signed-off-by: Lachlan Evenson <lachlan.evenson@microsoft.com>
@lachie83
Copy link
Member Author

lachie83 commented Feb 2, 2021

@johnbelamaric this is ready for review.

@dcbw
Copy link
Member

dcbw commented Feb 4, 2021

I skimmed the checklist and I think it's accurate.
/lgtm
/approve

bridgetkromhout and others added 5 commits February 5, 2021 16:12
Signed-off-by: Bridget Kromhout <bridget@kromhout.org>
Signed-off-by: Bridget Kromhout <bridget@kromhout.org>
Signed-off-by: Bridget Kromhout <bridget@kromhout.org>
Signed-off-by: Bridget Kromhout <bridget@kromhout.org>
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 5, 2021
Copy link
Member

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi. I don't think the latest round of revisions really answers my questions. I think to answer them you need to shift mindset a little. These questions aren't about "how can this specific code fail" but about "what could go wrong now that this change is out there". That is, they really aren't narrowly focused on just your changes, but rather on how those changes interact with the system as a whole. What are the downstream consequences of enabling this feature?

Really put yourself in the shoes of someone that is responsible for 10s or 100s of thousands of workloads running on a fleet of clusters, and ask yourself: Am I willing to enable this feature? If I do, how can I make sure that I haven't broken the cluster control plane? How can I make sure I haven't broken user workloads? Is there any planning I should do before I enable it? How can I go home and not worry about getting paged? What do I have to check, what would inform me that I need to roll it back?


1. For services:

Services are not in the path of pod creation. Thus, any malfunction or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why you're only talking about pod creation. What about service creation? Are there metrics there that could be useful? Do we have metrics around IP allocations? What about endpoint selection/creation? Do we have metrics about endpoints created?

- Components exposing the metric:
- [ ] Other (treat as last resort)
- Details:
Dual-stack networking is a functional addition, not a service with SLIs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a service itself, but it can fail, right? You can fail to allocate the IP, you can fail to assign the IP to a Pod, you can fail to wire up the Pod sandbox, etc. These are likely already covered by existing metrics. Can you list them here, to inform operators what metrics they should be looking at? If they all of a sudden see a bunch of failures then they may want to turn it off.

What about watching for workload failures as I mentioned in my last review?

The point here is just to provide guidance and documentation. If you upgraded your cluster with 10,000 workloads on it, what would you look for to see that everything is going OK?


* **What happens if we reenable the feature if it was previously rolled back?**
Similar to enable it the first time on a cluster.
Similar to enabling it the first time on a cluster. We don't load balance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without my questions this is hard to follow, it needs more context. Even with my questions I am still not clear on how the answer addresses the "imbalance during re-enablement" case.

If that case is really possible (and I haven't seen anything to say whether or not it is), we just need to document it so people know what to expect and what to look for.

Once people upgrade, this goes on by default. If something goes wrong, and they turn it off, and then later want to re-enable it, we need to make sure they can plan appropriately. They may need to find any dual-stack services with bazillions of endpoints and make them single-stack first, for example. Just to avoid that imbalance situation.

I think the statement "we don't load balance across families" is trying to say that the imbalance is not possible. But I don't see how that invalidates the scenario I described. In fact, if we did load balance across families and did do protocol translation, then the scenario I described would not be a problem.

- 99% percentile over day of absolute value from (job creation time minus expected
job creation time) for cron job <= 10%
- 99,9% of /health requests per day finish with 200 code
N/A
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is true. If the SLIs above are existing SLIs, you might just be able to say something like: existing kubelet pod creation or service creations SLOs (and whatever others may apply).


Dual-stack components are not in path of pod creation. It is in the path
of reporting pod ips. So pod creation will not be affected; if it is
affected, then it is a CNI issue.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, does CNI have any input from the kubelet as to whether Pods should be single or dual stack? Or is that completely out-of-band configuration that doesn't come from this feature at all? How does CNI know today which family of address to allocate?


Dual-stack components are in the path of PodIPs reporting which affects
kubelet. If there is a problem (or if there are persistent problems)
then disabling the featuregate on api-server will mitigate.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on apiserver or on kubelet? Shouldn't it be everywhere? Otherwise I imagine really bad stuff can happen.

@@ -1459,7 +1497,8 @@ of this feature?**

* **Will enabling / using this feature result in any new calls to the cloud
provider?**
No
No. Errors are surfaced when the user makes a call. We don't focus on
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this question isn't about metrics server, it's about calls to the cloud provider. I believe in some cases IP allocation is done by the cloud provider? In GKE we have native IPs assigned to Pods. I don't know where that allocation is done exactly, maybe @bowei knows. But it seems feasible that dual stack would require an extra call to the cloud provider to allocation one of those native IPs.

* **How does this feature react if the API server and/or etcd is unavailable?**
This feature will not be operable if either kube-apiserver or etcd is unavailable.

* **What are other known failure modes?**
For each of them, fill in the following information by copying the below template:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this doesn't address workload failures nor the "imbalance", we probably want to discuss those.

2. kind dual-stack iptables: https://testgrid.k8s.io/sig-network-kind#sig-network-kind,%20dual,%20master
3. kind dual-stack ipvs: https://testgrid.k8s.io/sig-network-kind#sig-network-kind,%20ipvs,%20master

### Rollout, Upgrade and Rollback Planning

* **How can a rollout fail? Can it impact already running workloads?**
Try to be as paranoid as possible - e.g., what if some components will restart
mid-rollout?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about workload failures? If I upgrade and suddently 10% of my workloads start crashlooping because they are stupid and don't know what to do with a AAAA record from DNS, I want to disable and/or rollback. Again this probably doesn't apply much in the upgrade case because there are no dual-stack services, but in the re-enablement case it does.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you change the IPFamilyPolicy to SingleStack you will be back to the initial situation

bridgetkromhout and others added 11 commits February 8, 2021 15:04
Signed-off-by: Bridget Kromhout <bridget@kromhout.org>
Signed-off-by: Bridget Kromhout <bridget@kromhout.org>
Signed-off-by: Bridget Kromhout <bridget@kromhout.org>
Signed-off-by: Lachlan Evenson <lachlan.evenson@microsoft.com>
Signed-off-by: Bridget Kromhout <bridget@kromhout.org>
Signed-off-by: Bridget Kromhout <bridget@kromhout.org>
Signed-off-by: Bridget Kromhout <bridget@kromhout.org>
Signed-off-by: Bridget Kromhout <bridget@kromhout.org>
Copy link
Member

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking really good. A couple minor points.

Pods and Services will remain single-stack until cli flags have been modified
as described in this KEP. Existing and new services will remain single-stack
until user requests otherwise. Pods will become dual-stack once CNI is
configured for dual-stack.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little confusing:

Pods and Services will remain single-stack until cli flags have been modified
as described in this KEP.

and

Pods will become dual-stack once CNI is configured for dual-stack.

are contradictory. I think the distinction is api vs pod runtime environment. Please clear that up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - edited.

the controller manager will automatically update endpoints and endpointSlices
to match the service IP families. When the feature is reenabled, kube-proxy
will automatically start updating iptables/ipvs rules for the alternative
ipfamily, for existing and new dual-stack services.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is where you should mention:

DNS will immediately begin returning the secondary IP family, while endpoints, endpointSlices, and iptables programming may take some time. This can lead to large or very busy services from receiving excessive traffic on the secondary family address, until the endpoints, endpointslices, and ip tables rules are fully populated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we had a decent mitigation for this. It can happen anytime a larger service goes from single -> dual stack.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - edited.

1. (preferred) Do not create dual-stack services until the rollout of the
dual-stack feature across the cluster is complete.
or
2. Cordon and drain the node(s) where the feature is not enabled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for adding these

IPv4/IPv6 dual-stack](https://kubernetes.io/docs/tasks/network/validate-dual-stack/)
to ensure that node addressing, pod addressing, and services are configured
correctly. If dual-stack services are created, they have passed validation.
Metrics to check could include pods stuck in pending; look in the event logs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, nice

bridgetkromhout and others added 2 commits February 9, 2021 13:46
Signed-off-by: Bridget Kromhout <bridget@kromhout.org>
@johnbelamaric
Copy link
Member

Thank you, looking good!

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dcbw, johnbelamaric, lachie83

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 9, 2021
@johnbelamaric
Copy link
Member

And

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 9, 2021
@k8s-ci-robot k8s-ci-robot merged commit 3533411 into kubernetes:master Feb 9, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Feb 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/network Categorizes an issue or PR as relevant to SIG Network. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants