Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure AD: Migration path from 1.17.0 to 1.17.1 #87464

Closed
jcrowthe opened this issue Jan 22, 2020 · 21 comments · Fixed by #87507
Closed

Azure AD: Migration path from 1.17.0 to 1.17.1 #87464

jcrowthe opened this issue Jan 22, 2020 · 21 comments · Fixed by #87507
Assignees
Labels
area/provider/azure Issues or PRs related to azure provider kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/windows Categorizes an issue or PR as relevant to SIG Windows.
Milestone

Comments

@jcrowthe
Copy link

What happened:
Kubernetes 1.17.1 included a breaking change to how Azure AD may be used for OIDC authentication in Kubernetes: #86412 For all teams which were successfully using Azure AD's implementation of OIDC in Kubernetes previously, this change means:

  • >=1.17.1 clusters are only compatible with >=1.17.1 kubectl versions and client libraries
  • <=1.17.0 clusters are only compatible with <= 1.17.0 kubectl versions and client libraries

Notably, this includes not only teams who directly use AAD, but also all AKS clusters. See Azure/aks-engine#2582.

What you expected to happen:
Associated with a breaking change, a migration path should be provided. In this case (and as currently released), both the apiserver and all clients of the apiserver need to simultaneously change versions to avoid impact. In larger deployments of Kubernetes, a single cluster may have hundreds (or more) of disparate clients. Such a large change cannot be reasonably executed without impacting customers and causing critical service outages.

This is a request for details on how cluster operators and Kubernetes users (that use Azure AD) should migrate from 1.17.0 to 1.17.1 without impact.

How to reproduce it (as minimally and precisely as possible):

  1. Launch a cluster on 1.17.0
  2. Follow instructions here to tie into Azure AD: https://github.com/kubernetes/kubernetes/tree/master/staging/src/k8s.io/client-go/plugin/pkg/client/auth/azure
  3. Verify valid authn/authz with kubectl <= 1.17.0
  4. Verify invalid authn/authz with kubectl >= 1.17.1
  5. Repeat test with a 1.17.1 cluster and kubectl versions below 1.17.1

Anything else we need to know?:
The fact that this underlying functionality should change is not disputed. (see the evidence in the original ticket #86410). Azure AD has improperly injected spn: into all JWT token aud fields for a long time now. The issue is not with the new functionality itself, but rather the manner in which the change was implemented. Without a designated migration path, this change is highly impactful to clients.

Environment:

  • Kubernetes version (use kubectl version): apiserver: 1.17.1, kubectl: 1.17.0 (and vice versa)
  • Cloud provider or hardware configuration: N/A
  • OS (e.g: cat /etc/os-release): N/A
  • Kernel (e.g. uname -a): N/A
  • Install tools: N/A
  • Network plugin and version (if this is a network-related bug): N/A
  • Others:

Tagging original author and approvers of the PR: @feiskyer @andyzhangx @enj

/sig cloud-provider
/area provider/azure

@jcrowthe jcrowthe added the kind/bug Categorizes issue or PR as related to a bug. label Jan 22, 2020
@k8s-ci-robot k8s-ci-robot added sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. area/provider/azure Issues or PRs related to azure provider labels Jan 22, 2020
@andyzhangx
Copy link
Member

@jcrowthe do you mean we should revert this change?
cc @weinong @karataliu
currently it's only in k8s v1.17.1 release

@jcrowthe
Copy link
Author

@andyzhangx To help in determining that, I'd suggest the following:

  • This is a major functionality change which was introduced in a patch release. It would be more appropriate for such a change to be introduced in a major Kubernetes release version.
  • This is a breaking change with high impact to both operations teams and end users of Kubernetes. Such a change should be introduced with a clear migration strategy that does not impact cluster operations.

I am supportive of this functionality change being made to Kubernetes, however this PR as implemented requires an unreasonably impactful cost to end users. I recommend it be reworked into a more appropriate and non-impactful PR that provides a solid migration path for those already using the "old" functionality.

With that, I'll defer to maintainers to determine the proper next steps.

@andyzhangx
Copy link
Member

andyzhangx commented Jan 23, 2020

@andyzhangx To help in determining that, I'd suggest the following:

  • This is a major functionality change which was introduced in a patch release. It would be more appropriate for such a change to be introduced in a major Kubernetes release version.
  • This is a breaking change with high impact to both operations teams and end users of Kubernetes. Such a change should be introduced with a clear migration strategy that does not impact cluster operations.

I am supportive of this functionality change being made to Kubernetes, however this PR as implemented requires an unreasonably impactful cost to end users. I recommend it be reworked into a more appropriate and non-impactful PR that provides a solid migration path for those already using the "old" functionality.

With that, I'll defer to maintainers to determine the proper next steps.

@jcrowthe thanks, that's the case, if there is one AAD enabled cluster with both 1.17.0 and 1.17.1 agent nodes , there is no way to use one kubectl verison to operate on this cluster.

@andyzhangx
Copy link
Member

Since there is already a breaking change on 1.17.1, revert means the change can only be done in 1.17.2, that would make the situation more complex. I would suggest we add a doc:

On an AAD enabled cluster,
>=1.17.1 clusters are only compatible with >=1.17.1 kubectl versions and client libraries
<=1.17.0 clusters are only compatible with <= 1.17.0 kubectl versions and client libraries

Need to check whether this will break sth, e.g. upgrade AKS cluster from 1.17.0 to 1.17.1

@karataliu
Copy link
Contributor

it's not related to server side version, but apiserver configuration:
Current situation is

'--oidc-client-id=spn:<guid>' is compatible with <=1.17.0 client
'--oidc-client-id=<guid>' is compatible with >=1.17.1

A workable fix should be allow multiple '--oidc-client-id' to be passed in, while the server switch to

'--oidc-client-id=spn:<guid>,<guid>'

@mblaschke
Copy link

mblaschke commented Jan 23, 2020

When you're running a shared cluster this situation is really bad. Many people/teams are now complaining not being able to login to the clusters anymore so they now have to keep multiple versions of kubectl which is breaking a lot of scripts.

Please create a migration path for companies which have to run multiple clusters in different versions or roll back the change.

The support overhead is huge.

@dharmab
Copy link
Contributor

dharmab commented Jan 23, 2020

Since there is already a breaking change on 1.17.1, revert means the change can only be done in 1.17.2, that would make the situation more complex. I would suggest we add a doc:

On an AAD enabled cluster,
>=1.17.1 clusters are only compatible with >=1.17.1 kubectl versions and client libraries
<=1.17.0 clusters are only compatible with <= 1.17.0 kubectl versions and client libraries

Need to check whether this will break sth, e.g. upgrade AKS cluster from 1.17.0 to 1.17.1

This isn't a feasible solution for users. As it stands, users using Azure AD OIDC have no upgrade path from 1.17.0.

  • Users operating a single high-availability cluster will have a window where both 1.17.1 and 1.17.0 apiservers are running and either version of kubectl will randomly fail depending on which apiserver responds.
  • For users operating a single cluster, asking thousands of clients to switch over at the exact same instant the new apiserver is up is unreasonable- especially for clients using a Kubernetes client library for their programming language of choice, since changing the client version may require deploying a different build.
  • For users operating multiple clusters, it is impossible for clients to switch over without breakage because different clusters would likely be upgraded at different times (possibly spanning days or weeks).

The only reasonable solution is to provide an "overlapping" release where both OIDC client ID formats are accepted.

I think a reasonable solution would be:

  • Revert this change for both 1.17.2 and 1.18.0
  • Publish notifications that users using Azure AD OIDC should skip 1.17.1
  • Rework this change to allow both the old and new format of client ID to be used. Add warning messages in the apiserver startup logs and release changelog that the old format will go away in a future release. This change would ideally be in a new minor release of Kubernetes rather than a patch release.
  • In a future minor release at least one minor version past the one that allowed the overlap, remove support for the old OIDC version.

@dharmab
Copy link
Contributor

dharmab commented Jan 23, 2020

/sig auth

@k8s-ci-robot k8s-ci-robot added the sig/auth Categorizes an issue or PR as relevant to SIG Auth. label Jan 23, 2020
@dharmab
Copy link
Contributor

dharmab commented Jan 23, 2020

/sig cli

@k8s-ci-robot k8s-ci-robot added the sig/cli Categorizes an issue or PR as relevant to SIG CLI. label Jan 23, 2020
@dharmab
Copy link
Contributor

dharmab commented Jan 23, 2020

Note that this change also invalidates the kubectl version skew policy

kubectl is supported within one minor version (older or newer) of kube-apiserver

i.e. 1.17.X clusters must support kubelet 1.16.X and 1.18.X. So the OIDC change would have to support a total of three minor versions, instead of a single patch version as it stands.

@weinong
Copy link
Contributor

weinong commented Jan 23, 2020

sorry about that my change caused such problem. originally i thought a configuration change is an acceptable migration path. I will submit PR to revert it.

@weinong
Copy link
Contributor

weinong commented Jan 23, 2020

another thought:

this would prevent breaking change and allow new scenario to be enabled immediately

the reason i suggest to add azure2 module is that right now it's the bug in kubectl which dictates the AAD oidc argument in api-server. It seems weird to fix kube-apiserver for a bug in kubectl. There will always be a time where kubectl is updated and breaks existing AAD OIDC flow when api-server is not updated accordingly as people have voiced here. Which suggest a solution requiring no change to api-server OIDC argument. At the same time, we also should not do too many hacky thing in api-server to intelligently omit spn: prefix because authentication is security critical.

Hence, we might as well just implement a new auth module...

thoughts?

@weinong
Copy link
Contributor

weinong commented Jan 23, 2020

@karataliu I'm not sure if proposed '--oidc-client-id=spn:<guid>,<guid>' is helpful or not. It's only helpful to deal with different versions of kubectl. But it doesn't help in the case where new kubectl talking to existing api-server. As indicated in the comments, people would need to update their clusters as soon as kubectl is updated. That's not much different than what currently is right now. Right?

@dharmab
Copy link
Contributor

dharmab commented Jan 23, 2020

But it doesn't help in the case where new kubectl talking to existing api-server. As indicated in the comments, people would need to update their clusters as soon as kubectl is updated

While this migration path is also technically out of compliance with the kubectl version skew policy, it is at least technically possible if users upgrade apiserver prior to upgrading kubectl.

The separate auth module for AAD v2 is an interesting thought. Is it possible for an apiserver to use multiple auth modules simultaneously? If both auth modules could be accepted at the same time, there would be a clean migration path by keeping both modules available over a minor release.

@weinong
Copy link
Contributor

weinong commented Jan 23, 2020

But it doesn't help in the case where new kubectl talking to existing api-server. As indicated in the comments, people would need to update their clusters as soon as kubectl is updated

While this migration path is also technically out of compliance with the kubectl version skew policy, it is at least technically possible if users upgrade apiserver prior to upgrading kubectl.

This is technically not true. As people have commented here, there will be a window of time where older kubectl is unable to talk to api-server with newer configuration.

The separate auth module for AAD v2 is an interesting thought. Is it possible for an apiserver to use multiple auth modules simultaneously? If both auth modules could be accepted at the same time, there would be a clean migration path by keeping both modules available over a minor release.

I think it's possible. we will need to 1) implement two auth modules in kubectl, and 2) enable multiple --oidc-client-id in api server like what @karataliu proposed. 3) migrate kubeconfig to use new auth module.

@weinong
Copy link
Contributor

weinong commented Jan 23, 2020

@mblaschke @jcrowthe @dharmab does that sound like a plausible path? Should I just open issues to track these tasks while I'm preparing to revert?

@jcrowthe
Copy link
Author

@karataliu I'm not sure if proposed '--oidc-client-id=spn:<guid>,<guid>' is helpful or not. It's only helpful to deal with different versions of kubectl. But it doesn't help in the case where new kubectl talking to existing api-server. As indicated in the comments, people would need to update their clusters as soon as kubectl is updated. That's not much different than what currently is right now. Right?

@weinong We've identified a lot of use cases here, each with differing requirements. I believe to entirely remove impact, it's a requirement for both a server-side solution and a client-side solution to be provided.

Of these two sides -- client and server -- a solution in one means that the "other doesn't need to change". ie. If we change the apiserver to support multiple ID's, no clients will need to change how they authenticate as all old and new OIDC client ID's will be accepted. And vice versa, if we change kubectl to include a second auth option, cluster operators don't need to reconfigure apiservers to change what ID's they accept.

Having both of these solutions allows all users and operations teams involved the flexibility to select which option fits their needs best. It also allows for swapping between spn: and the lack of spn: as needed (which moving off spn: is desirable for the many reasons stated in the original git issue).

Having said all of this, and as an operator of (very many) clusters, the primary cost I'm looking to avoid is the requirement that every user of a k8s cluster my team operates needs to make a change to their client libraries/kubectl version, and associated kubeconfig. Since that affects thousands of users and deployment tools in my organization, that is the primary point of impact I'm attempting to avoid. For that reason I'm much more interested in the serverside apiserver modification, yet others likely feel differently based on their use case. As such, the proper solution is likely to support both serverside and clientside solutions.

Hopefully this all makes sense.

@weinong
Copy link
Contributor

weinong commented Jan 24, 2020

@jcrowthe I was referring to this paragraph. this is aligned to your thought: we need both server side and client side change to have proper migration

The separate auth module for AAD v2 is an interesting thought. Is it possible for an apiserver to use multiple auth modules simultaneously? If both auth modules could be accepted at the same time, there would be a clean migration path by keeping both modules available over a minor release.

I think it's possible. we will need to 1) implement two auth modules in kubectl, and 2) enable multiple --oidc-client-id in api server like what @karataliu proposed. 3) migrate kubeconfig to use new auth module.

@k8s-ci-robot k8s-ci-robot added the sig/windows Categorizes an issue or PR as relevant to SIG Windows. label Jan 24, 2020
@liggitt
Copy link
Member

liggitt commented Jan 24, 2020

The first priority should be to restore compatibility with the oldest existing releases. I would expect the breaking change to be reverted on 1.15/1.16/1.17 branches.

@liggitt liggitt removed the sig/windows Categorizes an issue or PR as relevant to SIG Windows. label Jan 24, 2020
@liggitt
Copy link
Member

liggitt commented Jan 24, 2020

/assign @andyzhangx @feiskyer

@k8s-ci-robot k8s-ci-robot added the sig/windows Categorizes an issue or PR as relevant to SIG Windows. label Jan 24, 2020
@weinong
Copy link
Contributor

weinong commented Jan 24, 2020

@liggitt yes. will cherry pick once it's merged. thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provider/azure Issues or PRs related to azure provider kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/windows Categorizes an issue or PR as relevant to SIG Windows.
Projects
Development

Successfully merging a pull request may close this issue.

9 participants