Azure AD: Migration path from 1.17.0 to 1.17.1 #87464

jcrowthe · 2020-01-22T18:59:20Z

What happened:
Kubernetes 1.17.1 included a breaking change to how Azure AD may be used for OIDC authentication in Kubernetes: #86412 For all teams which were successfully using Azure AD's implementation of OIDC in Kubernetes previously, this change means:

>=1.17.1 clusters are only compatible with >=1.17.1 kubectl versions and client libraries
<=1.17.0 clusters are only compatible with <= 1.17.0 kubectl versions and client libraries

Notably, this includes not only teams who directly use AAD, but also all AKS clusters. See Azure/aks-engine#2582.

What you expected to happen:
Associated with a breaking change, a migration path should be provided. In this case (and as currently released), both the apiserver and all clients of the apiserver need to simultaneously change versions to avoid impact. In larger deployments of Kubernetes, a single cluster may have hundreds (or more) of disparate clients. Such a large change cannot be reasonably executed without impacting customers and causing critical service outages.

This is a request for details on how cluster operators and Kubernetes users (that use Azure AD) should migrate from 1.17.0 to 1.17.1 without impact.

How to reproduce it (as minimally and precisely as possible):

Launch a cluster on 1.17.0
Follow instructions here to tie into Azure AD: https://github.com/kubernetes/kubernetes/tree/master/staging/src/k8s.io/client-go/plugin/pkg/client/auth/azure
Verify valid authn/authz with kubectl <= 1.17.0
Verify invalid authn/authz with kubectl >= 1.17.1
Repeat test with a 1.17.1 cluster and kubectl versions below 1.17.1

Anything else we need to know?:
The fact that this underlying functionality should change is not disputed. (see the evidence in the original ticket #86410). Azure AD has improperly injected spn: into all JWT token aud fields for a long time now. The issue is not with the new functionality itself, but rather the manner in which the change was implemented. Without a designated migration path, this change is highly impactful to clients.

Environment:

Kubernetes version (use kubectl version): apiserver: 1.17.1, kubectl: 1.17.0 (and vice versa)
Cloud provider or hardware configuration: N/A
OS (e.g: cat /etc/os-release): N/A
Kernel (e.g. uname -a): N/A
Install tools: N/A
Network plugin and version (if this is a network-related bug): N/A
Others:

Tagging original author and approvers of the PR: @feiskyer @andyzhangx @enj

/sig cloud-provider
/area provider/azure

The text was updated successfully, but these errors were encountered:

andyzhangx · 2020-01-23T02:38:35Z

@jcrowthe do you mean we should revert this change?
cc @weinong @karataliu
currently it's only in k8s v1.17.1 release

jcrowthe · 2020-01-23T03:38:37Z

@andyzhangx To help in determining that, I'd suggest the following:

This is a major functionality change which was introduced in a patch release. It would be more appropriate for such a change to be introduced in a major Kubernetes release version.
This is a breaking change with high impact to both operations teams and end users of Kubernetes. Such a change should be introduced with a clear migration strategy that does not impact cluster operations.

I am supportive of this functionality change being made to Kubernetes, however this PR as implemented requires an unreasonably impactful cost to end users. I recommend it be reworked into a more appropriate and non-impactful PR that provides a solid migration path for those already using the "old" functionality.

With that, I'll defer to maintainers to determine the proper next steps.

andyzhangx · 2020-01-23T03:56:37Z

@andyzhangx To help in determining that, I'd suggest the following:

This is a major functionality change which was introduced in a patch release. It would be more appropriate for such a change to be introduced in a major Kubernetes release version.

This is a breaking change with high impact to both operations teams and end users of Kubernetes. Such a change should be introduced with a clear migration strategy that does not impact cluster operations.

I am supportive of this functionality change being made to Kubernetes, however this PR as implemented requires an unreasonably impactful cost to end users. I recommend it be reworked into a more appropriate and non-impactful PR that provides a solid migration path for those already using the "old" functionality.

With that, I'll defer to maintainers to determine the proper next steps.

@jcrowthe thanks, that's the case, if there is one AAD enabled cluster with both 1.17.0 and 1.17.1 agent nodes , there is no way to use one kubectl verison to operate on this cluster.

andyzhangx · 2020-01-23T04:06:09Z

Since there is already a breaking change on 1.17.1, revert means the change can only be done in 1.17.2, that would make the situation more complex. I would suggest we add a doc:

On an AAD enabled cluster,
>=1.17.1 clusters are only compatible with >=1.17.1 kubectl versions and client libraries
<=1.17.0 clusters are only compatible with <= 1.17.0 kubectl versions and client libraries

Need to check whether this will break sth, e.g. upgrade AKS cluster from 1.17.0 to 1.17.1

karataliu · 2020-01-23T05:59:59Z

it's not related to server side version, but apiserver configuration:
Current situation is

'--oidc-client-id=spn:<guid>' is compatible with <=1.17.0 client
'--oidc-client-id=<guid>' is compatible with >=1.17.1

A workable fix should be allow multiple '--oidc-client-id' to be passed in, while the server switch to

'--oidc-client-id=spn:<guid>,<guid>'

mblaschke · 2020-01-23T11:26:16Z

When you're running a shared cluster this situation is really bad. Many people/teams are now complaining not being able to login to the clusters anymore so they now have to keep multiple versions of kubectl which is breaking a lot of scripts.

Please create a migration path for companies which have to run multiple clusters in different versions or roll back the change.

The support overhead is huge.

dharmab · 2020-01-23T15:51:09Z

Since there is already a breaking change on 1.17.1, revert means the change can only be done in 1.17.2, that would make the situation more complex. I would suggest we add a doc:
On an AAD enabled cluster,
>=1.17.1 clusters are only compatible with >=1.17.1 kubectl versions and client libraries
<=1.17.0 clusters are only compatible with <= 1.17.0 kubectl versions and client libraries
Need to check whether this will break sth, e.g. upgrade AKS cluster from 1.17.0 to 1.17.1

This isn't a feasible solution for users. As it stands, users using Azure AD OIDC have no upgrade path from 1.17.0.

Users operating a single high-availability cluster will have a window where both 1.17.1 and 1.17.0 apiservers are running and either version of kubectl will randomly fail depending on which apiserver responds.
For users operating a single cluster, asking thousands of clients to switch over at the exact same instant the new apiserver is up is unreasonable- especially for clients using a Kubernetes client library for their programming language of choice, since changing the client version may require deploying a different build.
For users operating multiple clusters, it is impossible for clients to switch over without breakage because different clusters would likely be upgraded at different times (possibly spanning days or weeks).

The only reasonable solution is to provide an "overlapping" release where both OIDC client ID formats are accepted.

I think a reasonable solution would be:

Revert this change for both 1.17.2 and 1.18.0
Publish notifications that users using Azure AD OIDC should skip 1.17.1
Rework this change to allow both the old and new format of client ID to be used. Add warning messages in the apiserver startup logs and release changelog that the old format will go away in a future release. This change would ideally be in a new minor release of Kubernetes rather than a patch release.
In a future minor release at least one minor version past the one that allowed the overlap, remove support for the old OIDC version.

dharmab · 2020-01-23T15:52:12Z

/sig auth

dharmab · 2020-01-23T15:52:23Z

/sig cli

dharmab · 2020-01-23T15:56:02Z

Note that this change also invalidates the kubectl version skew policy

kubectl is supported within one minor version (older or newer) of kube-apiserver

i.e. 1.17.X clusters must support kubelet 1.16.X and 1.18.X. So the OIDC change would have to support a total of three minor versions, instead of a single patch version as it stands.

weinong · 2020-01-23T19:00:01Z

sorry about that my change caused such problem. originally i thought a configuration change is an acceptable migration path. I will submit PR to revert it.

weinong · 2020-01-23T19:08:58Z

another thought:

revert the change in fix incompatible AAD token #86412 and keep the azure auth module as before
add azure2 auth module which supports new oidc behavior

this would prevent breaking change and allow new scenario to be enabled immediately

the reason i suggest to add azure2 module is that right now it's the bug in kubectl which dictates the AAD oidc argument in api-server. It seems weird to fix kube-apiserver for a bug in kubectl. There will always be a time where kubectl is updated and breaks existing AAD OIDC flow when api-server is not updated accordingly as people have voiced here. Which suggest a solution requiring no change to api-server OIDC argument. At the same time, we also should not do too many hacky thing in api-server to intelligently omit spn: prefix because authentication is security critical.

Hence, we might as well just implement a new auth module...

thoughts?

weinong · 2020-01-23T19:21:53Z

@karataliu I'm not sure if proposed '--oidc-client-id=spn:<guid>,<guid>' is helpful or not. It's only helpful to deal with different versions of kubectl. But it doesn't help in the case where new kubectl talking to existing api-server. As indicated in the comments, people would need to update their clusters as soon as kubectl is updated. That's not much different than what currently is right now. Right?

dharmab · 2020-01-23T20:32:12Z

But it doesn't help in the case where new kubectl talking to existing api-server. As indicated in the comments, people would need to update their clusters as soon as kubectl is updated

While this migration path is also technically out of compliance with the kubectl version skew policy, it is at least technically possible if users upgrade apiserver prior to upgrading kubectl.

The separate auth module for AAD v2 is an interesting thought. Is it possible for an apiserver to use multiple auth modules simultaneously? If both auth modules could be accepted at the same time, there would be a clean migration path by keeping both modules available over a minor release.

weinong · 2020-01-23T21:52:02Z

But it doesn't help in the case where new kubectl talking to existing api-server. As indicated in the comments, people would need to update their clusters as soon as kubectl is updated

While this migration path is also technically out of compliance with the kubectl version skew policy, it is at least technically possible if users upgrade apiserver prior to upgrading kubectl.

This is technically not true. As people have commented here, there will be a window of time where older kubectl is unable to talk to api-server with newer configuration.

The separate auth module for AAD v2 is an interesting thought. Is it possible for an apiserver to use multiple auth modules simultaneously? If both auth modules could be accepted at the same time, there would be a clean migration path by keeping both modules available over a minor release.

I think it's possible. we will need to 1) implement two auth modules in kubectl, and 2) enable multiple --oidc-client-id in api server like what @karataliu proposed. 3) migrate kubeconfig to use new auth module.

weinong · 2020-01-23T22:18:42Z

@mblaschke @jcrowthe @dharmab does that sound like a plausible path? Should I just open issues to track these tasks while I'm preparing to revert?

jcrowthe · 2020-01-24T00:06:18Z

@karataliu I'm not sure if proposed '--oidc-client-id=spn:<guid>,<guid>' is helpful or not. It's only helpful to deal with different versions of kubectl. But it doesn't help in the case where new kubectl talking to existing api-server. As indicated in the comments, people would need to update their clusters as soon as kubectl is updated. That's not much different than what currently is right now. Right?

@weinong We've identified a lot of use cases here, each with differing requirements. I believe to entirely remove impact, it's a requirement for both a server-side solution and a client-side solution to be provided.

Of these two sides -- client and server -- a solution in one means that the "other doesn't need to change". ie. If we change the apiserver to support multiple ID's, no clients will need to change how they authenticate as all old and new OIDC client ID's will be accepted. And vice versa, if we change kubectl to include a second auth option, cluster operators don't need to reconfigure apiservers to change what ID's they accept.

Having both of these solutions allows all users and operations teams involved the flexibility to select which option fits their needs best. It also allows for swapping between spn: and the lack of spn: as needed (which moving off spn: is desirable for the many reasons stated in the original git issue).

Having said all of this, and as an operator of (very many) clusters, the primary cost I'm looking to avoid is the requirement that every user of a k8s cluster my team operates needs to make a change to their client libraries/kubectl version, and associated kubeconfig. Since that affects thousands of users and deployment tools in my organization, that is the primary point of impact I'm attempting to avoid. For that reason I'm much more interested in the serverside apiserver modification, yet others likely feel differently based on their use case. As such, the proper solution is likely to support both serverside and clientside solutions.

Hopefully this all makes sense.

weinong · 2020-01-24T00:23:01Z

@jcrowthe I was referring to this paragraph. this is aligned to your thought: we need both server side and client side change to have proper migration

The separate auth module for AAD v2 is an interesting thought. Is it possible for an apiserver to use multiple auth modules simultaneously? If both auth modules could be accepted at the same time, there would be a clean migration path by keeping both modules available over a minor release.

I think it's possible. we will need to 1) implement two auth modules in kubectl, and 2) enable multiple --oidc-client-id in api server like what @karataliu proposed. 3) migrate kubeconfig to use new auth module.

liggitt · 2020-01-24T01:56:28Z

The first priority should be to restore compatibility with the oldest existing releases. I would expect the breaking change to be reverted on 1.15/1.16/1.17 branches.

liggitt · 2020-01-24T01:58:26Z

/assign @andyzhangx @feiskyer

weinong · 2020-01-24T02:14:35Z

@liggitt yes. will cherry pick once it's merged. thank you!

jcrowthe added the kind/bug Categorizes issue or PR as related to a bug. label Jan 22, 2020

k8s-ci-robot added sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. area/provider/azure Issues or PRs related to azure provider labels Jan 22, 2020

k8s-ci-robot added the sig/auth Categorizes an issue or PR as relevant to SIG Auth. label Jan 23, 2020

k8s-ci-robot added the sig/cli Categorizes an issue or PR as relevant to SIG CLI. label Jan 23, 2020

weinong mentioned this issue Jan 24, 2020

Revert "It fixes a bug where AAD token obtained by kubectl is incompa… #87507

Merged

k8s-ci-robot added the sig/windows Categorizes an issue or PR as relevant to SIG Windows. label Jan 24, 2020

liggitt removed the sig/windows Categorizes an issue or PR as relevant to SIG Windows. label Jan 24, 2020

k8s-ci-robot assigned andyzhangx and feiskyer Jan 24, 2020

k8s-ci-robot added the sig/windows Categorizes an issue or PR as relevant to SIG Windows. label Jan 24, 2020

feiskyer added this to Backlog in Provider Azure via automation Jan 24, 2020

feiskyer moved this from Backlog to In progress in Provider Azure Jan 24, 2020

k8s-ci-robot closed this as completed in #87507 Jan 25, 2020

Provider Azure automation moved this from In progress to Done Jan 25, 2020

liggitt mentioned this issue Jan 30, 2020

Getting invalid Azure Active Directory token after upgrading kubectl to 1.17.2 #87643

Closed

dharmab mentioned this issue Mar 30, 2020

Kubernetes 1.18 re-introduces previously fixed Azure AD OIDC regression #89663

Closed

liggitt added the kind/regression Categorizes issue or PR as related to a regression from a prior release. label Apr 8, 2020

liggitt added this to the v1.17 milestone Apr 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure AD: Migration path from 1.17.0 to 1.17.1 #87464

Azure AD: Migration path from 1.17.0 to 1.17.1 #87464

jcrowthe commented Jan 22, 2020

andyzhangx commented Jan 23, 2020

jcrowthe commented Jan 23, 2020

andyzhangx commented Jan 23, 2020 •

edited

andyzhangx commented Jan 23, 2020

karataliu commented Jan 23, 2020

mblaschke commented Jan 23, 2020 •

edited

dharmab commented Jan 23, 2020 •

edited

dharmab commented Jan 23, 2020

dharmab commented Jan 23, 2020

dharmab commented Jan 23, 2020 •

edited

weinong commented Jan 23, 2020

weinong commented Jan 23, 2020 •

edited

weinong commented Jan 23, 2020 •

edited

dharmab commented Jan 23, 2020

weinong commented Jan 23, 2020

weinong commented Jan 23, 2020 •

edited

jcrowthe commented Jan 24, 2020

weinong commented Jan 24, 2020

liggitt commented Jan 24, 2020 •

edited

liggitt commented Jan 24, 2020

weinong commented Jan 24, 2020

Azure AD: Migration path from 1.17.0 to 1.17.1 #87464

Azure AD: Migration path from 1.17.0 to 1.17.1 #87464

Comments

jcrowthe commented Jan 22, 2020

andyzhangx commented Jan 23, 2020

jcrowthe commented Jan 23, 2020

andyzhangx commented Jan 23, 2020 • edited

andyzhangx commented Jan 23, 2020

karataliu commented Jan 23, 2020

mblaschke commented Jan 23, 2020 • edited

dharmab commented Jan 23, 2020 • edited

dharmab commented Jan 23, 2020

dharmab commented Jan 23, 2020

dharmab commented Jan 23, 2020 • edited

weinong commented Jan 23, 2020

weinong commented Jan 23, 2020 • edited

weinong commented Jan 23, 2020 • edited

dharmab commented Jan 23, 2020

weinong commented Jan 23, 2020

weinong commented Jan 23, 2020 • edited

jcrowthe commented Jan 24, 2020

weinong commented Jan 24, 2020

liggitt commented Jan 24, 2020 • edited

liggitt commented Jan 24, 2020

weinong commented Jan 24, 2020

andyzhangx commented Jan 23, 2020 •

edited

mblaschke commented Jan 23, 2020 •

edited

dharmab commented Jan 23, 2020 •

edited

dharmab commented Jan 23, 2020 •

edited

weinong commented Jan 23, 2020 •

edited

weinong commented Jan 23, 2020 •

edited

weinong commented Jan 23, 2020 •

edited

liggitt commented Jan 24, 2020 •

edited