Allow for easy migration between Citadel to SPIRE backed workload identities #50783

SpectralHiss · 2024-05-01T09:13:49Z

When an operator wants to migrate from the self-signed Citadel CA to spire backed workload identities he will encounter multiple difficulties in doing so, which seems to force him to restart all workloads that cross communicate at once. Making the migration path difficult without blue/green or full cluster upgrade.

to better illustrate our findings about the current behavior for mTLS we have an experiment we ran we will show here:

First install standard istio and bookinfo:

kind create cluster --name test-onboard-spire
istioctl install --set profile=minimal
kubectl create ns bookinfo
kubectl label ns bookinfo istio-injection=enabled
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.21/samples/bookinfo/platform/kube/bookinfo.yaml -n bookinfo

Then apply the quick start spire:

kubectl apply -f spire-backed/spire-quickstart.yaml
kubectl wait pod --for=condition=ready -n spire -l app=spire-agent
kubectl apply -f spire-backed/clusterspiffeid.yaml

istioctl install -f - <<EOF
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
  namespace: istio-system
spec:
  profile: minimal
  meshConfig:
    trustDomain: cluster.local
  values:
    global:
    # This is used to customize the sidecar template
    sidecarInjectorWebhook:
      templates:
        spire: |
          spec:
            containers:
            - name: istio-proxy
              volumeMounts:
              - name: workload-socket
                mountPath: /run/secrets/workload-spiffe-uds
                readOnly: true
            volumes:
              - name: workload-socket
                csi:
                  driver: "csi.spiffe.io"
                  readOnly: true
EOF

(Since there is no ingress gateway we have a simpler config than the standard quickstart)
Notice we choose the exact same trust domain: cluster.local for the spire domain..

Next we make the details pod fetch itself a spire certificate:

#These are the bookinfo manifests but with the spire podselector labels and template annotations added 
kubectl apply -f spire-backed/bookinfo-spired.yaml -l app=details,version=v1

We confirmed that the details pod gets issued spire backed certificate succesfully and connects to the control plane.
We see that now communication between productpage -> details is broken due to TLS error.

This is because productpage only has the old selfsigned root in the istio-root-ca-configmap trust store. and this shows in istioctl pc secret

If we now make productpage use spire:

# Make productpage use spire
kubectl apply -f spire-backed/bookinfo-spired.yaml -l app=productpage,version=v1

Now productpage reads details fine but ratings/reviews are broken (as they use old CA and are not trusted).

What's even more confusing is that the control plane is connected to correctly even though it uses the old CA.
In case you were not aware the Istiod control plane does not yet support workload identity through spire as shown in this issue:
#49087

The reason this works, is that the xds proxy on each sidecar which connects to istiod on behalf of envoy, sets up a seperate root of trust on workloads, including spire backed workloads and uses the istio-root-ca-configmap:

istio/pkg/istio-agent/agent.go

Line 623 in f30859e

func (a *Agent) FindRootCAForXDS() (string, error) {

https://github.com/istio/api/blob/68cdbb256ce1d970fa0a2fb4397057d165ee4732/mesh/v1alpha1/proxy.proto#L593

This is inconsistent and very surprising. There should be a way to include all the old roots for all workloads to allow for spire to non-spire communication in my view if only to allow for gradual rollout.

Describe alternatives you've considered

to just install spire on all workloads and do a full cluster upgrade at once with a maintenance window or if available use a blue/green setup for this.
Perhaps force the trust bundle context somehow for workloads to always contain both the spire root and the control plane self-signed root. Unclear if current mechanisms would allow you to do so.
This would still require workload restarts to take effect presently but would at least allow for a gradual rollout.

Affected product area (please put an X in all that apply)

[ ] Ambient
[ ] Docs
[ ] Dual Stack
[x ] Installation
[ ] Networking
[ ] Performance and Scalability
[ ] Extensions and Telemetry
[x ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure

Affected features (please put an X in all that apply)

[ ] Multi Cluster
[ ] Virtual Machine
[ ] Multi Control Plane

Additional context
Tested on Istio 1.18.2

The text was updated successfully, but these errors were encountered:

SpectralHiss · 2024-05-01T13:17:31Z

To illustrate the situation better i've created a couple diagrams to explain the situation with the trust going on:

Details service only use spire certificates

Productpage and details use spire certificates

istio-policy-bot added area/environments area/security labels May 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow for easy migration between Citadel to SPIRE backed workload identities #50783

Allow for easy migration between Citadel to SPIRE backed workload identities #50783

SpectralHiss commented May 1, 2024 •

edited

SpectralHiss commented May 1, 2024 •

edited

Allow for easy migration between Citadel to SPIRE backed workload identities #50783

Allow for easy migration between Citadel to SPIRE backed workload identities #50783

Comments

SpectralHiss commented May 1, 2024 • edited

SpectralHiss commented May 1, 2024 • edited

Details service only use spire certificates

Productpage and details use spire certificates

SpectralHiss commented May 1, 2024 •

edited

SpectralHiss commented May 1, 2024 •

edited