Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shoot migration not working anymore #306

Open
mwennrich opened this issue Apr 26, 2023 · 2 comments
Open

Shoot migration not working anymore #306

mwennrich opened this issue Apr 26, 2023 · 2 comments
Assignees

Comments

@mwennrich
Copy link
Contributor

mwennrich commented Apr 26, 2023

"Waiting until the namespace 'shoot--p4jxn2--mwentest' has been cleaned up and deleted in the Seed cluster...

shoot--p4jxn2--mwentest                        Terminating   20m


NAME                                                                     NAMESPACE                AGE
firewalldeployment.firewall.metal-stack.io/shoot-firewall                shoot--p4jxn2--mwentest  17m
firewall.firewall.metal-stack.io/shoot--p4jxn2--mwentest-firewall-0f0a9  shoot--p4jxn2--mwentest  17m
firewallset.firewall.metal-stack.io/shoot-firewall-0eea7                 shoot--p4jxn2--mwentest  17m

fw2,fwset,fwdeployment objects have a firewall.metal-stack.io/firewall-controller-manager finalizer, but fcm has already been deleted.

After removing the finalizer, migration continues, but after the restore, a new firewall is created, without deleting the old one.
This results in a cluster with two firewalls.

$ k get fwmon -n firewall
NAME                                     MACHINE ID                             IMAGE                          SIZE            LAST EVENT    AGE
shoot--p4jxn2--mwentest-firewall-0f0a9   256b1c00-be6d-11e9-8000-3cecef22b288   firewall-ubuntu-3.0.20230404   n1-medium-x86   Phoned Home   35m
shoot--p4jxn2--mwentest-firewall-e3c19   48eb9200-be80-11e9-8000-3cecef22fc1a   firewall-ubuntu-3.0.20230404   n1-medium-x86   Phoned Home   8m30s
@Gerrit91
Copy link
Contributor

Gerrit91 commented May 2, 2023

Very rough idea:

  • Remove the finalizers from the resources on migrate
  • Restore firewall resources from the firewall monitor resources in the shoot

@Gerrit91 Gerrit91 changed the title control plane migration not working anymore Shoot migration not working anymore Jul 24, 2023
@Gerrit91
Copy link
Contributor

Gerrit91 commented Jul 24, 2023

With #308 we can make the firewall survive the shoot migration.

However, as the firewall-controller is now maintaining a seed client for reconciliation, the seed client becomes invalid after a shoot migration. This is because we use a static service account token, which Kubernetes signs with the cluster's CA, which has, of course, changed after the migration. Also the server endpoint has changed after the migration.

Thus, there must be a possibility for the firewall-controller to migrate its client to the new seed. For this, I think we have two options:

  • Recreate the firewall in the process of the migration (drawback: user traffic interruption during a migration, First part implementation of shoot migration for firewalls #308 becomes unnecessary)
  • Offer a renewal of the seed client through the firewall monitor resource (shoot client secret can be persisted during a Gardener shoot migration, so this continues working)

If we decide for the second variant, we should also consider migrating away from static service account tokens and instead start
rotation of the certificates. Also, we can use bootstrap tokens in order to establish a trusted connection between the firewall-controller and the api-server.

Here is a brief description of how the process could look like:

  1. The firewall gets created with bootstrap kubeconfig through userdata at /etc/firewall-controller/.bootstrap.kubeconfig along with the following roles in the shoot's seed namespace:
---
kind: ClusterRole
metadata:
  name: firewall.metal-stack.io:system:firewall-bootstrapper
rules:
- apiGroups:
  - certificates.k8s.io
  resources:
  - certificatesigningrequests
  verbs:
  - create
  - get
- apiGroups:
  - certificates.k8s.io
  resources:
  - certificatesigningrequests/firewallcontroller
  verbs:
  - create
---
kind: ClusterRoleBinding
metadata:
  name: firewall.metal-stack.io:system:firewall-bootstrapper
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: firewall.metal-stack.io:system:firewall-bootstrapper
subjects:
- kind: Group
  name: system:bootstrappers
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: Secret
metadata:
  name: bootstrap-token-07401b
  namespace: kube-system
type: bootstrap.kubernetes.io/token
stringData:
  description: "Token for bootstrapping the metal-stack firewall-controller."
  token-id: 07401b
  token-secret: f395accd246ae52d
  expiration: <now+60m>
  usage-bootstrap-authentication: "true"
  usage-bootstrap-signing: "true"
  auth-extra-groups: system:bootstrappers
  1. The firewall-controller starts up and uses the bootstrap kubeconfig to issue a certificate signing request (CSR)
  2. The firewall-controller-manager can approve the CSR, enabling the firewall-controller to construct a seed client with the minimal permissions as they currently are implemented.
apiVersion: certificates.k8s.io/v1
kind: CertificateSigningRequest
metadata:
  name: firewall-controller-csr
spec:
  groups:
  - system:authenticated
  request: <csr>
  signerName: kubernetes.io/kube-apiserver-client
  usages:
  - digital signature
  - key encipherment
  - client auth
  username: shoot--pcfgbt--cilium-firewall-653f3    <-- FCM creates a rolebinding and role for every firewall
  expirationSeconds: <1 year?>
status:
  certificate: <cert>
  conditions:
  - lastTransitionTime: "2023-06-21T10:39:54Z"
    lastUpdateTime: "2023-06-21T10:39:54Z"
    message: Auto approving firewall-controller client certificate after SubjectAccessReview.
    reason: AutoApproved
    status: "True"
    type: Approved
  1. The firewall-controller writes the seed kubeconfig to /etc/firewall-controller/.seed.kubeconfig
  2. The firewall-controller starts up and uses the shoot access fields from the firewall object to create the shoot client
  3. The shoot client is written to /etc/firewall-controller/.shoot.kubeconfig
  4. The firewall-controller starts up normal operation
    • Asynchronously updates the tokens in the .shoot.kubeconfig and .seed.kubeconfig through the firewall monitor's shoot access fields
  5. The signed certificate for the firewall-controller is continuously checked by the firewall-controller-manager
    • When the certificate becomes invalid (e.g. due to a shoot migration or requested CA roll), a new bootstrap kubeconfig is put
      to the field in the seed access section in the firewall monitor
  6. If the firewall-controller receives an invalid certificate error with the client, it repeats the initial bootstrap process and creates a new seed client

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants