Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: al2 support #119

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

feat: al2 support #119

wants to merge 4 commits into from

Conversation

szab100
Copy link

@szab100 szab100 commented Dec 11, 2023

Adding Amazon Linux 2 support.

Notes:

  • Requires 5.15 kernel (available from amazon-linux-extras RPM source)
    • can be done via the following AMI packer provisioner script:
        {
            "type": "shell",
            "inline": [
                "sudo amazon-linux-extras disable kernel-5.4",
                "sudo amazon-linux-extras install -y kernel-5.15",
                "sudo yum install -y kernel-devel"
            ]
        },

Known / pending issues:

  1. Auto K8S ServiceAccount secret mounts using tmpfs are not working (see workaround below).

    The resulting error is like

    container create failed: time="2024-01-19T16:16:13Z" level=error msg="container_linux.
    go:424: starting container process caused: process_linux.go:607: container init caused: process_linux.go:578: handleReqOp caused: rootfs_init_linux.go:3
    78: bind mounting /var/lib/kubelet/pods/4feffa97-9271-47c6-831f-6f4f2132c146/volumes/kubernetes.io~projected/kube-api-access-dm675 to run/secrets/kubern
    etes.io/serviceaccount caused: bind mount through procfd of /var/lib/kubelet/pods/4feffa97-9271-47c6-831f-6f4f2132c146/volumes/kubernetes.io~projected/k
    ube-api-access-dm675 -> run/secrets/kubernetes.io/serviceaccount: open o_path procfd: open /var/lib/containers/storage/overlay/642e09746c44d554f6b2ffda3
    73169624ef1c1609648924e6be5dd2a33073603/merged/run/secrets/kubernetes.io/serviceaccount: no such file or directory"
    

    Workaround: These auto-mounted SA Tokens failing to be mounted to the default /var/run/secrets/.. location, however, by disabling the auto-mount option on the Pod and adding the volume and volumeMount entries manually with a different mountPath (like /secrets/...) should work:

    1. Disable auto-mounting these SA Tokens (by various K8S / AWS controllers) on the Pod or container level by adding either of these to your Pod spec:
      annotations:
        eks.amazonaws.com/skip-containers = dev
      
      or (depending on which aws controller does it)
      spec:
        automount_service_account_token = false
      
    2. Add the volume & volumeMount entries manually to mount the Secret-backed token(s) to your Pod/container:
        volumes:
        - name: aws-iam-token
          projected:
            defaultMode: 420
            sources:
            - serviceAccountToken:
                audience: sts.amazonaws.com
                expirationSeconds: 86400
                path: token
      
        containers:
        - name: dev
          volumeMounts:
            - mountPath: /secrets/eks.amazonaws.com/serviceaccount
              name: aws-iam-token
              readOnly: true
      

Copy link
Member

@ctalledo ctalledo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @szab100 for the contribution!

Looks good in general, just a couple of comments.

k8s/scripts/kubelet-config-helper.sh Outdated Show resolved Hide resolved
@@ -1476,6 +1488,7 @@ function do_config_kubelet() {
clean_cgroups_kubepods
config_kubelet "host-based"
adjust_crio_config_dependencies
restart_containerd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious on why restarting containerd is needed; it's non-obvious to me because we are not changing any containerd config, and after this script switches kubelet to use CRI-O, containerd will no longer be used.

Copy link
Author

@szab100 szab100 Jan 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i'm not quite sure, I know this was needed at some point (may be AL2 specific), but unsure if it was finally needed or not.. it shouldn't hurt though

Copy link
Member

@ctalledo ctalledo Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to leave it out, since there's no change to any containerd config in the script, so restarting containerd is confusing I think; unless it's actually needed of course, in which case a comment as to why would be helpful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it helps on the sysbox uninstall from the cluster, where we revert back from CRI-O -> containerd.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ctalledo, I made a new build today on top of sysbox v0.6.3 but applying the new sysbox-runc patch + this PR + reverting the change that deprecated v1.25 (we are still on that, it is officially supported by AWS/EKS until '24 October) and I can confirm that this line (restarting containerd) seems to be needed. Without this, the installer daemonset finishes with the first run and the node just fails (the kubelet is down).

Upon SSH-ing to the failing node, I see the following issue:

[root@ip-10-194-243-117 user]# systemctl status kubelet-config-helper.service
● kubelet-config-helper.service - Kubelet config service
   Loaded: loaded (/usr/lib/systemd/system/kubelet-config-helper.service; static; vendor preset: disabled)
   Active: failed (Result: exit-code) since Thu 2024-02-01 20:53:48 UTC; 14min ago
  Process: 24294 ExecStart=/bin/sh -c /usr/local/bin/kubelet-config-helper.sh (code=exited, status=1/FAILURE)
 Main PID: 24294 (code=exited, status=1/FAILURE)

Feb 01 20:52:18 ip-10-194-243-117.vpc.internal sh[24294]: + systemctl restart crio
Feb 01 20:52:18 ip-10-194-243-117.vpc.internal sh[24294]: + restart_kubelet
Feb 01 20:52:18 ip-10-194-243-117.vpc.internal sh[24294]: + echo 'Restarting Kubelet ...'
Feb 01 20:52:18 ip-10-194-243-117.vpc.internal sh[24294]: Restarting Kubelet ...
Feb 01 20:52:18 ip-10-194-243-117.vpc.internal sh[24294]: + systemctl restart kubelet
Feb 01 20:53:48 ip-10-194-243-117.vpc.internal sh[24294]: A dependency job for kubelet.service failed. See 'journalctl -xe' for details.
Feb 01 20:53:48 ip-10-194-243-117.vpc.internal systemd[1]: kubelet-config-helper.service: main process exited, code=exited, status=1/FAILURE
Feb 01 20:53:48 ip-10-194-243-117.vpc.internal systemd[1]: Failed to start Kubelet config service.
Feb 01 20:53:48 ip-10-194-243-117.vpc.internal systemd[1]: Unit kubelet-config-helper.service entered failed state.
Feb 01 20:53:48 ip-10-194-243-117.vpc.internal systemd[1]: kubelet-config-helper.service failed.
[root@ip-10-194-243-117 user]# systemctl status kubelet.service
● kubelet.service - Kubernetes Kubelet
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-kubelet-args.conf, 30-kubelet-extra-args.conf
   Active: inactive (dead) since Thu 2024-02-01 20:52:13 UTC; 20min ago
     Docs: https://github.com/kubernetes/kubernetes
 Main PID: 6995 (code=exited, status=0/SUCCESS)
...
Feb 01 20:52:13 ip-10-194-243-117.vpc.internal kubelet[6995]: I0201 20:52:13.113055    6995 kubelet.go:2132] "SyncLoop (PLEG): event for pod" pod="addon-active-monitor-ns/aws-asg-activities-healthcheck-workflow...3aa88fa506f6a}
Feb 01 20:52:13 ip-10-194-243-117.vpc.internal systemd[1]: Stopping Kubernetes Kubelet...
Feb 01 20:52:13 ip-10-194-243-117.vpc.internal kubelet[6995]: I0201 20:52:13.564307    6995 dynamic_cafile_content.go:171] "Shutting down controller" name="client-ca-bundle::/etc/kubernetes/pki/ca.crt"
Feb 01 20:52:13 ip-10-194-243-117.vpc.internal systemd[1]: Stopped Kubernetes Kubelet.
Feb 01 20:53:48 ip-10-194-243-117.vpc.internal systemd[1]: Dependency failed for Kubernetes Kubelet.
Feb 01 20:53:48 ip-10-194-243-117.vpc.internal systemd[1]: Job kubelet.service/start failed with result 'dependency'.
Hint: Some lines were ellipsized, use -l to show in full.

This is probably because the kubelet systemd unit has containerd as a dependency (After & Requires):

[root@ip-10-194-243-117 user]# cat /etc/systemd/system/kubelet.service
[Unit]
Description=Kubernetes Kubelet
Documentation=https://github.com/kubernetes/kubernetes
After=containerd.service sandbox-image.service
Requires=containerd.service sandbox-image.service
...

So I added a new sed command to replace any potential dependencies on 'containerd' with 'crio' to fix it. But even after this replacement, followed by the systemctl daemon-reload command at the end of the config_kubelet() function, reloading kubelet still fails until the containerd service is stopped. That is why adding service containerd restart helped, but I just replaced it with a call to the existing stop_containerd() func instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @szab100, thank you very much for the detailed explanation, let's keep it there (and possibly add a short comment that is needed).

reverting the change that deprecated v1.25 (we are still on that, it is officially supported by AWS/EKS until '24 October)

Oh I had missed that, my mistake; we can bring it back then (it's a pretty simple change in the Makefiles IIRC), but I don't think we can actively test on it (but since it's been tested before, things should continue to work).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @szab100, thank you very much for the detailed explanation, let's keep it there (and possibly add a short comment that is needed).

reverting the change that deprecated v1.25 (we are still on that, it is officially supported by AWS/EKS until '24 October)

Oh I had missed that, my mistake; we can bring it back then (it's a pretty simple change in the Makefiles IIRC), but I don't think we can actively test on it (but since it's been tested before, things should continue to work).

It's fine, actually in the meantime we upgraded to 1.26.6 as well, so no need to bring it back.

@szab100 szab100 requested a review from ctalledo January 23, 2024 22:30
Copy link
Member

@ctalledo ctalledo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more comment please.

else
kubelet_env_file=$(echo "$kubelet_env_files" | awk '{print $NF}')
fi

backup_config "$kubelet_env_file" "kubelet_env_file"

# Replace potential dependencies on 'containerd' with 'crio'
sed -i "s/containerd.service/crio.service/" /etc/systemd/system/kubelet.service
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During the sysbox-deploy-k8s uninstall, don't we need to "undo" this change? Maybe create a copy of the kubelet.service, and then revert to it during uninstall.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants