Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Container-Optimized OS support #6165

Closed
innobead opened this issue Jun 20, 2023 · 13 comments
Closed

[FEATURE] Container-Optimized OS support #6165

innobead opened this issue Jun 20, 2023 · 13 comments
Assignees
Labels
area/install-uninstall-upgrade Install, Uninstall or Upgrade related area/platform-arch Platform and architecture support related highlight Important feature/issue to highlight kind/feature Feature request, new feature priority/0 Must be fixed in this release (managed by PO) require/doc Require updating the longhorn.io documentation require/manual-test-plan Require adding/updating manual test cases if they can't be automated
Milestone

Comments

@innobead
Copy link
Member

innobead commented Jun 20, 2023

Is your feature request related to a problem? Please describe (馃憤 if you like this request)

https://cloud.google.com/container-optimized-os/docs This is an evaluation task instead to see how Longhorn can integrate with it.

Describe the solution you'd like

A clear and concise description of what you want to happen

Describe alternatives you've considered

A clear and concise description of any alternative solutions or features you've considered.

Additional context

Limitations https://cloud.google.com/container-optimized-os/docs/concepts/features-and-benefits

@innobead innobead added kind/feature Feature request, new feature area/install-uninstall-upgrade Install, Uninstall or Upgrade related priority/0 Must be fixed in this release (managed by PO) investigation-needed Need to identify the case before estimating and starting the development labels Jun 20, 2023
@innobead innobead added this to the v1.6.0 milestone Jun 20, 2023
@innobead innobead changed the title [FEATURE] Container OS support [FEATURE] Container-Optimized OS support Jul 17, 2023
@innobead innobead added the highlight Important feature/issue to highlight label Jul 17, 2023
@innobead innobead added priority/1 Highly recommended to fix in this release (managed by PO) and removed priority/0 Must be fixed in this release (managed by PO) labels Dec 13, 2023
@innobead innobead modified the milestones: v1.6.0, v1.7.0 Dec 25, 2023
@innobead innobead added priority/0 Must be fixed in this release (managed by PO) area/platform-arch Platform and architecture support related and removed priority/1 Highly recommended to fix in this release (managed by PO) labels Feb 26, 2024
@c3y1huang
Copy link
Contributor

c3y1huang commented Mar 6, 2024

According to the document statement:

Container-Optimized OS is the default node OS Image in Kubernetes Engine and other Kubernetes deployments on Google Cloud Platform.

Given that Longhorn is compatible with Google Kubernetes Engine (GKE), we can assume that it should also work on self-launched COS + Kubernetes.

TODO: manually create a COS cluster with Kubernetes and verify Longhorn passes the core testing.

@c3y1huang
Copy link
Contributor

c3y1huang commented Mar 6, 2024

Seems like we've only tested GKE with UBUNTU_CONTAINERD. We've never done any testings with COS_CONTAINERD.

TODO: figure out how if it is possible to get Longhorn dependencies onto COS.

@c3y1huang
Copy link
Contributor

c3y1huang commented Mar 12, 2024

TODO: figure out how if it is possible to get Longhorn dependencies onto COS.

COS_CONTAINERD doesn't ship with the package manager, and the majority of directories are set to read-only. GKE works around this by cloud-config in user-data and /home/kubernetes/containerized_mounter/rootfs that contains the dependencies and mounts within the chroot environment.

TODO: manually create a COS cluster with Kubernetes and verify Longhorn passes the core testing.

Testing with other orchestrators like K3s is challenging due to the absence of a support method for the environment pre-configuration.

The easiest approach is to use GKE which comes with the COS_CONTAINER + Kubernetes. We could potentially use the same rootfs for installing Longhorn dependencies and mounting for the data path. However, since the user-data is predefined in GKE, we can't directly utilize the cloud-config. Instead, a possible workaround is to use a daemonset that runs chroot and nsenter for Longhorn dependencies setup/installation.

ref: https://github.com/kubernetes/kubernetes/tree/v1.29.2/cluster/gce

@c3y1huang
Copy link
Contributor

c3y1huang commented Mar 12, 2024

As evidence supporting the assumption mentioned above, below is the installed the open-iscsi via a daemonset and confirmed it is running on the host system:

> ps aux | grep iscsid
root       58264  0.0  0.0  25384   292 ?        Ss   06:21   0:00 /sbin/iscsid
root       58265  0.0  0.0  25888  5324 ?        S<Ls 06:21   0:00 /sbin/iscsid

@c3y1huang
Copy link
Contributor

c3y1huang commented Mar 18, 2024

Given that we've done the majority of the work on the Talos support feature, it seems that this feature that only a small effort is needed for this feature. Primarily, providing users with the necessary dependency installation.

Additionally, in GKE, periodic updates are made to topology.kubernetes.io/zone to reflect the actual zone, causing failures of the replica_auto_balance_zone tests when such updates occur. To tackle this, we can modify the tests to refresh the zone labels and maintain the expected zone label simulation throughout each test run.

Below is the full PoC test result using 1.28.7-gke.1026000, COS_CONTAINERD cos-109-17800-66-78, S3 as the backup store.

= 355 passed, 19 skipped, 348 warnings in 53969.99s (14:59:29) =

Below is the core test result 1.28.7-gke.1026000, COS_CONTAINERD cos-109-17800-66-78, NFS as the backup store.

= 64 passed, 310 deselected, 44 warnings in 7587.32s (2:06:27) =

@c3y1huang c3y1huang added require/doc Require updating the longhorn.io documentation require/manual-test-plan Require adding/updating manual test cases if they can't be automated and removed investigation-needed Need to identify the case before estimating and starting the development labels Mar 18, 2024
@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Mar 19, 2024

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:

  • Is there a workaround for the issue? If so, where is it documented?
    The workaround is at:

  • Does the PR include the explanation for the fix or the feature?

  • Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
    The PR for the YAML change is at: feat(platform): support GKE Container-Optimized OS聽#8196
    The PR for the chart change is at:

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at

  • Which areas/issues this PR might have potential impacts on?
    Area platform
    Issues

  • If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?
    The LEP PR is at

  • If labeled: area/ui Has the UI issue filed or ready to be merged (including backport-needed/*)?
    The UI issue/PR is at

  • If labeled: require/doc Has the necessary document PR submitted or merged (including backport-needed/*)?
    The documentation issue/PR is at doc(1.7.0): support GKE Container-Optimized OS聽website#884

  • If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
    The automation skeleton PR is at
    The automation test case PR is at test(integration/manual): support GKE Container-Optimized OS聽longhorn-tests#1819
    The issue of automation test case implementation is at (please create by the template)

  • If labeled: require/automation-engine Has the engine integration test been merged (including backport-needed/*)?
    The engine automation PR is at

  • If labeled: require/manual-test-plan Has the manual test plan been documented?
    The updated manual test plan is at test(integration/manual): support GKE Container-Optimized OS聽longhorn-tests#1819

  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at

@chriscchien
Copy link
Contributor

Hi @c3y1huang , test case test_replica_auto_balance_node_duplicates_in_multiple_zones failed on master pipeline after longhorn/longhorn-tests#1819, could you take a look on that? thank you.

@c3y1huang
Copy link
Contributor

Hi @c3y1huang , test case test_replica_auto_balance_node_duplicates_in_multiple_zones failed on master pipeline after longhorn/longhorn-tests#1819, could you take a look on that? thank you.

Thank you @chriscchien . Will be fixed in longhorn/longhorn-tests#1849.

@yangchiu
Copy link
Member

After applied kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/master/deploy/prerequisite/longhorn-gke-cos-node-agent.yaml on cos_containerd, there are some error messages in the log:

$ kubectl get pods
NAME                                  READY   STATUS    RESTARTS        AGE
longhorn-gke-cos-node-agent-5n9xx     1/1     Running   1 (7h43m ago)   7h44m
longhorn-gke-cos-node-agent-8r7tx     1/1     Running   1 (7h43m ago)   7h44m
longhorn-gke-cos-node-agent-f28vz     1/1     Running   1 (7h43m ago)   7h44m

$ kubectl logs longhorn-gke-cos-node-agent-5n9xx
...
(23/24) Installing: systemd-249.17-150400.8.40.1.x86_64 [.......
Creating group systemd-journal with gid 485.
Creating group systemd-network with gid 484.
Creating user systemd-network (systemd Network Management) with uid 484 and gid 484.
Creating group systemd-timesync with gid 483.
Creating user systemd-timesync (systemd Time Synchronization) with uid 483 and gid 483.
System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to connect to bus: Host is down
/usr/lib/tmpfiles.d/journal-nocow.conf:26: Failed to resolve specifier: uninitialized /etc detected, skipping
All rules containing unresolvable specifiers will be skipped.
System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to connect to bus: Host is down
System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to connect to bus: Host is down
System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to connect to bus: Host is down
System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to connect to bus: Host is down
done]
(24/24) Installing: open-iscsi-2.1.9-150500.46.3.1.x86_64 [..
System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to connect to bus: Host is down

Not sure if there's any issue here, but longhorn can be installed without problem. cc @c3y1huang

@c3y1huang
Copy link
Contributor

c3y1huang commented Apr 25, 2024

After applied kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/master/deploy/prerequisite/longhorn-gke-cos-node-agent.yaml on cos_containerd, there are some error messages in the log:

$ kubectl get pods
NAME                                  READY   STATUS    RESTARTS        AGE
longhorn-gke-cos-node-agent-5n9xx     1/1     Running   1 (7h43m ago)   7h44m
longhorn-gke-cos-node-agent-8r7tx     1/1     Running   1 (7h43m ago)   7h44m
longhorn-gke-cos-node-agent-f28vz     1/1     Running   1 (7h43m ago)   7h44m

$ kubectl logs longhorn-gke-cos-node-agent-5n9xx
...
(23/24) Installing: systemd-249.17-150400.8.40.1.x86_64 [.......
Creating group systemd-journal with gid 485.
Creating group systemd-network with gid 484.
Creating user systemd-network (systemd Network Management) with uid 484 and gid 484.
Creating group systemd-timesync with gid 483.
Creating user systemd-timesync (systemd Time Synchronization) with uid 483 and gid 483.
System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to connect to bus: Host is down
/usr/lib/tmpfiles.d/journal-nocow.conf:26: Failed to resolve specifier: uninitialized /etc detected, skipping
All rules containing unresolvable specifiers will be skipped.
System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to connect to bus: Host is down
System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to connect to bus: Host is down
System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to connect to bus: Host is down
System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to connect to bus: Host is down
done]
(24/24) Installing: open-iscsi-2.1.9-150500.46.3.1.x86_64 [..
System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to connect to bus: Host is down

Not sure if there's any issue here, but longhorn can be installed without problem.

When iSCSId is installed via package managers within a container, the package manager will try to initiate the ISCSi services managed by systemd, which causes problems because the container environment doesn't fully support systemd. Hence, the script manually starts the daemon instead of relying on systemd. Therefore, this kind of error should be safe to ignore in this context.

@yangchiu
Copy link
Member

Verified passed on master-head (longhorn-manager f63611c) following the document to setup Longhorn on cos_containerd. Longhorn can be installed without problem, and run core test on it, test cases all passed.

@innobead
Copy link
Member Author

innobead commented Apr 26, 2024

Not sure if there's any issue here, but longhorn can be installed without problem.

When iSCSId is installed via package managers within a container, the package manager will try to initiate the ISCSi services managed by systemd, which causes problems because the container environment doesn't fully support systemd. Hence, the script manually starts the daemon instead of relying on systemd. Therefore, this kind of error should be safe to ignore in this context.

@c3y1huang can we mention this error as an expected warning in the doc? to prevent any confusion or later explanation?

@c3y1huang
Copy link
Contributor

c3y1huang commented Apr 26, 2024

@c3y1huang can we mention this error as an expected warning in the doc? to prevent any confusion or later explanation?

Sure! Doc updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/install-uninstall-upgrade Install, Uninstall or Upgrade related area/platform-arch Platform and architecture support related highlight Important feature/issue to highlight kind/feature Feature request, new feature priority/0 Must be fixed in this release (managed by PO) require/doc Require updating the longhorn.io documentation require/manual-test-plan Require adding/updating manual test cases if they can't be automated
Projects
None yet
Development

No branches or pull requests

5 participants