Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for arm images for hyperkube, kubeadm and cni_binary #4261

Open
wants to merge 8 commits into
base: master
from

Conversation

Projects
None yet
7 participants
@lwolf
Copy link

commented Feb 17, 2019

This adds support for arm checksums for hyperkube, kubeadm and cni images.

Related to #4065

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Feb 17, 2019

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lwolf
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: woopstar

If they are not already assigned, you can assign the PR to them by writing /assign @woopstar in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from Miouge1 and mirwan Feb 17, 2019

@k8s-ci-robot k8s-ci-robot added the size/M label Feb 17, 2019

@Miouge1

This comment has been minimized.

Copy link
Member

commented Feb 18, 2019

@lwolf thank you for your contribution. Great to see more ARM support, do you mind sharing the tests you did around this? Which HW platform did you use etc...?

ci check this
/lgtm

@lwolf

This comment has been minimized.

Copy link
Author

commented Feb 18, 2019

As a test, I run an upgrade of my current multiarch cluster from v1.12.3 to v1.12.5 with changes from this PR and #4176 backported to the 2.8 branch.

Arm nodes are based on https://www.hardkernel.com/shop/odroid-hc1-home-cloud-one/

$ uname -a
Linux odroid-hc-01 4.14.78-150 #1 SMP PREEMPT Tue Oct 23 10:43:36 -03 2018 armv7l armv7l armv7l GNU/Linux

To be clear, this PR is not enough to install kubespray on arm nodes, but it reduces number of hacks needed to do that.

Let me know if there are any specific tests I can run

@Miouge1

This comment has been minimized.

Copy link
Member

commented Feb 18, 2019

@lwolf if you can, it would be good to make an issue tracking all hacks necessary for armv7l support, kind of like #2551 is for armv8/arm64.

Also adding an arm hash for etcd_binary_checksums would be nice.

@lwolf

This comment has been minimized.

Copy link
Author

commented Feb 18, 2019

I'll run more or less clean deploy in the next few days and try to collect all the issues I'll have in a ticket.

Regarding etcd, unfortunately there is no arm build available.

@laimison

This comment has been minimized.

Copy link

commented Feb 20, 2019

There is a preview/unofficial version of Debian Buster for Raspberry which enables arm64
https://wiki.debian.org/RaspberryPi3

@lwolf do you think it can reduce the issues that you have faced using arm64 instead of 32 bit?

@lwolf

This comment has been minimized.

Copy link
Author

commented Feb 20, 2019

@laimison I'm not sure how exactly arm64 debian will help in arm32 setup.

@b23prodtm

This comment has been minimized.

Copy link

commented Feb 21, 2019

There is an issue #4259 opened also for etcd arm (armv7l , "32 bits") binary support.
Actually Pi 3 is a 64 bits capable CPU which wouldn't come enabled with "stretch" but "buster" as mentioned @laimison

@lwolf

This comment has been minimized.

Copy link
Author

commented Feb 21, 2019

So, if I'm not mistaken, RPI3 (actually a 64bit CPU) is being wrongly identified in older OS/kernel, which will be fixed in newer version of OS.
But most of the real armv7l devices are 32-bit, and require binaries for arm32.
So it doesn't help in the current context.

@b23prodtm thanks for linking related issue

@laimison

This comment has been minimized.

Copy link

commented Feb 21, 2019

I have mentioned 64 bit solution because etcd binary is available at https://github.com/etcd-io/etcd/releases/download/v3.3.12/etcd-v3.3.12-linux-arm64.tar.gz and Raspberry itself is capable to run Buster. So the question is whether other Kubernetes binaries are missed and if Buster is ready/worth to try. Am I missing something especially about that etcd binary?

@lwolf

This comment has been minimized.

Copy link
Author

commented Feb 22, 2019

@laimison It seems that you have an assumption that Raspberry PI 3 is the only existing ARM hardware, or at least the only one, people will use with kubespray, therefore upgrading it to Buster and running arm64 binaries will solve all the problems.

This PR is about adding support for truly 32 bit ARM hardware, therefore, the only way I see to solve the issue with etcd is by compiling it accordingly.

Let me know if I misunderstood or missing something.

@nmiculinic

This comment has been minimized.

Copy link
Contributor

commented Feb 22, 2019

When I try to run in on the worker node:

    "downloads": "VARIABLE IS NOT DEFINED!: 'dict object' has no attribute u'arm'"

Somewhere in kubespray/extra_playbooks/roles/download/defaults/main.yml the arm is missing

@lwolf lwolf referenced this pull request Feb 22, 2019

Open

Track the gaps when porting to ARM (arm7l) #4294

0 of 5 tasks complete
@Miouge1

This comment has been minimized.

Copy link
Member

commented Feb 25, 2019

@nmiculinic is this on a kube-node without any kube-master or etcd role?

@nmiculinic

This comment has been minimized.

Copy link
Contributor

commented Feb 25, 2019

yes, that's right. Adding dummy etcd for arm kinda fixes things (( at least this bug for me ))

@Miouge1

This comment has been minimized.

Copy link
Member

commented Feb 25, 2019

@lwolf is it possible to add something (document or an assert) to gracefully fail if someone tries to do etcd on arm32?

@nmiculinic

This comment has been minimized.

Copy link
Contributor

commented Feb 25, 2019

When I run this playbook. For arm etcd I put some dummy hash so the dict doesn't crash and burn.


TASK [kubernetes/kubeadm : Join to cluster] ****************************************************************************
skipping: [ip-10-100-60-75.eu-west-1.compute.internal]
fatal: [bbb-test]: FAILED! => {"ansible_job_id": "848543574878.27806", "changed": true, "cmd": ["/usr/local/bin/kubeadm", "join", "--config", "/etc/kubernetes/kubeadm-client.conf", "--ignore-preflight-errors=DirAvailable--etc-kubernetes-manifests"], "delta": "0:00:05.137759", "end": "2019-02-25 08:24:26.577483", "finished": 1, "msg": "non-zero return code", "rc": 2, "start": "2019-02-25 08:24:21.439724", "stderr": "\t[WARNING DirAvailable--etc-kubernetes-manifests]: /etc/kubernetes/manifests is not empty\n[preflight] Some fatal errors occurred:\n\t[ERROR Port-10250]: Port 10250 is in use\n[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`", "stderr_lines": ["\t[WARNING DirAvailable--etc-kubernetes-manifests]: /etc/kubernetes/manifests is not empty", "[preflight] Some fatal errors occurred:", "\t[ERROR Port-10250]: Port 10250 is in use", "[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`"], "stdout": "[preflight] Running pre-flight checks", "stdout_lines": ["[preflight] Running pre-flight checks"]}

TASK [kubernetes/kubeadm : Join to cluster with ignores] ***************************************************************
fatal: [bbb-test]: FAILED! => {"changed": false, "msg": "async task did not complete within the requested time"}

TASK [kubernetes/kubeadm : Display kubeadm join stderr if any] *********************************************************
fatal: [bbb-test]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'stderr_lines'\n\nThe error appears to have been in '/Users/lpp/Desktop/ascalia/bb/kubespray/roles/kubernetes/kubeadm/tasks/main.yml': line 95, column 7, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n    - name: Display kubeadm join stderr if any\n      ^ here\n"}
skipping: [ip-10-100-60-75.eu-west-1.compute.internal]
@lwolf

This comment has been minimized.

Copy link
Author

commented Feb 25, 2019

@Miouge1 I was thinking about adding a dummy value, but assert sounds like a better option. Will update the PR

@nmiculinic

This comment has been minimized.

Copy link
Contributor

commented Feb 25, 2019

This big error I c/p is result of armv7 having "\n" at the end and misconfiguring kubelet systemd service.

@lwolf

This comment has been minimized.

Copy link
Author

commented Feb 25, 2019

@nmiculinic yeah, I mentioned this issue and workaround I use in #4065 (comment)
but I feel like it is too "hacky" to put it into upstream.
Didn't have time yet to find more elegant solution

@nmiculinic

This comment has been minimized.

Copy link
Contributor

commented Feb 25, 2019

Oh I missed it, I had a pause of a couple of weeks while I was doing other stuff, and now I'm back at this.

@k8s-ci-robot k8s-ci-robot removed the lgtm label Feb 26, 2019

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Feb 26, 2019

New changes are detected. LGTM label has been removed.

@lwolf

This comment has been minimized.

Copy link
Author

commented Apr 17, 2019

@nmiculinic the main block for me was in the CI. I'll update the PR with the latest changes from master and new checksums.

@lwolf

This comment has been minimized.

Copy link
Author

commented Apr 17, 2019

@nmiculinic I merged current master into the branch and new checksums.

@Miouge1

This comment has been minimized.

Copy link
Member

commented Apr 17, 2019

@lwolf ansible-lint was broken until #4561 can you merge that in?

@lwolf

This comment has been minimized.

Copy link
Author

commented Apr 17, 2019

@Miouge1 merged, but seems there some problems in gitlab with dns resolving

@nmiculinic

This comment has been minimized.

Copy link
Contributor

commented Apr 19, 2019

I'm having problems with this PR, commit f689be4:

Full ansible playbook with -vvv:

ansible.log

k8s-cluster:

dashboard_enabled: false
kube_api_anonymous_auth: true

## Change this to use another Kubernetes version, e.g. a current beta release
kube_version: v1.13.4

# kubernetes image repo define
kube_image_repo: "gcr.io/google-containers"

kube_network_plugin: cni

# Can be ipvs, iptables
kube_proxy_mode: iptables

cluster_name: bbb.local

## Container runtime
## docker for docker and crio for cri-o.
container_manager: docker
deploy_container_engine: false
download:
  container: false

kubelet_node_custom_flags: 
  - --container-runtime=remote
  - --container-runtime-endpoint=unix:///run/containerd/containerd.sock

# pod security policy (RBAC must be enabled either by having 'RBAC' in authorization_modes or kubeadm enabled)
podsecuritypolicy_enabled: true

# Make a copy of kubeconfig on the host that runs Ansible in {{ inventory_dir }}/artifacts
kubeconfig_localhost: true

kubelet_preferred_address_types: "InternalIP,Hostname,InternalDNS,ExternalDNS,ExternalIP"

etcd_deployment_type: host
kubelet_deployment_type: host

kubeadm_enabled: true

kube_pods_subnet: 10.101.0.0/16

loadbalancer_apiserver:
  port: 443

Specific variables for ads0903:

# download: download_container.yml
download_container: False

pod_infra_image_repo: k8s.gcr.io/pause
pod_infra_image_tag: "3.1"

container_manager: docker
docker_version: "latest"
ignore_assert_errors: true  # To ignore min RAM requirement

kubelet_node_custom_flags: # less housekeeping for bbb, this one is big performance/accuracry tradeoff
  - --housekeeping-interval=10m
  - --container-runtime=remote
  - --container-runtime-endpoint=unix:///var/run/containerd/containerd.sock

# manually configure it for fuck sake
resolvconf_mode: none
# resolvconffile: /etc/systemd/resolved.conf.d/k8s.conf
@Miouge1

This comment has been minimized.

Copy link
Member

commented Apr 19, 2019

@nmiculinic it looks like you have ansible_architecture: "armdv7\n" somewhere, is that possible?

@nmiculinic

This comment has been minimized.

Copy link
Contributor

commented Apr 19, 2019

I have patch on the PC; will post after Easter; .dunno why but ansible generates that \n somewhere and it propagates .

I trim it (( via set fact )) and then later problems down the road. .

@nmiculinic

This comment has been minimized.

Copy link
Contributor

commented Apr 23, 2019

Ok, I've debugged things a bit further. I'm not sure what's the problem but I get another error this patch solves:

diff --git a/roles/download/defaults/main.yml b/roles/download/defaults/main.yml
index a96e4e624..522a51662 100644
--- a/roles/download/defaults/main.yml
+++ b/roles/download/defaults/main.yml
@@ -174,6 +174,9 @@ cni_binary_checksums:
   arm64: 016bbc989877e35e3cd49fafe11415fb2717e52c74fde6b1650411154cb91b81
   amd64: f04339a21b8edf76d415e7f17b620e63b8f37a76b2f706671587ab6464411f2d
 calicoctl_binary_checksums:
+  arm:
+    v3.5.4: 0
+    v3.4.4: 0
   amd64:
     v3.5.4: 197194b838cc2a9a7455c2ebd5505a5e24f8f3d994eb75c17f5dd568944100b8
     v3.4.4: 93bd084e053cf1bf3b7fef369677bd6767c30fe7135e2c7e044e31693422ef61
@nmiculinic

This comment has been minimized.

Copy link
Contributor

commented Apr 23, 2019

I'm not even using calico in any way shape or form yet this crashes the ansible playbook

@lwolf

This comment has been minimized.

Copy link
Author

commented Apr 23, 2019

thanks @nmiculinic for trying it out.
I was upgrading my cluster from 2.8 to 2.9 + this patch and encountered the issue with the calico but in the other place.
upgrade was failing because of calico_policy which I had to disable.

calico_policy:
     enabled: "{{ enable_network_policy or kube_network_plugin == 'canal' }}"

https://github.com/kubernetes-sigs/kubespray/blob/master/roles/download/defaults/main.yml#L416

@nmiculinic

This comment has been minimized.

Copy link
Contributor

commented Apr 23, 2019

@nmiculinic it looks like you have ansible_architecture: "armdv7\n" somewhere, is that possible?

It was probably some stale ansible caching since I cannot replicate it anymore

@nmiculinic

This comment has been minimized.

Copy link
Contributor

commented Apr 23, 2019

I've managed to install it now. (( only had to apply patched I provided few comments earlier ))

@lwolf

This comment has been minimized.

Copy link
Author

commented Apr 23, 2019

Thanks, I added dummy checksums for the calicoctl_binary_checksums

@Miouge1

This comment has been minimized.

Copy link
Member

commented Apr 23, 2019

CI is broken in master, once it's fixed in master you can rebase to get CI to run.

lwolf added some commits Feb 17, 2019

Add dummy etcd checksum for arm
This commit adds dummy etcd checksum for arm to avoid "no attribute" error
during setup.

@lwolf lwolf force-pushed the lwolf:arm branch 2 times, most recently from 1295039 to ab12c34 Apr 23, 2019

@lwolf

This comment has been minimized.

Copy link
Author

commented Apr 24, 2019

what's the right way to restart failed builds (for not code related reasons) - amend last commit or ask somebody to trigger restart?

@lwolf lwolf force-pushed the lwolf:arm branch from ab12c34 to 577eb4c Apr 24, 2019

@lwolf

This comment has been minimized.

Copy link
Author

commented Apr 24, 2019

@Miouge1 could you please advice.
I can't make architecture check pass the tests.

If I set gather_facts: True it fails with python not found
https://gitlab.com/kargo-ci/kubernetes-sigs-kubespray/-/jobs/201813766

If I set gather_facts: False it fails with ansible_architecture is undefined.
https://gitlab.com/kargo-ci/kubernetes-sigs-kubespray/-/jobs/201820403

@Miouge1

This comment has been minimized.

Copy link
Member

commented Apr 27, 2019

@lwolf special access is required to Retry CI jobs, otherwise an amend and force push can retry the whole pipeline.

As we are trying to improve CI coverage, there has been significant changes in CI in the past few weeks.

@lwolf

This comment has been minimized.

Copy link
Author

commented Apr 27, 2019

I see, thanks. I wasn't sure that amending is the right way to do it.
I removed the new architecture check that was causing build to fail, and now it seems fine.
Is there anything else that needs to be done to get this merged?

@woopstar

This comment has been minimized.

Copy link
Member

commented May 1, 2019

@Miouge1 what do you think of this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.