Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing download. #10592

Closed
dgarner-cg opened this issue Nov 3, 2023 · 13 comments
Closed

Failing download. #10592

dgarner-cg opened this issue Nov 3, 2023 · 13 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@dgarner-cg
Copy link

dgarner-cg commented Nov 3, 2023

I have attempted everything to resolve this issue .. for over a week or more.
It's getting frustrating.
I've attempted this with the newest version of everything involved (Ansible, Kubespray, etc..) in standard OS and alternatively I've attempted this in a venv with requirements.txt versions of everything.
I've attempted to eliminate all troubleshooting options possible before posting and it always comes down to the same Download section.
I'm seeing a lot of info I haven't seen before with this inventory but I have to run to Rx before it closes and want to post immediately as it's already been an issue for actually more like 3 weeks, just the last week I have consistently focused on it.

fyi..

• pve-cos-pri: Outside server not involved in k8s cluster, this can be considered the primary server of the network.
• pve-k8s-... obviously cluster
• Primary network subnet: 10.0.0.0/24
• "DCHP" slots for k8s: 10.0.0.82 - 88
• dnssubdomain is separate for k8s cluster and is on the mgmt.sub.domain.tld portion.
• All machines have good dhcp/dns/resolve curl ifconfig.me properly..

Can't think of much else outside of the process it could be / to run through.. now onto the other stuff and I'll be back later.

Thanks guys,

Environment:
Local Baremetal Proxmox,
Dual Socket Xeon Gold 6148 80 Core with 256 GB RAM.

  • OS:
    Control Server: Debian GNU/Linux 12 (Bookworm) Linux 6.1.0-13-amd64 x86_64

7 Node K8s Cluster, all the same.

  • Version of Ansible (ansible --version):

  • 2.14.11

  • Version of Python (python --version):

  • Python 3.11.2

Kubespray version (commit) (git rev-parse --short HEAD):
22f58a5

Network plugin used:
Calico

Full inventory with variables:

https://gist.github.com/dgarner-cg/c5ea336fdc78b369145cf52cd075dfee

Command used to invoke ansible:

ansible-playbook
-i inventory/k8-mg/hosts.yaml
--private-key=~/.ssh/id_rsa
-u root
--become
cluster.yml

Output of ansible run:

https://gist.github.com/dgarner-cg/3f57fe502a970ead3529ac7fd836b043

Anything else do we need to know:
I would look into why this is throwing as this may be another issue, but I've got to run out before Rx closes rq..

https://gist.github.com/dgarner-cg/d055057c89634705e8366b14208c5223

@dgarner-cg dgarner-cg added the kind/bug Categorizes issue or PR as related to a bug. label Nov 3, 2023
@dgarner-cg
Copy link
Author

dgarner-cg commented Nov 4, 2023

This is insanely frustrating.

I've added the following to /roles/.../download_files.yml

- name: Download_file | Download item
  block:
    - name: Download file
      get_url:
        url: "{{ valid_mirror_urls | random }}"
        dest: "{{ file_path_cached if download_force_cache else download.dest }}"
        owner: "{{ omit if download_localhost else (download.owner | default(omit)) }}"
        mode: "{{ omit if download_localhost else (download.mode | default(omit)) }}"
        checksum: "{{ 'sha256:' + download.sha256 if download.sha256 else omit }}"
        validate_certs: "{{ download_validate_certs }}"
        url_username: "{{ download.username | default(omit) }}"
        url_password: "{{ download.password | default(omit) }}"
        force_basic_auth: "{{ download.force_basic_auth | default(omit) }}"
        timeout: "{{ download.timeout | default(omit) }}"
        delegate_to: "{{ download_delegate if download_force_cache else inventory_hostname }}"
        run_once: "{{ download_force_cache }}"
        register: get_url_result
        become: "{{ not download_localhost }}"
        environment: "{{ proxy_env }}"
        no_log: "{{ not (unsafe_show_logs | bool) }}"
    
    - name: Handle Download Errors
      fail:
        msg: "Download failed: {{ get_url_result.msg }}"
      when: get_url_result.failed

  rescue:
- name: Retry on failure
  debug:
    msg: "Retrying download..."
  register: retry_debug_result
  until: "'OK' in get_url_result.msg or 'file already exists' in get_url_result.msg"
  retries: "{{ download_retries }}"
  delay: "{{ retry_stagger | default(5) }}"
  when: retry_debug_result is not defined or retry_debug_result.failed

  always:
    - name: Print Results
      debug:
        var: get_url_result

And below is further output ...

https://gist.github.com/dgarner-cg/064541f36bbac6b3ea49590f759989b0

@dgarner-cg
Copy link
Author

Bro, same ish on all Ubuntu systems.. what the f.

@FaraSys
Copy link

FaraSys commented Nov 11, 2023

Hi.
I have the same issue on Ubuntu 22.04 LTS
Kubespray Release 2.23.1

@dgarner-cg
Copy link
Author

I am .. making progress, I have literally been working on this for a week.

@arusa
Copy link

arusa commented Nov 19, 2023

I experience very similar problems. DNS problems coming up all the time. Currently I'm trying to add a new node and cp-node using cluster.yml and scale.yml and it always results in servers not being able to download stuff because kubespray updated their /etc/systemd/resolved.conf to resolve using coredns, but they don't have access to coredns yet :(

I'm very happy that I'm only running a test cluster.

@dgarner-cg
Copy link
Author

Thanks for your feedback, I want to say that all the nodes are reachable via valid DNS but I will check.. I know my outside-of-cluster installer controller and 2 k8s-controller nodes all have valid DNS from here to Google, but I also had no ide about the CoreDNS issue either ..

I am looking as I have time to ensure Cilium is used across all files and use a local repo, but work has picked up going into the Holiday, just got off a 7 straight week, 24/7 on call stretch. :D

I will take a look at this again in a few moments and hope to knock it out.

@arusa
Copy link

arusa commented Nov 19, 2023

I finally managed to add the new node. When I saw in the ansible output, that it just updated the /etc/systemd/resolved.conf file, I quickly opened it up on the new node and changed the line:

DNS=10.233.0...

to

DNS=1.1.1.1

and ran:

systemctl restart systemd-resolved.service

This way the node managed to finish all downloads executed by ansible. And in the end the resolved.conf was already changed back to use the coredns service as a resolver.

Btw. I also had to set enable_nodelocaldns to false yesterday, because I had a similar resolving problem while rolling out some changes using kubespray. At one point the nodes couldn't resolve anything because the nodelocaldns iptables rules probably weren't ready.

So DNS feels generally very fragile with Kubespray.

@marvin0815
Copy link

No idea if it's new but GitHub now gives a 401 Forbidden for me when validating mirrors in Kubespray

See:
curl -vJL -X HEAD https://github.com/kubernetes-sigs/cri-tools/releases/download/v1.24.0/crictl-v1.24.0-linux-amd64.tar.gz

@marvin0815
Copy link

I get a 200 again today. Sill modified the download role to check with GET instead of HEAD to deploy.

diff --git a/roles/download/tasks/download_file.yml b/roles/download/tasks/download_file.yml
index 376a15e8a..88f83c8cb 100644
--- a/roles/download/tasks/download_file.yml
+++ b/roles/download/tasks/download_file.yml
@@ -55,7 +55,7 @@
   - name: download_file | Validate mirrors
     uri:
       url: "{{ mirror }}"
-      method: HEAD
+      method: GET
       validate_certs: "{{ download_validate_certs }}"
       url_username: "{{ download.username | default(omit) }}"
       url_password: "{{ download.password | default(omit) }}"

Just in case it's a random bug in GitHub's cache system or something.

@mdbudnick
Copy link

mdbudnick commented Dec 15, 2023

It looks like I am having a similar issue and this is my output when running with the block: and outputting get_url_result:

ok: [workernode-3] => {
    "get_url_result": {
        "attempts": 4,
        "changed": false,
        "checksum_dest": null,
        "checksum_src": "d11d2f438da1892c8b1bdfc638ddb6764dbd0e2c",
        "dest": "/tmp/releases/runc-v1.1.9.arm64",
        "elapsed": 0,
        "failed": true,
        "msg": "Destination /tmp/releases does not exist",
        "src": "/home/mb/.ansible/tmp/ansible-tmp-1702614795.2398012-24550-25178834028643/tmpr_hbccf9",
        "url": "https://github.com/opencontainers/runc/releases/download/v1.1.9/runc.arm64"
    }
}

Please note: "Destination /tmp/releases does not exist" is not the issue as it fails with the same msg after adding an explicit file task before to create the directory.

Edit There is no checksum issue

I will try v2.22.1 and other versions and investigate the difference if I get it to work.

@mdbudnick
Copy link

Nevermind me, TIL --check has major limitations. This is my first Ansible playbook outside of tutorials, in my defense.

@VannTen
Copy link
Contributor

VannTen commented Jan 16, 2024

Kubespray version (commit) (git rev-parse --short HEAD):
22f58a5

I can't find this commit in the repository.
From your gist

{{ etcd_supported_versions[kube_major_version] }}: 'dict object' has no attribute 'v1.28'. 'dict object' has no attribute 'v1.28'. {{ etcd_supported_versions[kube_major_version] }}: 'dict object' has no attribute 'v1.28'. 'dict object' has no attribute 'v1.28'\n\nThe error appears to be in '/etc/ansible/usr-playbooks/cg-k8-ctrl/roles/download/tasks/download_file.yml': line 10, column 5, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n - name: Download_file | Starting download of file\n ^ here\n"}

Looks like you tried to use 1.28 on unsupported versions.
I'm going to close this, feel free to reopen if you actually still encounter a bug
/close

@k8s-ci-robot
Copy link
Contributor

@VannTen: Closing this issue.

In response to this:

Kubespray version (commit) (git rev-parse --short HEAD):
22f58a5

I can't find this commit in the repository.
From your gist

{{ etcd_supported_versions[kube_major_version] }}: 'dict object' has no attribute 'v1.28'. 'dict object' has no attribute 'v1.28'. {{ etcd_supported_versions[kube_major_version] }}: 'dict object' has no attribute 'v1.28'. 'dict object' has no attribute 'v1.28'\n\nThe error appears to be in '/etc/ansible/usr-playbooks/cg-k8-ctrl/roles/download/tasks/download_file.yml': line 10, column 5, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n - name: Download_file | Starting download of file\n ^ here\n"}

Looks like you tried to use 1.28 on unsupported versions.
I'm going to close this, feel free to reopen if you actually still encounter a bug
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

8 participants
@arusa @marvin0815 @mdbudnick @VannTen @k8s-ci-robot @dgarner-cg @FaraSys and others