Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kargo can't deploy 200 nodes cluster (timeout to quay.io to fetch etcd image) #479

Closed
olegeech-me opened this issue Sep 12, 2016 · 4 comments

Comments

@olegeech-me
Copy link

olegeech-me commented Sep 12, 2016

After several retries it fails with:

TASK [kubernetes/secrets : Check_certs | Set 'sync_certs' to true] *************
2016-09-12T03:19:40.513428 (delta: 16.515082)         elapsed: 4829.598493 **** 
fatal: [node1]: FAILED! => {"failed": true, "msg": "The conditional check '{%- set certs = {'sync': False} -%} {%- for server in play_hosts\n   if (not hostvars[server].kubecert.stat.exists|default(False)) or\n   (hostvars[server].kubecert.stat.checksum|default('') != kubecert_master.stat.checksum|default('')) -%}\n   {%- set _ = certs.update({'sync': True}) -%}\n{%- endfor -%} {{ certs.sync }}' failed. The error was: error while evaluating conditional ({%- set certs = {'sync': False} -%} {%- for server in play_hosts\n   if (not hostvars[server].kubecert.stat.exists|default(False)) or\n   (hostvars[server].kubecert.stat.checksum|default('') != kubecert_master.stat.checksum|default('')) -%}\n   {%- set _ = certs.update({'sync': True}) -%}\n{%- endfor -%} {{ certs.sync }}): 'dict object' has no attribute 'kubecert'\n\nThe error appears to have been in '/home/vagrant/workspace/kargo/roles/kubernetes/secrets/tasks/check-certs.yml': line 25, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: \"Check_certs | Set 'sync_certs' to true\"\n  ^ here\n"}

msg: The conditional check '{%- set certs = {'sync': False} -%} {%- for server in play_hosts
   if (not hostvars[server].kubecert.stat.exists|default(False)) or
   (hostvars[server].kubecert.stat.checksum|default('') != kubecert_master.stat.checksum|default('')) -%}
   {%- set _ = certs.update({'sync': True}) -%}
{%- endfor -%} {{ certs.sync }}' failed. The error was: error while evaluating conditional ({%- set certs = {'sync': False} -%} {%- for server in play_hosts
   if (not hostvars[server].kubecert.stat.exists|default(False)) or
   (hostvars[server].kubecert.stat.checksum|default('') != kubecert_master.stat.checksum|default('')) -%}
   {%- set _ = certs.update({'sync': True}) -%}
{%- endfor -%} {{ certs.sync }}): 'dict object' has no attribute 'kubecert'

The error appears to have been in '/home/vagrant/workspace/kargo/roles/kubernetes/secrets/tasks/check-certs.yml': line 25, column 3, but may
be elsewhere in the file depending on the exact syntax problem.

Also there are warning at the log collecting sections:

 ssh -A -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null vagrant@10.3.56.3 ADMIN_USER=vagrant LOG_LEVEL= ADMIN_WORKSPACE=workspace collect_logs.sh
Warning: Permanently added '10.3.56.3' (ECDSA) to the list of known hosts.
[DEPRECATION WARNING]: Using bare variables is deprecated. Update your 
playbooks so that the environment value uses the full variable syntax 
('{{output.results}}').
This feature will be removed in a future release. 
Deprecation warnings can be disabled by setting deprecation_warnings=False in 
ansible.cfg.
[DEPRECATION WARNING]: Using bare variables is deprecated. Update your 
playbooks so that the environment value uses the full variable syntax 
('{{output.results}}').
This feature will be removed in a future release. 
Deprecation warnings can be disabled by setting deprecation_warnings=False in 
ansible.cfg.

Full Kargo run output:

kargo_run_200.txt.gz

@bogdando
Copy link
Contributor

Please attach the tarball with diag info

@bogdando
Copy link
Contributor

bogdando commented Sep 12, 2016

The deployment failed to setup etcd cluster, because of connection timeout to quay.io:

fatal: [node161]: FAILED! => {"changed": false, "cmd": ["sh", "-c", "/usr/bin/docker rm -f etcdctl-binarycopy; /usr/bin/docker create --name etcdctl-b
inarycopy quay.io/coreos/etcd:v3.0.1 && /usr/bin/docker cp etcdctl-binarycopy:/usr/local/bin/etcdctl /usr/local/bin/etcdctl && /usr/bin/docker rm -f e
tcdctl-binarycopy"], "delta": "0:00:24.512206", "end": "2016-09-11 23:54:11.669654", "failed": true, "rc": 1, "start": "2016-09-11 23:53:47.157448", "
stderr": "Error response from daemon: No such container: etcdctl-binarycopy\nUnable to find image 'quay.io/coreos/etcd:v3.0.1' locally\nError response
 from daemon: Get https://quay.io/v2/coreos/etcd/manifests/v3.0.1: Get https://quay.io/v2/auth?scope=repository%3Acoreos%2Fetcd%3Apull&service=quay.io
: dial tcp: lookup quay.io on 8.8.8.8:53: read udp 10.3.59.55:44707->8.8.8.8:53: i/o timeout", "stdout": "", "stdout_lines": [], "warnings": []}

Would be nice to have retries for such tasks

@bogdando bogdando changed the title Kargo can't deploy 200 nodes cluster Kargo can't deploy 200 nodes cluster (timeout to quay.io to fetch etcd image) Sep 12, 2016
@olegeech-me
Copy link
Author

Sure, here's tarball from node1:
logs.tar.gz

@bogdando
Copy link
Contributor

So, I'll implement retries to mitigate intermittent download issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants