Nomad unable to launch on EC2 graviton instance types #7989

shoenig · 2020-05-16T15:10:24Z

We go through the trouble of shipping a table of CPU performance data for EC2 types as of v0.11.2, so we should be able to launch here. It looks like the standard CPU fingerprinter causes an error on graviton (ARM) instances, because there is no MHz information available. (On AMD/Intel, data is available but meaningless and discarded).

ubuntu@ip-172-31-17-121:~$ curl -s http://169.254.169.254/latest/meta-data/instance-type
a1.medium

ubuntu@ip-172-31-17-121:~$ lscpu
Architecture:                    aarch64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
CPU(s):                          1
On-line CPU(s) list:             0
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       ARM
Model:                           3
Model name:                      Cortex-A72
Stepping:                        r0p3
BogoMIPS:                        166.66
L1d cache:                       32 KiB
L1i cache:                       48 KiB
L2 cache:                        2 MiB
NUMA node0 CPU(s):               0
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Not affected
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Branch predictor hardening
Vulnerability Tsx async abort:   Not affected
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid

ubuntu@ip-172-31-17-121:~$ ./nomad version 
Nomad v0.11.2 (807cfebe90d56f9e5beec3e72936ebe86acc8ce3)

ubuntu@ip-172-31-17-121:~$ ./nomad agent -dev -log-level=TRACE
==> No configuration files loaded
==> Starting Nomad agent...
==> Error starting agent: client setup failed: fingerprinting failed: cannot detect cpu total compute. CPU compute must be set manually using the client config option "cpu_total_compute"
    2020-05-16T14:58:28.830Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=
    2020-05-16T14:58:28.830Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=
    2020-05-16T14:58:28.831Z [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
    2020-05-16T14:58:28.831Z [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
    2020-05-16T14:58:28.831Z [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
    2020-05-16T14:58:28.831Z [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
    2020-05-16T14:58:28.831Z [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
    2020-05-16T14:58:28.831Z [INFO]  agent: detected plugin: name=nvidia-gpu type=device plugin_version=0.1.0
    2020-05-16T14:58:28.835Z [INFO]  nomad.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:127.0.0.1:4647 Address:127.0.0.1:4647}]"
    2020-05-16T14:58:28.836Z [INFO]  nomad: serf: EventMemberJoin: ip-172-31-17-121.global 127.0.0.1
    2020-05-16T14:58:28.836Z [INFO]  nomad: starting scheduling worker(s): num_workers=1 schedulers=[service, batch, system, _core]
    2020-05-16T14:58:28.836Z [INFO]  client: using state directory: state_dir=/tmp/NomadClient618123295
    2020-05-16T14:58:28.837Z [INFO]  client: using alloc directory: alloc_dir=/tmp/NomadClient800920818
    2020-05-16T14:58:28.842Z [DEBUG] client.fingerprint_mgr: built-in fingerprints: fingerprinters=[arch, cgroup, consul, cpu, host, memory, network, nomad, signal, storage, vault, env_aws, env_gce]
    2020-05-16T14:58:28.843Z [INFO]  client.fingerprint_mgr.cgroup: cgroups are available
    2020-05-16T14:58:28.843Z [INFO]  nomad.raft: entering follower state: follower="Node at 127.0.0.1:4647 [Follower]" leader=
    2020-05-16T14:58:28.843Z [INFO]  nomad: adding server: server="ip-172-31-17-121.global (Addr: 127.0.0.1:4647) (DC: dc1)"
    2020-05-16T14:58:28.843Z [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=cgroup period=15s
    2020-05-16T14:58:28.844Z [DEBUG] client.fingerprint_mgr.cpu: detected core count: cores=1

…etected Previously, Nomad would fail to startup if the CPU fingerprinter could not detect the cpu total compute (i.e. cores * mhz). This is common on some EC2 instance types (graviton class), where the env_aws fingerprinter will override the detected CPU performance with a more accurate value anyway. Instead of crashing on startup, have Nomad use a low default for available cpu performance of 1000 ticks (e.g. 1 core * 1 GHz). This enables Nomad to get past the useless cpu fingerprinting on those EC2 instances. The crashing error message is now a log statement suggesting the setting of cpu_total_compute in client config. Fixes #7989

github-actions · 2022-10-27T02:37:47Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

shoenig added theme/fingerprint theme/environment-aws labels May 16, 2020

tgross mentioned this issue May 21, 2020

e2e: Linux ARM64 test target #7769

Open

schmichael added the theme/platform-arm label Jun 30, 2020

tekacs mentioned this issue Oct 7, 2020

Feature Request: Support for ARM installations hashicorp/terraform-aws-nomad#80

Closed

shoenig mentioned this issue Dec 3, 2020

Fingerprinting fails to detect cpu total compute in arm64 EC2 instances #9511

Closed

shoenig self-assigned this Dec 9, 2020

shoenig mentioned this issue Dec 9, 2020

client/fingerprint/cpu: use fallback total compute value if cpu not detected #9589

Merged

shoenig added this to the 1.0.1 milestone Dec 9, 2020

shoenig closed this as completed in #9589 Dec 9, 2020

github-actions bot locked as resolved and limited conversation to collaborators Oct 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad unable to launch on EC2 graviton instance types #7989

Nomad unable to launch on EC2 graviton instance types #7989

shoenig commented May 16, 2020

github-actions bot commented Oct 27, 2022

Nomad unable to launch on EC2 graviton instance types #7989

Nomad unable to launch on EC2 graviton instance types #7989

Comments

shoenig commented May 16, 2020

github-actions bot commented Oct 27, 2022