Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allocation resources usage showing zero for rootless raw_exec driver in v1.7.6 #20285

Open
simon1990zcs opened this issue Apr 4, 2024 · 6 comments

Comments

@simon1990zcs
Copy link

Nomad version

Nomad v1.6.9 (expected behavior) compared with Nomad v1.7.6 (incorrect behavior)

Operating system and Environment details

cat /etc/redhat-release
CentOS Linux release 7.6.1810 (Core)

Issue

In v1.7.6, allocation resources usage showing wrong (zero) for Raw_Exec driver, while nomad is running rootless environment (NOT running as ROOT). In the contrast, it would display properly in v1.6.9.
FYI, I don't know how nomad behaves under ROOT or under other drivers (Java, Docker, Exec,...), since we can't run as ROOT.

Reproduction steps

I did these steps below exactly for nomad v1.6.9 and v1.7.6, while nomad v1.69 output expected value and nomad v1.7.6 output wrong (zero) values.

  1. start a dev nomad server and client, nomad agent -dev -log-include-location -log-level=debug -bind 0.0.0.0
  2. run a very simply python server nomad job run python-service.hcl. Note: the python-service.hcl is shown below
  3. Once the job is running, display the allocation info by running nomad alloc status <correspondingAllocId>. FYI, this misbehavior is not just happening on nomad alloc status command, it's happening across the board, including the usage graph on UI dashboard, /v1/metrics API output.

Expected Result

The CPU and Memory usage should be displayed properly, non-zero, as shown in nomad v1.6.9.

nomad/nomad1.6.9 alloc status 94e3e990
ID                  = 94e3e990-e0ff-efb1-0bd2-aca285bfd853
Eval ID             = 901d4f9c
Name                = python-service.python-service-nomad[0]
Node ID             = d25b0ede
Node Name           = txvslgcath004
Job ID              = python-service
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 37s ago
Modified            = 27s ago
Deployment ID       = 01b02560
Deployment Health   = healthy

Allocation Addresses:
Label     Dynamic  Address
*webapp0  yes      127.0.0.1:25915

Task "web-app-nomad" is "running"
Task Resources:
CPU         Memory          Disk     Addresses
29/100 MHz  47 MiB/300 MiB  300 MiB

Task Events:
Started At     = 2024-04-04T10:05:04Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2024-04-04T11:05:04+01:00  Started     Task started by client
2024-04-04T11:05:04+01:00  Task Setup  Building Task Directory
2024-04-04T11:05:04+01:00  Received    Task received by client

Actual Result

As shown in output below, nomad v1.7.6 output 0 for both CPU and memory usage.

nomad/nomad1.7.6 alloc status b427e125-2117-80a4-f02d-f0c730eb7aaa
ID                  = b427e125-2117-80a4-f02d-f0c730eb7aaa
Eval ID             = c9dd3ba9
Name                = python-service.python-service-nomad[0]
Node ID             = e5be6681
Node Name           = txvslgcath004
Job ID              = python-service
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 5m37s ago
Modified            = 5m27s ago
Deployment ID       = 48763ca7
Deployment Health   = healthy

Allocation Addresses:
Label     Dynamic  Address
*webapp0  yes      127.0.0.1:25201

Task "web-app-nomad" is "running"
Task Resources:
CPU        Memory       Disk     Addresses
0/100 MHz  0 B/300 MiB  300 MiB

Task Events:
Started At     = 2024-04-04T09:56:27Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2024-04-04T10:56:27+01:00  Started     Task started by client
2024-04-04T10:56:27+01:00  Task Setup  Building Task Directory
2024-04-04T10:56:27+01:00  Received    Task received by client

Job file (if appropriate)

job "python-service" {
  # Specifies the datacenter where this job should be run
  # This can be omitted and it will default to ["*"]
  datacenters = ["*"]

  group "python-service-nomad" {
    network {
      port "webapp0" {}
    }
    
    task "web-app-nomad" {
      driver = "raw_exec"

      config {
        command   = "python"
        args    = ["-m","http.server","${NOMAD_PORT_webapp0}"]
      }
    }
  }
}

Nomad Server logs (if appropriate)

Below are two startup logs from both Nomad v1.6.9 and Nomad v.1.7.6.
I noticed that cgroup is not detected in v1.7.6, I don't think it's related to CPU/Memory usage info, just worth mentioning.

Startup log of Nomad v1.6.9

nomad/nomad1.6.9 agent -dev -log-include-location -log-level=debug -bind 0.0.0.0
==> No configuration files loaded
==> Starting Nomad agent...
==> Nomad agent configuration:

       Advertise Addrs: HTTP: 10.57.20.156:4646; RPC: 10.57.20.156:4647; Serf: 10.57.20.156:4648
            Bind Addrs: HTTP: [0.0.0.0:4646]; RPC: 0.0.0.0:4647; Serf: 0.0.0.0:4648
                Client: true
             Log Level: debug
               Node Id: 41b8f952-3c9e-261e-e6c3-41bb032a0ea0
                Region: global (DC: dc1)
                Server: true
               Version: 1.6.9

==> Nomad agent started! Log data will stream in below:

    2024-04-04T11:03:56.642+0100 [INFO]  raft@v1.5.0/api.go:591: nomad.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:8180a835-a887-8df5-ce2f-74638feed68f Address:10.57.20.156:4647}]"
    2024-04-04T11:03:56.642+0100 [INFO]  raft@v1.5.0/raft.go:159: nomad.raft: entering follower state: follower="Node at 10.57.20.156:4647 [Follower]" leader-address= leader-id=
    2024-04-04T11:03:56.642+0100 [INFO]  go-hclog@v1.5.0/stdlog.go:60: nomad: serf: EventMemberJoin: txvslgcath004.global 10.57.20.156
    2024-04-04T11:03:56.642+0100 [INFO]  nomad/server.go:1757: nomad: starting scheduling worker(s): num_workers=8 schedulers=["service", "batch", "system", "sysbatch", "_core"]
    2024-04-04T11:03:56.642+0100 [DEBUG] nomad/server.go:1764: nomad: started scheduling worker: id=28f9afe0-bd29-0f0e-ec62-b2c8dab6242a index=1 of=8
    2024-04-04T11:03:56.642+0100 [DEBUG] nomad/server.go:1764: nomad: started scheduling worker: id=39ae8096-4293-76cf-617c-10560ad95c9b index=2 of=8
    2024-04-04T11:03:56.642+0100 [DEBUG] nomad/server.go:1764: nomad: started scheduling worker: id=23e63116-a6f7-b30c-b6ec-c2ed7511a294 index=3 of=8
    2024-04-04T11:03:56.642+0100 [DEBUG] nomad/server.go:1764: nomad: started scheduling worker: id=3419dbbc-790a-4b45-680b-6d136557a8bd index=4 of=8
    2024-04-04T11:03:56.642+0100 [DEBUG] nomad/server.go:1764: nomad: started scheduling worker: id=17157abf-1a73-7150-2297-66206481f425 index=5 of=8
    2024-04-04T11:03:56.642+0100 [DEBUG] nomad/worker.go:403: worker: running: worker_id=23e63116-a6f7-b30c-b6ec-c2ed7511a294
    2024-04-04T11:03:56.642+0100 [DEBUG] nomad/worker.go:403: worker: running: worker_id=17157abf-1a73-7150-2297-66206481f425
    2024-04-04T11:03:56.642+0100 [DEBUG] nomad/worker.go:403: worker: running: worker_id=39ae8096-4293-76cf-617c-10560ad95c9b
    2024-04-04T11:03:56.642+0100 [DEBUG] nomad/server.go:1764: nomad: started scheduling worker: id=8a92f25e-dd71-a624-f1da-e189b997463b index=6 of=8
    2024-04-04T11:03:56.642+0100 [DEBUG] nomad/worker.go:403: worker: running: worker_id=28f9afe0-bd29-0f0e-ec62-b2c8dab6242a
    2024-04-04T11:03:56.643+0100 [DEBUG] nomad/server.go:1764: nomad: started scheduling worker: id=514b0144-8ff1-abcc-2154-de1b38bd8567 index=7 of=8
    2024-04-04T11:03:56.643+0100 [DEBUG] nomad/server.go:1764: nomad: started scheduling worker: id=8b983810-6d40-0e03-89b7-1e5b0a9e6e6a index=8 of=8
    2024-04-04T11:03:56.643+0100 [INFO]  nomad/server.go:1769: nomad: started scheduling worker(s): num_workers=8 schedulers=["service", "batch", "system", "sysbatch", "_core"]
    2024-04-04T11:03:56.643+0100 [DEBUG] nomad/worker.go:403: worker: running: worker_id=514b0144-8ff1-abcc-2154-de1b38bd8567
    2024-04-04T11:03:56.642+0100 [DEBUG] nomad/worker.go:403: worker: running: worker_id=3419dbbc-790a-4b45-680b-6d136557a8bd
    2024-04-04T11:03:56.643+0100 [DEBUG] nomad/worker.go:403: worker: running: worker_id=8b983810-6d40-0e03-89b7-1e5b0a9e6e6a
    2024-04-04T11:03:56.642+0100 [DEBUG] nomad/worker.go:403: worker: running: worker_id=8a92f25e-dd71-a624-f1da-e189b997463b
    2024-04-04T11:03:56.643+0100 [INFO]  nomad/serf.go:62: nomad: adding server: server="txvslgcath004.global (Addr: 10.57.20.156:4647) (DC: dc1)"
    2024-04-04T11:03:56.643+0100 [DEBUG] nomad/encrypter.go:446: nomad.keyring.replicator: starting encryption key replication
    2024-04-04T11:03:56.645+0100 [DEBUG] docker/driver.go:1885: agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=""
    2024-04-04T11:03:56.645+0100 [INFO]  agent/plugins.go:41: agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
    2024-04-04T11:03:56.645+0100 [INFO]  agent/plugins.go:41: agent: detected plugin: name=exec type=driver plugin_version=0.1.0
    2024-04-04T11:03:56.645+0100 [INFO]  agent/plugins.go:41: agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
    2024-04-04T11:03:56.645+0100 [INFO]  agent/plugins.go:41: agent: detected plugin: name=java type=driver plugin_version=0.1.0
    2024-04-04T11:03:56.645+0100 [INFO]  agent/plugins.go:41: agent: detected plugin: name=docker type=driver plugin_version=0.1.0
    2024-04-04T11:03:56.645+0100 [WARN]  cgutil/cpuset_manager_v1.go:48: client.cpuset.v1: failed to ensure cgroup parent exists; disable cpuset management: error="mkdir /sys/fs/cgroup/cpuset/nomad: permission denied"
    2024-04-04T11:03:56.645+0100 [INFO]  client/client.go:623: client: using state directory: state_dir=/tmp/NomadClient3429186125
    2024-04-04T11:03:56.645+0100 [INFO]  client/client.go:670: client: using alloc directory: alloc_dir=/tmp/NomadClient1804281296
    2024-04-04T11:03:56.645+0100 [INFO]  client/client.go:678: client: using dynamic ports: min=20000 max=32000 reserved=""
    2024-04-04T11:03:56.654+0100 [DEBUG] client/fingerprint_manager.go:83: client.fingerprint_mgr: built-in fingerprints: fingerprinters=["arch", "bridge", "cgroup", "cni", "consul", "cpu", "host", "landlock", "memory", "network", "nomad", "plugins_cni", "si
gnal", "storage", "vault", "env_azure", "env_digitalocean", "env_aws", "env_gce"]
    2024-04-04T11:03:56.667+0100 [INFO]  fingerprint/cgroup_linux.go:43: client.fingerprint_mgr.cgroup: cgroups are available
    2024-04-04T11:03:56.667+0100 [DEBUG] fingerprint/cni.go:31: client.fingerprint_mgr: CNI config dir is not set or does not exist, skipping: cni_config_dir=/opt/cni/config
    2024-04-04T11:03:56.667+0100 [DEBUG] client/fingerprint_manager.go:165: client.fingerprint_mgr: fingerprinting periodically: fingerprinter=cgroup initial_period=15s
    2024-04-04T11:03:56.667+0100 [DEBUG] client/fingerprint_manager.go:165: client.fingerprint_mgr: fingerprinting periodically: fingerprinter=consul initial_period=15s
    2024-04-04T11:03:56.667+0100 [DEBUG] fingerprint/cpu.go:74: client.fingerprint_mgr.cpu: detected CPU model: name="Intel(R) Xeon(R) Gold 6354 CPU @ 3.00GHz"
    2024-04-04T11:03:56.667+0100 [DEBUG] fingerprint/cpu.go:92: client.fingerprint_mgr.cpu: detected CPU frequency: mhz=2992
    2024-04-04T11:03:56.667+0100 [DEBUG] fingerprint/cpu.go:110: client.fingerprint_mgr.cpu: detected CPU core count: EXTRA_VALUE_AT_END=8
    2024-04-04T11:03:56.673+0100 [WARN]  fingerprint/landlock.go:34: client.fingerprint_mgr.landlock: failed to fingerprint kernel landlock feature: error="function not implemented"
    2024-04-04T11:03:56.674+0100 [WARN]  fingerprint/network_linux.go:65: client.fingerprint_mgr.network: unable to parse speed: path=/sbin/ethtool device=lo
    2024-04-04T11:03:56.674+0100 [DEBUG] fingerprint/network_linux.go:22: client.fingerprint_mgr.network: unable to read link speed: path=/sys/class/net/lo/speed device=lo
    2024-04-04T11:03:56.674+0100 [DEBUG] fingerprint/network.go:98: client.fingerprint_mgr.network: link speed could not be detected and no speed specified by user, falling back to default speed: interface=lo mbits=1000
    2024-04-04T11:03:56.674+0100 [DEBUG] fingerprint/network.go:118: client.fingerprint_mgr.network: detected interface IP: interface=lo IP=127.0.0.1
    2024-04-04T11:03:56.674+0100 [DEBUG] fingerprint/network.go:118: client.fingerprint_mgr.network: detected interface IP: interface=lo IP=::1
    2024-04-04T11:03:56.675+0100 [WARN]  fingerprint/network_linux.go:65: client.fingerprint_mgr.network: unable to parse speed: path=/sbin/ethtool device=lo
    2024-04-04T11:03:56.675+0100 [DEBUG] fingerprint/network_linux.go:22: client.fingerprint_mgr.network: unable to read link speed: path=/sys/class/net/lo/speed device=lo
    2024-04-04T11:03:56.675+0100 [DEBUG] fingerprint/network.go:147: client.fingerprint_mgr.network: link speed could not be detected, falling back to default speed: interface=lo mbits=1000
    2024-04-04T11:03:56.676+0100 [WARN]  fingerprint/plugins_cni.go:53: client.fingerprint_mgr.cni_plugins: failed to read CNI plugins directory: cni_path=/opt/cni/bin error="open /opt/cni/bin: no such file or directory"
    2024-04-04T11:03:56.677+0100 [DEBUG] client/fingerprint_manager.go:165: client.fingerprint_mgr: fingerprinting periodically: fingerprinter=vault initial_period=15s
    2024-04-04T11:03:56.681+0100 [DEBUG] fingerprint/env_gce.go:110: client.fingerprint_mgr.env_gce: could not read value for attribute: attribute=machine-type error="Get \"http://169.254.169.254/computeMetadata/v1/instance/machine-type\": context deadline e
xceeded (Client.Timeout exceeded while awaiting headers)"
    2024-04-04T11:03:56.681+0100 [DEBUG] fingerprint/env_gce.go:285: client.fingerprint_mgr.env_gce: error querying GCE Metadata URL, skipping
    2024-04-04T11:03:56.682+0100 [DEBUG] fingerprint/env_azure.go:95: client.fingerprint_mgr.env_azure: could not read value for attribute: attribute=compute/azEnvironment error="Get \"http://169.254.169.254/metadata/instance/compute/azEnvironment?api-versio
n=2019-06-04&format=text\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
    2024-04-04T11:03:56.683+0100 [DEBUG] fingerprint/env_digitalocean.go:84: client.fingerprint_mgr.env_digitalocean: failed to request metadata: attribute=region error="Get \"http://169.254.169.254/metadata/v1/region\": context deadline exceeded (Client.Tim
eout exceeded while awaiting headers)"
    2024-04-04T11:03:56.683+0100 [DEBUG] client/fingerprint_manager.go:158: client.fingerprint_mgr: detected fingerprints: node_attrs=["arch", "bridge", "cgroup", "cpu", "host", "network", "nomad", "signal", "storage"]
    2024-04-04T11:03:56.683+0100 [INFO]  pluginmanager/group.go:46: client.plugin: starting plugin manager: plugin-type=csi
    2024-04-04T11:03:56.683+0100 [INFO]  pluginmanager/group.go:46: client.plugin: starting plugin manager: plugin-type=driver
    2024-04-04T11:03:56.683+0100 [INFO]  pluginmanager/group.go:46: client.plugin: starting plugin manager: plugin-type=device
    2024-04-04T11:03:56.683+0100 [DEBUG] devicemanager/manager.go:145: client.device_mgr: exiting since there are no device plugins
    2024-04-04T11:03:56.683+0100 [DEBUG] pluginmanager/group.go:70: client.plugin: waiting on plugin manager initial fingerprint: plugin-type=driver
    2024-04-04T11:03:56.683+0100 [DEBUG] pluginmanager/group.go:70: client.plugin: waiting on plugin manager initial fingerprint: plugin-type=device
    2024-04-04T11:03:56.683+0100 [DEBUG] pluginmanager/group.go:74: client.plugin: finished plugin manager initial fingerprint: plugin-type=device
    2024-04-04T11:03:56.683+0100 [DEBUG] drivermanager/instance.go:386: client.driver_mgr: initial driver fingerprint: driver=raw_exec health=healthy description=Healthy
    2024-04-04T11:03:56.683+0100 [DEBUG] docker/driver.go:1885: client.driver_mgr.docker: using client connection initialized from environment: driver=docker
    2024-04-04T11:03:56.683+0100 [DEBUG] drivermanager/instance.go:386: client.driver_mgr: initial driver fingerprint: driver=exec health=undetected description="Driver must run as root"
    2024-04-04T11:03:56.683+0100 [DEBUG] drivermanager/instance.go:386: client.driver_mgr: initial driver fingerprint: driver=java health=undetected description="Driver must run as root"
    2024-04-04T11:03:56.684+0100 [ERROR] docker/driver.go:830: client.driver_mgr.docker: failed to list pause containers for recovery: driver=docker error="Get \"http://unix.sock/containers/json?filters=%7B%22label%22%3A%5B%22com.hashicorp.nomad.alloc_id%22%
5D%7D\": dial unix /var/run/docker.sock: connect: no such file or directory"
    2024-04-04T11:03:56.684+0100 [DEBUG] docker/fingerprint.go:116: client.driver_mgr.docker: could not connect to docker daemon: driver=docker endpoint=unix:///var/run/docker.sock error="Get \"http://unix.sock/version\": dial unix /var/run/docker.sock: conn
ect: no such file or directory"
    2024-04-04T11:03:56.684+0100 [DEBUG] drivermanager/instance.go:386: client.driver_mgr: initial driver fingerprint: driver=docker health=undetected description="Failed to connect to docker daemon"
    2024-04-04T11:03:56.684+0100 [DEBUG] servers/manager.go:208: client.server_mgr: new server list: new_servers=[0.0.0.0:4647, 10.57.20.156:4647] old_servers=[]
    2024-04-04T11:03:56.687+0100 [DEBUG] drivermanager/instance.go:386: client.driver_mgr: initial driver fingerprint: driver=qemu health=undetected description=""
    2024-04-04T11:03:56.687+0100 [DEBUG] drivermanager/manager.go:281: client.driver_mgr: detected drivers: drivers="map[healthy:[raw_exec] undetected:[exec java docker qemu]]"
    2024-04-04T11:03:56.687+0100 [DEBUG] pluginmanager/group.go:74: client.plugin: finished plugin manager initial fingerprint: plugin-type=driver
    2024-04-04T11:03:56.687+0100 [INFO]  client/client.go:588: client: started client: node_id=d25b0ede-b22e-7ffc-d641-8393267a1567
    2024-04-04T11:03:56.687+0100 [DEBUG] agent/http.go:510: http: UI is enabled
    2024-04-04T11:03:56.687+0100 [DEBUG] agent/http.go:510: http: UI is enabled
    2024-04-04T11:03:57.914+0100 [WARN]  raft@v1.5.0/raft.go:241: nomad.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
    2024-04-04T11:03:57.914+0100 [INFO]  raft@v1.5.0/raft.go:285: nomad.raft: entering candidate state: node="Node at 10.57.20.156:4647 [Candidate]" term=2
    2024-04-04T11:03:57.914+0100 [DEBUG] raft@v1.5.0/raft.go:1869: nomad.raft: voting for self: term=2 id=8180a835-a887-8df5-ce2f-74638feed68f
    2024-04-04T11:03:57.914+0100 [DEBUG] raft@v1.5.0/raft.go:304: nomad.raft: calculated votes needed: needed=1 term=2
    2024-04-04T11:03:57.914+0100 [DEBUG] raft@v1.5.0/raft.go:327: nomad.raft: vote granted: from=8180a835-a887-8df5-ce2f-74638feed68f term=2 tally=1
    2024-04-04T11:03:57.914+0100 [INFO]  raft@v1.5.0/raft.go:332: nomad.raft: election won: term=2 tally=1
    2024-04-04T11:03:57.914+0100 [INFO]  raft@v1.5.0/raft.go:421: nomad.raft: entering leader state: leader="Node at 10.57.20.156:4647 [Leader]"
    2024-04-04T11:03:57.914+0100 [INFO]  nomad/leader.go:104: nomad: cluster leadership acquired
    2024-04-04T11:03:57.916+0100 [DEBUG] raft-autopilot@v0.1.6/run.go:103: nomad.autopilot: autopilot is now running
    2024-04-04T11:03:57.916+0100 [DEBUG] raft-autopilot@v0.1.6/run.go:175: nomad.autopilot: state update routine is now running
    2024-04-04T11:03:57.916+0100 [INFO]  nomad/leader.go:2732: nomad.core: established cluster id: cluster_id=3b95c2a8-7d89-6a64-db4e-02511408bf9e create_time=1712225037916341223
    2024-04-04T11:03:57.916+0100 [INFO]  nomad/leader.go:2784: nomad: eval broker status modified: paused=false
    2024-04-04T11:03:57.916+0100 [INFO]  nomad/leader.go:2791: nomad: blocked evals status modified: paused=false
    2024-04-04T11:03:57.918+0100 [INFO]  nomad/leader.go:2717: nomad.keyring: initialized keyring: id=ab5a3307-d5ea-2798-4a7f-8d09e4b1ac8c
    2024-04-04T11:03:58.144+0100 [DEBUG] servers/manager.go:208: client.server_mgr: new server list: new_servers=[10.57.20.156:4647] old_servers=[0.0.0.0:4647, 10.57.20.156:4647]
    2024-04-04T11:03:58.144+0100 [INFO]  client/client.go:2030: client: node registration complete
    2024-04-04T11:03:58.144+0100 [DEBUG] client/client.go:2451: client: updated allocations: index=1 total=0 pulled=0 filtered=0
    2024-04-04T11:03:58.144+0100 [DEBUG] client/client.go:2524: client: allocation updates: added=0 removed=0 updated=0 ignored=0
    2024-04-04T11:03:58.144+0100 [DEBUG] client/client.go:2568: client: allocation updates applied: added=0 removed=0 updated=0 ignored=0 errors=0
    2024-04-04T11:03:58.145+0100 [DEBUG] client/client.go:2077: client: state updated: node_status=ready
    2024-04-04T11:03:59.145+0100 [DEBUG] client/client.go:2497: client: state changed, updating node and re-registering
    2024-04-04T11:03:59.146+0100 [INFO]  client/client.go:2030: client: node registration complete
==> Failed to check for updates: Get "https://checkpoint-api.hashicorp.com/v1/check/nomad?arch=amd64&os=linux&signature=&version=1.6.9": Proxy Authentication Required

the startup log of nomad v1.7.6

nomad/nomad1.7.6 agent -dev -log-include-location -log-level=debug -bind 0.0.0.0
==> No configuration files loaded
==> Starting Nomad agent...
==> Nomad agent configuration:

       Advertise Addrs: HTTP: 10.57.20.156:4646; RPC: 10.57.20.156:4647; Serf: 10.57.20.156:4648
            Bind Addrs: HTTP: [0.0.0.0:4646]; RPC: 0.0.0.0:4647; Serf: 0.0.0.0:4648
                Client: true
             Log Level: debug
               Node Id: 4bee006f-9a4c-dc14-c7de-e89afdde439b
                Region: global (DC: dc1)
                Server: true
               Version: 1.7.6

==> Nomad agent started! Log data will stream in below:

    2024-04-04T10:54:26.543+0100 [DEBUG] nomad/server.go:440: nomad: issuer not set; OIDC Discovery endpoint for workload identities disabled
    2024-04-04T10:54:26.544+0100 [INFO]  raft@v1.5.0/api.go:591: nomad.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:a3882dfa-e3b0-d1d5-c45f-8cbc011d941b Address:10.57.20.156:4647}]"
    2024-04-04T10:54:26.545+0100 [INFO]  raft@v1.5.0/raft.go:159: nomad.raft: entering follower state: follower="Node at 10.57.20.156:4647 [Follower]" leader-address= leader-id=
    2024-04-04T10:54:26.545+0100 [INFO]  go-hclog@v1.5.0/stdlog.go:60: nomad: serf: EventMemberJoin: txvslgcath004.global 10.57.20.156
    2024-04-04T10:54:26.545+0100 [INFO]  nomad/server.go:1818: nomad: starting scheduling worker(s): num_workers=8 schedulers=["service", "batch", "system", "sysbatch", "_core"]
    2024-04-04T10:54:26.545+0100 [DEBUG] nomad/server.go:1825: nomad: started scheduling worker: id=32044e35-0ef3-c4ad-0ae0-f2e09f176c04 index=1 of=8
    2024-04-04T10:54:26.545+0100 [DEBUG] nomad/server.go:1825: nomad: started scheduling worker: id=c42ddd46-caa9-1af6-7ce8-2ea34b3a82c1 index=2 of=8
    2024-04-04T10:54:26.545+0100 [DEBUG] nomad/server.go:1825: nomad: started scheduling worker: id=b962e4f2-6e81-bba5-35dc-022011ce839e index=3 of=8
    2024-04-04T10:54:26.545+0100 [DEBUG] nomad/server.go:1825: nomad: started scheduling worker: id=2eed5234-f8e6-eb14-74cd-bbc632b1d04a index=4 of=8
    2024-04-04T10:54:26.545+0100 [DEBUG] nomad/server.go:1825: nomad: started scheduling worker: id=4c94eb32-03a1-043c-39b2-2dcabe99ef5f index=5 of=8
    2024-04-04T10:54:26.545+0100 [DEBUG] nomad/server.go:1825: nomad: started scheduling worker: id=fc2af168-830c-c3de-471c-97c9bdfa4ead index=6 of=8
    2024-04-04T10:54:26.545+0100 [DEBUG] nomad/server.go:1825: nomad: started scheduling worker: id=9997d867-45f6-3527-bddc-220cb29a2da4 index=7 of=8
    2024-04-04T10:54:26.545+0100 [DEBUG] nomad/worker.go:403: worker: running: worker_id=32044e35-0ef3-c4ad-0ae0-f2e09f176c04
    2024-04-04T10:54:26.545+0100 [DEBUG] nomad/worker.go:403: worker: running: worker_id=b962e4f2-6e81-bba5-35dc-022011ce839e
    2024-04-04T10:54:26.545+0100 [DEBUG] nomad/server.go:1825: nomad: started scheduling worker: id=869b07c8-c751-2524-6ee5-fb2d10f2674d index=8 of=8
    2024-04-04T10:54:26.545+0100 [INFO]  nomad/server.go:1830: nomad: started scheduling worker(s): num_workers=8 schedulers=["service", "batch", "system", "sysbatch", "_core"]
    2024-04-04T10:54:26.545+0100 [DEBUG] nomad/worker.go:403: worker: running: worker_id=869b07c8-c751-2524-6ee5-fb2d10f2674d
    2024-04-04T10:54:26.545+0100 [DEBUG] nomad/worker.go:403: worker: running: worker_id=fc2af168-830c-c3de-471c-97c9bdfa4ead
    2024-04-04T10:54:26.545+0100 [DEBUG] nomad/worker.go:403: worker: running: worker_id=2eed5234-f8e6-eb14-74cd-bbc632b1d04a
    2024-04-04T10:54:26.545+0100 [DEBUG] nomad/worker.go:403: worker: running: worker_id=c42ddd46-caa9-1af6-7ce8-2ea34b3a82c1
    2024-04-04T10:54:26.545+0100 [DEBUG] nomad/worker.go:403: worker: running: worker_id=4c94eb32-03a1-043c-39b2-2dcabe99ef5f
    2024-04-04T10:54:26.545+0100 [DEBUG] nomad/encrypter.go:555: nomad.keyring.replicator: starting encryption key replication
    2024-04-04T10:54:26.545+0100 [INFO]  nomad/serf.go:62: nomad: adding server: server="txvslgcath004.global (Addr: 10.57.20.156:4647) (DC: dc1)"
    2024-04-04T10:54:26.545+0100 [DEBUG] nomad/worker.go:403: worker: running: worker_id=9997d867-45f6-3527-bddc-220cb29a2da4
    2024-04-04T10:54:26.546+0100 [DEBUG] docker/driver.go:1870: agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=""
    2024-04-04T10:54:26.547+0100 [INFO]  agent/plugins.go:41: agent: detected plugin: name=java type=driver plugin_version=0.1.0
    2024-04-04T10:54:26.547+0100 [INFO]  agent/plugins.go:41: agent: detected plugin: name=docker type=driver plugin_version=0.1.0
    2024-04-04T10:54:26.547+0100 [INFO]  agent/plugins.go:41: agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
    2024-04-04T10:54:26.547+0100 [INFO]  agent/plugins.go:41: agent: detected plugin: name=exec type=driver plugin_version=0.1.0
    2024-04-04T10:54:26.547+0100 [INFO]  agent/plugins.go:41: agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
    2024-04-04T10:54:26.547+0100 [INFO]  client/client.go:665: client: using state directory: state_dir=/tmp/NomadClient3325952921
    2024-04-04T10:54:26.547+0100 [INFO]  client/client.go:712: client: using alloc directory: alloc_dir=/tmp/NomadClient1555522694
    2024-04-04T10:54:26.547+0100 [INFO]  client/client.go:720: client: using dynamic ports: min=20000 max=32000 reserved=""
    2024-04-04T10:54:26.553+0100 [DEBUG] client/fingerprint_manager.go:88: client.fingerprint_mgr: built-in fingerprints: fingerprinters=["arch", "bridge", "cgroup", "cni", "consul", "cpu", "host", "landlock", "memory", "network", "nomad", "p
lugins_cni", "signal", "storage", "vault", "env_gce", "env_azure", "env_digitalocean", "env_aws"]
    2024-04-04T10:54:26.565+0100 [DEBUG] fingerprint/cni.go:31: client.fingerprint_mgr: CNI config dir is not set or does not exist, skipping: cni_config_dir=/opt/cni/config
    2024-04-04T10:54:26.566+0100 [DEBUG] client/fingerprint_manager.go:170: client.fingerprint_mgr: fingerprinting periodically: fingerprinter=consul initial_period=15s
    2024-04-04T10:54:26.567+0100 [DEBUG] fingerprint/cpu.go:104: client.fingerprint_mgr.cpu: detected CPU model: name="Intel(R) Xeon(R) Gold 6354 CPU @ 3.00GHz"
    2024-04-04T10:54:26.567+0100 [DEBUG] fingerprint/cpu.go:122: client.fingerprint_mgr.cpu: detected CPU frequency: mhz=2992
    2024-04-04T10:54:26.567+0100 [DEBUG] fingerprint/cpu.go:148: client.fingerprint_mgr.cpu: detected CPU core count: cores=8
    2024-04-04T10:54:26.572+0100 [WARN]  fingerprint/landlock.go:34: client.fingerprint_mgr.landlock: failed to fingerprint kernel landlock feature: error="function not implemented"
    2024-04-04T10:54:26.574+0100 [WARN]  fingerprint/network_linux.go:65: client.fingerprint_mgr.network: unable to parse speed: path=/sbin/ethtool device=lo
    2024-04-04T10:54:26.574+0100 [DEBUG] fingerprint/network_linux.go:22: client.fingerprint_mgr.network: unable to read link speed: path=/sys/class/net/lo/speed device=lo
    2024-04-04T10:54:26.574+0100 [DEBUG] fingerprint/network.go:98: client.fingerprint_mgr.network: link speed could not be detected and no speed specified by user, falling back to default speed: interface=lo mbits=1000
    2024-04-04T10:54:26.574+0100 [DEBUG] fingerprint/network.go:113: client.fingerprint_mgr.network: detected interface IP: interface=lo IP=127.0.0.1
    2024-04-04T10:54:26.574+0100 [DEBUG] fingerprint/network.go:113: client.fingerprint_mgr.network: detected interface IP: interface=lo IP=::1
    2024-04-04T10:54:26.575+0100 [WARN]  fingerprint/network_linux.go:65: client.fingerprint_mgr.network: unable to parse speed: path=/sbin/ethtool device=lo
    2024-04-04T10:54:26.575+0100 [DEBUG] fingerprint/network_linux.go:22: client.fingerprint_mgr.network: unable to read link speed: path=/sys/class/net/lo/speed device=lo
    2024-04-04T10:54:26.575+0100 [DEBUG] fingerprint/network.go:142: client.fingerprint_mgr.network: link speed could not be detected, falling back to default speed: interface=lo mbits=1000
    2024-04-04T10:54:26.576+0100 [WARN]  fingerprint/plugins_cni.go:53: client.fingerprint_mgr.cni_plugins: failed to read CNI plugins directory: cni_path=/opt/cni/bin error="open /opt/cni/bin: no such file or directory"
    2024-04-04T10:54:26.577+0100 [DEBUG] client/fingerprint_manager.go:170: client.fingerprint_mgr: fingerprinting periodically: fingerprinter=vault initial_period=15s
    2024-04-04T10:54:26.581+0100 [DEBUG] fingerprint/env_gce.go:110: client.fingerprint_mgr.env_gce: could not read value for attribute: attribute=machine-type error="Get \"http://169.254.169.254/computeMetadata/v1/instance/machine-type\": co
ntext deadline exceeded (Client.Timeout exceeded while awaiting headers)"
    2024-04-04T10:54:26.581+0100 [DEBUG] fingerprint/env_gce.go:285: client.fingerprint_mgr.env_gce: error querying GCE Metadata URL, skipping
    2024-04-04T10:54:26.582+0100 [DEBUG] fingerprint/env_azure.go:94: client.fingerprint_mgr.env_azure: could not read value for attribute: attribute=compute/azEnvironment error="Get \"http://169.254.169.254/metadata/instance/compute/azEnviro
nment?api-version=2019-06-04&format=text\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
    2024-04-04T10:54:26.583+0100 [DEBUG] fingerprint/env_digitalocean.go:84: client.fingerprint_mgr.env_digitalocean: failed to request metadata: attribute=region error="Get \"http://169.254.169.254/metadata/v1/region\": context deadline exce
eded (Client.Timeout exceeded while awaiting headers)"
    2024-04-04T10:54:26.583+0100 [DEBUG] client/fingerprint_manager.go:163: client.fingerprint_mgr: detected fingerprints: node_attrs=["arch", "bridge", "cpu", "host", "network", "nomad", "signal", "storage"]
    2024-04-04T10:54:26.583+0100 [INFO]  cgroupslib/init.go:28: client.proclib.cg2: initializing nomad cgroups: cores=0-7
    2024-04-04T10:54:26.583+0100 [INFO]  pluginmanager/group.go:46: client.plugin: starting plugin manager: plugin-type=csi
    2024-04-04T10:54:26.583+0100 [INFO]  pluginmanager/group.go:46: client.plugin: starting plugin manager: plugin-type=driver
    2024-04-04T10:54:26.583+0100 [INFO]  pluginmanager/group.go:46: client.plugin: starting plugin manager: plugin-type=device
    2024-04-04T10:54:26.583+0100 [DEBUG] devicemanager/manager.go:145: client.device_mgr: exiting since there are no device plugins
    2024-04-04T10:54:26.583+0100 [DEBUG] pluginmanager/group.go:70: client.plugin: waiting on plugin manager initial fingerprint: plugin-type=driver
    2024-04-04T10:54:26.583+0100 [DEBUG] drivermanager/instance.go:386: client.driver_mgr: initial driver fingerprint: driver=exec health=undetected description="Driver must run as root"
    2024-04-04T10:54:26.583+0100 [DEBUG] docker/driver.go:1870: client.driver_mgr.docker: using client connection initialized from environment: driver=docker
    2024-04-04T10:54:26.583+0100 [DEBUG] pluginmanager/group.go:70: client.plugin: waiting on plugin manager initial fingerprint: plugin-type=device
    2024-04-04T10:54:26.584+0100 [DEBUG] pluginmanager/group.go:74: client.plugin: finished plugin manager initial fingerprint: plugin-type=device
    2024-04-04T10:54:26.584+0100 [DEBUG] drivermanager/instance.go:386: client.driver_mgr: initial driver fingerprint: driver=docker health=undetected description="Driver must run as root"
    2024-04-04T10:54:26.583+0100 [DEBUG] drivermanager/instance.go:386: client.driver_mgr: initial driver fingerprint: driver=java health=undetected description="Driver must run as root"
    2024-04-04T10:54:26.583+0100 [DEBUG] drivermanager/instance.go:386: client.driver_mgr: initial driver fingerprint: driver=raw_exec health=healthy description=Healthy
    2024-04-04T10:54:26.583+0100 [DEBUG] drivermanager/instance.go:386: client.driver_mgr: initial driver fingerprint: driver=qemu health=undetected description=""
    2024-04-04T10:54:26.584+0100 [DEBUG] drivermanager/manager.go:281: client.driver_mgr: detected drivers: drivers="map[healthy:[raw_exec] undetected:[exec docker java qemu]]"
    2024-04-04T10:54:26.584+0100 [DEBUG] pluginmanager/group.go:74: client.plugin: finished plugin manager initial fingerprint: plugin-type=driver
    2024-04-04T10:54:26.584+0100 [ERROR] docker/driver.go:823: client.driver_mgr.docker: failed to list pause containers for recovery: driver=docker error="Get \"http://unix.sock/containers/json?filters=%7B%22label%22%3A%5B%22com.hashicorp.no
mad.alloc_id%22%5D%7D\": dial unix /var/run/docker.sock: connect: no such file or directory"
    2024-04-04T10:54:26.585+0100 [DEBUG] servers/manager.go:208: client.server_mgr: new server list: new_servers=[0.0.0.0:4647, 10.57.20.156:4647] old_servers=[]
    2024-04-04T10:54:26.585+0100 [INFO]  client/client.go:630: client: started client: node_id=e5be6681-6198-1191-5a05-0d1226c37f81
    2024-04-04T10:54:26.585+0100 [DEBUG] agent/http.go:515: http: UI is enabled
    2024-04-04T10:54:26.586+0100 [DEBUG] agent/http.go:515: http: UI is enabled
    2024-04-04T10:54:28.174+0100 [WARN]  raft@v1.5.0/raft.go:241: nomad.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
    2024-04-04T10:54:28.174+0100 [INFO]  raft@v1.5.0/raft.go:285: nomad.raft: entering candidate state: node="Node at 10.57.20.156:4647 [Candidate]" term=2
    2024-04-04T10:54:28.174+0100 [DEBUG] raft@v1.5.0/raft.go:1869: nomad.raft: voting for self: term=2 id=a3882dfa-e3b0-d1d5-c45f-8cbc011d941b
    2024-04-04T10:54:28.174+0100 [DEBUG] raft@v1.5.0/raft.go:304: nomad.raft: calculated votes needed: needed=1 term=2
    2024-04-04T10:54:28.174+0100 [DEBUG] raft@v1.5.0/raft.go:327: nomad.raft: vote granted: from=a3882dfa-e3b0-d1d5-c45f-8cbc011d941b term=2 tally=1
    2024-04-04T10:54:28.174+0100 [INFO]  raft@v1.5.0/raft.go:332: nomad.raft: election won: term=2 tally=1
    2024-04-04T10:54:28.174+0100 [INFO]  raft@v1.5.0/raft.go:421: nomad.raft: entering leader state: leader="Node at 10.57.20.156:4647 [Leader]"
    2024-04-04T10:54:28.174+0100 [INFO]  nomad/leader.go:109: nomad: cluster leadership acquired
    2024-04-04T10:54:28.176+0100 [DEBUG] raft-autopilot@v0.1.6/run.go:103: nomad.autopilot: autopilot is now running
    2024-04-04T10:54:28.176+0100 [DEBUG] raft-autopilot@v0.1.6/run.go:175: nomad.autopilot: state update routine is now running
    2024-04-04T10:54:28.176+0100 [INFO]  nomad/leader.go:2798: nomad.core: established cluster id: cluster_id=3342784b-d650-3b98-4c50-cdd043d4fc54 create_time=1712224468176447102
    2024-04-04T10:54:28.176+0100 [INFO]  nomad/leader.go:2850: nomad: eval broker status modified: paused=false
    2024-04-04T10:54:28.176+0100 [INFO]  nomad/leader.go:2857: nomad: blocked evals status modified: paused=false
    2024-04-04T10:54:28.345+0100 [DEBUG] servers/manager.go:208: client.server_mgr: new server list: new_servers=[10.57.20.156:4647] old_servers=[0.0.0.0:4647, 10.57.20.156:4647]
    2024-04-04T10:54:28.345+0100 [INFO]  client/client.go:2041: client: node registration complete
    2024-04-04T10:54:28.345+0100 [DEBUG] client/client.go:2088: client: state updated: node_status=ready
    2024-04-04T10:54:28.345+0100 [DEBUG] client/client.go:2462: client: updated allocations: index=1 total=0 pulled=0 filtered=0
    2024-04-04T10:54:28.346+0100 [DEBUG] client/client.go:2535: client: allocation updates: added=0 removed=0 updated=0 ignored=0
    2024-04-04T10:54:28.346+0100 [DEBUG] client/client.go:2579: client: allocation updates applied: added=0 removed=0 updated=0 ignored=0 errors=0
    2024-04-04T10:54:28.652+0100 [INFO]  nomad/leader.go:2783: nomad.keyring: initialized keyring: id=be29d285-56a1-0e0b-92fa-12e930e5bddb
    2024-04-04T10:54:29.345+0100 [DEBUG] client/client.go:2508: client: state changed, updating node and re-registering
    2024-04-04T10:54:29.346+0100 [INFO]  client/client.go:2041: client: node registration complete
==> Failed to check for updates: Get "https://checkpoint-api.hashicorp.com/v1/check/nomad?arch=amd64&os=linux&signature=&version=1.7.6": Proxy Authentication Required

Nomad Client logs (if appropriate)

@jrasell jrasell added this to Needs Triage in Nomad - Community Issues Triage via automation Apr 4, 2024
@simon1990zcs
Copy link
Author

FYI, this post seems like reporting exactly same issue with mine.
https://discuss.hashicorp.com/t/nomad-1-7-3-upgraded-from-1-0-4-cpu-and-memory-allocation-showing-as-zero/62062

@jrasell
Copy link
Member

jrasell commented Apr 5, 2024

Hi @simon1990zcs and thanks for raising this issue. The Nomad client process needs to run as root and therefore this could be causing the issue you have seen. Recent cgroup changes and other isolation changes to support features like NUMA and cgroupv2 are likely to be involved here. This issue has some useful additional information.

Allocation resource usage is collected by the Nomad clients, and therefore the logs from those agents might hold the information and errors to diagnose this problem.

@jrasell jrasell added stage/needs-investigation theme/rootless Run Nomad Client without root labels Apr 5, 2024
@jrasell jrasell moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Apr 5, 2024
@simon1990zcs
Copy link
Author

Hi, @jrasell , thanks for looking into it.

  1. In terms of client side logs, surprisingly, there is not much when I run nomad alloc status <allocId> under nomad v1.7.6 (same output for v1.6.9).
    2024-04-05T10:30:11.944+0100 [DEBUG] agent/http.go:714: http: request complete: method=GET path=/v1/allocations?prefix=8ccbe83b duration="300.194µs"
    2024-04-05T10:30:11.947+0100 [DEBUG] agent/http.go:714: http: request complete: method=GET path=/v1/allocation/8ccbe83b-fc7a-4001-4943-febed50fbd6d?namespace=default duration="551.809µs"
    2024-04-05T10:30:11.952+0100 [DEBUG] agent/http.go:714: http: request complete: method=GET path=/v1/client/allocation/8ccbe83b-fc7a-4001-4943-febed50fbd6d/checks duration="436.809µs"
    2024-04-05T10:30:11.954+0100 [DEBUG] agent/http.go:714: http: request complete: method=GET path=/v1/client/allocation/8ccbe83b-fc7a-4001-4943-febed50fbd6d/stats duration="393.442µs"
  1. In terms of running nomad client as non-root user, I am aware of drawbacks, aware that we don't have root privilege to tools like cgroup for resource isolation. That's all fine for me, since I can't do it myself as non-root user, and I don't expect nomad would be able to do it for me. I am very appreciate that nomad provides RAW_EXEC driver windows for non-root user, so that nomad is able to behave as me to run/kill jobs, like python-service in the demo here.
  2. However, speaking of querying allocation resource usage, this doesn't require root access at all in my understanding. That's why I consider this to be a bug here.
    I did check nomad codebase very briefly, I believe nomad query allocation/process usage through here. It basically use methods of process.MemoryInfoWithContext(context) and process.TimesWithContext(context) from library https://github.com/shirou/gopsutil, and I did check their codebase a little bit as well, they use standard linux solution, by reading /proc/{pid}/statm and /proc/{pid}/stat. Again, reading these files doesn't require root access at all. Since no error was logged here at all, I guess nomad silently return zero earlier somewhere. further investigation would require me to debug in source code, so I raised this ticket for help, hope you can help addressing this issue.

Let me know if you need any more information.

@simon1990zcs
Copy link
Author

in case you need full logs from nomad clients, including the logs from running nomad job run python-service.hcl and nomad alloc status <allocId>, below are logs right after startup log.

    2024-04-05T11:17:17.222+0100 [DEBUG] nomad/worker.go:524: worker: dequeued evaluation: worker_id=8c4e7ab8-ffdf-e23d-6428-f7940a1fa213 eval_id=09539011-9833-570d-72c3-2dcd9a08630b type=service namespace=default job_id=python-service node_id="" tri
ggered_by=job-register
    2024-04-05T11:17:17.222+0100 [DEBUG] agent/http.go:714: http: request complete: method=PUT path=/v1/jobs duration=2.057555ms
    2024-04-05T11:17:17.222+0100 [DEBUG] scheduler/generic_sched.go:390: worker.service_sched: reconciled current state with desired state: eval_id=09539011-9833-570d-72c3-2dcd9a08630b job_id=python-service namespace=default worker_id=8c4e7ab8-ffdf-e
23d-6428-f7940a1fa213
  results=
  | Total changes: (place 1) (destructive 0) (inplace 0) (stop 0) (disconnect 0) (reconnect 0)
  | Created Deployment: "e9de8c7a-23f6-e68f-84d7-fe07778ccb4e"
  | Desired Changes for "python-service-nomad": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)

    2024-04-05T11:17:17.225+0100 [DEBUG] nomad/worker.go:689: worker: submitted plan for evaluation: worker_id=8c4e7ab8-ffdf-e23d-6428-f7940a1fa213 eval_id=09539011-9833-570d-72c3-2dcd9a08630b
    2024-04-05T11:17:17.225+0100 [DEBUG] scheduler/util.go:629: worker.service_sched: setting eval status: eval_id=09539011-9833-570d-72c3-2dcd9a08630b job_id=python-service namespace=default worker_id=8c4e7ab8-ffdf-e23d-6428-f7940a1fa213 status=comp
lete
    2024-04-05T11:17:17.225+0100 [DEBUG] agent/http.go:714: http: request complete: method=GET path=/v1/evaluation/09539011-9833-570d-72c3-2dcd9a08630b duration="168.904µs"
    2024-04-05T11:17:17.225+0100 [DEBUG] nomad/worker.go:752: worker: updated evaluation: worker_id=8c4e7ab8-ffdf-e23d-6428-f7940a1fa213 eval="<Eval \"09539011-9833-570d-72c3-2dcd9a08630b\" JobID: \"python-service\" Namespace: \"default\">"
    2024-04-05T11:17:17.225+0100 [DEBUG] nomad/worker.go:559: worker: ack evaluation: worker_id=8c4e7ab8-ffdf-e23d-6428-f7940a1fa213 eval_id=09539011-9833-570d-72c3-2dcd9a08630b type=service namespace=default job_id=python-service node_id="" triggere
d_by=job-register
    2024-04-05T11:17:17.226+0100 [DEBUG] client/client.go:2462: client: updated allocations: index=12 total=1 pulled=1 filtered=0
    2024-04-05T11:17:17.226+0100 [DEBUG] client/client.go:2535: client: allocation updates: added=1 removed=0 updated=0 ignored=0
    2024-04-05T11:17:17.226+0100 [INFO]  taskrunner/task_runner.go:1402: client.alloc_runner.task_runner: Task event: alloc_id=50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed task=web-app-nomad type=Received msg="Task received by client" failed=false
    2024-04-05T11:17:17.226+0100 [DEBUG] client/client.go:2579: client: allocation updates applied: added=1 removed=0 updated=0 ignored=0 errors=0
    2024-04-05T11:17:17.226+0100 [DEBUG] widmgr/widmgr.go:115: client.alloc_runner.widmgr: retrieving and renewing workload identities: alloc_id=50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed num_identities=0
    2024-04-05T11:17:17.227+0100 [DEBUG] taskrunner/task_runner.go:591: client.alloc_runner.task_runner: lifecycle start condition has been met, proceeding: alloc_id=50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed task=web-app-nomad
    2024-04-05T11:17:17.227+0100 [INFO]  taskrunner/task_runner.go:1402: client.alloc_runner.task_runner: Task event: alloc_id=50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed task=web-app-nomad type="Task Setup" msg="Building Task Directory" failed=false
    2024-04-05T11:17:17.227+0100 [DEBUG] cmdrunner/cmd_runner.go:73: client.alloc_runner.task_runner.task_hook.logmon: starting plugin: alloc_id=50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed task=web-app-nomad path=/dat/cathedral_core/developers/nomad/nomad1.
7.6 args=["/dat/cathedral_core/developers/nomad/nomad1.7.6", "logmon"]
    2024-04-05T11:17:17.228+0100 [DEBUG] cmdrunner/cmd_runner.go:80: client.alloc_runner.task_runner.task_hook.logmon: plugin started: alloc_id=50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed task=web-app-nomad path=/dat/cathedral_core/developers/nomad/nomad1.7
.6 pid=17300
    2024-04-05T11:17:17.228+0100 [DEBUG] go-plugin@v1.6.0/client.go:825: client.alloc_runner.task_runner.task_hook.logmon: waiting for RPC address: alloc_id=50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed task=web-app-nomad plugin=/dat/cathedral_core/developers
/nomad/nomad1.7.6
    2024-04-05T11:17:17.228+0100 [DEBUG] agent/http.go:714: http: request complete: method=GET path=/v1/evaluation/09539011-9833-570d-72c3-2dcd9a08630b/allocations duration="264.661µs"
    2024-04-05T11:17:17.258+0100 [DEBUG] go-plugin@v1.6.0/client.go:1224: client.alloc_runner.task_runner.task_hook.logmon.nomad1.7.6: plugin address: alloc_id=50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed task=web-app-nomad @module=logmon address=/tmp/plugin
1256268056 network=unix timestamp="2024-04-05T11:17:17.258+0100"
    2024-04-05T11:17:17.258+0100 [DEBUG] go-plugin@v1.6.0/client.go:878: client.alloc_runner.task_runner.task_hook.logmon: using plugin: alloc_id=50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed task=web-app-nomad version=2
    2024-04-05T11:17:17.259+0100 [DEBUG] go-plugin@v1.6.0/client.go:1224: client.alloc_runner.task_runner.task_hook.logmon.nomad1.7.6: opening fifo: alloc_id=50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed task=web-app-nomad path=/tmp/NomadClient2485119766/50a8
fb0d-ea62-4062-d45a-7ba0aa0c3aed/alloc/logs/.web-app-nomad.stdout.fifo @module=logmon timestamp="2024-04-05T11:17:17.259+0100"
    2024-04-05T11:17:17.259+0100 [DEBUG] go-plugin@v1.6.0/client.go:1224: client.alloc_runner.task_runner.task_hook.logmon.nomad1.7.6: opening fifo: alloc_id=50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed task=web-app-nomad @module=logmon path=/tmp/NomadClient
2485119766/50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed/alloc/logs/.web-app-nomad.stderr.fifo timestamp="2024-04-05T11:17:17.259+0100"
    2024-04-05T11:17:17.262+0100 [INFO]  rawexec/driver.go:322: client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:python Args:[-m http.server 20186]}"
    2024-04-05T11:17:17.262+0100 [DEBUG] cmdrunner/cmd_runner.go:73: client.driver_mgr.raw_exec.executor: starting plugin: alloc_id=50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed driver=raw_exec task_name=web-app-nomad path=/dat/cathedral_core/developers/nomad
/nomad1.7.6 args=["/dat/cathedral_core/developers/nomad/nomad1.7.6", "executor", "{\"LogFile\":\"/tmp/NomadClient2485119766/50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed/web-app-nomad/executor.out\",\"LogLevel\":\"debug\",\"FSIsolation\":false,\"Compute\":{\"
tc\":23936,\"nc\":8}}"]
    2024-04-05T11:17:17.263+0100 [DEBUG] cmdrunner/cmd_runner.go:80: client.driver_mgr.raw_exec.executor: plugin started: alloc_id=50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed driver=raw_exec task_name=web-app-nomad path=/dat/cathedral_core/developers/nomad/
nomad1.7.6 pid=17312
    2024-04-05T11:17:17.263+0100 [DEBUG] go-plugin@v1.6.0/client.go:825: client.driver_mgr.raw_exec.executor: waiting for RPC address: alloc_id=50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed driver=raw_exec task_name=web-app-nomad plugin=/dat/cathedral_core/de
velopers/nomad/nomad1.7.6
    2024-04-05T11:17:17.293+0100 [DEBUG] go-plugin@v1.6.0/client.go:878: client.driver_mgr.raw_exec.executor: using plugin: alloc_id=50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed driver=raw_exec task_name=web-app-nomad version=2
    2024-04-05T11:17:17.300+0100 [INFO]  taskrunner/task_runner.go:1402: client.alloc_runner.task_runner: Task event: alloc_id=50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed task=web-app-nomad type=Started msg="Task started by client" failed=false
    2024-04-05T11:17:17.950+0100 [DEBUG] client/client.go:2462: client: updated allocations: index=14 total=1 pulled=0 filtered=1
    2024-04-05T11:17:17.951+0100 [DEBUG] client/client.go:2535: client: allocation updates: added=0 removed=0 updated=0 ignored=1
    2024-04-05T11:17:17.951+0100 [DEBUG] client/client.go:2579: client: allocation updates applied: added=0 removed=0 updated=0 ignored=1 errors=0
    2024-04-05T11:17:28.039+0100 [DEBUG] client/client.go:2462: client: updated allocations: index=15 total=1 pulled=0 filtered=1
    2024-04-05T11:17:28.039+0100 [DEBUG] client/client.go:2535: client: allocation updates: added=0 removed=0 updated=0 ignored=1
    2024-04-05T11:17:28.039+0100 [DEBUG] client/client.go:2579: client: allocation updates applied: added=0 removed=0 updated=0 ignored=1 errors=0
    2024-04-05T11:17:29.290+0100 [DEBUG] nomad/worker.go:524: worker: dequeued evaluation: worker_id=6d9672fe-72c4-16b0-c8db-058cf950906f eval_id=8d8ce2ef-23a5-414a-a93a-ffd2b4a72545 type=service namespace=default job_id=python-service node_id="" tri
ggered_by=deployment-watcher
    2024-04-05T11:17:29.291+0100 [DEBUG] scheduler/generic_sched.go:390: worker.service_sched: reconciled current state with desired state: eval_id=8d8ce2ef-23a5-414a-a93a-ffd2b4a72545 job_id=python-service namespace=default worker_id=6d9672fe-72c4-1
6b0-c8db-058cf950906f
  results=
  | Total changes: (place 0) (destructive 0) (inplace 0) (stop 0) (disconnect 0) (reconnect 0)
  | Deployment Update for ID "e9de8c7a-23f6-e68f-84d7-fe07778ccb4e": Status "successful"; Description "Deployment completed successfully"
  | Desired Changes for "python-service-nomad": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 1) (canary 0)

    2024-04-05T11:17:29.291+0100 [DEBUG] nomad/worker.go:689: worker: submitted plan for evaluation: worker_id=6d9672fe-72c4-16b0-c8db-058cf950906f eval_id=8d8ce2ef-23a5-414a-a93a-ffd2b4a72545
    2024-04-05T11:17:29.291+0100 [DEBUG] scheduler/util.go:629: worker.service_sched: setting eval status: eval_id=8d8ce2ef-23a5-414a-a93a-ffd2b4a72545 job_id=python-service namespace=default worker_id=6d9672fe-72c4-16b0-c8db-058cf950906f status=comp
lete
    2024-04-05T11:17:29.292+0100 [DEBUG] nomad/worker.go:752: worker: updated evaluation: worker_id=6d9672fe-72c4-16b0-c8db-058cf950906f eval="<Eval \"8d8ce2ef-23a5-414a-a93a-ffd2b4a72545\" JobID: \"python-service\" Namespace: \"default\">"
    2024-04-05T11:17:29.292+0100 [DEBUG] nomad/worker.go:559: worker: ack evaluation: worker_id=6d9672fe-72c4-16b0-c8db-058cf950906f eval_id=8d8ce2ef-23a5-414a-a93a-ffd2b4a72545 type=service namespace=default job_id=python-service node_id="" triggere
d_by=deployment-watcher
    2024-04-05T11:18:10.998+0100 [DEBUG] agent/http.go:714: http: request complete: method=GET path=/v1/allocations?prefix=50a8fb0d- duration="339.696µs"
    2024-04-05T11:18:11.002+0100 [DEBUG] agent/http.go:714: http: request complete: method=GET path=/v1/allocation/50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed?namespace=default duration="244.028µs"
    2024-04-05T11:18:11.007+0100 [DEBUG] agent/http.go:714: http: request complete: method=GET path=/v1/client/allocation/50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed/checks duration="114.396µs"
    2024-04-05T11:18:11.009+0100 [DEBUG] agent/http.go:714: http: request complete: method=GET path=/v1/client/allocation/50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed/stats duration="155.035µs"

@simon1990zcs
Copy link
Author

Now I assure more this is a bug, since it works fine on Windows and MacOs nomad clients, but not Linux, or at least CentOS 7.6.
Can you please advise with some code references about entry points to clients's task/allocation resources usage? so that I can download code and debug further what's going wrong on Linux platform (or CentOs 7.6)

@simon1990zcs
Copy link
Author

@jrasell hey, James, I found the root cause after debugging in code base.
Basically drivers/shared/executor/procstats/getstats.go@scanPIDs() is supposed to fetch detected live pids that make up the task process tree / as in the tasks group. However, it updates with empty pids in my case.
because executor_universal_linux.go@ListProcesses() is assuming that cgroup is available on client machine, which ends up zero live pids when cgroup is not actually used.

func (e *UniversalExecutor) ListProcesses() *set.Set[procstats.ProcessID] {
	return procstats.List(e.command)
}

One proposal of code fixing is adding a condition here, to allow a fallback logic when cgroup is missing, as below.

func (e *UniversalExecutor) ListProcesses() *set.Set[procstats.ProcessID] {
	if cgroupslib.GetMode() != cgroupslib.OFF {
		return procstats.List(e.command)
	} else {
		return procstats.ListByPid(e.childCmd.Process.Pid)
	}
}

The procstatc.ListByPid mentioned in proposal is exactly same to list_default.go.List(executorPID int), so I suggest we can refactor it to common package and rename to ListByPid, eg. procstats.go@ListByPid(executorPID int)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

2 participants