Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inaccurate memory reporting in Nomad #5165

Closed
ashald opened this issue Jan 8, 2019 · 6 comments
Closed

Inaccurate memory reporting in Nomad #5165

ashald opened this issue Jan 8, 2019 · 6 comments

Comments

@ashald
Copy link

ashald commented Jan 8, 2019

Nomad version

$ nomad version
Nomad v0.8.4+ent (806c7a42398568f7ad91cf33daff47a41189fb8a)

Operating system and Environment details

CentOS 7.6.1810

Issue

We noticed this issue only when we moved our workloads to Docker from rkt but I suspect it might be an issue for other drivers as well.

We have a job running minio (we also noticed similar issue with number of other services written in Scala and Python) which resource definition stanza looks like (excerpt):

resources {
  cpu    = 6800
  memory = 10240
}

What Nomad reports as memory consumption for this process is (excerpt from a call to https://www.nomadproject.io/api/client.html#read-allocation-statistics):

{
  "Tasks": {
          "minio": {
              "ResourceUsage": {
                  "MemoryStats": {
                      "RSS": 97730560,
                      "Cache": 673124352,
                      "Swap": 0,
                      "MaxUsage": 10737418240,
                      "KernelUsage": 0,
                      "KernelMaxUsage": 0,
                      "Measured": [
                          "RSS",
                          "Cache",
                          "Swap",
                          "Max Usage"
                      ]
                  }
              },
              "Timestamp": 1546972954150818057,
              "Pids": null
          }
      }
  }
}

Even if we count in swap we would still see:

(97730560+673124352)/10737418240=0.07179145813

which is about 74 mb and that's approximately what we see in Nomad UI as well.

OTOH, Docker shows a completely different value:

$ docker container stats 5e8c4b48a147
CONTAINER ID        NAME                                         CPU %               MEM USAGE / LIMIT   MEM %               NET I/O             BLOCK I/O           PIDS
5e8c4b48a147        minio-a79f62cf-aae6-6571-6154-c4ec0e2ae81d   102.11%             9.356GiB / 10GiB    93.56%              14.2GB / 105GB      0B / 483kB          121

Docker on it's own derives memory consumption from cgroups and reports it as mem.Usage - mem.Stats["cache"] (https://github.com/docker/cli/blob/master/cli/command/container/stats_helpers.go#L227-L229) which is populated from https://github.com/docker/libcontainer/blob/master/cgroups/fs/memory.go#L127-L164 - in terms of sysfs, given:

$ tree /sys/fs/cgroup/memory/docker/5e8c4b48a14718ed42c77f89502597cd44f549edac3d6ff6f81af088c2768a39
/sys/fs/cgroup/memory/docker/5e8c4b48a14718ed42c77f89502597cd44f549edac3d6ff6f81af088c2768a39
├── cgroup.clone_children
├── cgroup.event_control
├── cgroup.procs
├── memory.failcnt
├── memory.force_empty
├── memory.kmem.failcnt
├── memory.kmem.limit_in_bytes
├── memory.kmem.max_usage_in_bytes
├── memory.kmem.slabinfo
├── memory.kmem.tcp.failcnt
├── memory.kmem.tcp.limit_in_bytes
├── memory.kmem.tcp.max_usage_in_bytes
├── memory.kmem.tcp.usage_in_bytes
├── memory.kmem.usage_in_bytes
├── memory.limit_in_bytes
├── memory.max_usage_in_bytes
├── memory.memsw.failcnt
├── memory.memsw.limit_in_bytes
├── memory.memsw.max_usage_in_bytes
├── memory.memsw.usage_in_bytes
├── memory.move_charge_at_immigrate
├── memory.numa_stat
├── memory.oom_control
├── memory.pressure_level
├── memory.soft_limit_in_bytes
├── memory.stat
├── memory.swappiness
├── memory.usage_in_bytes
├── memory.use_hierarchy
├── notify_on_release
└── tasks

0 directories, 31 files

where:

$ cat memory.limit_in_bytes
10737418240

and:

$ cat memory.usage_in_bytes
10737127424

and

$ cat memory.stat
cache 405790720
rss 97730560
rss_huge 0
mapped_file 0
swap 0
pgpgin 71391821
pgpgout 71268891
pgfault 33089794
pgmajfault 0
inactive_anon 52785152
active_anon 44945408
inactive_file 179175424
active_file 226521088
unevictable 0
hierarchical_memory_limit 10737418240
hierarchical_memsw_limit 10737418240
total_cache 405790720
total_rss 97730560
total_rss_huge 0
total_mapped_file 0
total_swap 0
total_pgpgin 71391821
total_pgpgout 71268891
total_pgfault 33089794
total_pgmajfault 0
total_inactive_anon 52785152
total_active_anon 44945408
total_inactive_file 179175424
total_active_file 226521088
total_unevictable 0

this would look like

(memory.usage_in_bytes - memory.stat:cache) / memory.limit_in_bytes = (10737127424 - 405790720) / 10737418240 = 0.96

which correspond to what Docker shows.

Until we increased memory to our current level we saw minio being frequently killed by OOM killer which led us tot an assumption that real memory usage is different from what Nomad reports.
It seems that it's also the cause of #4495.

@endocrimes
Copy link
Contributor

I think this should be fixed in 0.9 because we get those metrics from the Docker API as part of the new driver.

@angrycub
Copy link
Contributor

angrycub commented Jan 9, 2019

@ashald, I was able to confirm that this is still an issue in Nomad v0.9.0-dev (c506b56). We'll be looking into it further.

notnoop pushed a commit that referenced this issue Jan 15, 2019
Track current memory usage, `memory.usage_in_bytes`, in addition to
`memory.max_memory_usage_in_bytes` and friends.  This number is closer
what Docker reports.

Related to #5165 .
@hashicorp hashicorp deleted a comment from stale bot May 10, 2019
@endocrimes
Copy link
Contributor

@ashald Could you confirm if this was fixed in 0.9.0/0.9.1 as part of #5190?

@ashald
Copy link
Author

ashald commented May 14, 2019

@endocrimes we haven't had a chance to upgrade to Nomad 0.9.x yet so I cannot confirm it so far. I think we can resolve this issue now given that fix looks correct and it can be reopened later (or a new one can bee submitted) may regression happen.

@nickethier
Copy link
Member

Thanks @ashald

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants