Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement manage_rlimit_as; add PTL test #2332

Merged
merged 14 commits into from Apr 13, 2021

Conversation

alexis-cousein
Copy link
Contributor

@alexis-cousein alexis-cousein commented Mar 26, 2021

Describe Bug or Feature

When enabling the "memsw" functionality in the cgroup hook, the semantics of "vmem" change.

Without it "vmem" is managed by MoM and denotes the sum of the address spaces used by all processes on the mother superior node, and MoM sets a per process limit for RLIMIT_AS (since clearly a single process cannot use less address space than the sum for all processes).

But when the cgroup hook "memsw" functionality is enabled, "vmem" requests instead specify the sum of physical memory plus swap usage for the job, which is often smaller than the address space limit.

Indeed, it is doubtful that in the face of kernel limits on real memory plus swap usage anyone would care about the address space limits, especially on 64-bit systems (which have an 8 Exabyte positive address address space). Address space limits were used in the past when there were no physical memory and swap limits enforced by the kernel, but cgroups make these unnecessary.

But on most current versions of PBSPro, the MoM does not know about this change in semantics for vmem, and it still sets a RLIMIT_AS limit for all processes in the job. Since many applications allocate much more address space than they actually use memory, that can make applications fail even though they are staying well within the memory+swap usage corresponding to their vmem requests.

Even MoM itself can be a victim for small vmem limits: when the job starter child forks to run the execjob_launch hook, the fork and exec of pbs_python can make it use up to 150mb-200mb of address space, so setting a small RLIMIT_AS limit can make either the job starter or the hooks fail.

Furthermore, a per-job 'pvmem' resource request unambiguously requests exactly a RLIMIT_AS, so there is no real need to use vmem to specify this limit if one is desired.

To summarise: when enabling the "memsw" functionality in the cgroup hook, RLIMIT_AS should be unlimited, unless the job also requests "pvmem" (which is an unambiguous request to set RLIMIT_AS).

Note: this really is a MoM bug; ideally it should be possible to enable or disable automatic setting of the RLIMIT_AS limit when 'vmem' is requested in the MoM config file. But until such a mechanism exists, at least the cgroup hook can mitigate the effect for values of 'vmem' large enough to allow the job starter and the hooks to run.

Describe Your Change

A new parameter is introduced to the cgroup configuration file, manage_rlimit_as, which by default is true.

If enabled, then the cgroup hook will reset the RLIMIT_AS process limit for task processes to either unlimited (if pvmem is not specified) or the value of pvmem requested for the job.

This functionality requires a kernel that supports the prlimit system call (i.e. Linux kernel 2.6.36 and above).

If hooks use Python 3, i.e. for PBS versions 2020 and above, that is the only requirement.

For older versions of PBS, the 'prlimit' command should be present. This command is available in util-linux versions 2.21 and above on:
-CentOS7/RedHat from version 7.0
-SLES from version 12 onward,
-Ubuntu from version Ubuntu 16.04 onward
-Debian versions from version 8.0 onward
(i.e. most OS versions with a kernel that supports the prlimit system call and have a version of util-linux >= 2.21).

If the support for changing process limits of other processes is not present, then limits set by MoM are never changed (i.e. the hook behaves as before, and as if manage_rlimit_as were disabled).

When the flag is disabled, limits set by MoM will not be changed. It is unlikely anyone would want this behaviour, except on MoMs where it is possible to disable enforcement of 'vmem' as an RLIMIT_AS limit set on processes.

Link to Design Doc

https://openpbs.atlassian.net/wiki/spaces/PD/pages/2668199965/Fix+incorrect+MoM+RLIMIT+AS+when+vmem+is+used+for+the+group+hook+s+memsw+functionality

Attach Test and Valgrind Logs/Output

A PTL test was added, test_manage_rlimit_as.

A job requests -lselect=1:ncpus=0:mem=300mb:vmem=300mb and the test checks that ulimit -v in a job returns 'unlimited'. A job requests -lselect=1:ncpus=0:mem=300mb:vmem=300mb,pvmem=300mb, and then ulimit -v should return 307200 (since ulimit -v output is in kb units).

Relevant portion of test_manage_rlimit_as output:

2021-03-26 13:08:54,255 INFO     manager on dragon: set server {'job_history_enable': 'True'}
2021-03-26 13:08:54,255 INFOCLI  dragon: /opt/pbs/bin/qmgr -c set server job_history_enable=True
2021-03-26 13:08:54,283 INFOCLI  dragon: /opt/pbs/bin/qstat -Bf dragon
2021-03-26 13:08:54,294 INFO     expect on server dragon: job_history_enable = True server dragon ...  OK
2021-03-26 13:08:54,294 INFO     Current config: {
    "cgroup_prefix"         : "pbs_jobs",
    "exclude_hosts"         : [],
    "exclude_vntypes"       : ["no_cgroups"],
    "run_only_on_hosts"     : [],
    "periodic_resc_update"  : true,
    "vnode_per_numa_node"   : false,
    "online_offlined_nodes" : true,
    "use_hyperthreads"      : false,
    "ncpus_are_cores"       : false,
    "manage_rlimit_as"      : true,
    "cgroup" : {
        "cpuacct" : {
            "enabled"            : true,
            "exclude_hosts"      : [],
            "exclude_vntypes"    : []
        },
        "cpuset" : {
            "enabled"            : true,
            "exclude_cpus"       : [],
            "exclude_hosts"      : [],
            "exclude_vntypes"    : [],
            "mem_fences"         : true,
            "mem_hardwall"       : false,
            "memory_spread_page" : false
        },
        "devices" : {
            "enabled"            : true,
            "exclude_hosts"      : [],
            "exclude_vntypes"    : [],
            "allow"              : [
                "b *:* rwm",
                "c *:* rwm"
            ]
        },
        "hugetlb" : {
            "enabled"            : false,
            "exclude_hosts"      : [],
            "exclude_vntypes"    : [],
            "default"            : "0MB",
            "reserve_percent"    : 0,
            "reserve_amount"     : "0MB"
        },
        "memory" : {
            "enabled"            : true,
            "exclude_hosts"      : [],
            "exclude_vntypes"    : [],
            "soft_limit"         : false,
            "default"            : "100MB",
            "reserve_percent"    : 0,
            "swappiness"         : 0,
            "reserve_amount"     : "2GB",
            "enforce_default"    : true,
            "exclhost_ignore_default" : true
        },
        "memsw" : {
            "enabled"            : true,
            "exclude_hosts"      : [],
            "exclude_vntypes"    : [],
            "default"            : "100MB",
            "reserve_percent"    : 0,
            "reserve_amount"     : "10GB",
            "manage_cgswap"      : true,
            "enforce_default"    : true,
            "exclhost_ignore_default" : true
        }
    }
}

2021-03-26 13:09:01,301 INFO     manager on dragon: import hook pbs_cgroups {'content-type': 'application/x-config', 'content-encoding': 'default', 'input-file': '/tmp/PtlPbs9u4sidjl'}
2021-03-26 13:09:01,302 INFOCLI  dragon: sudo -H /opt/pbs/bin/qmgr -c import hook pbs_cgroups application/x-config default /tmp/PtlPbs9u4sidjl
2021-03-26 13:09:01,326 ERROR    err: ["hook 'pbs_cgroups' contents overwritten"]
2021-03-26 13:09:04,892 INFO     mom dragon log match: searching for "pbs_cgroups.CF;copy hook-related file request received" - with existence... OK
2021-03-26 13:09:04,903 INFOCLI2 dragon: sudo -H cat /var/spool/PBS/server_priv/hooks/pbs_cgroups.CF
2021-03-26 13:09:04,923 INFOCLI2 dragon: sudo -H cat /var/spool/PBS/mom_priv/hooks/pbs_cgroups.CF
2021-03-26 13:09:04,941 INFO     server & mom pbs_cgroups.CF match
2021-03-26 13:09:08,945 INFO     mom dragon: sent signal -HUP
2021-03-26 13:09:08,945 INFOCLI2 dragon: sudo -H /usr/bin/cat /var/spool/PBS/mom_priv/mom.lock
2021-03-26 13:09:08,964 INFOCLI2 dragon: sudo -H kill -HUP 554166
2021-03-26 13:09:12,543 INFO     mom dragon log match: searching for "hook_perf_stat;label=hook_exechost_startup_pbs_cgroups_.* profile_stop" - using regular expression  - with existence - No match
2021-03-26 13:09:17,143 INFO     mom dragon log match: searching for "hook_perf_stat;label=hook_exechost_startup_pbs_cgroups_.* profile_stop" - using regular expression  - with existence... OK
2021-03-26 13:09:17,153 INFO     job: executable set to /bin/sleep with arguments: 100
2021-03-26 13:09:17,154 INFOCLI2 dragon: which chmod
2021-03-26 13:09:17,163 INFOCLI2 dragon: /usr/bin/chmod 755 /tmp/PtlPbsJobScript_fdaldlg
2021-03-26 13:09:17,168 INFOCLI  dragon: sudo -H -u pbsuser /opt/pbs/bin/qsub -l select=1:ncpus=0:mem=300mb:vmem=300mb:vnode=dragon /tmp/PtlPbsJobScript_fdaldlg
2021-03-26 13:09:17,203 INFO     submit to dragon as pbsuser: job 4424.dragon OrderedDict([('Resource_List.select', '1:ncpus=0:mem=300mb:vmem=300mb:vnode=dragon')])
2021-03-26 13:09:17,203 INFOCLI  job script /tmp/PtlPbsJobScript_fdaldlg
---
#!/bin/bash
ulimit -v
---
2021-03-26 13:09:17,203 INFO     server dragon: expect offset set to 1
2021-03-26 13:09:18,204 INFOCLI  dragon: /opt/pbs/bin/qstat -x -f 4424.dragon
2021-03-26 13:09:18,218 INFO     expect on server dragon: job_state = F job 4424.dragon  got: job_state = R
2021-03-26 13:09:18,719 INFOCLI  dragon: /opt/pbs/bin/qmgr -c set server scheduling=True
2021-03-26 13:09:18,738 INFOCLI  dragon: /opt/pbs/bin/qstat -x -f 4424.dragon
2021-03-26 13:09:18,950 INFO     expect on server dragon: job_state = F job 4424.dragon attempt: 2  got: job_state = E
2021-03-26 13:09:19,451 INFOCLI  dragon: /opt/pbs/bin/qmgr -c set server scheduling=True
2021-03-26 13:09:19,470 INFOCLI  dragon: /opt/pbs/bin/qstat -x -f 4424.dragon
2021-03-26 13:09:19,482 INFO     expect on server dragon: job_state = F job 4424.dragon attempt: 3 ...  OK
2021-03-26 13:09:19,482 INFO     status on dragon: job 4424.dragon
2021-03-26 13:09:19,482 INFOCLI  dragon: /opt/pbs/bin/qstat -x -f 4424.dragon
2021-03-26 13:09:19,693 INFOCLI2 dragon: sudo -H /usr/bin/cat /home/pbsuser/PtlPbsJobScript_fdaldlg.o4424
2021-03-26 13:09:19,714 INFO     job_out=unlimited
2021-03-26 13:09:19,714 INFO     Job that requests vmem but no pvmem correctly has unlimited RLIMIT_AS
2021-03-26 13:09:19,714 INFO     job: executable set to /bin/sleep with arguments: 100
2021-03-26 13:09:19,715 INFOCLI2 dragon: /usr/bin/chmod 755 /tmp/PtlPbsJobScriptld59nol_
2021-03-26 13:09:19,723 INFOCLI  dragon: sudo -H -u pbsuser /opt/pbs/bin/qsub -l select=1:ncpus=0:mem=300mb:vmem=300mb:vnode=dragon -l pvmem=300mb /tmp/PtlPbsJobScriptld59nol_
2021-03-26 13:09:19,745 INFO     submit to dragon as pbsuser: job 4425.dragon OrderedDict([('Resource_List.select', '1:ncpus=0:mem=300mb:vmem=300mb:vnode=dragon'), ('Resource_List.pvmem', '300mb')])
2021-03-26 13:09:19,745 INFOCLI  job script /tmp/PtlPbsJobScriptld59nol_
---
#!/bin/bash
ulimit -v
---
2021-03-26 13:09:19,745 INFO     server dragon: expect offset set to 1
2021-03-26 13:09:20,746 INFOCLI  dragon: /opt/pbs/bin/qstat -x -f 4425.dragon
2021-03-26 13:09:20,761 INFO     expect on server dragon: job_state = F job 4425.dragon  got: job_state = E
2021-03-26 13:09:21,261 INFOCLI  dragon: /opt/pbs/bin/qmgr -c set server scheduling=True
2021-03-26 13:09:21,281 INFOCLI  dragon: /opt/pbs/bin/qstat -x -f 4425.dragon
2021-03-26 13:09:21,293 INFO     expect on server dragon: job_state = F job 4425.dragon attempt: 2  got: job_state = E
2021-03-26 13:09:21,794 INFOCLI  dragon: /opt/pbs/bin/qmgr -c set server scheduling=True
2021-03-26 13:09:21,811 INFOCLI  dragon: /opt/pbs/bin/qstat -x -f 4425.dragon
2021-03-26 13:09:22,023 INFO     expect on server dragon: job_state = F job 4425.dragon attempt: 3 ...  OK
2021-03-26 13:09:22,024 INFO     status on dragon: job 4425.dragon
2021-03-26 13:09:22,024 INFOCLI  dragon: /opt/pbs/bin/qstat -x -f 4425.dragon
2021-03-26 13:09:22,237 INFOCLI2 dragon: sudo -H /usr/bin/cat /home/pbsuser/PtlPbsJobScriptld59nol_.o4425
2021-03-26 13:09:22,257 INFO     job_out=307200
2021-03-26 13:09:22,257 INFO     Job that requests 300mb pvmem correctly has 300mb RLIMIT_AS

@alexis-cousein
Copy link
Contributor Author

Test also passed on Test Harness, on CentOS7, CentOS8 and SLES15.

6752 | Custom | Tests from given sources on platforms CENTOS7, CENTOS8, SUSE15 | alexis-cousein/pbspro@cgroups_rlimit_as | 26th Mar 2021 05:35:50.619 PM +01:00 | 00:07:24.458 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 3

When rerunning all cgroup tests I see there's still a race condition in the test to see if stuck nodes are correctly offlined, because the job that should get stuck can still be deleted before the processes are frozen. I'm going to implement a more robust test to confirm the freezer has frozen the processes (instead of "if I sleep 1 second it should be OK").

Stay tuned for another commit.

@alexis-cousein
Copy link
Contributor Author

It will also be necessary to rewrite test_cgroup_enforce_memsw, since the 'success' criterion depends on the bug worked around by manage_rlimit_as! It currently tests on whether Python was stopped because it ran out of address space, rather than because the memsw limit was enforce.

@alexis-cousein
Copy link
Contributor Author

Since the Test Harness has no swap and will not run it, the relevant output for the fixed test_cgroup_enforce_memsw test on my system:

2021-03-27 01:45:11,433 INFO     Current config: {
    "exclude_hosts"         : [],
    "exclude_vntypes"       : [],
    "run_only_on_hosts"     : [],
    "periodic_resc_update"  : true,
    "vnode_per_numa_node"   : false,
    "online_offlined_nodes" : true,
    "use_hyperthreads"      : true,
    "cgroup":
    {
        "cpuacct":
        {
            "enabled"         : true,
            "exclude_hosts"   : [],
            "exclude_vntypes" : []
        },
        "cpuset":
        {
            "enabled"         : true,
            "exclude_hosts"   : [],
            "exclude_vntypes" : []
        },
        "devices":
        {
            "enabled"         : false
        },
        "hugetlb":
        {
            "enabled"         : false
        },
        "memory":
        {
            "enabled"         : true,
            "default"         : "96MB",
            "reserve_amount"  : "50MB",
            "exclude_hosts"   : [],
            "exclude_vntypes" : []
        },
        "memsw":
        {
            "enabled"         : true,
            "default"         : "96MB",
            "reserve_amount"  : "45MB",
            "exclude_hosts"   : [],
            "exclude_vntypes" : []
        }
    }
}

2021-03-27 01:45:18,435 INFO     manager on dragon: import hook pbs_cgroups {'content-type': 'application/x-config', 'content-encoding': 'default', 'input-file': '/tmp/PtlPbspszxq2fy'}
2021-03-27 01:45:18,435 INFOCLI  dragon: sudo -H /opt/pbs/bin/qmgr -c import hook pbs_cgroups application/x-config default /tmp/PtlPbspszxq2fy
2021-03-27 01:45:18,453 ERROR    err: ["hook 'pbs_cgroups' contents overwritten"]
2021-03-27 01:45:18,558 INFO     mom dragon log match: searching for "pbs_cgroups.CF;copy hook-related file request received" - with existence... OK
2021-03-27 01:45:18,558 INFOCLI2 dragon: sudo -H cat /var/spool/PBS/server_priv/hooks/pbs_cgroups.CF
2021-03-27 01:45:18,570 INFOCLI2 dragon: sudo -H cat /var/spool/PBS/mom_priv/hooks/pbs_cgroups.CF
2021-03-27 01:45:18,581 INFO     server & mom pbs_cgroups.CF match
2021-03-27 01:45:22,585 INFO     mom dragon: sent signal -HUP
2021-03-27 01:45:22,586 INFOCLI2 dragon: sudo -H /usr/bin/cat /var/spool/PBS/mom_priv/mom.lock
2021-03-27 01:45:22,598 INFOCLI2 dragon: sudo -H kill -HUP 713397
2021-03-27 01:45:22,715 INFO     mom dragon log match: searching for "hook_perf_stat;label=hook_exechost_startup_pbs_cgroups_.* profile_stop" - using regular expression  - with existence - No match
2021-03-27 01:45:23,825 INFO     mom dragon log match: searching for "hook_perf_stat;label=hook_exechost_startup_pbs_cgroups_.* profile_stop" - using regular expression  - with existence... OK
2021-03-27 01:45:23,825 INFOCLI2 dragon: sudo -H -u pbsuser /tmp/PtlPbsej6r65hs
Contents of /tmp/PtlPbsej6r65hs:
----------------------------------------
#!/bin/bash
/usr/bin/rm ~/CGROUP6.*
----------------------------------------
2021-03-27 01:45:23,838 INFOCLI2 dragon: /usr/bin/rm /tmp/PtlPbsej6r65hs
2021-03-27 01:45:23,840 INFO     job: executable set to /bin/sleep with arguments: 100
2021-03-27 01:45:23,840 INFOCLI2 dragon: which chmod
2021-03-27 01:45:23,841 INFOCLI2 dragon: /usr/bin/chmod 755 /tmp/PtlPbsJobScriptiwry015u
2021-03-27 01:45:23,843 INFOCLI  dragon: sudo -H -u pbsuser /opt/pbs/bin/qsub -l select=1:ncpus=1:mem=300mb:vmem=320mb:host=dragon -N CGROUP6 /tmp/PtlPbsJobScriptiwry015u
2021-03-27 01:45:23,876 INFO     submit to dragon as pbsuser: job 4465.dragon OrderedDict([('Resource_List.select', '1:ncpus=1:mem=300mb:vmem=320mb:host=dragon'), ('Job_Name', 'CGROUP6')])
2021-03-27 01:45:23,876 INFOCLI  job script /tmp/PtlPbsJobScriptiwry015u
---
#PBS -joe
#PBS -S /bin/bash
timeout -s KILL 60 sync
sleep 10
python_path=`which python 2>/dev/null`
python3_path=`which python3 2>/dev/null`
python2_path=`which python2 2>/dev/null`
if [ -z "$python_path" ]; then
    if [ -n "$python3_path" ]; then
        python_path=$python3_path
    else
        python_path=$python2_path
    fi
fi
if [ -z "$python_path" ]; then
    echo Exiting -- no python found
    exit 1
fi
$python_path - 80 10 10 <<EOF

import sys
import time
MB = 2 ** 20
iterations = 1
chunkSizeMb = 1
sleeptime = 0
if (len(sys.argv) > 1):
    iterations = int(sys.argv[1])
if (len(sys.argv) > 2):
    chunkSizeMb = int(sys.argv[2])
if (len(sys.argv) > 3):
    sleeptime = int(sys.argv[3])
if (iterations < 1):
    print('Iteration count must be greater than zero.')
    exit(1)
if (chunkSizeMb < 1):
    print('Chunk size must be greater than zero.')
    exit(1)
totalSizeMb = chunkSizeMb * iterations
print('Allocating %d chunk(s) of size %dMB. (%dMB total)' %
      (iterations, chunkSizeMb, totalSizeMb))
buf = ''
for i in range(iterations):
    print('allocating %dMB' % ((i + 1) * chunkSizeMb))
    buf += ('#' * MB * chunkSizeMb)
if sleeptime > 0:
    time.sleep(sleeptime)

EOF

---
2021-03-27 01:45:23,876 INFOCLI  dragon: /opt/pbs/bin/qstat -f 4465.dragon
2021-03-27 01:45:23,882 INFO     expect on server dragon: job_state = R && substate = 42 job 4465.dragon  got: substate = 41
2021-03-27 01:45:24,383 INFOCLI  dragon: /opt/pbs/bin/qmgr -c set server scheduling=True
2021-03-27 01:45:24,394 INFOCLI  dragon: /opt/pbs/bin/qstat -f 4465.dragon
2021-03-27 01:45:24,400 INFO     expect on server dragon: job_state = R && substate = 42 job 4465.dragon attempt: 2 ...  OK
2021-03-27 01:45:24,400 INFO     status on dragon: job 4465.dragon
2021-03-27 01:45:24,400 INFOCLI  dragon: /opt/pbs/bin/qstat -f 4465.dragon
2021-03-27 01:45:24,607 INFO     Reading file: /home/pbsuser/CGROUP6.o4465 on host: dragon
2021-03-27 01:45:37,120 INFOCLI2 dragon: sudo -H /usr/bin/cat /home/pbsuser/CGROUP6.o4465
2021-03-27 01:45:37,135 INFO     Joined stdout/stderr contained expected string: Cgroup memsw limit exceeded

@alexis-cousein
Copy link
Contributor Author

Note a cgroup hook fix was also necessary: for joined stderr/stdout, if the cgroup limit detection happens only in the execjob_epilogue, it is too late to append to stderr (it does not make it to the joined file), so we need to write in stdout if we detect Join_Path set to 'oe'.

That was indeed a valid 'test failure', because users who use this feature would otherwise never see the cgroup limit exceeded messages in what is supposed to be a joined stdout/stderr.

The dmesg searcher was also extended to support more recent kernels, which no longer have lines matching the old parser; for more recent kernels the lines added to stderr (or the joined stdout/stderr) are in the form:

Cgroup mem limit exceeded: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=4464.dragon,mems_allowed=0,oom_memcg=/pbs_jobs.service/jobid/4464.dragon,task_memcg=/pbs_jobs.service/jobid/4464.dragon,task=python,pid=711137,uid=4359
Cgroup memsw limit exceeded: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=4464.dragon,mems_allowed=0,oom_memcg=/pbs_jobs.service/jobid/4464.dragon,task_memcg=/pbs_jobs.service/jobid/4464.dragon,task=python,pid=711137,uid=4359

As before, they identify the limit for which failcnt was non-zero (here both limits were exceeded) , the cgroup (which validates it's about the correct job) and the victim selected by the OOM killer. More information can be found by site admins in dmesg output.

@alexis-cousein
Copy link
Contributor Author

alexis-cousein commented Mar 27, 2021

The sibling test test_cgroup_enforce_memory is also incorrect: it uses a job that sets only a mem limit, and then uses a configuration file that enables "memory" and disables "memsw", leaving "swappiness" to the default 1. That will not make the job exceed the memory limit in all cases: if the machine has swap, once the job reaches its memory limit, it is allowed to swap out pages to swap (and since memsw is disabled there is no limit to it). On some kernels even when swap is depleted there may not even be a failcount set on the memory cgroup when the OOM killer wakes up because the swap is depleted -- so the job may get killed because of a "general" lack of swap, rather than a cgroup limit violation in the specific cgroup.

To avoid this behaviour, you can either enable memsw and set a vmem limit, or set swappiness to 0 in the configuration file. Since
-there is already a separate test_cgroup_enforce_memsw for the former

  • we want that test to also run on machines without swap, where memsw disabled and swappiness set to 0 is a more natural configuration (which is also not tested in other tests).

the latter is preferable, and was implemented by setting up a slightly different configuration file.

@alexis-cousein
Copy link
Contributor Author

All tests that run without swap passed on CentOS7, CentOS8 and SLES15 on the test harness:

6763 | cgroups_cousein | Tests from given sources on platforms CENTOS7, SUSE15, CENTOS8 | alexis-cousein/pbspro@cgroups_rlimit_as | 27th Mar 2021 03:09:14.645 AM +01:00 | 00:32:36.812 | 159 | 0 | 0 | 0 | 0 | 0 | 39 | 120

159 run, 39 skipped (for lack of swap or a number of nodes that is too low), 120 successes.

All tests also pass on my machine which has swap enabled.

bayucan
bayucan previously approved these changes Mar 30, 2021
Copy link
Contributor

@bayucan bayucan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@alexis-cousein
Copy link
Contributor Author

Fixing a few other problems:
-cleanup was not accurate; if 'cleaning up' freezers, it would "rm" rather than rmdir the freezer and would also not clean up processes. The current cleanup thaws all freezers (including those not tied to jobs, e.g. the PtlPbs freezer made by the test_cgroup_offline_node test) but then kills processes in freezers tied to directories that are probably jobs (i.e. a number followed by . and something else) before doing an rmdir.
-Ditto for cleaning up other cgroups. There was no attempt at killing the processes before doing an rmdir, so if processes were lingering the cleanup function would fail to mop up everything. I've also made it resistant to future tests that would make sub-cgroups in the job cgroups.
-the PtlPbs freezer was madein the freezer root. We have no delegate control, so under some OSes systemd can move processes out of it into the (unfrozen) root freezer and clean it up because it "should not be there" (since we haven't registered a service or slice with systemd), making the test fail. There was also an error in error handling in that case (which I had not seen since the freezer worked on my system) that was throwing an exception (hardly the thing you want when cleaning up a freezer!) and that was corrected.

@alexis-cousein
Copy link
Contributor Author

All tests passed or (correctly) skipped on test harness; now also on Ubunto 20.04:
PTL_results

@alexis-cousein
Copy link
Contributor Author

alexis-cousein commented Apr 6, 2021

There were a number of issues with exception handling in the hook and in the tests.

The tests weren't correctly skipping some tests when self.swapctl was set to 'false' (because it's set to a string you can't use 'if not self.swapctl'), on nodes that have no kernel support for swap accounting (aka memsw).

Corrected that in the tests -- and when detecting memsw kernel support it will now log an info message in all case--.

But I also made sure that the hook would not fail with incomprehensible exceptions when memsw was enabled but kernel support for it was missing. I added guards around all the messages encountered, reinstalled some sanity checks for vmem_avail and vmem_default, and fixed one indentation issue when we were looking up memory.memsw.limit_in_bytes that would make the periodic hook fail when hosts have no swap accounting.

size_to_int was also fixed to return None instead of generating an incomprehensible exception when one passes something that cannot be a pbs size (e.g. a negative number, which was caused by not being able to read the memory.memsw.limit_in_bytes file).

Finally, the cgroup hook was modified to neuter this "impossible" configuration: if memsw is enabled in the config file but kernel support is lacking, memsw will be disabled and every hook event will print three messages in the logs at EVENT_ERROR level starting with "CONFIG ERROR:" telling the site something needs to be fixed.

one test was forcibly creating two vnodes on all hosts, named hostname[0] and hostname[1], but was also doing so when there is only one socket. Needless to say, configure_job does not take kindly to a request by the scheduler to assign memory on "hostname[1]", which it takes to be a request to assign it to the non-existing socket 1. The hook did crash with an ungainly exception (a KeyError on 1), so I modified the hook to fail with "Failed to assign resources" and a specific error message in the log.

The test, though, needs to be rewritten to let the cgroup hook create vnodes and work with what it gets, instead of messing with the vnodes that are supposed to be managed by the cgroup hook.

It now sets vnode_per_numa_node, lets MoM create the extra vnodes, checks for the presence of a vnode for socket [1] (and skips if not present), tallies the total memory available on all vnodes of the first host, and then creates a job that uses 1 CPU and every memory available except the two last 2mb and that forces the host to match the first MoM host; it then attempts to run a job that uses 1024mb more (to ensure that the first job run was not successful because memory requests are ignored), which should fail because there is not sufficient memory on the requested host.

The test success criterion was also changed; it was looking for a "Can Never Run" message for insufficient mem, but if there is more than one node the message can also start with "Not Running" instead of that string ('Can Never Run' is printed when the entire complex does not have enough memory, not when the requested host does not), so it now just matches the "insufficient resource" part of the string.

Finally, some tests were printing a value without first asserting that the value is valid, resulting in an unreadable exception rather than the failure with the human readable assertion message. That was fixed: these tests now first assert the value is valid and only then print an informational message using them.

@alexis-cousein
Copy link
Contributor Author

Also changed some tests that will be problematic on hosts that are not "real" VMs or hosts but containers. Some tests were using sync(), which if unfiltered is going to hang until the entire host kernel has flushed its buffer cache (with very uncertain timings, and creating a process that is unkillable while the sync is in progress and may mess up test cleanup).

I replaced it with a command to echo 3 to /proc/sys/vm/drop_caches, which on hosts and VMs will ask to flush the caches (but in contrast with sync will not wait around and will not create an unkillable process), on privileged containers will ask to flush the host caches, and on unprivileged containers will simply return an error (as /proc is not writeable on these containers).

Copy link
Contributor

@nishiya nishiya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please attach latest test results.

@vstumpf vstumpf requested a review from bayucan April 6, 2021 21:43
@vstumpf vstumpf dismissed bayucan’s stale review April 6, 2021 21:43

to prevent errant merge

@alexis-cousein
Copy link
Contributor Author

The drop_caches turns out to be a bad idea -- all it does on recent kernels is mark clean pages free, but if the tunables of the host OS are sane (i.e. vm.min_free_kbytes is not too low) that's pointless -- it's the dirty pages that are hard to get rid of..

@alexis-cousein
Copy link
Contributor Author

Tests on test harness on all the OSes I could find all either passed or skipped.

Description: Rerun All Tests From #6905
Platforms: CENTOS7,SUSE15,CENTOS8,SLES12,SLES15,UBUNTU1604,UBUNTU1804,UBUNTU2004
Profile: cgroups_cousein

ID TOTAL FAIL ERROR TIMEDOUT SKIP PASS
6910 424 0 0 0 105 319

@nishiya
Copy link
Contributor

nishiya commented Apr 7, 2021

Thanks for posting test results, looks good.
Need a maintainer review.

@alexis-cousein
Copy link
Contributor Author

alexis-cousein commented Apr 8, 2021

the rlimit_as test was not robust to hosts that add extra lines of spam in job outputs; the test was changed to search for the required output in all lines of the output.

New test results on the Test Harness:

Description: Tests from given sources on platforms CENTOS7, SUSE15, CENTOS8, SLES12, SLES15, UBUNTU1604, UBUNTU1804, UBUNTU2004
Platforms: CENTOS7,SUSE15,CENTOS8,SLES12,SLES15,UBUNTU1604,UBUNTU1804,UBUNTU2004
Profile: cgroups_cousein

ID TOTAL FAIL ERROR TIMEDOUT SKIP PASS
6957 424 0 0 0 104 320

Copy link
Contributor

@bayucan bayucan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@bayucan
Copy link
Contributor

bayucan commented Apr 12, 2021

@nishiya : All the tests are passing now after the latest commit from Alexis. I'll wait until you've seen the latest commit before I merge this in.

Copy link
Contributor

@nishiya nishiya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bayucan bayucan merged commit d449a6b into openpbs:master Apr 13, 2021
@bayucan
Copy link
Contributor

bayucan commented Apr 13, 2021

Merged #2332 into master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants