Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job status Z for some parallel jobs #191

Closed
bart1 opened this issue Apr 1, 2020 · 15 comments
Closed

Job status Z for some parallel jobs #191

bart1 opened this issue Apr 1, 2020 · 15 comments

Comments

@bart1
Copy link

bart1 commented Apr 1, 2020

I guess that this is more an question than a real bug, where I'm not sure how to proceeds to debug or solve. I'm using drake with clustermq on a cluster. I reserve a node with 16 cores and plenty memory. Some of my jobs require quite a bit of data to run and I feel there might be a slight bottle neck there.

When I retrieve the htop output from the node I get this result:

top - 11:29:52 up 17 min,  0 users,  load average: 8.51, 37.31, 44.33
Tasks: 269 total,   9 running, 252 sleeping,   0 stopped,   8 zombie
%Cpu(s): 56.5 us,  2.0 sy,  0.0 ni, 32.0 id,  9.4 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem : 97566520 total, 56060736 free, 32877460 used,  8628324 buff/cache
KiB Swap:  8388604 total,  1861784 free,  6526820 used. 63820784 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  4575 bartk     20   0 5392324 3.679g  17880 R 100.0  4.0   5:08.03 R
  4533 bartk     20   0 5467368 3.505g  11020 R  94.1  3.8   5:11.34 R
  4545 bartk     20   0 5504140 3.744g  17700 R  94.1  4.0   5:15.31 R
  4548 bartk     20   0 4651516 2.957g  18336 R  94.1  3.2   4:59.44 R
  4551 bartk     20   0 5403236 3.571g  11028 R  94.1  3.8   5:12.47 R
  4557 bartk     20   0 5493364 3.832g  17836 R  94.1  4.1   5:03.62 R
  4566 bartk     20   0 5492232 3.820g  17828 R  94.1  4.1   5:08.13 R
  4563 bartk     20   0 5970844 4.251g  10780 R  88.2  4.6   4:57.96 R
  5082 bartk     20   0   45504   3728   3044 R   5.9  0.0   0:00.01 top
  4488 bartk     20   0 1366912 191732   8424 S   0.0  0.2   0:31.45 R
  4536 bartk     20   0       0      0      0 Z   0.0  0.0   0:28.46 R
  4539 bartk     20   0       0      0      0 Z   0.0  0.0   1:29.11 R
  4542 bartk     20   0       0      0      0 Z   0.0  0.0   0:37.59 R
  4554 bartk     20   0       0      0      0 Z   0.0  0.0   0:17.86 R
  4560 bartk     20   0       0      0      0 Z   0.0  0.0   0:16.98 R
  4569 bartk     20   0       0      0      0 Z   0.0  0.0   0:58.99 R
  4572 bartk     20   0       0      0      0 Z   0.0  0.0   0:17.56 R
  4578 bartk     20   0       0      0      0 Z   0.0  0.0   0:15.23 R
  5081 bartk     20   0   95072   4712   3372 S   0.0  0.0   0:00.00 sshd

I notice that 8 of the 16 parallel processes get the status Z ( Defunct ("zombie") process, terminated but not reaped by its parent.). This number slowly grows over time. This means I am not very efficiently using the 16 cores available.

My submission script looks as follows (slightly simplified):

#!/sw/arch/Debian9/EB_production/2019/software/R/3.5.1-intel-2018b/bin/Rscript
#SBATCH --job-name=build           # job name
#SBATCH --partition=normal                 # partition
#SBATCH --output=.build.o # you can add .%a for array index
#SBATCH --error=.build.e  # log file
#SBATCH --mem=32136
#SBATCH -n 16
#SBATCH --time=23:59:00
options(clustermq.scheduler = "multicore")
source("targets.R")
source('_drake.R')
for(i in seq_along(targets))
{
        make(envir$plan, parallelism = "clustermq", 
                        jobs = 16,
                        sleep =  function(i) .03,
                        memory_strategy='preclean',
                        caching='worker',
                        garbage_collection = TRUE, envir = envir,
                        keep_going=F, targets=targets[[i]])
}

What are the best ways to debug this or are there settings that can prevent this from happening.?

@bart1
Copy link
Author

bart1 commented Apr 1, 2020

Note I run the master process on the same node as the workers, as my cluster kills anything running longer at the login node

@mschubert
Copy link
Owner

Hi,
Is there any chance you can reproduce this behavior without drake and a self-contained script?

@bart1
Copy link
Author

bart1 commented Apr 1, 2020

I will give it a try

@bart1
Copy link
Author

bart1 commented Apr 1, 2020

Ok I have been able to reproduce it with a self contained example but still using drake (I will try a bit further with only clustermq):

top - 14:40:35 up 10 min,  0 users,  load average: 12.75, 5.93, 2.34
Tasks: 308 total,   7 running, 299 sleeping,   0 stopped,   2 zombie
%Cpu(s):  6.8 us,  2.5 sy,  0.0 ni, 79.5 id, 11.0 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem : 97566520 total, 65024864 free, 30798916 used,  1742740 buff/cache
KiB Swap:  8388604 total,  3543864 free,  4844740 used. 65909836 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  3630 bartk     20   0 2851776 1.743g  16396 R 100.0  1.9   0:29.38 R
  3603 bartk     20   0 2811220 2.249g  16216 D  88.2  2.4   0:38.69 R
  3604 bartk     20   0 2885020 2.257g  16400 R  88.2  2.4   0:27.22 R
  3615 bartk     20   0 3107720 2.187g  16388 R  88.2  2.4   0:29.86 R
  3621 bartk     20   0 2687168 2.006g  16272 R  88.2  2.2   0:34.91 R
  3633 bartk     20   0 2680832 1.999g  16164 R  88.2  2.1   0:36.08 R
  3612 bartk     20   0 1692688 1.055g  16016 S  64.7  1.1   0:49.39 R
  3636 bartk     20   0 3587672 2.855g  15980 R  58.8  3.1   0:30.94 R
  3609 bartk     20   0 2955336 2.274g   7052 D  17.6  2.4   0:30.81 R
  3627 bartk     20   0 4054740 2.900g   5320 D  17.6  3.1   0:26.80 R
  3639 bartk     20   0 2580744 1.720g   7556 D  11.8  1.8   0:21.24 R
  3581 bartk     20   0  688140  68916   8176 S   5.9  0.1   0:02.05 R
  3642 bartk     20   0 3370348 2.535g   6916 D   5.9  2.7   0:29.66 R
  4244 bartk     20   0   45636   3664   2928 R   5.9  0.0   0:00.01 top
  3599 bartk     20   0 3547912 379372   3868 D   0.0  0.4   0:17.85 R
  3600 bartk     20   0       0      0      0 Z   0.0  0.0   0:13.46 R
  3618 bartk     20   0       0      0      0 Z   0.0  0.0   0:12.22 R
  3624 bartk     20   0 3263740 1.694g   3872 D   0.0  1.8   0:19.83 R
  4243 bartk     20   0   95072   4740   3388 S   0.0  0.0   0:00.00 sshd
#!/sw/arch/Debian9/EB_production/2019/software/R/3.5.1-intel-2018b/bin/Rscript
#SBATCH --job-name=test # job name
#SBATCH --partition=normal                 # partition
#SBATCH --output=.test.o # you can add .%a for array index
#SBATCH --error=.test.e  # log file
#SBATCH --mem=32136
#SBATCH -n 16
#SBATCH --time=00:19:00
options(clustermq.scheduler = "multicore")
require(clustermq)
f<-list.files('data', full.names=T)
require(stars)
fx = function(x) st_as_stars(st_apply(read_stars(x, proxy=T)[2,],2,mean))
require(drake)
p<-drake_plan(
  files=list.files('data', full.names=T),
  ttt= target(fx(files), dynamic=map(files)))
make(p, 
 parallelism='clustermq', jobs=16, caching='worker')

Note the data directory contains large netcdf files (about 80)

@bart1
Copy link
Author

bart1 commented Apr 1, 2020

Ok I now also have an example without drake:

top - 14:48:35 up  3:30,  0 users,  load average: 11.15, 10.76, 11.37
Tasks: 308 total,   6 running, 299 sleeping,   0 stopped,   3 zombie
%Cpu(s): 53.8 us,  2.5 sy,  0.0 ni, 42.5 id,  1.0 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem : 97566520 total, 69849712 free, 26405552 used,  1311260 buff/cache
KiB Swap:  8388604 total,  5570456 free,  2818148 used. 70281664 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 50130 bartk     20   0 3040808 2.407g  16484 R  94.1  2.6   0:25.33 R
 50140 bartk     20   0 3375820 2.590g   7568 R  88.2  2.8   0:22.92 R
 50146 bartk     20   0 2991472 2.359g  15840 R  88.2  2.5   0:26.34 R
 50144 bartk     20   0 3028436 2.391g  15832 R  70.6  2.6   0:24.81 R
 50156 bartk     20   0 2508280 1.541g   6540 D  11.8  1.7   0:21.20 R
 50113 bartk     20   0 2393668 1.529g   7348 D   5.9  1.6   0:21.45 R
 50116 bartk     20   0 2400176 1.460g   6652 D   5.9  1.6   0:21.69 R
 50120 bartk     20   0 2341420 1.812g   7292 D   5.9  1.9   0:21.39 R
 50121 bartk     20   0 2341420 1.514g   7136 D   5.9  1.6   0:20.98 R
 50134 bartk     20   0 2531216 1.554g   6520 R   5.9  1.7   0:21.94 R
 50150 bartk     20   0 2472496 1.551g   6520 D   5.9  1.7   0:20.96 R
 50797 bartk     20   0   45632   3792   3060 R   5.9  0.0   0:00.02 top
 50082 bartk     20   0  610944  12932   7224 S   0.0  0.0   0:01.27 R
 50112 bartk     20   0       0      0      0 Z   0.0  0.0   0:16.97 R
 50117 bartk     20   0 2341424 1.565g   6492 D   0.0  1.7   0:21.41 R
 50124 bartk     20   0       0      0      0 Z   0.0  0.0   0:11.07 R
 50137 bartk     20   0 2502012 1.833g   7108 D   0.0  2.0   0:21.55 R
 50153 bartk     20   0       0      0      0 Z   0.0  0.0   0:11.56 R
 50796 bartk     20   0   95072   4880   3524 S   0.0  0.0   0:00.00 sshd

The R script

#!/sw/arch/Debian9/EB_production/2019/software/R/3.5.1-intel-2018b/bin/Rscript
#SBATCH --job-name=test # job name
#SBATCH --partition=normal                 # partition
#SBATCH --output=.test.o # you can add .%a for array index
#SBATCH --error=.test.e  # log file
#SBATCH --mem=32136
#SBATCH -n 16
#SBATCH --time=00:19:00
options(clustermq.scheduler = "multicore")
require(clustermq)
f<-list.files('data', full.names=T)
require(stars)
fx = function(x) st_as_stars(st_apply(read_stars(x, proxy=T)[2,],2,mean))

a<-Q(fx, x=f, n_jobs=16)

The job also does not finish properly (all processes are either S (sleep) or in Z state) I guess this is because some jobs do not finish:

top - 14:51:35 up  3:33,  0 users,  load average: 0.97, 6.53, 9.66
Tasks: 297 total,   1 running, 293 sleeping,   0 stopped,   3 zombie
%Cpu(s): 53.1 us,  2.5 sy,  0.0 ni, 43.2 id,  1.0 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem : 97566520 total, 82542632 free, 13577792 used,  1446096 buff/cache
KiB Swap:  8388604 total,  8259016 free,   129588 used. 83109744 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 50082 bartk     20   0  610944  26352   5864 S   0.0  0.0   0:01.35 R
 50112 bartk     20   0       0      0      0 Z   0.0  0.0   0:16.97 R
 50113 bartk     20   0 1552856 1.112g  15520 S   0.0  1.2   0:31.62 R
 50116 bartk     20   0 1557364 1.117g  15924 S   0.0  1.2   0:32.26 R
 50117 bartk     20   0 1463864 1.022g   9208 S   0.0  1.1   0:33.20 R
 50120 bartk     20   0 1463856 1.027g  15444 S   0.0  1.1   0:34.56 R
 50121 bartk     20   0 1518812 1.080g  15428 S   0.0  1.2   0:31.13 R
 50124 bartk     20   0       0      0      0 Z   0.0  0.0   0:11.07 R
 50130 bartk     20   0  852360 326608  15832 S   0.0  0.3   0:40.69 R
 50134 bartk     20   0 1627088 1.050g   7084 S   0.0  1.1   0:33.30 R
 50137 bartk     20   0 1270948 753120  15284 S   0.0  0.8   0:35.26 R
 50140 bartk     20   0 1507416 989780  15452 S   0.0  1.0   0:35.82 R
 50144 bartk     20   0 1680428 1.101g   7608 S   0.0  1.2   0:41.68 R
 50146 bartk     20   0 1627112 1.051g   8468 S   0.0  1.1   0:41.80 R
 50150 bartk     20   0 1591820 1.018g   9180 S   0.0  1.1   0:32.83 R
 50153 bartk     20   0       0      0      0 Z   0.0  0.0   0:11.56 R
 50156 bartk     20   0 1278968 729396  15912 S   0.0  0.7   0:31.33 R
 51009 bartk     20   0   95072   3680   2320 S   0.0  0.0   0:00.00 sshd
 51010 bartk     20   0   45628   3736   3012 R   0.0  0.0   0:00.01 top

@mschubert
Copy link
Owner

Ok, great! Now we know this is a potential clustermq issue and not related to drake. Or, it could be that your function call crashes R for some reason. So let's test that.

I would now check the following:

  • Do your computations finish successfully if clustermq.scheduler="local"?
  • Do you computations finish successfully if you submit them as jobs? (clustermq.scheduler="slurm")
  • What do the logs say if you activate them?

(Unfortunately, the logging on multicore is a bit underdeveloped. I still need to fix that.)

@bart1
Copy link
Author

bart1 commented Apr 1, 2020

I have tested with clustermq.scheduler="local" that finished without any issues. I tried activating the logs for a multicore call as follows:

a<-Q(fx, x=f, n_jobs=16 , log_worker=TRUE, template=list(log_file='clustermq.log'))

However no log file is produced, is there a specific way to enable logging for multicore?

@mschubert
Copy link
Owner

(Unfortunately, the logging on multicore is a bit underdeveloped. I still need to fix that.)

You'd need to run this as jobs (via slurm), multicore workers don't support proper logging (for now).

@bart1
Copy link
Author

bart1 commented Apr 2, 2020

I have been trying a bit further I ran it in two different ways first I tried to run 16 workers on a node. using the following script and template (rename to ensure I can attach it):

test2.txt

slurm_clustermq.tmpl.txt

This job does not finish properly and the worker node does not shutdown before I cancel it (last line in log), the worker node produces the following log

slurm-4669134.out.log

The master node has the following log:

Loading required package: clustermq
Loading required package: stars
Loading required package: abind
Loading required package: sf
Linking to GEOS 3.6.2, GDAL 2.2.3, PROJ 5.0.0
Submitting 16 worker jobs (ID: 6500) ...
Running 84 calculations (0 objs/0 Mb common; 1 calls/chunk) ...
Error in self$receive_data(timeout, with_checks = with_checks) : 
  Trying to receive data without workers
Calls: Q ... master -> <Anonymous> -> <Anonymous> -> <Anonymous>
Execution halted

I'm not sure if this is the same problem or a different one.

Single worker on a node

This works properly but is rather in efficient:

test2.txt
template.txt

output of worker node

slurm-4669207.out.log

successful output of master node

Loading required package: clustermq
Loading required package: stars
Loading required package: abind
Loading required package: sf
Linking to GEOS 3.6.2, GDAL 2.2.3, PROJ 5.0.0
Submitting 16 worker jobs (ID: 7629) ...
Warning in private$fill_options(...) :
  Add 'CMQ_AUTH={{ auth }}' to template to enable socket authentication
Running 84 calculations (0 objs/0 Mb common; 1 calls/chunk) ...
Master: [328.8s 0.1% CPU]; Worker: [avg 99.3% CPU, max 3248.9 Mb]
c(16142.1510370588, 16173.7716961821, 16205.5004053062, 16237.1328383444, 16269.9509591846, 16303.2011487977, 16337.6605700659, 16373.347024536, 16407.6724520324, 16442.6700693559, 16475.7976513871, 16506.0830752003, 16536.2453834126, 16566.3473387541, 16595.3064131182, 16624.8984663854, 16655.0043924283, 16684.3316444557, 16713.5975583934, 16742.8486463249, 16772.5408509392, 16802.3021203059, 16833.0455248498, 16864.0435662742, 16895.6030165495, 16926.9370166867, 16958.0463232466, 16989.6115128455, 
17021.1157997998, 17053.6974611012, 17085.5259974742, 17117.5674545038, 17148.8087734615, 17179.8319295221, 17211.0570863137, 17242.0665218813, 17273.3808514385, 17303.5514877842, 17334.3936776904, 17364.7727102406, 17395.3072871116, 17424.8993085654, 17453.9778966651, 17482.6439314819, 17510.340652888, 17537.9195804667, 17564.0312136117, 17590.2533603023, 17614.9966315126, 17639.859933425, 17663.3394946399, 17686.1774068251, 17708.4757907717, 17729.4796733955, 17750.8122056193, 17769.9326833986, 
17788.9847515813, 17805.762140778, 17822.4755245362, 17837.6002887535, 17852.4804919686, 17866.7851890507, 17879.6646449455, 17893.8523693818, 17905.2133034273, 17919.0947122158, 17928.7579626699, 17943.2143268231, 17955.9000604121, 17960.520934751, 17970.2885856203, 17969.3942641831, 17971.6104533665, 17974.8370766788, 17975.676868079, 17981.3611348064, 17980.806931675, 17982.1201302886, 17982.1648160881, 17981.0783093266, 17978.9898745255)

Session info

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)

Matrix products: default
BLAS: /sara/eb/AVX2/Debian9/EB_production/2019/software/R/3.5.1-intel-2018b/lib64/R/lib/libR.so
LAPACK: /sara/eb/AVX2/Debian9/EB_production/2019/software/R/3.5.1-intel-2018b/lib64/R/modules/lapack.so

locale:
 [1] LC_CTYPE=en_US       LC_NUMERIC=C         LC_TIME=en_US       
 [4] LC_COLLATE=en_US     LC_MONETARY=en_US    LC_MESSAGES=en_US   
 [7] LC_PAPER=en_US       LC_NAME=C            LC_ADDRESS=C        
[10] LC_TELEPHONE=C       LC_MEASUREMENT=en_US LC_IDENTIFICATION=C 

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] clustermq_0.8.9

loaded via a namespace (and not attached):
[1] compiler_3.5.1

@mschubert
Copy link
Owner

Sorry for the late reply.

I'm a bit confused about what might be going on here.

Can you split each worker log in a separate file?

@bart1
Copy link
Author

bart1 commented Apr 7, 2020

I have been working a bit further and the issue does not seem to occur if i allocate a lot more memory (although I do not have a solid replication yet). It might thus be that the processes get stopped by some controlling software when memory gets tight.

@mschubert
Copy link
Owner

Did you find out any more about the cause of this?

@mschubert
Copy link
Owner

mschubert commented Jun 21, 2020

Ok, looking at your submission script again:

#S BATCH --array=1-{{ n_jobs }}              # job array
#S BATCH --time=00:29:00 # default lisa is 5 minutes
R -e 'sessionInfo()'
echo 'asdf'
ulimit -v $(( 1024 * {{ memory | 11096 }} ))
echo $SLURM_NTASKS
for i in {1..22} 
do 
  echo $i
  (CMQ_AUTH={{ auth }} R --quiet --no-save --no-restore -e 'clustermq:::worker("{{ master }}")' ) & 
done

This will not work, because clustermq expects n_jobs, but you are submitting 22*n_jobs. So when all jobs are sent, it shuts down additional workers (subtracting one from what it thinks are running), and will soon think that there are not workers left. So, the loop should not be there. This is where Trying to receive data without workers comes from.

I will unfortunately only be able to help you further if this is fixed and we can get the slurm logs, and it would be great if you could provide an example I can reproduce without relying on external data.

@mschubert
Copy link
Owner

According to #179 this is fixed in develop. Please reopen if this is not the case, including this information.

@bart1
Copy link
Author

bart1 commented Jun 30, 2020

Thank you for resolving this, I will let you know if it occurs again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants