Memory issues with RiboDetector #9

akrinos · 2022-03-01T15:27:23Z

Hi, thanks for the awesome tool! I have been trying to run it, and keep running into issues with memory in CPU mode. I am running with 10 threads and a chunk size of 256 on paired end sequence read files. The job appears to fail due to over-consumption of memory when trying to read out non-rRNA sequences, even though the node should have 180GB of RAM allocated. This is not within RiboDetector (though previously I did get a MemoryError within Python, but is rather the job scheduler cancelling the job.

I have checked the read length and tried reducing the threads to 5 and chunk size to 128. Are there other things that can be tried - and does reading out as gz change both the memory and the time taken?

Thanks in advance!

The text was updated successfully, but these errors were encountered:

akrinos · 2022-03-01T15:38:58Z

Upon looking into this further, potentially it is a SLURM configuration issue, at least in part. I will keep you posted if I figure it out!

dawnmy · 2022-03-01T15:50:37Z

Hi,

Thank you for reporting this issue. I have tested it on SGE and it worked without any issue. I guess SLURM should be similar. Are you able to run it on a computer/server interactively without using SLURM? Chunk size of 256 should use less than 10GB memory.

I have checked the read length and tried reducing the threads to 5 and chunk size to 128. Are there other things that can be tried - and does reading out as gz change both the memory and the time taken?

Output with gz will increase the runtime (compression needs time) but not the memory use.

Looking forward to your updates.

Best,
ZL

akrinos · 2022-03-01T18:02:16Z

Thanks for getting back to me! I tried running a single sample and got a MemoryError from Python again:

Exception in thread Thread-3:
Traceback (most recent call last):
  File "<path-to-conda>/ribodetector/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "<path-to-conda>/ribodetector/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "<path-to-conda>/ribodetector/lib/python3.9/multiprocessing/pool.py", line 576, in _handle_results
    task = get()
  File "<path-to-conda>/ribodetector/lib/python3.9/multiprocessing/connection.py", line 256, in recv
    return _ForkingPickler.loads(buf.getbuffer())
MemoryError

Using the following call to RiboDetector:

ribodetector_cpu -t 8 \
            -l 101 \
             -i $r1 $r2 \
             -r $out1 $out2 \
             -e rrna \
             --chunk_size 128 \
             -o $outtotal1 $outtotal2

with defined variable denoted with $ just storing the string name of each file used for input and output. I can provide a few lines of the zipped fastq files I am providing as input, if that helps!

dawnmy · 2022-03-01T18:41:34Z

Which OS are you using? I think the code for multiprocessing does not work on Windows.

akrinos · 2022-03-01T18:42:43Z

This is on a CentOS 7 Linux system.

dawnmy · 2022-03-01T18:46:32Z

Could you send me example fastq files which can reproduce this error? How many bases and reads in your input files?

dawnmy · 2022-03-01T19:59:05Z

@akrinos Could you also check how many processes and threads from RiboDetector actually running when you submitted the job with top or ps. It might be an issue caused by wrong NUM_THREADS env variables settings.

akrinos · 2022-03-01T22:22:14Z

Hi, thanks for all the help! The files are quite large, 5-8 GB per paired end gzipped fastq file (meta-omic data). I've been trying to get it to run with a single thread, but I am confused as to why the total size of the file would cause the memory issue so early on. The single-thread run has been going for several hours, so I'll let you know if I have any luck with that (it has not failed, though). I just also tried another parallelized run with relevant environment variables set before running. I will also see if I can figure out a test data set that is possible for me to send over!

dawnmy · 2022-03-01T22:47:50Z

The large input files could be the reason. You can confirm this with subset of your data. Before the sequences being converted to features, they are all loaded into memory first. The current chunk size is used to control how much memory to be used to convert sequences into features (numbers). Usually, the size of the input sequence data (uncompressed) should not exceed the total RAM. 5-8 GB compressed file per end is unusually large. Even though 180 GB RAM should be enough. I guess the free memory on your server should be much smaller than 180GB because of the other running processes? I will add chunk size support for loading the sequences into memory in the future release.

dawnmy · 2022-03-02T11:29:28Z

Another user also reported a similar memory issue while running RiboDetector with PBS. It would be great if you can post the job-script you used to submit the job. It will help us to locate the root cause of the issue.

akrinos · 2022-03-02T17:55:38Z

Hi, thanks so much for the work you're putting into this! I would just like to note that it appears that the job is being killed for lack of memory on a write-out step. As far as SLURM, I'm using Snakemake integration - these are the cluster parameters I most recently used for the submission when I tried really high memory:

__default__:
  account: <account-name>
  command_options:
    slurm:
      account: --account={}
      command: sbatch --parsable
      key_mapping: null
      mem: --mem={}g
      name: --job-name={}
      nodes: -N {}
      queue: --partition={}
      threads: -n {}
      time: --time={}
  cpupertask: 20
  mem: 1000
  nodes: 1
  queue: <queue-name>
  system: slurm
  tasks: 1
  time: 5000

sjaenick · 2022-03-04T16:59:46Z

Depending on the SLURM setup, cluster jobs might be executed within the context of a Linux cgroup, which limits the
max amount of e.g. memory that can be allocated before a process is aborted. -> Try increasing your memory request
when submitting cluster jobs.

dawnmy · 2022-03-04T17:09:49Z

Hi, thanks so much for the work you're putting into this! I would just like to note that it appears that the job is being killed for lack of memory on a write-out step. As far as SLURM, I'm using Snakemake integration - these are the cluster parameters I most recently used for the submission when I tried really high memory:
__default__:
  account: <account-name>
  command_options:
    slurm:
      account: --account={}
      command: sbatch --parsable
      key_mapping: null
      mem: --mem={}g
      name: --job-name={}
      nodes: -N {}
      queue: --partition={}
      threads: -n {}
      time: --time={}
  cpupertask: 20
  mem: 1000
  nodes: 1
  queue: <queue-name>
  system: slurm
  tasks: 1
  time: 5000

Does mem: 1000 means 1GB memory? If yes, this will be too small for your input data. I found package onnxruntime that RiboDetector used for CPU inference has compatibility issue with SLURM. This issue seems to be related: #6

akrinos · 2022-03-04T23:53:53Z

@dawnmy mem: 1000 is memory in gigabytes (1 TB memory)

akrinos · 2022-03-06T21:07:08Z

pip3 install --force-reinstall onnxruntime as suggested by @sjaenick made it so that RiboDetector has successfully run on multiple of my samples despite large file size.

akrinos · 2022-03-06T21:27:30Z

But, trying to increase threads still doesn't totally work, probably because of the large file size

dawnmy · 2022-03-06T21:32:15Z

But, trying to increase threads still doesn't totally work, probably because of the large file size

What do you mean by "doesn't totally work"? It only uses two CPU when loading the input paired end files into memory. After loading, the encoding and prediction will utilize all specified number CPUs.

akrinos · 2022-03-06T21:33:32Z

I say "doesn't totally work" because I started with 1 thread (which worked), tried 2 (didn't work on all samples), and 20 threads did not work regardless of chunk size - still doing some testing

dawnmy · 2022-03-06T21:39:16Z

Did you run all samples at once or one by one? It is better to run only one sample at one time and with multiple CPU (-t). Run multiple samples at the same time will not be faster with the same total number of CPUs and consume much more memory.

akrinos · 2022-03-06T23:02:17Z

One by one, not at the same time

dawnmy · 2022-03-07T12:30:12Z

This issue should have been solved with v0.2.4. Please update with:

pip install ribodetector -U

dawnmy · 2022-03-08T15:06:25Z

This issue seems to be solved. I will close it. feel free to reopen if not working.

dawnmy · 2022-04-21T16:20:16Z

@akrinos The new release v0.2.6 can substantially reduce the memory use for large input file with chunk_size parameter. You can update it with pip. See #13

dawnmy added enhancement New feature or request bug Something isn't working labels Mar 1, 2022

dawnmy mentioned this issue Mar 1, 2022

Add chunk size support for loading the sequences to reduce memory use for large files #10

Closed

dawnmy mentioned this issue Mar 2, 2022

Process hangs in CPU detection step #6

Closed

dawnmy added onnxruntime onnxruntime on SLURM hangs SLURM and removed onnxruntime onnxruntime on SLURM hangs labels Mar 3, 2022

dawnmy closed this as completed Mar 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory issues with RiboDetector #9

Memory issues with RiboDetector #9

akrinos commented Mar 1, 2022

akrinos commented Mar 1, 2022

dawnmy commented Mar 1, 2022 •

edited

Loading

akrinos commented Mar 1, 2022

dawnmy commented Mar 1, 2022

akrinos commented Mar 1, 2022

dawnmy commented Mar 1, 2022 •

edited

Loading

dawnmy commented Mar 1, 2022

akrinos commented Mar 1, 2022

dawnmy commented Mar 1, 2022 •

edited

Loading

dawnmy commented Mar 2, 2022

akrinos commented Mar 2, 2022

sjaenick commented Mar 4, 2022

dawnmy commented Mar 4, 2022

akrinos commented Mar 4, 2022

akrinos commented Mar 6, 2022

akrinos commented Mar 6, 2022

dawnmy commented Mar 6, 2022

akrinos commented Mar 6, 2022

dawnmy commented Mar 6, 2022 •

edited

Loading

akrinos commented Mar 6, 2022

dawnmy commented Mar 7, 2022

dawnmy commented Mar 8, 2022

dawnmy commented Apr 21, 2022

Memory issues with RiboDetector #9

Memory issues with RiboDetector #9

Comments

akrinos commented Mar 1, 2022

akrinos commented Mar 1, 2022

dawnmy commented Mar 1, 2022 • edited Loading

akrinos commented Mar 1, 2022

dawnmy commented Mar 1, 2022

akrinos commented Mar 1, 2022

dawnmy commented Mar 1, 2022 • edited Loading

dawnmy commented Mar 1, 2022

akrinos commented Mar 1, 2022

dawnmy commented Mar 1, 2022 • edited Loading

dawnmy commented Mar 2, 2022

akrinos commented Mar 2, 2022

sjaenick commented Mar 4, 2022

dawnmy commented Mar 4, 2022

akrinos commented Mar 4, 2022

akrinos commented Mar 6, 2022

akrinos commented Mar 6, 2022

dawnmy commented Mar 6, 2022

akrinos commented Mar 6, 2022

dawnmy commented Mar 6, 2022 • edited Loading

akrinos commented Mar 6, 2022

dawnmy commented Mar 7, 2022

dawnmy commented Mar 8, 2022

dawnmy commented Apr 21, 2022

dawnmy commented Mar 1, 2022 •

edited

Loading

dawnmy commented Mar 1, 2022 •

edited

Loading

dawnmy commented Mar 1, 2022 •

edited

Loading

dawnmy commented Mar 6, 2022 •

edited

Loading