Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory issues with RiboDetector #9

Closed
akrinos opened this issue Mar 1, 2022 · 23 comments
Closed

Memory issues with RiboDetector #9

akrinos opened this issue Mar 1, 2022 · 23 comments
Labels
bug Something isn't working enhancement New feature or request onnxruntime onnxruntime on SLURM hangs SLURM

Comments

@akrinos
Copy link

akrinos commented Mar 1, 2022

Hi, thanks for the awesome tool! I have been trying to run it, and keep running into issues with memory in CPU mode. I am running with 10 threads and a chunk size of 256 on paired end sequence read files. The job appears to fail due to over-consumption of memory when trying to read out non-rRNA sequences, even though the node should have 180GB of RAM allocated. This is not within RiboDetector (though previously I did get a MemoryError within Python, but is rather the job scheduler cancelling the job.

I have checked the read length and tried reducing the threads to 5 and chunk size to 128. Are there other things that can be tried - and does reading out as gz change both the memory and the time taken?

Thanks in advance!

@akrinos
Copy link
Author

akrinos commented Mar 1, 2022

Upon looking into this further, potentially it is a SLURM configuration issue, at least in part. I will keep you posted if I figure it out!

@dawnmy
Copy link
Member

dawnmy commented Mar 1, 2022

Hi,

Thank you for reporting this issue. I have tested it on SGE and it worked without any issue. I guess SLURM should be similar. Are you able to run it on a computer/server interactively without using SLURM? Chunk size of 256 should use less than 10GB memory.

I have checked the read length and tried reducing the threads to 5 and chunk size to 128. Are there other things that can be tried - and does reading out as gz change both the memory and the time taken?

Output with gz will increase the runtime (compression needs time) but not the memory use.

Looking forward to your updates.

Best,
ZL

@akrinos
Copy link
Author

akrinos commented Mar 1, 2022

Thanks for getting back to me! I tried running a single sample and got a MemoryError from Python again:

Exception in thread Thread-3:
Traceback (most recent call last):
  File "<path-to-conda>/ribodetector/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "<path-to-conda>/ribodetector/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "<path-to-conda>/ribodetector/lib/python3.9/multiprocessing/pool.py", line 576, in _handle_results
    task = get()
  File "<path-to-conda>/ribodetector/lib/python3.9/multiprocessing/connection.py", line 256, in recv
    return _ForkingPickler.loads(buf.getbuffer())
MemoryError

Using the following call to RiboDetector:

ribodetector_cpu -t 8 \
            -l 101 \
             -i $r1 $r2 \
             -r $out1 $out2 \
             -e rrna \
             --chunk_size 128 \
             -o $outtotal1 $outtotal2

with defined variable denoted with $ just storing the string name of each file used for input and output. I can provide a few lines of the zipped fastq files I am providing as input, if that helps!

@dawnmy
Copy link
Member

dawnmy commented Mar 1, 2022

Which OS are you using? I think the code for multiprocessing does not work on Windows.

@akrinos
Copy link
Author

akrinos commented Mar 1, 2022

This is on a CentOS 7 Linux system.

@dawnmy
Copy link
Member

dawnmy commented Mar 1, 2022

Could you send me example fastq files which can reproduce this error? How many bases and reads in your input files?

@dawnmy
Copy link
Member

dawnmy commented Mar 1, 2022

@akrinos Could you also check how many processes and threads from RiboDetector actually running when you submitted the job with top or ps. It might be an issue caused by wrong NUM_THREADS env variables settings.

@akrinos
Copy link
Author

akrinos commented Mar 1, 2022

Hi, thanks for all the help! The files are quite large, 5-8 GB per paired end gzipped fastq file (meta-omic data). I've been trying to get it to run with a single thread, but I am confused as to why the total size of the file would cause the memory issue so early on. The single-thread run has been going for several hours, so I'll let you know if I have any luck with that (it has not failed, though). I just also tried another parallelized run with relevant environment variables set before running. I will also see if I can figure out a test data set that is possible for me to send over!

@dawnmy
Copy link
Member

dawnmy commented Mar 1, 2022

The large input files could be the reason. You can confirm this with subset of your data. Before the sequences being converted to features, they are all loaded into memory first. The current chunk size is used to control how much memory to be used to convert sequences into features (numbers). Usually, the size of the input sequence data (uncompressed) should not exceed the total RAM. 5-8 GB compressed file per end is unusually large. Even though 180 GB RAM should be enough. I guess the free memory on your server should be much smaller than 180GB because of the other running processes? I will add chunk size support for loading the sequences into memory in the future release.

@dawnmy
Copy link
Member

dawnmy commented Mar 2, 2022

Another user also reported a similar memory issue while running RiboDetector with PBS. It would be great if you can post the job-script you used to submit the job. It will help us to locate the root cause of the issue.

@akrinos
Copy link
Author

akrinos commented Mar 2, 2022

Hi, thanks so much for the work you're putting into this! I would just like to note that it appears that the job is being killed for lack of memory on a write-out step. As far as SLURM, I'm using Snakemake integration - these are the cluster parameters I most recently used for the submission when I tried really high memory:

__default__:
  account: <account-name>
  command_options:
    slurm:
      account: --account={}
      command: sbatch --parsable
      key_mapping: null
      mem: --mem={}g
      name: --job-name={}
      nodes: -N {}
      queue: --partition={}
      threads: -n {}
      time: --time={}
  cpupertask: 20
  mem: 1000
  nodes: 1
  queue: <queue-name>
  system: slurm
  tasks: 1
  time: 5000

@dawnmy dawnmy added onnxruntime onnxruntime on SLURM hangs SLURM and removed onnxruntime onnxruntime on SLURM hangs labels Mar 3, 2022
@sjaenick
Copy link

sjaenick commented Mar 4, 2022

Depending on the SLURM setup, cluster jobs might be executed within the context of a Linux cgroup, which limits the
max amount of e.g. memory that can be allocated before a process is aborted. -> Try increasing your memory request
when submitting cluster jobs.

@dawnmy
Copy link
Member

dawnmy commented Mar 4, 2022

Hi, thanks so much for the work you're putting into this! I would just like to note that it appears that the job is being killed for lack of memory on a write-out step. As far as SLURM, I'm using Snakemake integration - these are the cluster parameters I most recently used for the submission when I tried really high memory:

__default__:
  account: <account-name>
  command_options:
    slurm:
      account: --account={}
      command: sbatch --parsable
      key_mapping: null
      mem: --mem={}g
      name: --job-name={}
      nodes: -N {}
      queue: --partition={}
      threads: -n {}
      time: --time={}
  cpupertask: 20
  mem: 1000
  nodes: 1
  queue: <queue-name>
  system: slurm
  tasks: 1
  time: 5000

Does mem: 1000 means 1GB memory? If yes, this will be too small for your input data. I found package onnxruntime that RiboDetector used for CPU inference has compatibility issue with SLURM. This issue seems to be related: #6

@akrinos
Copy link
Author

akrinos commented Mar 4, 2022

@dawnmy mem: 1000 is memory in gigabytes (1 TB memory)

@akrinos
Copy link
Author

akrinos commented Mar 6, 2022

pip3 install --force-reinstall onnxruntime as suggested by @sjaenick made it so that RiboDetector has successfully run on multiple of my samples despite large file size.

@akrinos
Copy link
Author

akrinos commented Mar 6, 2022

But, trying to increase threads still doesn't totally work, probably because of the large file size

@dawnmy
Copy link
Member

dawnmy commented Mar 6, 2022

But, trying to increase threads still doesn't totally work, probably because of the large file size

What do you mean by "doesn't totally work"? It only uses two CPU when loading the input paired end files into memory. After loading, the encoding and prediction will utilize all specified number CPUs.

@akrinos
Copy link
Author

akrinos commented Mar 6, 2022

I say "doesn't totally work" because I started with 1 thread (which worked), tried 2 (didn't work on all samples), and 20 threads did not work regardless of chunk size - still doing some testing

@dawnmy
Copy link
Member

dawnmy commented Mar 6, 2022

Did you run all samples at once or one by one? It is better to run only one sample at one time and with multiple CPU (-t). Run multiple samples at the same time will not be faster with the same total number of CPUs and consume much more memory.

@akrinos
Copy link
Author

akrinos commented Mar 6, 2022

One by one, not at the same time

@dawnmy
Copy link
Member

dawnmy commented Mar 7, 2022

This issue should have been solved with v0.2.4. Please update with:

pip install ribodetector -U

@dawnmy
Copy link
Member

dawnmy commented Mar 8, 2022

This issue seems to be solved. I will close it. feel free to reopen if not working.

@dawnmy dawnmy closed this as completed Mar 8, 2022
@dawnmy
Copy link
Member

dawnmy commented Apr 21, 2022

@akrinos The new release v0.2.6 can substantially reduce the memory use for large input file with chunk_size parameter. You can update it with pip. See #13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request onnxruntime onnxruntime on SLURM hangs SLURM
Projects
None yet
Development

No branches or pull requests

3 participants