Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hostremoval_input_fastq has insufficient memory #789

Closed
4 of 5 tasks
ivelsko opened this issue Sep 1, 2021 · 4 comments
Closed
4 of 5 tasks

hostremoval_input_fastq has insufficient memory #789

ivelsko opened this issue Sep 1, 2021 · 4 comments
Assignees
Labels
bug Something isn't working DSL2 needs upstream fix Needs a fix in the upstream tool project

Comments

@ivelsko
Copy link

ivelsko commented Sep 1, 2021

Check Documentation

I have checked the following places for your error:

Description of the bug

hostremoval_input_fastq fails on larger samples b/c it runs out of memory

Steps to reproduce

Steps to reproduce the behaviour:

  1. Command line:
nextflow run nf-core/eager \
-r 2.3.5 \
-profile eva,archgen,big_data \
--outdir /mnt/archgen/microbiome_calculus/abpCapture/03-preprocessing/set1_set3/eager2 \
-work-dir /mnt/archgen/microbiome_calculus/abpCapture/03-preprocessing/set1_set3/work \
--input /mnt/archgen/microbiome_calculus/abpCapture/03-preprocessing/set1_set3/abpCap_set1_set3_eager_input.tsv \
--complexity_filter_poly_g \
--fasta /mnt/archgen/Reference_Genomes/Human/HG19/hg19_complete.fasta \
--seq_dict /mnt/projects1/Reference_Genomes/Human/HG19/hg19_complete.dict \
--bwa_index /mnt/archgen/Reference_Genomes/Human/HG19/ \
--bwaalnn 0.02 \
--bwaalnl 1024 \
--hostremoval_input_fastq \
--hostremoval_mode remove \
--run_bam_filtering \
--bam_unmapped_type fastq \
--skip_damage_calculation \
--skip_qualimap \
--email irina_marie_velsko@eva.mpg.de \
-name abpCap_set13 \
-with-tower
  1. See error:
Error executing process > 'hostremoval_input_fastq (BSH001.A0101.SG1)'

Caused by:
  Process `hostremoval_input_fastq (BSH001.A0101.SG1)` terminated with an error exit status (1)

Command executed:

  samtools index BSH001.A0101.SG1_PE.mapped.bam
  extract_map_reads.py BSH001.A0101.SG1_PE.mapped.bam BSH001.A0101.SG1_R1_lanemerged.fq.gz -rev BSH001.A0101.SG1_R2_lanemerged.fq.gz -m  remove -of BSH001.A0101.SG1_PE.mapped.hostremoved.fwd.fq.gz -or BSH001.A0101.SG1_PE.mapped.hostremove
d.rev.fq.gz -p 1

Command exit status:
  1

Command output:
  - Extracting mapped reads from BSH001.A0101.SG1_PE.mapped.bam
  - Parsing forward fq file BSH001.A0101.SG1_R1_lanemerged.fq.gz

Command error:
  Traceback (most recent call last):
    File "/home/irina_marie_velsko/.nextflow/assets/nf-core/eager/bin/extract_map_reads.py", line 270, in <module>
    File "/home/irina_marie_velsko/.nextflow/assets/nf-core/eager/bin/extract_map_reads.py", line 147, in parse_fq
    File "/home/irina_marie_velsko/.nextflow/assets/nf-core/eager/bin/extract_map_reads.py", line 120, in get_fq_reads
    File "/opt/conda/envs/nf-core-eager-2.3.5/lib/python3.7/site-packages/Bio/SeqIO/QualityIO.py", line 933, in FastqGeneralIterator
      seq_string = handle_readline().rstrip()
    File "/opt/conda/envs/nf-core-eager-2.3.5/lib/python3.7/site-packages/xopen/__init__.py", line 268, in readline
      return self._file.readline(*args)
    File "/opt/conda/envs/nf-core-eager-2.3.5/lib/python3.7/codecs.py", line 322, in decode
      (result, consumed) = self._buffer_decode(data, self.errors, final)
  MemoryError

Work dir:
  /mnt/archgen/microbiome_calculus/abpCapture/03-preprocessing/set1_set3/work/e4/663badafbd377d9291bdb211a98525

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line
-[nf-core/eager] Pipeline completed with errors-

Expected behaviour

There should be enough memory for the host-mapped reads to be removed from the input files and the resulting host_removed fastq files to be written for both forward and reverse reads.

Log files

Have you provided the following extra information/files:

  • The command used to run the pipeline
  • The .nextflow.log file
  • The exact error: see above

System

  • Hardware: HPC
  • Executor: sge
  • OS: Linux
  • Version: Ubuntu 20.04.3 LTS

Nextflow Installation

  • Version: 20.10.0 build 5430

Container engine

  • Engine: Singularity
  • version:
  • Image tag: nfcore/eager:2.3.5

Additional context

I tried to increase the memory from 32GB by adjusting the lines

#$ -l h_rss=184320M,mem_free=184320M
#$ -S /bin/bash -j y -o output.log -l h_vmem=180G,virtual_free=180G

in the .command.run file. It ran with 180GB as written above, but the qacct record says maxvmem 120.745GB.

@ivelsko ivelsko added the bug Something isn't working label Sep 1, 2021
@ivelsko
Copy link
Author

ivelsko commented Sep 1, 2021

@maxibor for you, but I can't seem to assign you

@jfy133
Copy link
Member

jfy133 commented Sep 6, 2021

@maxibor more specifically it seems this ends up using a ridiculous amount of memory, so I think it would require optimisation on part of the script.

@jfy133 jfy133 added the needs upstream fix Needs a fix in the upstream tool project label Nov 29, 2021
@jfy133
Copy link
Member

jfy133 commented Nov 29, 2021

@maxibor also posted the following on slack:

https://github.com/TOAST-sandbox/podPeople
https://github.com/sandberg-lab/dataprivacy (a.k.a. BAMboozle)

Note: these tools are probably indeed more robust, but we need to consider that it might be unwanted to just 'replace' reads or variants with ref. genome ones. If someone wanted to reanalyse e.g. calculus for human DNA, they may not realise they are lookin at 'fake' sequence.

This is sort of 'tampering' with the FASTQ file in a misleading sense. Therefore I would still rather have a NNN replacement or entire removal (will need to check if the tools support this).

@jfy133
Copy link
Member

jfy133 commented Sep 6, 2022

Done in 2.4.5!

@jfy133 jfy133 closed this as completed Sep 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working DSL2 needs upstream fix Needs a fix in the upstream tool project
Projects
None yet
Development

No branches or pull requests

3 participants