Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault (core dumped) #21

Open
jaclyn-taroni opened this issue Jun 5, 2019 · 10 comments
Open

Segmentation fault (core dumped) #21

jaclyn-taroni opened this issue Jun 5, 2019 · 10 comments

Comments

@jaclyn-taroni
Copy link

For context: I am attempting to create an augmented FASTA file to add decoy sequence to a Salmon index as noted in the release notes in the most recent version of Salmon (0.14.0): https://github.com/COMBINE-lab/salmon/releases/tag/v0.14.0

The authors provide a script that makes use of MashMap to do so here: https://github.com/COMBINE-lab/SalmonTools/blob/master/scripts/generateDecoyTranscriptome.sh

I get Segmentation fault (core dumped) when the script reaches the MashMap step at this line https://github.com/COMBINE-lab/SalmonTools/blob/23eac847decf601c345abd8527eed5dc1b382573/scripts/generateDecoyTranscriptome.sh#L105

This can be reproduced from the command line:

mashmap -r reference.masked.genome.fa -q Homo_sapiens.GRCh38.cdna.all.fa -t 8 --pi 80 -s 500
>>>>>>>>>>>>>>>>>>
Reference = [reference.masked.genome.fa]
Query = [Homo_sapiens.GRCh38.cdna.all.fa]
Kmer size = 16
Window size = 5
Segment length = 500 (read split allowed)
Alphabet = DNA
Percentage identity threshold = 80%
Mapping output file = mashmap.out
Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none)
Execution threads  = 8
>>>>>>>>>>>>>>>>>>
INFO, skch::Sketch::build, minimizers picked from reference = 985533927
Segmentation fault (core dumped)

Where the relevant input to generateDecoyTranscriptome.sh to generate reference.masked.genome.fa and the transcript fasta are:

Input File Download
GTF ftp://ftp.ensembl.org/pub/release-96/gtf/homo_sapiens/Homo_sapiens.GRCh38.96.gtf.gz
Genome FASTA ftp://ftp.ensembl.org/pub/release-96/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz
Transcript FASTA ftp://ftp.ensembl.org/pub/release-96/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz

I'm using a Docker image with the v2.0 release of MashMap. (It can be pulled from jtaroni/2019-chi-training and MashMap is installed like so: https://github.com/AlexsLemonade/RNA-Seq-Exercises/blob/d6e5f8627c75e55e572e9061f0498388ebb7d212/Dockerfile#L91).

This also occurs running on my Ubuntu 18.04 machine w/ 64GB RAM outside the container.

Any ideas about what may be happening would be appreciated. Thank you!

@cjain7
Copy link
Contributor

cjain7 commented Jun 6, 2019

Would it be possible to re-run mashmap with /usr/bin/time utility to report its memory usage. Comparing the peak memory-usage with the RAM size would help. My first guess is that it's running out of memory with the parameters --pi 80 -s 500

@lpantano
Copy link

lpantano commented Jun 6, 2019

Hi,

I got the same error when running in a cluster, and the job was killed by the scheduler becaouse of memory and it showed the same error.

@cjain7, do you know how much memory it needs to run this kind of alignments? it would be the transcriptome against the genome?

I set up the limit to 200GB and it wasn't enough.

Thanks!

@k3yavi
Copy link

k3yavi commented Jun 6, 2019

I've just finished running it on human gencode data and annotation. It took ~80G of memory for me to finish.

@lpantano
Copy link

lpantano commented Jun 7, 2019 via email

@k3yavi
Copy link

k3yavi commented Jun 7, 2019

No problem @lpantano .
Not to swarm the issue with salmon related files but gentrome.fa for genocde human comes out to be around 477 MB while ensembl one is around 431 MB. If you are looking for human ensembl decoys, we have uploaded them here. You can also follow up or raise a request for creating decoys for non model organism here COMBINE-lab/SalmonTools#5 , we would be happy to create that for you.

@jaclyn-taroni
Copy link
Author

Thanks all for the replies. I am out of the office today, but I will run this with GNU time when I get back in early next week and see if that gives us any additional insight.

@lpantano
Copy link

lpantano commented Jun 7, 2019

@k3yavi , thanks. All good, it was enough 100GB, I messed up the configuration, sorry about that, but good to know about the resources, thanks so much for your time!!! really appreciate the help!

@jaclyn-taroni
Copy link
Author

Hi @cjain7,

When I run /usr/bin/time with --verbose, the output is:

Command terminated by signal 11
    Command being timed: "mashmap -r reference.masked.genome.fa -q Homo_sapiens.GRCh38.cdna.all.fa -t 8 --pi 80 -s 500"
    User time (seconds): 1269.39
    System time (seconds): 64.50
    Percent of CPU this job got: 273%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 8:07.51
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 48309816
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 1
    Minor (reclaiming a frame) page faults: 56777549
    Voluntary context switches: 195530
    Involuntary context switches: 332377
    Swaps: 0
    File system inputs: 106068536
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

Thank you!

@antonkulaga
Copy link

Guys, you claim " Mashmap can map a human genome assembly to the human reference genome in about one minute total execution time and < 4 GB memory using just 8 CPU threads, ", why then it should take >64GB RAM to make human alignment in salmon tools decoy script with Masmap?

@cjain7
Copy link
Contributor

cjain7 commented Jul 7, 2019

The performance is highly dependent on the length [-s] and identity [--pi] requirements provided to Mashmap...
When looking for long approximate matches that are highly similar, the algorithm would compute sparse LSH sketch to execute the computation. This was the case when comparing two human genome assemblies (--pi 95 -s 5000).

When looking for short divergent matches (--pi 80 -s 500, i.e., segment length 500 and 20% error rate here in your application), it will need dense sketch to identify those. Hence large memory-use and runtime in your specific case.. (Mashmap paper is a good reference for a verbose discussion on this).

One possible suggestion is to see if relaxing (i.e., increasing) the minimum identity/length requirements makes sense for the application.. If it is do-able, then the algorithm will execute much faster, with much less memory.

The other way-around this problem would be to partition the reference into smaller chunks, and run those independently, but this pipeline will require a bit more engineering to aggregate the results..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants