Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sga index segfault with large values of -d #131

Open
sjackman opened this issue Nov 3, 2016 · 11 comments
Open

sga index segfault with large values of -d #131

sjackman opened this issue Nov 3, 2016 · 11 comments

Comments

@sjackman
Copy link
Contributor

sjackman commented Nov 3, 2016

The command sga index -d 20000000 -t 64 hsapiens.preprocess.filter.pass.merged.fa segfaults with -d 20000000. Reducing to -d 1000000 works. Is each BWT batch size limited in size, perhaps to 2 or 4 billion nucleotides? -d 20000000 with a mean sequence size of ~300 bp should correspond to a batch size of about 6 Gbp.

@sjackman
Copy link
Contributor Author

sjackman commented Nov 3, 2016

Can sga index -a ropebwt work with the output of sga fm-merge? The mean sequence size is 300 bp, and the largest sequence is 30,889 bp.

@jts
Copy link
Owner

jts commented Nov 3, 2016

Did you run out of memory with -d 20000000? Without -a ropebwt a memory inefficient algorithm is used. There is no 2 (or 4) billion nucleotide batch limit.

@jts
Copy link
Owner

jts commented Nov 3, 2016

Whether it is worth using -a ropebwt depends on the read length distribution. I suggest sticking with the recommended parameters (not ropebwt, -d X). It shouldn't take very long.

@sjackman
Copy link
Contributor Author

sjackman commented Nov 3, 2016

The fm-merge FASTA file is 20 GB, so it should be possible to construct the BWT in a single pass using SAIS in roughly 200 GB RAM. I reported this issue because of the segfault, which is 😢. I'm happy with the -d 1000000 workaround though.

Did you run out of memory with -d 20000000?

I don't believe so. It was using 76 GB of RAM when it crashed, and the machine has 2.5 TB available.

It shouldn't take very long.

I'm using sga index -d 1000000 now. It has finished 41 of 69 batches in four hours, so it's trucking along nicely. 🏎

@sjackman
Copy link
Contributor Author

sjackman commented Nov 3, 2016

Have you read Optimal In-Place Suffix Sorting? https://arxiv.org/abs/1610.08305
It seems worth checking out. @rob-p brought it to my attention.

@sjackman
Copy link
Contributor Author

sjackman commented Nov 7, 2016

sga index -d 1000000 completed in 25 hours.

sga index -d 1000000 -t 64 hsapiens.preprocess.filter.pass.merged.fa
205964.05s user 3080.39s system 232% cpu 24:56:18.90 total 9111 MB

@jts
Copy link
Owner

jts commented Nov 7, 2016

Thanks for the update. I did see that paper from @rob-p's twitter - its on my to-read list :)

@sjackman
Copy link
Contributor Author

sjackman commented Nov 7, 2016

Here's the wallclock and memory results for SGA on human HG004 data with and without fm-mege. (a memo to self and for future curious readers)

fm-merge Wallclock (h) Peak Memory (GB)
FALSE 65.4 270.35938
TRUE 65.0 82.24316

@jts
Copy link
Owner

jts commented Nov 8, 2016

Interesting, thanks! I wouldn't have expected the runtimes to be (nearly) the same, but it is good to see.

@sjackman
Copy link
Contributor Author

sjackman commented Nov 8, 2016

It was surprising to me to. Running fm-merge first speeds up overlap and assemble quite a bit. I found that rmdup after fm-merge didn't remove any sequences. Is it necessary, or did I just get lucky?

@sjackman
Copy link
Contributor Author

sjackman commented Nov 9, 2016

sga index -d 1000000 succeeded.
sga index -d 10000000 succeeded.
sga index -d 20000000 segfaulted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants