Memory optimisations #15

alexjironkin · 2016-06-08T07:28:08Z

I have reworked the filtering and vcf2fasta steps that will allow more efficient memory usage (See #12)

TODO:

Documentation
Testing

This assert helps to validate that the sequences outputted are the same in size.

When filtering VCF all records were kept in memory. This leads to huge memory consumption (~10G for 5Mb genome). Now, if *out_vcf* is specified for *VariantSet.filter_vcf* then as soon as record in filtered, it is written to the file. Thus reducing memory footprint, which is mostly python stuff now.

Some features have been disabled, but otherwise comporable result is returned from old version, but more memory friendly.

calculation. Distance matrix calculation is going to be done in a separate module.

gap content.

ParallelVCFReader can now be found and used to travers multiple VCF readers one position at a time.

sample-gaps filtering.

Aleksey Jironkin added 23 commits May 27, 2016 17:08

Added an assert for vcf2fasta.

1f876b4

This assert helps to validate that the sequences outputted are the same in size.

Started on parallelVCFReader class.

a12db37

New working version of parallelVCFreader for vcf2fasta.

37250c2

Some features have been disabled, but otherwise comporable result is returned from old version, but more memory friendly.

Working vcf2fasta. Some features are still missing.

606eb3f

Fixed type with positive_float comparison.

8c781cd

vcf2fasta now works for --refernece option. Removed distance matrix

2addfc6

calculation. Distance matrix calculation is going to be done in a separate module.

Disabled --sample-Ns filtering.

1bb4afc

sample-Ns is wprking. sample-gaps introduced to remove samples with high

ac14ed1

gap content.

Fully working --reference parameter.

1a9c402

Coorected type for reference.

93ca3ae

Reshuffled the classes a little into new phe.utils.reader module.

609bcc6

ParallelVCFReader can now be found and used to travers multiple VCF readers one position at a time.

Tidy up the code.

f254f43

More code tidy up.

06e8952

import tidy.

20b3072

Updated ignores for eclipse.

d13dd11

Working and tested version of multi-contig vcf2fasta.

1000b0e

Updated stats to reflect correctly information about the SNPs, gaps etc.

e04bc0a

Only use reference if it is not N #13 and fully working sample-Ns and

44028a9

sample-gaps filtering.

Remove the 'tmp' folder forcefully.

fdd5edc

Dicumentation for the optimised memory and all ancilliary parts.

c6ab22c

Removed --with-mixtured parameter froom vcf2fasta, not used currently.

1a94f42

Removed --plots-dir parameter from vcf2fasta as not used.

fc05f92

alexjironkin merged commit 7206951 into master Jun 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory optimisations #15

Memory optimisations #15

alexjironkin commented Jun 8, 2016 •

edited

Loading

Memory optimisations #15

Memory optimisations #15

Conversation

alexjironkin commented Jun 8, 2016 • edited Loading

alexjironkin commented Jun 8, 2016 •

edited

Loading