Avoid looping through sample data twice when computing genotype information. #39

merged 2 commits into from May 22, 2012


None yet
2 participants

arq5x commented Apr 5, 2012

This change recognizes that in Reader._parse_samples, we loop through all of the genotypes for each sample when constructing samp_data. samp_data is a list of _Call instances. Therefore, since we are alrady created _Call instances, we can build lists for each sample of commonly-used information such as the actual genotype ("A/G"), the genotype numeric type (0,1,2), etc.

This becomes rather crucial when trying to do anything meaningful with genotype information on VCFs with many samples, as we would be forced to loop through all of the genotype information twice --- that is, once in _parse_samples and once in the user's code that would have to do something like:

for var in vcf_reader:
    # build a list of the samples' numeric genotype types
    gt_types = [s.gt_type for s in var.samples]

In contrast, since they are now pre-computed in the Reader, one can do:

for var in vcf_reader:
    # build a list of the samples' numeric genotype types
    gt_types = var.gt_types

I have tested this on a VCF file with 1000 records containing 1046 samples from the 1000 Genomes Project. Prior to this change, it took 42 seconds loop through each record and compute aforementioned list. After the change, it took 25 seconds, or a 41% reduction.

Do others agree that it is better to pre-compute this, recognizing that VCF files are only going to include more and more samples, and these are the types of information that people will need to do meaningful things with the data?

@jamescasbon jamescasbon merged commit 7afcaea into jamescasbon:master May 22, 2012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment