Methods for Comparison of Different Imputation Pipelines


The 23andMe file used for the comparison can be downloaded from:

Extracting the R2 values

The first step is to extract the R2 values from the output from each of the imputation pipelines.


Beagle (the fast imputation module on Deploit) supplies the R2 value within the FORMAT field of the output VCF file

grep -v '#' 8589.23andme.6953.vcf > test.vcf
awk '{print $8}' test.vcf > dr2.txt
grep -Poo DR2=[0-9]\.[0-9]+ dr2.txt | grep -oE '[0-9]\.[0-9]+' > beagle_r2.txt

Michigan Imputation Server

The Michigan Imputation Server supplies the R2 value within the FORMAT field of the output VCF file

awk '{print $7}' michigan_imputed.vcf > michigan_r2.txt


For Impute2 the equivalent values of the R2 values (the info values) were extracted. To do this the _info files generated per chromosome were first merged by making the following changes to the pipeline. Then the following could be run

awk '{print $7}' merged_impute2_info.txt > impute2_info.txt

Calculating the number of SNPs at different R2 value thresholds

Once the R2 values are extracted the number of SNPs can be counted at different thresholds using the script. You can run ./ with the file containing the R2 values (generated from the previous step) in the current directory. This should produce three new files with the total number of SNPs at the different R2 thresholds, for each of the different imputation pipelines. This can then be plotted (see example here)

