snpflip is a command line tool which outputs each SNP on the reverse strand in your GWAS data. It also reports the SNPs for which it is impossible to tell whether they are from the reference strand or not. In addition, it prints a table listing the frequency of each type of SNP (see the extended example below).
The only third-party module requirement is Biopython. It is only used in a few lines (~5) to do reference genome lookups and can easily be changed to use the library of your choice.
python2 snipflip.py -b gwas_migraine.bim -rgf genome.fa
The inputs are:
-rgf/--reference-genome-fastaa reference genome in the fasta format
.bimfile describing your data.
There is an optional
-op/--output_prefix argument, which allows you to name your output files. Otherwise, it defaults to using the prefix of your input file.
The output files are:
<prefix>.reverse- The SNPs that are on the reverse strand.
<prefix>.ambiguous- The SNPs that cannot be assigned to a strand.
<prefix>.bad- The SNPs that are invalid.
Each output file is tab-separated and has the following fields:
"SNP name" "Chromosome" "SNP position" "Allele one" "Allele two" "Reference nucleotide"
which means the snpflip output files can be used as input to Plink. This is convenient if you want to remove the bad or ambiguous SNPs or to flip the SNPs that are on the reverse strand.
Example usage with plink:
python2 snipflip.py -b snp_data.bim -rgf genome.fa -op snpflip_output
plink --file snp_data --flip snpflip_output.reverse --make-bed
Main use cases
Generate a list of SNPs not on the reference strand: Many bioinformatics applications require the SNPs to be on the reference strand. By running snpflip on your GWAS data you'll find the SNPs that need to be flipped.
Quality controlling your GWAS data and pruning it: If your GWAS data contains ambiguous SNPs, these might ruin the usefulness of your data for certain purposes (e.g., imputation). By running snpflip you can get the names of the ambiguous SNPs and remove them with Plink.
This example uses the example files in the example_files catalog.
endrebak@tang:~/biocore/snpflip$ cat example_files/reference_genome.fa >chr1 ACT >chr2 CCC endrebak@tang:~/biocore/snpflip$ cat example_files/example.bim 1 snp1 0 1 A C 1 snp2 0 2 A T 1 snp3 0 3 A G 2 snp4 0 1 A G 2 esv5 0 2 AA G endrebak@tang:~/biocore/snpflip$ python2 snpflip.py -b example_files/example.bim -rgf \ example_files/reference_genome.fa -op extended_example Loading reference genome file example_files/reference_genome.fa *** Results for example_files/example.bim *** Reverse strand Forward strand Ambiguous snp Monomorphic 0 0 0 Polymorphic 2 1 1 Structural variant NA NA 1 Invalid snp NA NA 0 endrebak@tang:~/biocore/snpflip$ cat extended_example.ambiguous snp2 1 2 A T C endrebak@tang:~/biocore/snpflip$ cat extended_example.reverse snp3 1 3 A G T snp4 2 1 A G C
My SNPs are not in Plink .bed+.bim+.fam format. What do I do?
Use Plink to convert your files to the binary plink format. Example:
plink --vcf yourfile.vcf --make-bed --out your_prefix
How do you decide the SNP type and what strand it came from?
Please see the unittests in the test folder for documentation.
No issue is too small. I will entertain all feature requests. General support questions accepted.