Simple command line tool reporting the reverse strand SNPs in your GWAS data.
Python
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
config
example_files
file_io
snp_info
tests
.gitignore
README.md
snpflip.py

README.md

snpflip

snpflip is a command line tool which outputs each SNP on the reverse strand in your GWAS data. It also reports the SNPs for which it is impossible to tell whether they are from the reference strand or not. In addition, it prints a table listing the frequency of each type of SNP (see the extended example below).

Requirements

The only third-party module requirement is Biopython. It is only used in a few lines (~5) to do reference genome lookups and can easily be changed to use the library of your choice.

Usage

python2 snipflip.py -b gwas_migraine.bim -rgf genome.fa

The inputs are:

  • -rgf/--reference-genome-fasta a reference genome in the fasta format
  • -b/--bim-file a plink .bim file describing your data.

There is an optional -op/--output_prefix argument, which allows you to name your output files. Otherwise, it defaults to using the prefix of your input file.

The output files are:

  • <prefix>.reverse - The SNPs that are on the reverse strand.
  • <prefix>.ambiguous - The SNPs that cannot be assigned to a strand.
  • <prefix>.bad - The SNPs that are invalid.

Each output file is tab-separated and has the following fields:

"SNP name" "Chromosome" "SNP position" "Allele one" "Allele two" "Reference nucleotide"

which means the snpflip output files can be used as input to Plink. This is convenient if you want to remove the bad or ambiguous SNPs or to flip the SNPs that are on the reverse strand.

Example usage with plink:

python2 snipflip.py -b snp_data.bim -rgf genome.fa -op snpflip_output

plink --file snp_data --flip snpflip_output.reverse --make-bed

Main use cases

  • Generate a list of SNPs not on the reference strand: Many bioinformatics applications require the SNPs to be on the reference strand. By running snpflip on your GWAS data you'll find the SNPs that need to be flipped.

  • Quality controlling your GWAS data and pruning it: If your GWAS data contains ambiguous SNPs, these might ruin the usefulness of your data for certain purposes (e.g., imputation). By running snpflip you can get the names of the ambiguous SNPs and remove them with Plink.

Extended example

This example uses the example files in the example_files catalog.

endrebak@tang:~/biocore/snpflip$ cat example_files/reference_genome.fa
>chr1
ACT
>chr2
CCC
endrebak@tang:~/biocore/snpflip$ cat example_files/example.bim
1 snp1 0 1 A C
1 snp2 0 2 A T
1 snp3 0 3 A G
2 snp4 0 1 A G
2 esv5 0 2 AA G
endrebak@tang:~/biocore/snpflip$ python2 snpflip.py -b example_files/example.bim -rgf \
 example_files/reference_genome.fa -op extended_example
Loading reference genome file example_files/reference_genome.fa
***
Results for example_files/example.bim
***
                          Reverse strand      Forward strand       Ambiguous snp
         Monomorphic                   0                   0                   0
         Polymorphic                   2                   1                   1
  Structural variant                  NA                  NA                   1
         Invalid snp                  NA                  NA                   0
endrebak@tang:~/biocore/snpflip$ cat extended_example.ambiguous
snp2    1    2    A    T    C
endrebak@tang:~/biocore/snpflip$ cat extended_example.reverse
snp3    1    3    A    G    T
snp4    2    1    A    G    C

FAQ

My SNPs are not in Plink .bed+.bim+.fam format. What do I do?

Use Plink to convert your files to the binary plink format. Example:

plink --vcf yourfile.vcf --make-bed --out your_prefix

How do you decide the SNP type and what strand it came from?

Please see the unittests in the test folder for documentation.

Issues

See https://github.com/endrebak/snipflip/issues

No issue is too small. I will entertain all feature requests. General support questions accepted.