Detection of Non-homologous and Very Divergent Regions in Protein Sequence Alignments
Python
Latest commit 42d2828 Jul 21, 2014 @lisandracady Only birds alns

README.txt

-------------
-------------
DivA 1.0
M. Lisandra Zepeda Mendoza & Rute R. da Fonseca
-------------
-------------


-------------
DESCRIPTION
-------------

Set of python scripts designed to detect non-homologous and very Divergent regions in protein sequence Alignments. DivA was tested with python 2.7

DivA makes no assumptions on evolutionary models, and it is ideal for detecting incorrectly annotated segments within individual gene sequences. DivA is a python script that is a binary decision making method that inapplies a sliding-window approach to estimates four divergence-based parameters and defines their outlier values according to automatically defined thresholds that can be optionally modified. DivA then classifies the windows of a sequence of an alignment as very divergent (potentially non-homologous) if it presents a combination of outlier values for the four parameters. The windows classified as very divergent can optionally be masked in the alignment.  This allows DivA to discard a minimum amount of sequence information compared to other currently available methods that remove entire sequences or blocks of a multiple sequence alignment. One important application of DivA is in the detection of incorrect automatic gene annotated sequences, which can have confounding effects in comparative genomics and phylogenomics analyses.


-------------
INSTALLATION
-------------

DivA is a python script that does not need any sort of compilation. It was developed in Python 2.7.3 and uses the following modules which should be already installed in the user's system:

- numpy
- function AlignIO from module Bio
- re
- os
- sys
- argparse

Make sure to put the bin in your path, where the blosum62.txt should also be placed; alternatively place the blosum62.txt or another distance matrix of preference on the same directory where DivA is going to me used.


------
USAGE
------

usage: DivA.py [-h] [--mask] [--printAllwindows] [-w W] [-g G] [-p P] [-zp ZP]
               [-d D] [-zd ZD] [-o O] [-m M]
               alnNamesFile

Identify very divergent potentially non-homologous windows in a protein
multiple sequence alignment.

positional arguments:
  alnNamesFile       A txt file with the file name(s) of the MSA(s) on which
                     to perform the method

optional arguments:
  -h, --help         show this help message and exit
  --mask             Flag for the output of an alignment with the wrong
                     windows masked with XXs [default not set]
  --printAllwindows  Flag for the output of a file with the parameter values
                     and start and end positions of all the windows in the
                     MSA(s) [default not set]
  -w W               The size of the sliding window [default 12]
  -g G               Maximum gap content in a window to be considered [default
                     0.6]
  -p P               The number of standard deviations from the mean of the
                     alpha parameter to use as threshold [default 1]
  -zp ZP             The number of standard deviations from the mean of the
                     Zalpha parameter to use as threshold [default 2]
  -d D               The number of standard deviations from the mean of the
                     beta parameter to use as threshold [default 2]
  -zd ZD             The number of standard deviations from the mean of the
                     Zbeta parameter to use as threshold [default 2]
  -o O               Output basename prefix [default "out"]
  -m M               The amino acid distance matrix [default "blosum62.txt"]





#Example:


 1. Create a file with the names/paths of the alignments to be analyzed. The final thresholds will be calculated using all those alignemnts.

 2. Run DivA:

python DivA_RF.py ListOfAlignments.txt #Basic default DivA run

python DivA.py -h # Will display the help

python DivA.py ListOfAlignments.txt -o DivaOutput --mask --printAllwindows # The outputs will have the prefix "DivaOutput" and alignments with the wrong windows masked wll begenerated, as well as an etra output file containing all the windows with the four parameter values and start and end positions.

python DivA.py ListOfAlignments.txt -o DivaOutput -p 2 # The number of standard deviations form the mean of the alpha parameter is changed to 2 and the outputs will have the prefix "DivaOutput"


#Example files in the 'Test' directory
The Test.aln file corresponds to ortholog alignment 14518.fasta from Jarvis et al.


-----
CITE
-----

Zepeda Mendoza ML, Nygaard S, and da Fonseca R (2014)  "DivA: detection of non-homologous and very Divergent regions in protein sequence Alignments"

--------
CONTACT
--------

For any enquiries correspondence is sent to rute.r.da.fonseca@gmail.com