A set of Python command-line programs getting additional statistics about microsatellites or simple sequence repeats (SSR) from MISA output.
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
Example
.gitignore
INSTALL.txt
LICENSE.txt
README.txt
filterrepeatsmisa.bat
filterrepeatsmisa.py
format_border.bat
format_border.py
getsequences.bat
getsequences.py
imperfect.bat
imperfect.py
setpath.bat
statgetlongest.bat
statgetlongest.py
statistics_misa.bat
statistics_misa.py

README.txt

PySSRstat
=========

Version 1.0, 2015-08-31

PySSRstat is a set of command line programs that use the output of a Perl
script called MISA [1], that must be executed first. MISA generates two
output files. One file ending in "*.misa" is from now on called "MISA-file"
and the one ending in "*.statistics" called "MISA-statistics-file".

The output files generated by MISA are used by the programs of PySSRstat
for further analysis about the distribution or repeats, to find the longest
repeats and to filter the MISA-file by repeat length (minimum and maximum)
and optional by border for primer selection.

PySSRstat was written for and used in Galasso and Ponzoni (2015)[2].

The programs of the PySSRstat are written by Mario Nenno in the
Python 3.4 language and tested on a PC with Core i5-4570, 8 GB RAM and
MS Windows 8.1.

All programs of PySSRstat are copyright 2015 by Mario Nenno and distributed
under the terms of the Revised BSD License. For more details see the file
LICENSE.txt.


Set 1) For distribution of repeats and find longest repeats
===========================================================

Program               Description
------------------------------------------------------------------
statistics_misa.py    Extract additional statistical data form the
                      MISA-statistics-file and MISA-file
                     
                      Input: 
                        - MISA-file
                        - MISA-statistics-file
                        - optional parameter -rpc for SSR repeat classes
                          (experimental)
                      
                      Output: repeats_analysis.txt


statgetlongest.py     Find accessions of the longest repeats

                      Input: 
                        - repeats_analysis.txt
                        - MISA-file

                      Output: longest-sequences-list.txt
                     

Set 2) Filter MISA-file by repeat length and border for primer selection
========================================================================

Program               Description
-------------------------------------------------------------------------------
filterrepeatsmisa.py  Filter the MISA file by minimum and maximum repeat 
                      length

                      Input:
                        - MISA-file
                        - <minimum>
                        - <maximum>

                      Output: filtered-repeats-sequence-list.txt

getsequences.py       Extract the previously filtered accessions from database
                      file. Optionally filter for a border of n bp up- and
                      downstream of microsatellite

                      Input:
                        - filtered-repeats-sequence-list.txt
                        - db file with original sequences in FASTA format
                        [-b nnn] (optional)
                    
                      Output:
                        - index.txt
                        - getsequences-info.txt
                        - repeats-sequences.fas
                          repeats-sequences-border.fas (if with border option)
                        [- border.txt]


3) Others
=========
imperfect.py          Statistics about imperfect repeats in MISA-file 
                      (experimental)

                      Input:
                        - MISA-file

                      Output:
                      - imperfect.txt

format_border.py      Format the file border.txt with spaces or tabs

                      Input:
                      - border file (output of getsequences.py)
                      - 'tab' or 'space' as column delimiter

                      OPTIONAL PARAMTER
                      -idt or --idtrunc to truncat id at the first underscore
                       character

                      Ouput:
                      - border-space.txt (if space delimited)
                      - boder-tab.txt (if tab delimited)


Flow of data and programs
=========================

a) Repeat analysis, distribution, longest:

 MISA-file
 MISA-statistics-file
    |
    !
[statistics_misa.py]  ->   repeats_analysis.txt
                              MISA-file
                                  |
                                  |
                          [statgetlongest.py]  -> longest-sequences-list.txt





b) For Primer selection:

   MISA-file
       |
       |
[filterrepeatsmisa.py] -> filtered-repeats-sequence-list.txt
                           db file with original sequences
                                   |
                                   |
                           [getsequences.py]  ->  getsequences-info.txt
                                                  repeats-sequences.fas
                                                       or
                                                  repeats-sequences-border.fas
                                                  border.txt


Note: The file index.txt is for internal use and helps to speed up the 
      extraction of accessions for the sequence db file


References
==========
[1] Thiel T., Michalek W., Varshney R., Graner A. 2003. Exploiting EST 
databases for the development and characterization of gene-derived SSR-markers 
in barley (Hordeum vulgare L.). Theoretical and Applied Genetics 106: 411-422.
Link: http://pgrc.ipk-gatersleben.de/misa/.


[2] Galasso, I. and Ponzoni, E. (2015) In Silico Exploration of Cannabis sativa L.
Genome for Simple Sequence Repeats (SSRs). American Journal of Plant Sciences, 6,
3244-3250. doi: 10.4236/ajps.2015.619315
Link: http://www.scirp.org/Journal/PaperInformation.aspx?PaperID=62020
Fulltext: http://dx.doi.org/10.4236/ajps.2015.619315