Scripts
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
paper_analyses
paper_outputs
test
.Rhistory
.gitignore
LICENSE
README.md
README.md~
obtain_nucleotides_model.py
restriction_site_search.sh
sequence_probability.py

README.md

PredRAD

High-throughput sequencing of reduced representation libraries obtained through digestion with restriction enzymes–generally known as restriction-site associated DNA sequencing (RAD-seq)–is now one most commonly used strategies to generate single nucleotide polymorphism data in eukaryotes. The choice of restriction enzyme is critical for the design of any RAD-seq study as it determines the number of genetic markers that can be obtained for a given species, and ultimately the success of a project.

For the design of a study using RAD-seq, or a related methodology, there are two general fundamental questions that researchers face: i) what is the best restriction enzyme to use to obtain a desired number of RAD tags in the organism of interest? And ii) how many markers can be obtained with a particular enzyme in the organism of interest? This software pipeline will allow any researcher to obtain an approximate answer to these questions and will help guide the design of any study using RAD sequencing and related methods.

This Git contains the software code and output results from Herrera S., P.H. Reyes-Herrera & T.M. Shank (2015) Predicting RAD-seq Marker Numbers across the Eukaryotic Tree of Life.


Requirements


Install

Download python and shell scritps

For the shell script (change execute permissions using chmod u+x)


Usage

  • restriction_site_search.sh. This shell script will search all the restriction sites from the file (patternfilename) in every genome from the input file (genomefilename). As a result the script provides the following files:

    • ALL.count.txt - contains a table with the number of restriciton sites found in each genome
    • ALL.size.txt - contains a table with the size of each genome
    • If bowtieflag is equal to YES then it provides the following files: ALL.aligned.txt, ALL.failed.txt, ALL.processed.txt, ALL.suppressed.txt - each file with a table summarizing bowtie output(reads aligned, failed, processed and suppressed) for each genome.

    The input arguments are:

    • parametersfilename: name of file with four parameters (see test/params.txt)
      • genomefilename: name of file with table with two columns (1) species code and (2) link to whole genome fasta file or path to fasta file (for genome file example with url see test/genomeFileExample.txt, for file with localfile path see test/genomeFileExample_localfile.txt)
      • patternfilename - name of file with table with two columns (1) restriction site regular expression and (2) restriction site name (see test/Patterns_list.txt)
      • bowtieflag equals YES (default value) to use bowtie to align. Any other value if you do not want to use bowtie.
      • localfile flag equals NO (default value) to download the fasta files. If the flag equals YES, the program will search for a localfile in the indicated path

    To run, just write on shell

    ./restriction_site_search.sh parametersfilename


  • obtain_nucleotides_model.py. This python script obtains the nucleotides, dinucleotide and trinucleotides distribution for each genome from the input file (genomefilename)

    The input arguments are:

    • genomefilename: name of file with table with two columns (1) species code and (2) link to whole genome fasta file or path to fasta file.(for genome file example with url see test/genomeFileExample_2.txt, for file with localfile path see test/genomeFileExample_localfile.txt)
    • resultsfile : name of the outputfile
    • localfileflag : yes if the files are in local, no otherwise.

    To run, just write on shell

    python obtain_nucleotides_model.py genomefilename resultsfile localfileflag

For details of events that occur once the script runs, please check the .log file.


  • sequence_probability.py. This python script obtains the probability for each restriction site from the input file (patternfilename) in every genome considering nt, dint and trint frequencies (distributionfile). As a result the script provides the following files:

    • $distributionfile$_nt - contains a table with the sequences probabilities (based on nucleotide probabilities)
    • $distributionfile$_dint - contains a table with the sequences probabilities (based on dinucleotides probabilities)
    • $distributionfile$_trint - contains a table with the sequences probabilities (based on trinucleotides probabilities)

    The input arguments are:

    • distributionfile - output from genome_nucleotide_distrib_paper (see test/DistributionFile.txt)
    • patternfilename - name of file with table with tow columns (1) restriction site regular expression and (2) restriction site name (see test/Patterns_list.txt)

    To run, just write on shell

    python sequence_probability.py distributionfile patternsfile

License

Created by Santiago Herrera and Paula H. Reyes-Herrera on 11 June 2014 Copyright (c) 2014 Santiago Herrera and Paula H. Reyes-Herrera. All rights reserved.

PredRAD is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 2.