Skip to content

For simulations of targeted-sequencing experiments under a known species/gene tree distribution, extracts the reference sequences that be used for mapping.

License

Notifications You must be signed in to change notification settings

merlyescalona/refselector

Repository files navigation

Reference selector

© 2017 Merly Escalona (merlyescalona@uvigo.es)

University of Vigo, Spain, http://darwin.uvigo.es

Build Status codecov

SimPhy/NGSphy refselector

For simulations of targeted-sequencing experiments under a known species/gene tree distribution, extracts the reference sequences that would have been used as target in the probe desing.

Assumptions

  • We are working under a SimPhy - NGSphy simulation pipeline scenario. Following the same hierarchical folder structure.

  • Also, it is assumed that the SimPhy folder project has been compressed using simphycompress and the length of the concatenation N sequence used is known. To know more about the simulation pipeline scenario go to:

  • Species tree replicates are filtered based on the number of loci (number of sequences/ FASTA files) existing in each folder.

  • True sequences from the SimPhy/INDELible simulation process do not contain N's.

Input

  • SimPhy folder path
  • prefix of the existing FASTA files
  • prefix for the output files
  • method indicating how to obtain the reference sequences
  • (optional) length of the N sequence that will be used to separate the sequences when concatenated
  • (optional) file with the description of the sequences that will be used as reference.

Methods for reference selection

Specified method to obtain the reference sequence. Values range from 0-4 ( Default: 0), where:

  • (0): Considers the outgroup sequence as the reference loci.
  • (1): Extracts a specific sequence per locus
  • (2): Selects a random sequence from the ingroups. Same sequence throughout the loci.
  • (3): Selects randomly a specie and generates a consensus sequence of the sequences belonging to that species.
  • (4): Generates a consensus sequences from all the sequences involved (will need parameter -sdf/--seq-desc-file)
**NOTE:** 	The higher the method number, the longer it will take to generate the reference loci.

Reference description file

Each description should be in a separate line. The order in which the descriptions are organized is the order that will be considered for the specific replicate (i.e. line 1, replicate 1). If there are less descriptions than species tree replicates, the remaining references will be considered as sequence 1_0_0.

Output

  • The output will be a directory of FASTA files
  • There should be as many FASTA files as replicates have been generated for the current SimPhy project
  • Each file will contain all the selected loci, either concatenated or as a multiple alignment file

Install

  • Clone this repository
git clone git@github.com:merlyescalona/refselector.git
  • Chance your current directory to the downloaded folder:
cd refselector
  • Install:
python setup.py install --user

Usage

Documentation

Go to the wiki

About

For simulations of targeted-sequencing experiments under a known species/gene tree distribution, extracts the reference sequences that be used for mapping.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published