findsim_seq

Biopiece: findsim_seq

Description

Warning: findsim_seq is under active development and testing.

The purpose of findsim_seq is to find similarities between query DNA/RNA sequences and a database of small subunit ribosomal DNA/RNA sequences. The sequence similarities are calculated as the number of shared unique oligos/kmers divided by the smallest number of unique oligoes in either the query or database sequence. This yields a rough under estimate of similarity e.g. 50% oligo similarity may correspond to 80% similarity on a nucleotide level are records with identity and query sequence and subject sequence indeces or IDs. The records are sorted with repect to query seqence order and decending score. The records look like this:

REC_TYPE: findsim
Q_ID: 9
S_ID: 209
SCORE: 0.96
---

Or using the -Q and -S switches the indeces are replaced with IDs:

---
REC_TYPE: findsim
Q_ID: ID00000098
S_ID: AANN01161813.9.1494 Eukaryota;Metazoa;Chordata;Craniata; ...
SCORE: 0.96

The -M switch will limit the output of hits so that the diversity of the hits is less than or equal to the best hit - this value. E.g. if the best hit identity is 98.9% and -M is 2.0 all hits within the range 98.9% down to 96.9% will be output.

Using the -r switch will cause all hits to be realigned using muscle to yiels a more precise sequence identity.

The performance is good. 2000 full length 16S RNA sequences (~1500 nt) was matched aginst 600,000 database sequences in ~10 minutes using a single CPU and ~700Mb memory on a laptop.

findsim_seq can be used for any type of amplicon data - it is not restricted to 16S rRNA amplicons.

findsim_seq is an implementation of the SimRank logic proposed by Niels Larsen.

Usage

... | findsim_seq [options]

Options

[-?         | --help]                   # Print full usage description.
[-d <sting> | --database=<string>]      # Database to use - a FASTA file.
[-k <uint>  | --kmer=<uint>]            # Kmer size or oligo size                              -  Default=8
[-s <uint>  | --step=<uint>]            # Kmer step size                                       -  Default=1
[-m <float> | --min_score=<float>]      # Minimum score to include                             -  Default=0.5
[-h <uint>  | --max_hits=<uint>]        # Maximum number of hits per query                     -  Default=20
[-M <float> | --max_diversity=<float>]  # Maximum diversity between best and worst hit (in %).
[-Q         | --query_ids]              # Replace query index with query ID.
[-S         | --subject_ids]            # Replace subject index with subject ID.
[-r         | --realign]                # Realign hits to obtain precise identity.
[-I <file!> | --stream_in=<file!>]      # Read input from stream file                          -  Default=STDIN
[-O <file>  | --stream_out=<file>]      # Write output to stream file                          -  Default=STDOUT
[-v         | --verbose]                # Verbose output.

Examples

In the process to classify a set of sequences using the taxonomy data in SILVA we can do:

read_fasta -i amplicon_reads.fna | findsim_seq -d silva.fna -S | grab -Kp REC_TYPE

Author

mail@maasha.dk

April 2012

License

GNU General Public License version 2

http://www.gnu.org/copyleft/gpl.html

Help

findsim_seq is part of the Biopieces framework.

http://www.biopieces.org

Provide feedback

Saved searches

Use saved searches to filter your results more quickly