No description, website, or topics provided.
Python Shell CMake
Clone or download
Pull request Compare This branch is 46 commits behind GATB:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
data
doc
scripts
test
tests_blast
thirdparty
tools
.gitignore
.gitmodules
CMakeLists.txt
INSTALL
LICENSE
README.md
short_read_connector.sh

README.md

Linux Mac OSX
https://ci.inria.fr/gatb-core/view/RConnector/job/tool-rconnector-build-debian7-64bits-gcc-4.7/ https://ci.inria.fr/gatb-core/view/RConnector/job/tool-rconnector-build-macos-10.9.5-gcc-4.2.1/

http://www.gnu.org/licenses/agpl-3.0.en.html

What is Short Read Connector (SRC)?

Short read connector enables the comparisons of two read sets B and Q. For each read from Q it provides either: * The number of occurrences of each k-mers of the read in the set B (SRC_counter) or * A list of reads from B that share enough k-mers with the (a window of) the tested read from A (SRC_linker)

Citation Camille Marchet, Antoine Limasset, Lucie Bittner, Pierre Peterlongo. A resource-frugal probabilistic dictionary and applications in (meta)genomics. 2016.

Getting the latest source code

Requirements

CMake 2.6+; see http://www.cmake.org/cmake/resources/software.html

c++ compiler; compilation was tested with gcc and g++ version>=4.5 (Linux) and clang version>=4.1 (Mac OSX).

Instructions

# get a local copy of source code
git clone --recursive https://github.com/GATB/rconnector.git

# compile the code an run a simple test on your computer
cd gatb-rconnector
sh INSTALL

Getting a binary stable release

Binary release for Linux and Mac OSX are provided within the "Releases" tab on Github/rconnector web page.

Quick start

Run a simple test looking for reads from data/c2.fasta.gz that share at least 20 kmers (k=25) with data/c1.fasta.gz. Kmers indexed from data/c1.fasta.gz are those occurring at least 2 times.

 sh short_read_connector.sh -b data/c1.fasta.gz -q data/fof.txt

Usage

Mimimal call

Calling SRC_linker between read sets bank and query:

sh short_read_connector.sh -b bank -q query

Options

 -c: use short_read_connector_counter (SRC_counter)
 -r: with this option (incompatible with SRC_counter), no precision about pair of similar reads is output. Only ids of reads from queries similar to at least one read from bank are output.
 -p prefix. All out files will start with this prefix. Default="short_read_connector_res"
 -g: with this option, if a file of solid kmer exists with same prefix name and same k value, then it is re-used and not re-computed.
 -k value. Set the length of used kmers. Must fit the compiled value. Default=31
 -f value. Fingerprint size. Size of the key associated to each indexed value, limiting false positives. Default=12
 -G value. gamma value. MPHF expert users parameter - Default=2
 -a: kmer abundance min (kmer from bank seen less than this value are not indexed). Default=2
 -s: Minimal percentage of shared kmer span for considering 2 reads as similar.  The kmer span is the number of bases from the read query covered by a kmer shared with the target read. If a read of length 80 has a kmer-span of 60 with another read from the bank (of unkonwn size), then the percentage of shared kmer span is 75%. If a least a windows (of size "windows_size" contains at least kmer_threshold percent of positionf covered by shared kmers, the read couple is conserved.)
 -w: size of the window. If the windows size is zero (default value), then the full read is considered
 -t: number of thread used. Default=0
 

Output Format

Short reads counter

Command:

 sh short_read_connector.sh -b data/c1.fasta.gz -q data/fof.txt -c

Two first lines of the output file:

 #query_read_id mean median min max number of shared 31mers with banq read set data/c1.fasta.gz
 0 3.614286 4 2 5

The first line is the file header. The second line can be decomposed as: * 0: id of the query read (from read set contained in fof.txt) * 3.614286: mean number of occurrences of its k-mers (here with k=31) in the read set data/c1.fasta.gz * 4: median number of occurrences of its k-mers (here with k=31) in the read set data/c1.fasta.gz * 2: minimal number of occurrences of at least a kmer from read 0 in the read set data/c1.fasta.gz * 5: maximal number of occurrences of at least a kmer from read 0 in the read set data/c1fasta.gz

Short reads linker

Command:

 sh short_read_connector.sh -b data/c1.fasta.gz -q data/fof.txt

Two first lines of the output file:

 #query_read_id [target_read_id-kmer_span (k=31)-kmer_span query percentage]* or U (unvalid read, containing not only ACGT characters or low complexity read)
 1:676-93-93.000000 809-89-89.000000

The first line is the file header. The second line can be decomposed as: * 1: id of the query read * 676-93-93.000000: a target read and its peaces of information: * 676: id of the targeted read * 93: kmer-span (number of position of read 1 that is covered by at least a solid kmer present in read 676) * 93.000000: kmer-span ratio wrt to read 1 length (here 100) * 809-89-89.000000: a second targeted read and its pieces of information (and so on).

Note that with the -r option, only the id of the queried and shared read is output. In this example the line would be limited to

 #query_read_id 
 1

 

Input read sets

We use file of files format. The input read sets are provided using a file of file(s). The file of file(s) contains on each line a read file or another file of file(s). Let's look to a few usual cases (italic strings indicate the composition of a file): * Case1: I've a unique read set composed of a unique read file (reads.fq.gz). * fof.txt: * reads.fq.gz * Case2: I've a unique read set composed of a couple of read files (reads_R1.fq.gz and reads_R2.fq.gz). This may be the case in case of pair end sequencing. * fof.txt: * fof_reads.txt:

 with fof_reads.txt:

 * reads_R1.fq.gz
 * reads_R2.fq.gz
  • Case3: I've two read sets each composed of a unique read file: reads1.fq.gz and reads2.fq.gz:

  • fof.txt:

    • reads1.fq.gz

    • reads2.fq.gz

  • Case4: I've two read sets each composed two read files: reads1_R1.fq.gz and reads1_R2.fq.gz and reads2_R1.fq.gz and reads2_R2.fq.gz:

  • fof.txt:

    • fof_reads1.txt

    • fof_reads2.txt

with fof_reads1.txt:

 * reads1_R1.fq.gz
 * reads1_R2.fq.gz

with fof_reads2.txt: * reads2_R1.fq.gz * reads2_R2.fq.gz * and so on...

Contact

Contact: Pierre Peterlongo: pierre.peterlongo@inria.fr