Skip to content
I don't even know what it does!
Ruby
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
COPYING
README.md
identibin.rb
three_col_to_matrix_converter.rb

README.md

IdentiBin

(Known in another life as seanie_parallel)

This program has been designed for two primary purposes:

  1. Identify identical bins from multiple assembly techniques so that the best version (i.e. most complete and least redundant/contaminated) of a bin can be selected.
  2. Identify the amount of new information contained within one bin compared to another. Good for understanding how assembly techniques affect bin fidelity.

Admittedly, this is similar in purpose to other software packages, such as the excellent program dRep, though this could be used in conjunction. dRep uses ANI metrics to determine the relatedness of bins so that bins can be dereplicated. One potential pitfall in ANI is that the metric can be susceptible to incompleteness and contig fragmentation (the kind that you might see, for instance, when comparing assemblies of bins at different coverage levels). IdentiBin uses open reading frame (ORF) or amino acid calls to avoid these kinds of pitfalls.

The main difference here is that we are comparing the percentage of novel 100% identical ORF clusters between two bins. So, the resulting metric is EXTREMELY conservative. This was by design, so that we could identify bins from identical organisms between assemblies, not just closely related organisms within a species.

Another way to look at this metric (from the outfile):

info that b1 adds to b2 -- (1 - (num mixed clusters / total clusters with an ORF from b1)

b1 = bin of interest 1
b2 = bin of interest 2
num mixed clusters = number of shared 100% identical ORFs
total clusters with an ORF in b1 = total number of 100% identical ORFs in the bin of interest 1

Dependencies

The following ruby gems must be installed: fileutils, systemu, parse_fasta, abort_if, parallel.

To install:

gem install fileutils systemu parse_fasta abort_if parallel

Installation

Download the repository from GitHub: IdentiBin

Usage

After installing, you can call the program with:

ruby ~/software/seanie_parallel/identibin.rb

This returns the usage information:

USAGE: ruby ~/software/seanie_parallel/identibin.rb num_threads tmpdir bin1_orfs.fa bin2_orfs.fa [bin3_orfs.fa ...] > jawns.txt

Arguments for the identibin.rb script are positional, meaning you must provide them in the order indicated.

num_threads = The max number of threads to use for parallelized pairwise comparisons
tmpdir = Directory for the storage of intermediate results and logs. Recommend you create a 'temp' directory for this.
bin1_orfs.fa = fasta formatted file with all ORFs or amino acid calls for a bin
bin2_orgs.fa = continue for each bin in the comparison
jawns.txt = generic output file name. Change it to what you like.

Example:

The Zetaproteobacteria have several "species" defined by operational taxonomic unit (ZOTUs). Here we compare four bins from two ZOTUs and two assembly techniques.

Bin ZOTU Assembly Technique
S1_10_Zeta1 ZOTU2 10% subassembly
S1_Zeta1 ZOTU2 Individual sample assembly
S6_Zeta10 ZOTU2 Individual sample assembly
S6_Zeta3 ZOTU10 Individual sample assembly

Running the script:

mkdir temp
ruby ~/software/seanie_parallel/identibin.rb 4 temp S1_10_Zeta1.faa S1_Zeta1_MANUALCURATION.faa S6_Zeta10_MANUALCURATION.faa S6_Zeta3_MANUALCURATION.faa > jawns.txt

Produces this outfile (jawns.txt):

b1 b2 info that b1 adds to b2 -- (1 - (num mixed clusters / total clusters with an ORF from b1)
S1_10_Zeta1_faa S1_Zeta1_faa 0.452431289640592
S1_Zeta1_faa S1_10_Zeta1_faa 0.468990261404408
S6_Zeta10_faa S1_10_Zeta1_faa 0.9590371621621622
S1_10_Zeta1_faa S6_Zeta10_faa 0.9487315010570825
S6_Zeta3_faa S1_10_Zeta1_faa 1.0
S1_10_Zeta1_faa S6_Zeta3_faa 1.0
S6_Zeta10_faa S1_Zeta1_faa 0.9569256756756757
S1_Zeta1_faa S6_Zeta10_faa 0.9477191184008201
S1_Zeta1_faa S6_Zeta3_faa 1.0
S6_Zeta3_faa S1_Zeta1_faa 1.0
S6_Zeta10_faa S6_Zeta3_faa 1.0
S6_Zeta3_faa S6_Zeta10_faa 1.0

From this outfile you can see that most of the bins share less than 5% of their 100% identical ORF clusters (i.e. have 95% or more novel information in the bin). The only two bin comparisons that share significant 100% ORF clusters are the two from the same organism from two different assemblies: S1_10_Zeta1 and S1_Zeta1. Even comparisons from the same ZOTU (i.e. S1_Zeta1 vs. S6_Zeta10) share very few 100% identical ORFs.

Hopefully this illustrates how conservative and sensitive this technique is.

The GitHub for this program also includes a three column to matrix converter, three_col_to_matrix_converter.rb, which can take the output from identibin.rb and convert it to a square matrix which can be imported into R and displayed graphically.

To run this program:

ruby ~/software/seanie_parallel/three_col_to_matrix_converter.rb jawns.txt > matrix.txt

The output file (matrix.txt) looks like:

S1_10_Zeta1_faa S1_Zeta1_faa S6_Zeta10_faa S6_Zeta3_faa
S1_10_Zeta1_faa 0 0.452431289640592 0.9487315010570825 1.0
S1_Zeta1_faa 0.468990261404408 0 0.9477191184008201 1.0
S6_Zeta10_faa 0.9590371621621622 0.9569256756756757 0 1.0
S6_Zeta3_faa 1.0 1.0 1.0 0

Please open an issue if you have any issues or questions. This program is provided without warranty or a guarantee of support. Thanks!!

You can’t perform that action at this time.