Skip to content
master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 

GMASS

The GMASS score is a novel measure for representing structural similarity between two assemblies. It represents the structural similarity of a pair of assemblies based on the length and number of similar genomic regions defined as consensus segment blocks (CSBs) in the assemblies.

GMASS_diagram

Quick start

git clone https://github.com/jkimlab/GMASS.git 
cd GMASS
perl setup.pl install
perl calculateGMASS.pl -p example/params.txt -o outdir

Install package

System requirement

  • Linux x64 (Tested in CentOS 6.9, CentOS 6.10, Ubuntu 16.04 and Ubuntu 18.04)
  • Perl >= 5.10
  • Perl modules
    • Switch
    • Parallel::ForkManager
    • Sort::Key::Natural
  • GCC >= 4.4.7
  • zlib >= 1.2.7
  • glib 2.14
  • glib 2.17

To install the package for calculating GMASS score,

git clone https://github.com/jkimlab/GMASS.git
cd GMASS
perl setup.pl install

User can test whether the package is installed properly as

perl calculateGMASSS.pl -p example/params.txt -o outdir

User can also uninstall this package as

perl setup.pl uninstall

Run package

Usage

calculateGMASS.pl [options] -f1 <fasta1> -f2 <fasta2> -r <resolutions> -s <dist>
   
-Inputs:
  -f1/-f2		Uncompressed sequence files in fasta format
  -r|--resolution		Comma-separated resolution list
  -s|--strict		Alignment strictness [self|near|medium|far] (Default: near)

-Options:
  -c|--core		Core number  (Default: 1)
  -o|--outdir		Path of output directory  (Default: Current directory)
  -h|--help		Print help message

You can offer the input parameters to the package using a file as

  calculateGMASS.pl [options] -p <params>

Inputs

  • -f1/-f2: <fasta>
    Uncompressed sequence files of a pair of genome assemblies compared. The files must be written in fasta format. The details of fasta files can be confirmed from https://en.wikipedia.org/wiki/FASTA_format.

    Providing assembly by -f1 does not always mean the assesmbly is used as reference assembly. The assembly used as reference or target assembly will be automatically decided by N50 size. N50 size is calculated by the package, so user don't have to do.

  • -r: <resolutions>
    Comma-seperated resolution list. Each resolution will be used for setting minimum CSB size. The package requires resolution values at least two.

  • -s: <dist>
    User can adjust the strictness for alignment. There are 4 options, 'self', 'near', 'medium', 'far', and the default is 'near'.

    Example of usage

    • self: Comparng different versions of human genome assemblies
    • near: Comparing human genome assembly to chimpanzee genome assembly
    • medium: Comparing human genome assembly to mouse genome assembly
    • far: Comparing human genome assembly to chicken genome assembly
  • -p: <param>
    A file containing input parameters.

      # Assemblies compared
      ### file format should be fasta format
      $ASSEMBLY1=example/Assembly1.fa
      $ASSEMBLY2=example/Assembly2.fa
      
      # Resolution lists (Comma-separated)
      $RESOLUTION=10000,20000,30000,40000,50000
    
      # Alignment strictness 
      ### Option: self, near(default), medium, far
      $STRICT=near
    

Outputs

If the packages run successfully, The output directory looks like this.

  aln.log  assembly_CSB.stats.txt  chainNet/  CSB/  data/  scores.txt
  • aln.log
    Log file of alignment run

  • assembly_CSB.stats.txt
    The file containing the features of assemblies and CSB in each resolution

    Description

    • AS_count: The number of scaffolds (chromosomes) in reference/target assembly
    • AS_length: Total size of scaffolds (chromosomes) in reference/target assembly
    • usedAS_count: The number of scaffolds being used for CSBs in reference/target assembly
    • usedAS_length: Total size of scaffolds being used for CSBs in reference/target assembly
    • CSB_count: The number of CSBs constructed between the assembly pair
    • CSB_length: Total size of CSBs constructed between the assembly pair

    Example

      #stats  10000   20000   30000   40000   50000
      AS_count(ref)   19  19  19  19  19
      AS_count(tar)   17  17  17  17  17
      AS_length(ref)  2880676 2880676 2880676 2880676 2880676
      AS_length(tar)  2862930 2862930 2862930 2862930 2862930
      CSB_count(ref)  5   5   5   4   4
      CSB_count(tar)  5   5   5   4   4
      CSB_length(ref) 2837645 2837645 2837645 2799734 2799734
      CSB_length(tar) 2828992 2828992 2828992 2791108 2791108
      usedAS_count(ref)   5   5   5   4   4
      usedAS_count(tar)   5   5   5   4   4
      usedAS_length(ref)  2855326 2855326 2855326 2817019 2817019
      usedAS_length(tar)  2828992 2828992 2828992 2791108 2791108
    
  • chainNet/
    Directory containing alignment results

  • CSB/
    Directory containing constructed CSB in each resolution

  • data/
    Direcotory for package input data

  • scores.txt
    The file containing GMASS score as well as Ci, Si, and Li scores in each resolution

    Example

    #Resolution Li score    Ci score    Si score
    10000   0.996889512514958   1   0.996889512514958
    20000   0.996889512514958   1   0.996889512514958
    30000   0.996889512514958   1   0.996889512514958
    40000   0.996917865804394   1   0.996917865804394
    50000   0.996917865804394   1   0.996917865804394
    
    --
    GMASS   0.996900853830732
    

Third party tools

How to cite

Kwon D, Lee J, Kim J. GMASS: a novel measure for genome assembly structural similarity. BMC Bioinformatics. 2019 Mar 18;20(1):147. doi: 10.1186/s12859-019-2710-z.

Contact

bioinfolabkr@gmail.com

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published