Skip to content
calculate COding potential from Multiple fEatures
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
bin
examples
utilities
README.md Update README.md May 14, 2019

README.md

COME --- calculate COding potential from Multiple fEatures.

0. About COME

COME (coding potential calculator based on multiple features) is a computational tool that predicts the coding potential for a given transcript. It integrates multiple sequence-derived and experiment-based features using a decompose-compose method, which makes COME’s performance more accurate and robust than other well-known tools. First, COME compose the feature matrix for the given transcripts using the pre-calculated features vectors. Second, COME predict the coding potential by the pre-trained models, using the feature matrix generated in the first step.

COME is currently pre-trained for five model species: human (hg19), mouse (mm10), fly (dm3), worm (ce10) and plant (TAIR10). The pre-trained models were avaible in the folder of [bin/models] (https://github.com/lulab/COME/tree/master/bin/models)

COME integreated features including GC content, DNA sequence conservation, protein conservation and RNA secondary structure conservation, expression abundance from poly(A)+, poly(A)- and small RNA sequencing, H3K36me3 and H3K4me3 modification. These input features were pre-calculated and avaiable in folder of [bin/HDF5] (https://github.com/lulab/COME/tree/master/bin/HDF5)

For users who are not familiar with Linux, we also provide a webserver, which is still in a beta version.

1. Installation

Pre-requisite:

  1. Linux

  2. R (>=2.15.2)

  3. R packages ("randomForest" and "rhdf5"); You can install these packages by entering R and typing these:

     ## Install package "randomForest"
     install.packages("randomForest"); 
     ## Install package "rhdf5"
     source("http://bioconductor.org/biocLite.R");biocLite("rhdf5");
    

Download files into sepcific folders.

  1. First, change directory to your working directory, download the source codes from https://github.com/lulab/COME/archive/master.zip and decompress it. Enter the subfolder "COME-master/bin" and define the path as the variable Bin_dir

     $ unzip		./COME-master.zip;
     $ cd 		./COME-master/bin;
     $ Bin_dir=`pwd|awk '{print $1}'`;
    
  2. Second, download your species'(Let's say, human) feature vector files from onedrive or Tsinghua Cloud. These (nine) files need to be placed in the subfolder "COME-master/bin/HDF5".

     $ unzip	./human.feature_vector.HDF5.zip;
     $ mv	./human/human.HDF5.*	$Bin_dir/HDF5;
    
  3. Third, download your species' model file from onedrive or Tsinghua Cloud. The (one) model file need to be placed in the subfolder "COME-master/bin/models".

     $ mv	./human.model	$Bin_dir/models;
    

2. Usage and Examples

bash /path/to/bin_subfolder/COME_main.sh /path/to/your/transcripts.gtf	/path/to/your/output_folder/	/path/to/bin_subfolder/	species	model;

  • /path/to/bin_subfolder/ is the path where you kept downloaded COME's "bin" subfolder, i.e., the $Bin_dir

  • /path/to/bin_subfolder/COME_main.sh is COME's main program script.

  • /path/to/your/transcripts.gtf is your input gtf file. The input gtf file should be as the description of ucsc's [gtf format] (http://genome.ucsc.edu/FAQ/FAQformat.html#format4). In summary, the first field should be chormosome in lower and abbreviate case (e.g., chr1, chrX); the third field should be exactly "exon"; the seventh field should be strand (i.e., + or -). The subsequent attribute list must begin with the two mandatory attributes: gene_id "value"; transcript_id "value". In addition, transcript length should be longer than 50 nucleotides. Any lines of your input file don't match the criteria aboved will be skipped.

  • /path/to/your/output_folder/ is a folder that will be created (if the user didn't create it already) to save your output file "result.txt"

  • species is one of these five names: "human", "mouse", "fly", "worm" and "plant". It specifies which species' feature vector files should be applied to your calculation

  • model is one of these ten names: "human.model", "human.NoExpHis.model", "mouse.model", "mouse.NoExpHis.model", "fly.model", "fly.NoExpHis.model", "worm.model", "worm.NoExpHis.model", "plant.model" and "plant.NoExpHis.model". It specifies which model should be applied to your calculation. *.model, e.g., human.model, is the default model trained by multiple sequence-derived and experiment-based features. We also provided *.NoExpHis.model, e.g., human.NoExpHis.model, which is the model trained by multiple sequence-derived features only.


An example:

Assuming I want to predict the human test transcripts from the [examples] (https://github.com/lulab/COME/tree/master/examples) folder, human.test.gtf. I would work on my home directory ~/ and I want the output of COME stored in a folder named ~/COME_out/.

  1. ~/COME-master.zip was downloaded to my working directory ~/ from [github] (https://github.com/lulab/COME/archive/master.zip) by clicking the link or wget:

     $ cd ~;
     $ wget -c --content-disposition   http://github.com/lulab/COME/archive/master.zip;
    
  2. ~/human.feature_vector.HDF5.zip was downloaded to my working directory ~/ from onedrive or Tsinghua Cloud by clicking the link or wget:

     $ cd ~;
     $ wget -c --content-disposition http://lulab.life.tsinghua.edu.cn/RNAfinder/download_files_for_COME/HDF5/human.feature_vector.HDF5.zip
    
  3. ~/human.model was downloaded to my working directory ~/ from onedrive or Tsinghua Cloud by clicking the link or wget:

     $ cd ~;
     $ wget -c --content-disposition   http://lulab.life.tsinghua.edu.cn/RNAfinder/download_files_for_COME/models/human.model
    
  4. Then run COME by the following commands:

     ## Installation and preparison
     $ cd ~/;		
     $ unzip	./COME-master.zip;
     $ cd 	./COME-master/bin;
     ## Save the path of "bin" subfolder to the variable "$Bin_dir"
     $ Bin_dir=`pwd|awk '{print $1}'`;
     $ cd ~/;
     $ unzip	./human.feature_vector.HDF5.zip;
     $ mv	./human/human.HDF5.*	$Bin_dir/HDF5;
     $ rm -rf	./human;
     $ mv	./human.model	$Bin_dir/models;
     ## Running COME
     $ bash $Bin_dir/COME_main.sh	$Bin_dir/../examples/human.test.gtf	~/COME_out	$Bin_dir	human	human.model;
    
  5. The final output will be stored in ~/COME_out/result.txt. We can compare it with the example output file ~/human.test.result.txt. (Notice: the subclass number may be different, because the K-means algorithm used random seed.)

  6. Users are recommended to use the absolute path (/dir1/dir2/file1) instead of the relative path (../../file2).

3. Citing COME

=================

Hu L., Xu Z., Hu B. and Lu ZJ, COME: a robust coding potential calculation tool for lncRNA identification and characterization based on multiple features, 2016

4. Contact

==========

Long Hu hulongptp@gmail.com

You can’t perform that action at this time.