We ask you to cite the main publication related to this software whenever you use any part of this software in any scientific publication.
You may use the following .bibtex to cite the main publication of this software:
@article{10.1093/bib/bbaa185,
author = {da Cruz, Murilo Horacio Pereira and Domingues, Douglas Silva and Saito, Priscila Tiemi Maeda and Paschoal, Alexandre Rossi and Bugatti, Pedro Henrique},
title = "{TERL: classification of transposable elements by convolutional neural networks}",
journal = {Briefings in Bioinformatics},
year = {2020},
month = {09},
issn = {1477-4054},
doi = {10.1093/bib/bbaa185},
url = {https://doi.org/10.1093/bib/bbaa185},
note = {bbaa185},
eprint = {https://academic.oup.com/bib/advance-article-pdf/doi/10.1093/bib/bbaa185/33724804/bbaa185.pdf},
}
Here are all the topics:
To install TERL you need to clone the repository into your local machine. First you need to have git installed in your local machine. You can follow these steps to install git. Once you have git installed, you can clone this repository with the following command:
git clone https://github.com/muriloHoracio/TERL
After the clone, you have a directory named TERL which contains all codes to train TERL and classify sequences.
Since TERL is made on Python 3.6 and use some libraries, we recomend the use of virtual environmnets to run it. Before installing virtualenv, make sure you have pip3 installed. To install pip3 run the followoing command:
sudo apt-get install python3-pip
Update pip3 by running the following command:
sudo -H pip3 install --upgrade pip
Also, you need to install setup-tools by running the following command:
sudo apt-get install python3-setuptools
To create a virtual environment you need to have virtualenv installed, in order to do that you can run the following command:
sudo apt-get install virtualenv
To create the virtual environment, you need to execute the following command:
virtualenv -p python .venv
Once the virtual environment is created, you need to active the environment in order to install the dependencies of TERL. To do this, run the following command:
. .venv/bin/activate
If everything worked well, you will notice that (.venv)
will appear before the user name in the command line.
Now you must install the dependencies needed to run TERL.
If you are using GPU, you must run the following command:
pip3 install -r requirements-gpu.txt
Otherwise, you must run the following command:
pip3 install -r requirements.txt
If everything went well, you have just installed TERL and are ready to train your on model on your sequences or classify some sequences based on a previously trained model.
Transposable Elements Representation Learner
The TERL can be used to classify any genomic sequence. This framework provides tools to train and test models. Users can opt to deploy the trained network model. There are vast parameters that can be used to define the network architecture and to set the model's parameters. All the set of parameters are described here with examples of usage.
In order to use this framework to train and test CNNs models for genomic data, users need to organize the structure of dataset's files The files should be stored in the following way:
Root
└─── Train
| └─── Class1.fa
| └─── Class2.fa
| └─── Class3.fa
| └─── Class4.fa
└─── Test
└─── Class1.fa
└─── Class2.fa
└─── Class3.fa
└─── Class4.fa
The filenames on Train and Test folders must be identicals and reflect the class that each file represents.
This is an example how to train a model with TERL. The files are stored over Train and Test folders, which are stored on Dataset folder, sotored in the TERL folder.
Dataset
└─── Train
| └─── LTR.fa
| └─── LINE.fa
| └─── SINE.fa
└─── Test
└─── LTR.fa
└─── LINE.fa
└─── SINE.fa
This example model have the following architecture:
Architecture: conv pool conv pool fc fc
Functions: relu avg relu avg relu relu
Widths: 30 20 30 20 1500 500
Strides: 1 20 1 20
Feature maps: 64 - 32 - - -
Example:
python3 terl_train.py -r Dataset -l 6 -a conv pool conv pool fc fc -f relu avg relu avg relu relu -w 30 20 30 20 1500 500 -s 1 20 1 20 -fm 64 32 -sg -sr -sm
This section describes the parameters with its possible values and examples of usage.
Required parameter that defines the Root folder, where Train and Test folders containing sample sequences files are located.
It can be the relative or absolute path to Root
Example:
python3 terl_train.py -r ~/TERL/Datasets/DS1
Parameter that defines the number of layers that will be created. It checks if the number of defined layers in the model is correct. The layers can be defined without this parameter, but it is a good practice to use it to guarantee that the model have the correct number of layers.
Default value is 8, which is the number of layers of the default model.
Example:
python3 terl_train.py -l 6
Parameter that defines the architecture of the model. This defines the types of the layers and its order in the model.
Input and classification layer should not be included.
The supported values are:
- conv (Convolution layer)
- pool (Pooling layer)
- fc (Fully connected layer)
Default value is: conv pool conv pool conv pool fc fc
Example:
python3 terl_train.py -a conv pool conv pool fc fc
Parameter that defines the functions of each layer. The functions should be entered according to the --architecture parameter, i.e. the first option should be the function of the first layer defined in --architecture, the second option the function of the second layer and so on...
The available activation functions for convolution and fully connected layers are:
- relu
- tanh
- sigmoid
- leaky_relu
- elu
The available funcions for pooling layers are:
- avg
- max
Default value is: -f relu avg relu avg relu avg relu relu
Example:
python3 terl_train.py -f relu avg relu avg relu relu
Parameter that defines the widths of the filters (convolution and pooling) and the number of neurons for fully connected layers. The values should be entered according to the --architecture parameter, i.e. the first value should be the width of the first layer filter, the second value should be the width of the second layer's filter, and so on...
Default parameter is: -w 30 20 30 20 30 10 1500 500
Example:
python3 terl_train.py -w 30 20 30 20 1500 500
Parameter that defines the strides of the layers of the model. The values should be entered according to the parameter --architecture, i.e. the first value should be the stride of the first layer, the second value the stride of the second layer, and so on...
Default value is: -s 1 20 1 20 1 10
Example:
python3 terl_train.py -s 1 20 1 20
Parameter that defines the amount of feature maps of each convolution layer. It should be entered n values for a network with n convolution layers.
Default value is: -fm 64 32 16
Example:
python3 terl_train.py -fm 64 32
Parameter that defines the optimizer that will be used to train the model and optimize the values of the learnable parameters (i.e. weights) of the model.
The available optimizers are:
- adam
- adadelta
- adagrad
- ftrl
- rmsprop
- grad_desc
Default value is: -o adam
Example:
python3 terl_train.py -o adagrad
Parameter that defines the learning rate to be used by the optimizer.
Default value is: -lr 0.001
Example:
python3 terl_train.py -lr 0.001
Parameter that defines the train batch size. The train batch is the amount of samples that will be presented to the network for each step during training.
Default value is: 32
Example:
python3 terl_train.py -trb 64
Parameter that defines the test batch size. The test batch is the amount of samples that will be presented to the network for each step during testing.
Default value is: 32
Example:
python3 terl_train.py -tsb 64
Parameter that defines the number of epochs that training will be executed. In each epoch all training samples are presented to the network during training.
Default value is: 30
Example:
python3 terl_train.py -e 100
Parameter that defines the dropout rate that is used to drop neurons in each convolution and fully connected layer in the model.
Default value is: 0.5
Example:
python3 terl_train.py -d 0.3
Parameter that sets confusion matrix and learning curve graphs to be saved. The title of the graphs are defined in the --confusion-matrix-title and --learning-curve-title parameters.
By default, graphs are not saved, meaning you need to set it if you really want to save them.
Example:
python3 terl_train.py -sg
Parameter that defines the title of the confusion matrix graph. The title should not contain the character "-".
Default value is: Confusion Matrix
Example:
python3 terl_train.py -cmt Confusion Matrix DS1
Parameter that defines the title of the learning curve graph. The title should not contain the character "-".
Default value is: Learning Curve
Example:
python3 terl_train.py -lct Learning Curve DS1
Parameter that defines the prefix name to be used to save files, e.g. graphs, models and reports. The name must be one string, i.e. without spaces.
Default value is: RUN_yyyymmdd_HHMMSS
Where yyyy is the 4 digit current year, mm is the 2 digit current month, dd is the 2 digits current day, hh, mm, and ss is the current hour, minute and second respectively.
Example:
python3 terl_train.py -p DS1_Tests
Parameter that sets the model to be saved on the directory defined on --model-export-dir.
By default, the model is not saved. Users who want to save their models must set it with this parameter.
Example:
python3 terl_train.py -sm
Parameter that defines the folder where the model will be exported. The value should be the relative or absolute path to the desired folder. We suggest the use of folder Models created on the folder TERL.
Default value is: Models/Model_yyyymmdd_HHMMSS
Where yyyy is the 4 digit current year, mm is the 2 digit current month, dd is the 2 digits current day, hh, mm, and ss is the current hour, minute and second respectively.
Example:
python3 terl_train.py -md Models/DS1_Model
Parameter that sets the reports to be saved on the folder Outputs that is located in the TERL folder.
By default, reports are not saved. Users who want to save it must set it with this parameter.
Example:
python3 terl_train.py -sr
Parameter that sets the model to be saved on the directory defined on --model-export-dir.
By default, the model is not saved. Users who want to save their models must set it with this parameter.
Example:
python3 terl_train.py -sm
Parameter that disables the verbose mode, which provides useful information to the user.
The verbose mode shows the following information:
- OPTIONS (all parameters used)
- FILES (training and testing file)
- CLASSIFICATION INFO (classes, train and test size, longest sequence and vocabulary size)
- Accuracy micro, macro and simple after each epoch
- REPORT (confusion matrix and classification metrics)
- TIME (train and test times)
By default, verbose is on. Users who want to disable it must set it with this parameter.
Example:
python3 terl_train.py -nv
This is an example how to test TERL or classify files. You must inform a trained and saved model to perform this operation.
Example:
python3 terl_test.py -m Models/TERLModel -f file1.fa file2.fa file3.fa
After classification is done, three files with prefix TERL_YYYYmmdd_HHMMSS_
will be created containing the results of the classification. TERL copies the sequences and changes the header according to the predicted class.
This section describes the parameters with its possible values and examples of usage.
Required parameter that defines the model to be used for classification.
Example:
python3 terl_test.py -m Models/TERLModel
Parameter that defines the FASTA files to be classified. After classifying the files, output files are created with a prefix name containing the sequences in the original file and the headers with the predicted classes.
Default value is TERL_YYYYmmdd_HHMMSS_ where YYYY, mm, dd, HH, MM and SS means the current year, month, day, hour, minutes and seconds.
Example:
python3 terl_test.py -m Models/TERLModel -f file1.fa file2.fa file3.fa
Parameter that defines the batch size that will be used to load sequences and classify them.
Default value is 32
Example:
python3 terl_test.py -m Models/TERLModel -f file1.fa file2.fa file3.fa -b 64
Parameter that defines the prefix to be used when writing the output files.
Default value is TERL_YYYYmmdd_HHMMSS_
Example:
python3 terl_test.py -m Models/TERLModel -f file1.fa file2.fa file3.fa -p TERL_exp1_
Which will results in the following output files:
TERL_exp1_file1.fa
TERL_exp1_file2.fa
TERL_exp1_file3.fa
Parameter that deactivates verbose mode, which prints a lot of useful information.
Default value is False, which prints useful information in the terminal screen
Example:
python3 terl_test.py -m Models/TERLModel -f file1.fa file2.fa file3.fa -q
The above command will log only Tensorflow's logs
If you want to use any of the Pre-treined models, they were trained as follows:
- DS1 was trained based on RepBase sequences to classify the following superfamilies: Copia, Gypsy, Bel-Pao, ERV, L1, Tc1-Mariner, and hAT.
- DS2 was also trained based on RepBase to classify the sequences into Class II, LTR, and LINE.
- DS3 was trained with sequences from 7 databases and classifies sequences into Copia, Gypsy, Bel-Pao, ERV, L1, SINE, Tc1-Mariner, hAT, Mutator, PIF-Harbinger, and En/Spm - CACTA.
- DS4 was also trained with sequences from 7 databases, but it classifies sequences into Class II, LTR, LINE, and SINE.
- DS5 was trained with sequences from RepBase and tested with sequences from the other databases. It can classify sequences into Class II, LTR, and LINE.
All models were trained with a class of non-TE sequences that were generated by shuffling the sequences to simulate noise.
Basically, DS1 and DS3 are for superfamily classification, whereas DS2, DS4, and DS5 are for order.
For more details read the Experiments section of the original paper.
You can access all of the datasets used in this work by following on these links to the original papers:
- RepBase (version 23.10) DOI: 10.1159/000084979
- DPTEdb DOI: 10.1093/database/baw078
- SPTEdb DOI: https://doi.org/10.1093/database/bay024
- PGSB PlantsDB DOI: 10.1093/nar/gkv1130
- RiTE database DOI https://doi.org/10.1186/s12864-015-1762-3
- TREP DOI: 10.1016/S1360-1385(02)02372-5
- TEfam DOI: 10.1186/1471-2164-12-260