Skip to content
exRNA Biomarker Discovery for Liquid Biopsy
Jupyter Notebook Python R C++ HTML Dockerfile Other
Branch: master
Clone or download

Latest commit

Fetching latest commit…
Cannot retrieve the latest commit at this time.


Type Name Latest commit message Commit time
Failed to load latest commit information.


exRNA Biomarker Discovery for Liquid Biopsy


  • The exSEEK program starts from a data matrix of gene expression (read counts of each gene in each sample) and performs normalization, feature selection and evaluation.
  • Meanwhile, we provide some pipelines and QC steps for the pre-process of exRNA-seq (including long and short cfRNA-seq/exoRNA-seq) raw data.
  • We also recommend other alternatives for the pre-process, such as exceRpt, that is specifically developed for the process of exRNA-seq raw reads.

Table of Contents:


For easy installation, you can use the exSEEK image of docker with all dependencies installed:

docker pull ltbyshi/exseek

Alternatively, you can use use singularity or udocker to run the container for Linux kernel < 3 or if you don't have permission to use docker.


Run the main program from docker:

docker run --rm -it -v $PWD:/workspace -w /workspace ltbyshi/exseek

The exSEEK directory was cloned to /apps/exseek in the docker.

You can create a bash script named exseek and set the script executable:

#! /bin/bash
docker run --rm -it -v $PWD:/workspace -w /workspace ltbyshi/exseek "$@"

After adding the file to one of the directory in the $PATH variable, you can simply run: exseek.

The basic usage of exSEEK is:

exseek ${step_name} -d ${dataset}


  • Other arguments are passed to snakemake
  • Specify number of processes to run in parallel with -j
  • ${step_name} is one of normalization and cross_validation.
  • ${dataset} is the name of your dataset that should match the prefix of your configuration file described in the following section.

Input files

An example can be found in example_data directory with the following structure:

├── config
│   └── example.yaml
├── data
│   └── example
│       ├── batch_info.txt
│       ├── compare_groups.yaml
│       ├── sample_classes.txt
│       └── sample_ids.txt
└── output
    └── example
        └── count_matrix
            └── mirna_and_domains_rna.txt


  • config/example.yaml: configuration file
  • data/example/batch_info.txt: table of batch information
  • data/example/compare_groups.yaml: configuration file for definition of positive and negative samples
  • data/example/sample_classes.txt: table of sample labels
  • output/example/count_matrix/mirna_and_domains_rna.txt: input matrix of read counts

You can create your own data directory with the above directory structure. Multiple datasets can be put in the same directory by replacing "example" with your own dataset names.

More information about input and output files can be found on File Format page.



exseek normalization -d ${dataset}

This will generate normalized expression matrix for every combination of methods with the following file name pattern:


You can specify normalization methods by setting the value of normalization_method and the batch removal methods by setting the value of batch_removal_method in in config/${dataset}.yaml.

Supported normalization methods: TMM, RLE, CPM, CPM_top, UQ, null

Supported batch removal methods: limma, ComBat, RUV, null

When the method name is set to "null", the step is skipped.

${batch_index} is the column number (start from 1) in config/${dataset}/batch_info.txt to be used to remove batch effects.

Feature selection


exseek feature_selection -d ${dataset}

This will evaluate all combinations of feature selection methods and classifiers by cross-validation.

Three summary files will be generated:

  • output/${dataset}/summary/cross_validation/metrics.test.txt
  • output/${dataset}/summary/cross_validation/metrics.train.txt
  • output/${dataset}/summary/cross_validation/feature_stability.txt

Cross-validation results and trained models for individual combinations are in this directory:


Selected list of features are in features.txt.

Note: More information about output files can be found on File format page. Detailed parameters of feature selection and classifiers can be found in config/machine_learning.yaml.

Advanced Usage

Copyright and License Information

Copyright (C) 2019 Tsinghua University, Beijing, China

This program is licensed with commercial restriction use license. Please see the LICENSE file for details.


Binbin Shi, Jingyi Cao, Xupeng Chen and Zhi John Lu (2019) exSEEK: an integrative computational framework for identifying extracellular RNA biomarkers in liquid biopsy

You can’t perform that action at this time.