Workflow for RNA sequencing using the Parallel Scripting Library - Parsl.
Reference: Cruz, L., Coelho, M., Terra, R., Carvalho, D., Gadelha, L., Osthoff, C., & Ocaña, K. (2021). Workflows Científicos de RNA-Seq em Ambientes Distribuídos de Alto Desempenho: Otimização de Desempenho e Análises de Dados de Expressão Diferencial de Genes. In Anais do XV Brazilian e-Science Workshop, p. 57-64. Porto Alegre: SBC. DOI: https://doi.org/10.5753/bresci.2021.15789
In order to use RNA-seq Workflow the following tools must be available:
ParslRNA-Seq was tested on Python, version 3.8.2.
The recommended way to install Parsl is the suggest approach from Parsl's documentation:
python3 -m pip install parsl
You can install Bowtie2 by running:
bowtie2-2.3.5.1-linux-x86_64.zip
Or
sudo yum install bowtie2-2.3.5-linux-x86_64
Samtools is a suite of programs for interacting with high-throughput sequencing data.
Picard is a set of Java command line tools for manipulating high-throughput sequencing (HTS) data and formats.
HTSeq is a native Python library that folows conventions of many Python packages. You can install it by running:
pip install HTSeq
HTSeq uses NumPy, Pysam and matplotlib. Be sure this tools are installed.
To use DESEq2 script make sure R language is also installed. You can install it by running:
sudo apt install r-base
First of all, make a Comma Separated Values (CSV) file. So, onto the first line type: sampleName,fileName,condition
. Remember, there must be no spaces between items. You can use the file "table.csv" in this repository as an example. Your CSV file will be like this:
sampleName | fileName | condition |
---|---|---|
tissue control 1 | SRR5445794.merge.count | control |
tissue control 2 | SRR5445795.merge.count | control |
tissue control 3 | SRR5445796.merge.count | control |
tissue wntup 1 | SRR5445797.merge.count | wntup |
tissue wntup 2 | SRR5445798.merge.count | wntup |
tissue wntup 3 | SRR5445799.merge.count | wntup |
The list of command line arguments passed to Python script, beyond the script's name, must be:
- The indexed genome;
- The number of threads or splitted files for
bowtie
,sort
,split
andhtseq
tasks; - Path to read fastaq file, which is the path of the input files;
- Directory's name where the output files must be placed;
- GTF file;
- and, lastly the DESeq script.
Make sure all the files necessary to run the workflow are in the same directory and the fastaq files in a dedicated folder, as a input directory. The command line will be like this:
python3 rna-seq.py ../mm9/mm9 24 ../inputs/ ../outputs ../Mus_musculus.NCBIM37.67.gtf ../DESeq.R
Remember to adjust the parameter multithreaded and multicore according with your computational environment. Example: If your machine has 8 cores, you should set the parameter on 8.
ParslRNA-Seq is also available on docker. You can push it from DockerHub, running the following command:
docker pull lucruzz/parslrna-seq
To run it, create a directory on the host machine with the following hierarchy of directories and mount them in the container:
- inputs
- input files
- outputs
- table.csv
- gtf
- file.gtf
- genomic_base
- genomic-base-files
To run ParslRNA-Seq in the container, run the following command and keep monitoring the outputs directory:
$ sudo docker run -d -v diretorio_maquina_hospedeira:/workdir -e
$ RNASEQ_TABLE_CSV=/workdir/table.csv -e
$ RNASEQ_GENETIC_BASE=/workdir/base_genetica/prefixo_arquivo -e
$ RNASEQ_NUM_THREADS=4 -e RNASEQ_INPUTS=/workdir/inputs/ -e
$ RNASEQ_OUTPUTS=/workdir/outputs/ -e RNASEQ_GTF=/workdir/arquivo.gtf
$ rnaseq:1.0 /RNA-seq/rna-seq.sh