Skip to content

mshakya/piret

Repository files navigation

Build Status codecov install with bioconda

PiReT

Pipeline for Reference based Transcriptomics.

0.0 Installing PiReT

PiReT is installed using conda. So, please make sure that conda is installed and in your path. The installation can take upto 2 hours depending on your internet speed.

0.0.1 Install directly from bioconda

Coming soon!

0.0.2 Install dependencies separately using conda

For installation to work, conda must be installed. See here for instructions on how to install conda. Use following commands to create conda environments and then install corresponding packages. Also make sure that there is not an environment by the name of piret_env before attempting the installation. Delete the environment if its already present. I recommend that if you are python savvy, use this instruction as you will have control on every step of the installation, and if something fails, you wont have to start from the beginning.

git clone https://github.com/mshakya/piret.git
cd piret
conda create -n piret_env python=3.6.6 --yes
conda install -c bioconda faqcs -n piret_env --yes
conda install -c bioconda star hisat2 subread -n piret_env --yes
conda install -c bioconda subread stringtie -n piret_env --yes
conda install -c bioconda samtools bamtools bedtools -n piret_env --yes
conda install -c bioconda diamond=0.9.24 -n piret_env --yes
source activate piret_env
cd thirdparty
rm -rf eggnog-mapper
git clone https://github.com/mshakya/eggnog-mapper.git
cd eggnog-mapper
python download_eggnog_data.py -y
cd ..
cd ..
Rscript --no-init-file -e "if('BiocManager' %in% rownames(installed.packages()) == FALSE){install.packages('BiocManager',repos='https://cran.r-project.org')}";
# install optparse
Rscript --no-init-file -e "if('optparse' %in% rownames(installed.packages()) == FALSE){install.packages('optparse',repos='https://cran.r-project.org')}";
# install tidyverse
Rscript --no-init-file -e "if('tidyverse' %in% rownames(installed.packages()) == FALSE){install.packages('tidyverse',repos='https://cran.r-project.org')}";
# install R reshape2 packages
Rscript --no-init-file -e "if('reshape2' %in% rownames(installed.packages()) == FALSE){install.packages('reshape2',repos='https://cran.r-project.org')}";
# install R pheatmap packages
Rscript --no-init-file -e "if('pheatmap' %in% rownames(installed.packages()) == FALSE){install.packages('pheatmap',repos='https://cran.r-project.org')}";
# install R edgeR packages
Rscript --no-init-file -e "if('edgeR' %in% rownames(installed.packages()) == FALSE){BiocManager::install('edgeR')}";
# install R deseq2 packages
Rscript --no-init-file -e "if('DESeq2' %in% rownames(installed.packages()) == FALSE){BiocManager::install('DESeq2')}";
# install R pathview package
Rscript --no-init-file -e "if('pathview' %in% rownames(installed.packages()) == FALSE){BiocManager::install('pathview')}";
# install R gage package
Rscript --no-init-file -e "if('gage' %in% rownames(installed.packages()) == FALSE){BiocManager::install('gage')}";
# install R ballgown package
Rscript --no-init-file -e "if('ballgown' %in% rownames(installed.packages()) == FALSE){BiocManager::install('ballgown')}";
python setup.py install

0.0.3 Install using provided bash script

$ git clone https://github.com/mshakya/piret.git
$ cd piret
$ ./installer.sh <conda_env>

For example:

$ git clone https://github.com/mshakya/piret.git
$ cd piret
$ ./installer.sh piret_env

Make sure that the environment name (eg. piret_env) doesnt exist yet.

0.0.4 Install using pip

Coming soon!

1.0 Testing Installation

We have provided test data set to check if the installation was successful or not. fastq files can be found in tests/fastqs and corresponding reference fasta files are found in tests/data. To run the test, from within piret directory:

For running tests on eukaryote datasets:

$ cd piret
$ source activate piret_env

$LUIGI_CONFIG_PATH="/panfs/biopan01/scratch-311300/ecoli_usda/ecoli.cfg" bin/piret -c ecoli.cfg -d ecoli_piret -e exp_desn.txt
$LUIGI_CONFIG_PATH="full_path_to/piret/tests/test_euk.cfg" bin/piret -c tests/test_euk.cfg -d tests/test_euk -e tests/test_euk.txt

For running tests on prokarya datasets:

$LUIGI_CONFIG_PATH="full_path_to/piret/tests/test_prok.cfg" bin/piret -c tests/test_prok.cfg -d tests/test_prok -e tests/test_prok.txt

For running tests using both prokarya and eukarya datasets:

$LUIGI_CONFIG_PATH="full_path_to/piret/tests/test_both.cfg" bin/piret -c tests/test_prok.cfg -d tests/test_prok -e tests/test_both.txt

For getting KO ids for genes, PiReT uses emapper. The conda install of PiReT also includes emapper. However, its database need to be downloaed following instruction here. Briefly,

0.1 Dependencies

PiReT requires following dependencies, all of which should be installed and in the PATH.

0.1.0 Programming/Scripting languages

0.1.1 Installing dependencies

  • conda v4.2.13 If conda is not installed, INSTALL.sh will download and install miniconda, a "mini" version of conda that only installs handful of packages compared to anaconda

0.1.2 Third party softwares/packages

0.1.3 R packages

0.1.4 Python packages

2.0 Running PiReT

usage: piret [-h] -d WORKDIR -e EXPDSN -c CONFIG [-v]

piret

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit

required arguments:
  -d WORKDIR            working directory where all output files will be
                        processed and written (default: None)
  -e EXPDSN             tab delimited experimental design file
  -c CONFIG, --config CONFIG
                        luigi config file for setting parameters that control
                        each step, see github repo for an example (default:
                        None)

Example runs:

        piret -d <workdir> -e <design file>  -c <config file>

2.1 Experimental design file

An experimental design file consist of sample name (SampleID), full path to fastq files (Files), and different groups of your samples (Group). We recommend that you use a text editor like BBedit or TextWrangler to generate the tab delimited experimental design file. Exporting a tab delimited file directly from Excel tend to cause formatting problem. If possible, please avoid any special characters in sample names and group names.

For example:

samp1, samp_1 : good name
samp 1, samp.1: not a good name and will likely cause errors.

A sample of experimental design file can be found here.

2.2 Config file

All options are set in the config file.

3.0 OUTPUT

All the outputs will be within the working directory. The main output file is a concatenated JSON file called out.json.

  • samp2: The name of this directory corresponds to sample name. Within this folder there are two sub-folders:

    • mapping_results This folder contains reads mapped using hisat2 in following formats. If splice_sites_gff.txt is present, hisat2 aligns based on known splice sites.
      • *.sam: outputs of hisat2
      • *.bam: generated from .sam
      • mapping.log: Alignment summary file from hisat2.
      • *sTie.tab: Tab delimited file with Coverage, FPKM, TPM, for all the genes and novel transcripts. Generated using string tie.
      • *sTie.gtf: Primay GTF formatted output of stringtie.
    • trimming_results This folder contains results of quality trimming and filtering using FaQC.
      • *_qc_report.pdf: A QC report file with figures.
      • fastqCount.txt: A text file with summary of read counts.
      • *trimmed.fastq: Pair of trimmed fastq files.
      • *unpaired.trimmed.fastq: fastq that did not have pairs after QC.
      • *.stats.txt: Summary file with numbers of reads before and after QC.
  • ballgown ballgown folder. The folder is to be read by R package ballgown for finding significantly expressed genes. There is one folder per sample.

  • *merged_transcript.gtf: Non-redundant list of transcripts in GTF format merged from all samples.

  • featureCounts: A folder containing tables of counts from featureCounts.

  • edgeR: A folder containing tables and figures processed mainly using R package edgeR to detect significantly expressed genes. Based on the options picked, the folder will have either one or two folders, prokarya and eukarya. Withing these folders there are following files and figures.

    • *RPKM.csv: A table with RPKM values for all genes across all samples.
    • *CPM.csv: A table with CPM values for all features across all samples
    • *feature_count_heatmap.pdf: Heatmap based on count data for the features listed in gff files.
    • *feature_count_CPM_histogram.pdf: A histogram of CPMs.
    • *MDS.pdf: A MDS plot based on reads mapped to samples.
    • group1__group2__gene__et.csv: table with gene name, logFC, logCPM, PValue, and FDR comparing group1 vs. group 2. This one contains all genes that have any counts.
    • group1__group2__gene__sig.csv: A subset of group1__group2__gene__et.csv with all only genes that are significant based on the specified P-value.

4.0 Removing PiReT

For removal, since all dependencies that are not in your system are installed in PiReT, delete (rm -rf) PiReT folder is sufficient to uninstall the package. Before removing check if your project files are within PiReT directory.

5.0 Contributions

  • Migun Shakya

6.0 Citations:

If you use PiReT please cite following papers:

  • samtools: Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID: 19505943]
  • bowtie2: Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), 357-359. [PMID: 22388286]
  • bwa: Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25:1754-60. [PMID: 19451168]
  • DESeq2: Love MI, Huber W and Anders S (2014). “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology, 15, pp. 550. [PMID: 25516281]
  • edgeR: McCarthy, J. D, Chen, Yunshun, Smyth and K. G (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research, 40(10), pp. -9. [PMID: 22287627]
  • HTSeq: Anders, S., Pyl, P. T., & Huber, W. (2014). HTSeq–a Python framework to work with high-throughput sequencing data. Bioinformatics. [PMID: 25260700]
  • hisat2: Kim, D., Langmead, B., & Salzberg, S. L. (2015). HISAT: a fast spliced aligner with low memory requirements. Nature methods, 12(4), 357-360. [PMID: 25751142]
  • BEDTools: Quinlan AR and Hall IM, 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 26, 6, pp. 841–842. [PMID: 20110278]
  • GAGE: Luo, Weijun, Michael S. Friedman, Kerby Shedden, Kurt D. Hankenson, and Peter J. Woolf. 2009. “GAGE: Generally Applicable Gene Set Enrichment for Pathway Analysis.” BMC Bioinformatics 10 (May): 161.
  • Pathview: Luo, Weijun, and Cory Brouwer. 2013. “Pathview: An R/Bioconductor Package for Pathway-Based Data Integration and Visualization.” Bioinformatics 29 (14). Oxford University Press: 1830–31.
  • Ballgown: Frazee, Alyssa C., Geo Pertea, Andrew E. Jaffe, Ben Langmead, Steven L. Salzberg, and Jeffrey T. Leek. 2015. “Ballgown Bridges the Gap between Transcriptome Assembly and Expression Analysis.” Nature Biotechnology 33 (3): 243–46.
  • featureCounts: Liao, Yang, Gordon K. Smyth, and Wei Shi. 2014. “featureCounts: An Efficient General Purpose Program for Assigning Sequence Reads to Genomic Features.” Bioinformatics 30 (7): 923–30.
  • StringTie: Pertea, Mihaela, Geo M. Pertea, Corina M. Antonescu, Tsung-Cheng Chang, Joshua T. Mendell, and Steven L. Salzberg. 2015. “StringTie Enables Improved Reconstruction of a Transcriptome from RNA-Seq Reads.” Nature Biotechnology 33 (3): 290–95.

Copyright

Copyright (XXXX). Triad National Security, LLC. All rights reserved.

This program was produced under U.S. Government contract 89233218CNA000001 for Los Alamos National Laboratory (LANL), which is operated by Triad National Security, LLC for the U.S. Department of Energy/National Nuclear Security Administration.

All rights in the program are reserved by Triad National Security, LLC, and the U.S. Department of Energy/National Nuclear Security Administration. The Government is granted for itself and others acting on its behalf a nonexclusive, paid-up, irrevocable worldwide license in this material to reproduce, prepare derivative works, distribute copies to the public, perform publicly and display publicly, and to permit others to do so.

This is open source software; you can redistribute it and/or modify it under the terms of the GPLv3 License. If software is modified to produce derivative works, such modified software should be clearly marked, so as not to confuse it with the version available from LANL. Full text of the GPLv3 License can be found in the License file in the main development branch of the repository.