Skip to content

Process for taking a single sample of Illumina RNA expression data, in fastq format, and processing it into FPKM values that are comparable to publicly available datasets.

mrbende/RNAprep

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

76 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

alt text

RNAprep

Process for taking a single sample of Illumina RNA expression data, in fastq format, and processing it into FPKM values that are comparable to publicly available datasets.

Reference Files

The hg19 (GRCh37) build of the USCS human reference genome can be found here, which should be downloaded in 2bit format. The utility program for Linux distributions, twoBitToFa, can be used to extract .fa file(s) from this 2bit binary. A precompiled version of this tool can be found here.

The comprehensive gene annotation list can be found here. These are GENCODE annotations specific to the hg19 build of the human genome reference chromosomes and were downloaded in gtf format.

Tools

NOTE: If you plan to append the sample of RNAseq data to an existing matrix of samples, GEMprep offers the tools to do that.

Process

This process was developed using Clemson University's Palmetto Cluster, which utilizes the Portable Batch Scheduling system (PBS) to manage job submission. Most commands were wrapped into independent scripts that specified resource allocation and the exact command line parameters. The code is copied below, but considering the large resource requirements of these processes it is recommended that they be submitted as batch jobs if possible. It is also important that the STAR executable as well as the RSEM commands are recognizable within the system path, or otherwise are explicitly directed.

On the Palmetto Cluster, these jobs were submitted with the following parameters:

select=1:ncpus=24:mpiprocs=1:mem=494gb,walltime=12:00:00

One "chunk" of hardware, 24 CPU cores, 494 gb of RAM, with 12 hours of allocated walltime. This is overkill for many of these jobs, however these programs do utilize significant compute resources. Use what you can.

1. Generate the Genome Index

STAR --runThreadN 24 --runMode genomeGenerate \
--genomeDir /path/to/desired/output/directory \
--genomeFastaFiles /path/to/genome/fasta/hg19.fa \
--sjdbGTFfile /path/to/annotations/gencode.v19.annotation.gtf \
--sjdbOverhang 99

The --runThreadN flag should be set to the number of avaiable cores on the node. The genome index that this process creates will be stored in a new directory, designated by the --genomeDir flag. It will henseforth be referred to as /path/to/genome.

2. Align the Expression Data to the Reference Genome

STAR --runThreadN 24 --runMode alignReads \
--outSAMtype BAM Unsorted SortedByCoordinate \
--genomeDir /path/to/genome \
--outFileNamePrefix /path/to/output/DESIRED_FILE_PREFIX \
--readFilesIn /path/to/lane1/read1,/path/to/lane2/read1 /path/to/lane1/read2,/path/to/lane2/read2
--outFilterType BySJout \
--outSAMattributes NH HI AS NM MD \
--outFilterMultimapNmax 20 \
--outFilterMismatchNmax 999 \
--outFilterMismatchNoverLmax 0.04 \
--alignIntronMin 20 \
--alignIntronMax 1000000 \
--alignMatesGapMax 1000000 \
--quantMode TranscriptomeSAM \
--alignSJoverhangMin 1 \
--alignSJDBoverhangMin 8 

By specifying the option --quantMode TranscriptomeSAM, STAR will output a file Aligned.toTranscriptome.bam. This is what we will use for RSEM. For more information regarding these paramters, refer to the STAR manual.

If your fastq files are not pair-end reads, you will only have one read to input. Not all illumina results will appear in different 'lane' files, as this largely depends on the size of the inputs and numebr of genes sequenced. Alter the --readFilesIn flag to represent your data.

3. RSEM Prepare Genome Reference

/path/to/RSEM/rsem-prepare-reference -p 24 --star \
--gtf /path/to/annotations/gencode.v19.annotation.gtf \
/path/to/genome/fasta/hg19.fa \
/path/to/desired/output/human_ref/hg19

The -p flag replaces the --runThreadN flag before, and still represnts the number of threads available. This command takes the same inputs as when genrating the genome index with STAR, however greates a unique reference directory for use with RSEM. Ensure that this output directory does not overwrite the directory generated with STAR, as it will be used in the next step... It is also worth noting that RSEM's rsem-prepare-reference command will run the STAR genomegenerate command as a part of its process. This results in some redundancy, but current configuration uses the STAR command to align the genome and the RSEM command to calculate expression.

4. RSEM calculate expression

Before calculating expression, it is important to verify that the input files are valid because we used an alternate aligner (RSEM defaults to Bowtie aligner, this process uses STAR). RSEM requires that the two mates of any paired-end alignments be adjacent. To check this, run the following:

/path/to/RSEM/rsem-sam-validator DESIRED_FILE_PREFIXAligned.toTranscriptome.out.bam

This could take some time, depending on the size of the bam file. If the command returns The input file is valid! then you are good to go! Now for the ultimate step of finally calculating expression...

/path/to/RSEM/rsem-calculate-expression --alignments --paired-end -p 24 \
/path/to/STARaligned/DESIRED_FILE_PREFIXAligned.toTranscriptome.out.bam \
/path/to/output/human_ref/hg19 \
/path/to/desired/FPKM/outputs/SAMPLE_NAME

If the initial expression data was not paired-end, remove the --paired-end flag. The end result will be a file SAMPLE_NAME.genes.results. This will contain ensembl gene IDs, FPKM values, along with other extraneous information.

To isolate the gene IDs and FPKM values, the simplest way is to run the following in a bash environment:

cat SAMPLE_NAME.genes.results | awk '{print $1,$7}' > SAMPLE_NAME.fpkm.txt

This is the ultimate result, a vector of FPKM expression values for the sample, matched with Ensembl gene IDs. For instructions on appending this vector to a pre-existing Gene Expression Matrix (GEM), continue on!

5. Downloading and Processing Tissue Data

This study published the initial method for combining GTEx and TCGA data by a common processing method.

alt text

Here, we download the tissue datasets of interest from the public database. These files present normalized gene expression levels, in FPKM format, after being corrected for batch variations between GTEx and TCGA. These files, divided into independent files by tissue type, can be viewed in figshare. Particular datasets of interest can be downloaded independently, or all can be downloaded together.

The rest of this documentation will use the kidney data for example. The following files were then downloaded from the fileshare and extracted from their gunziped format:

kidney-rsem-fpkm-gtex.txt
kirp-rsem-fpkm-tcga-t.txt
kirp-rsem-fpkm-tcga.txt
kich-rsem-fpkm-tcga-t.txt
kich-rsem-fpkm-tcga.txt
kirc-rsem-fpkm-tcga-t.txt
kirc-rsem-fpkm-tcga.txt

KIRP, KICH, and KIRC are three subtypes of renal cell carcinoma. The tcga data is divided into the tumor samples, denoted as *-tcga-t.txt, and healthy tissue samples, denotes as *-tcga.txt. We also downloaded the gtex samples, which can now be direcly evaluated with the tcga normal samples.

The recommended way to use the scripts in this repository is with an Anaconda environment. This is the Anaconda environemnt used by GEMprep later, but we will need some of these packages before then. To create an Anaconda environment:

module add anaconda3/5.1.0

conda create -n myenv python=3.6 matplotlib mpi4py numpy pandas r scikit-learn seaborn

source deactivate

All of the downloaded Gene Expression Matrices (GEMs) should be added to the same directory, containing only these files. Then, to activate the conda environment for running the python script...

conda activate myenv

The gemmaker.py script in this repo is used to merge all of the GEMs located in the directory while aligning the genes. It is important to make note of the beginnign of this python script.

dir_path = '/scratch2/mrbende/KIDNEY/'

labels_file = 'kidney-labels.txt'

GEM_file = 'kidney_FPKM.txt'

These variables should be changed to reflect your environment. dir_path should be the absolute path where the directory containing the GEMs is located. The labels_file being generated will be relevant in downstream applications so just make note of it here. GEM_file will be the GEM that is created containing FPKM values of all samples.

Once these script variables have been changed, simply run the python code. Note that it requires python3, so check the version.

python gemmaker.py

6. Appending the sample expression data to this larger dataset

Once the GEM of tissue samples has been created, and the larger GEM of GTEx-TCGA data has been combined, the last step is to add the sample processed with STAR/RSEM to the larger sample of public data. This is most easily done by running gemmaker.py again. Create a new directory, containing kidney_FPKM.txt and SAMPLE_NAME.fpkm.txt. It is also possible to add multiple sample files, as it is often the case to have one 'normal' file and one 'tumor' sample. With this new directory, simplpy run gemmaker.py again.

NOTE: This can be combined with the previous step to just run gemmaker.py once. It is recommended to do this seperately, however, because there are many downstream applications where it will be useful to have a separate GEM containing just the public data.

From here, we can now work with a single output file. We will call it a generic GEM.txt with a corresponding labels.txt.

7. Normalize and Visualize the data with GEMPREP

About

Process for taking a single sample of Illumina RNA expression data, in fastq format, and processing it into FPKM values that are comparable to publicly available datasets.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages