Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Clone this wiki locally
6-iii. Integrated assignment answers
Background: The PCA3 gene plays a role in Prostate Cancer detection due to its localized expression in prostate tissues and its over-expression in tumour tissues. This gene expression profile makes it a useful marker that can complement the most frequently used biomarker for prostate cancer, PSA. There are cancer assays available that test the presence of PCA3 in urine.
Objectives: In this assignment, we will be using a subset of the GSE22260 dataset, which consists of 30 RNA-seq tumour/normal pairs, to assess the prostate cancer specific expression of the PCA3 gene.
Experimental information and other things to keep in mind:
- The libraries are polyA selected.
- The libraries are prepared as paired end.
- The samples are sequenced on a Illumina Genome Analyzer II (this data is now quite old).
- Each read is 36 bp long
- The average insert size is 150 bp with standard deviation of 38bp.
- We will only look at chromosome 9 in this exercise.
- The dataset is located here: GSE22260
- 20 tumour and 10 normal samples are available
- For this exercise we will pick 3 matched pairs (C02,C03,C06 for tumour and N02,N03,N06 for normal). We can do more if we have time.
PART 1 : Obtaining Data and References
- Obtain the files necessary for data processing
- Familiarize yourself with reference and annotation file format
- Familiarize yourself with sequence FASTQ format
Create a working directory
~/workspace/rnaseq/integrated_assignment/ to store this exercise. Then create a unix environment variable named
RNA_ASSIGNMENT that stores this path for convenience in later commands.
cd $RNA_HOME mkdir -p ~/workspace/rnaseq/integrated_assignment/ export RNA_ASSIGNMENT=~/workspace/rnaseq/integrated_assignment/
Obtain reference, annotation and data files and place them in the integrated assignment directory
Note: when initiating an environment variable, we do not need the $; however, everytime we call the variable, it needs to be preceeded by a $.
echo $RNA_ASSIGNMENT cd $RNA_ASSIGNMENT wget https://xfer.genome.wustl.edu/gxfer1/project/gms/testdata/misc/integrated_assignment_refs.tar.gz wget https://xfer.genome.wustl.edu/gxfer1/project/gms/testdata/misc/integrated_assignment_data.tar.gz tar -zxvf integrated_assignment_refs.tar.gz tar -zxvf integrated_assignment_data.tar.gz
Q1.) How many items are there under the “refs” directory (counting all files in all sub-directories)?
A1.) The answer is 6. Review these files so that you are familiar with them.
cd $RNA_ASSIGNMENT/refs/ tree find * find * | wc -l
What if this reference file was not provided for you? How would you obtain/create a reference genome fasta file for chromosome 9 only. How about the GTF transcripts file from Ensembl? How would you create one that contained only transcripts on chromosome 9?
Q2.) How many exons does the gene PCA3 have?
A2.) The answer is 4. Review the GTF file so that you are familiar with it. What downstream steps will we need this file for? What is it used for?
cd $RNA_ASSIGNMENT/refs/hg19/genes/ grep -w "PCA3" genes_chr9.gtf grep -w "PCA3" genes_chr9.gtf | wc -l
Q3.) How many cancer/normal samples do you see under the data directory?
A3.) The answer is 12. 6 normal and 6 tumor.
cd $RNA_ASSIGNMENT/data/ ls -l ls -1 | wc -l
NOTE: The fasta files you have copied above contain sequences for chr9 only. We have pre-processed those fasta files to obtain chr9 and also matched read1/read2 sequences for each of the samples. You do not need to redo this.
Q4.) What sample has the highest number of reads?
A4.) The answer is that 'carcinoma_C06' has the most reads (288428/2 = 144214 reads).
An easy way to figure out the number of reads is to make use of the command ‘wc’. This command counts the number of lines in a file. Keep in mind that one sequence can be represented by multiple lines. Therefore, you need to first grep the read tag ">" and count those.
Running this command only give you 2 x read number:
cd $RNA_ASSIGNMENT/data/ wc -l YourFastaFile.fasta wc -l *
PART 2: Data alignment
- Familiarize yourself with Tophat/Bowtie alignment options
- Perform alignments
- Obtain alignment summary
Q5.) What is the value of --mate-inner-dist? What calculation did you do to get that answer?
A5.) Mate inner distance is the approximate distance between the reads (~80 bp).
You can get this number by:
- Using insert size estimates provided from the library preparation step. --mate-inner-distance = insert size - (2 x ReadLength)
- If you don’t have that information, then you can subset the FASTA file and run a quick alignment. Plot the fragment distribution from this subset and use those numbers for the full alignment
- We were told that the average insert size for these samples is 150 bp and the reads are 36 bp long. so --mate-inner-distance = 150 - (2 x 36) = 78 = ~80 bp
Refer to this diagram to figure out what the mate inner distance should be:
PE reads R1---------> <---------R2 fragment ~~~========================================~~~ insert ======================================== inner mate distance ...............
Q6.) Considering that the read length in this exercise is 36bp, what should you set the --segment-length to (default is 25bp)?
A6.) If you keep the default value of 25 bases, Tophat will split each read into 2 segments of 25bp and 11bp lengths. It is preferred to split the read into segments of equal length. Therefore, assigning —segment-length a value of 18 for a 36bp read is recommended. When deciding on a number, try avoiding a split that will result in a very short segment. Short segments might not be uniquely mapped and this can affect your transcript assembly process.
Create a bowtie index of the reference genome sequence for TopHat to use
cd $RNA_ASSIGNMENT/refs/hg19/ mkdir -p bwt/9 bowtie2-build fasta/9/9.fa bwt/9/9 cp $RNA_ASSIGNMENT/refs/hg19/fasta/9/*.fa $RNA_ASSIGNMENT/refs/hg19/bwt/9/ ls bwt/9/
Create a directory to store the transcriptome index that tophat2 will create the first time you run it
cd $RNA_ASSIGNMENT/ export RNA_DATA_DIR=$RNA_ASSIGNMENT/data/ echo $RNA_DATA_DIR mkdir -p alignments/tophat/trans_idx cd alignments/tophat export TRANS_IDX_DIR=$RNA_ASSIGNMENT/alignments/tophat/trans_idx/ echo $TRANS_IDX_DIR
Once you have a value for --mate-inner-dist and --transcriptome-index, create tophat2 alignment commands for all six samples and store the results in appropriately named output directories
tophat2 -p 8 --mate-inner-dist 80 --mate-std-dev 38 --segment-length 18 --rg-id=normal --rg-sample=normal_N02 -o normal_N02 -G $RNA_ASSIGNMENT/refs/hg19/genes/genes_chr9.gtf --transcriptome-index $TRANS_IDX_DIR/ENSG_Genes $RNA_ASSIGNMENT/refs/hg19/bwt/9/9 $RNA_DATA_DIR/normal_N02_read1.fasta $RNA_DATA_DIR/normal_N02_read2.fasta tophat2 -p 8 --mate-inner-dist 80 --mate-std-dev 38 --segment-length 18 --rg-id=normal --rg-sample=normal_N03 -o normal_N03 -G $RNA_ASSIGNMENT/refs/hg19/genes/genes_chr9.gtf --transcriptome-index $TRANS_IDX_DIR/ENSG_Genes $RNA_ASSIGNMENT/refs/hg19/bwt/9/9 $RNA_DATA_DIR/normal_N03_read1.fasta $RNA_DATA_DIR/normal_N03_read2.fasta tophat2 -p 8 --mate-inner-dist 80 --mate-std-dev 38 --segment-length 18 --rg-id=normal --rg-sample=normal_N06 -o normal_N06 -G $RNA_ASSIGNMENT/refs/hg19/genes/genes_chr9.gtf --transcriptome-index $TRANS_IDX_DIR/ENSG_Genes $RNA_ASSIGNMENT/refs/hg19/bwt/9/9 $RNA_DATA_DIR/normal_N06_read1.fasta $RNA_DATA_DIR/normal_N06_read2.fasta tophat2 -p 8 --mate-inner-dist 80 --mate-std-dev 38 --segment-length 18 --rg-id=carcinoma --rg-sample=carcinoma_C02 -o carcinoma_C02 -G $RNA_ASSIGNMENT/refs/hg19/genes/genes_chr9.gtf --transcriptome-index $TRANS_IDX_DIR/ENSG_Genes $RNA_ASSIGNMENT/refs/hg19/bwt/9/9 $RNA_DATA_DIR/carcinoma_C02_read1.fasta $RNA_DATA_DIR/carcinoma_C02_read2.fasta tophat2 -p 8 --mate-inner-dist 80 --mate-std-dev 38 --segment-length 18 --rg-id=carcinoma --rg-sample=carcinoma_C03 -o carcinoma_C03 -G $RNA_ASSIGNMENT/refs/hg19/genes/genes_chr9.gtf --transcriptome-index $TRANS_IDX_DIR/ENSG_Genes $RNA_ASSIGNMENT/refs/hg19/bwt/9/9 $RNA_DATA_DIR/carcinoma_C03_read1.fasta $RNA_DATA_DIR/carcinoma_C03_read2.fasta tophat2 -p 8 --mate-inner-dist 80 --mate-std-dev 38 --segment-length 18 --rg-id=carcinoma --rg-sample=carcinoma_C06 -o carcinoma_C06 -G $RNA_ASSIGNMENT/refs/hg19/genes/genes_chr9.gtf --transcriptome-index $TRANS_IDX_DIR/ENSG_Genes $RNA_ASSIGNMENT/refs/hg19/bwt/9/9 $RNA_DATA_DIR/carcinoma_C06_read1.fasta $RNA_DATA_DIR/carcinoma_C06_read2.fasta
Q7.) How would you obtain summary statistics for each aligned file?
A7.) There are many RNA-seq QC tools available that can provide you with detailed information about the quality of the aligned sample (e.g. FastQC and RSeQC). However, for a simple summary of aligned reads counts you can use samtools flagstat. You can also look for the logs generated by TopHat. These logs provide a summary of the aligned reads.
cd $RNA_ASSIGNMENT/alignments/tophat/ samtools flagstat carcinoma_C02/accepted_hits.bam > carcinoma_C02/accepted_hits.flagstat.txt samtools flagstat carcinoma_C03/accepted_hits.bam > carcinoma_C03/accepted_hits.flagstat.txt samtools flagstat carcinoma_C06/accepted_hits.bam > carcinoma_C06/accepted_hits.flagstat.txt samtools flagstat normal_N02/accepted_hits.bam > normal_N02/accepted_hits.flagstat.txt samtools flagstat normal_N03/accepted_hits.bam > normal_N03/accepted_hits.flagstat.txt samtools flagstat normal_N06/accepted_hits.bam > normal_N06/accepted_hits.flagstat.txt grep "mapped (" */accepted_hits.flagstat.txt
PART 3: Expression Estimation
- Familiarize yourself with Cufflinks options
- Run Cufflinks to obtain expression values
- Obtain expression values for the gene PCA3
Create an expression results directory, run cuffinks on all samples, and store the results in appropriately named subdirectories in this results dir
cd $RNA_ASSIGNMENT mkdir expression cd expression cufflinks -p 8 -o normal_N02 --GTF $RNA_ASSIGNMENT/refs/hg19/genes/genes_chr9.gtf --no-update-check $RNA_ASSIGNMENT/alignments/tophat/normal_N02/accepted_hits.bam cufflinks -p 8 -o normal_N03 --GTF $RNA_ASSIGNMENT/refs/hg19/genes/genes_chr9.gtf --no-update-check $RNA_ASSIGNMENT/alignments/tophat/normal_N03/accepted_hits.bam cufflinks -p 8 -o normal_N06 --GTF $RNA_ASSIGNMENT/refs/hg19/genes/genes_chr9.gtf --no-update-check $RNA_ASSIGNMENT/alignments/tophat/normal_N06/accepted_hits.bam cufflinks -p 8 -o carcinoma_C02 --GTF $RNA_ASSIGNMENT/refs/hg19/genes/genes_chr9.gtf --no-update-check $RNA_ASSIGNMENT/alignments/tophat/carcinoma_C02/accepted_hits.bam cufflinks -p 8 -o carcinoma_C03 --GTF $RNA_ASSIGNMENT/refs/hg19/genes/genes_chr9.gtf --no-update-check $RNA_ASSIGNMENT/alignments/tophat/carcinoma_C03/accepted_hits.bam cufflinks -p 8 -o carcinoma_C06 --GTF $RNA_ASSIGNMENT/refs/hg19/genes/genes_chr9.gtf --no-update-check $RNA_ASSIGNMENT/alignments/tophat/carcinoma_C06/accepted_hits.bam
Q8.) How do you get the expression of the gene PCA3 across the normal and carcinoma samples?
A8.) Cufflinks generates two expression files: gene level expression and isoform level expression. To look for the expression value of a specific gene, you can use the command ‘grep’ followed by the gene name and the path to the expression file
cd $RNA_ASSIGNMENT/expression grep PCA3 ./*/genes.fpkm_tracking
PART 4: Differential Expression Analysis
- Perform differential analysis between tumor and normal samples
- Check if PCA3 is differentially expressed
Use cuffmerge to create a combined transcripts GTF from the six transcripts.gtf created by Cufflinks (one for each sample)
cd $RNA_ASSIGNMENT/expression ls -1 */transcripts.gtf > assembly_GTF_list.txt; cuffmerge -p 8 -o merged -g $RNA_ASSIGNMENT/refs/hg19/genes/genes_chr9.gtf -s $RNA_ASSIGNMENT/refs/hg19/bwt/9/ assembly_GTF_list.txt
Create a new directory to store the differential expression results
cd $RNA_ASSIGNMENT/ mkdir de mkdir de/reference_only
Run cuffdiff to perform comparisons between all tumor and normal samples (3 tumor versus 3 normal)
cd $RNA_ASSIGNMENT/alignments/tophat cuffdiff -p 8 -L Normal,Carcinoma -o $RNA_ASSIGNMENT/de/reference_only/ --no-update-check $RNA_ASSIGNMENT/expression/merged/merged.gtf normal_N02/accepted_hits.bam,normal_N03/accepted_hits.bam,normal_N06/accepted_hits.bam carcinoma_C02/accepted_hits.bam,carcinoma_C03/accepted_hits.bam,carcinoma_C06/accepted_hits.bam
Q9.) Are there any significant differentially expressed genes? What about the PCA3?
A9.) Due to the small sample size, the PCA3 signal is not significant at the adjusted p-value level. You can try re-running the above exercise on your own by using all of the samples in the original data set. Does including more samples change the results?
cd $RNA_ASSIGNMENT/de/reference_only/ grep PCA3 gene_exp.diff
Q10.) What plots can you generate to help you visualize this gene expression profile
A10.) The CummerBund package provides a wide variety of plots that can be used to visualize a gene’s expression profile or genes that are differentially expressed. Some of these plots include heatmaps, boxplots, and volcano plots. Alternatively you can use custom plots using ggplot2 command or base R plotting commands such as those provided in the supplementary tutorials. Start with something very simple such as a scatter plot of tumor vs. normal FPKM values.
| Previous Section | This Section | Next Section | |:------------------------------------------------------------:|:--------------------------:|:-------------------------------------------:| | Practical Exercise Solutions | Integrated Assignment | Proposed Improvements |