# RNA-seq expression

This document shows the use of various modules to prepare reference data, perform RNA-seq calling, expression level quantification and quality control, and finally normalization. In particular,

1. `RNA_calling.ipynb`
2. `bulk_expression_QC.ipynb`
3. `bulk_expression_normalization.ipynb`

This protocol is meant for generating the expression phenotype for eQTL studies, although a subset of the steps can be equally useful for preparing the data for eg differential gene expression analysis.

Before started, please refer to the reference_data page to generate the reference data needed. Alternative, the reference data can be downloaded from Synapse as illustrated in the landing page.

`Input` of this min-protocol is a collection of fastq file and a sample list file describing the sample name, file name, and optionally strandness as well as read length of each samples.

```
ID fq1 fq2 strand read_length
sample_1 samp1_r1.fq.gz samp1_r2.fq.gz rf 100
sample_2 samp2_r1.fq.gz samp2_r2.fq.gz fr 150
sample_3 samp3_r1.fq.gz samp3_r2.fq.gz strand_missing 75
```


`Output` of this mini-protocol is a bed.gz file that are TensorQTL ready.

## Perform data quality summary via `fastqc`

In [None]:
sos run pipeline/RNA_calling.ipynb fastqc \
    --cwd output/rnaseq/fastqc \
    --samples ROSMAP_data/RNASeq/fastq/xqtl_protocol_data.fastqlist \
    --data-dir ROSMAP_data/RNASeq/fastq \
    --container containers/rna_quantification.sif \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf

## Cut adaptor (Optional)
This step will trim the fastq file to remove the adaptor. It is optional because the fastq in the protocol data folders are converted from bam file and are already without adaptors.


In [None]:
sos run pipeline/RNA_calling.ipynb fastp_trim_adaptor \
    --cwd output/rnaseq --samples ROSMAP_data/RNASeq/fastq/xqtl_protocol_data.fastqlist \
    --data-dir ROSMAP_data/RNASeq/fastq --STAR-index reference_data/STAR_Index/ \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container containers/rna_quantification.sif \
    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat

## Read alignment via STAR and QC via Picard.
Following step shall take at least 40G of memory ~2hr in total. It will also produce the input needed for the splicing QTL. It should be noted that, the gtf file used here is the same as the one fed into RSEM index step in the refernce data mini protocol, i.e. the one without `gene` in its file name.

In [None]:
sos run pipeline/RNA_calling.ipynb STAR_output \
    --cwd output/rnaseq --samples ROSMAP_data/RNASeq/fastq/xqtl_protocol_data.fastqlist \
    --data-dir ROSMAP_data/RNASeq/fastq --STAR-index reference_data/STAR_Index/ \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container containers/rna_quantification.sif \
    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat

## Call gene-level RNA expression via `rnaseqc`
Following steps generate the count table of gene expression. It should be noted that, the gtf file used here is not the same as the one used in STAR_output i.e. the one with `gene` in its file name.

In [None]:
sos run pipeline/RNA_calling.ipynb rnaseqc_call \
    --cwd output/rnaseq \
    --samples ROSMAP_data/RNASeq/fastq/xqtl_protocol_data.fastqlist    --data-dir ROSMAP_data/RNASeq/fastq \
    --STAR-index reference_data/STAR_Index/ \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.gtf \
    --container containers/rna_quantification.sif \
    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \
    --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf.ref.flat  \
    --bam_list output/rnaseq/xqtl_protocol_data_bam_list

## Call transcript level RNA expression via `RSEM`

In [None]:
sos run RNA_calling.ipynb rsem_call \
    --cwd output/rnaseq \
    --samples data/sample_fastq.list \
    --data-dir data \
    --fasta_with_adapters_etc TruSeq3-PE.fa \
    --STAR-index reference_data/STAR_Index/ \
    --RSEM-index reference_data/RSEM_Index/ \
    --container container/rna_quantification.sif \
    --mem 40G

## Multi-sample RNA-seq QC

We need to use a different MWE data-set that contains multiple samples -- here is the [Google Drive link](https://drive.google.com/drive/u/0/folders/1Rv2bWHBbX_tastTh49ToYVDMV6rFP5Wk).

In [None]:
sos run pipeline/bulk_expression_QC.ipynb qc \
    --cwd output/rnaseq \ \
    --tpm-gct output/rnaseq/xqtl_protocol_data.rnaseqc.gene_tpm.gct.gz \
    --counts-gct output/rnaseq/xqtl_protocol_data.rnaseqc.gene_readsCount.gct.gz \
    --container containers/rna_quantification.sif 

## Multi-sample read count normalization
Please download the reference_data/sample_participant_lookup.rnaseq file from the reference_data folder within the synapses. 

In [None]:
sos run pipeline/bulk_expression_normalization.ipynb normalize \
    --cwd output/rnaseq \
    --tpm-gct output/rnaseq/xqtl_protocol_data.rnaseqc.low_expression_filtered.outlier_removed.tpm.gct.gz \
    --counts-gct output/rnaseq/xqtl_protocol_data.rnaseqc.low_expression_filtered.outlier_removed.geneCount.gct.gz \
    --annotation-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf  \
    --container containers/rna_quantification.sif \
    --count-threshold 1 --sample_participant_lookup reference_data/sample_participant_lookup.rnaseq

## Optional: Region list generation
By default, the first 4 column of the bed.gz file contains the chr,start (TSS), end (TSS+1), and gene id of each gene. User can extract these information with a simple `zcat {phenoFile}.bed.gz | cut -f 1,2,3,4 ` command. However, when a region list with both gene id and gene name are needed, following utilities are provided.


In [None]:
sos run pipeline/gene_annotation.ipynb region_list_generation \
    --cwd output/rnaseq  \
    --phenoFile output/rnaseq/xqtl_protocol_data.rnaseqc.low_expression_filtered.outlier_removed.tmm.expression.bed.gz \
    --annotation-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf  \
    --sample-participant-lookup reference_data/sample_participant_lookup.rnaseq \
    --container containers/bioinfo.sif \
    --phenotype-id-type gene_id