# Alternative splicing from RNA-seq data

This document shows the use of moudules for Data Preperation, Quantification, Quality Control + Normalization for Splicing events analysis, and converting the results to molecular phenotype data in `bed` format. In particular:

1. `molecular_phenotypes/calling/RNA_calling.ipynb`
2. `molecular_phenotypes/calling/splicing_calling.ipynb`
3. `molecular_phenotypes/QC/splicing_normalization.ipynb`
4. `data_preprocessing/phenotype/gene_annotation.ipynb`

Two tools, leafCutter and Psichomics are used in this splicing analyzing workflow and please check the corresponding modules for code documentation. Various reference data need to be prepared before using this workflow, please check [this module](https://cumc.github.io/xqtl-pipeline/code/data_preprocessing/reference_data.html) to download and prepare the reference data. 


The data used in this mini protocol can be found within the input folder within protocol data folder in the synapse repo as outlined in the landing pages. The `fastq` files can be used for step "fastqc", "fast_trim_adaptor", and "STAR_output" steps below, which are exactly the same as the first half of the RNA-calling mini protocol.

The output of the overlapping step can also be found in the output folder within the protocol data folder so that the overlapped step can be skipped.

## RNA Seq Alignment

## Perform data quality summary via `fastqc`

In [None]:
sos run pipeline/RNA_calling.ipynb fastqc \
    --cwd output/rnaseq/fastqc \
    --samples ROSMAP_data/RNASeq/fastq/xqtl_protocol_data.fastqlist \
    --data-dir ROSMAP_data/RNASeq/fastq \
    --container containers/rna_quantification.sif \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf

## Cut adaptor (Optional)
This step will trim the fastq file to remove the adaptor. It is optional because the fastq in the protocol data folders are converted from bam file and are already without adaptors.


In [None]:
sos run pipeline/RNA_calling.ipynb fastp_trim_adaptor \
    --cwd output/rnaseq --samples ROSMAP_data/RNASeq/fastq/xqtl_protocol_data.fastqlist \
    --data-dir ROSMAP_data/RNASeq/fastq --STAR-index reference_data/STAR_Index/ \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container containers/rna_quantification.sif \
    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat

## Read alignment via STAR and QC via Picard.
Following step shall take at least 40G of memory ~2hr in total. It will also produce the input needed for the splicing QTL. It should be noted that, the gtf file used here is the same as the one fed into RSEM index step in the refernce data mini protocol, i.e. the one without `gene` in its file name.

In [None]:
sos run pipeline/RNA_calling.ipynb STAR_output \
    --cwd output/rnaseq --samples ROSMAP_data/RNASeq/fastq/xqtl_protocol_data.fastqlist \
    --data-dir ROSMAP_data/RNASeq/fastq --STAR-index reference_data/STAR_Index/ \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container containers/rna_quantification.sif \
    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat

The LeafCutter and Psichomics part below should are in parallel. They should be run independently and the corresponding inputs/outputs are not depend on eachother.

## LeafCutter part workflow

### Intron usage ratio quantification via `leafCutter`
*  `input`: a meta data file contains locations of all Aligned.sortedByCoord.out.bam files to be analyzed.
*  `output`: a file with intron usage ratios, end with "_intron_usage_perind.counts.gz"

In [None]:
sos run pipeline/splicing_calling.ipynb leafcutter \
    --cwd output/leaf_cutter/ \
    --samples output/rnaseq/xqtl_protocol_data_bam_list \
    --container containers/leafcutter.sif 

### QC and Normalization of leafCutter outputs
*  `input`: the "_intron_usage_perind.counts.gz" file from previous step
*  `output`: QC'd and normalized phenotype table end with "qqnorm.txt"
Be noted that the `ratio` file to be fed into the leafcutter_norm are the one without `number` tag in its filename. 

In [None]:
sos run pipeline/splicing_calling.ipynb leafcutter \
    --cwd output/leaf_cutter/ \
    --samples output/rnaseq/xqtl_protocol_data_bam_list \
    --container containers/leafcutter.sif 

sos run pipeline/splicing_normalization.ipynb leafcutter_norm \
    --cwd output/leaf_cutter/ \
    --ratios output/leaf_cutter/xqtl_protocol_data_bam_list_intron_usage_perind.counts.gz \
    --container containers/leafcutter.sif 



### Post-process of leafcutter outputs for them to be TensorQTL ready
*  `input`: output of the previous two steps and the gtf file.
*  `output`: a file in bed format end with "formated.bed.gz" 

In [None]:
sos run pipeline/gene_annotation.ipynb annotate_leafcutter_isoforms \
    --cwd output/leaf_cutter/ \
    --intron_count output/leaf_cutter/xqtl_protocol_data_bam_list_intron_usage_perind_numers.counts.gz \
    --phenoFile output/leaf_cutter/xqtl_protocol_data_bam_list_intron_usage_perind.counts.gz_raw_data.qqnorm.txt \
    --annotation-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.gtf \
    --container containers/bioinfo.sif \
    --sample_participant_lookup reference_data/sample_participant_lookup.rnaseq

## Psichomics part workflow

### Percent Spliced In (PSI) quantification for alternative splicing events via `Psichomics`
*  `input`: a meta data file contains locations of all SJ.out.tab files to be analyzed.
*  `output`: psi_raw_data.tsv, contains percent spliced in values for each alternative splicing event

In [None]:
sos run splicing_calling.ipynb psichomics \
    --cwd psichomics_output/ \
    --samples sample_fastq_bam_list\
    --splicing_annotation hg38_suppa.rds \
    --container containers/psichomics.sif

### QC and Normalization of psichomics outputs
*  `input`: the "psi_raw_data.tsv" file from previous step
*  `output`: QC'd and normalized phenotype table end with "qqnorm.txt"

In [None]:
sos run pipeline/splicing_normalization.ipynb psichomics_norm\
    --cwd psichomics_output \
    --ratios psichomics_output/psi_raw_data.tsv \
    --container containers/psichomics.sif

### Post-process of psichomics outputs for them to be TensorQTL ready
*  `input`: the "qqnorm.txt" output from the previous step and the gtf file.
*  `output`: a file in bed format end with "formated.bed.gz" 

In [None]:
sos run pipeline/code/data_preprocessing/phenotype/gene_annotation.ipynb annotate_psichomics_isoforms \
    --cwd psichomics_output \
    --phenoFile psichomics_output/psichomics_raw_data_bedded.qqnorm.txt \
    --annotation-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gene.gtf \
    --container containers/bioinfo.sif