# Quantifying alternative splicing from RNA-seq data

This pipeline implements our pipeline to call alternative splicing events from RNA-seq data, using [`leafcutter`](https://www.nature.com/articles/s41588-017-0004-9) and [`psichomics`](https://academic.oup.com/nar/article/47/2/e7/5114259) to call the RNA-seq data from original `fastq.gz` data. It implements the GTEx pipeline for GTEx/TOPMed project. Please refer to [this page](https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md) for detail. The choice of pipeline modules in this project is supported by internal (unpublished) benchmarks from GTEx group.

**Various reference data needs to be prepared before using this workflow**. [Here we provide a module](https://cumc.github.io/xqtl-pipeline/code/data_preprocessing/reference_data.html) to download and prepare the reference data. 

The product of this workflow can be used in generating phenotype tables using /molecular_phenotyles/QC/splicing_normalization.ipynb.

## Methods overview

There are many types of alternative splicing events. See [Wang et al (2008)](https://pubmed.ncbi.nlm.nih.gov/18978772/) and [Park et al (2018)](https://pubmed.ncbi.nlm.nih.gov/29304370/) for an illustration on different events and how splicings are controlled. We will apply two methods to quantify alternative splicing:

1. [`psichomics`](https://academic.oup.com/nar/article/47/2/e7/5114259) that quantifies each specific event. In particular the exon skipping event which is used also in GTEx sQTL analysis.
2. [`leafcutter`](https://www.nature.com/articles/s41588-017-0004-9) to quantify the usage of alternatively excised introns. This collectively captures skipped exons, 5’ and 3’ alternative splice site usage and other complex events. The method was previously applied to ROSMAP data as part of the Brain xQTL version 2.0. 

## Input

Both leafcutter and psichomics section, a meta-data file, white space delimited, containing 4 columns: sample ID, RNA strandness and path to the BAM files input for leafcutter section and to SJ.out.tab files for psichomics section:

```
sample_id       strand          bam_list                                SJ_list
sample_1        rf              sample_1.Aligned.sortedByCoord.out.bam  sample_1.SJ.out.tab
sample_2        fr              sample_2.Aligned.sortedByCoord.out.bam  sample_2.SJ.out.tab
sample_3        strand_missing  sample_3.Aligned.sortedByCoord.out.bam  sample_3.SJ.out.tab
```

If only one type of input files is prepared, one of the bam_list column and SJ_list column can be left empty.

### `leafcutter`

The bam files can be generated by `the STAR_align` workflow from our RNA_calling.ipynb module. 

All the BAM files should be available under specified folder (default assumes the same folder as where the meta-data file is).

If intend to blacklist some chromosomes and not analyze it, add one text file named black_list.txt with one chromosome name per line in the same directory of the meta-data file.


### `psichomics`

The SJ.out.tab files can be generated by `the STAR_align` workflow from our RNA_calling.ipynb module. 

All the SJ.out.tab files should be available under specified folder (default assumes the same folder as where the meta-data file is).



## Output

### `leafcutter`

`{sample_list}` below refers to the name of the meta-data file input.

Main output include: 

- `{sample_list}_intron_usage_perind.counts.gz` file with row id in format: "chromosome:intron_start:intron_end:cluster_id", column labeled as input sample names and each type of intron usage ratio under each sample (i.e. #particular intron in a sample / #total introns classified in the same cluster in a sample) in each cells. 
- `{sample_list}_intron_usage_perind_numers.counts.gz` file with the same row and column label but the count of each intron in each cells.

### `psichomics`

- `psi_raw_data.tsv` A dataframe of PSI values (quantification of the alternative splicing events) with first column splicing event identifier (for instance, SE_1_-_2125078_2124414_2124284_2121220_C1orf86) is composed of:

                   Event type (SE stands for skipped exon)
                   Chromosome (1)
                   Strand (-)
                   Relevant coordinates depending on event type (in this case, the first constitutive exon’s end, the                            alternative exon’ start and end and the second constitutive exon’s start)
                   Associated gene (C1orf86)

| Splicing Event Type | Abbreviation | [Coordinates](https://bioconductor.org/packages/release/bioc/manuals/psichomics/man/psichomics.pdf) |
| --- | --- | --- |
| Skipped Exon | SE | constitutive exon 1 end, alternative exon (start and end) and constitutive exon 2 start |
| Mutually exclusive exon | MXE | constitutive exon 1 end, alternative exon 1 and 2 (start and end) and constitutive exon 2 start |
| Alternative 5' splice site | A5SS | constitutive exon 1 end, alternative exon 1 end and constitutive exon 2 start |
| Alternative 3' splice site | A3SS | constitutive exon 1 end, alternative exon 1 start and constitutive exon 2 start |
| Alternative first exon | AFE | constitutive exon 1 end, alternative exon 1 end and constitutive exon 2 start |
| Alternative last exon | ALE | constitutive exon 1 end, alternative exon 1 start and constitutive exon 2 start |
| Alternative first exon (exon-centered - less reliable) | AFE_exon | constitutive exon 1 end, alternative exon 1 end and constitutive exon 2 start |
| Alternative last exon (exon-centered - less reliable) | ALE_exon | constitutive exon 1 end, alternative exon 1 start and constitutive exon 2 start |



## Minimal working example


A minimal working example is uploaded in the [google drive](https://drive.google.com/drive/folders/1lpcx3eKG2UpauntLUuJ6bMBjHyIhWW_R). It contains example inputs for leafcutter/psichomics, two spliing annotations for psichomics, the meta-data file list, and a example of blacklist chromosome file for leafcutter.

### For `leafcutter`

In [None]:
sos run pipeline/splicing_calling.ipynb leafcutter \
    --cwd leafcutter_output/ \
    --samples sample_bam.list \
    --container containers/leafcutter.sif 

### For `psichomics`

In [None]:
sos run splicing_calling.ipynb psichomics \
    --cwd psidata/output/ \
    --samples psidata/sample_SJ.list \
    --splicing_annotation hg38_suppa.rds \
    --container container/psichomics.sif

## Command interface

In [40]:
sos run splicing_calling.ipynb -h

usage: sos run splicing_calling.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  leafcutter
  psichomics

Global Workflow Options:
  --cwd output (as path)
                        The output directory for generated files.
  --samples VAL (as path, required)
                        Sample meta data list
  --data-dir  path(f"{samples:d}")

                        Raw data directory, default to the same directory as
                        sample list
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5h
                        Wall clock time expected
  --mem 16G
                        Memory expected
  --nu

## Setup and global parameters

In [21]:
[global]
# The output directory for generated files. 
parameter: cwd = path("output")
# Sample meta data list
parameter: samples = path
# Raw data directory, default to the same directory as sample list
parameter: data_dir = path(f"{samples:d}")
# splicing annotation for psichomics
parameter: splicing_annotation = ""
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 8
# Software container option
parameter: container = ""
from sos.utils import expand_size
cwd = path(f'{cwd:a}')

def get_samples(fn, dr):
    import os
    import pandas as pd
    
    samples = pd.read_csv(fn, sep='\t')
    names = []
    strandness = []
    bam_list = []
    bam_files = []
    SJtab_list = []
    SJtab_files = []
    
    samples = samples.fillna("NA")
    names = samples['sample_id'].tolist()
    strandness = samples['strand'].tolist()
    bam_list = samples['coord_bam_list'].tolist()
    SJtab_list = samples['SJ_list'].tolist()
    
    if ((len(bam_list) == sum(x == "NA" for x in bam_list)) & (len(SJtab_list) == sum(x == "NA" for x in SJtab_list))):
        raise ValueError("At least one type of input should be ready")
        
    for j in range(len(strandness)):
        # for regtools command usage, replace 0 = unstranded/XS, 1 = first-strand/RF, 2 = second-strand/FR
        if strandness[j] == 'rf':
            strandness[j] = 1
        if strandness[j] == 'fr':
            strandness[j] = 2
        if strandness[j] == 'strand_missing':
            strandness[j] = 0
            
    if (len(bam_list) != 0) & (len(bam_list) != sum(x == "NA" for x in bam_list)):
        for y in bam_list:
            y = os.path.join(dr, y)
            if not os.path.isfile(y):
                raise ValueError(f"File {y} does not exist")
            bam_files.append(y)
        
    if len(bam_list) != len(set(bam_list)):
        raise ValueError("Duplicated files are found (but should not be allowed) in BAM file list")
    
    if (len(SJtab_list) != 0) & (len(SJtab_list) != sum(x == "NA" for x in SJtab_list)):
        for y in SJtab_list:
            y = os.path.join(dr, y)
            if not os.path.isfile(y):
                raise ValueError(f"File {y} does not exist")
            SJtab_files.append(y)
        
    if len(SJtab_list) != len(set(SJtab_list)):
        raise ValueError("Duplicated files are found (but should not be allowed) in SJ.tab file list")
        
    return names, strandness, bam_files, SJtab_files

sample_id, strandness, bam_data, SJtab_data = get_samples(samples, data_dir)

## `leafcutter`

Documentation: [`leafcutter`](https://davidaknowles.github.io/leafcutter/index.html). The choices of regtool parameters are [discussed here](https://github.com/davidaknowles/leafcutter/issues/127).

### Other clustering options:

*   "-q", "--quiet" : don't print status messages to stdout, default=True.

*   "-p", "--mincluratio" : minimum fraction of reads in a cluster that support a junction, default 0.001. 

*   "-c", "--cluster" : refined cluster file when clusters are already made, default = None.

*   "-k", "--nochromcheck" : Don't check that the chromosomes are well formated e.g. chr1, chr2, ..., or 1, 2, ..., default = False.

*    "-C", "--includeconst" : also include constitutive introns, default = False.

The default parameter we used are:

`--min_clu_ratio 0.001 --max_intron_len 500000 --min_clu_reads 30`

These parameter is based on [GTEX's sQTL discovery pipeline (Section 3.4.3) ](https://www.science.org/action/downloadSupplement?doi=10.1126%2Fscience.aaz1776&file=aaz1776_aguet_sm.pdf)

### Things to keep in mind:

* If .bam.bai index files of the .bam input are ready before using leafCutter, it can be placed in the same directory with input .bam files and the "samtools index ${_input}" line can be skipped.


In [18]:
[leafcutter_1]
# anchor length (default 8)
parameter: anchor_len = 8
# minimum intron length to be analyzed (default 50)
parameter: min_intron_len = 50
# maximum intron length to be analyzed (default 500000)
parameter: max_intron_len = 500000
input: bam_data, group_by = 1, group_with = "strandness"
output: f'{cwd}/{_input:bn}.junc' 
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container
    samtools index ${_input}
    regtools junctions extract -a ${anchor_len} -m ${min_intron_len} -M ${max_intron_len} -s ${_strandness} ${_input} -o ${_output}

In [19]:
[leafcutter_2]
# minimum reads in a cluster (default 50 reads)
parameter: min_clu_reads = 30 
# maximum intron length to be analyzed (default 500000)
parameter: max_intron_len = 500000 
# minimum fraction of reads in a cluster that support a junction (default 0.001)
parameter: min_clu_ratio = 0.001
input: group_by = 'all'
output: f'{cwd}/{samples:bn}_intron_usage_perind.counts.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container
    rm -f ${_output:nn}.junc
    for i in ${_input:r}; do
    echo $i >> ${_output:nn}.junc ; done
    python /opt/leafcutter/clustering/leafcutter_cluster_regtools.py -j ${_output:nn}.junc -o ${f'{_output:bnn}'.replace("_perind","")} -m ${min_clu_reads} -l ${max_intron_len} -r ${cwd} -p ${min_clu_ratio}

## `psichomics`

Documentation: [`psichomics`](http://bioconductor.org/packages/release/bioc/html/psichomics.html)

### Other options

quantifySplicing( annotation,
                  junctionQuant,
                  eventType = c("SE", "MXE", "ALE", "AFE", "A3SS", "A5SS"),
                  minReads = 10,
                  genes = NULL
)

In function quantifySplicing, arguments eventType (Character: splicing event types to quantify), minReads (Integer: values whose number of total supporting read counts is below minReads are returned as NA) and genes (Character: gene symbols for which to quantify splicing events. If NULL, events from all genes are quantified.) can be specified. Usage and default values are shown above.

### Alternative Splicing Annotation Information

Two alternative splicing annotations will be provided in this pipeline which can be download [here](https://drive.google.com/drive/folders/1lpcx3eKG2UpauntLUuJ6bMBjHyIhWW_R). The hg38_suppa.rds is created Via SUPPA using the gtf file of the xqtl-pipeline, and the modified_psichomics_hg38_splicing_annotation.rds is modified from the default Human hg38 (2018-04-30) annotation provided by psichomics package. Description of the database can be found in the Alternative splicing annotation section in the [MATERIALS AND METHODS](https://academic.oup.com/nar/article/47/2/e7/5114259?login=true#130023625) part. Gene names of the original annotation are replaced by Ensembl ids for format unifying. The Ensembl IDs used in modifiction are matched from the gtf file, HGNC database, SUPPA and VASTTOOL records within the original annotation.

Theoretically the annotation created using the gtf file only will give results more consistent with other part of the pipeline. The annotation modified from psichomics original hg38 annotation can identify more events since it was build based on information maximizing principle, however there will be risk of containing outdated information too. 

For details of generation method of the gtf file and the two splicing annotations, please check the GFF3 to GTF formatting, Generation of SUPPA annotation for psichomics, and Modification of psichomics default Hg38 splicing annotation sections in [reference_data.ipynb](https://github.com/cumc/xqtl-pipeline/blob/main/code/data_preprocessing/reference_data.ipynb).

### Things to keep in mind:

* The script below allows to run prepareJunctionQuant() function from psichomics package on different input directories, however the prepareJunctionQuant() function will generate one psichomics_junctions.txt in each input directory. The psichomics_junctions.txt files generated in input directories are recommanded to be deleted before rerun of the psichimics_1 step since there probabaly be an overwrite conflict.

In [26]:
[psichomics_1]
input: SJtab_data
output: f'{cwd}/psi_raw_data.tsv'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container
    library("psichomics")
    library("dplyr")
    library("tidyr")
    library("purrr")
  
    files = list()
  
    for (f in c(${_input:ar,})){
      filename = gsub("^.*/", "", f)
      directory = gsub(filename, "", f)
      if (length(files[[directory]]) == 0){
        files[[directory]] = filename
        } else {
      files[[directory]] = append(files[[directory]], filename)
      }
    }
  
    if (length(files) == 1) {
      setwd(names(files)[1])
      res = prepareJunctionQuant(files[[1]])
    } else {
    res = list()
    for (i in 1:(length(files))) {
        d = names(files[i])
        setwd(d)
        res[[d]] = prepareJunctionQuant(files[[d]])
        }
    res = res %>% reduce(full_join, by = "Junction ID")
    }
    
    res[is.na(res)] <- 0
  
    write.table(res, file='${cwd}/psichomics_junctions.txt', quote=FALSE, sep='\t', row.names=FALSE)
    
    data <- loadLocalFiles("${cwd}")
    junctionQuant <- data[[1]]$`Junction quantification`
    annotation = readRDS("${splicing_annotation}")
    psi <- quantifySplicing(annotation, junctionQuant)
    write.table(psi, file='${cwd}/psi_raw_data.tsv', quote=FALSE, sep='\t')