# Reference data standardization

This module provides reference data download, indexing and preprocessing (if necessary), in preparation for use throughout the pipeline.

We have included the PDF document compiled by data standardization subgroup in the [on Google Drive](https://drive.google.com/file/d/1R5sw5o8vqk_mbQQb4CGmtH3ldu1T3Vu0/view?usp=sharing) as well as on [ADSP Dashboard](https://www.niagads.org/adsp/content/adspgcadgenomeresources-v2pdf). It contains the reference data to use for the project.

## Overview

This module is based on the [TOPMed workflow from Broad](https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md). The reference data after we process it (details see Methods section and the rest of the analysis) can be found [in this folder on Google Drive](https://drive.google.com/drive/folders/19fmoII8yS7XE7HFcMU4OfvC2bL1zMD_P). 

### Processed reference file for RNA-seq based expression quantification

**We have decided to use these preprocessed reference files for RNA-seq expression quantification. They may not be applicable to other molecular phenotypes.**

Specifically, the list of reference files to be used are:

1. `GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.{dict,fasta,fasta.fai}`
2. `Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf` for stranded protocol, and `Homo_sapiens.GRCh38.103.chr.reformatted.gene.ERCC.gtf` for unstranded protocol.
3. Everything under `STAR_Index` folder
4. Everything under `RSEM_Index` folder
5. Optionally, for quality control, `gtf_ref.flat`



## Methods

Workflows implemented include:

### Convert transcript feature file gff3 to gtf

- Input: an uncompressed gff3 file.(i.e. can be view via cat)
- Output: a gtf file.

### Collapse transcript features into genes

- Input: a gtf file.
- Output: a gtf file with collapesed gene model.

### Generate STAR index based on gtf and reference fasta

- Input: a gtf file and an acompanying fasta file.
- Output: A folder of STAR index.

### Generate RSEM index based on gtf and reference fasta

- Input: a gtf file and an acompanying fasta file.
- Output: A folder of RSEM index.

## Example commands

To download reference data, it will take approximately an hour, depending on the network.

In [None]:
sos run pipeline/reference_data.ipynb download_hg_reference --cwd reference_data    &
sos run pipeline/reference_data.ipynb download_gene_annotation --cwd reference_data &
sos run pipeline/reference_data.ipynb download_ercc_reference --cwd reference_data &
sos run pipeline/reference_data.ipynb download_dbsnp --cwd reference_data &

To format reference data, these step should take ~10 min in total, with 16GB of memory

In [None]:
sos run reference_data.ipynb hg_reference \
    --cwd reference_data \
    --ercc-reference reference_data/ERCC92.fa \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.fa \
    --container container/rna_quantification.sif

In [None]:
sos run pipeline/reference_data.ipynb hg_gtf \
    --cwd reference_data \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.gtf \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \
    --container containers/rna_quantification.sif --stranded

To format gene feature data:

In [None]:
sos run pipeline/reference_data.ipynb gene_annotation \
    --cwd reference_data \
    --ercc-gtf reference_data/ERCC92.gtf \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.gtf \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \
    --container containers/rna_quantification.sif --stranded

**Notice that for un-stranded RNA-seq protocol please use switch `--no-stranded` to the command above instead of `--stranded`. More details can be found later in the document.**

Generating STAR index without the GTF annotation file allow customize read lenght lateron in STAR alignment. it will take at least 40G of memory for STAR to build the index. 
Aproximate time: 30  min
Mem: 40 G

In [None]:
sos run pipeline/reference_data.ipynb STAR_index \
    --cwd reference_data \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --container containers/rna_quantification.sif \
    --mem 40G

**Notice that command above requires at least 40G of memory, and takes quite a while to complete**.

To generate RSEM index with the gtf file **prior** to the gene collapsing step ( **without** the gene tag in its file name. )

Aproximate time: 1  min 

In [None]:
sos run pipeline/reference_data.ipynb RSEM_index \
    --cwd reference_data \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container containers/rna_quantification.sif  &

To generate RefFlat annotation for Picard QC

In [None]:
sos run pipeline/reference_data.ipynb RefFlat_generation \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf 

To generate the SUPPA annotation for psichomics

In [None]:
sos run pipeline/reference_data.ipynb SUPPA_annotation \
    --hg_gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container containers/psochimics.sif

## Command interface

In [1]:
sos run reference_data.ipynb -h

usage: sos run reference_data.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  download_hg_reference
  download_gene_annotation
  download_ercc_reference
  gff3_to_gtf
  hg_reference
  hg_gtf
  ercc_gtf
  gene_annotation
  STAR_index
  RSEM_indexing

Global Workflow Options:
  --cwd VAL (as path, required)
                        The output directory for generated files.
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5h
                        Wall clock time expected
  --mem 16G
                        Memory expected
  --numThreads 8 (as int)
                        Number of threads
  --container ''
               

In [None]:
[global]
# The output directory for generated files.
parameter: cwd = path("output")
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 8
# Software container option
parameter: container = ""
cwd = path(f'{cwd:a}')
from sos.utils import expand_size

## Data download

In [None]:
[download_hg_reference]
output: f"{cwd:a}/GRCh38_full_analysis_set_plus_decoy_hla.fa"
download: dest_dir = cwd
    ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa

In [None]:
[download_gene_annotation]
output: f"{cwd:a}/Homo_sapiens.GRCh38.103.chr.gtf"
download: dest_dir = cwd, decompress=True
    http://ftp.ensembl.org/pub/release-103/gtf/homo_sapiens/Homo_sapiens.GRCh38.103.chr.gtf.gz

In [None]:
[download_ercc_reference]
output: f"{cwd:a}/ERCC92.gtf", f"{cwd:a}/ERCC92.fa"
download: dest_dir = cwd, decompress=True
    https://tools.thermofisher.com/content/sfs/manuals/ERCC92.zip

In [None]:
[download_dbsnp]
output: f"{cwd:a}/00-All.vcf.gz", f"{cwd:a}/00-All.vcf.gz.tbi"
download: dest_dir = cwd
    ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/00-All.vcf.gz
    ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/00-All.vcf.gz.tbi

## GFF3 to GTF formatting

In [None]:
[gff3_to_gtf]
parameter: gff3_file = path
input: gff3_file
output: f'{cwd}/{_input:bn}.gtf'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'
bash: container=container, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
        gffread ${_input} -T -o ${_output}

## HG reference file preprocessing
1. Remove the HLA/ALT/Decoy record from the fasta -- because none of the downstreams RNA-seq calling pipeline component can handle them properly.
2. Adding in ERCC information to the fasta file -- even if ERCC is not included in the RNA-seq library it does not harm to add them.
3. Generating index for the fasta file

In [None]:
[hg_reference_1 (HLA ALT Decoy removal)]
# Path to HG reference file
parameter: hg_reference = path
input: hg_reference
output:  f'{cwd}/{_input:bn}.noALT_noHLA_noDecoy.fasta'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'
python: expand = "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container
    with open('${_input}', 'r') as fasta:
        contigs = fasta.read()
        contigs = contigs.split('>')
        contig_ids = [i.split(' ', 1)[0] for i in contigs]

        # exclude ALT, HLA and decoy contigs
        filtered_fasta = '>'.join([c for i,c in zip(contig_ids, contigs)
        if not (i[-4:]=='_alt' or i[:3]=='HLA' or i[-6:]=='_decoy')])
    
    with open('${_output}', 'w') as fasta:
        fasta.write(filtered_fasta)

In [None]:
[hg_reference_2 (merge with ERCC reference)]
parameter: ercc_reference = path
output: f'{cwd}/{_input:bn}_ERCC.fasta'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'
bash: expand = "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output}.stdout', container = container
    sed 's/ERCC-/ERCC_/g' ${ercc_reference} >  ${ercc_reference:n}.patched.fa
    cat ${_input} ${ercc_reference:n}.patched.fa > ${_output}

In [None]:
[hg_reference_3 (index the fasta file)]
output: f'{cwd}/{_input:bn}.dict'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'
bash: expand = "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container
    samtools faidx ${_input}
    java -jar /opt/picard-tools/picard.jar \
    CreateSequenceDictionary \
    R=${_input} \
    O=${_output}

## Transcript and gene model reference processing

This step modify the `gtf` file for following reasons:

1. RSEM require GTF input to have the same chromosome name format (with `chr` prefix) as the fasta file. **although for STAR, this problem can be solved by the now commented --sjdbGTFchrPrefix "chr" option, we have to add `chr` to it for use with RSEM**. 
2. Gene model collapsing script `collapse_annotation.py` from GTEx require the gtf have `transcript_type` instead `transcript_biotype` in its annotation. We rename it here, although **this problem can also be solved by modifying the collapse_annotation.py while building the docker, since we are doing 1 above we think it is better to add in another customization here.**
3. Adding in ERCC information to the `gtf` reference.

We may reimplement 1 and 2 if the problem with RSEM is solved, or when RSEM is no longer needed.

In [None]:
[hg_gtf_1 (add chr prefix to gtf file)]
parameter: hg_reference = path
parameter: hg_gtf = path
input: hg_reference, hg_gtf
output: f'{cwd}/{_input[1]:bn}.reformatted.gtf'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'
R: expand = "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container
    library("readr")
    library("stringr")
    library("dplyr")
    options(scipen = 999)
    con <- file("${_input[0]}","r")
    fasta <- readLines(con,n=1)
    close(con)
    gtf = read_delim("${_input[1]}", "\t",  col_names  = F, comment = "#", col_types="ccccccccc")
    if(!str_detect(fasta,">chr")) {
        gtf_mod = gtf%>%mutate(X1 = str_remove_all(X1,"chr"))
    } else if (!any(str_detect(gtf$X1[1],"chr"))) {
        gtf_mod = gtf%>%mutate(X1 = paste0("chr",X1))
    } else (gtf_mod = gtf)
    if(any(str_detect(gtf_mod$X9, "transcript_biotype"))) {
      gtf_mod = gtf_mod%>%mutate(X9 = str_replace_all(X9,"transcript_biotype","transcript_type"))
    }
    gtf_mod%>%write.table("${_output}",sep = "\t",quote = FALSE,col.names = F,row.names = F)

**Text below is taken from https://github.com/broadinstitute/gtex-pipeline/tree/master/gene_model**


Gene-level expression and eQTLs from the GTEx project are calculated based on a collapsed gene model (i.e., combining all isoforms of a gene into a single transcript), according to the following rules:

1. Transcripts annotated as “retained_intron” or “read_through” are excluded. Additionally, transcripts that overlap with annotated read-through transcripts may be blacklisted (blacklists for GENCODE v19, 24 & 25 are provided in this repository; no transcripts were blacklisted for v26).
2. The union of all exon intervals of each gene is calculated.
3. Overlapping intervals between genes are excluded from all genes.


The purpose of step 3 is primarily to exclude overlapping regions from genes annotated on both strands, which can't be unambiguously quantified from unstranded RNA-seq (GTEx samples were sequenced using an unstranded protocol). For stranded protocols, this step can be skipped by adding the `--collapse_only` flag.

Further documentation is available on the [GTEx Portal](https://gtexportal.org/home/documentationPage#staticTextAnalysisMethods).

In [None]:
[hg_gtf_2 (collapsed gene model)]
parameter: stranded = bool
output: f'{_input:n}{".collapse_only" if stranded else ""}.gene.gtf'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
bash: expand = "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container
    collapse_annotation.py ${"--collapse_only" if stranded else ""} ${_input} ${_output}

In [None]:
[ercc_gtf (Preprocess ERCC gtf file)]
parameter: ercc_gtf = path
input: ercc_gtf
output: f'{cwd}/{_input:bn}.genes.patched.gtf'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'
python: expand = "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container
    with open('${_input}') as exon_gtf, open('${_output}', 'w') as gene_gtf:
        for line in exon_gtf:
            f = line.strip().split('\t')
            f[0] = f[0].replace('-','_')  # required for RNA-SeQC/GATK (no '-' in contig name)
        
            attr = f[8]
            if attr[-1]==';':
                attr = attr[:-1]
            attr = dict([i.split(' ') for i in attr.replace('"','').split('; ')])
            # add gene_name, gene_type
            attr['gene_name'] = attr['gene_id']
            attr['gene_type'] = 'ercc_control'
            attr['gene_status'] = 'KNOWN'
            attr['level'] = 2
            for k in ['id', 'type', 'name', 'status']:
                attr['transcript_'+k] = attr['gene_'+k]
        
            attr_str = []
            for k in ['gene_id', 'transcript_id', 'gene_type', 'gene_status', 'gene_name',
                'transcript_type', 'transcript_status', 'transcript_name']:
                attr_str.append('{0:s} "{1:s}";'.format(k, attr[k]))
            attr_str.append('{0:s} {1:d};'.format('level', attr['level']))
            f[8] = ' '.join(attr_str)
        
            # write gene, transcript, exon
            gene_gtf.write('\t'.join(f[:2]+['gene']+f[3:])+'\n')
            gene_gtf.write('\t'.join(f[:2]+['transcript']+f[3:])+'\n')
            f[8] = ' '.join(attr_str[:2])
            gene_gtf.write('\t'.join(f[:2]+['exon']+f[3:])+'\n')

In [None]:
[gene_annotation]
input: output_from("hg_gtf_1"), output_from("hg_gtf_2"), output_from("ercc_gtf")
output: f'{cwd}/{_input[0]:bn}.ERCC.gtf', f'{cwd}/{_input[1]:bn}.ERCC.gtf'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'
bash: expand = "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container
    cat ${_input[0]} ${_input[2]} > ${_output[0]}
    cat ${_input[1]} ${_input[2]} > ${_output[1]}

## Generating index file for `STAR` 

This step generate the index file for STAR alignment. This file just need to generate once and can be re-used. 

**At least 40GB of memory is needed**.

### Step Inputs

* `gtf` and `fasta`: path to reference sequence. Both of them needs to be unzipped. `gtf` should be the one prior to collapse by gene.
* `sjdbOverhang`: specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads. We use 100 here as recommended by the TOPMed pipeline. See here for [some additional discussions](https://groups.google.com/g/rna-star/c/h9oh10UlvhI/m/BfSPGivUHmsJ). 

### Step Output

* Indexing file stored in `{cwd}/STAR_index`, which will be used by `STAR`

In [None]:
[STAR_index]
parameter: hg_reference = path
# Specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads.
# Default choice follows from TOPMed pipeline recommendation.
if expand_size(mem) < expand_size('40G'):
    print("Insufficent memory for STAR, changing to 40G")
    star_mem = '40G'
else:
    star_mem = mem
input: hg_reference
output: f"{cwd}/STAR_Index/chrName.txt", 
        f"{cwd}/STAR_Index/SAindex", f"{cwd}/STAR_Index/SA", f"{cwd}/STAR_Index/genomeParameters.txt", 
        f"{cwd}/STAR_Index/chrStart.txt",
        f"{cwd}/STAR_Index/chrLength.txt", 
        f"{cwd}/STAR_Index/Genome", f"{cwd}/STAR_Index/chrNameLength.txt", 
        f"{cwd}/STAR_Index/geneInfo.tab"
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output[0]:bd}'
bash: container=container, expand= "${ }", stderr = f'{_output[1]:n}.stderr', stdout = f'{_output[1]:n}.stdout'
    STAR --runMode genomeGenerate \
         --genomeDir ${_output[0]:d} \
         --genomeFastaFiles ${_input[0]} \
         --runThreadN ${numThreads}

## Generating index file for `RSEM`

This step generate the indexing file for `RSEM`. This file just need to generate once.

### Step Inputs

* `gtf` and `fasta`: path to reference sequence. `gtf` should be the one prior to collapse by gene.
* `sjdbOverhang`: specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads.

### Step Outputs
* Indexing file stored in `RSEM_index_dir`, which will be used by `RSEM`

In [None]:
[RSEM_index]
parameter: hg_gtf = path
parameter: hg_reference = path
input: hg_reference, hg_gtf
output: f"{cwd}/RSEM_Index/rsem_reference.n2g.idx.fa", f"{cwd}/RSEM_Index/rsem_reference.grp", 
        f"{cwd}/RSEM_Index/rsem_reference.idx.fa", f"{cwd}/RSEM_Index/rsem_reference.ti", 
        f"{cwd}/RSEM_Index/rsem_reference.chrlist", f"{cwd}/RSEM_Index/rsem_reference.seq", 
        f"{cwd}/RSEM_Index/rsem_reference.transcripts.fa"
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output[0]:bd}'
bash: container=container, expand= "${ }", stderr = f'{_output[1]:n}.stderr', stdout = f'{_output[1]:n}.stdout'
    rsem-prepare-reference \
            ${_input[0]} \
            ${_output[1]:n} \
            --gtf ${_input[1]} \
            --num-threads ${numThreads}

## Generation of RefFlat file 
This file is needed for picard CollectRnaSeqMetrics module, which in turn 
>produces metrics describing the distribution of the bases within the transcripts. It calculates the total numbers and the fractions of nucleotides within specific genomic regions including untranslated regions (UTRs), introns, intergenic sequences (between discrete genes), and peptide-coding sequences (exons). This tool also determines the numbers of bases that pass quality filters that are specific to Illumina data (PF_BASES).

In [None]:
[RefFlat_generation]
parameter: hg_gtf = path
input: hg_gtf
output: f'{_input:n}.ref.flat'
bash: container=container, expand= "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout'
    gtfToGenePred ${_input}  ${_output}.tmp -genePredExt -geneNameAsName2
    awk -F'\t' -v OFS="\t" '{$1=$12 OFS $1;}7' ${_output}.tmp | cut -f 1-11 > ${_output}
    rm ${_output}.tmp

### Generation of SUPPA annotation for psichomics.
The generation of custom alternative splicing annotation is based on [this tutorial](https://rpubs.com/nuno-agostinho/preparing-AS-annotation). The way to generate the local alternative splicing suppa output is documented on [SUPPA github page](https://github.com/comprna/SUPPA).

In [None]:
sos run pipeline/reference_data.ipynb SUPPA_annotation \
    --hg_gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container containers/psochimics.sif

In [None]:
[SUPPA_annotation_1]
parameter: hg_gtf = path
input: hg_gtf
output: f'{cwd}/hg38.{_input:bn}_SE_strict.ioe' # The stderr file must not shared the same start with the output file
bash: container=container, expand= "${ }", stderr = f'{cwd}/{_input:bn}.stderr', stdout = f'{cwd}/{_input:bn}.stdout'
    python ~/GIT/SUPPA/suppa.py generateEvents -i ${_input} -o ${cwd}/hg38.${_input:bn} -f ioe -e SE SS MX RI FL
[SUPPA_annotation_2]
parameter: hg_gtf = path
output: f'{cwd}/{hg_gtf:bn}.SUPPA_annotation.rds'
R: container=container, expand= "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout'
    library("psichomics")
    suppa <- parseSuppaAnnotation("${_input:d}", genome="hg38") 
    annot <- prepareAnnotationFromEvents(suppa)
    saveRDS(annot, file=${_output:r})

### Modification of psichomics default Hg38 splicing annotation.
Since the original annotation provided by psichomics package is using gene symbols, we modified it to use Ensembl IDs. The modified annotation will be used here: [psichomics section](https://github.com/cumc/xqtl-pipeline/blob/main/code/molecular_phenotypes/calling/splicing_calling.ipynb).

In [None]:
sos run pipeline/reference_data.ipynb psi_hg38_annotation_modification \
    --hg_gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformated.gtf \
    --hgrc_db reference_data/hgnc_database.txt \
    --container container/psichomics.sif

In [None]:
[psi_hg38_annotation_modification_1]
parameter: hg_gtf = path
parameter: hgrc_db = path
input: hg_gtf, hgrc_db
output: f'{cwd}/modified_psichomics_hg38_splicing_annotation.rds'
R: container=container, expand= "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout'
    library("psichomics")
    library("purrr")
    library("dplyr")
    library("tidyr")
    library("data.table")
  
    # load psicomics default annotation, option of hg38 from listSplicingAnnotations()
    annotation <- loadAnnotation("AH63657")

    
    # reduce the demension of annotation file
    annotation <- 
      map(annotation, ~.x%>%
                       tidyr::unnest(cols = `Gene`))
  
    # Create empty colomns for each event for easier mapping
    annotation[["Tandem UTR"]][["SUPPA.Event.ID"]] <- NA
    annotation[["Tandem UTR"]][["VAST-TOOLS.Event.ID"]] <- NA
    annotation[["Alternative first exon"]][["VAST-TOOLS.Event.ID"]] <- NA
    annotation[["Alternative last exon"]][["VAST-TOOLS.Event.ID"]] <- NA
    annotation[["Mutually exclusive exon"]][["VAST-TOOLS.Event.ID"]] <- NA
  
    # extract Ensembl ID substring from original SUPPA.ID and VASTTOOL.ID
    annotation <- 
      map(annotation, ~.x%>%
                       mutate(ENSG.SUPPA = substr(`SUPPA.Event.ID`, 1, 15))%>%
                       mutate(ENSG.VAST = substr(`VAST-TOOLS.Event.ID`, 1, 15)))
  
    # Load gtf file
    gtf_sample <- read.table('${_input[0]}',header = FALSE, sep = '\t')
  
    # from the gtf file, seperate gene names and corresponding Ensembl ID
    gtf_sample <- separate(gtf_sample, V9, sep = ";",into = c("gene_id", "transcript_id", "exon_number", "gene_name"))
    gtf_sample <- separate(gtf_sample, gene_id, sep = " ",into = c("gene_id", "gene_id_val"))
    gtf_sample <- separate(gtf_sample, gene_name, sep = "e ",into = c("gene_name", "gene_name_val"))
  
    gtf_name_id_match <- gtf_sample[,c("gene_id_val","gene_name_val")]
    gtf_name_id_match <- gtf_name_id_match[!duplicated(gtf_name_id_match), ]
  
    # For any matched approved id in the psi hg38 annotation and gtf file, record the corresponding Ensembl ID
    annotation <-
      map(annotation, ~.x%>%
                       mutate(`ENSG.GTF` = gtf_name_id_match$gene_id_val[match(`Gene`, gtf_name_id_match$gene_name_val)]))
  
    # load hgnc database
    hgnc_db <- fread('${_input[1]}', fill = TRUE, header = TRUE, sep = '\t', quote="")
  
    # Combine the `Ensembl.ID.supplied.by.Ensembl.` and `Ensembl.gene.ID` column, if there are any conflict use the former
    # For conflict ones (15 total) both the former and latter records are poiting to the same gene name in Ensembl website so the order should not matter
    hgnc_db <- hgnc_db %>%
    mutate(ENSG.ID = ifelse(`Ensembl ID(supplied by Ensembl)` == "", `Ensembl gene ID`, `Ensembl ID(supplied by Ensembl)`))
  
    # Create a one to one reference list for approved names, previous names and aliases
    # There is no duplicate symbol and Ensembl id info for approved symbol so no need for chromosome verification
    hgnc_name_id_match <- hgnc_db[,c("Approved symbol","ENSG.ID")]
    hgnc_name_prev_check <- hgnc_db[,c("Previous symbols","Chromosome","ENSG.ID")]
    hgnc_name_alias_check <- hgnc_db[,c("Alias symbols","Chromosome","ENSG.ID")]
  
    # Remove NAs
    hgnc_name_prev_check <- hgnc_name_prev_check[hgnc_name_prev_check$ENSG.ID != "",]
    hgnc_name_alias_check <- hgnc_name_alias_check[hgnc_name_alias_check$ENSG.ID != "",]

    hgnc_name_prev_check <- hgnc_name_prev_check[hgnc_name_prev_check$"Previous symbols" != "",] 
    hgnc_name_alias_check <- hgnc_name_alias_check[hgnc_name_alias_check$"Alias symbols" != "",]

    # Seperate symbol column values from list of sybols to individual rows with one each
    hgnc_name_prev_check <- separate_rows(hgnc_name_prev_check, "Previous symbols", convert = FALSE)
    hgnc_name_alias_check <- separate_rows(hgnc_name_alias_check, "Alias symbols", convert = FALSE)
  
    # Convert chomosome info in hgnc database to number for matching with other database
    hgnc_name_prev_check <- separate(hgnc_name_prev_check, "Chromosome", sep = 'p', into = "Chrp", remove = FALSE)
    hgnc_name_prev_check <- separate(hgnc_name_prev_check, "Chromosome", sep = 'q', into = "Chrq", remove = FALSE)
    hgnc_name_prev_check <- hgnc_name_prev_check%>%
                                mutate(Chr = ifelse(nchar(hgnc_name_prev_check$Chrp) <= 2, Chrp, Chrq))
    
    hgnc_name_alias_check <- separate(hgnc_name_alias_check, "Chromosome", sep = 'p', into = "Chrp", remove = FALSE)
    hgnc_name_alias_check <- separate(hgnc_name_alias_check, "Chromosome", sep = 'q', into = "Chrq", remove = FALSE)
    hgnc_name_alias_check <- hgnc_name_alias_check%>%
                                mutate(Chr = ifelse(nchar(hgnc_name_alias_check$Chrp) <= 2, Chrp, Chrq))
  
    # For any matched approved id in the psi hg38 annotation and hgnc database, record the corresponding Ensembl ID
    annotation <-
      map(annotation, ~.x%>%
                       mutate(`ENSG.HGNC` = hgnc_name_id_match$`ENSG.ID`[match(`Gene`, hgnc_name_id_match$"Approved symbol")]))
  
    # Drop hypothetical genes
    annotation<-
      map(annotation, ~.x%>%
            subset(`Gene` != 'Hypothetical'))
  
    # IN remaining NAs, for any matched alias/previous names and chromosome in the psi hg38 annotation and hgnc database, record the corresponding Ensembl ID
    annotation <-
      map(annotation, ~.x%>%
                      mutate(ENSG.HGNC = ifelse(is.na(`ENSG.HGNC`) | `ENSG.HGNC` == "",
                                                hgnc_name_alias_check$`ENSG.ID`[match(`Gene`, hgnc_name_alias_check$"Alias symbols") & match(`Chromosome`, hgnc_name_alias_check$Chr)],
                                                `ENSG.HGNC`))%>%
                      mutate(ENSG.HGNC = ifelse(is.na(`ENSG.HGNC`) | `ENSG.HGNC` == "",
                                                hgnc_name_prev_check$`ENSG.ID`[match(`Gene`, hgnc_name_prev_check$"Previous symbols") & match(`Chromosome`, hgnc_name_prev_check$Chr)],
                                                `ENSG.HGNC`)))
  
    # Build the final Ensembl id column base on the gtf file first, then for remaining NAs check the HGNC database record, SUPPA and VASTTOOL record,
    # Drop special cases that Ensembl ID is not recorded in VASTTOOLs and SUPPA
    # finnally in the remaining NA Ensmbl ID if the gene name in original annotation is NCBI IDs just use it
    annotation <-
      map(annotation, ~.x%>%
                       mutate(`ENSG.ID` = `ENSG.GTF`)%>%
                       mutate(`ENSG.ID` = ifelse(is.na(`ENSG.ID`),
                                               `ENSG.HGNC`,
                                                `ENSG.ID`))%>%
                       mutate(`ENSG.ID` = ifelse(is.na(`ENSG.ID`),
                                               `ENSG.VAST`,
                                                `ENSG.ID`))%>%
                       mutate(`ENSG.ID` = ifelse(is.na(`ENSG.ID`),
                                               `ENSG.SUPPA`,
                                                `ENSG.ID`))%>%
                       mutate(`ENSG.ID` = ifelse(substr(`ENSG.ID`, 1, 4) == 'ENSG',
                                                 `ENSG.ID`,
                                                 NA))%>%
                       mutate(`ENSG.ID` = ifelse(is.na(`ENSG.ID`) & substr(`Gene`, 1, 3) == 'LOC',
                                               `Gene`,
                                                `ENSG.ID`)))

    # Use the Ensembl IDs to replace gene names, drop remaining NAs
    annotation <-
    map(annotation, ~.x%>%
                       mutate(`Gene` = `ENSG.ID`)%>%
            drop_na(`Gene`)
          )

    # save modified annotation
    saveRDS(annotation, file = "${_output}")