# Gene coordinate annotation


This workflow adds genomic coordinate annotation to gene-level molecular phenotype files generated in `gct` format and convert them to `bed` format for downstreams analysis.

## Overview

This pipeline is based on [`pyqtl`, as demonstrated here](https://github.com/broadinstitute/gtex-pipeline/blob/master/qtl/src/eqtl_prepare_expression.py).

**FIXME: please explain here what we do with gene symbol vs gene ID**

### Alternative implementation

Previously we use `biomaRt` package in R instead of code from `pyqtl`. The core function calls are:

```r
    ensembl = useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version = "$[ensembl_version]")
    ensembl_df <- getBM(attributes=c("ensembl_gene_id","chromosome_name", "start_position", "end_position"),mart=ensembl)
```

We require ENSEMBL version to be specified explicitly in this pipeline. As of 2021 for the Brain xQTL project, we use ENSEMBL version 103.

## Input

1. Molecular phenotype data in `gct` format, with the first column being ENSEMBL ID and other columns being sample names. 
2. GTF for collapsed gene model
    - the gene names must be consistent with the molecular phenotype data matrices (eg ENSG00000000003 vs. ENSG00000000003.1 will not work) 
3. (Optional) Meta-data to match between sample names in expression data and genotype files
    - Required input
    - Tab delimited with header
    - Only 2 columns: first column is sample name in expression data, 2nd column is sample name in genotype data
    - **must contains all the sample name in expression matrices even if they don't existing in genotype data**
    

## Output

Molecular phenotype data in `bed` format.

## Minimal working example

The MWE is uploaded to the [Google Drive](https://drive.google.com/drive/u/0/folders/1Rv2bWHBbX_tastTh49ToYVDMV6rFP5Wk)

In [None]:
sos run gene_annotation.ipynb annotate_coord \
    --cwd output \
    --phenoFile data/MWE.pheno_log2cpm.tsv.gz \
    --annotation-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.gene.ERCC.gtf \
    --sample-participant-lookup data/sampleSheetAfterQC.txt \
    --container container/rna_quantification.sif --phenotype-id-type gene_name

## Command interface

In [3]:
sos run gene_annotation.ipynb -h

usage: sos run gene_annotation.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  annotate_coord
  annotate_coord_biomart

Global Workflow Options:
  --cwd output (as path)
                        Work directory & output directory
  --annotation-gtf VAL (as path, required)
                        gene gtf annotation table
  --phenoFile VAL (as path, required)
                        Molecular phenotype matrix
  --phenotype-id-type 'gene_id'
                        Whether the input data is named by gene_id or gene_name.
                        By default it is gene_id, if not, please change it to
                        gene_name
  --job-size 1 (as int)
             

In [None]:
[global]
# Work directory & output directory
parameter: cwd = path("output")
#  gene gtf annotation table
parameter: annotation_gtf = path
# Molecular phenotype matrix
parameter: phenoFile = path
# Whether the input data is named by gene_id or gene_name. By default it is gene_id, if not, please change it to gene_name
parameter: phenotype_id_type = 'gene_id'
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 1
parameter: container = ""

## Region List generation

To partitioning the data by genes require a region list file which:

1. have 5 columns: chr,start,end,gene_id,gene_name
2. have the same gene as or less gene than that of the bed file

Input:

1. A gtf file used to generated the bed
2. A phenotype bed file, must have a gene_id column indicating the name of genes.

In [None]:
[region_list_generation]
input: phenoFile, annotation_gtf
output: f'{cwd:a}/{_input[0]:bnn}.region_list'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'  
python: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container
    import pandas as pd
    import qtl.io
    # get the five column data
    bed_template_df_id = qtl.io.gtf_to_tss_bed(${_input[1]:ar}, feature='transcript',phenotype_id = "gene_id" )
    bed_template_df_name = qtl.io.gtf_to_tss_bed(${_input[1]:ar}, feature='transcript',phenotype_id = "gene_name" )
    bed_template_df = bed_template_df_id.merge(bed_template_df_name, on = ["chr","start","end"])
    bed_template_df.columns = ["#chr","start","end","gene_id","gene_name"]
    pheno = pd.read_csv(${_input[0]:r}, sep = "\t")
    # Retaining only the genes in the data
    region_list = bed_template_df[bed_template_df.${phenotype_id_type}.isin(pheno.gene_id)]
    region_list.to_csv("${_output}", sep = "\t",index = 0)

## Implementation using `pyqtl`

Implementation based on [GTEx pipeline](https://github.com/broadinstitute/gtex-pipeline/blob/master/qtl/src/eqtl_prepare_expression.py).

In [None]:
[annotate_coord]
# A file to map sample ID from expression to genotype, must contain two columns, sample_id and participant_id, mapping IDs in the expression files to IDs in the genotype (these can be the same).
parameter: sample_participant_lookup = path()
input: phenoFile, annotation_gtf
output: f'{cwd:a}/{_input[0]:bn}.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'  
python: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container

    import pandas as pd
    import qtl.io
    from pathlib import Path
    def prepare_bed(df, bed_template_df, chr_subset=None):
        bed_df = pd.merge(bed_template_df, df, left_index=True, right_index=True)
        # sort by start position
        bed_df = bed_df.groupby('chr', sort=False, group_keys=False).apply(lambda x: x.sort_values('start'))
        if chr_subset is not None:
            # subset chrs from VCF
            bed_df = bed_df[bed_df.chr.isin(chr_subset)]
        return bed_df
    # Load data
    df = pd.read_csv(${_input[0]:ar}, sep='\t', skiprows=0)
    sample_participant_lookup = Path("${sample_participant_lookup:a}")
    if "chr" in df.columns and "start" in df.columns and  "end" in df.columns:
        df = df.drop(["chr", "start", "end" ])
    df.set_index( df.columns[0] , inplace=True)
    
    # change sample IDs to participant IDs
    if sample_participant_lookup.is_file():
        sample_participant_lookup_s = pd.read_csv(sample_participant_lookup, sep="\t", index_col=0, dtype={0:str,1:str}, squeeze=True)
        df.rename(columns=sample_participant_lookup_s.to_dict(), inplace=True)

    if sum(qtl.io.gtf_to_tss_bed(${_input[1]:ar}, feature='gene',phenotype_id = "gene_id" ).index.duplicated()) >0:
        raise ValueError(f"GTF file ${_input[1]:ar} needs to be collapsed into gene model by reference data processing module")
         
    bed_template_df = qtl.io.gtf_to_tss_bed(${_input[1]:ar}, feature='transcript',phenotype_id = "${phenotype_id_type}" )

    ### Detect duplicated gene_id
    dup_count = bed_template_df.groupby(bed_template_df.index).cumcount().astype(str).values
    dup_count = pd.Series([f'.{x}'.replace(".0","") for x in dup_count])
    ### Add surfix to the duplicated gene_id
    bed_template_df.index = bed_template_df.index +  dup_count
    bed_template_df.gene_id = bed_template_df.index
    bed_df = prepare_bed(df, bed_template_df)
    qtl.io.write_bed(bed_df, ${_output:r})

## Implementation using biomaRt
This workflow adds the annotations of chr pos(TSS where start = end -1) and gene_ID to the `bed` file. **This workflow is obsolete**.

In [None]:
[annotate_coord_biomart]
parameter: ensembl_version=int

input: phenoFile
output: f'{cwd:a}/{_input:bn}.bed.gz',
        f'{cwd:a}/{_input:bn}.region_list'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output[0]:bn}'  
R:  expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout' ,container = container
    library("biomaRt")
    library(dplyr)
    library(readr)
    biomartCacheClear()
    gene_exp = readr::read_delim("$[_input[0]]",delim = "\t")
    if("#chr" %in% colnames(gene_exp) ){
      # need to re-annotate
      gene_exp = gene_exp[,4:ncol(gene_exp)]
    }
    ensembl = useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version = "$[ensembl_version]")
    ensembl_df <- getBM(attributes=c("ensembl_gene_id","chromosome_name", "start_position", "end_position"),mart=ensembl)
    my_genes = gene_exp$gene_ID
    keep_genes =  my_genes
    my_genes_ann = ensembl_df[match(my_genes, ensembl_df$ensembl_gene_id),]%>%filter(chromosome_name%in%1:23)%>%dplyr::rename( "#chr" = chromosome_name, "start" = start_position, "end" = end_position,"gene_ID" = ensembl_gene_id)%>%filter(gene_ID!="NA", gene_ID%in%keep_genes)
    my_genes_ann%>%select(`#chr`,start,end,gene_ID)%>%write_delim(path = "$[_output[1]]","\t")
    my_gene_bed = inner_join(my_genes_ann %>%mutate(end = start + 1) %>%select(`#chr`,start,end,gene_ID),gene_exp,by = "gene_ID" )%>%arrange(`#chr`,start) 
    my_gene_bed%>%readr::write_tsv( path = "$[_output[0]:n]", na = "NA", append = FALSE, col_names = TRUE, quote_escape = "double")

bash: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container
        bgzip -f $[_output[0]:n]
        tabix -p bed $[_output[0]] -f

## Annotation of leafcutter isoform
The following steps processed the output files of leafcutter so that they are TensorQTL ready. Shown below are three intemediate files

Exon list

chr   |  start  |  end    |  strand  | gene_id | gene_name
------|---------|---------|----------|----------|--------------
chr1  |  29554  |  30039  |  +       | ENSG00000243485 | MIR1302-2HG
chr1  |  30564  |  30667  |  +       | ENSG00000243485 | MIR1302-2HG
chr1  |  30976  |  31097  |  +       | ENSG00000243485 | MIR1302-2HG
chr1  |  35721  |  36081  |  -       | ENSG00000237613 | FAM138A
chr1  |  35277  |  35481  |  -       | ENSG00000237613 | FAM138A
chr1  |  34554  |  35174  |  -       | ENSG00000237613 | FAM138A
chr1  |  65419  |  65433  |  +       | ENSG00000186092 | OR4F5
chr1  |  65520  |  65573  |  +       | ENSG00000186092 | OR4F5
chr1  |  69037  |  71585  |  +       | ENSG00000186092 | OR4F5

clusters_to_genes


|clu    | genes |
|--------|---------- |
|1:clu_1_+  |    ENSG00000116288|
|1:clu_10_+ |    ENSG00000143774|
|1:clu_11_+ |    ENSG00000143774|
|1:clu_12_+ |    ENSG00000143774|
|1:clu_14_- |    ENSG00000126709|
|1:clu_15_- |    ENSG00000121753|
|1:clu_16_- |    ENSG00000121753|
|1:clu_17_- |    ENSG00000116560|
|1:clu_18_- |    ENSG00000143549|

phenotype_group

|X1|X2|
|-------------|---|
| 7:102476270:102478811:clu_309_-:ENSG00000005075 | ENSG00000005075 | 
| 7:102476270:102478808:clu_309_-:ENSG00000005075 | ENSG00000005075 |
| X:47572961:47574002:clu_349_-:ENSG00000008056   | ENSG00000008056 |
| X:47572999:47574002:clu_349_-:ENSG00000008056   | ENSG00000008056 |
| 8:27236905:27239971:clu_322_-:ENSG00000015592   | ENSG00000015592 |
| 8:27239279:27239971:clu_322_-:ENSG00000015592   | ENSG00000015592 |
| 8:27241262:27241677:clu_323_-:ENSG00000015592   | ENSG00000015592 |
| 8:27241262:27242397:clu_323_-:ENSG00000015592   | ENSG00000015592 |
| 8:27241757:27242397:clu_323_-:ENSG00000015592   | ENSG00000015592 |
| 1:35558223:35559107:clu_4_+:ENSG00000020129     | ENSG00000020129 |

The gtf used here should be the collapsed gtf, i.e. the final output of reference_data gtf processing and the one used to called rnaseq.

In [None]:
[map_leafcutter_cluster_to_gene]
## Extract the code in case psichromatic needs to be processed the same way
## PheoFile in this step is the intron_count file
parameter: intron_count = path
input: intron_count, annotation_gtf
output: f'{_input[1]}.exon_list', f'{cwd}/{_input[0]:b}.leafcutter.clusters_to_genes.txt'
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container
    import pandas as pd
    import qtl.annotation
    # Load data
    annot = qtl.annotation.Annotation(${_input[1]:r})
    exon_df = pd.DataFrame([[g.chr, e.start_pos, e.end_pos, g.strand, g.id, g.name]
                        for g in annot.genes for e in g.transcripts[0].exons],
                       columns=['chr', 'start', 'end', 'strand', 'gene_id', 'gene_name'])
    exon_df.to_csv(${_output[0]:r}, sep='\t', index=False)


R:expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container
    suppressMessages(library(dplyr, quietly=TRUE))
    suppressMessages(library(stringr, quietly=TRUE))
    suppressMessages(library(foreach, quietly=TRUE))
    # leafcutter functions:
    
    #' Make a data.frame of meta data about the introns
    #' @param introns Names of the introns
    #' @return Data.frame with chr, start, end, cluster id
    #' @export
    get_intron_meta <- function(introns) {
      intron_meta <- do.call(rbind, strsplit(introns,":"))
      colnames(intron_meta) <- c("chr","start","end","clu")
      intron_meta <- as.data.frame(intron_meta, stringsAsFactors=FALSE)
      intron_meta$start <- as.numeric(intron_meta$start)
      intron_meta$end <- as.numeric(intron_meta$end)
      intron_meta
    }
    
    #' Work out which gene each cluster belongs to. Note the chromosome names used in the two inputs must match.
    #' @param intron_meta Data frame describing the introns, usually from get_intron_meta
    #' @param exons_table Table of exons, see e.g. /data/gencode19_exons.txt.gz
    #' @return Data.frame with cluster ids and genes separated by commas
    #' @import dplyr
    #' @export
    map_clusters_to_genes <- function(intron_meta, exons_table) {
      gene_df <- foreach (chr=sort(unique(intron_meta$chr)), .combine=rbind) %dopar% {
    
        intron_chr <- intron_meta[ intron_meta$chr==chr, ]
        exons_chr <- exons_table[exons_table$chr==chr, ]
    
        exons_chr$temp <- exons_chr$start
        intron_chr$temp <- intron_chr$end
        three_prime_matches <- inner_join( intron_chr, exons_chr, by="temp")
    
        exons_chr$temp <- exons_chr$end
        intron_chr$temp <- intron_chr$start
        five_prime_matches <- inner_join( intron_chr, exons_chr, by="temp")
    
        all_matches <- rbind(three_prime_matches, five_prime_matches)[ , c("clu", "gene_name")]
    
        all_matches <- all_matches[!duplicated(all_matches),]
    
        if (nrow(all_matches)==0) return(NULL)
        all_matches$clu <- paste(chr,all_matches$clu,sep=':')
        all_matches
      }
    
      clu_df <- gene_df %>% group_by(clu) %>% summarize(genes=paste(gene_name, collapse = ","))
      class(clu_df) <- "data.frame"
      clu_df
    }

    cat("LeafCutter: mapping clusters to genes\n")
    intron_counts <- read.table(${_input[0]:r}, header=TRUE, check.names=FALSE, row.names=1)
    intron_meta <- get_intron_meta(rownames(intron_counts))
    exon_table <- read.table(${_output[0]:r}, header=TRUE, stringsAsFactors=FALSE)
    if(!str_detect(intron_meta,"chr")) {
        exon_table = exon_table%>%mutate(chr = str_remove_all(chr,"chr"))
    } else if (!any(str_detect(exon_table$chr[1],"chr"))) {
        exon_table = exon_table%>%mutate(chr = paste0("chr",chr))
    } else (exon_table = exon_table)
    stopifnot(is.element('gene_id', colnames(exon_table)))
    exon_table[, 'gene_name'] <- exon_table[, 'gene_id']
    m <- map_clusters_to_genes(intron_meta, exon_table)
    write.table(m, ${_output[1]:r}, sep = "\t", quote=FALSE, row.names=FALSE)

In [None]:
[annotate_leafcutter_isoforms]
parameter: sample_participant_lookup = path()
input: phenoFile, annotation_gtf,output_from("map_leafcutter_cluster_to_gene")
output: f'{cwd:a}/{_input[0]:bn}.formated.bed.gz', f'{cwd:a}/{_input[0]:bn}.phenotype_group.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output[0]:bn}'  
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container
    import pandas as pd
    import numpy as np
    import qtl.io
    from pathlib import Path
    # Load data
    tss_df = qtl.io.gtf_to_tss_bed(${_input[1]:r})
    bed_df = pd.read_csv(${_input[0]:ar}, sep='\t', skiprows=0)
    bed_df.columns.values[0] = "#chr" # Temporary
    sample_participant_lookup = Path("${sample_participant_lookup:a}")
    cluster2gene_dict = pd.read_csv(${_input[3]:r}, sep='\t', index_col=0, squeeze=True).to_dict()
    print('    ** assigning introns to gene mapping(s)')
    n = 0
    gene_bed_df = []
    group_s = {}
    for _,r in bed_df.iterrows():
        s = r['ID'].split(':')
        cluster_id = s[0]+':'+s[-1]
        if cluster_id in cluster2gene_dict:
            gene_ids = cluster2gene_dict[cluster_id].split(',')
            for g in gene_ids:
                gi = r['ID']+':'+g
                gene_bed_df.append(tss_df.loc[g, ['chr', 'start', 'end']].tolist() + [gi] + r.iloc[4:].tolist())
                group_s[gi] = g
        else:
            n += 1
    if n > 0:
        print(f'    ** discarded {n} introns without a gene mapping')

    print('  * writing BED files for QTL mapping')
    gene_bed_df = pd.DataFrame(gene_bed_df, columns=bed_df.columns)
    # sort by TSS
    gene_bed_df = gene_bed_df.groupby('#chr', sort=False, group_keys=False).apply(lambda x: x.sort_values('start'))
    # change sample IDs to participant IDs
    if sample_participant_lookup.is_file():
        sample_participant_lookup_s = pd.read_csv(sample_participant_lookup, sep="\t", index_col=0, dtype={0:str,1:str}, squeeze=True)
        gene_bed_df.rename(columns=sample_participant_lookup_s, inplace=True)
    qtl.io.write_bed(gene_bed_df, ${_output[0]:r})
    gene_bed_df[['start', 'end']] = gene_bed_df[['start', 'end']].astype(np.int32)
    gene_bed_df[gene_bed_df.columns[4:]] = gene_bed_df[gene_bed_df.columns[4:]].astype(np.float32)
    pd.Series(group_s).sort_values().to_csv(${_output[1]:r}, sep='\t', header=False)

## Processing of psichomics output
It occurs that the psichomatic by default grouped the isoforms by gene name, so only thing needs to be done is to extract this information and potentially renamed the gene symbol into ENSG ID

In [None]:
[annotate_psichomics_isoforms]
parameter: sample_participant_lookup = path()
input: phenoFile, annotation_gtf
output: f'{cwd:a}/{_input[0]:bn}.formated.bed.gz', f'{cwd:a}/{_input[0]:bn}.phenotype_group.txt'
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container
    import pandas as pd
    import numpy as np
    import qtl.io
    from pathlib import Path
    tss_df = qtl.io.gtf_to_tss_bed(${_input[1]:r}, feature='gene',phenotype_id = "gene_id" )
    bed_df = pd.read_csv(${_input[0]:ar}, sep='\t', skiprows=0)
    bed_df["gene_id"]  = [x[-1] for x in bed_df.ID.str.split("_")]
    sample_participant_lookup = Path("${sample_participant_lookup:a}")
    if "start" in  bed_df.columns:
        bed_df = bed_df.drop(["#Chr","start","end"],axis = 1)
    output = tss_df.merge(bed_df, left_on = ["gene_id"], right_on = ["gene_id"],how = "right").sort_values(["chr","start"])
    # change sample IDs to participant IDs
    if sample_participant_lookup.is_file():
        sample_participant_lookup_s = pd.read_csv(sample_participant_lookup, sep="\t", index_col=0, dtype={0:str,1:str}, squeeze=True)
        output.rename(columns=sample_participant_lookup_s.to_dict(), inplace=True)
    bed_output = output.drop("gene_id" , axis = 1)
    phenotype_group = output[["ID","gene_id"]]
    bed_output.to_csv(${_output[0]:nr},"\t",index = False)
    phenotype_group.to_csv(${_output[1]:r},"\t",index = False,header=False)


bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container
    bgzip -f ${_output[0]:n}
    tabix ${_output[0]}