# Function getGeneNames()
---
This function takes a character vector of Ensembl IDs of one kind (gene, transcript, protein or exon stable IDs, also ID versions), and returns a vector with gene names (HGNC symbols) associated with those IDs:

```R
getGeneNames(ids, biomart.dt, strip.version=TRUE, gene.symbols=FALSE)
```

This function accepts 4 arguments:
- `ids` - the input vector of Ensembl IDs
- `biomart.dt` - the input Ensembl BioMart data on Ensembl IDs, must be provided as a data table
- `strip.version=TRUE` - by default IDs are stripped off versions, change to _FLASE_ in the case ID versions are meant to be used
- `gene.symbols=FALSE` - by default the input vector is interpreted as a vector of valid Ensembl IDs, for which HGNC symbols (gene symbols) are returned, change to _TRUE_ to use this function in an opposite way, i.e. to obtain Gene stable IDs for input gene symbols

This function output ia a vector as described below:
- The vector is of the same length as the input vector, and subsequent values of both vectors correspond to each other.
- If a given ID exists but is not linked to any gene symbol, empty character value is returned.
- If a given ID does not exist in the reference data, NA value is returned.
- For every NA value in the input vector, NA value is returned.

Data on IDs and HGNC symbols (BioMart data) must be obtained via [Ensembl BioMart](https://www.ensembl.org/biomart/martview) (see the enxt section) and provided as a data frame with unchanged column names.

Examples:
```R
getGeneNames("ENSP00000354687", biomart_dt)
# Output: 'MT-ND1'

getGeneNames(c("RAVER2", "MT-RNR1"), biomart_dt, gene.symbols=TRUE)
# Output: 'ENSG00000162437' 'ENSG00000211459'
```

---

The data on IDs is already saved in the input directory as _`mart_export_ids.tsv.gz`_. It can be redownloaded in the following way:
- go to https://www.ensembl.org/biomart/martview
- as `-CHOOSE DATABASE-` select `Ensembl Genes 109`
- as `-CHOOSE DATASET-` select `Human Genes (GRCh38.p13)`
- go to `Attributes` and in the section `GENES:` select:
    - Gene stable ID
    - Gene stable ID version
    - Transcript stable ID
    - Transcript stable ID version
    - Protein stable ID
    - Protein stable ID version
    - Exon stable ID
- and in the section `EXTERNALS:` (subsection _External References_) select:
    - HGNC symbol
- unselect other then listed above attributes in all sections if such were preselected
- click the `Results` button
- in `Export all results to` choose `Compressed file (.gz)`, `TSV`
- check the option `Unique results only`
- click the `Go` button
- save the file in a location of your choosing

---

- Do the necessary imports:

In [1]:
library("data.table")

- Define `getGeneNames()` function:

In [2]:
#' Get gene names (HGNC symbols) for Esembl IDs
#'
#' This function takes a character vector of Ensembl IDs of one kind
#' (gene, transcript, protein or exon stable IDs, also ID versions),
#' and returns a vector with gene names (HGNC symbols) associated with those IDs.
#'
#' @param ids The input vector of Ensembl IDs
#' @param biomart.dt The input Ensembl BioMart data on Ensembl IDs, must be provided as a data table.
#' @param strip.version=TRUE By default IDs are stripped off versions, change to FLASE in the case
#'                          ID versions are meant to be used
#' @param gene.symbols=FALSE By default the input vector is interpreted as a vector of valid Ensembl IDs,
#'                         for which HGNC symbols (gene symbols) are returned, change to TRUE to use
#'                         this function in an opposite way, i.e. to obtain Gene stable IDs
#'                         for input gene symbols.
#' @return The vector of corresponding gene names (HGNC symbols):
#' The vector is of the same length as the input vector, and subsequent values of both vectors correspond to each other.
#' If a given ID exists but is not linked to any gene symbol, empty character value is returned.
#' If a given ID does not exist in the reference data, NA value is returned.
#' For every NA value in the input vector, NA value is returned.
#' @examples
#' getGeneNames("ENSP00000354687", biomart_dt)
#' # Output: 'MT-ND1'
#' getGeneNames(c("RAVER2", "MT-RNR1"), biomart_dt, gene.symbols=True)
#' # Output: 'ENSG00000162437' 'ENSG00000211459'
getGeneNames <- function(ids, biomart.dt, strip.version=TRUE, gene.symbols=FALSE) {
    # If ids in not a vector, throw en error.
    if (!is.vector(ids)){
        stop("Input data 'ids' is not a vector!")
    }

    # Create a copy of ids input vector.
    ids <- copy(ids)

    # If strip.version is set to TRUE (default), version suffixes
    # will be stripped, if present, from IDs (ids) values.
    if (strip.version)
        ids <- sub("\\.[^.]*$", "", ids)

    # The src_col is a column in biomart.dt that corresponds
    # to the type of provided ids.
    # Set the source column for biomart.dt to 'HGNC symbol'.
    # It is preliminarily assumed that provided ids are gene symbols.
    src_col <- "HGNC symbol"

    # If gene.symbols is FALSE, valid Ensembl IDs are expected.
    # Check the number of unique prefixes for ids.
    # If there is only one, find out of which kind or throw an error.
    # If there is more than one, throw an error.
    if (!gene.symbols) {
        id_types = unique( substr(ids[!is.na(ids)], 1, 4) )
        if (length(id_types) == 1) {
            src_col <- switch(id_types,
                "ENSG" = "Gene stable ID",
                "ENST" = "Transcript stable ID",
                "ENSP" = "Protein stable ID",
                "ENSE" = "Exon stable ID",
                NA
            )

            if (is.na(src_col)) {
                stop( paste("Unknown Ensembl ID prefix:", paste(id_types, collapse=", ")) )
            }
        }
        else {
            stop( paste("Mixed Ensembl ID types:", paste(id_types, collapse=", "), ".",
                        "Please provide IDs of the same type or",
                        "set geneSymbols to TRUE in order to use gene symbols.") )
        }
    }

    # The out_col is a column in biomart.dt, values of which that
    # correspond to provided ids will be placed in the output vector.
    # By default it is 'HGNC symbol', unless gene.symbols is set to TRUE,
    # then it is 'Gene stable ID'.
    out_col <- if (!gene.symbols) "HGNC symbol" else "Gene stable ID"

    # If versions are allowed, check whether all ids have versions.
    # If everything is ok, add proper suffix to src_col, unless
    # it is 'Exon stable ID', which does not have version,
    # then again, throw an error.
    if (!strip.version) {
        contains_dot = grepl("\\.", ids)
        if ( any(contains_dot) ) {
            if ( all(contains_dot) ) {
                if (src_col != "Exon stable ID") {
                    src_col <- paste(src_col, "version")
                } else {
                    stop("Exon stable IDs cannot be used with version designation.")
                }
            } else {
                stop( paste("Input data contains Ensembl IDs mixed with IDs with versions.",
                            "Use only one or another, or set stripVersions to TRUE.") )
            }
        }
    }

    # Convert ids into a data.table,
    # give it the name src_col.
    ids_dt <- data.table(col=ids)
    setnames(ids_dt, "col", src_col)

    # Create a copy of unique biomart.dt rows for src_col and out_col,
    # and convert them in into a data.table
    col_names <- c(src_col, out_col)
    sub_dt <- data.table( copy( unique( biomart.dt[, ..col_names] ) ) )

    # Group the resutling data.table by the column src_col,
    # take unique of the groupped values from the column out_col,
    # and finally paste them into one string.
    # In short, when one ID is mapped to more than one gene symbol
    # that ID will be mapped now to one character value of catenated symbols.

    groupped_dt <- sub_dt[, .(col = paste( unique( get(out_col) ), collapse = ", ")), by=get(src_col)]
    setnames(groupped_dt, "get", src_col)
    setnames(groupped_dt, "col", out_col)

    # Join groupped_dt with ids_dt on src_col.
    # That should preserve the order of IDs as
    # it is provided in the input vector ids.
    merged_dt <- groupped_dt[ids_dt, on=src_col]

    # Pick up and return the vector of gene names.
    gene_names <- merged_dt[[out_col]]
    return(gene_names)
}

---
# Test getGeneNames() function
___
- Load the BioMart data to a data frame, look up the dimensions and the first 5 rows.

In [3]:
biomart_dt <- fread(
    file="input/mart_export_ids.tsv.gz", sep="\t",
    header=TRUE, check.names=FALSE, stringsAsFactors=FALSE
)

dim(biomart_dt)
head(biomart_dt)

Gene stable ID,Gene stable ID version,Transcript stable ID,Transcript stable ID version,Protein stable ID,Protein stable ID version,Exon stable ID,HGNC symbol
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
ENSG00000210049,ENSG00000210049.1,ENST00000387314,ENST00000387314.1,,,ENSE00001544501,MT-TF
ENSG00000211459,ENSG00000211459.2,ENST00000389680,ENST00000389680.2,,,ENSE00001544499,MT-RNR1
ENSG00000210077,ENSG00000210077.1,ENST00000387342,ENST00000387342.1,,,ENSE00001544498,MT-TV
ENSG00000210082,ENSG00000210082.2,ENST00000387347,ENST00000387347.2,,,ENSE00001544497,MT-RNR2
ENSG00000209082,ENSG00000209082.1,ENST00000386347,ENST00000386347.1,,,ENSE00002006242,MT-TL1
ENSG00000198888,ENSG00000198888.2,ENST00000361390,ENST00000361390.2,ENSP00000354687,ENSP00000354687.2,ENSE00001435714,MT-ND1


---
### Use getGeneNames() function using an exemplary vector of several values

---
- A vector of exemplary Ensembl IDs (gene stable IDs). It contains also one NA value.

In [4]:
exmpl_ids <- c("ENSG00000211459", NA, "ENSG00000179546", "ENSG00000210049",
               "ENSG00000289881", "ENSG00000276085", "ENSG00000277336")

- Pass a single value only:

In [5]:
getGeneNames(exmpl_ids[1], biomart_dt)

- Pass the whole vector:

In [6]:
getGeneNames(exmpl_ids, biomart_dt)

- A test for a vector of exemplary HGNC symols. It contains also one NA value, and one incorrect value _HEYHEY_. If Gene symbols are used, it must be indicated explicite (by `gene.symbols=TRUE`). It turns off automatic Ensembl ID type detection.

In [7]:
gene_symbols <- c("MT-RNR1", NA, "HEYHEY", "CCL3L1", "CCL3L3")

In [8]:
getGeneNames(gene_symbols, biomart_dt, gene.symbols=TRUE)

---
### Use getGeneNames() function using a larger dataset

---
- Load the dataset to a data frame, look up the dimensions and the first two rows

In [9]:
ids_df <- read.table(
    gzfile("input/test_ids.tsv.gz"), sep="\t", header=TRUE,
    check.names=FALSE, stringsAsFactors=FALSE
)

dim(ids_df)
head(ids_df, 2)

Unnamed: 0_level_0,gene_id,transcript_id
Unnamed: 0_level_1,<chr>,<chr>
1,ENSG00000000003.10,ENST00000373020.4
2,ENSG00000000003.10,ENST00000494424.1


---
- Test the function for **10** random `gene_ids` and corresponding `transcript_ids` as well as different values of `strip.version` argument:

In [10]:
set.seed(1)
smpl_rows <- sample(which(complete.cases(ids_df)), 10)

In [11]:
ids <- ids_df[smpl_rows, "gene_id"]
names <- getGeneNames(ids, biomart_dt)
names

In [12]:
ids <- ids_df[smpl_rows, "transcript_id"]
names <- getGeneNames(ids, biomart_dt)
names

In [13]:
ids <- ids_df[smpl_rows, "gene_id"]
names <- getGeneNames(ids, biomart_dt, strip.version=FALSE)
names

In [14]:
ids <- ids_df[smpl_rows, "transcript_id"]
names <- getGeneNames(ids, biomart_dt, strip.version=FALSE)
names

---
- Define a helper function `printResStats()` that calculates some basic stats, and among all shows how many IDs cannot be found in the provided BioMart data.
- Test `getGeneNames()` function for **all** `gene_ids` and corresponding `transcript_ids` (~200k) using the default `strip.version` argument value (_TRUE_):

In [15]:
printResStats <- function(ids, names) {
    in_na  <- sum( is.na(ids) )
    out_na <- sum( is.na(names) )
    print( paste("Input vector length:", length(ids)) )
    print( paste("Output vector length:", length(names)) )
    print( paste("Input NA count:", in_na) )
    print( paste("Output NA count:", out_na) )
    print( paste("IDs not found:", out_na-in_na) )
}

In [16]:
ids <- ids_df[["gene_id"]]
names <- getGeneNames(ids, biomart_dt)

printResStats(ids, names)

[1] "Input vector length: 196520"
[1] "Output vector length: 196520"
[1] "Input NA count: 0"
[1] "Output NA count: 7714"
[1] "IDs not found: 7714"


In [17]:
ids <- ids_df[["transcript_id"]]
names <- getGeneNames(ids, biomart_dt)

printResStats(ids, names)

[1] "Input vector length: 196520"
[1] "Output vector length: 196520"
[1] "Input NA count: 0"
[1] "Output NA count: 17043"
[1] "IDs not found: 17043"
