# Function get_gene_names()
---
This function takes a list of strings of Ensembl IDs of one kind (gene, transcript, protein or exon stable IDs, also ID versions), and returns a list with gene names (HGNC symbols) associated with those IDs:

```Python
get_gene_names(ids, biomart_df, strip_version=True, gene_symbols=False)
```

This function accepts 4 arguments:
- `ids` - the input list of Ensembl IDs
- `biomart_df` - the input Ensembl BioMart data on Ensembl IDs, must be provided as Pandas DataFrame
- `strip_version=True` - by default IDs are stripped off versions, change to _False_ in the case ID versions are meant to be used
- `gene_symbols=False` - by default the input list is interpreted as a list of valid Ensembl IDs, for which HGNC symbols (gene symbols) are returned, change to _True_ to use this function in an opposite way, i.e. to obtain Gene stable IDs for input gene symbols

This function output ia a vector as described below:
- The list is of the same length as the input list, and subsequent values of both lists correspond to each other.
- If a given ID exists but is not linked to any gene symbol, empty string value is returned.
- If a given ID does not exist in the reference data, nan value is returned.
- For every None/nan value in the input list, nan value is returned.

Data on IDs and HGNC symbols (BioMart data) must be obtained via [Ensembl BioMart](https://www.ensembl.org/biomart/martview) (see the enxt section) and provided as Pandas DataFrame with unchanged column names.

Examples:
```Python
get_gene_names(["ENSP00000354687"], biomart_df)
# Output: ['MT-ND1']

getGeneNames(["RAVER2", "MT-RNR1"]), biomart_df, gene_symbols=True)
# Output: ['ENSG00000162437', 'ENSG00000211459']
```

---

The data on IDs is already saved in the input directory as _`mart_export_ids.tsv.gz`_. It can be redownloaded in the following way:
- go to https://www.ensembl.org/biomart/martview
- as `-CHOOSE DATABASE-` select `Ensembl Genes 109`
- as `-CHOOSE DATASET-` select `Human Genes (GRCh38.p13)`
- go to `Attributes` and in the section `GENES:` select:
    - Gene stable ID
    - Gene stable ID version
    - Transcript stable ID
    - Transcript stable ID version
    - Protein stable ID
    - Protein stable ID version
    - Exon stable ID
- and in the section `EXTERNALS:` (subsection _External References_) select:
    - HGNC symbol
- unselect other then listed above attributes in all sections if such were preselected
- click the `Results` button
- in `Export all results to` choose `Compressed file (.gz)`, `TSV`
- check the option `Unique results only`
- click the `Go` button
- save the file in a location of your choosing

---

- Do the necessary imports:

In [1]:
import pandas as pd
from collections import defaultdict

- Create a default dictionary that maps Ensembl ID prefixes to BioMart column names and returns None for non-existing keys:

In [2]:
name_map = defaultdict(lambda: None, {
    'ENSG' : 'Gene stable ID',
    'ENST' : 'Transcript stable ID',
    'ENSP' : 'Protein stable ID',
    'ENSE' : 'Exon stable ID'
})

- Define `get_gene_names()` function:

In [3]:
def get_gene_names(ids, biomart_df, strip_version=True, gene_symbols=False):
    '''Takes a character vector of Ensembl IDs of one kind
    (gene, transcript, protein or exon stable IDs, also ID versions),
    and returns a vector with gene names (HGNC symbols) associated with those IDs.
    Arguments:
    ids           -- The input list of Ensembl IDs
    biomart_df    -- The input Ensembl BioMart data on Ensembl IDs, must be provided as Pandas DataFrame.
    strip_version -- By default True, IDs are stripped off versions, change to False in the case
                     ID versions are meant to be used
    gene_symbols  -- By default False, the input list is interpreted as a list of valid Ensembl IDs,
                     for which HGNC symbols (gene symbols) are returned, change to True to use
                     this function in an opposite way, i.e. to obtain Gene stable IDs
                     for input gene symbols.
    Returns:
    gene_names    -- the list of corresponding gene names (HGNC symbols):
                     The list is of the same length as the input list, and subsequent values
                     of both lists correspond to each other.
                     If a given ID exists but is not linked to any gene symbol,
                     empty string value is returned.
                     If a given ID does not exist in the reference data, nan value is returned.
                     For every None/nan value in the input list, nan value is returned.
    Examples:
    get_gene_names(['ENSP00000354687'], biomart_df)
    # Output: ['MT-ND1']
    get_gene_names(['RAVER2', 'MT-RNR1'), biomart_df, gene_symbols=True)
    # Output: ['ENSG00000162437', 'ENSG00000211459']'''
    
    # If ids in not a list, throw en error.
    if type(ids) is not list:
        raise Exception('Input data "ids" is not a list!')

    # Convert the ids into Pandas Series
    ids = pd.Series(ids)

    # If strip_version is set to True (default), version suffixes
    # will be stripped, if present, from IDs (ids) values.
    if strip_version:
        ids = ids.str.replace(r'\.[^\.]+$', '', regex=True)

    # The src_col is a column in biomart_df that corresponds
    # to the type of provided ids.
    # Set the source column for biomart_df to 'HGNC symbol'.
    # It is preliminarily assumed that provided ids are gene symbols.
    src_col = 'HGNC symbol'

    # If gene_symbols is False, valid Ensembl IDs are expected.
    # Check the number of unique prefixes for ids.
    # If there is only one find out of which kind or throw an error.
    # If there is more than one, throw an error.
    if not gene_symbols:
        id_types = ids[ids.notna()].str[0:4].unique()
        if id_types.shape[0] == 1:
            src_col = name_map[ id_types[0] ]
            if src_col is None:
                raise Exception(f'Unknown Ensembl ID prefix: {id_types[0]}')
        else:
            raise Exception(
                'Mixed Ensembl ID types:' + ' '.join(id_types) + '. ' +
                'Please provide IDs of the same type or set gene_symbols '  +
                'to True in order to use gene symbols.'
            )

    # The out_col is a column in biomart_df, values of which that
    # correspond to provided ids will be placed in the output list.
    # By default it is 'HGNC symbol', unless gene_symbols is set to True,
    # then it is 'Gene stable ID'.
    out_col = 'HGNC symbol' if not gene_symbols else 'Gene stable ID'

    # If versions are allowed, check whether all IDs have versions.
    # If everything is ok, add proper suffix to src_col, unless
    # it is 'Exon stable ID', which does not have version,
    # then again, throw an error.
    if not strip_version:
        contains_dot = ids.str.contains('.')
        if any(contains_dot):
            if all(contains_dot):
                if src_col != 'Exon stable ID':
                    src_col += ' version'
                else:
                    raise Exception(
                        'Exon stable IDs cannot be used with version designation.'
                    )
            else:
                raise Exception(
                    'Input data contains Ensembl IDs mixed with IDs with versions. ' +
                    'Use only one or another, or set stripVersions to TRUE.'
                )

    # Get unique biomart_df rows for src_col and out_col
    sub_df = biomart_df[[src_col, out_col]].drop_duplicates()

    # Group the resutling DataFrame by the column src_col,
    # take unique of the groupped values from the column out_col,
    # and finally join them into one string.
    # In short, when one ID is mapped to more than one gene symbol
    # that ID will be mapped to one string of catenated symbols.
    groupped_df = sub_df.fillna('').groupby(src_col).agg(list)
    groupped_df[out_col] = groupped_df[out_col].str.join(', ')

    # Wrap up ids into one-column Pandas DataFrame ids_df.
    # Merge ids_df with groupped_df on src_col.
    # That preserves the order of IDs as
    # it is provided in ids_dt.
    ids_df = ids.to_frame(name=src_col)
    selected_df = ids_df.merge(groupped_df, how='left', on=src_col)

    # Pick up and return the vector of gene names.
    gene_names = list(selected_df[out_col])
    return gene_names

---
# Test getGeneNames() function
___
- Load the BioMart data to a data frame, look up the dimensions and the first 5 rows.

In [4]:
biomart_df = pd.read_csv('input/mart_export_ids.tsv.gz', sep='\t')
biomart_df.head()

Unnamed: 0,Gene stable ID,Gene stable ID version,Transcript stable ID,Transcript stable ID version,Protein stable ID,Protein stable ID version,Exon stable ID,HGNC symbol
0,ENSG00000210049,ENSG00000210049.1,ENST00000387314,ENST00000387314.1,,,ENSE00001544501,MT-TF
1,ENSG00000211459,ENSG00000211459.2,ENST00000389680,ENST00000389680.2,,,ENSE00001544499,MT-RNR1
2,ENSG00000210077,ENSG00000210077.1,ENST00000387342,ENST00000387342.1,,,ENSE00001544498,MT-TV
3,ENSG00000210082,ENSG00000210082.2,ENST00000387347,ENST00000387347.2,,,ENSE00001544497,MT-RNR2
4,ENSG00000209082,ENSG00000209082.1,ENST00000386347,ENST00000386347.1,,,ENSE00002006242,MT-TL1


---
### Use getGeneNames() function using an exemplary vector of several values

---
- A vector of exemplary Ensembl IDs (gene stable IDs). It contains also one NA value.

In [5]:
exmpl_ids = ['ENSG00000211459', None, 'ENSG00000179546', 'ENSG00000210049',
             'ENSG00000289881', 'ENSG00000276085', 'ENSG00000277336']

- Pass a single value only:

In [6]:
get_gene_names([exmpl_ids[0]], biomart_df)

['MT-RNR1']

- Pass the whole vector:

In [7]:
get_gene_names(exmpl_ids, biomart_df)

['MT-RNR1', nan, 'HTR1D', 'MT-TF', '', 'CCL3L1, CCL3L3', 'CCL3L1, CCL3L3']

- A test for a vector of exemplary HGNC symols. It contains also one NA value, and one incorrect value _HEYHEY_. If Gene symbols are used, it must be indicated explicite (by `gene_symbols=True`). It turns off automatic Ensembl ID type detection.

In [8]:
gene_symbols = ['MT-RNR1', None, 'HEYHEY', 'CCL3L1', 'CCL3L3']

In [9]:
get_gene_names(gene_symbols, biomart_df, gene_symbols=True)

['ENSG00000211459',
 nan,
 nan,
 'ENSG00000277796, ENSG00000277768, ENSG00000277336, ENSG00000276085',
 'ENSG00000277796, ENSG00000277768, ENSG00000277336, ENSG00000276085']

---
### Use getGeneNames() function using a larger dataset

---
- Load the dataset to a data frame, look up the dimensions and the first two rows

In [10]:
ids_df = pd.read_csv('input/test_ids.tsv.gz', sep='\t')

display(ids_df.shape)
ids_df.head(2)

(196520, 2)

Unnamed: 0,gene_id,transcript_id
0,ENSG00000000003.10,ENST00000373020.4
1,ENSG00000000003.10,ENST00000494424.1


---
- Test the function for **10** random `gene_ids` and corresponding `transcript_ids` as well as different values of `strip_version` argument:

In [11]:
ids = list( ids_df['gene_id'].sample(10, random_state=1) )
names = get_gene_names(ids, biomart_df)
names

[nan,
 'AFG3L1P',
 'NELFA',
 'APOL1',
 'IKBKB-DT',
 'PLCB1-IT1',
 'AREG',
 nan,
 '',
 'MGST1']

In [12]:
ids = list( ids_df['transcript_id'].sample(10, random_state=1) )
names = get_gene_names(ids, biomart_df)
names

[nan,
 'AFG3L1P',
 'NELFA',
 'APOL1',
 'IKBKB-DT',
 'PLCB1-IT1',
 'AREG',
 nan,
 '',
 'MGST1']

In [13]:
ids = list( ids_df['gene_id'].sample(10, random_state=1) )
names = get_gene_names(ids, biomart_df, strip_version=False)
names

[nan, nan, nan, nan, nan, 'PLCB1-IT1', nan, nan, '', nan]

In [14]:
ids = list( ids_df['transcript_id'].sample(10, random_state=1) )
names = get_gene_names(ids, biomart_df, strip_version=False)
names

[nan, nan, nan, nan, nan, 'PLCB1-IT1', 'AREG', nan, '', nan]

---
- Define a helper function `print_res_stats()` that calculates some basic stats, and among all shows how many IDs cannot be found in the provided BioMart data.
- Test `get_gene_names()` function for **all** `gene_ids` and corresponding `transcript_ids` (~200k) using the default `strip_version` argument value (_True_):

In [15]:
def print_res_stats(ids, names):
    in_na  = pd.isna(ids).sum()
    out_na = pd.isna(names).sum()
    print(f'Input vector length: {len(ids)}')
    print(f'Output vector length: {len(names)}')
    print(f'Input NA count: {in_na}')
    print(f'Output NA count: {out_na}')
    print(f'IDs not found: {out_na-in_na}')

In [16]:
ids   = list( ids_df['gene_id'] )
names = get_gene_names(ids, biomart_df)

print_res_stats(ids, names)

Input vector length: 196520
Output vector length: 196520
Input NA count: 0
Output NA count: 7714
IDs not found: 7714


In [17]:
ids   = list( ids_df['transcript_id'] )
names = get_gene_names(ids, biomart_df)

print_res_stats(ids, names)

Input vector length: 196520
Output vector length: 196520
Input NA count: 0
Output NA count: 17043
IDs not found: 17043
