# Hands on: Introduction to data annotation using identifiers

In [1]:
!pip install -q -r requirements.txt

## 1. Example data set 3

![image.png](attachment:image.png)

- Experimental factors:
 - cell lines (tabs): LOXIMVI, COLO858, etc.
 - drugs (rows): AZ628, Selumetinib, Vemurafenib, etc.
 - antibodies for protein abundance/PTM status (columns): MEK, ERK, p90RSK, etc.

In [21]:
import pandas as pd

DATA_PATH = '../data/fallahi/fallahi_data.xlsx'

sheets = pd.read_excel(DATA_PATH, sheet_name=None, skiprows=2)

In [22]:
sheets

{'TableS1':                                            Unnamed: 0
 0   Table S1A. RPPA data for C32 cell line imaged ...
 1   Table S1B. RPPA data for COLO858 cell line ima...
 2   Table S1C. RPPA data for K2 cell line imaged o...
 3   Table S1D. RPPA data for LOXIMVI cell line ima...
 4   Table S1E. RPPA data for MMACSF cell line imag...
 5   Table S1F. RPPA data for MZ7MEL cell line imag...
 6   Table S1G. RPPA data for RVH421 cell line imag...
 7   Table S1H. RPPA data for SKMEL28 cell line ima...
 8   Table S1I. RPPA data for WM115 cell line image...
 9   Table S1J. RPPA data for WM1552C cell line ima...
 10                                                NaN
 11   Median/Std have been split into separate tables.,
 'C32':         AZ628   1  0.0032  0.22484  0.10692  NaN  0.01074   0.24791    0.3933  \
 0       AZ628   1  0.0100 -0.16601 -0.19265  NaN -0.31682  0.074960  0.233410   
 1       AZ628   1  0.0316 -0.64244 -1.14610  NaN -0.40092  0.040274  0.095077   
 2       AZ628   1  

In [23]:
sheet_names = list(sheets)
sheet_names

['TableS1',
 'C32',
 'C32-std',
 'COLO858',
 'COLO858-std',
 'K2',
 'K2-std',
 'LOXIMVI',
 'LOXIMVI-std',
 'MMACSF',
 'MMACSF-std',
 'MZ7MEL',
 'MZ7MEL-std',
 'RVH421',
 'RVH421-std',
 'SKMEL28',
 'SKMEL28-std',
 'WM115',
 'WM115-std',
 'WM1552C',
 'WM1552C-std']

In [24]:
cell_lines = [s for s in sheet_names if 'Table' not in s and '-std' not in s]
cell_lines

['C32',
 'COLO858',
 'K2',
 'LOXIMVI',
 'MMACSF',
 'MZ7MEL',
 'RVH421',
 'SKMEL28',
 'WM115',
 'WM1552C']

## How can we normalize text names to identifiers?

Gilda is a Python package and REST service that grounds (i.e., finds appropriate identifiers in namespaces for) named entities in biomedical text.

Gyori BM, Hoyt CT, Steppi A (2022). Gilda: biomedical entity text normalization with machine-learned disambiguation as a service. Bioinformatics Advances, 2022; vbac034 https://doi.org/10.1093/bioadv/vbac034.

In [25]:
import gilda

In [26]:
for cell_line in cell_lines:
    matches = gilda.ground(cell_line)
    if matches:
        print('%s: %s' % (cell_line, matches[0].term))
    else:
        print('%s could not be grounded' % cell_line)

C32: Term(c32,C32,EFO,0006364,C32,name,efo,None,None,None)
COLO858 could not be grounded
K2: Term(k2,K2,HGNC,13686,RBPJP3,synonym,hgnc,9606,None,None)
LOXIMVI: Term(loximvi,LOXIMVI,EFO,0006284,LOXIMVI,name,efo,None,None,None)
MMACSF could not be grounded
MZ7MEL could not be grounded
RVH421 could not be grounded
SKMEL28: Term(skmel28,SK-MEL-28,EFO,0003081,SK-MEL-28,name,efo,None,None,None)
WM115: Term(wm115,WM115,EFO,0002390,WM115,name,efo,None,None,None)
WM1552C could not be grounded


Let's resolve LOXIMVI using its ID: https://bioregistry.io/EFO:0006284

Many cell lines couldn't be grounded using the default Gilda instance. For those interested, it's possible to customize Gilda with other resources, see: https://github.com/gyorilab/gilda/blob/master/notebooks/custom_grounders.ipynb, e.g., Cellosaurus, which contains a more systematic catalogue of cell lines.

In [28]:
df = sheets['LOXIMVI']
df

Unnamed: 0,Drug,Time (hr),Concentration (uM),pMEK(S217/221),pERK(T202/Y204),p-p90RSK(S380),p-p90RSK(T573),p-AKT(T308),p-AKT(S473),p-mTOR(S2448),...,p-JNK(T183/Y185),Total c-Jun,p-cJun(S63),p-P38(T180/Y182),p-HSP27(S82),p-NFKB(S536),Bim,cPARP,p-Histone H3(S10),p27 Kip1
0,AZ628,1,0.0032,0.79671,1.92250,0.693510,0.95871,,1.030700,1.294000,...,1.601000,1.19060,1.61060,1.731500,1.524300,0.776550,1.80920,1.44660,0.23324,1.84360
1,AZ628,1,0.0100,0.34613,2.06110,0.653200,1.00940,,1.343700,1.428700,...,1.775200,1.28160,1.82370,1.725300,1.261200,0.985020,2.07280,1.61360,0.44360,1.40240
2,AZ628,1,0.0316,-0.17426,1.47580,0.531690,0.87866,,1.477500,1.341200,...,1.716700,1.26090,1.82310,1.437700,1.143800,0.970360,2.22710,1.61820,0.80767,1.17420
3,AZ628,1,0.1000,-0.53714,0.77338,0.631350,1.06920,,1.483400,1.035600,...,1.411100,1.14720,1.92130,1.048600,1.356800,0.845850,2.19750,1.62300,0.79431,1.39630
4,AZ628,1,0.3160,-1.28310,-0.48519,0.644720,0.84908,,1.605400,1.031200,...,1.330300,0.74568,1.81040,-0.269690,0.991100,0.718900,1.60500,1.25360,0.47505,1.76480
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
170,SB590885,48,0.0316,-0.84192,-0.93416,0.231260,-0.65540,,0.514670,0.365790,...,0.219610,-0.56826,-0.45860,-0.046925,-0.033998,0.116620,0.42773,0.29036,0.31432,-0.62735
171,SB590885,48,0.1000,-1.74070,-1.60260,0.049010,-0.48014,,0.534460,0.018717,...,0.031661,-0.65894,-0.70742,-0.020500,-0.120410,0.019411,0.36881,0.17516,0.16177,-0.33856
172,SB590885,48,0.3160,-1.86090,-1.41040,0.300630,-0.32317,,0.461400,0.476280,...,0.421190,-0.62106,-0.30036,0.085235,0.433510,0.284350,0.89159,0.42707,0.25295,0.40443
173,SB590885,48,1.0000,-2.37280,-1.59910,0.235590,-0.51840,,0.013754,0.308200,...,0.213210,-1.04860,-0.44825,-0.183070,0.185150,0.420750,0.63868,0.17904,-0.14101,-0.27232


In [29]:
drugs = set(df.Drug)
drugs

{'AZ628', 'PLX4720', 'SB590885', 'Selumetinib', 'Vemurafenib'}

In [30]:
for drug in drugs:
    matches = gilda.ground(drug)
    if matches:
        print('%s: %s' % (drug, matches[0].term))
    else:
        print('%s could not be grounded' % drug)

SB590885: Term(sb590885,SB-590885,CHEBI,CHEBI:131882,SB-590885,name,chebi,None,None,None)
AZ628: Term(az628,AZ-628,MESH,C000592454,AZ-628,name,mesh,None,None,None)
PLX4720: Term(plx4720,PLX-4720,CHEBI,CHEBI:90295,PLX-4720,name,chebi,None,None,None)
Vemurafenib: Term(vemurafenib,Vemurafenib,CHEBI,CHEBI:63637,vemurafenib,curated,famplex,None,None,None)
Selumetinib: Term(selumetinib,selumetinib,CHEBI,CHEBI:90227,selumetinib,name,chebi,None,None,None)


In [35]:
import bioregistry

Given any of these IDs, you can resolve them using the Bioregistry as e.g. https://bioregistry.io/CHEBI:131882. You can also find out how the ID gets resolved by Bioregistry as

In [37]:
bioregistry.get_default_iri('CHEBI', '131882')

'http://purl.obolibrary.org/obo/CHEBI_131882'

## 2. Example Dataset 1

We use an example dataset produced from an MSstats differential abundance analysis.  This dataset is a small molecule dataset with known inhibition targets.  It includes 8 small molecule inhibitors and a control DMSO holdout. 

In [40]:
DATA_PATH = "ProteinLevelData.csv"
input_data = pd.read_csv(DATA_PATH)
input_data

Unnamed: 0,RUN,Protein,LogIntensities,originalRUN,GROUP,SUBJECT,TotalGroupMeasurements,NumMeasuredFeature,MissingPercentage,more50missing,NumImputedFeature
0,1,1433B_HUMAN,12.873423,230719_THP-1_Chrom_end2end_Plate1_DMSO_A02_DIA,DMSO,2,1210,10,0.0,False,0
1,2,1433B_HUMAN,12.866217,230719_THP-1_Chrom_end2end_Plate1_DMSO_A05_DIA,DMSO,5,1210,10,0.0,False,0
2,3,1433B_HUMAN,12.686827,230719_THP-1_Chrom_end2end_Plate1_DMSO_A10_DIA,DMSO,10,1210,10,0.0,False,0
3,4,1433B_HUMAN,12.625462,230719_THP-1_Chrom_end2end_Plate1_DMSO_A12_DIA,DMSO,12,1210,10,0.0,False,0
4,5,1433B_HUMAN,12.538365,230719_THP-1_Chrom_end2end_Plate1_DMSO_B01_DIA,DMSO,13,1210,10,0.0,False,0
...,...,...,...,...,...,...,...,...,...,...,...
1189821,266,ZZZ3_HUMAN,10.384438,230719_THP-1_Chrom_end2end_Plate3_DMSO_A10,VTP50469,202,170,10,0.0,False,0
1189822,267,ZZZ3_HUMAN,10.231615,230719_THP-1_Chrom_end2end_Plate3_DMSO_B03,VTP50469,207,170,10,0.0,False,0
1189823,268,ZZZ3_HUMAN,10.502691,230719_THP-1_Chrom_end2end_Plate3_DbET6_C07,VTP50469,223,170,10,0.0,False,0
1189824,269,ZZZ3_HUMAN,10.674776,230719_THP-1_Chrom_end2end_Plate3_DMSO_C11,VTP50469,227,170,10,0.0,False,0


### Experimental Factors:
| Treatment    | Nominal target |
| :-------- | :------- |
| DMSO  | Control    |
| VTP50469  | MEN1    |
| PF477736 | Chk1    |
| Jakafi    | JAK1/2    |
| K-975  | TEAD1   |
| VE-821 | ATR    |
| dBET6    | BRD2/3/4   |


Our first goal is to make this data set interoperable and connected to other data sets and surrounding knowledge.  This means we must normalize experimental factors to identifiers

## How can we normalize text names to identifiers?

In [41]:
import gilda

For each drug, we can ground their names using gilda.

In [42]:
gilda.ground('PF-477736')[0].term

Term(pf477736,PF-477736,CHEBI,CHEBI:91385,PF-00477736,synonym,chebi,None,None,None)

In [43]:
gilda.ground('Jakafi')[0].term

Term(jakafi,Jakafi,CHEBI,CHEBI:66917,ruxolitinib phosphate,synonym,chebi,None,None,None)

We can also ground target names.  For example, CHK1 is grounded to CHEK1

In [44]:
gilda.ground('Chk1')[0].term

Term(chk1,Chk1,HGNC,1925,CHEK1,curated,famplex,9606,None,None)

Gilda has a REST service that accepts POST requests with a JSON header on the /ground endpoint. There is a public REST service running at http://grounding.indra.bio

In [45]:
import requests
res = requests.post('http://grounding.indra.bio/ground', json={'text': 'Jakafi'})
res.json()

[{'match': {'cap_combos': [],
   'dash_mismatches': [],
   'exact': True,
   'query': 'Jakafi',
   'ref': 'Jakafi',
   'space_mismatch': False},
  'score': 0.5555555555555556,
  'term': {'db': 'CHEBI',
   'entry_name': 'ruxolitinib phosphate',
   'id': 'CHEBI:66917',
   'norm_text': 'jakafi',
   'source': 'chebi',
   'status': 'synonym',
   'text': 'Jakafi'},
  'url': 'https://identifiers.org/CHEBI:66917'}]

The Bioregistry supports converting a CURIE to a canonical CURIE by normalizing the prefix and removing redundant namespaces embedded in LUIs with the normalize_curie() function

In [46]:
# CHEBIID is an alternative prefix for chebi
bioregistry.normalize_curie('CHEBIID:66917')

'chebi:66917'

We can determine whether an ID is a valid identifier within a particular namespace.  In the below example, we demonstrate how the bioregistry package can determine whether an ID is a valid uniprot ID.

In [47]:
print(bioregistry.is_valid_identifier('uniprot', '1433B_HUMAN')) # uniprot mnemonic ID
print(bioregistry.is_valid_identifier('uniprot', 'YWHAB')) # corresponding gene name
print(bioregistry.is_valid_identifier('uniprot', 'P31946')) # corresponding uniprot ID

False
False
True


Bioregistry can also provide a URL to access information on the identifier

In [48]:
bioregistry.resolve_identifier.get_default_iri('uniprot', 'P31946')

'https://purl.uniprot.org/uniprot/P31946'

## 3. Identifier mapping

There are many overlapping databases for proteins, small molecules, etc., and even within the same database like UniProt, people refer to an entry using different types of IDs or names. We need to be able to map between these. 

This dataset uses UniProt mnemonics to identify proteins. Let's see how we can map these to UniProt identifiers, and also to the corresponding human gene ID and symbol.

In [49]:
from indra.databases import uniprot_client

In [50]:
uniprot_client.get_id_from_mnemonic('1433B_HUMAN')

'P31946'

In [53]:
uniprot_client.get_hgnc_id('P31946')

'12849'

In [54]:
uniprot_client.get_gene_name('P31946')

'YWHAB'

## Exercise

Iterate over all the proteins in the data set and find the corresponding UniProt ID, then map that to an HGNC ID. Print the results or store it in a data structure like a dictionary or data frame. Do you notice any issues like things that can't be correctly mapped?