# Hands on: Introduction to data annotation using identifiers

In [67]:
!pip install -q -r requirements.txt

## 1. Introduction

### Example Dataset

We use an example dataset produced from an MSstats differential abundance analysis.  This dataset is a small molecule dataset with known inhibition targets.  It includes 8 small molecule inhibitors and a control DMSO holdout. 

In [3]:
import pandas as pd

DATA_PATH = "dataProcessOutput.csv"

def import_data(filename):
    pandas_df = pd.read_csv(filename)
    return pandas_df

input_data = import_data(DATA_PATH)
input_data

Unnamed: 0,RUN,Protein,LogIntensities,originalRUN,GROUP,SUBJECT,TotalGroupMeasurements,NumMeasuredFeature,MissingPercentage,more50missing,NumImputedFeature
0,1,1433B_HUMAN,12.873423,230719_THP-1_Chrom_end2end_Plate1_DMSO_A02_DIA,DMSO,2,1210,10,0.0,False,0
1,2,1433B_HUMAN,12.866217,230719_THP-1_Chrom_end2end_Plate1_DMSO_A05_DIA,DMSO,5,1210,10,0.0,False,0
2,3,1433B_HUMAN,12.686827,230719_THP-1_Chrom_end2end_Plate1_DMSO_A10_DIA,DMSO,10,1210,10,0.0,False,0
3,4,1433B_HUMAN,12.625462,230719_THP-1_Chrom_end2end_Plate1_DMSO_A12_DIA,DMSO,12,1210,10,0.0,False,0
4,5,1433B_HUMAN,12.538365,230719_THP-1_Chrom_end2end_Plate1_DMSO_B01_DIA,DMSO,13,1210,10,0.0,False,0
...,...,...,...,...,...,...,...,...,...,...,...
1189821,266,ZZZ3_HUMAN,10.384438,230719_THP-1_Chrom_end2end_Plate3_DMSO_A10,VTP50469,202,170,10,0.0,False,0
1189822,267,ZZZ3_HUMAN,10.231615,230719_THP-1_Chrom_end2end_Plate3_DMSO_B03,VTP50469,207,170,10,0.0,False,0
1189823,268,ZZZ3_HUMAN,10.502691,230719_THP-1_Chrom_end2end_Plate3_DbET6_C07,VTP50469,223,170,10,0.0,False,0
1189824,269,ZZZ3_HUMAN,10.674776,230719_THP-1_Chrom_end2end_Plate3_DMSO_C11,VTP50469,227,170,10,0.0,False,0


### Experimental Factors:
| Treatment    | Target |
| :-------- | :------- |
| DMSO  | Control    |
| VTP50469  | MEN1    |
| PF477736 | Chk1    |
| Jakafi    | JAK1/2    |
| K-975  | TEAD1   |
| VE-821 | ATR    |
| dBET6    | BRD2/3/4   |

- Proteins: 1433B_HUMAN, ZZZ3_HUMAN, etc.
- Protein Abundance (LogIntensities)


Our first goal is to make this data set interoperable and connected to other data sets and surrounding knowledge.  This means we must normalize experimental factors to identifiers

## 2. How can we normalize text names to identifiers?

Gilda is a Python package and REST service that grounds (i.e., finds appropriate identifiers in namespaces for) named entities in biomedical text.

Gyori BM, Hoyt CT, Steppi A (2022). Gilda: biomedical entity text normalization with machine-learned disambiguation as a service. Bioinformatics Advances, 2022; vbac034 https://doi.org/10.1093/bioadv/vbac034.

In [2]:
import gilda

In [3]:
gilda.ground('ZZZ3_HUMAN')

[]

We initially look to ground the protein IDs in the dataset, but we are not initially successful.  We can first naively remove the `_HUMAN` suffix to get a grounding.

In [7]:
gilda.ground('ZZZ3')[0].term

Term(zzz3,ZZZ3,HGNC,24523,ZZZ3,name,hgnc,9606,None,None)

In [24]:
gilda.ground('1433B')

[]

However, the dataset above uses Uniprot mnemonic IDs, which are not able to be grounded by gilda at the time of this writing.  Alternatively, one can use INDRA's uniprot client to get the gene name for a given uniprot mnemonic ID, which subsequently can be grounded.

In [23]:
from indra.databases import uniprot_client
uniprot_client.get_gene_name('1433B_HUMAN')

'YWHAB'

In [11]:
gilda.ground('YWHAB')[0].term

Term(ywhab,YWHAB,HGNC,12849,YWHAB,name,hgnc,9606,None,None)

We can also ground drug treatment names shown earlier.

In [47]:
gilda.ground('PF-477736')[0].term

Term(pf477736,PF-477736,CHEBI,CHEBI:91385,PF-00477736,synonym,chebi,None,None,None)

In [46]:
gilda.ground('Jakafi')[0].term

Term(jakafi,Jakafi,CHEBI,CHEBI:66917,ruxolitinib phosphate,synonym,chebi,None,None,None)

Gilda has a REST service that accepts POST requests with a JSON header on the /ground endpoint. There is a public REST service running at http://grounding.indra.bio

In [17]:
import requests
res = requests.post('http://grounding.indra.bio/ground', json={'text': 'YWHAB'})
res.json()

[{'match': {'cap_combos': [],
   'dash_mismatches': [],
   'exact': True,
   'query': 'YWHAB',
   'ref': 'YWHAB',
   'space_mismatch': False},
  'score': 0.7777777777777778,
  'term': {'db': 'HGNC',
   'entry_name': 'YWHAB',
   'id': '12849',
   'norm_text': 'ywhab',
   'organism': '9606',
   'source': 'uniprot',
   'source_db': 'UP',
   'source_id': 'P31946',
   'status': 'name',
   'text': 'YWHAB'},
  'url': 'https://identifiers.org/hgnc:12849'}]

In some cases, there may be multiple feasible groundings for a particular ID.  For example, take the case for treatment VE-821 that targets ATR.  

In [28]:
gilda.ground('ATR')

[ScoredMatch(Term(atr,ATR,HGNC,882,ATR,curated,famplex,9606,None,None),1.0,Match(query=ATR,ref=ATR,exact=True,space_mismatch=False,dash_mismatches={},cap_combos=[])),
 ScoredMatch(Term(atr,ATR,HGNC,21014,ANTXR1,synonym,hgnc,9606,None,None),0.5555555555555556,Match(query=ATR,ref=ATR,exact=True,space_mismatch=False,dash_mismatches={},cap_combos=[])),
 ScoredMatch(Term(atr,ATR,HGNC,8985,SERPINA2,synonym,hgnc,9606,None,None),0.5555555555555556,Match(query=ATR,ref=ATR,exact=True,space_mismatch=False,dash_mismatches={},cap_combos=[]))]

Users can add context to distinguish between different terms.  We look at "IR" as an example, which is widely used in the literature as an acronym for e.g., insulin receptor, and ionizing radiation.

In the first example, we ground IR with context implying the insulin receptor sense:

In [64]:
res = requests.post('http://grounding.indra.bio/ground', json={
        'text': 'IR',
        'context': 'IR binds INS at the membrane.'
    }
)
res.json()[0]

{'disambiguation': {'match': 'grounded',
  'score': 0.9945447300565196,
  'type': 'adeft'},
 'match': {'cap_combos': [],
  'dash_mismatches': [],
  'exact': True,
  'query': 'IR',
  'ref': 'IR',
  'space_mismatch': False},
 'score': 0.9945447300565196,
 'term': {'db': 'HGNC',
  'entry_name': 'INSR',
  'id': '6091',
  'norm_text': 'ir',
  'organism': '9606',
  'source': 'famplex',
  'status': 'curated',
  'text': 'IR'},
 'url': 'https://identifiers.org/hgnc:6091'}

Next, we look at a sentence which implies that IR is used in the sense of ionizing radiation:

In [61]:
res = requests.post('http://grounding.indra.bio/ground', json={
        'text': 'IR',
        'context': 'IR can cause DNA damage.'
    }
)
res.json()[0]

{'disambiguation': {'match': 'grounded',
  'score': 0.9915279740334499,
  'type': 'adeft'},
 'match': {'cap_combos': [],
  'dash_mismatches': [],
  'exact': True,
  'query': 'IR',
  'ref': 'IR',
  'space_mismatch': False},
 'score': 0.9915279740334499,
 'term': {'db': 'MESH',
  'entry_name': 'Radiation, Ionizing',
  'id': 'D011839',
  'norm_text': 'ir',
  'source': 'famplex',
  'source_db': 'MESH',
  'source_id': 'D011839',
  'status': 'curated',
  'text': 'IR'},
 'url': 'https://identifiers.org/mesh:D011839'}

In [68]:
import bioregistry

The Bioregistry supports converting a CURIE to a canonical CURIE by normalizing the prefix and removing redundant namespaces embedded in LUIs with the normalize_curie() function

In [76]:
bioregistry.normalize_curie('UniProtKB:1433B_HUMAN')

'uniprot:1433B_HUMAN'

We can determine whether an ID is a valid identifier within a particular namespace.  In the below example, we demonstrate how the bioregistry package can determine whether an ID is a valid uniprot ID.

In [82]:
print(bioregistry.is_valid_identifier('uniprot', '1433B_HUMAN')) # uniprot mnemonic
print(bioregistry.is_valid_identifier('uniprot', 'YWHAB')) # hgnc gene name
print(bioregistry.is_valid_identifier('uniprot', 'P31946')) # uniprot ID

False
False
True


Bioregistry can also provide a URL to access information on the identifier

In [81]:
bioregistry.resolve_identifier.get_default_iri('uniprot', 'P31946')

'https://purl.uniprot.org/uniprot/P31946'