# Hands on: Introduction to data annotation using identifiers

In [1]:
!pip install -q -r requirements.txt

## 1. Introduction

### Example Dataset

We use an example dataset produced from an MSstats differential abundance analysis.  This dataset is a small molecule dataset with known inhibition targets.  It includes 8 small molecule inhibitors and a control DMSO holdout. 

In [2]:
import pandas as pd

DATA_PATH = "dataProcessOutput.csv"

def import_data(filename):
    pandas_df = pd.read_csv(filename)
    return pandas_df

input_data = import_data(DATA_PATH)
input_data

Unnamed: 0,RUN,Protein,LogIntensities,originalRUN,GROUP,SUBJECT,TotalGroupMeasurements,NumMeasuredFeature,MissingPercentage,more50missing,NumImputedFeature
0,1,1433B_HUMAN,12.873423,230719_THP-1_Chrom_end2end_Plate1_DMSO_A02_DIA,DMSO,2,1210,10,0.0,False,0
1,2,1433B_HUMAN,12.866217,230719_THP-1_Chrom_end2end_Plate1_DMSO_A05_DIA,DMSO,5,1210,10,0.0,False,0
2,3,1433B_HUMAN,12.686827,230719_THP-1_Chrom_end2end_Plate1_DMSO_A10_DIA,DMSO,10,1210,10,0.0,False,0
3,4,1433B_HUMAN,12.625462,230719_THP-1_Chrom_end2end_Plate1_DMSO_A12_DIA,DMSO,12,1210,10,0.0,False,0
4,5,1433B_HUMAN,12.538365,230719_THP-1_Chrom_end2end_Plate1_DMSO_B01_DIA,DMSO,13,1210,10,0.0,False,0
...,...,...,...,...,...,...,...,...,...,...,...
1189821,266,ZZZ3_HUMAN,10.384438,230719_THP-1_Chrom_end2end_Plate3_DMSO_A10,VTP50469,202,170,10,0.0,False,0
1189822,267,ZZZ3_HUMAN,10.231615,230719_THP-1_Chrom_end2end_Plate3_DMSO_B03,VTP50469,207,170,10,0.0,False,0
1189823,268,ZZZ3_HUMAN,10.502691,230719_THP-1_Chrom_end2end_Plate3_DbET6_C07,VTP50469,223,170,10,0.0,False,0
1189824,269,ZZZ3_HUMAN,10.674776,230719_THP-1_Chrom_end2end_Plate3_DMSO_C11,VTP50469,227,170,10,0.0,False,0


### Experimental Factors:
| Treatment    | Target |
| :-------- | :------- |
| DMSO  | Control    |
| VTP50469  | MEN1    |
| PF477736 | Chk1    |
| Jakafi    | JAK1/2    |
| K-975  | TEAD1   |
| VE-821 | ATR    |
| dBET6    | BRD2/3/4   |


Our first goal is to make this data set interoperable and connected to other data sets and surrounding knowledge.  This means we must normalize experimental factors to identifiers

## 2. How can we normalize text names to identifiers?

Gilda is a Python package and REST service that grounds (i.e., finds appropriate identifiers in namespaces for) named entities in biomedical text.

Gyori BM, Hoyt CT, Steppi A (2022). Gilda: biomedical entity text normalization with machine-learned disambiguation as a service. Bioinformatics Advances, 2022; vbac034 https://doi.org/10.1093/bioadv/vbac034.

In [3]:
import gilda

For each drug, we can ground their names using gilda.

In [4]:
gilda.ground('PF-477736')[0].term

Term(pf477736,PF-477736,CHEBI,CHEBI:91385,PF-00477736,synonym,chebi,None,None,None)

In [5]:
gilda.ground('Jakafi')[0].term

Term(jakafi,Jakafi,CHEBI,CHEBI:66917,ruxolitinib phosphate,synonym,chebi,None,None,None)

We can also ground target names.  For example, CHK1 is grounded to CHEK1

In [6]:
gilda.ground('Chk1')[0].term

Term(chk1,Chk1,HGNC,1925,CHEK1,curated,famplex,9606,None,None)

Gilda has a REST service that accepts POST requests with a JSON header on the /ground endpoint. There is a public REST service running at http://grounding.indra.bio

In [7]:
import requests
res = requests.post('http://grounding.indra.bio/ground', json={'text': 'Jakafi'})
res.json()

[{'match': {'cap_combos': [],
   'dash_mismatches': [],
   'exact': True,
   'query': 'Jakafi',
   'ref': 'Jakafi',
   'space_mismatch': False},
  'score': 0.5555555555555556,
  'term': {'db': 'CHEBI',
   'entry_name': 'ruxolitinib phosphate',
   'id': 'CHEBI:66917',
   'norm_text': 'jakafi',
   'source': 'chebi',
   'status': 'synonym',
   'text': 'Jakafi'},
  'url': 'https://identifiers.org/CHEBI:66917'}]

In [8]:
import bioregistry

The Bioregistry supports converting a CURIE to a canonical CURIE by normalizing the prefix and removing redundant namespaces embedded in LUIs with the normalize_curie() function

In [9]:
# UniProtKB is another naming convention referring to Uniprot
bioregistry.normalize_curie('UniProtKB:1433B_HUMAN')

'uniprot:1433B_HUMAN'

In [10]:
# CHEBIID is an alternative prefix for chebi
bioregistry.normalize_curie('CHEBIID:66917')

'chebi:66917'

We can determine whether an ID is a valid identifier within a particular namespace.  In the below example, we demonstrate how the bioregistry package can determine whether an ID is a valid uniprot ID.

In [11]:
print(bioregistry.is_valid_identifier('uniprot', '1433B_HUMAN')) # uniprot mnemonic ID
print(bioregistry.is_valid_identifier('uniprot', 'YWHAB')) # corresponding gene name
print(bioregistry.is_valid_identifier('uniprot', 'P31946')) # corresponding uniprot ID

False
False
True


Bioregistry can also provide a URL to access information on the identifier

In [12]:
bioregistry.resolve_identifier.get_default_iri('uniprot', 'P31946')

'https://purl.uniprot.org/uniprot/P31946'