# Find UniProt IDs in ChEMBL targets

Ultimately, we are interested in getting [activity data from ChEMBl](/chembl-27/query_local_chembl-27.ipynb) we need to account for three components:

* The compound being measured
* The target the compound binds to
* The assay where this measurement took place

So, to find all activity data stored in ChEMBL that refers to kinases, we have to query for those assays annotated with a certain target.

Each of those three components have a unique ChEMBL ID, but so far we only have obtained Uniprot IDs in the `human-kinases` notebook. We need a way to connect Uniprot IDs to ChEMBL target IDs. Fortunately, ChEMBL maintains such a map in their FTP releases. We will parse that file and convert it into a dataframe for easy manipulation.

In [1]:
from pathlib import Path
import urllib.request

import pandas as pd

In [2]:
REPO = (Path(_dh[-1]) / "..").resolve()
DATA = REPO / 'data'

In [3]:
CHEMBL_VERSION = 29
url = fr"ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_{CHEMBL_VERSION}/chembl_uniprot_mapping.txt"

In [4]:
with urllib.request.urlopen(url) as response:
    uniprot_map = pd.read_csv(response, sep="\t", skiprows=[0], names=["UniprotID", "chembl_targets", "description", "type"])
uniprot_map

Unnamed: 0,UniprotID,chembl_targets,description,type
0,P21266,CHEMBL2242,Glutathione S-transferase Mu 3,SINGLE PROTEIN
1,O00519,CHEMBL2243,Anandamide amidohydrolase,SINGLE PROTEIN
2,P19217,CHEMBL2244,Estrogen sulfotransferase,SINGLE PROTEIN
3,P97292,CHEMBL2245,Histamine H2 receptor,SINGLE PROTEIN
4,P17342,CHEMBL2247,Atrial natriuretic peptide receptor C,SINGLE PROTEIN
...,...,...,...,...
13150,A0A0J5PZ55,CHEMBL4630893,"Dihydroorotate dehydrogenase (quinone), mitoch...",SINGLE PROTEIN
13151,Q8NI17,CHEMBL4630894,Interleukin-31 receptor subunit alpha,SINGLE PROTEIN
13152,P43630,CHEMBL4630895,Killer cell immunoglobulin-like receptor 3DL2,SINGLE PROTEIN
13153,A0A0H3HP34,CHEMBL4630896,Enoyl-[acyl-carrier-protein] reductase [NADH],SINGLE PROTEIN


We join this new information to the human kinases aggregated list from `human-kinases` (all of them, regardless the source):

In [15]:
kinases = pd.read_csv(DATA / "human_kinases.aggregated.csv", index_col=0)
kinases

Unnamed: 0,UniprotID,Name,kinhub,klifs,pkinfam,reviewed_uniprot,dunbrack_msa
0,A0A0B4J2F2,SIK1B,False,False,True,True,True
1,A4QPH2,PI4KAP2,False,True,False,False,False
2,B5MCJ9,TRIM66,True,False,False,False,False
3,O00141,SGK1,True,True,True,True,True
4,O00238,BMPR1B|BMR1B,True,True,True,True,True
...,...,...,...,...,...,...,...
547,Q9Y616,IRAK3,True,True,True,True,True
548,Q9Y6E0,STK24,True,True,True,True,True
549,Q9Y6M4,CSNK1G3|KC1G3,True,True,True,True,True
550,Q9Y6R4,M3K4|MAP3K4,True,True,True,True,True


We are only interested in those kinases present in these datasets:

* KinHub
* KLIFS
* PKinFam
* Dunbrack's MSA

In [16]:
kinases_subset = kinases[kinases[["kinhub", "klifs", "pkinfam", "dunbrack_msa"]].sum(axis=1) > 0]
kinases_subset

Unnamed: 0,UniprotID,Name,kinhub,klifs,pkinfam,reviewed_uniprot,dunbrack_msa
0,A0A0B4J2F2,SIK1B,False,False,True,True,True
1,A4QPH2,PI4KAP2,False,True,False,False,False
2,B5MCJ9,TRIM66,True,False,False,False,False
3,O00141,SGK1,True,True,True,True,True
4,O00238,BMPR1B|BMR1B,True,True,True,True,True
...,...,...,...,...,...,...,...
547,Q9Y616,IRAK3,True,True,True,True,True
548,Q9Y6E0,STK24,True,True,True,True,True
549,Q9Y6M4,CSNK1G3|KC1G3,True,True,True,True,True
550,Q9Y6R4,M3K4|MAP3K4,True,True,True,True,True


We would also like to preserve the provenance of the Uniprot assignment, so we will group the provenance columns in a single one now.

In [17]:
kinases_subset["origin"] = kinases_subset.apply(lambda s: '|'.join([k for k in ["kinhub", "klifs", "pkinfam", "reviewed_uniprot", "dunbrack_msa"] if s[k]]), axis=1)
kinases_subset

Unnamed: 0,UniprotID,Name,kinhub,klifs,pkinfam,reviewed_uniprot,dunbrack_msa,origin
0,A0A0B4J2F2,SIK1B,False,False,True,True,True,pkinfam|reviewed_uniprot|dunbrack_msa
1,A4QPH2,PI4KAP2,False,True,False,False,False,klifs
2,B5MCJ9,TRIM66,True,False,False,False,False,kinhub
3,O00141,SGK1,True,True,True,True,True,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
4,O00238,BMPR1B|BMR1B,True,True,True,True,True,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
...,...,...,...,...,...,...,...,...
547,Q9Y616,IRAK3,True,True,True,True,True,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
548,Q9Y6E0,STK24,True,True,True,True,True,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
549,Q9Y6M4,CSNK1G3|KC1G3,True,True,True,True,True,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
550,Q9Y6R4,M3K4|MAP3K4,True,True,True,True,True,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...


We can now merge the needed columns based on the `UniprotID` key.

In [18]:
merged = pd.merge(kinases_subset[["UniprotID", "Name", "origin"]], uniprot_map[["UniprotID", "chembl_targets", "type"]], how="inner", on='UniprotID')[["UniprotID", "Name", "chembl_targets", "type", "origin"]]
merged

Unnamed: 0,UniprotID,Name,chembl_targets,type,origin
0,A4QPH2,PI4KAP2,CHEMBL4105789,SINGLE PROTEIN,klifs
1,A4QPH2,PI4KAP2,CHEMBL3038509,PROTEIN COMPLEX,klifs
2,O00141,SGK1,CHEMBL2343,SINGLE PROTEIN,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
3,O00238,BMPR1B|BMR1B,CHEMBL5476,SINGLE PROTEIN,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
4,O00311,CDC7,CHEMBL2111377,PROTEIN COMPLEX,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
...,...,...,...,...,...
990,Q9Y5S2,MRCKB|CDC42BPB,CHEMBL5052,SINGLE PROTEIN,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
991,Q9Y616,IRAK3,CHEMBL5081,SINGLE PROTEIN,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
992,Q9Y6E0,STK24,CHEMBL5082,SINGLE PROTEIN,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
993,Q9Y6M4,CSNK1G3|KC1G3,CHEMBL5084,SINGLE PROTEIN,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...


~~How is this possible? 969 targets (ChEMBL 28)?!~~

Apparently, there's not 1:1 correspondence between UniprotID and ChEMBL ID! Some Uniprot IDs are included in several ChEMBL targets:

In [20]:
merged.UniprotID.value_counts()

P11802    17
P24941    15
P35968    11
P06493    11
Q00534    11
          ..
O00141     1
Q12852     1
Q12851     1
Q09013     1
Q9Y6R4     1
Name: UniprotID, Length: 494, dtype: int64

In [21]:
merged[merged.UniprotID == "P11802"]

Unnamed: 0,UniprotID,Name,chembl_targets,type,origin
210,P11802,CDK4,CHEMBL331,SINGLE PROTEIN,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
211,P11802,CDK4,CHEMBL2095942,PROTEIN COMPLEX,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
212,P11802,CDK4,CHEMBL2111326,SELECTIVITY GROUP,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
213,P11802,CDK4,CHEMBL1907601,PROTEIN COMPLEX,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
214,P11802,CDK4,CHEMBL3301385,PROTEIN COMPLEX,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
215,P11802,CDK4,CHEMBL4523686,PROTEIN-PROTEIN INTERACTION,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
216,P11802,CDK4,CHEMBL4523715,PROTEIN-PROTEIN INTERACTION,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
217,P11802,CDK4,CHEMBL4523732,PROTEIN-PROTEIN INTERACTION,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
218,P11802,CDK4,CHEMBL3038472,PROTEIN COMPLEX,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
219,P11802,CDK4,CHEMBL4523963,SELECTIVITY GROUP,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...


... and some ChEMBL targets include several kinases (e.g. chimeric proteins):

In [22]:
merged[merged.chembl_targets == "CHEMBL2096618"]

Unnamed: 0,UniprotID,Name,chembl_targets,type,origin
108,P00519,ABL1,CHEMBL2096618,CHIMERIC PROTEIN,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
198,P11274,BCR,CHEMBL2096618,CHIMERIC PROTEIN,kinhub|klifs


This is due to the different `type` values:

In [23]:
merged.type.value_counts()

SINGLE PROTEIN                 492
PROTEIN FAMILY                 245
PROTEIN COMPLEX                120
PROTEIN-PROTEIN INTERACTION     91
SELECTIVITY GROUP               22
CHIMERIC PROTEIN                14
PROTEIN COMPLEX GROUP           11
Name: type, dtype: int64

If we focus on `SINGLE PROTEIN` types:

In [24]:
merged[merged.type == "SINGLE PROTEIN"]

Unnamed: 0,UniprotID,Name,chembl_targets,type,origin
0,A4QPH2,PI4KAP2,CHEMBL4105789,SINGLE PROTEIN,klifs
2,O00141,SGK1,CHEMBL2343,SINGLE PROTEIN,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
3,O00238,BMPR1B|BMR1B,CHEMBL5476,SINGLE PROTEIN,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
5,O00311,CDC7,CHEMBL5443,SINGLE PROTEIN,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
7,O00329,PIK3CD,CHEMBL3130,SINGLE PROTEIN,klifs|pkinfam
...,...,...,...,...,...
990,Q9Y5S2,MRCKB|CDC42BPB,CHEMBL5052,SINGLE PROTEIN,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
991,Q9Y616,IRAK3,CHEMBL5081,SINGLE PROTEIN,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
992,Q9Y6E0,STK24,CHEMBL5082,SINGLE PROTEIN,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
993,Q9Y6M4,CSNK1G3|KC1G3,CHEMBL5084,SINGLE PROTEIN,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...


... we end up with a total of 491 targets (ChEMBL 28), which is more acceptable.

For that reason, we will only save records corresponding to `type == SINGLE PROTEIN`

In [25]:
merged[merged.type == "SINGLE PROTEIN"].to_csv(DATA /  f"human_kinases_and_chembl_targets.chembl_{CHEMBL_VERSION}.csv", index=False)