# Find UniProt IDs in ChEMBL targets

Ultimately, we are interested in getting [activity data from ChEMBl](/chembl-27/query_local_chembl-27.ipynb) we need to account for three components:

* The compound being measured
* The target the compound binds to
* The assay where this measurement took place

So, to find all activity data stored in ChEMBL that refers to kinases, we have to query for those assays annotated with a certain target.

Each of those three components have a unique ChEMBL ID, but so far we only have obtained Uniprot IDs in the `human-kinases` notebook. We need a way to connect Uniprot IDs to ChEMBL target IDs. Fortunately, ChEMBL maintains such a map in their FTP releases. We will parse that file and convert it into a dataframe for easy manipulation.

In [3]:
from pathlib import Path
import urllib.request

import pandas as pd

In [4]:
REPO = (Path(_dh[-1]) / "..").resolve()
DATA = REPO / 'data'

In [5]:
CHEMBL_VERSION = "chembl_27"
url = fr"ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/{CHEMBL_VERSION}/chembl_uniprot_mapping.txt"

In [6]:
with urllib.request.urlopen(url) as response:
    uniprot_map = pd.read_csv(response, sep="\t", skiprows=[0], names=["UniprotID", "chembl_targets", "description", "type"])
uniprot_map

Unnamed: 0,UniprotID,chembl_targets,description,type
0,P21266,CHEMBL2242,Glutathione S-transferase Mu 3,SINGLE PROTEIN
1,O00519,CHEMBL2243,Anandamide amidohydrolase,SINGLE PROTEIN
2,P19217,CHEMBL2244,Estrogen sulfotransferase,SINGLE PROTEIN
3,P97292,CHEMBL2245,Histamine H2 receptor,SINGLE PROTEIN
4,P17342,CHEMBL2247,Atrial natriuretic peptide receptor C,SINGLE PROTEIN
...,...,...,...,...
11779,Q91ZR5,CHEMBL3886121,Cation channel sperm-associated protein 1,SINGLE PROTEIN
11780,P48763,CHEMBL3886122,Sodium/hydrogen exchanger 2,SINGLE PROTEIN
11781,Q9UKU6,CHEMBL3886123,Thyrotropin-releasing hormone-degrading ectoen...,SINGLE PROTEIN
11782,Q9JJH7,CHEMBL3886124,Transient receptor potential cation channel su...,SINGLE PROTEIN


We join this new information to the human kinases aggregated list from `human-kinases` (all of them, regardless the source):

In [8]:
kinases = pd.read_csv(DATA / "human_kinases.aggregated.csv", index_col=0)
kinases

Unnamed: 0,UniprotID,Name,kinhub,klifs,pkinfam,reviewed_uniprot,dunbrack_msa
0,A0A0B4J2F2,SIK1B,False,False,True,True,True
1,A2A3N6,PIPSL,False,False,False,True,False
2,A2RU49,HYKK,False,False,False,True,False
3,A4D2B8,PM2P1,False,False,False,True,False
4,A4QPH2,PI4KAP2|PI4P2,False,True,False,True,False
...,...,...,...,...,...,...,...
756,Q9Y6K8,KAD5,False,False,False,True,False
757,Q9Y6M4,KC1G3|CSNK1G3,True,True,True,True,True
758,Q9Y6R4,M3K4|MAP3K4,True,True,True,True,True
759,Q9Y6S9,RPKL1|RPS6KL1,True,True,True,True,True


We are only interested in those kinases present in these datasets:

* KinHub
* KLIFS
* PKinFam
* Dunbrack's MSA

In [18]:
kinases_subset = kinases[kinases[["kinhub", "klifs", "pkinfam", "dunbrack_msa"]].sum(axis=1) > 0]
kinases_subset

Unnamed: 0,UniprotID,Name,kinhub,klifs,pkinfam,reviewed_uniprot,dunbrack_msa
0,A0A0B4J2F2,SIK1B,False,False,True,True,True
4,A4QPH2,PI4KAP2|PI4P2,False,True,False,True,False
5,B5MCJ9,TRIM66,True,False,False,False,False
6,O00141,SGK1,True,True,True,True,True
8,O00238,BMPR1B|BMR1B,True,True,True,True,True
...,...,...,...,...,...,...,...
753,Q9Y616,IRAK3,True,True,True,True,True
754,Q9Y6E0,STK24,True,True,True,True,True
757,Q9Y6M4,KC1G3|CSNK1G3,True,True,True,True,True
758,Q9Y6R4,M3K4|MAP3K4,True,True,True,True,True


We would also like to preserve the provenance of the Uniprot assignment, so we will group the provenance columns in a single one now.

In [53]:
kinases_subset["origin"] = kinases_subset.apply(lambda s: '|'.join([k for k in ["kinhub", "klifs", "pkinfam", "reviewed_uniprot", "dunbrack_msa"] if s[k]]), axis=1)
kinases_subset

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,UniprotID,Name,kinhub,klifs,pkinfam,reviewed_uniprot,dunbrack_msa,origin
0,A0A0B4J2F2,SIK1B,False,False,True,True,True,pkinfam|reviewed_uniprot|dunbrack_msa
4,A4QPH2,PI4KAP2|PI4P2,False,True,False,True,False,klifs|reviewed_uniprot
5,B5MCJ9,TRIM66,True,False,False,False,False,kinhub
6,O00141,SGK1,True,True,True,True,True,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
8,O00238,BMPR1B|BMR1B,True,True,True,True,True,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
...,...,...,...,...,...,...,...,...
753,Q9Y616,IRAK3,True,True,True,True,True,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
754,Q9Y6E0,STK24,True,True,True,True,True,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
757,Q9Y6M4,KC1G3|CSNK1G3,True,True,True,True,True,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
758,Q9Y6R4,M3K4|MAP3K4,True,True,True,True,True,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...


We can now merge the needed columns based on the `UniprotID` key.

In [55]:
merged = pd.merge(kinases_subset[["UniprotID", "Name", "origin"]], uniprot_map[["UniprotID", "chembl_targets", "type"]], how="inner", on='UniprotID')[["UniprotID", "Name", "chembl_targets", "type", "origin"]]
merged

Unnamed: 0,UniprotID,Name,chembl_targets,type,origin
0,A4QPH2,PI4KAP2|PI4P2,CHEMBL4105789,SINGLE PROTEIN,klifs|reviewed_uniprot
1,A4QPH2,PI4KAP2|PI4P2,CHEMBL3038509,PROTEIN COMPLEX,klifs|reviewed_uniprot
2,O00141,SGK1,CHEMBL2343,SINGLE PROTEIN,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
3,O00238,BMPR1B|BMR1B,CHEMBL5476,SINGLE PROTEIN,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
4,O00311,CDC7,CHEMBL2111377,PROTEIN COMPLEX,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
...,...,...,...,...,...
873,Q9Y5S2,CDC42BPB|MRCKB,CHEMBL5052,SINGLE PROTEIN,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
874,Q9Y616,IRAK3,CHEMBL5081,SINGLE PROTEIN,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
875,Q9Y6E0,STK24,CHEMBL5082,SINGLE PROTEIN,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...
876,Q9Y6M4,KC1G3|CSNK1G3,CHEMBL5084,SINGLE PROTEIN,kinhub|klifs|pkinfam|reviewed_uniprot|dunbrack...


How is this possible? 878 targets?!

Apparently, there's not 1:1 correspondence between UniprotID and ChEMBL ID! Some Uniprot IDs are included in several ChEMBL targets:

In [56]:
merged.UniprotID.value_counts()

P24941    12
P11802    12
P35968    10
Q13131     9
P54646     9
          ..
Q16512     1
O43293     1
Q9UIK4     1
Q15835     1
Q9H2X6     1
Name: UniprotID, Length: 488, dtype: int64

In [10]:
merged[merged.UniprotID == "P11802"]

Unnamed: 0,UniprotID,chembl_targets,type
181,P11802,CHEMBL331,SINGLE PROTEIN
182,P11802,CHEMBL2095942,PROTEIN COMPLEX
183,P11802,CHEMBL2111326,SELECTIVITY GROUP
184,P11802,CHEMBL1907601,PROTEIN COMPLEX
185,P11802,CHEMBL3301385,PROTEIN COMPLEX
186,P11802,CHEMBL3038472,PROTEIN COMPLEX
187,P11802,CHEMBL3559691,PROTEIN FAMILY
188,P11802,CHEMBL3038517,PROTEIN FAMILY
189,P11802,CHEMBL4106184,PROTEIN FAMILY
190,P11802,CHEMBL3885548,PROTEIN COMPLEX


... and some ChEMBL targets include several kinases (e.g. chimeric proteins):

In [11]:
merged[merged.chembl_targets == "CHEMBL2096618"]

Unnamed: 0,UniprotID,chembl_targets,type
98,P00519,CHEMBL2096618,CHIMERIC PROTEIN
171,P11274,CHEMBL2096618,CHIMERIC PROTEIN


This is due to the different `type` values:

In [12]:
merged.type.value_counts()

SINGLE PROTEIN                 485
PROTEIN FAMILY                 215
PROTEIN COMPLEX                107
PROTEIN-PROTEIN INTERACTION     33
SELECTIVITY GROUP               16
CHIMERIC PROTEIN                11
PROTEIN COMPLEX GROUP           11
Name: type, dtype: int64

If we focus on `SINGLE PROTEIN` types:

In [13]:
merged[merged.type == "SINGLE PROTEIN"]

Unnamed: 0,UniprotID,chembl_targets,type
0,A4QPH2,CHEMBL4105789,SINGLE PROTEIN
2,O00141,CHEMBL2343,SINGLE PROTEIN
3,O00238,CHEMBL5476,SINGLE PROTEIN
5,O00311,CHEMBL5443,SINGLE PROTEIN
6,O00329,CHEMBL3130,SINGLE PROTEIN
...,...,...,...
873,Q9Y5S2,CHEMBL5052,SINGLE PROTEIN
874,Q9Y616,CHEMBL5081,SINGLE PROTEIN
875,Q9Y6E0,CHEMBL5082,SINGLE PROTEIN
876,Q9Y6M4,CHEMBL5084,SINGLE PROTEIN


... we end up with a total of 485 targets, which is more acceptable.

For that reason, we will only save records corresponding to `type == SINGLE PROTEIN`

In [57]:
merged[merged.type == "SINGLE PROTEIN"].to_csv(DATA /  f"human_kinases_and_chembl_targets.{CHEMBL_VERSION}.csv", index=False)