# Find all human kinases in ChEMBL

This notebook maps Uniprot IDs to ChEMBL target IDs and produces a helper CSV file useful in other notebooks

In [1]:
from pathlib import Path
import urllib.request

import pandas as pd

In [2]:
REPO = (Path(_dh[-1]) / "..").resolve()
DATA = REPO / 'data'

Load human kinases list, as obtained from http://kinhub.org/kinases.html

In [3]:
kinases = pd.read_csv(DATA / "KinHubKinaseList.csv")
kinases

Unnamed: 0,xName,Manning Name,HGNC Name,Kinase Name,Group,Family,SubFamily,UniprotID
0,ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519
1,ACK,ACK,TNK2,Activated CDC42 kinase 1,TK,Ack,,Q07912
2,ACTR2,ACTR2,ACVR2A,Activin receptor type-2A,TKL,STKR,STKR2,P27037
3,ACTR2B,ACTR2B,ACVR2B,Activin receptor type-2B,TKL,STKR,STKR2,Q13705
4,ADCK4,ADCK4,ADCK4,Uncharacterized aarF domain-containing protein...,Atypical,ABC1,ABC1-A,Q96D53
...,...,...,...,...,...,...,...,...
531,GTF2F1,GTF2F1,,,Atypical,GTF2F1,,Q6IBK5
532,Col4A3BP,Col4A3BP,COL4A3BP,Collagen type IV alpha-3-binding protein,Atypical,Col4A3BP,,Q9Y5P4
533,BLVRA,BLVRA,BLVRA,Biliverdin reductase A,Atypical,BLVRA,,P53004
534,BAZ1A,BAZ1A,BAZ1A,Bromodomain adjacent to zinc finger domain pro...,Atypical,BAZ,,Q9NRL2


Get UniProt-ChEMBL mapping from EBI FTP server. Update URL to reflect new ChEMBL releases!

In [4]:
CHEMBL_VERSION = "chembl_27"
url = fr"ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/{CHEMBL_VERSION}/chembl_uniprot_mapping.txt"

In [5]:
with urllib.request.urlopen(url) as response:
    uniprot_map = pd.read_csv(response, sep="\t", skiprows=[0], names=["UniprotID", "chembl_targets", "description", "type"])
uniprot_map

Unnamed: 0,UniprotID,chembl_targets,description,type
0,P21266,CHEMBL2242,Glutathione S-transferase Mu 3,SINGLE PROTEIN
1,O00519,CHEMBL2243,Anandamide amidohydrolase,SINGLE PROTEIN
2,P19217,CHEMBL2244,Estrogen sulfotransferase,SINGLE PROTEIN
3,P97292,CHEMBL2245,Histamine H2 receptor,SINGLE PROTEIN
4,P17342,CHEMBL2247,Atrial natriuretic peptide receptor C,SINGLE PROTEIN
...,...,...,...,...
11779,Q91ZR5,CHEMBL3886121,Cation channel sperm-associated protein 1,SINGLE PROTEIN
11780,P48763,CHEMBL3886122,Sodium/hydrogen exchanger 2,SINGLE PROTEIN
11781,Q9UKU6,CHEMBL3886123,Thyrotropin-releasing hormone-degrading ectoen...,SINGLE PROTEIN
11782,Q9JJH7,CHEMBL3886124,Transient receptor potential cation channel su...,SINGLE PROTEIN


We join this new information to the KinHub list:

In [6]:
merged = pd.merge(kinases, uniprot_map[["UniprotID", "chembl_targets", "type"]], on='UniprotID')
merged

Unnamed: 0,xName,Manning Name,HGNC Name,Kinase Name,Group,Family,SubFamily,UniprotID,chembl_targets,type
0,ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519,CHEMBL1862,SINGLE PROTEIN
1,ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519,CHEMBL2096618,CHIMERIC PROTEIN
2,ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519,CHEMBL2111414,PROTEIN FAMILY
3,ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519,CHEMBL4296119,PROTEIN-PROTEIN INTERACTION
4,ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519,CHEMBL4296120,PROTEIN-PROTEIN INTERACTION
...,...,...,...,...,...,...,...,...,...,...
865,BCR,BCR,BCR,Breakpoint cluster region protein,Atypical,BCR,,P11274,CHEMBL4296120,PROTEIN-PROTEIN INTERACTION
866,BCR,BCR,BCR,Breakpoint cluster region protein,Atypical,BCR,,P11274,CHEMBL4296137,PROTEIN-PROTEIN INTERACTION
867,Col4A3BP,Col4A3BP,COL4A3BP,Collagen type IV alpha-3-binding protein,Atypical,Col4A3BP,,Q9Y5P4,CHEMBL3399913,SINGLE PROTEIN
868,BAZ1A,BAZ1A,BAZ1A,Bromodomain adjacent to zinc finger domain pro...,Atypical,BAZ,,Q9NRL2,CHEMBL4105737,SINGLE PROTEIN


And save as CSV for easy reuse in other notebooks.

In [7]:
merged.to_csv(DATA /  f"human_kinases_and_chembl_targets.{CHEMBL_VERSION}.csv", index=False)

Note there's not 1:1 correspondence between UniprotID and ChEMBL ID! Some kinases are included in several ChEMBL targets:

In [8]:
merged[merged.UniprotID == "P00519"]

Unnamed: 0,xName,Manning Name,HGNC Name,Kinase Name,Group,Family,SubFamily,UniprotID,chembl_targets,type
0,ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519,CHEMBL1862,SINGLE PROTEIN
1,ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519,CHEMBL2096618,CHIMERIC PROTEIN
2,ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519,CHEMBL2111414,PROTEIN FAMILY
3,ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519,CHEMBL4296119,PROTEIN-PROTEIN INTERACTION
4,ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519,CHEMBL4296120,PROTEIN-PROTEIN INTERACTION
5,ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519,CHEMBL4296137,PROTEIN-PROTEIN INTERACTION
6,ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519,CHEMBL3885630,PROTEIN-PROTEIN INTERACTION
7,ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519,CHEMBL3885645,CHIMERIC PROTEIN


... and some ChEMBL targets include several kinases (e.g. chimeric proteins):

In [9]:
merged[merged.chembl_targets == "CHEMBL2096618"]

Unnamed: 0,xName,Manning Name,HGNC Name,Kinase Name,Group,Family,SubFamily,UniprotID,chembl_targets,type
1,ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519,CHEMBL2096618,CHIMERIC PROTEIN
863,BCR,BCR,BCR,Breakpoint cluster region protein,Atypical,BCR,,P11274,CHEMBL2096618,CHIMERIC PROTEIN


This is due to the different `type` values:

In [10]:
merged.type.value_counts()

SINGLE PROTEIN                 475
PROTEIN FAMILY                 219
PROTEIN COMPLEX                110
PROTEIN-PROTEIN INTERACTION     32
SELECTIVITY GROUP               16
CHIMERIC PROTEIN                11
PROTEIN COMPLEX GROUP            7
Name: type, dtype: int64