# Find all human kinases in ChEMBL

This notebook maps Uniprot IDs to ChEMBL target IDs and produces a helper CSV file useful in other notebooks

In [1]:
from pathlib import Path
import urllib.request

import requests
import bs4
import pandas as pd

In [2]:
REPO = (Path(_dh[-1]) / "..").resolve()
DATA = REPO / 'data'

## 01. Obtain UniprotIDs for all human kinases

There are several sources that can be used to identify the human kinome in Uniprot:

* [KinHub](http://kinhub.org/kinases.html)
* [KLIFS](https://klifs.vu-compmedchem.nl/)
* [UniProt](https://www.uniprot.org/docs/pkinfam.txt)'s _Human and mouse protein kinases: classification and index_.
* Scientific literature

There are also several nomenclatures commonly used to refer to kinases: xName, Manning, HGNC... For  completeness, we will try to merge all these sources into a single table we can use in later exercises.

### 01.1 KinHub

This part is straightforward: it retrieves the table listed at [Kinhub.org](http://kinhub.org/kinases.html) with `requests` and `beautifulsoup`.

In [3]:
r = requests.get("http://kinhub.org/kinases.html")
r.raise_for_status()
html = bs4.BeautifulSoup(r.text)

In [45]:
table = html.find('table')
columns = [cell.text.strip() for cell in table.find('thead').find_all('th')]
records = [[cell.text.strip() for cell in row.find_all('td')] for row in table.find('tbody').find_all('tr')]
kinhub = pd.DataFrame.from_records(records, columns=columns).sort_values(by="xName")
kinhub

Unnamed: 0,xName,Manning Name,HGNC Name,Kinase Name,Group,Family,SubFamily,UniprotID
336,AAK1,AAK1,AAK1,AP2-associated protein kinase 1,Other,NAK,,Q2M2I8
0,ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519
24,ABL2,ARG,ABL2,Abelson tyrosine-protein kinase 2,TK,Abl,,P42684
529,ABR,ABR,ABR,Active breakpoint cluster region-related protein,Atypical,BCR,,Q12979
1,ACK,ACK,TNK2,Activated CDC42 kinase 1,TK,Ack,,Q07912
...,...,...,...,...,...,...,...,...
274,p38g,p38g,MAPK12,Mitogen-activated protein kinase 12,CMGC,MAPK,p38,P53778
209,p70S6K,p70S6K,RPS6KB1,Ribosomal protein S6 kinase beta-1,AGC,RSK,RSKp70,P23443
210,p70S6Kb,p70S6Kb,RPS6KB2,Ribosomal protein S6 kinase beta-2,AGC,RSK,RSKp70,Q9UBS0
477,skMLCK,skMLCK,MYLK2,Myosin light chain kinase 2,CAMK,MLCK,,Q9H1R3


## 01.2 KLIFS

In [21]:
r = requests.get("https://klifs.vu-compmedchem.nl/api/kinase_names?species=HUMAN")
r.raise_for_status()
data = r.json()
all_ids = [record["kinase_ID"] for record in data]

In [23]:
r = requests.get(f"https://klifs.vu-compmedchem.nl/api/kinase_information?kinase_ID={','.join(map(str, all_ids))}")
r.raise_for_status()
data = r.json()

In [46]:
items = []
for record in data:
    items.append({
        "KLIFS Name": record["name"],
        "HGNC Name": record["HGNC"],
        "Family": record["family"],
        "Group": record["group"],
        "Class": record["kinase_class"],
        "Kinase Name": record["full_name"],
        "UniprotID": record["uniprot"],
        "IUPHAR": record["iuphar"],
        "Pocket": record["pocket"],
        "KLIFS ID": record["kinase_ID"]
    })
klifs = pd.DataFrame.from_dict(items).sort_values(by="KLIFS Name")
klifs

Unnamed: 0,KLIFS Name,HGNC Name,Family,Group,Class,Kinase Name,UniprotID,IUPHAR,Pocket,KLIFS ID
276,AAK1,AAK1,NAK,Other,BIKE,AP2 associated kinase 1,Q2M2I8,1921,EVLAEGGFAIVFLCALKRMVCKREIQIMRDLSKNIVGYIDSLILMD...,277
391,ABL1,ABL1,Abl,TK,,"ABL proto-oncogene 1, non-receptor tyrosine ki...",P00519,1923,HKLGGGQYGEVYEVAVKTLEFLKEAAVMKEIKPNLVQLLGVYIITE...,392
392,ABL2,ABL2,Abl,TK,,"ABL proto-oncogene 2, non-receptor tyrosine ki...",P42684,1924,HKLGGGQYGEVYVVAVKTLEFLKEAAVMKEIKPNLVQLLGVYIVTE...,393
393,ACK,TNK2,Ack,TK,,"tyrosine kinase, non-receptor, 2",Q07912,2246,EKLGDGSFGVVRRVAVKCLDFIREVNAMHSLDRNLIRLYGVKMVTE...,394
522,ACTR2,ACVR2A,STKR,TKL,Type2,activin A receptor type IIA,P27037,1791,EVKARGRFGCVWKVAVKIFSWQNEYEVYSLPGENILQFIGAWLITA...,523
...,...,...,...,...,...,...,...,...,...,...
251,p38g,MAPK12,MAPK,CMGC,p38,mitogen-activated protein kinase 12,P53778,1501,QPVGSGAYGAVCSVAIKKLRAYRELRLLKHMRENVIGLLDVYLVMP...,252
50,p70S6K,RPS6KB1,RSK,AGC,p70,ribosomal protein S6 kinase B1,P23443,1525,RVLGKGGYGKVFQFAMKVLHTKAERNILEEVKPFIVDLIYAYLILE...,51
51,p70S6Kb,RPS6KB2,RSK,AGC,p70,ribosomal protein S6 kinase B2,Q9UBS0,1526,RVLGKGGYGKVFQYAMKVLHTRAERNILESVKPFIVELAYAYLILE...,52
154,skMLCK,MYLK2,MLCK,CAMK,,myosin light chain kinase 2,Q9H1R3,1553,EALGGGKFGAVCTLAAKVIMVLLEIEVMNQLNRNLIQLYAAVLFME...,155


## 01.3 UniProt

UniProt maintains [a txt file](https://www.uniprot.org/docs/pkinfam.txt) with a kinase index. We can try to parse it and see what information we have there.

In [31]:
r = requests.get("https://www.uniprot.org/docs/pkinfam.txt")
r.raise_for_status()
text = r.text

In [47]:
items = []
current_category = None
lines = iter(text.splitlines())
for line in lines:
    if line.startswith("==="):  # this is a header
        line = next(lines)
        current_category = line.strip()
        next(lines)
        next(lines)
        continue
    if not line.strip() or not current_category or '_HUMAN' not in line:
        continue
    if line.startswith("---") and current_category:
        break
    
    fields = line.split()
    name = fields[0]
    uniprot_id = fields[2][1:]
    items.append({
        'Name': name,
        'Family': current_category,
        'UniprotID': uniprot_id,
    })
uniprot = pd.DataFrame.from_dict(items).sort_values(by="Name")
uniprot

Unnamed: 0,Name,Family,UniprotID
403,AAK1,Other,Q2M2I8
313,AATK,Tyr protein kinase family,Q6ZMQ8
314,ABL1,Tyr protein kinase family,P00519
315,ABL2,Tyr protein kinase family,P42684
279,ACVR1,TKL Ser/Thr protein kinase family,Q04771
...,...,...,...
481,WNK2,Other,Q9Y3S1
482,WNK3,Other,Q9BYP7
483,WNK4,Other,Q96J92
401,YES1,Tyr protein kinase family,P07947


### Review results so far

KinHub lists 536 kinases, KLIFS 555, and UniProt itself 513?

***

WIP! Cells below have not been updated yet!

***

## 02. Find UniProt IDs in ChEMBL targets

Get UniProt-ChEMBL mapping from EBI FTP server. Update URL to reflect new ChEMBL releases!

In [4]:
CHEMBL_VERSION = "chembl_27"
url = fr"ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/{CHEMBL_VERSION}/chembl_uniprot_mapping.txt"

In [5]:
with urllib.request.urlopen(url) as response:
    uniprot_map = pd.read_csv(response, sep="\t", skiprows=[0], names=["UniprotID", "chembl_targets", "description", "type"])
uniprot_map

Unnamed: 0,UniprotID,chembl_targets,description,type
0,P21266,CHEMBL2242,Glutathione S-transferase Mu 3,SINGLE PROTEIN
1,O00519,CHEMBL2243,Anandamide amidohydrolase,SINGLE PROTEIN
2,P19217,CHEMBL2244,Estrogen sulfotransferase,SINGLE PROTEIN
3,P97292,CHEMBL2245,Histamine H2 receptor,SINGLE PROTEIN
4,P17342,CHEMBL2247,Atrial natriuretic peptide receptor C,SINGLE PROTEIN
...,...,...,...,...
11779,Q91ZR5,CHEMBL3886121,Cation channel sperm-associated protein 1,SINGLE PROTEIN
11780,P48763,CHEMBL3886122,Sodium/hydrogen exchanger 2,SINGLE PROTEIN
11781,Q9UKU6,CHEMBL3886123,Thyrotropin-releasing hormone-degrading ectoen...,SINGLE PROTEIN
11782,Q9JJH7,CHEMBL3886124,Transient receptor potential cation channel su...,SINGLE PROTEIN


We join this new information to the KinHub list:

In [6]:
merged = pd.merge(kinases, uniprot_map[["UniprotID", "chembl_targets", "type"]], on='UniprotID')
merged

Unnamed: 0,xName,Manning Name,HGNC Name,Kinase Name,Group,Family,SubFamily,UniprotID,chembl_targets,type
0,ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519,CHEMBL1862,SINGLE PROTEIN
1,ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519,CHEMBL2096618,CHIMERIC PROTEIN
2,ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519,CHEMBL2111414,PROTEIN FAMILY
3,ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519,CHEMBL4296119,PROTEIN-PROTEIN INTERACTION
4,ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519,CHEMBL4296120,PROTEIN-PROTEIN INTERACTION
...,...,...,...,...,...,...,...,...,...,...
865,BCR,BCR,BCR,Breakpoint cluster region protein,Atypical,BCR,,P11274,CHEMBL4296120,PROTEIN-PROTEIN INTERACTION
866,BCR,BCR,BCR,Breakpoint cluster region protein,Atypical,BCR,,P11274,CHEMBL4296137,PROTEIN-PROTEIN INTERACTION
867,Col4A3BP,Col4A3BP,COL4A3BP,Collagen type IV alpha-3-binding protein,Atypical,Col4A3BP,,Q9Y5P4,CHEMBL3399913,SINGLE PROTEIN
868,BAZ1A,BAZ1A,BAZ1A,Bromodomain adjacent to zinc finger domain pro...,Atypical,BAZ,,Q9NRL2,CHEMBL4105737,SINGLE PROTEIN


And save as CSV for easy reuse in other notebooks.

In [7]:
merged.to_csv(DATA /  f"human_kinases_and_chembl_targets.{CHEMBL_VERSION}.csv", index=False)

Note there's not 1:1 correspondence between UniprotID and ChEMBL ID! Some kinases are included in several ChEMBL targets:

In [8]:
merged[merged.UniprotID == "P00519"]

Unnamed: 0,xName,Manning Name,HGNC Name,Kinase Name,Group,Family,SubFamily,UniprotID,chembl_targets,type
0,ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519,CHEMBL1862,SINGLE PROTEIN
1,ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519,CHEMBL2096618,CHIMERIC PROTEIN
2,ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519,CHEMBL2111414,PROTEIN FAMILY
3,ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519,CHEMBL4296119,PROTEIN-PROTEIN INTERACTION
4,ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519,CHEMBL4296120,PROTEIN-PROTEIN INTERACTION
5,ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519,CHEMBL4296137,PROTEIN-PROTEIN INTERACTION
6,ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519,CHEMBL3885630,PROTEIN-PROTEIN INTERACTION
7,ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519,CHEMBL3885645,CHIMERIC PROTEIN


... and some ChEMBL targets include several kinases (e.g. chimeric proteins):

In [9]:
merged[merged.chembl_targets == "CHEMBL2096618"]

Unnamed: 0,xName,Manning Name,HGNC Name,Kinase Name,Group,Family,SubFamily,UniprotID,chembl_targets,type
1,ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519,CHEMBL2096618,CHIMERIC PROTEIN
863,BCR,BCR,BCR,Breakpoint cluster region protein,Atypical,BCR,,P11274,CHEMBL2096618,CHIMERIC PROTEIN


This is due to the different `type` values:

In [10]:
merged.type.value_counts()

SINGLE PROTEIN                 475
PROTEIN FAMILY                 219
PROTEIN COMPLEX                110
PROTEIN-PROTEIN INTERACTION     32
SELECTIVITY GROUP               16
CHIMERIC PROTEIN                11
PROTEIN COMPLEX GROUP            7
Name: type, dtype: int64