# EGFR kinase

This notebook filters the already curated ChEMBL database for the EGFR kinase.

In [1]:
from pathlib import Path
import pandas as pd
from collections import Counter

In [2]:
HERE = Path(_dh[-1])
REPO = (HERE / "..").resolve()
DATA = REPO / "data"
OUT = HERE / "_out"
OUT.mkdir(parents=True, exist_ok=True)

In [3]:
CHEMBL_VERSION = 29

In [4]:
path = f"https://github.com/openkinome/kinodata/releases/download/v0.3/activities-chembl{CHEMBL_VERSION}_v0.3.zip"
data = pd.read_csv(path)
print(f"Shape of data: {data.shape}")

Shape of data: (190634, 17)


In [5]:
data.head()

Unnamed: 0.1,Unnamed: 0,activities.activity_id,assays.chembl_id,target_dictionary.chembl_id,molecule_dictionary.chembl_id,molecule_dictionary.max_phase,activities.standard_type,activities.standard_value,activities.standard_units,compound_structures.canonical_smiles,compound_structures.standard_inchi,component_sequences.sequence,assays.confidence_score,docs.chembl_id,docs.year,docs.authors,UniprotID
0,96698,16291323,CHEMBL3705523,CHEMBL2973,CHEMBL3666724,0,pIC50,14.09691,nM,CCCC(=O)Nc1cccc(-c2nc(Nc3ccc4[nH]ncc4c3)c3cc(O...,InChI=1S/C31H33N7O3/c1-2-4-29(40)33-22-6-3-5-2...,MSRPPPTGKMPGAPETAPGDGAGASRQRKLEALIRDPRSPINVESL...,9,CHEMBL3639077,2014.0,,O75116
1,94326,16264754,CHEMBL3705523,CHEMBL2973,CHEMBL3666728,0,pIC50,14.0,nM,CCCC(=O)Nc1cccc(-c2nc(Nc3ccc4[nH]ncc4c3)c3cc(O...,InChI=1S/C34H40N8O3/c1-5-7-32(43)36-24-9-6-8-2...,MSRPPPTGKMPGAPETAPGDGAGASRQRKLEALIRDPRSPINVESL...,9,CHEMBL3639077,2014.0,,O75116
2,98119,16306943,CHEMBL3705523,CHEMBL2973,CHEMBL1968705,0,pIC50,14.0,nM,CCCC(=O)Nc1cccc(-c2nc(Nc3ccc4[nH]ncc4c3)c3cc(O...,InChI=1S/C31H33N7O2/c1-2-6-29(39)33-23-8-5-7-2...,MSRPPPTGKMPGAPETAPGDGAGASRQRKLEALIRDPRSPINVESL...,9,CHEMBL3639077,2014.0,,O75116
3,101161,16340050,CHEMBL3705523,CHEMBL2973,CHEMBL1997433,0,pIC50,13.958607,nM,CCCC(=O)Nc1cccc(-c2nc(Nc3ccc4[nH]ncc4c3)c3cc(O...,InChI=1S/C28H28N6O3/c1-3-5-26(35)30-20-7-4-6-1...,MSRPPPTGKMPGAPETAPGDGAGASRQRKLEALIRDPRSPINVESL...,9,CHEMBL3639077,2014.0,,O75116
4,96324,16287186,CHEMBL3705523,CHEMBL2973,CHEMBL3666721,0,pIC50,13.920819,nM,CCCC(=O)Nc1cccc(-c2nc(Nc3ccc4[nH]ncc4c3)c3cc(O...,InChI=1S/C32H35N7O2/c1-2-7-30(40)34-24-9-6-8-2...,MSRPPPTGKMPGAPETAPGDGAGASRQRKLEALIRDPRSPINVESL...,9,CHEMBL3639077,2014.0,,O75116


In [6]:
print(f"Number of unique uniprot IDs: {len(set(data['UniprotID']))}")

Number of unique uniprot IDs: 462


Let's focus on the EGRF kinase. Its uniprot ID is P00533, see https://www.uniprot.org/uniprot/P00533

In [7]:
data = data[data["UniprotID"] == "P00533"]
print(f"Shape of data: {data.shape}")

Shape of data: (6509, 17)


We save the EGFR activity data to a csv.

In [8]:
data.to_csv(OUT / f"EGFR-activities-chembl{CHEMBL_VERSION}.csv")

And compress it:

In [9]:
%%bash
cd _out
zip -r EGFR-activities-chembl29.zip EGFR-activities-chembl29.csv

  adding: EGFR-activities-chembl29.csv (deflated 89%)


## Subset of the EGFR affinity data

For testing purposes, it is often convenient to work with a sample of data points. We simply pick the first $100$.

In [10]:
EGFR_sample = data[0:100]

In [11]:
EGFR_sample.shape

(100, 17)

In [12]:
EGFR_sample.head()

Unnamed: 0.1,Unnamed: 0,activities.activity_id,assays.chembl_id,target_dictionary.chembl_id,molecule_dictionary.chembl_id,molecule_dictionary.max_phase,activities.standard_type,activities.standard_value,activities.standard_units,compound_structures.canonical_smiles,compound_structures.standard_inchi,component_sequences.sequence,assays.confidence_score,docs.chembl_id,docs.year,docs.authors,UniprotID
58,7654,1044894,CHEMBL683040,CHEMBL203,CHEMBL63786,0,pIC50,11.522879,nM,Brc1cccc(Nc2ncnc3cc4ccccc4cc23)c1,InChI=1S/C18H12BrN3/c19-14-6-3-7-15(10-14)22-1...,MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFED...,9,CHEMBL1129035,1996.0,"Rewcastle GW, Palmer BD, Bridges AJ, Showalter...",P00533
98,4823,720032,CHEMBL674648,CHEMBL203,CHEMBL53711,0,pIC50,11.221849,nM,CN(C)c1cc2c(Nc3cccc(Br)c3)ncnc2cn1,InChI=1S/C15H14BrN5/c1-21(2)14-7-12-13(8-17-14...,MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFED...,8,CHEMBL1131301,1998.0,"Rewcastle GW, Murray DK, Elliott WL, Fry DW, H...",P00533
99,23101,2136934,CHEMBL939337,CHEMBL203,CHEMBL35820,0,pIC50,11.221849,nM,CCOc1cc2ncnc(Nc3cccc(Br)c3)c2cc1OCC,InChI=1S/C18H18BrN3O2/c1-3-23-16-9-14-15(10-17...,MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFED...,8,CHEMBL1145498,2007.0,"Fedorov O, Marsden B, Pogacic V, Rellos P, Mül...",P00533
140,2647,400160,CHEMBL679944,CHEMBL203,CHEMBL53753,0,pIC50,11.09691,nM,CNc1cc2c(Nc3cccc(Br)c3)ncnc2cn1,InChI=1S/C14H12BrN5/c1-16-13-6-11-12(7-17-13)1...,MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFED...,8,CHEMBL1132555,1999.0,"Showalter HD, Bridges AJ, Zhou H, Sercel AD, M...",P00533
141,7726,1054530,CHEMBL683040,CHEMBL203,CHEMBL66031,0,pIC50,11.09691,nM,Brc1cccc(Nc2ncnc3cc4[nH]cnc4cc23)c1,InChI=1S/C15H10BrN5/c16-9-2-1-3-10(4-9)21-15-1...,MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFED...,9,CHEMBL1129035,1996.0,"Rewcastle GW, Palmer BD, Bridges AJ, Showalter...",P00533


We save the sample to a csv:

In [13]:
data.to_csv(OUT / f"EGFR-activities-chembl{CHEMBL_VERSION}-sample.csv")

And compress it:

In [14]:
%%bash
cd _out
zip -r EGFR-activities-chembl29-sample.zip EGFR-activities-chembl29-sample.csv

  adding: EGFR-activities-chembl29-sample.csv (deflated 89%)
