## ExCAPE-DB

Explore the ExCAPE dataset. Processing tsv file with pandas is slow. We can consider convert the dataset to sqlite if we need to do more heavy analysis.

In [51]:
import pandas as pd
import numpy as np
import re
from collections import Counter
from random import sample

In [4]:
df = pd.read_csv("pubchem.chembl.dataset4publication_inchi_smiles.tsv", sep='\t')

In [5]:
df.head()

Unnamed: 0,Ambit_InchiKey,Original_Entry_ID,Entrez_ID,Activity_Flag,pXC50,DB,Original_Assay_ID,Tax_ID,Gene_Symbol,Ortholog_Group,InChI,SMILES
0,AAAAZQPHATYWOK-YRBRRWAQNA-N,11399331,2064,A,7.19382,pubchem,248914,9606,ERBB2,1346,InChI=1/C32H29ClN6O3S/c1-4-41-28-16-25-22(15-2...,ClC=1C=C(NC=2C=3C(N=CC2C#N)=CC(OCC)=C(NC(=O)/C...
1,AAAAZQPHATYWOK-YRBRRWAQNA-N,CHEMBL175513,1956,A,6.73,chembl20,312997,9606,EGFR,1260,InChI=1/C32H29ClN6O3S/c1-4-41-28-16-25-22(15-2...,C1=2C(=C(C#N)C=NC1=CC(=C(C2)NC(/C=C/CN(C)C)=O)...
2,AAABHMIRDIOYOK-NPVYFSBINA-N,CHEMBL1527551,10919,N,4.55,chembl20,737344,9606,EHMT2,6822,InChI=1/C18H14N6O3/c1-23-10-15(24(26)27)16(22-...,O=C(NC=1C=C2N=C(NC2=CC1)C=3C=CC=CC3)C4=NN(C=C4...
3,AAABHMIRDIOYOK-NPVYFSBINA-N,CHEMBL1527551,19885,A,5.35,chembl20,688759,10090,RORC,3770,InChI=1/C18H14N6O3/c1-23-10-15(24(26)27)16(22-...,O=C(NC=1C=C2N=C(NC2=CC1)C=3C=CC=CC3)C4=NN(C=C4...
4,AAABHMIRDIOYOK-NPVYFSBINA-N,CHEMBL1527551,216,N,4.4,chembl20,688238,9606,ALDH1A1,143,InChI=1/C18H14N6O3/c1-23-10-15(24(26)27)16(22-...,O=C(NC=1C=C2N=C(NC2=CC1)C=3C=CC=CC3)C4=NN(C=C4...


In [24]:
assay_ids = df["Original_Assay_ID"].tolist()
len(set(assay_ids))

85307

In [25]:
target_ids = df["Entrez_ID"].tolist()
len(set(target_ids))

1667

In [26]:
Counter(df["Activity_Flag"])

Counter({'A': 1332426, 'N': 69517737})

## Check Overlapping with InChI

In [64]:
bone_df = pd.read_csv("/Users/JayWong/Programs/gitter_lab/pharmaco-image/data/test/meta_data/chemical_annotations_inchi.csv")
total_inchi = set(df["InChI"])
bone_inchi = set(bone_df["INCHI"])
print("There are {} compounds in U2OS, and {} compounds in ExCAPE.".format(len(bone_inchi),
                                                                           len(total_inchi)))

There are 30405 compounds in U2OS, and 998535 compounds in ExCAPE.


In [32]:
total_inchi & bone_inchi

set()

In [42]:
sample(bone_inchi, 10)

['InChI=1S/C22H18N2O5S2/c1-10-3-6-12(7-4-10)24-20(26)16-15(11-5-8-13(25)14(9-11)29-2)17-19(23-22(28)31-17)30-18(16)21(24)27/h3-9,15-16,18,25H,1-2H3,(H,23,28)',
 'InChI=1S/C15H15N/c1-2-6-12(7-3-1)15-11-16-10-13-8-4-5-9-14(13)15/h1-9,15-16H,10-11H2',
 'InChI=1S/C28H34FN3O5/c1-17-14-32(28(35)21-7-5-6-8-23(21)29)18(2)16-37-24-12-11-20(30-26(33)19-9-10-19)13-22(24)27(34)31(3)15-25(17)36-4/h5-8,11-13,17-19,25H,9-10,14-16H2,1-4H3,(H,30,33)/t17-,18-,25+/m1/s1',
 'InChI=1S/C17H16FNO3/c18-14-4-2-1-3-13(14)17(20)19-8-7-12-5-6-15-16(11-12)22-10-9-21-15/h1-6,11H,7-10H2,(H,19,20)',
 'InChI=1S/C10H10O4/c1-14-10(13)5-2-7-6-8(11)3-4-9(7)12/h2-6,11-12H,1H3',
 'InChI=1S/C25H23N5O4/c1-15-2-4-17(5-3-15)22-21-20(19(14-26)23(27)34-24(21)29-28-22)16-6-8-18(9-7-16)33-25(31)30-10-12-32-13-11-30/h2-9,20H,10-13,27H2,1H3,(H,28,29)',
 'InChI=1S/C31H42N4O6/c1-20-17-35(21(2)19-36)30(38)25-11-8-12-26(33-29(37)22-9-6-5-7-10-22)28(25)41-27(20)18-34(3)31(39)32-23-13-15-24(40-4)16-14-23/h8,11-16,20-22,27,36H,5-7,9-10,17-1

In [43]:
sample(total_inchi, 10)

['InChI=1/C23H14ClF2N3O/c24-18-11-15(9-10-19(18)26)27-23-28-21(13-5-7-14(25)8-6-13)17-12-30-20-4-2-1-3-16(20)22(17)29-23/h1-11H,12H2,(H,27,28,29)/f/h27H',
 'InChI=1/C20H21N3O5S/c1-22-17-7-6-16(12-18(17)28-20(22)25)29(26,27)21-10-8-19(24)23-11-9-14-4-2-3-5-15(14)13-23/h2-7,12,21H,8-11,13H2,1H3',
 'InChI=1/C26H27N3O8S/c1-16-11-22(17(2)29(16)20-7-10-24-25(12-20)37-15-36-24)23(31)14-35-26(32)13-28(4)38(33,34)21-8-5-19(6-9-21)27-18(3)30/h5-12H,13-15H2,1-4H3,(H,27,30)/f/h27H',
 'InChI=1/C20H25F3N4O2/c21-20(22,23)16-6-4-15(5-7-16)19-24-18(29-25-19)14-27-8-2-1-3-17(27)13-26-9-11-28-12-10-26/h4-7,17H,1-3,8-14H2',
 'InChI=1/C20H16Cl3N3O2/c21-11-1-2-12(16(22)6-11)14-8-24-9-15(14)20(28)26-18-5-10-3-4-25-19(27)13(10)7-17(18)23/h1-7,14-15,24H,8-9H2,(H,25,27)(H,26,28)/f/h26-27H',
 'InChI=1/C22H20N2O4/c1-27-19-10-4-8-17(13-19)23-21(25)15-6-3-7-16(12-15)22(26)24-18-9-5-11-20(14-18)28-2/h3-14H,1-2H3,(H,23,25)(H,24,26)/f/h23-24H',
 'InChI=1/C20H29N3O3/c1-15-7-5-9-18(16(15)2)26-14-11-21-10-3-4-12-23-19(24

There are no overlapping compounds. After inspecting on the random 10 samples from each list, we notice the `rdkit` converted InChI is in standard format (`InChI=1S`) while ExCAPE-DB is using version 1 (`InchI=1`) .

In [65]:
modified_bone_inchi = set([re.sub(r'(InChI=1S)(.+)', r'InChI=1\2', s) for s in bone_inchi if not pd.isna(s)])
len(total_inchi & modified_bone_inchi)

9468

After naively remove "S" from our InChI strings, we got 9468 intersections with ExCAPE-DB.