FDA-approved small drug 3D structure dataset

FDA-approved small drug 3D structure dataset

FDA-approved small drug 3D structure dataset

1600 FDA-approved small molecule drugs

http://www.drugbank.ca/downloads#structures
./dat/approved.txt

Molecule sizes

average heavy atom numbers and standard deviation

script

./script/count_atms.py
result

./dat/approved_heavy_atom_num.txt ./dat/elements.txt

DONE 3D structures of the chemicals

DONE Drug Bank id to pdb id

pdb –> uniprt

./dat/pdbsws_chain.txt
drug bank id –> uniprt ids
1. ./script/crawler.py
2. ./dat/drugid_uniprturls.dat
uniprt id –> pdb id
1. ./script/uniprt_pdb.py
2. ./dat/uniprt_to_pdb.dat
drug bank id –> pdb id
1. ./script/query_pdb.py
2. ./dat/drugid_pdb.dat
3. 82773 pdbs
4. 138078 chains

table

import pickle
import pandas as pd

dat_ifn = "./dat/drugid_pdb.dat"

with open(dat_ifn, 'r') as f:
    dat = pickle.load(f)

dset = pd.DataFrame({key: list(val) for key, val in dat.iteritems()})

DONE Filter the unbound proteins

DONE Extract the ligand atoms from the CIF files
1. ./script/qmol_CIF2PDB.pl (not compatible with latest perl modules)
2. work/jaydy/dat/fda_pdb_mb

DONE Cluster drug ligands

cmd

$ cd ~/Workspace/Bitbucket/fda-approved-drugs/dat/
$ python /home/jaydy/Workspace/Bitbucket/geauxcluster/src/geauxcluster.py -i approved.txt -o approved.txt.clusters -k DRUGBANK_ID

script ./script/clustering.py

DONE Compare the pdb ligand with drug bank ligand
1. calculate tanimoto coefficient using Kcombu
  1. ./script/calculateTanimota.py
  2. pkcombu Segmentation fault for
    1. DB01049
    2. to check pkcombu -A /work/jaydy/working/kcombu_run/DB01049_fda.sdf -B /work/jaydy/dat/fda_pdb_mb/ib/3ibdA.LG3.pdb
    3. DB00707
2. refuse proteins if their ligand's tanimoto < 0.9
  1. ./script/calculateTanimoto.py
  2. TANI_DIR = "/work/jaydy/working/tanimoto"

TODO Cluster the bound proteins based on sequence similarity

DONE add missing atoms

./script/fixPrt.py
DONE clean the proteins sequences

python ./script/pdb_to_fasta.py > run.sh sh ./script/run.sh
1. ctrip, heavy atom model
  1. /project/michal/apps/jackal_64bit/bin/ctrip
2. ./script/pdb2pdb.pl move pdb sequence to start at 1
3. ./script/pdb2fasta.pl convert pdb to fasta
4. ./script/fasta2fasta.pl break the lines at 80
DONE cd-hit to cluster the sequences
1. do not count the prt with seq length > 600
2. cluster python ./script/cluster_seq.py > ./script/cluster_seq.sh cd /work/jaydy/working/cluster sh /home/jaydy/Workspace/Bitbucket/fda-approved-drugs/script/cluster_seq.sh
3. collect /home/jaydy/Workspace/Bitbucket/fda-approved-drugs/script/collect.py

TODO

original data set of 1554 fda-approved drugs
1. script ./script/count_atms.py ./dat/drug_size.txt
2. result
  
  count 1554.000000
  
  mean 26.257400
  
  std 18.172654
  
  min 1.000000
  
  25% 18.000000
  
  50% 23.000000
  
  75% 30.000000
  
  max 419.000000
3. one atom drugs
  
  DrugBank ID Chem
  
  DB01356 Li
  
  DB01370 Al
  
  DB01592 Fe
  
  DB01593 Zn
4. 2 σ range (0, 62.6)
5. 1 σ range (8, 44)

after processing

script ./script/drug_size.py
result

statistics #HeavyAtom

count 197.000000

mean 23.928934

std 12.470170

min 6.000000

25% 16.000000

50% 21.000000

75% 29.000000

max 93.000000

within the range of (8, 44)

dat ./dat/representative_drugs.csv

stats #HeavyAtom

count 186.000000

mean 22.639785

std 8.456204

min 8.000000

25% 16.000000

50% 21.000000

75% 29.000000

max 44.000000

3D structures

protein bound ligands

import itertools
import shutil
import os
import pandas as pd
import cPickle

represents_ifn = "./dat/representative_drugs.csv"
dat_ifn = "./dat/dat.pkl"

with open(dat_ifn, 'r') as f:
    dset = cPickle.load(f)

represents = pd.read_csv(represents_ifn, index_col=0)

inter = dset[dset.DRUGBANK_ID.isin(represents.id)]
inter.to_csv("./dat/fda_drugs_prt_lig.csv")

drugs = {}
for index, row in inter.iterrows():
    drugs[row.DRUGBANK_ID] = row.ProteinBoundLig.split()

complexes = set(itertools.chain(*drugs.values()))
print "# complexes %d" % (len(complexes))

proteins = set([complex_id.split('.')[0][:4] for complex_id in complexes])
print "# proteins %d" % (len(proteins))

# retrive the 3D structures
DAT_DIR = "/work/jaydy/dat/fda_pdb_mb"
STRUCTURE_DIR = "/work/jaydy/dat/fda_drug_structures"

def getPaths(ligand_id):
    mid_two = ligand_id[1:3]
    ligand_path = os.path.join(DAT_DIR, mid_two, ligand_id)
    prt_id = ligand_id.split('.')[0] + '.pdb'
    prt_path = os.path.join(DAT_DIR, mid_two, prt_id)
    return ligand_path, prt_path

def copyFiles(drug_id):
    dat_dir = os.path.join(STRUCTURE_DIR, drug_id)
    os.makedirs(dat_dir)
    for ligand_id in drugs[drug_id]:
        ligand_path, prt_path = getPaths(ligand_id)
        shutil.copy(ligand_path, dat_dir)
        shutil.copy(prt_path, dat_dir)

    print drug_id, "done"


for drug_id in drugs:
    copyFiles(drug_id)

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
dat		dat
script		script
src		src
readme.md		readme.md
readme.org		readme.org

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dat

dat

script

script

src

src

readme.md

readme.md

readme.org

readme.org

Repository files navigation

FDA-approved small drug 3D structure dataset

1600 FDA-approved small molecule drugs

Molecule sizes

average heavy atom numbers and standard deviation

DONE 3D structures of the chemicals

DONE Drug Bank id to pdb id

DONE Filter the unbound proteins

TODO Cluster the bound proteins based on sequence similarity

About

Releases

Packages

Languages

count	1554.000000
mean	26.257400
std	18.172654
min	1.000000
25%	18.000000
50%	23.000000
75%	30.000000
max	419.000000

statistics	#HeavyAtom
count	197.000000
mean	23.928934
std	12.470170
min	6.000000
25%	16.000000
50%	21.000000
75%	29.000000
max	93.000000

stats	#HeavyAtom
count	186.000000
mean	22.639785
std	8.456204
min	8.000000
25%	16.000000
50%	21.000000
75%	29.000000
max	44.000000

DrugBank ID	Chem
DB01356	Li
DB01370	Al
DB01592	Fe
DB01593	Zn

omagebright/FDA-approved

Folders and files

Latest commit

History

Repository files navigation

FDA-approved small drug 3D structure dataset

1600 FDA-approved small molecule drugs

Molecule sizes

average heavy atom numbers and standard deviation

DONE 3D structures of the chemicals

DONE Drug Bank id to pdb id

DONE Filter the unbound proteins

TODO Cluster the bound proteins based on sequence similarity

About

Resources

Stars

Watchers

Forks

Languages