Skip to content

3D chemical structures of the FDA-approved small molecules

Notifications You must be signed in to change notification settings

omagebright/FDA-approved

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FDA-approved small drug 3D structure dataset

1600 FDA-approved small molecule drugs

  1. http://www.drugbank.ca/downloads#structures
  2. ./dat/approved.txt

Molecule sizes

average heavy atom numbers and standard deviation

  1. script

    ./script/count_atms.py

  2. result

    ./dat/approved_heavy_atom_num.txt ./dat/elements.txt

DONE 3D structures of the chemicals

DONE Drug Bank id to pdb id

  1. pdb –> uniprt

    ./dat/pdbsws_chain.txt

  2. drug bank id –> uniprt ids

    1. ./script/crawler.py
    2. ./dat/drugid_uniprturls.dat
  3. uniprt id –> pdb id

    1. ./script/uniprt_pdb.py
    2. ./dat/uniprt_to_pdb.dat
  4. drug bank id –> pdb id

    1. ./script/query_pdb.py
    2. ./dat/drugid_pdb.dat
    3. 82773 pdbs
    4. 138078 chains
  5. table

    import pickle
    import pandas as pd
    
    dat_ifn = "./dat/drugid_pdb.dat"
    
    with open(dat_ifn, 'r') as f:
        dat = pickle.load(f)
    
    dset = pd.DataFrame({key: list(val) for key, val in dat.iteritems()})

DONE Filter the unbound proteins

  1. DONE Extract the ligand atoms from the CIF files

    1. ./script/qmol_CIF2PDB.pl (not compatible with latest perl modules)
    2. work/jaydy/dat/fda_pdb_mb
  2. DONE Cluster drug ligands

    1. cmd

      $ cd ~/Workspace/Bitbucket/fda-approved-drugs/dat/
      $ python /home/jaydy/Workspace/Bitbucket/geauxcluster/src/geauxcluster.py -i approved.txt -o approved.txt.clusters -k DRUGBANK_ID
    2. script ./script/clustering.py

  3. DONE Compare the pdb ligand with drug bank ligand

    1. calculate tanimoto coefficient using Kcombu
      1. ./script/calculateTanimota.py
      2. pkcombu Segmentation fault for
        1. DB01049
        2. to check pkcombu -A /work/jaydy/working/kcombu_run/DB01049_fda.sdf -B /work/jaydy/dat/fda_pdb_mb/ib/3ibdA.LG3.pdb
        3. DB00707
    2. refuse proteins if their ligand's tanimoto < 0.9
      1. ./script/calculateTanimoto.py
      2. TANI_DIR = "/work/jaydy/working/tanimoto"

TODO Cluster the bound proteins based on sequence similarity

  1. DONE add missing atoms

    ./script/fixPrt.py

  2. DONE clean the proteins sequences

    python ./script/pdb_to_fasta.py > run.sh sh ./script/run.sh

    1. ctrip, heavy atom model
      1. /project/michal/apps/jackal_64bit/bin/ctrip
    2. ./script/pdb2pdb.pl move pdb sequence to start at 1
    3. ./script/pdb2fasta.pl convert pdb to fasta
    4. ./script/fasta2fasta.pl break the lines at 80
  3. DONE cd-hit to cluster the sequences

    1. do not count the prt with seq length > 600
    2. cluster python ./script/cluster_seq.py > ./script/cluster_seq.sh cd /work/jaydy/working/cluster sh /home/jaydy/Workspace/Bitbucket/fda-approved-drugs/script/cluster_seq.sh
    3. collect /home/jaydy/Workspace/Bitbucket/fda-approved-drugs/script/collect.py
  4. TODO

    1. original data set of 1554 fda-approved drugs
      1. script ./script/count_atms.py ./dat/drug_size.txt

      2. result

        count 1554.000000
        mean 26.257400
        std 18.172654
        min 1.000000
        25% 18.000000
        50% 23.000000
        75% 30.000000
        max 419.000000
      3. one atom drugs

        DrugBank ID Chem
        DB01356 Li
        DB01370 Al
        DB01592 Fe
        DB01593 Zn
      4. 2 σ range (0, 62.6)

      5. 1 σ range (8, 44)

    2. after processing
      1. script ./script/drug_size.py

      2. result

        statistics #HeavyAtom
        count 197.000000
        mean 23.928934
        std 12.470170
        min 6.000000
        25% 16.000000
        50% 21.000000
        75% 29.000000
        max 93.000000
      3. within the range of (8, 44)

        1. dat ./dat/representative_drugs.csv

          stats #HeavyAtom
          count 186.000000
          mean 22.639785
          std 8.456204
          min 8.000000
          25% 16.000000
          50% 21.000000
          75% 29.000000
          max 44.000000
        2. 3D structures

          1. protein bound ligands

            import itertools
            import shutil
            import os
            import pandas as pd
            import cPickle
            
            represents_ifn = "./dat/representative_drugs.csv"
            dat_ifn = "./dat/dat.pkl"
            
            with open(dat_ifn, 'r') as f:
                dset = cPickle.load(f)
            
            represents = pd.read_csv(represents_ifn, index_col=0)
            
            inter = dset[dset.DRUGBANK_ID.isin(represents.id)]
            inter.to_csv("./dat/fda_drugs_prt_lig.csv")
            
            drugs = {}
            for index, row in inter.iterrows():
                drugs[row.DRUGBANK_ID] = row.ProteinBoundLig.split()
            
            complexes = set(itertools.chain(*drugs.values()))
            print "# complexes %d" % (len(complexes))
            
            proteins = set([complex_id.split('.')[0][:4] for complex_id in complexes])
            print "# proteins %d" % (len(proteins))
            
            # retrive the 3D structures
            DAT_DIR = "/work/jaydy/dat/fda_pdb_mb"
            STRUCTURE_DIR = "/work/jaydy/dat/fda_drug_structures"
            
            def getPaths(ligand_id):
                mid_two = ligand_id[1:3]
                ligand_path = os.path.join(DAT_DIR, mid_two, ligand_id)
                prt_id = ligand_id.split('.')[0] + '.pdb'
                prt_path = os.path.join(DAT_DIR, mid_two, prt_id)
                return ligand_path, prt_path
            
            def copyFiles(drug_id):
                dat_dir = os.path.join(STRUCTURE_DIR, drug_id)
                os.makedirs(dat_dir)
                for ligand_id in drugs[drug_id]:
                    ligand_path, prt_path = getPaths(ligand_id)
                    shutil.copy(ligand_path, dat_dir)
                    shutil.copy(prt_path, dat_dir)
            
                print drug_id, "done"
            
            
            for drug_id in drugs:
                copyFiles(drug_id)

About

3D chemical structures of the FDA-approved small molecules

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Perl 38.3%
  • Shell 20.8%
  • Python 20.7%
  • Fortran 20.2%