Can you provide the SMILES with labels for the ZINC dataset? #56

DomInvivo · 2020-12-17T15:41:00Z

Thank you for your work in providing a standard repository for graph benchmarking

Some applications might require using the SMILES to build different types of graphs than the one provided by the benchmarking platform. I know that the ZINC dataset come from the JT-VAE paper where the SMILES are provided. However, this paper is only a subset of the original dataset, and the train-val-test split is different.

I tried going from the DGLGraph back to SMILES, but it is not possible since I don't know which node label corresponds to which atom.

vijaydwivedi75 · 2020-12-18T16:27:50Z

Hi @DomInvivo,
Thank you for the query.

Quick reply for:

D: I tried going from the DGLGraph back to SMILES, but it is not possible since I don't know which node label corresponds to which atom.

Please download the molecules_zinc_full.zip file from the link provided in this notebook. The zip contains library-agnostic ZINC data pickle files along with atom_dict.pickle and bond_dict.pickle
Read the atom and node dicts from the respective pickle files using this snippet, for knowing the node corresponds to which atom.

import pickle

class Dictionary:
    """
    word2idx and idx2word are mappings from words to idx and vice versa
    word2idx is a dictionary
    idx2word is a list
    word2num_occurence compute the number of times a given word has been added to the dictionary
    idx2num_occurence do the same, but with the index of the word rather than the word itself.
    """

    def __init__(self):
        self.word2idx = {}
        self.idx2word = []
        self.word2num_occurence = {}
        self.idx2num_occurence = []
        
data_folder = './'

with open(data_folder+"atom_dict.pickle","rb") as f:
    atom_dict=pickle.load(f)
with open(data_folder+"bond_dict.pickle","rb") as f:
    bond_dict=pickle.load(f)
    
print(atom_dict.word2idx)
print(bond_dict.word2idx)

OUTPUT:

{'C': 0, 'O': 1, 'N': 2, 'F': 3, 'C H1': 4, 'S': 5, 'Cl': 6, 'O -': 7, 'N H1 +': 8, 'Br': 9, 'N H3 +': 10, 'N H2 +': 11, 'N +': 12, 'N -': 13, 'S -': 14, 'I': 15, 'P': 16, 'O H1 +': 17, 'N H1 -': 18, 'O +': 19, 'S +': 20, 'P H1': 21, 'P H2': 22, 'C H2 -': 23, 'P +': 24, 'S H1 +': 25, 'C H1 -': 26, 'P H1 +': 27}
{'NONE': 0, 'SINGLE': 1, 'DOUBLE': 2, 'TRIPLE': 3}

--
Will get back soon for the SMILES data!

DomInvivo · 2021-01-20T00:54:29Z

Hey,
Thanks again for your answer! Did you manage to find the SMILES data of the ZINC dataset?

vijaydwivedi75 · 2021-01-20T12:48:16Z

Hi @DomInvivo,

The SMILES are here in the paper's repo, as you already mention.
Corresponding to this, the full dataset (ZINC-full; 249K) is in this benchmarking repo. The order of molecules is same.
The indices which corresponds to the molecules in the subset (ZINC; 12K) are in this folder.

Sorry for the late clarification.
Best Regards,
Vijay

DomInvivo · 2021-01-27T15:37:03Z

Thanks for your answer, this is helpful :)

vijaydwivedi75 closed this as completed Jan 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can you provide the SMILES with labels for the ZINC dataset? #56

Can you provide the SMILES with labels for the ZINC dataset? #56

DomInvivo commented Dec 17, 2020

vijaydwivedi75 commented Dec 18, 2020

DomInvivo commented Jan 20, 2021

vijaydwivedi75 commented Jan 20, 2021 •

edited

DomInvivo commented Jan 27, 2021

Can you provide the SMILES with labels for the ZINC dataset? #56

Can you provide the SMILES with labels for the ZINC dataset? #56

Comments

DomInvivo commented Dec 17, 2020

vijaydwivedi75 commented Dec 18, 2020

DomInvivo commented Jan 20, 2021

vijaydwivedi75 commented Jan 20, 2021 • edited

DomInvivo commented Jan 27, 2021

vijaydwivedi75 commented Jan 20, 2021 •

edited