Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can you provide the SMILES with labels for the ZINC dataset? #56

Closed
DomInvivo opened this issue Dec 17, 2020 · 4 comments
Closed

Can you provide the SMILES with labels for the ZINC dataset? #56

DomInvivo opened this issue Dec 17, 2020 · 4 comments

Comments

@DomInvivo
Copy link

Thank you for your work in providing a standard repository for graph benchmarking

Some applications might require using the SMILES to build different types of graphs than the one provided by the benchmarking platform. I know that the ZINC dataset come from the JT-VAE paper where the SMILES are provided. However, this paper is only a subset of the original dataset, and the train-val-test split is different.

I tried going from the DGLGraph back to SMILES, but it is not possible since I don't know which node label corresponds to which atom.

@vijaydwivedi75
Copy link
Member

Hi @DomInvivo,
Thank you for the query.

Quick reply for:

D: I tried going from the DGLGraph back to SMILES, but it is not possible since I don't know which node label corresponds to which atom.

  1. Please download the molecules_zinc_full.zip file from the link provided in this notebook. The zip contains library-agnostic ZINC data pickle files along with atom_dict.pickle and bond_dict.pickle
  2. Read the atom and node dicts from the respective pickle files using this snippet, for knowing the node corresponds to which atom.
import pickle

class Dictionary:
    """
    word2idx and idx2word are mappings from words to idx and vice versa
    word2idx is a dictionary
    idx2word is a list
    word2num_occurence compute the number of times a given word has been added to the dictionary
    idx2num_occurence do the same, but with the index of the word rather than the word itself.
    """

    def __init__(self):
        self.word2idx = {}
        self.idx2word = []
        self.word2num_occurence = {}
        self.idx2num_occurence = []
        
data_folder = './'

with open(data_folder+"atom_dict.pickle","rb") as f:
    atom_dict=pickle.load(f)
with open(data_folder+"bond_dict.pickle","rb") as f:
    bond_dict=pickle.load(f)
    
print(atom_dict.word2idx)
print(bond_dict.word2idx)

OUTPUT:

{'C': 0, 'O': 1, 'N': 2, 'F': 3, 'C H1': 4, 'S': 5, 'Cl': 6, 'O -': 7, 'N H1 +': 8, 'Br': 9, 'N H3 +': 10, 'N H2 +': 11, 'N +': 12, 'N -': 13, 'S -': 14, 'I': 15, 'P': 16, 'O H1 +': 17, 'N H1 -': 18, 'O +': 19, 'S +': 20, 'P H1': 21, 'P H2': 22, 'C H2 -': 23, 'P +': 24, 'S H1 +': 25, 'C H1 -': 26, 'P H1 +': 27}
{'NONE': 0, 'SINGLE': 1, 'DOUBLE': 2, 'TRIPLE': 3}

--
Will get back soon for the SMILES data!

@DomInvivo
Copy link
Author

Hey,
Thanks again for your answer! Did you manage to find the SMILES data of the ZINC dataset?

@vijaydwivedi75
Copy link
Member

vijaydwivedi75 commented Jan 20, 2021

Hi @DomInvivo,

  • The SMILES are here in the paper's repo, as you already mention.

  • Corresponding to this, the full dataset (ZINC-full; 249K) is in this benchmarking repo. The order of molecules is same.

  • The indices which corresponds to the molecules in the subset (ZINC; 12K) are in this folder.

Sorry for the late clarification.
Best Regards,
Vijay

@DomInvivo
Copy link
Author

Thanks for your answer, this is helpful :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants