Replace SMILES input to Coulomb matrix #56

jeffrey9909 · 2017-01-20T09:46:14Z

I am working on changing the input form SMILES to Coulomb matrix, 200 Coulomb matrices (29*29 matrix) with the HOMO-LUMO gap have been produced and saved into a .h5 file by the following code:

# Saving in .h5 format
h5f = h5py.File('processed.h5','w')
h5f.create_dataset('homo_lumo_gaps',data=homo_lumo_gaps)
h5f.create_dataset('padded_coulomb_matrices', data=padded_coulomb_matrices)

And I try to run the train.py directly with the generated process.h5 and give me this error message

KeyError: "Unable to open object (Object 'data_train' doesn't exist)"

I think that the problem comes from the way that I save the file is different from the original preprocess.py... But I cannot get the original idea and thus don't know how should I modify my code.
The preprocess.py I am using is here

https://docs.google.com/document/d/17f9n7tzeadpCo0_pit548QiU1-Loib2opMcm0I4MxzQ/edit?usp=sharing

And I want to known other than the ''naming'' problem as I have mentioned, will the NN work as I expected if I directly import the Coulomb matrix to replace the SMILES strings? Is there any part of the code I will need to modify? Thank You.
I know that this is not a good way to ask questions, but I really need some help. Any help is appreciated. Thank You.

The text was updated successfully, but these errors were encountered:

pechersky · 2017-01-20T14:53:43Z

The h5 files is expected to have two groups: "data_train" and "data_test".

You might want to do something like this, from the repo's preprocess.py:

train_idx, test_idx = map(np.array, train_test_split(structures.index, test_size = 0.20))

Then, using a chunking function we defined, we do:

    create_chunk_dataset(h5f, 'data_train', train_idx,
                         (len(train_idx), 120, len(charset)),
                         apply_fn=lambda ch: np.array(map(one_hot_encoded_fn,
                                                          structures[ch])))
    create_chunk_dataset(h5f, 'data_test', test_idx,
                         (len(test_idx), 120, len(charset)),
                         apply_fn=lambda ch: np.array(map(one_hot_encoded_fn,
                                                          structures[ch])))

jeffrey9909 · 2017-01-20T15:12:21Z

Thanks for your answer,
Do you mean I should rename my coulomb_matrix to data_test and data_train (which means that they are the same),
Then what should I do with my HOMO-LUMO gaps?
Cause I have tried to do what you have mentioned, but I don't really get this part

 train_idx, test_idx = map(np.array, train_test_split(structures.index, test_size = 0.20))

and this part

    create_chunk_dataset(h5f, 'data_train', train_idx,
                     (len(train_idx), 120, len(charset)),
                     apply_fn=lambda ch: np.array(map(one_hot_encoded_fn,
                                                      structures[ch])))
    create_chunk_dataset(h5f, 'data_test', test_idx,
                     (len(test_idx), 120, len(charset)),
                     apply_fn=lambda ch: np.array(map(one_hot_encoded_fn,
                                                      structures[ch])))

that is why I failed to rename by dataset

pechersky · 2017-01-20T16:10:39Z

What I mean is that you have to split your dataset of coulomb matrices into a train set and a test set. A helper function to do that is the train_test_split function from sklearn.model_selection. Then, you would do something like

train_data, test_data = train_test_split(padded_coulomb_matrices, test_size = 0.20))
h5f.create_dataset('data_train', data = train_data)
h5f.create_dataset('data_test', data = test_data)

That will create an h5 with your data split into train and test as expected by train.py. However, know that train.py (and all the other scripts in the repo) use a particular network topology that probably won't work with the shape of your data. The model is defined at https://github.com/maxhodak/keras-molecules/blob/master/molecules/model.py. As you can see, the dimensions of the input tensors as defined are (max_length, len(charset)) = (120, 51), and are expected to be one_hot per row.

jeffrey9909 · 2017-01-20T16:18:07Z

Oh, I think I get it.
Thank You for your help.
I will try to fix it tmr.
Thank again.

jeffrey9909 · 2017-01-24T09:37:39Z

Just a question, how is the latent_dim determined to be 292?
I am trying to modify the code by myself the moment, but I have not idea about this...
Thanks.

pechersky · 2017-01-24T16:26:45Z

That was the latent dimension that was reported in the Gomez-Bombarelli paper that is referenced in the README.

…

On Tue, Jan 24, 2017 at 4:37 AM, jeffrey9909 ***@***.***> wrote: Just a question, how is the latent_dim determined to be 292? I am trying to modify the code by myself the moment, but I have not idea about this... Thanks. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#56 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFGDhopCRjBO5c1S3l6GUAjgZKHXMBaKks5rVcZjgaJpZM4LpHaM> .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace SMILES input to Coulomb matrix #56

Replace SMILES input to Coulomb matrix #56

jeffrey9909 commented Jan 20, 2017 •

edited

Loading

pechersky commented Jan 20, 2017

jeffrey9909 commented Jan 20, 2017 •

edited

Loading

pechersky commented Jan 20, 2017

jeffrey9909 commented Jan 20, 2017

jeffrey9909 commented Jan 24, 2017

pechersky commented Jan 24, 2017 via email

Replace SMILES input to Coulomb matrix #56

Replace SMILES input to Coulomb matrix #56

Comments

jeffrey9909 commented Jan 20, 2017 • edited Loading

pechersky commented Jan 20, 2017

jeffrey9909 commented Jan 20, 2017 • edited Loading

pechersky commented Jan 20, 2017

jeffrey9909 commented Jan 20, 2017

jeffrey9909 commented Jan 24, 2017

pechersky commented Jan 24, 2017 via email

jeffrey9909 commented Jan 20, 2017 •

edited

Loading

jeffrey9909 commented Jan 20, 2017 •

edited

Loading