Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace SMILES input to Coulomb matrix #56

Open
jeffrey9909 opened this issue Jan 20, 2017 · 6 comments
Open

Replace SMILES input to Coulomb matrix #56

jeffrey9909 opened this issue Jan 20, 2017 · 6 comments

Comments

@jeffrey9909
Copy link

jeffrey9909 commented Jan 20, 2017

I am working on changing the input form SMILES to Coulomb matrix, 200 Coulomb matrices (29*29 matrix) with the HOMO-LUMO gap have been produced and saved into a .h5 file by the following code:

# Saving in .h5 format
h5f = h5py.File('processed.h5','w')
h5f.create_dataset('homo_lumo_gaps',data=homo_lumo_gaps)
h5f.create_dataset('padded_coulomb_matrices', data=padded_coulomb_matrices)

And I try to run the train.py directly with the generated process.h5 and give me this error message

KeyError: "Unable to open object (Object 'data_train' doesn't exist)"

I think that the problem comes from the way that I save the file is different from the original preprocess.py... But I cannot get the original idea and thus don't know how should I modify my code.
The preprocess.py I am using is here

https://docs.google.com/document/d/17f9n7tzeadpCo0_pit548QiU1-Loib2opMcm0I4MxzQ/edit?usp=sharing

And I want to known other than the ''naming'' problem as I have mentioned, will the NN work as I expected if I directly import the Coulomb matrix to replace the SMILES strings? Is there any part of the code I will need to modify? Thank You.
I know that this is not a good way to ask questions, but I really need some help. Any help is appreciated. Thank You.

@pechersky
Copy link
Collaborator

The h5 files is expected to have two groups: "data_train" and "data_test".

You might want to do something like this, from the repo's preprocess.py:

train_idx, test_idx = map(np.array, train_test_split(structures.index, test_size = 0.20))

Then, using a chunking function we defined, we do:

    create_chunk_dataset(h5f, 'data_train', train_idx,
                         (len(train_idx), 120, len(charset)),
                         apply_fn=lambda ch: np.array(map(one_hot_encoded_fn,
                                                          structures[ch])))
    create_chunk_dataset(h5f, 'data_test', test_idx,
                         (len(test_idx), 120, len(charset)),
                         apply_fn=lambda ch: np.array(map(one_hot_encoded_fn,
                                                          structures[ch])))

@jeffrey9909
Copy link
Author

jeffrey9909 commented Jan 20, 2017

Thanks for your answer,
Do you mean I should rename my coulomb_matrix to data_test and data_train (which means that they are the same),
Then what should I do with my HOMO-LUMO gaps?
Cause I have tried to do what you have mentioned, but I don't really get this part

 train_idx, test_idx = map(np.array, train_test_split(structures.index, test_size = 0.20))

and this part

    create_chunk_dataset(h5f, 'data_train', train_idx,
                     (len(train_idx), 120, len(charset)),
                     apply_fn=lambda ch: np.array(map(one_hot_encoded_fn,
                                                      structures[ch])))
    create_chunk_dataset(h5f, 'data_test', test_idx,
                     (len(test_idx), 120, len(charset)),
                     apply_fn=lambda ch: np.array(map(one_hot_encoded_fn,
                                                      structures[ch])))

that is why I failed to rename by dataset

@pechersky
Copy link
Collaborator

What I mean is that you have to split your dataset of coulomb matrices into a train set and a test set. A helper function to do that is the train_test_split function from sklearn.model_selection. Then, you would do something like

train_data, test_data = train_test_split(padded_coulomb_matrices, test_size = 0.20))
h5f.create_dataset('data_train', data = train_data)
h5f.create_dataset('data_test', data = test_data)

That will create an h5 with your data split into train and test as expected by train.py. However, know that train.py (and all the other scripts in the repo) use a particular network topology that probably won't work with the shape of your data. The model is defined at https://github.com/maxhodak/keras-molecules/blob/master/molecules/model.py. As you can see, the dimensions of the input tensors as defined are (max_length, len(charset)) = (120, 51), and are expected to be one_hot per row.

@jeffrey9909
Copy link
Author

Oh, I think I get it.
Thank You for your help.
I will try to fix it tmr.
Thank again.

@jeffrey9909
Copy link
Author

Just a question, how is the latent_dim determined to be 292?
I am trying to modify the code by myself the moment, but I have not idea about this...
Thanks.

@pechersky
Copy link
Collaborator

pechersky commented Jan 24, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants