Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocess uses too much RAM #35

Open
dakoner opened this issue Nov 14, 2016 · 21 comments
Open

Preprocess uses too much RAM #35

dakoner opened this issue Nov 14, 2016 · 21 comments

Comments

@dakoner
Copy link
Contributor

dakoner commented Nov 14, 2016

preprocess.py loads the entire pre-preprocessed data into RAM, does transforms that require more RAM. I'm trying to preprocess GDB-17 w/ 50M SMILES strings and it's just about filling up my 64GB RAM machine. We should be able to go directly from SMILES input to preprocessed input files with far fewer resources although it would take work to make all the steps incremental.

@pechersky
Copy link
Collaborator

Would we be alright to switching to generator based training? That gets rid
of the need to preprocess and the need to compress as well.

On Mon, Nov 14, 2016 at 5:56 PM, dakoner notifications@github.com wrote:

preprocess.py loads the entire pre-preprocessed data into RAM, does
transforms that require more RAM. I'm trying to preprocess GDB-17 w/ 50M
SMILES strings and it's just about filling up my 64GB RAM machine. We
should be able to go directly from SMILES input to preprocessed input files
with far fewer resources although it would take work to make all the steps
incremental.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#35, or mute the
thread
https://github.com/notifications/unsubscribe-auth/AFGDhk9C4mGWZ0L3uGPaX8uJpC4iJYBOks5q-Oc4gaJpZM4Kx6jM
.

@maxhodak
Copy link
Owner

sure! send a pull request!

@pechersky
Copy link
Collaborator

@dakoner can you provide a link to the 50M GDB-17 dataset you're using?

@dakoner
Copy link
Contributor Author

dakoner commented Nov 21, 2016

http://gdb.unibe.ch/downloads/
Specifically http://gdb.unibe.ch/cdn/GDB17.50000000.smi.gz

On Mon, Nov 21, 2016 at 11:03 AM, Yakov Pechersky notifications@github.com
wrote:

@dakoner https://github.com/dakoner can you provide a link to the 50M
GDB-17 dataset you're using?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#35 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHtyQPyn1sfpFzWIyVX8h_IJ30xith-xks5rAerygaJpZM4Kx6jM
.

@pechersky
Copy link
Collaborator

The following branch should be able to train using a stream-based approach, requiring way less RAM. It also provides a solution for issue #39.

Please test it out -- you'll need to edit train_gen.py to pass in some iterable that has SMILES strings in it. Right now, it expects an h5 file with a structure column; it could be a plain gzip text file instead.

https://github.com/pechersky/keras-molecules/tree/stream-process

@dakoner
Copy link
Contributor Author

dakoner commented Nov 22, 2016

Yakov,

Do I understand correctly that now there is no need to preprocess SMILES
input at all? IE, this code is buffering the SMILES input in RAM (in
pandas), then sampling from the data at runtime?

I am curious about the memory overhead of storing HDF5/pandas data vs. an
array of strings.

On Mon, Nov 21, 2016 at 2:15 PM, Yakov Pechersky notifications@github.com
wrote:

The following branch should be able to train using a stream-based
approach, requiring way less RAM. It also provides a solution for issue
#39 #39.

Please test it out -- you'll need to edit train_gen.py to pass in some
iterable that has SMILES strings in it. Right now, it expects an h5 file
with a structure column; it could be a plain gzip text file instead.

https://github.com/pechersky/keras-molecules/tree/stream-process


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#35 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHtyQDpR8m2HLF65MoJnM-u4KKVt558Oks5rAhfngaJpZM4Kx6jM
.

@dakoner
Copy link
Contributor Author

dakoner commented Nov 22, 2016

I had to make a few changes to train_gen.py to get things to work (see my
comments on your commit).

I can confirm it starts training w/o any preprocessing steps.

On Mon, Nov 21, 2016 at 4:33 PM, David Konerding dakoner@gmail.com wrote:

Yakov,

Do I understand correctly that now there is no need to preprocess SMILES
input at all? IE, this code is buffering the SMILES input in RAM (in
pandas), then sampling from the data at runtime?

I am curious about the memory overhead of storing HDF5/pandas data vs. an
array of strings.

On Mon, Nov 21, 2016 at 2:15 PM, Yakov Pechersky <notifications@github.com

wrote:

The following branch should be able to train using a stream-based
approach, requiring way less RAM. It also provides a solution for issue
#39 #39.

Please test it out -- you'll need to edit train_gen.py to pass in some
iterable that has SMILES strings in it. Right now, it expects an h5 file
with a structure column; it could be a plain gzip text file instead.

https://github.com/pechersky/keras-molecules/tree/stream-process


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#35 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHtyQDpR8m2HLF65MoJnM-u4KKVt558Oks5rAhfngaJpZM4Kx6jM
.

@dakoner
Copy link
Contributor Author

dakoner commented Nov 22, 2016

This code doesn't seem to train-up quickly like the old train.py does. It
gets stuck at an accuracy of about 0.07 while the old train.py jumps up to
50% in just a few batches.

I am seeing a warning at the end of the epoch:

/home/dek/keras-molecules/env/src/keras/keras/engine/training.py:1480:
UserWarning: Epoch comprised more than samples_per_epoch samples, which
might affect learning results. Set samples_per_epoch
correctly to avoid this warning.
warnings.warn('Epoch comprised more than '

when I set --epoch_size to 50000

On Mon, Nov 21, 2016 at 4:54 PM, David Konerding dakoner@gmail.com wrote:

I had to make a few changes to train_gen.py to get things to work (see my
comments on your commit).

I can confirm it starts training w/o any preprocessing steps.

On Mon, Nov 21, 2016 at 4:33 PM, David Konerding dakoner@gmail.com
wrote:

Yakov,

Do I understand correctly that now there is no need to preprocess SMILES
input at all? IE, this code is buffering the SMILES input in RAM (in
pandas), then sampling from the data at runtime?

I am curious about the memory overhead of storing HDF5/pandas data vs. an
array of strings.

On Mon, Nov 21, 2016 at 2:15 PM, Yakov Pechersky <
notifications@github.com> wrote:

The following branch should be able to train using a stream-based
approach, requiring way less RAM. It also provides a solution for issue
#39 #39.

Please test it out -- you'll need to edit train_gen.py to pass in some
iterable that has SMILES strings in it. Right now, it expects an h5 file
with a structure column; it could be a plain gzip text file instead.

https://github.com/pechersky/keras-molecules/tree/stream-process


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#35 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHtyQDpR8m2HLF65MoJnM-u4KKVt558Oks5rAhfngaJpZM4Kx6jM
.

@pechersky
Copy link
Collaborator

You might have gotten the epoch warning if your batch_size doesn't cleanly divide epoch_size.

Thanks for your comments here and on the commit. Could you share the command that you were using to run the old train.py? The only thing that I can think of as being different in this case is that sampling is with replacement -- the code as I have it currently ignores the weights on training.

@pechersky
Copy link
Collaborator

pechersky commented Nov 22, 2016

@dakoner There was a bug in encoding, it wasn't properly encoding padded words. I've also fixed the bugs you've pointed out. Now train_gen quickly reaches >60% accuracy within the first epoch.

@dakoner
Copy link
Contributor Author

dakoner commented Nov 23, 2016

Thanks. I changed batch_size and the error went away.

Also, I pulled your newer code and can confirm the training is working as
expected.

On Tue, Nov 22, 2016 at 2:28 PM, Yakov Pechersky notifications@github.com
wrote:

@dakoner https://github.com/dakoner There was a bug in encoding, it
wasn't properly encoding padded words. Now train_gen quickly reaches >60%
accuracy within the first epoch.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#35 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHtyQPibQckgg8eZiOu0XvysGZ9jqgicks5rA2yXgaJpZM4Kx6jM
.

@dakoner
Copy link
Contributor Author

dakoner commented Nov 23, 2016

Yakov
I've got a question about how the new generator-based training works

IIUC it loads the entire dataset into RAM, defines an index point in the
dataset, and then returns samples from before or after the index point
depending on whether a training or test sample is requested.

if you set an epoch size larger than the dataset size, doesn't that cause
no test samples to be returned?

On Tue, Nov 22, 2016 at 6:27 PM, David Konerding dakoner@gmail.com wrote:

Thanks. I changed batch_size and the error went away.

Also, I pulled your newer code and can confirm the training is working as
expected.

On Tue, Nov 22, 2016 at 2:28 PM, Yakov Pechersky <notifications@github.com

wrote:

@dakoner https://github.com/dakoner There was a bug in encoding, it
wasn't properly encoding padded words. Now train_gen quickly reaches

60% accuracy within the first epoch.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#35 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHtyQPibQckgg8eZiOu0XvysGZ9jqgicks5rA2yXgaJpZM4Kx6jM
.

@pechersky
Copy link
Collaborator

pechersky commented Nov 23, 2016

The sampling is with replacement, so any epoch size can be used. I chose to use "with replacement" to make the generator have as least state as possible. For some of the datasets I am trying, the number of samples is so much larger than reasonable epoch sizes, I rely on random sampling for each epoch, where no epoch is promised to have the same samples as a previous one.

I am working on extending the "CanonicalSmilesDataGenerator" to take a SMILES string, and create non-canonical representations of the same compound -- similar to how image data generators add noise through rotation etc. This will create an even larger dataset (finite, yet difficult to enumerate). In that case, sampling with replacement is important since you'd be fine with having two different representations of the same compound in a single epoch.

@dakoner
Copy link
Contributor Author

dakoner commented Nov 23, 2016

Great. I am currently training with a 35K smiles string dataset, and I set
--epoch_size to 35K. Tensorflow says it's training on 35K samples, does
that mean it's really samplign from a training set of 35K/2 items and a
test set of 35K/2 items?

On Wed, Nov 23, 2016 at 8:31 AM, Yakov Pechersky notifications@github.com
wrote:

The sampling is with replacement, so any epoch size can be used. I chose
to use "with replacement" to make the generator have as least state as
possible. For some of the datasets I am trying, the number of samples is so
much larger than reasonable epoch sizes, I rely on random sampling for each
epoch, where no epoch is promised to have the same samples as a previous
one.

I am working on extending the "CanonicalSmilesDataGenerator" to take a
SMILES string, and create non-canonical representations of the same
compound -- similar to how image data generators add noise through rotation
etc. This will create an even larger dataset (finite, yet difficult to
enumerate). In that case, sampling with replacement is important since
you'd be fine to have two different representations of the same compound in
a single epoch.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#35 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHtyQPfZDJuDkCO2rqMfQjIQu4pAz6bVks5rBGpPgaJpZM4Kx6jM
.

@pechersky
Copy link
Collaborator

In the molecules.vectorizer, SmilesDataGenerator takes a test_split optional parameter that creates the "index point" you mentioned. By default, it is 0.20, so 4/5 of the data is used for training, and 1/5 is used for testing. The version you pulled has a hard coded epoch size for the testing -- I just pushed a new version that figures out how to scale the test-epoch size based on this test_split param.

I'm also making a PR with that branch. Thank you for testing this, @dakoner. @maxhodak, if you could also test it and validate it for your purposes, especially the new sampling code, that would be great. I'd only like to merge it in if it works better than our current approaches (for processing, training, testing, sampling) in terms of memory load and ease of use.

@pechersky
Copy link
Collaborator

I should add that if you are training on 35K, and you assume the default test_split=0.20, then your "true" effective training set size is 28K. That's the epoch size you'll want to use, which will also give you a test-epoch of 7K.

@dakoner
Copy link
Contributor Author

dakoner commented Nov 23, 2016

BTW, you should be able to use the smilesparser to generate non-canonical
SMILES strings from canonical. The operation set includes permuting the
order of branches, starting the string from various terminals, etc.

On Wed, Nov 23, 2016 at 8:42 AM, Yakov Pechersky notifications@github.com
wrote:

In the molecules.vectorizer, SmilesDataGenerator takes a test_split
optional parameter that creates the "index point" you mentioned. By
default, it is 0.20, so 4/5 of the data is used for training, and 1/5 is
used for testing. The version you pulled has a hard coded epoch size for
the testing -- I just pushed a new version that figures out how to scale
the test-epoch size based on this test_split param.

I'm also making a PR with that branch. Thank you for testing this,
@dakoner https://github.com/dakoner. @maxhodak
https://github.com/maxhodak, if you could also test it and validate it
for your purposes, especially the new sampling code, that would be great.
I'd only like to merge it in if it works better than our current approaches
(for processing, training, testing, sampling) in terms of memory load and
ease of use.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#35 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHtyQM_ShbsELPmibMmWsuNEc6C0_Q_Mks5rBGzbgaJpZM4Kx6jM
.

@dakoner
Copy link
Contributor Author

dakoner commented Nov 23, 2016

In this case, I'm training on a file with 1.3M compounds. I believe your
comment only applies if I'm providing a small input file and setting
--epoch_size to a value that is close to the input file size... right?
Anyway, seems like I should be passing --test_split as a parameter, and the
code should do the work of figuring out everything else.

On Wed, Nov 23, 2016 at 8:56 AM, Yakov Pechersky notifications@github.com
wrote:

I should add that if you are training on 35K, and you assume the default
test_split=0.20, then your "true" effective training set size is 28K.
That's the epoch size you'll want to use, which will also give you a
test-epoch of 7K.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#35 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHtyQMOKsAV9yR80fGvHgyP6oxR_eu1Cks5rBHAhgaJpZM4Kx6jM
.

@pechersky
Copy link
Collaborator

I've just pushed a change which allows you to pass test_split as an optional command-line argument. Test-epoch size will be equal to epoch_size / int((1 - test_split) / test_split). The comment I made previously isn't so important with large dataset sizes, I guess. Still, test-epoch-size will be 4 times less than the train-epoch-size with a test_split of 0.20, not 5 times less.

@dakoner
Copy link
Contributor Author

dakoner commented Nov 27, 2016 via email

@pechersky
Copy link
Collaborator

#43 got merged 3 days ago, it seems. After 50 epochs of 600,000 strings (batch size 300) using the gen-method, I got 97% acc. After 150 epochs, I've plateau'd out at ~98.5% acc. Does this also fix #38 , do you think? I don't really have a validation set, so I can't say -- but I do notice that still, the molecules that start with N do not get autoencoded to results that start with N.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants