Preprocess uses too much RAM #35

dakoner · 2016-11-14T22:56:56Z

preprocess.py loads the entire pre-preprocessed data into RAM, does transforms that require more RAM. I'm trying to preprocess GDB-17 w/ 50M SMILES strings and it's just about filling up my 64GB RAM machine. We should be able to go directly from SMILES input to preprocessed input files with far fewer resources although it would take work to make all the steps incremental.

pechersky · 2016-11-15T01:25:10Z

Would we be alright to switching to generator based training? That gets rid
of the need to preprocess and the need to compress as well.

On Mon, Nov 14, 2016 at 5:56 PM, dakoner notifications@github.com wrote:

preprocess.py loads the entire pre-preprocessed data into RAM, does
transforms that require more RAM. I'm trying to preprocess GDB-17 w/ 50M
SMILES strings and it's just about filling up my 64GB RAM machine. We
should be able to go directly from SMILES input to preprocessed input files
with far fewer resources although it would take work to make all the steps
incremental.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#35, or mute the
thread
https://github.com/notifications/unsubscribe-auth/AFGDhk9C4mGWZ0L3uGPaX8uJpC4iJYBOks5q-Oc4gaJpZM4Kx6jM
.

maxhodak · 2016-11-15T22:39:38Z

sure! send a pull request!

pechersky · 2016-11-21T19:03:13Z

@dakoner can you provide a link to the 50M GDB-17 dataset you're using?

dakoner · 2016-11-21T19:04:26Z

http://gdb.unibe.ch/downloads/
Specifically http://gdb.unibe.ch/cdn/GDB17.50000000.smi.gz

On Mon, Nov 21, 2016 at 11:03 AM, Yakov Pechersky notifications@github.com
wrote:

@dakoner https://github.com/dakoner can you provide a link to the 50M
GDB-17 dataset you're using?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#35 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHtyQPyn1sfpFzWIyVX8h_IJ30xith-xks5rAerygaJpZM4Kx6jM
.

pechersky · 2016-11-21T22:15:02Z

The following branch should be able to train using a stream-based approach, requiring way less RAM. It also provides a solution for issue #39.

Please test it out -- you'll need to edit train_gen.py to pass in some iterable that has SMILES strings in it. Right now, it expects an h5 file with a structure column; it could be a plain gzip text file instead.

https://github.com/pechersky/keras-molecules/tree/stream-process

dakoner · 2016-11-22T00:33:23Z

Yakov,

Do I understand correctly that now there is no need to preprocess SMILES
input at all? IE, this code is buffering the SMILES input in RAM (in
pandas), then sampling from the data at runtime?

I am curious about the memory overhead of storing HDF5/pandas data vs. an
array of strings.

On Mon, Nov 21, 2016 at 2:15 PM, Yakov Pechersky notifications@github.com
wrote:

The following branch should be able to train using a stream-based
approach, requiring way less RAM. It also provides a solution for issue
#39 #39.

Please test it out -- you'll need to edit train_gen.py to pass in some
iterable that has SMILES strings in it. Right now, it expects an h5 file
with a structure column; it could be a plain gzip text file instead.

https://github.com/pechersky/keras-molecules/tree/stream-process

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#35 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHtyQDpR8m2HLF65MoJnM-u4KKVt558Oks5rAhfngaJpZM4Kx6jM
.

dakoner · 2016-11-22T00:54:18Z

I had to make a few changes to train_gen.py to get things to work (see my
comments on your commit).

I can confirm it starts training w/o any preprocessing steps.

On Mon, Nov 21, 2016 at 4:33 PM, David Konerding dakoner@gmail.com wrote:

Yakov,

Do I understand correctly that now there is no need to preprocess SMILES
input at all? IE, this code is buffering the SMILES input in RAM (in
pandas), then sampling from the data at runtime?

I am curious about the memory overhead of storing HDF5/pandas data vs. an
array of strings.

On Mon, Nov 21, 2016 at 2:15 PM, Yakov Pechersky <notifications@github.com

wrote:

The following branch should be able to train using a stream-based
approach, requiring way less RAM. It also provides a solution for issue
#39 #39.

Please test it out -- you'll need to edit train_gen.py to pass in some
iterable that has SMILES strings in it. Right now, it expects an h5 file
with a structure column; it could be a plain gzip text file instead.

https://github.com/pechersky/keras-molecules/tree/stream-process

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#35 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHtyQDpR8m2HLF65MoJnM-u4KKVt558Oks5rAhfngaJpZM4Kx6jM
.

dakoner · 2016-11-22T01:12:14Z

This code doesn't seem to train-up quickly like the old train.py does. It
gets stuck at an accuracy of about 0.07 while the old train.py jumps up to
50% in just a few batches.

I am seeing a warning at the end of the epoch:

/home/dek/keras-molecules/env/src/keras/keras/engine/training.py:1480:
UserWarning: Epoch comprised more than samples_per_epoch samples, which
might affect learning results. Set samples_per_epoch
correctly to avoid this warning.
warnings.warn('Epoch comprised more than '

when I set --epoch_size to 50000

On Mon, Nov 21, 2016 at 4:54 PM, David Konerding dakoner@gmail.com wrote:

I had to make a few changes to train_gen.py to get things to work (see my
comments on your commit).

I can confirm it starts training w/o any preprocessing steps.

On Mon, Nov 21, 2016 at 4:33 PM, David Konerding dakoner@gmail.com
wrote:

Yakov,

Do I understand correctly that now there is no need to preprocess SMILES
input at all? IE, this code is buffering the SMILES input in RAM (in
pandas), then sampling from the data at runtime?

I am curious about the memory overhead of storing HDF5/pandas data vs. an
array of strings.

On Mon, Nov 21, 2016 at 2:15 PM, Yakov Pechersky <
notifications@github.com> wrote:

The following branch should be able to train using a stream-based
approach, requiring way less RAM. It also provides a solution for issue
#39 #39.

Please test it out -- you'll need to edit train_gen.py to pass in some
iterable that has SMILES strings in it. Right now, it expects an h5 file
with a structure column; it could be a plain gzip text file instead.

https://github.com/pechersky/keras-molecules/tree/stream-process

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#35 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHtyQDpR8m2HLF65MoJnM-u4KKVt558Oks5rAhfngaJpZM4Kx6jM
.

pechersky · 2016-11-22T13:58:50Z

You might have gotten the epoch warning if your batch_size doesn't cleanly divide epoch_size.

Thanks for your comments here and on the commit. Could you share the command that you were using to run the old train.py? The only thing that I can think of as being different in this case is that sampling is with replacement -- the code as I have it currently ignores the weights on training.

pechersky · 2016-11-22T22:28:38Z

@dakoner There was a bug in encoding, it wasn't properly encoding padded words. I've also fixed the bugs you've pointed out. Now train_gen quickly reaches >60% accuracy within the first epoch.

dakoner · 2016-11-23T02:27:06Z

Thanks. I changed batch_size and the error went away.

Also, I pulled your newer code and can confirm the training is working as
expected.

On Tue, Nov 22, 2016 at 2:28 PM, Yakov Pechersky notifications@github.com
wrote:

@dakoner https://github.com/dakoner There was a bug in encoding, it
wasn't properly encoding padded words. Now train_gen quickly reaches >60%
accuracy within the first epoch.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#35 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHtyQPibQckgg8eZiOu0XvysGZ9jqgicks5rA2yXgaJpZM4Kx6jM
.

dakoner · 2016-11-23T15:55:01Z

Yakov
I've got a question about how the new generator-based training works

IIUC it loads the entire dataset into RAM, defines an index point in the
dataset, and then returns samples from before or after the index point
depending on whether a training or test sample is requested.

if you set an epoch size larger than the dataset size, doesn't that cause
no test samples to be returned?

On Tue, Nov 22, 2016 at 6:27 PM, David Konerding dakoner@gmail.com wrote:

Thanks. I changed batch_size and the error went away.

Also, I pulled your newer code and can confirm the training is working as
expected.

On Tue, Nov 22, 2016 at 2:28 PM, Yakov Pechersky <notifications@github.com

wrote:

@dakoner https://github.com/dakoner There was a bug in encoding, it
wasn't properly encoding padded words. Now train_gen quickly reaches

60% accuracy within the first epoch.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#35 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHtyQPibQckgg8eZiOu0XvysGZ9jqgicks5rA2yXgaJpZM4Kx6jM
.

pechersky · 2016-11-23T16:31:11Z

The sampling is with replacement, so any epoch size can be used. I chose to use "with replacement" to make the generator have as least state as possible. For some of the datasets I am trying, the number of samples is so much larger than reasonable epoch sizes, I rely on random sampling for each epoch, where no epoch is promised to have the same samples as a previous one.

I am working on extending the "CanonicalSmilesDataGenerator" to take a SMILES string, and create non-canonical representations of the same compound -- similar to how image data generators add noise through rotation etc. This will create an even larger dataset (finite, yet difficult to enumerate). In that case, sampling with replacement is important since you'd be fine with having two different representations of the same compound in a single epoch.

dakoner · 2016-11-23T16:34:42Z

Great. I am currently training with a 35K smiles string dataset, and I set
--epoch_size to 35K. Tensorflow says it's training on 35K samples, does
that mean it's really samplign from a training set of 35K/2 items and a
test set of 35K/2 items?

On Wed, Nov 23, 2016 at 8:31 AM, Yakov Pechersky notifications@github.com
wrote:

The sampling is with replacement, so any epoch size can be used. I chose
to use "with replacement" to make the generator have as least state as
possible. For some of the datasets I am trying, the number of samples is so
much larger than reasonable epoch sizes, I rely on random sampling for each
epoch, where no epoch is promised to have the same samples as a previous
one.

I am working on extending the "CanonicalSmilesDataGenerator" to take a
SMILES string, and create non-canonical representations of the same
compound -- similar to how image data generators add noise through rotation
etc. This will create an even larger dataset (finite, yet difficult to
enumerate). In that case, sampling with replacement is important since
you'd be fine to have two different representations of the same compound in
a single epoch.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#35 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHtyQPfZDJuDkCO2rqMfQjIQu4pAz6bVks5rBGpPgaJpZM4Kx6jM
.

pechersky · 2016-11-23T16:42:03Z

In the molecules.vectorizer, SmilesDataGenerator takes a test_split optional parameter that creates the "index point" you mentioned. By default, it is 0.20, so 4/5 of the data is used for training, and 1/5 is used for testing. The version you pulled has a hard coded epoch size for the testing -- I just pushed a new version that figures out how to scale the test-epoch size based on this test_split param.

I'm also making a PR with that branch. Thank you for testing this, @dakoner. @maxhodak, if you could also test it and validate it for your purposes, especially the new sampling code, that would be great. I'd only like to merge it in if it works better than our current approaches (for processing, training, testing, sampling) in terms of memory load and ease of use.

pechersky · 2016-11-23T16:56:00Z

I should add that if you are training on 35K, and you assume the default test_split=0.20, then your "true" effective training set size is 28K. That's the epoch size you'll want to use, which will also give you a test-epoch of 7K.

dakoner · 2016-11-23T16:56:54Z

BTW, you should be able to use the smilesparser to generate non-canonical
SMILES strings from canonical. The operation set includes permuting the
order of branches, starting the string from various terminals, etc.

On Wed, Nov 23, 2016 at 8:42 AM, Yakov Pechersky notifications@github.com
wrote:

In the molecules.vectorizer, SmilesDataGenerator takes a test_split
optional parameter that creates the "index point" you mentioned. By
default, it is 0.20, so 4/5 of the data is used for training, and 1/5 is
used for testing. The version you pulled has a hard coded epoch size for
the testing -- I just pushed a new version that figures out how to scale
the test-epoch size based on this test_split param.

I'm also making a PR with that branch. Thank you for testing this,
@dakoner https://github.com/dakoner. @maxhodak
https://github.com/maxhodak, if you could also test it and validate it
for your purposes, especially the new sampling code, that would be great.
I'd only like to merge it in if it works better than our current approaches
(for processing, training, testing, sampling) in terms of memory load and
ease of use.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#35 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHtyQM_ShbsELPmibMmWsuNEc6C0_Q_Mks5rBGzbgaJpZM4Kx6jM
.

dakoner · 2016-11-23T16:58:50Z

In this case, I'm training on a file with 1.3M compounds. I believe your
comment only applies if I'm providing a small input file and setting
--epoch_size to a value that is close to the input file size... right?
Anyway, seems like I should be passing --test_split as a parameter, and the
code should do the work of figuring out everything else.

On Wed, Nov 23, 2016 at 8:56 AM, Yakov Pechersky notifications@github.com
wrote:

I should add that if you are training on 35K, and you assume the default
test_split=0.20, then your "true" effective training set size is 28K.
That's the epoch size you'll want to use, which will also give you a
test-epoch of 7K.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#35 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHtyQMOKsAV9yR80fGvHgyP6oxR_eu1Cks5rBHAhgaJpZM4Kx6jM
.

pechersky · 2016-11-23T17:10:16Z

I've just pushed a change which allows you to pass test_split as an optional command-line argument. Test-epoch size will be equal to epoch_size / int((1 - test_split) / test_split). The comment I made previously isn't so important with large dataset sizes, I guess. Still, test-epoch-size will be 4 times less than the train-epoch-size with a test_split of 0.20, not 5 times less.

dakoner · 2016-11-27T19:32:18Z

FWIW I spent about 3 days training a model with what I pulled (from 4 days ago) from your fork. It got up to about 95%, I didn't go any further, but I believe with another day or two of training it would reach the same level the older code did (99.5% accuracy for my data set). So, I don't think there are any functional regressions associated with the generative training, and I think it has a big advantage of not needing two large .h5 files. I would prefer if this or something similar gets merged.

…

On Wed, Nov 23, 2016 at 9:10 AM, Yakov Pechersky ***@***.***> wrote: I've just pushed a change which allows you to pass test_split as an optional command-line argument. Test-epoch size will be equal to epoch_size / int((1 - test_split) / test_split). The comment I made previously isn't so important with large dataset sizes, I guess. Still, test-epoch-size will be 4 times less than the train-epoch-size with a test_split of 0.20, not 5 times less. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#35 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHtyQOIzw15jLqj5ZJ34QimDoTOehF-xks5rBHN4gaJpZM4Kx6jM> .

pechersky · 2016-11-28T14:16:13Z

#43 got merged 3 days ago, it seems. After 50 epochs of 600,000 strings (batch size 300) using the gen-method, I got 97% acc. After 150 epochs, I've plateau'd out at ~98.5% acc. Does this also fix #38 , do you think? I don't really have a validation set, so I can't say -- but I do notice that still, the molecules that start with N do not get autoencoded to results that start with N.

pechersky mentioned this issue Nov 23, 2016

Add code for generator based training, testing, sampling #43

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocess uses too much RAM #35

Preprocess uses too much RAM #35

dakoner commented Nov 14, 2016

pechersky commented Nov 15, 2016

maxhodak commented Nov 15, 2016

pechersky commented Nov 21, 2016

dakoner commented Nov 21, 2016

pechersky commented Nov 21, 2016

dakoner commented Nov 22, 2016

dakoner commented Nov 22, 2016

dakoner commented Nov 22, 2016

pechersky commented Nov 22, 2016

pechersky commented Nov 22, 2016 •

edited

Loading

dakoner commented Nov 23, 2016

dakoner commented Nov 23, 2016

pechersky commented Nov 23, 2016 •

edited

Loading

dakoner commented Nov 23, 2016

pechersky commented Nov 23, 2016

pechersky commented Nov 23, 2016

dakoner commented Nov 23, 2016

dakoner commented Nov 23, 2016

pechersky commented Nov 23, 2016

dakoner commented Nov 27, 2016 via email

pechersky commented Nov 28, 2016

Preprocess uses too much RAM #35

Preprocess uses too much RAM #35

Comments

dakoner commented Nov 14, 2016

pechersky commented Nov 15, 2016

maxhodak commented Nov 15, 2016

pechersky commented Nov 21, 2016

dakoner commented Nov 21, 2016

pechersky commented Nov 21, 2016

dakoner commented Nov 22, 2016

dakoner commented Nov 22, 2016

dakoner commented Nov 22, 2016

pechersky commented Nov 22, 2016

pechersky commented Nov 22, 2016 • edited Loading

dakoner commented Nov 23, 2016

dakoner commented Nov 23, 2016

pechersky commented Nov 23, 2016 • edited Loading

dakoner commented Nov 23, 2016

pechersky commented Nov 23, 2016

pechersky commented Nov 23, 2016

dakoner commented Nov 23, 2016

dakoner commented Nov 23, 2016

pechersky commented Nov 23, 2016

dakoner commented Nov 27, 2016 via email

pechersky commented Nov 28, 2016

pechersky commented Nov 22, 2016 •

edited

Loading

pechersky commented Nov 23, 2016 •

edited

Loading