-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preprocess uses too much RAM #35
Comments
Would we be alright to switching to generator based training? That gets rid On Mon, Nov 14, 2016 at 5:56 PM, dakoner notifications@github.com wrote:
|
sure! send a pull request! |
@dakoner can you provide a link to the 50M GDB-17 dataset you're using? |
http://gdb.unibe.ch/downloads/ On Mon, Nov 21, 2016 at 11:03 AM, Yakov Pechersky notifications@github.com
|
The following branch should be able to train using a stream-based approach, requiring way less RAM. It also provides a solution for issue #39. Please test it out -- you'll need to edit https://github.com/pechersky/keras-molecules/tree/stream-process |
Yakov, Do I understand correctly that now there is no need to preprocess SMILES I am curious about the memory overhead of storing HDF5/pandas data vs. an On Mon, Nov 21, 2016 at 2:15 PM, Yakov Pechersky notifications@github.com
|
I had to make a few changes to train_gen.py to get things to work (see my I can confirm it starts training w/o any preprocessing steps. On Mon, Nov 21, 2016 at 4:33 PM, David Konerding dakoner@gmail.com wrote:
|
This code doesn't seem to train-up quickly like the old train.py does. It I am seeing a warning at the end of the epoch: /home/dek/keras-molecules/env/src/keras/keras/engine/training.py:1480: when I set --epoch_size to 50000 On Mon, Nov 21, 2016 at 4:54 PM, David Konerding dakoner@gmail.com wrote:
|
You might have gotten the epoch warning if your batch_size doesn't cleanly divide epoch_size. Thanks for your comments here and on the commit. Could you share the command that you were using to run the old train.py? The only thing that I can think of as being different in this case is that sampling is with replacement -- the code as I have it currently ignores the weights on training. |
@dakoner There was a bug in encoding, it wasn't properly encoding padded words. I've also fixed the bugs you've pointed out. Now |
Thanks. I changed batch_size and the error went away. Also, I pulled your newer code and can confirm the training is working as On Tue, Nov 22, 2016 at 2:28 PM, Yakov Pechersky notifications@github.com
|
Yakov IIUC it loads the entire dataset into RAM, defines an index point in the if you set an epoch size larger than the dataset size, doesn't that cause On Tue, Nov 22, 2016 at 6:27 PM, David Konerding dakoner@gmail.com wrote:
|
The sampling is with replacement, so any epoch size can be used. I chose to use "with replacement" to make the generator have as least state as possible. For some of the datasets I am trying, the number of samples is so much larger than reasonable epoch sizes, I rely on random sampling for each epoch, where no epoch is promised to have the same samples as a previous one. I am working on extending the "CanonicalSmilesDataGenerator" to take a SMILES string, and create non-canonical representations of the same compound -- similar to how image data generators add noise through rotation etc. This will create an even larger dataset (finite, yet difficult to enumerate). In that case, sampling with replacement is important since you'd be fine with having two different representations of the same compound in a single epoch. |
Great. I am currently training with a 35K smiles string dataset, and I set On Wed, Nov 23, 2016 at 8:31 AM, Yakov Pechersky notifications@github.com
|
In the I'm also making a PR with that branch. Thank you for testing this, @dakoner. @maxhodak, if you could also test it and validate it for your purposes, especially the new sampling code, that would be great. I'd only like to merge it in if it works better than our current approaches (for processing, training, testing, sampling) in terms of memory load and ease of use. |
I should add that if you are training on 35K, and you assume the default |
BTW, you should be able to use the smilesparser to generate non-canonical On Wed, Nov 23, 2016 at 8:42 AM, Yakov Pechersky notifications@github.com
|
In this case, I'm training on a file with 1.3M compounds. I believe your On Wed, Nov 23, 2016 at 8:56 AM, Yakov Pechersky notifications@github.com
|
I've just pushed a change which allows you to pass |
FWIW I spent about 3 days training a model with what I pulled (from 4 days
ago) from your fork. It got up to about 95%, I didn't go any further, but
I believe with another day or two of training it would reach the same level
the older code did (99.5% accuracy for my data set).
So, I don't think there are any functional regressions associated with the
generative training, and I think it has a big advantage of not needing two
large .h5 files.
I would prefer if this or something similar gets merged.
…On Wed, Nov 23, 2016 at 9:10 AM, Yakov Pechersky ***@***.***> wrote:
I've just pushed a change which allows you to pass test_split as an
optional command-line argument. Test-epoch size will be equal to epoch_size
/ int((1 - test_split) / test_split). The comment I made previously isn't
so important with large dataset sizes, I guess. Still, test-epoch-size will
be 4 times less than the train-epoch-size with a test_split of 0.20, not 5
times less.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#35 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHtyQOIzw15jLqj5ZJ34QimDoTOehF-xks5rBHN4gaJpZM4Kx6jM>
.
|
#43 got merged 3 days ago, it seems. After 50 epochs of 600,000 strings (batch size 300) using the gen-method, I got 97% acc. After 150 epochs, I've plateau'd out at ~98.5% acc. Does this also fix #38 , do you think? I don't really have a validation set, so I can't say -- but I do notice that still, the molecules that start with |
preprocess.py loads the entire pre-preprocessed data into RAM, does transforms that require more RAM. I'm trying to preprocess GDB-17 w/ 50M SMILES strings and it's just about filling up my 64GB RAM machine. We should be able to go directly from SMILES input to preprocessed input files with far fewer resources although it would take work to make all the steps incremental.
The text was updated successfully, but these errors were encountered: