Full training set? #8

proteneer · 2017-08-01T17:39:01Z

We're waiting on the full dataset (with augmentation via normal mode sampling) to allow us to do complete validation of the ANI model - @lilleswing recommended using gitLFS for distribution.

Thoughts?

ghutchis · 2017-08-01T18:25:53Z

For one, you probably need a separate repository. How much data is it?

Jussmith01 · 2017-08-01T18:34:43Z

Hello,

We are currently in the process of publishing the data descriptor and data set. We will be submitting before the weekend. The data should be available shortly after. We will add a link on this repo's readme when we have it to share. Thanks!

isayev · 2017-08-01T22:16:18Z

Hey, @proteneer and @ghutchis : this is multigiabyte data set. We will provide a simple python package to read and slice data.

ghutchis · 2017-08-01T22:55:22Z

@isayev - I was guessing. But GitHub (even with LFS) may not be the ideal place to store it.

hlwoodcock · 2017-08-01T22:58:34Z

Hi All - If anyone's school uses google for email services then Google Drive should offer unlimited storage and an easy option for sharing.

…

On Tue, Aug 1, 2017 at 6:55 PM, Geoff Hutchison ***@***.***> wrote: @isayev <https://github.com/isayev> - I was guessing. But GitHub (even with LFS) may not be the ideal place to store it. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#8 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALmJwScVobH6DP240cCd5kuW15I7xYrBks5sT6zbgaJpZM4OqDu8> .

andersx · 2017-08-02T11:49:56Z

For a permanent solution I suggest storing the dataset somewhere that allows for a DOI. Not sure if you can get this using Google drive. Maybe datadryad.org? I guess it also depends on how much data we are talking about, and if you are willing to spend money on hosting at all.

proteneer · 2017-08-02T12:28:36Z

What type of information is in the training set? Atomic coordinates, types, and predicted QM energies? Bond orders? Topologies? SMILES?

We could also consider hosting it ourselves as well as mirror.

isayev · 2017-08-02T13:35:09Z

@andersx yup, we will host it with DOI!
@proteneer this data is xyz file like. We have 3D array containing cartesian coordinates for each conformer of the molecule, vector of atom species and vector of energies. We don't use bond orders or topologies.

proteneer · 2017-08-02T15:43:39Z

I see - I presume you're tossing out formal charges as well then? If you're also throwing away bond orders/topologies it might be a little difficult to do reconstruction/debugging but we can live with xyz-like for now. You can probably get away with 16bits of precision + compression to reduce the file sizes (internally we use something like the gromacs XTC format + gzip with 16bits to drastically reduce sizes).

isayev · 2017-08-02T15:53:33Z

Whole point of this approach is to be QM-like! It does not rely on anything but element species and coordinates. We run and successfully converged all systems in DFT. All of them are neutral. I think we also have SMILES strings too. However for every SMILES there will be an ensemble (~100-1000) of 3D conformations. Dataset will be available as (lossless) HDF5 file with a python wrapperclass

proteneer · 2017-08-02T15:55:50Z

Okay - works for us.

proteneer · 2017-08-23T17:30:49Z

Closing - thanks guys!

proteneer closed this as completed Aug 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full training set? #8

Full training set? #8

proteneer commented Aug 1, 2017

ghutchis commented Aug 1, 2017

Jussmith01 commented Aug 1, 2017

isayev commented Aug 1, 2017

ghutchis commented Aug 1, 2017

hlwoodcock commented Aug 1, 2017 via email

andersx commented Aug 2, 2017

proteneer commented Aug 2, 2017

isayev commented Aug 2, 2017

proteneer commented Aug 2, 2017 •

edited

Loading

isayev commented Aug 2, 2017 via email •

edited

Loading

proteneer commented Aug 2, 2017

proteneer commented Aug 23, 2017

Full training set? #8

Full training set? #8

Comments

proteneer commented Aug 1, 2017

ghutchis commented Aug 1, 2017

Jussmith01 commented Aug 1, 2017

isayev commented Aug 1, 2017

ghutchis commented Aug 1, 2017

hlwoodcock commented Aug 1, 2017 via email

andersx commented Aug 2, 2017

proteneer commented Aug 2, 2017

isayev commented Aug 2, 2017

proteneer commented Aug 2, 2017 • edited Loading

isayev commented Aug 2, 2017 via email • edited Loading

proteneer commented Aug 2, 2017

proteneer commented Aug 23, 2017

proteneer commented Aug 2, 2017 •

edited

Loading

isayev commented Aug 2, 2017 via email •

edited

Loading