Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full training set? #8

Closed
proteneer opened this issue Aug 1, 2017 · 12 comments
Closed

Full training set? #8

proteneer opened this issue Aug 1, 2017 · 12 comments

Comments

@proteneer
Copy link
Collaborator

We're waiting on the full dataset (with augmentation via normal mode sampling) to allow us to do complete validation of the ANI model - @lilleswing recommended using gitLFS for distribution.

Thoughts?

@ghutchis
Copy link
Collaborator

ghutchis commented Aug 1, 2017

For one, you probably need a separate repository. How much data is it?

@Jussmith01
Copy link
Collaborator

Hello,

We are currently in the process of publishing the data descriptor and data set. We will be submitting before the weekend. The data should be available shortly after. We will add a link on this repo's readme when we have it to share. Thanks!

@isayev
Copy link
Owner

isayev commented Aug 1, 2017

Hey, @proteneer and @ghutchis : this is multigiabyte data set. We will provide a simple python package to read and slice data.

@ghutchis
Copy link
Collaborator

ghutchis commented Aug 1, 2017

@isayev - I was guessing. But GitHub (even with LFS) may not be the ideal place to store it.

@hlwoodcock
Copy link
Collaborator

hlwoodcock commented Aug 1, 2017 via email

@andersx
Copy link
Collaborator

andersx commented Aug 2, 2017

For a permanent solution I suggest storing the dataset somewhere that allows for a DOI. Not sure if you can get this using Google drive. Maybe datadryad.org? I guess it also depends on how much data we are talking about, and if you are willing to spend money on hosting at all.

@proteneer
Copy link
Collaborator Author

What type of information is in the training set? Atomic coordinates, types, and predicted QM energies? Bond orders? Topologies? SMILES?

We could also consider hosting it ourselves as well as mirror.

@isayev
Copy link
Owner

isayev commented Aug 2, 2017

@andersx yup, we will host it with DOI!
@proteneer this data is xyz file like. We have 3D array containing cartesian coordinates for each conformer of the molecule, vector of atom species and vector of energies. We don't use bond orders or topologies.

@proteneer
Copy link
Collaborator Author

proteneer commented Aug 2, 2017

I see - I presume you're tossing out formal charges as well then? If you're also throwing away bond orders/topologies it might be a little difficult to do reconstruction/debugging but we can live with xyz-like for now. You can probably get away with 16bits of precision + compression to reduce the file sizes (internally we use something like the gromacs XTC format + gzip with 16bits to drastically reduce sizes).

@isayev
Copy link
Owner

isayev commented Aug 2, 2017 via email

@proteneer
Copy link
Collaborator Author

Okay - works for us.

@proteneer
Copy link
Collaborator Author

Closing - thanks guys!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants