-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full training set? #8
Comments
For one, you probably need a separate repository. How much data is it? |
Hello, We are currently in the process of publishing the data descriptor and data set. We will be submitting before the weekend. The data should be available shortly after. We will add a link on this repo's readme when we have it to share. Thanks! |
Hey, @proteneer and @ghutchis : this is multigiabyte data set. We will provide a simple python package to read and slice data. |
@isayev - I was guessing. But GitHub (even with LFS) may not be the ideal place to store it. |
Hi All - If anyone's school uses google for email services then Google
Drive should offer unlimited storage and an easy option for sharing.
…On Tue, Aug 1, 2017 at 6:55 PM, Geoff Hutchison ***@***.***> wrote:
@isayev <https://github.com/isayev> - I was guessing. But GitHub (even
with LFS) may not be the ideal place to store it.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#8 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ALmJwScVobH6DP240cCd5kuW15I7xYrBks5sT6zbgaJpZM4OqDu8>
.
|
For a permanent solution I suggest storing the dataset somewhere that allows for a DOI. Not sure if you can get this using Google drive. Maybe datadryad.org? I guess it also depends on how much data we are talking about, and if you are willing to spend money on hosting at all. |
What type of information is in the training set? Atomic coordinates, types, and predicted QM energies? Bond orders? Topologies? SMILES? We could also consider hosting it ourselves as well as mirror. |
@andersx yup, we will host it with DOI! |
I see - I presume you're tossing out formal charges as well then? If you're also throwing away bond orders/topologies it might be a little difficult to do reconstruction/debugging but we can live with xyz-like for now. You can probably get away with 16bits of precision + compression to reduce the file sizes (internally we use something like the gromacs XTC format + gzip with 16bits to drastically reduce sizes). |
Whole point of this approach is to be QM-like! It does not rely on anything but element species and coordinates. We run and successfully converged all systems in DFT. All of them are neutral.
I think we also have SMILES strings too. However for every SMILES there will be an ensemble (~100-1000) of 3D conformations.
Dataset will be available as (lossless) HDF5 file with a python wrapperclass
|
Okay - works for us. |
Closing - thanks guys! |
We're waiting on the full dataset (with augmentation via normal mode sampling) to allow us to do complete validation of the ANI model - @lilleswing recommended using gitLFS for distribution.
Thoughts?
The text was updated successfully, but these errors were encountered: