New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Estimate disk storage needed for all DWPC matrices #91
Conversation
The latest commits showed that among the Rephetio metapaths, the average density was almost exactly 20%. At this density, we get about a 77% reduction in file size by using sparse What I would like to investigate is how much we could save on disk space if we strategically selected between sparse and dense matrices. For example, if we have a very low density, very large matrix, we may want to store it in sparse format, while for high-density or small matrices we get less benefit. This is, assuming that ~ 10 TB is larger than ideal for our storage purposes. Edit: See latest commits where we get, for some
In cases like this it can surely only make sense to store this matrix as sparse. |
Haven't looked a notebook yet, but wanted to make a point about floating point numbers. I was wrong that we would want to transform values, e.g. log1p(DWPC). I believe currently our matrices use dtype of
Respect the Note that floating point numbers have more precision closer to zero, so we don't need to perform any transformation to make values larger. I think we could probably get by with float32, but perhaps not float16. Another thing we should consider is compressing .npy files. I assume common compression algorithms like xz would achieve savings similar to those of compressed scipy.sparse .npz files. The big disadvantage of .npy.xz files would be if we were dealing with a matrix that was very sparse but very large and therefore could never be fully loaded in Python, unless using a sparse data structure. The advantage of using npy.xz is that the conversion between .npy.xz and .npy is straightforward. As long as we don't need mem-mapped reading, we could keep .npy files compressed on disk. |
The size of a .npy file is pretty straightforward it seems
|
Two more thoughts. We often do scale & transform DPWCs so they are in the range of 0-6. If we do this transformation we won't have as small of numbers (although this may or may not matter). We also have to think about permuted hetnets. If we're storing 5 permuted DWPC matrices, that will 6x our current estimates. Perhaps it's time to revive the fabled R-DWPC, which would transform each individual DWPC value to be relative to the permuted derivatives. @zietzm if you're around, I think we should meet up to do a brainstorm ⚡ |
Just pushed a quick commit which found no surprises that half precision, But, something to consider is doing a test case so that we can see what precision we need for accuracy. Presumably double precision is not needed in order to get good prediction accuracy, but I really can't attest to that myself. @dhimmel, could you suggest a way to investigate this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR looks good to me. We can merge or keep open if there's more you'd like to do as part of it (like seeing if the precision loss effects DWPCs).
One request would be to add a title cell to each notebook, so viewers can quickly read what the notebook will be exploring.
I think I may prefer to do this in a future PR. Just added headers, so if everything on this one looks good, I'd be happy to merge it and continue with DWPC precision, etc. in a separate one. |
Merges greenelab/connectivity-search-analyses#94 Refs greenelab/connectivity-search-analyses#91 (comment) Former-commit-id: 10b80b5076e7fa81d7c77bdddd7a58970e33fb05
Addresses #76
Several key points from a tentative investigation of metapaths up to and including length 4:
.npy
format and about 175 MB in sparse,.npz
format.G---G
matrices, roughly 20,000 x 20,000 = 400,000,000 numbers storednumpy.log1p
to these matrices does not appear to reduce file size.npz
format, these matrices take about 1.1 GB.npy
form, we can expect this to require about 10 TB.npz
storage, I looked at the average density of matrices.