Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Estimate disk storage needed for all DWPC matrices #91

Merged
merged 9 commits into from Apr 18, 2018

Conversation

zietzm
Copy link
Collaborator

@zietzm zietzm commented Apr 9, 2018

Addresses #76

Several key points from a tentative investigation of metapaths up to and including length 4:

  • The average matrix size is about 7,850 x 7,850 ~= 61,500,000 numbers stored
  • If an average-sized matrix has 30% density, meaning 70% of the matrix is zero, it takes ~ 500 MB in .npy format and about 175 MB in sparse, .npz format.
  • The largest matrices are, of course, G---G matrices, roughly 20,000 x 20,000 = 400,000,000 numbers stored
    • These require about 3.2 GB to store in dense format
    • Applying numpy.log1p to these matrices does not appear to reduce file size
    • In sparse .npz format, these matrices take about 1.1 GB
  • If we store all matrices (length <= 4) in dense .npy form, we can expect this to require about 10 TB
    • 19,716 x 500 = 9,858,000 MB ~= 10 TB
    • See table below for rough calculations using average size = 493 MB
Length Number of paths Estimated size (GB)
2 266 131
3 2,205 1,087
4 19,716 9,720
5 174,363 85,960
  • To estimate sparse .npz storage, I looked at the average density of matrices.

@zietzm
Copy link
Collaborator Author

zietzm commented Apr 10, 2018

The latest commits showed that among the Rephetio metapaths, the average density was almost exactly 20%. At this density, we get about a 77% reduction in file size by using sparse .npy matrices on disk. Assuming that the Rephetio metapaths were a good sample with respect to density, this would mean all length 4 metapaths should take around 2.3 TB if stored in sparse .npz format.

What I would like to investigate is how much we could save on disk space if we strategically selected between sparse and dense matrices. For example, if we have a very low density, very large matrix, we may want to store it in sparse format, while for high-density or small matrices we get less benefit.

This is, assuming that ~ 10 TB is larger than ideal for our storage purposes.

Edit: See latest commits where we get, for some G-G metapaths the following: (example metapath GdCpDuG, with density 0.00092).

Dense: 3510 MB
Sparse: 0.7248 MB

In cases like this it can surely only make sense to store this matrix as sparse.

density
file_size

@dhimmel
Copy link
Collaborator

dhimmel commented Apr 10, 2018

Haven't looked a notebook yet, but wanted to make a point about floating point numbers. I was wrong that we would want to transform values, e.g. log1p(DWPC). I believe currently our matrices use dtype of numpy.float64. Each value here requires 64 bits (8 bytes). Alternatively, we could explore:

Respect the #mantissa

Note that floating point numbers have more precision closer to zero, so we don't need to perform any transformation to make values larger. I think we could probably get by with float32, but perhaps not float16.

Another thing we should consider is compressing .npy files. I assume common compression algorithms like xz would achieve savings similar to those of compressed scipy.sparse .npz files. The big disadvantage of .npy.xz files would be if we were dealing with a matrix that was very sparse but very large and therefore could never be fully loaded in Python, unless using a sparse data structure. The advantage of using npy.xz is that the conversion between .npy.xz and .npy is straightforward. As long as we don't need mem-mapped reading, we could keep .npy files compressed on disk.

@dhimmel
Copy link
Collaborator

dhimmel commented Apr 10, 2018

The size of a .npy file is pretty straightforward it seems

megabytes for a .npy file = n_rows * n_cols * (fp_bits / 8) / 1_000_000

@dhimmel
Copy link
Collaborator

dhimmel commented Apr 10, 2018

Two more thoughts. We often do scale & transform DPWCs so they are in the range of 0-6. If we do this transformation we won't have as small of numbers (although this may or may not matter). We also have to think about permuted hetnets. If we're storing 5 permuted DWPC matrices, that will 6x our current estimates. Perhaps it's time to revive the fabled R-DWPC, which would transform each individual DWPC value to be relative to the permuted derivatives.


@zietzm if you're around, I think we should meet up to do a brainstorm ⚡

@zietzm
Copy link
Collaborator Author

zietzm commented Apr 17, 2018

Just pushed a quick commit which found no surprises that half precision, np.float16 gives one quarter of double precision, np.float64 size and half of single precision, np.float32 size on disk. This is totally expected and not really novel at all.

But, something to consider is doing a test case so that we can see what precision we need for accuracy. Presumably double precision is not needed in order to get good prediction accuracy, but I really can't attest to that myself.

@dhimmel, could you suggest a way to investigate this?

Copy link
Collaborator

@dhimmel dhimmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks good to me. We can merge or keep open if there's more you'd like to do as part of it (like seeing if the precision loss effects DWPCs).

One request would be to add a title cell to each notebook, so viewers can quickly read what the notebook will be exploring.

@zietzm
Copy link
Collaborator Author

zietzm commented Apr 18, 2018

like seeing if the precision loss effects DWPCs

I think I may prefer to do this in a future PR.

Just added headers, so if everything on this one looks good, I'd be happy to merge it and continue with DWPC precision, etc. in a separate one.

@dhimmel dhimmel changed the title [WIP] Estimate disk storage needed for all DWPC matrices Estimate disk storage needed for all DWPC matrices Apr 18, 2018
@dhimmel dhimmel merged commit eecf282 into greenelab:master Apr 18, 2018
@zietzm zietzm deleted the disk_storage branch April 18, 2018 20:31
dhimmel pushed a commit to hetio/hetmatpy that referenced this pull request Nov 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants