Estimate disk storage needed for all DWPC matrices #91

zietzm · 2018-04-09T21:31:48Z

Addresses #76

Several key points from a tentative investigation of metapaths up to and including length 4:

The average matrix size is about 7,850 x 7,850 ~= 61,500,000 numbers stored
If an average-sized matrix has 30% density, meaning 70% of the matrix is zero, it takes ~ 500 MB in .npy format and about 175 MB in sparse, .npz format.
The largest matrices are, of course, G---G matrices, roughly 20,000 x 20,000 = 400,000,000 numbers stored
- These require about 3.2 GB to store in dense format
- Applying numpy.log1p to these matrices does not appear to reduce file size
- In sparse .npz format, these matrices take about 1.1 GB
If we store all matrices (length <= 4) in dense .npy form, we can expect this to require about 10 TB
- 19,716 x 500 = 9,858,000 MB ~= 10 TB
- See table below for rough calculations using average size = 493 MB

Length	Number of paths	Estimated size (GB)
2	266	131
3	2,205	1,087
4	19,716	9,720
5	174,363	85,960

To estimate sparse .npz storage, I looked at the average density of matrices.

zietzm · 2018-04-10T01:48:31Z

The latest commits showed that among the Rephetio metapaths, the average density was almost exactly 20%. At this density, we get about a 77% reduction in file size by using sparse .npy matrices on disk. Assuming that the Rephetio metapaths were a good sample with respect to density, this would mean all length 4 metapaths should take around 2.3 TB if stored in sparse .npz format.

What I would like to investigate is how much we could save on disk space if we strategically selected between sparse and dense matrices. For example, if we have a very low density, very large matrix, we may want to store it in sparse format, while for high-density or small matrices we get less benefit.

This is, assuming that ~ 10 TB is larger than ideal for our storage purposes.

Edit: See latest commits where we get, for some G-G metapaths the following: (example metapath GdCpDuG, with density 0.00092).

Dense: 3510 MB
Sparse: 0.7248 MB

In cases like this it can surely only make sense to store this matrix as sparse.

dhimmel · 2018-04-10T15:11:50Z

Haven't looked a notebook yet, but wanted to make a point about floating point numbers. I was wrong that we would want to transform values, e.g. log1p(DWPC). I believe currently our matrices use dtype of numpy.float64. Each value here requires 64 bits (8 bytes). Alternatively, we could explore:

numpy.float32 (4 bytes per value) https://en.wikipedia.org/wiki/Single-precision_floating-point_format
numpy.float16 (2 bytes per value) https://en.wikipedia.org/wiki/Half-precision_floating-point_format

Respect the #mantissa

Note that floating point numbers have more precision closer to zero, so we don't need to perform any transformation to make values larger. I think we could probably get by with float32, but perhaps not float16.

Another thing we should consider is compressing .npy files. I assume common compression algorithms like xz would achieve savings similar to those of compressed scipy.sparse .npz files. The big disadvantage of .npy.xz files would be if we were dealing with a matrix that was very sparse but very large and therefore could never be fully loaded in Python, unless using a sparse data structure. The advantage of using npy.xz is that the conversion between .npy.xz and .npy is straightforward. As long as we don't need mem-mapped reading, we could keep .npy files compressed on disk.

dhimmel · 2018-04-10T15:20:56Z

The size of a .npy file is pretty straightforward it seems

megabytes for a .npy file = n_rows * n_cols * (fp_bits / 8) / 1_000_000

dhimmel · 2018-04-10T15:32:28Z

Two more thoughts. We often do scale & transform DPWCs so they are in the range of 0-6. If we do this transformation we won't have as small of numbers (although this may or may not matter). We also have to think about permuted hetnets. If we're storing 5 permuted DWPC matrices, that will 6x our current estimates. Perhaps it's time to revive the fabled R-DWPC, which would transform each individual DWPC value to be relative to the permuted derivatives.

@zietzm if you're around, I think we should meet up to do a brainstorm ⚡

zietzm · 2018-04-17T17:33:30Z

Just pushed a quick commit which found no surprises that half precision, np.float16 gives one quarter of double precision, np.float64 size and half of single precision, np.float32 size on disk. This is totally expected and not really novel at all.

But, something to consider is doing a test case so that we can see what precision we need for accuracy. Presumably double precision is not needed in order to get good prediction accuracy, but I really can't attest to that myself.

@dhimmel, could you suggest a way to investigate this?

dhimmel

PR looks good to me. We can merge or keep open if there's more you'd like to do as part of it (like seeing if the precision loss effects DWPCs).

One request would be to add a title cell to each notebook, so viewers can quickly read what the notebook will be exploring.

zietzm · 2018-04-18T14:09:45Z

like seeing if the precision loss effects DWPCs

I think I may prefer to do this in a future PR.

Just added headers, so if everything on this one looks good, I'd be happy to merge it and continue with DWPC precision, etc. in a separate one.

Merges #94 Refs #91 (comment)

Merges greenelab/connectivity-search-analyses#91 Refs greenelab/connectivity-search-analyses#76

Merges greenelab/connectivity-search-analyses#94 Refs greenelab/connectivity-search-analyses#91 (comment)

Merges greenelab/connectivity-search-analyses#91 Refs greenelab/connectivity-search-analyses#76

Merges greenelab/connectivity-search-analyses#94 Refs greenelab/connectivity-search-analyses#91 (comment)

Merges greenelab/connectivity-search-analyses#91 Refs greenelab/connectivity-search-analyses#76

Merges greenelab/connectivity-search-analyses#94 Refs greenelab/connectivity-search-analyses#91 (comment)

Merges greenelab/connectivity-search-analyses#91 Refs greenelab/connectivity-search-analyses#76

Merges greenelab/connectivity-search-analyses#94 Refs greenelab/connectivity-search-analyses#91 (comment)

Merges greenelab/connectivity-search-analyses#91 Refs greenelab/connectivity-search-analyses#76

Merges greenelab/connectivity-search-analyses#94 Refs greenelab/connectivity-search-analyses#91 (comment)

Merges greenelab/connectivity-search-analyses#91 Refs greenelab/connectivity-search-analyses#76

Merges greenelab/connectivity-search-analyses#94 Refs greenelab/connectivity-search-analyses#91 (comment)

Merges greenelab/connectivity-search-analyses#91 Refs greenelab/connectivity-search-analyses#76 Former-commit-id: greenelab/connectivity-search-analyses@2eba725

Merges greenelab/connectivity-search-analyses#94 Refs greenelab/connectivity-search-analyses#91 (comment) Former-commit-id: 10b80b5076e7fa81d7c77bdddd7a58970e33fb05

Merges greenelab/connectivity-search-analyses#91 Refs greenelab/connectivity-search-analyses#76 Former-commit-id: greenelab/connectivity-search-analyses@2eba725

Merges greenelab/connectivity-search-analyses#94 Refs greenelab/connectivity-search-analyses#91 (comment) Former-commit-id: greenelab/connectivity-search-analyses@10b80b5

Merges greenelab/connectivity-search-analyses#91 Refs greenelab/connectivity-search-analyses#76 Former-commit-url: greenelab/connectivity-search-analyses@2eba725

Merges greenelab/connectivity-search-analyses#94 Refs greenelab/connectivity-search-analyses#91 (comment) Former-commit-url: greenelab/connectivity-search-analyses@10b80b5

Merges greenelab/connectivity-search-analyses#91 Refs greenelab/connectivity-search-analyses#76 Former-commit-url: https://github.com/greenelab/hetmech/commit/{hash[:2]}⛷{hash[2:]}

Merges greenelab/connectivity-search-analyses#94 Refs greenelab/connectivity-search-analyses#91 (comment) Former-commit-url: https://github.com/greenelab/hetmech/commit/{hash[:2]}⛷{hash[2:]}

Merges greenelab/connectivity-search-analyses#91 Refs greenelab/connectivity-search-analyses#76 Former-commit-url: https://github.com/greenelab/hetmech/commit/ee⛷cf282011704a36a6ff9436a783a3d21a7684a8

Merges greenelab/connectivity-search-analyses#94 Refs greenelab/connectivity-search-analyses#91 (comment) Former-commit-url: https://github.com/greenelab/hetmech/commit/c6⛷28a286a3d8f6cc04b52e16525c18ce73e1ce68

Merges greenelab/connectivity-search-analyses#91 Refs greenelab/connectivity-search-analyses#76 Former-commit-url: greenelab/connectivity-search-analyses@eecf282

Merges greenelab/connectivity-search-analyses#94 Refs greenelab/connectivity-search-analyses#91 (comment) Former-commit-url: greenelab/connectivity-search-analyses@c628a28

Merges greenelab/connectivity-search-analyses#91 Refs greenelab/connectivity-search-analyses#76 Former-commit: greenelab/connectivity-search-analyses@eecf282

Merges greenelab/connectivity-search-analyses#94 Refs greenelab/connectivity-search-analyses#91 (comment) Former-commit: greenelab/connectivity-search-analyses@c628a28

Merges greenelab/connectivity-search-analyses#91 Refs greenelab/connectivity-search-analyses#76 Former-commit: greenelab/connectivity-search-analyses@eecf282

Merges greenelab/connectivity-search-analyses#94 Refs greenelab/connectivity-search-analyses#91 (comment) Former-commit: greenelab/connectivity-search-analyses@c628a28

Average density

8379b52

zietzm force-pushed the disk_storage branch from 8e2a4d3 to 8379b52 Compare April 10, 2018 00:52

zietzm added 3 commits April 9, 2018 21:23

Check average density of Rephetio metapaths. Update environment.

f9966b2

Plot results and update .gitignore

215d98e

Compare sparse and dense sizes

c293b27

zietzm added 3 commits April 9, 2018 22:25

Category_to_function fails to give general case for 'other'

fdca326

Estimate densities of G-G matrices

3d7b146

Add calculation of number of metapaths by length

20f112b

dtypes

58cf09e

dhimmel reviewed Apr 17, 2018

View reviewed changes

Add notebook header cells

ceb9264

dhimmel approved these changes Apr 18, 2018

View reviewed changes

dhimmel changed the title ~~[WIP] Estimate disk storage needed for all DWPC matrices~~ Estimate disk storage needed for all DWPC matrices Apr 18, 2018

dhimmel merged commit eecf282 into greenelab:master Apr 18, 2018

zietzm deleted the disk_storage branch April 18, 2018 20:31

dhimmel pushed a commit that referenced this pull request May 10, 2018

Allow specifying dtype in weighted path count computations

c628a28

Merges #94 Refs #91 (comment)

dhimmel pushed a commit to hetio/hetmatpy that referenced this pull request Nov 7, 2018

Estimate disk storage of DWPC matrices

4f568d5

Merges greenelab/connectivity-search-analyses#91 Refs greenelab/connectivity-search-analyses#76

dhimmel pushed a commit to hetio/hetmatpy that referenced this pull request Nov 7, 2018

Allow specifying dtype in weighted path count computations

7423582

Merges greenelab/connectivity-search-analyses#94 Refs greenelab/connectivity-search-analyses#91 (comment)

dhimmel pushed a commit to hetio/hetmatpy that referenced this pull request Nov 7, 2018

Estimate disk storage of DWPC matrices

94a34de

Merges greenelab/connectivity-search-analyses#91 Refs greenelab/connectivity-search-analyses#76

dhimmel pushed a commit to hetio/hetmatpy that referenced this pull request Nov 7, 2018

Allow specifying dtype in weighted path count computations

41b07c9

Merges greenelab/connectivity-search-analyses#94 Refs greenelab/connectivity-search-analyses#91 (comment)

dhimmel pushed a commit to hetio/hetmatpy that referenced this pull request Nov 7, 2018

Estimate disk storage of DWPC matrices

2150a18

Merges greenelab/connectivity-search-analyses#91 Refs greenelab/connectivity-search-analyses#76

dhimmel pushed a commit to hetio/hetmatpy that referenced this pull request Nov 7, 2018

Allow specifying dtype in weighted path count computations

ae2eebd

Merges greenelab/connectivity-search-analyses#94 Refs greenelab/connectivity-search-analyses#91 (comment)

dhimmel pushed a commit to hetio/hetmatpy that referenced this pull request Nov 7, 2018

Estimate disk storage of DWPC matrices

df5b034

Merges greenelab/connectivity-search-analyses#91 Refs greenelab/connectivity-search-analyses#76

dhimmel pushed a commit to hetio/hetmatpy that referenced this pull request Nov 7, 2018

Allow specifying dtype in weighted path count computations

032ff7c

Merges greenelab/connectivity-search-analyses#94 Refs greenelab/connectivity-search-analyses#91 (comment)

dhimmel pushed a commit to hetio/hetmatpy that referenced this pull request Nov 7, 2018

Estimate disk storage of DWPC matrices

b6663fa

Merges greenelab/connectivity-search-analyses#91 Refs greenelab/connectivity-search-analyses#76

dhimmel pushed a commit to hetio/hetmatpy that referenced this pull request Nov 7, 2018

Allow specifying dtype in weighted path count computations

68635ec

Merges greenelab/connectivity-search-analyses#94 Refs greenelab/connectivity-search-analyses#91 (comment)

dhimmel pushed a commit to hetio/hetmatpy that referenced this pull request Nov 7, 2018

Estimate disk storage of DWPC matrices

10fe672

Merges greenelab/connectivity-search-analyses#91 Refs greenelab/connectivity-search-analyses#76

dhimmel pushed a commit to hetio/hetmatpy that referenced this pull request Nov 7, 2018

Allow specifying dtype in weighted path count computations

13b75bb

Merges greenelab/connectivity-search-analyses#94 Refs greenelab/connectivity-search-analyses#91 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Estimate disk storage needed for all DWPC matrices #91

Estimate disk storage needed for all DWPC matrices #91

zietzm commented Apr 9, 2018 •

edited

zietzm commented Apr 10, 2018 •

edited

dhimmel commented Apr 10, 2018 •

edited

dhimmel commented Apr 10, 2018

dhimmel commented Apr 10, 2018 •

edited

zietzm commented Apr 17, 2018 •

edited

dhimmel left a comment

zietzm commented Apr 18, 2018

Estimate disk storage needed for all DWPC matrices #91

Estimate disk storage needed for all DWPC matrices #91

Conversation

zietzm commented Apr 9, 2018 • edited

zietzm commented Apr 10, 2018 • edited

dhimmel commented Apr 10, 2018 • edited

dhimmel commented Apr 10, 2018

dhimmel commented Apr 10, 2018 • edited

zietzm commented Apr 17, 2018 • edited

dhimmel left a comment

Choose a reason for hiding this comment

zietzm commented Apr 18, 2018

zietzm commented Apr 9, 2018 •

edited

zietzm commented Apr 10, 2018 •

edited

dhimmel commented Apr 10, 2018 •

edited

dhimmel commented Apr 10, 2018 •

edited

zietzm commented Apr 17, 2018 •

edited