Low memory combine_rows #30

RMeli · 2019-08-02T11:16:44Z

The script combine_rows.py has a high memory consumption for large datasets (such as PDBbind18) and runs out of memory on a 64GB machine.

This PR re-implements such script with a totally different approach, pre-allocating pandas.DataFrames and populating them one row at a time. This approach consistently reduces the memory usage.

The final check is also sped up using numpy instead of Python loops.

Since the approach is completely different from the original script, I created a new script combine_rows_lowmem.py instead of modifying the original one.

Small Test

Loading only rows 1-3 and printing a 5x5 sub-matrix with the new code:

dist =
[[     nan      nan      nan      nan      nan]
 [0.008475 0.       0.016667 0.672362 0.628814]
 [0.025    0.016667 0.       0.664322 0.633333]
 [0.677387 0.672362 0.664322 0.       0.78191 ]
 [     nan      nan      nan      nan      nan]]

lsim=
[[     nan      nan      nan      nan      nan]
 [0.479689 1.       0.290152 0.364799 0.328125]
 [0.221582 0.290152 1.       0.300582 0.323887]
 [0.301352 0.364799 0.300582 1.       0.328125]
 [     nan      nan      nan      nan      nan]]

Loading only rows 1-3 and printing a 5x5 sub-matrix with combine_rows.py:

m =
[[     nan      nan      nan      nan      nan]
 [0.008475 0.       0.016667 0.672362 0.628814]
 [0.025    0.016667 0.       0.664322 0.633333]
 [0.677387 0.672362 0.664322 0.       0.78191 ]
 [     nan      nan      nan      nan      nan]]

lm =
[[     nan      nan      nan      nan      nan]
 [0.479689 1.       0.290152 0.364799 0.328125]
 [0.221582 0.290152 1.       0.300582 0.323887]
 [0.301352 0.364799 0.300582 1.       0.328125]
 [     nan      nan      nan      nan      nan]]

… gsoc/dev

dkoes

Not sure how I missed this PR.

RMeli added 5 commits August 1, 2019 15:26

low memory version of combine-rows

296aa5e

check speedup with numpy

37211e9

reverted change of another PR

a82ec59

reverted correctly

a5a6ced

improved output

465cca8

RMeli changed the title ~~Lowe memory combine_rows~~ Low memory combine_rows Aug 2, 2019

RMeli and others added 10 commits August 8, 2019 10:40

added checks on matrix properties

27cdb12

add output

c244ac6

formatting

5cf14cb

fixed tests

e37efa8

fixed output name

a3f8597

Merge remote-tracking branch 'upstream/master'

e3a2e93

Merge branch 'master' into gsoc/dev

4d7718d

fixed f-string name

cdb7eb8

Merge branch 'gsoc/dev' of https://github.com/RMeli/gninascripts into…

b88264a

… gsoc/dev

Merge remote-tracking branch 'upstream/master' into gsoc/dev

ac99a68

dkoes approved these changes Aug 26, 2020

View reviewed changes

dkoes merged commit f8334cd into gnina:master Aug 26, 2020

RMeli deleted the gsoc/dev branch February 25, 2022 10:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low memory combine_rows #30

Low memory combine_rows #30

RMeli commented Aug 2, 2019 •

edited

Loading

dkoes left a comment

Low memory combine_rows #30

Low memory combine_rows #30

Conversation

RMeli commented Aug 2, 2019 • edited Loading

Small Test

dkoes left a comment

Choose a reason for hiding this comment

RMeli commented Aug 2, 2019 •

edited

Loading