High-Throughput Screening Reliability #10

patrickfuller · 2014-09-01T20:23:18Z

This is a set of issues around running 100k+ simulations with RASPA, and stems mainly from its memory leaks. Currently, RASPA leaks a lot of memory, which builds up over many simulations. The "solution" currently used is to run every simulation in its own process, ie.

Spin out new process
Run simulation
Shut down process, letting OS attempt to handle memory management

This logic really slows down high-throughput screening, and results in a lot of unexplained segmentation faults and generally not playing well with other programs. Instead, screening should use one process per core, and each core should be capable of running an unlimited number of simulations serially.

History of debugging this:

libraspa shut down after 2-3 runs. This was due to a ridiculous amount of memory (~400 MB) leaking in movies.c. 72002b9 fixed this.
libraspa shut down after 128 runs. This was due to 4 dangling file pointers per run and an OS limit of 512 file pointers per process (128 * 4 = 512). a659eba cleaned up dangling file pointers and removed this limit.
libraspa shut down after ~1000 runs. This is believed to be due to RASPA printing random simulation info to logfiles, which bulked up quickly. 38db332 fixed this by removing relevant stderr printing on streaming.

Currently, libraspa shuts down after ~5000 runs. I think this is because of the remaining 7 MB of memory leak. 7 MB * 5000 = 35 GB, and my computer has 32 GB memory.

For context, the valgrind output as of today from:

▶ valgrind --log-file="valgrind.txt" --leak-check=full --show-leak-kinds=all ~/Dropbox/Github/RASPA2/simulations/bin/./simulate -i simulation.input

is 14,378 lines long and ends in

==3574== 
==3574== LEAK SUMMARY:
==3574==    definitely lost: 4,530 bytes in 9 blocks
==3574==    indirectly lost: 0 bytes in 0 blocks
==3574==      possibly lost: 0 bytes in 0 blocks
==3574==    still reachable: 7,569,987 bytes in 3,932 blocks
==3574==         suppressed: 0 bytes in 0 blocks
==3574==

To fix, I'll have to go through this list.

The text was updated successfully, but these errors were encountered:

patrickfuller · 2014-09-02T14:30:02Z

fba3cb4 got that still reachable number down.

==16717== LEAK SUMMARY:
==16717==    definitely lost: 4,530 bytes in 9 blocks
==16717==    indirectly lost: 0 bytes in 0 blocks
==16717==      possibly lost: 0 bytes in 0 blocks
==16717==    still reachable: 1,004,323 bytes in 2,434 blocks
==16717==         suppressed: 0 bytes in 0 blocks

Still 2,030 valgrind errors.

Although still reachable memory will be properly freed on process exit, its reduction is important in cases where one core runs multiple simulations. The core will crash if still reachable * number of simulations > available memory.

definitely lost should be completely eliminated. No questions there.

…. Short-term solution for #10.

patrickfuller · 2014-09-04T01:09:44Z

95d0096 isolates the RASPA C code into its own process. The process is constructed and terminated on every RASPA run. This allows the OS to come in and free the still reachable memory, but comes as the cost of complexity and speed (in NuMat's high-throughput screen, this turns 15 processes into >1,000,000 processes).

Not closing this issue because the right way to do this is to track down all the memory leaks, fix them, and then undo 95d0096. We're sacrificing speed until this happens.

patrickfuller · 2014-09-08T14:03:09Z

Here's a basic test script for process buildup / shutdown time:

from multiprocessing import Process, Pipe
from time import sleep

def f(conn):
    sleep(1)
    conn.send(5 * 5 * 5 * 5 * 5)
    conn.close()

parent_conn, child_conn = Pipe()
p = Process(target=f, args=(child_conn,))
p.start()
output = parent_conn.recv()
p.join()
p.terminate()
print(output)

Run with:

python -m cProfile -s time test.py

Seems that the overhead is about 50 ms per process. That gives a low-end estimate of (0.05 * 1,000,000 / 3600) 13.9 hours of added computation in a high-throughput screen. There's probably a relationship between con/destruct time and number of processes, so I'd expect this number to be more like ~100 hours. Still much better than the previous 3-5 sec OS shutdown approach (57.8 days of added computation!)

patrickfuller added the bug label Sep 1, 2014

patrickfuller added a commit that referenced this issue Sep 4, 2014

RASPA2 python bindings now isolate libraspa in an independent process…

95d0096

…. Short-term solution for #10.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High-Throughput Screening Reliability #10

High-Throughput Screening Reliability #10

patrickfuller commented Sep 1, 2014

patrickfuller commented Sep 2, 2014

patrickfuller commented Sep 4, 2014

patrickfuller commented Sep 8, 2014

High-Throughput Screening Reliability #10

High-Throughput Screening Reliability #10

Comments

patrickfuller commented Sep 1, 2014

patrickfuller commented Sep 2, 2014

patrickfuller commented Sep 4, 2014

patrickfuller commented Sep 8, 2014