Potential memory leak #1014

chongchenmath · 2017-05-31T09:55:53Z

There seems to be a potential memory leak somewhere in odl. Running the code from @davlars repo to make reconstructions on simulated 3D data, the used RAM memory is continuously increasing. Trying a TV-reconstruction with the Chambolle-Pock solver we (@aringh and I) get the following memory usage in the iterations:

iter = 0
CPU usage: [11.6, 12.5, 49.4, 11.5, 9.2, 9.2, 56.5, 8.8]
RAM usage: svmem(total=67514511360, available=27549761536, percent=59.2, used=39263768576, free=20183449600, active=46021120000, inactive=513998848, buffers=47173632, cached=8020119552, shared=38600704)
SWAP usage: sswap(total=4294963200, used=2080911360, free=2214051840, percent=48.5, sin=28048973824, sout=45718224896)

[...]

iter = 100
CPU usage: [19.8, 19.1, 17.1, 31.8, 13.6, 14.0, 12.8, 75.2]
RAM usage: svmem(total=67514511360, available=25842774016, percent=61.7, used=40964677632, free=18139529216, active=48008773632, inactive=556429312, buffers=60911616, cached=8349392896, shared=39292928)
SWAP usage: sswap(total=4294963200, used=2048077824, free=2246885376, percent=47.7, sin=28079247360, sout=45718224896)

[...]

iter = 500
CPU usage: [16.5, 19.1, 25.1, 45.3, 14.7, 14.1, 13.9, 61.1]
RAM usage: svmem(total=67514511360, available=18392444928, percent=72.8, used=48325439488, free=11604901888, active=50951569408, inactive=3880685568, buffers=491245568, cached=7092924416, shared=43077632)
SWAP usage: sswap(total=4294963200, used=1929170944, free=2365792256, percent=44.9, sin=28226437120, sout=45718732800)

As can be seen, the % of used RAM memory is increased from 59.2 to 72.8 over the iterations.

We do not yet know where the leak is, and it needs further investigation. Can anyone suggest a good way to do it?

The code used to get the above out-prints is (note however, that as is it will take several hours to run 500 iterations)

import odl
import numpy as np
import os
import adutils
import sys
import psutil

saveReco = True
savePath = '/home/chchen/dev/ad-skull-reconstruction/reco/'


# Create a text file and write everything both in terminal and in file
if not os.path.exists(savePath):
    os.makedirs(savePath)

output_filename = 'Terminal_output_TV.txt'

class Logger(object):
    def __init__(self, filename="Default.log"):
        self.terminal = sys.stdout
        self.log = open(filename, "a")
 
    def write(self, message):
        self.terminal.write(message)
        self.log.write(message)
 
    def flush(self):
        self.terminal.flush()
        self.log.flush()

sys.stdout = Logger(savePath + output_filename)


# Discretization
space = adutils.get_discretization()

# Forward operator (in the form of a broadcast operator)
A = adutils.get_ray_trafo(space)

# Data
rhs = adutils.get_data(A)

# Construct operators and functionals
gradient = odl.Gradient(space)

# Column vector of operators
op = odl.BroadcastOperator(A, gradient)

Anorm = odl.power_method_opnorm(A, maxiter=2)
Dnorm = odl.power_method_opnorm(gradient,
                                xstart=odl.phantom.white_noise(gradient.domain),
                                maxiter=10)

# Estimated operator norm, add 10 percent
op_norm = 1.1 * np.sqrt(len(A.operators)*(Anorm**2) + Dnorm**2)

print('Norm of the product space operator: {}'.format(op_norm))

lamb = 0.005 

# l2-squared data matching
l2_norm = odl.solvers.L2NormSquared(A.range).translated(rhs)

# Isotropic TV-regularization i.e. the l1-norm
l1_norm = lamb * odl.solvers.L1Norm(gradient.range)

# Combine functionals
f = odl.solvers.SeparableSum(l2_norm, l1_norm)

# Set g functional to zero
g = odl.solvers.ZeroFunctional(op.domain)

# Accelerataion parameter
gamma = 0.4

# Step size for the proximal operator for the primal variable x
tau = 1.0 / op_norm

# Step size for the proximal operator for the dual variable y
sigma = 1.0 / op_norm  # 1.0 / (op_norm ** 2 * tau)

# Use initial guess
x = A.domain.zero()

# Callback for the solver that both print iteration number and CPU/RAM use
class CallbackMycallback(odl.solvers.util.callback.SolverCallback):

    def __init__(self):
        self.print_iter_callback = odl.solvers.CallbackPrintIteration()

    def __call__(self, x):
        self.print_iter_callback(x)
        print('CPU usage: {}'.format(psutil.cpu_percent(percpu=True)))
        print('RAM usage: {}'.format(psutil.virtual_memory()))
        print('SWAP usage: {}'.format(psutil.swap_memory()))

    def reset(self):
        self.print_iter_callback.reset()

callback_print_iter = CallbackMycallback()

# Run such that last iteration is saved (saveReco = 1) or none (saveReco = 0)
niter = 1000
odl.solvers.chambolle_pock_solver(x, f, g, op, tau=tau, sigma = sigma,
				  niter = niter, gamma=gamma, callback=callback_print_iter)

# Close the log-file and set standard output to terminal
sys.stdout.log.close
sys.stdout = sys.stdout.terminal

The text was updated successfully, but these errors were encountered:

adler-j · 2017-05-31T10:21:57Z

We do not yet know where the leak is, and it needs further investigation. Can anyone suggest a good way to do it?

I know that there is a tool (that i currently cannot find, since google is blocked in china) that allows you to list all objects currently in python memory. There is also a functionality to only list those objects that are "new" since some checkpoint.

You could use this to see what new stuff is beeing allocated in each iteration and try to see if there is anything wrong there.

adler-j · 2017-05-31T13:36:36Z

Btw, that callback is quite useful, would you consider making a pull request with it?

adler-j · 2017-05-31T13:54:30Z

Finally, is this specific to this example, or could you try reproducing it with something that does not take hours to run?

Anyway, obviously looks quite bad and is something we need to fix.

chongchenmath · 2017-05-31T16:06:58Z

In the iteration part of chambolle_pock_solver, the lines of proximal_dual(sigma)(dual_tmp, out=y) and proximal_primal(tau)(primal_tmp, out=x) need to be run at each iteration. Then new operators are created simultaneously. Hence I guess that would consume new memory but not release the old at each iteration.

adler-j · 2017-05-31T17:05:35Z

That is likely not the issue since python garbage collects things like that. The issue is somewhere deeper.

Basically there are two ways that we could leak memory (that I know of)

Circular dependencies in classes with a __del__ method, i.e. by doing

import psutil
import numpy as np

class MyList(list):
    def __init__(self, *args, **kwargs):
        # Allocate something to cause memory growth
        self.large_array = np.random.rand(50000)
        list.__init__(self, *args, **kwargs)

    def __del__(self):
        pass

n = 10000
for i in range(n):
    x = MyList()
    x.append(x)
    if i%1000 == 0:
        print('RAM usage: {}%'.format(psutil.virtual_memory().percent))

which gives

RAM usage: 32.1%
RAM usage: 35.3%
RAM usage: 38.5%
RAM usage: 41.6%
RAM usage: 44.8%
RAM usage: 47.9%
RAM usage: 51.1%
RAM usage: 54.2%
RAM usage: 57.4%
RAM usage: 60.6%

However, this only occurs with python < 3.4, since in 3.4 PEP 442 fixes this, giving something like:

RAM usage: 35.5%
RAM usage: 35.7%
RAM usage: 36.4%
RAM usage: 35.8%
RAM usage: 36.5%
RAM usage: 36.0%
RAM usage: 36.7%
RAM usage: 36.1%
RAM usage: 35.6%
RAM usage: 36.3%

Interestingly, we only implement __del__ in two places, AstraCudaProjectorImpl and AstraCudaBackProjectorImpl and perhaps we somehow manage to create a ciruclar dependency with them? I cannot find one myself however.

The other option is that we're leaking some external resource, in this case that would be ASTRA. We could for example forget to call astra.delete somewhere.

So anyway, my debug steps for you would be:

Try to find the package I mentioned above and see where the leaked memory is going.
Try running your code with python>=3.4 and see if it helps.

kohr-h · 2017-05-31T17:35:15Z

Btw, that callback is quite useful, would you consider making a pull request with it?

This is definitely useful. @chongchenmath would you make a separate pull request just with this callback, giving it some name like CallbackPrintUsedResources or something similar? 谢谢

chongchenmath · 2017-05-31T19:44:46Z

不用客气 This is input by @aringh I think he will do it then.

chongchenmath · 2017-06-01T11:53:32Z

We run the code with Python 3.5.3. So it may be not the version problem of python.

adler-j · 2017-06-01T15:25:47Z

Frankly there is a large risk this has to do with astra then. Anyway, we need statistics on new allocations to get further with this then.

aringh · 2017-06-02T12:53:26Z

Yes of I can do a PR with the callback.

When it comes to the real issue:

I will try to run it on a small tomography example, that does not take as much time to run
I will try to look up the tool you mention @adler-j

but, it will have to wait until next week ;-)

adler-j · 2017-06-03T05:31:25Z

No problems! But I think this has to do with 3d somehow, given that we did 20 000 000 projections in https://github.com/adler-j/learned_gradient_tomography without any noticable memory leaks.

adler-j · 2017-06-08T09:39:31Z

Any update on this rather interesting issue?

aringh · 2017-06-08T11:09:35Z

I am stuck with other things right now... but if I make sufficient progress there I hope to be able to have a look at this late this afternoon.

aringh · 2017-06-08T14:35:50Z

The tool I found is called pympler. Was this the one you meant @adler-j ?

adler-j · 2017-06-08T16:03:44Z

That sure looks like it. By looming at diffs between iterates you should be able to see what has been created. You can do that by writing a simple callback.

aringh · 2017-06-08T17:45:21Z

I have started to look at it. All code, output and findings can be found on this branch. In the notes.txt file is my summary.

Conclusion so far: I still have no idea.

adler-j · 2017-06-09T02:16:52Z

VERY nice investigative work, and interesting indeed that it seems to "stagnate" at some point (indicating that this may have something to do with garbage collection).

Some suggestions:

Check if any individual component is causing this, e.g. possible suspects:
- creating proximals, specifically class 'odl.solvers.nonsmooth.proximal_operators.proximal_cconj_l1.<locals>.ProximalCConjL1 looks suspicious
- Calling the ray-transform and its adjoint (e.g. change solver to landweber)
The thing with 130 sounds scarily much like my optimization with lru_list_elem, since I think the lru thing keeps at most 100 elements. BUT! the issue existed even before that (since I only just just added that thing 3 days ago, but this issue was reported 9 days ago). One option would be to disable the lru cache (here) and re-run the tests to see what happens.
Further, it seems like @chongchenmath:s stuff do not stagnate (indeed, it seems like he very steadily allocates about 18314633 bytes of memory for each iteration.) Can you make sure to EXACTLY replicate what he does (including the terminal re-direction tricks etc? Also, perhaps the number 18314633 bytes is a hint as to what is happening.

P.S. Quite sure this type of work should not be on a branch (since it will never be merged back, given the huge log files).

aringh · 2017-06-16T13:09:28Z

Current update on this: when running from odl commit 3fd52ac (as reported in this issue davlars/ad-skull-reconstruction#35) the iterations on a TV reconstruction on the AD skull data set crashes after 4 iterations due to out of memory. However, since it has worked to run the code before we checked how it was run on @chongchenmath's account. There it runs from odl commit 7824dd2. From this commit I can run the exact same code on my account, and it does not crash after 4 iterations.

Conclusion 1: something has happened between commit 7824dd2 and 3fd52ac that makes the later allocate much more memory in fewer iterations.

aringh · 2017-06-16T15:23:58Z

Update on previous comment: in fact I seem to have a similar behavior as originally reported. Memory use seems to increase in the same way during the first 66 iterattions, after which I terminated the code in order to try to run code using pympler to see what is actually increasing in the memory usage.

Ps. Output files and notes are updated on the branch. Ds.

adler-j · 2017-06-17T11:00:29Z

Did you try this:

One option would be to disable the lru cache (here) and re-run the tests to see what happens.

Which is a likely culprit

aringh · 2017-06-19T07:50:15Z

I have not tried it. But in order to reproduce the exam same behavior as first reported, I need to run from commit 7824dd2 (potentially somewhat later, but at least before commit 3fd52ac), which is from February. Hence this is long before the lru-things were added.

I will pause this debugging for now: I think I need some help from @adler-j in person in order to continue this in a meaningful way :-)

adler-j · 2017-06-19T13:35:26Z

I will pause this debugging for now: I think I need some help from @adler-j in person in order to continue this in a meaningful way :-)

No problem, I'll be back next week. Great job so far!

adler-j · 2017-10-16T15:02:18Z

This has basically been solved in the ASTRA branch by this commit: astra-toolbox/astra-toolbox@aa32503

we're waiting for a conda release.

adler-j · 2017-10-16T15:03:19Z

Closing this issue here for now. Solution is to build astra locally.

adler-j added the type: bug label May 31, 2017

chongchenmath closed this as completed May 31, 2017

chongchenmath reopened this May 31, 2017

aringh mentioned this issue Jun 8, 2017

ENH: add callback that shows memory and cpu usage. #1024

Merged

adler-j added the area: tomo label Oct 12, 2017

adler-j mentioned this issue Oct 16, 2017

Possible memory leak astra-toolbox/astra-toolbox#124

Closed

adler-j closed this as completed Oct 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential memory leak #1014

Potential memory leak #1014

chongchenmath commented May 31, 2017

adler-j commented May 31, 2017

adler-j commented May 31, 2017

adler-j commented May 31, 2017

chongchenmath commented May 31, 2017

adler-j commented May 31, 2017 •

edited

Loading

kohr-h commented May 31, 2017

chongchenmath commented May 31, 2017 •

edited

Loading

chongchenmath commented Jun 1, 2017

adler-j commented Jun 1, 2017

aringh commented Jun 2, 2017

adler-j commented Jun 3, 2017

adler-j commented Jun 8, 2017

aringh commented Jun 8, 2017

aringh commented Jun 8, 2017

adler-j commented Jun 8, 2017

aringh commented Jun 8, 2017

adler-j commented Jun 9, 2017 •

edited

Loading

aringh commented Jun 16, 2017

aringh commented Jun 16, 2017

adler-j commented Jun 17, 2017

aringh commented Jun 19, 2017

adler-j commented Jun 19, 2017

adler-j commented Oct 16, 2017

adler-j commented Oct 16, 2017

Potential memory leak #1014

Potential memory leak #1014

Comments

chongchenmath commented May 31, 2017

adler-j commented May 31, 2017

adler-j commented May 31, 2017

adler-j commented May 31, 2017

chongchenmath commented May 31, 2017

adler-j commented May 31, 2017 • edited Loading

kohr-h commented May 31, 2017

chongchenmath commented May 31, 2017 • edited Loading

chongchenmath commented Jun 1, 2017

adler-j commented Jun 1, 2017

aringh commented Jun 2, 2017

adler-j commented Jun 3, 2017

adler-j commented Jun 8, 2017

aringh commented Jun 8, 2017

aringh commented Jun 8, 2017

adler-j commented Jun 8, 2017

aringh commented Jun 8, 2017

adler-j commented Jun 9, 2017 • edited Loading

aringh commented Jun 16, 2017

aringh commented Jun 16, 2017

adler-j commented Jun 17, 2017

aringh commented Jun 19, 2017

adler-j commented Jun 19, 2017

adler-j commented Oct 16, 2017

adler-j commented Oct 16, 2017

adler-j commented May 31, 2017 •

edited

Loading

chongchenmath commented May 31, 2017 •

edited

Loading

adler-j commented Jun 9, 2017 •

edited

Loading