Possible memory leak #37401

rbauststfc · 2024-05-20T11:27:39Z

Please read all comments to get full picture of the issue, as the problem appears to be more general than first described here.

Describe the bug

We've had reports of Mantid crashing due to memory use when users are auto-processing INTER experiment data while time slicing. I've done a bit of investigating and it looks to me like there may be a memory leak somewhere. This is most noticeable when we time slice in the GUI.

To Reproduce

The easiest way to replicate is as follows:

Go to the Interfaces -> Reflectometry -> ISIS Reflectometry.
On the Runs tab, in the first row of the table on the right hand side enter run 77064 and angle 0.8.
On the Event Handling tab, under the "Uniform slicing" section select the (sec) slices radio box and enter 5 to slice the data into 5 second slices.
On the Experiment Settings tab enter the following into the first row of the Settings Lookup table:
- ROI: 3-258
- ROI Detector IDS: 3001-3256,4001-4256,5001-5256,6001-6256,7001-7256,8001-8256,9001-9256,10001-10256
Back on the Runs tab, click the process button. This should complete successfully and create a set of workspace groups. Take a note of the memory usage at this stage.
Clear the workspace list using the button in Mantid Workbench. The memory use won't go down, but I don't think this in itself necessarily indicates a memory leak, as it can just be that the system hasn't retrieved the memory yet. To demonstrate the leak, in the Reflectometry GUI click to process the row again. The memory usage should continue to build up. If you keep doing this in a loop (5 or 6 times usually, clearing the workspace list in between) the memory completely fills up and Mantid eventually crashes.

If I do the above without the GUI (i.e. calling ReflectometryISISLoadAndProcess from a script while time-slicing and clearing workspaces between each run) then the memory doesn't seem to build up very much at all. If I run the algorithm directly from the algorithm dialog then the memory does build up more quickly and eventually causes a crash.

Expected behavior

Repeated processing of the data should not cause the memory to fill up and cause a crash if the workspace list is being cleared in between.

Platform/Version (please complete the following information):

IDAaaS, can't replicate on Windows, need to test on Linux and potentially Mac
Mantid Version - seems to be a problem in the Nightly, can't replicate it in version 6.9.1 or earlier.

Additional Information
Another way I've tested is to run this script repeatedly, which does not perform the time slicing, clearing the workspaces in between:

from mantid.simpleapi import *
import matplotlib.pyplot as plt
import numpy as np

num = 77064
for i in range(50):
    num = num + 1
    ReflectometryISISLoadAndProcess(InputRunList=str(num),ThetaIn=0.8,ROIDetectorIDs="3001-3256,4001-4256,5001-5256,6001-6256,7001-7256,8001-8256,9001-9256,10001-10256",ProcessingInstructions="3-258")

I haven't been able to crash Mantid using this script, though, even though the memory fills up to over 90% if run repeatedly. The memory builds up pretty slowly when tested in this way.

The text was updated successfully, but these errors were encountered:

rbauststfc · 2024-05-20T15:40:26Z

@cailafinn would you be happy to double check this on IDAaaS and check if you see the same behaviour on your Linux machine (and maybe Mac)? I would recommend using the first set of testing instructions (i.e. the time slicing example). While the script I've provided for the second way of causing the memory to fill up gets to over 90%, I can't seem to get it to the point where it crashes Mantid, so I think the first set of instructions is the most reliable way to demonstrate the problem.

cailafinn · 2024-05-21T13:22:45Z

Reproduced on Ubuntu 22.04 (Nightly conda package). Took two repeats of clicking Process to crash on a system with 32GiB of RAM. When the ADS was cleared, memory usage dropped somewhat, but not significantly. See below.

Breakdowns:

Nightly:

Start:
- Mantid: 0.5GiB
- System: 5.2GiB
1st Process:
- Mantid: 15.8GiB
- System: 20.4GiB
ADS Clear
- Mantid: 13.3GiB
- System: 18.1GiB
2nd Process
- Usage rose over 5-7 seconds before reaching the limits of the system until workbench froze and was crashed by the OOM Killer.

V9.6.1 and earlier

Start:
- Mantid: 0.4GiB
- System: 1.1GiB
1st Process:
- Mantid: 15.8GiB
- System: 16.4GiB
ADS Clear
- Mantid: 13.3GiB
- System: 14.0GiB
2nd Process:
- Mantid: 15.8GiB
- System: 16.5GiB
3rd Process:
- Mantid: 15.7GiB
- System: 16.5GiB

This would appear to be a regression.

rbauststfc · 2024-05-21T15:40:43Z

As Caila has suggested, we should see if this PR to address issues with jemalloc has any impact on this issue. We've not made any changes to the Reflectometry code since 6.9.1 that seem like they should be causing this, and it isn't happening on Windows. I'll test on IDAaaS when that PR has gone in and see where we are then.

rbauststfc · 2024-05-23T15:47:00Z

This doesn't appear to have been helped by the jemalloc change. We're still waiting on results from a Valgrind scan. In the meantime I've done some more testing on IDAaaS and have found this alternative way to replicate the behaviour:

Load data using LoadEventNexus with the following settings:
- Filename = INTER77064
- LoadMonitors = True
- MonitorsLoadOnly = Histogram
From the algorithm dialog, run ReflectometrySliceEventWorkspace with the following settings:
- InputWorkspace = name of loaded workspace
- MonitorWorkspace = name of loaded monitor workspace
- UseNewFilterAlgorithm = True (note the memory builds up regardless of how this is set)
- TimeInterval = 5

It should show the same behaviour as the Reflectometry GUI. If ReflectometrySliceEventWorkspace is called repeatedly via a script, though, then the memory doesn't seem to build up in the same way, it seems to be necessary to call it from a GUI.

rbauststfc · 2024-05-23T15:47:48Z

Depending on the results of the Valgrind scan, if further investigation is needed it will need to be done by someone with access to a Linux machine as I still can't replicate on Windows.

rbauststfc · 2024-05-24T08:40:47Z

I've done a bit more testing on IDAaaS and this is starting to look like a more general problem. I can replicate it using the LoadISISNexus algorithm to load a WISH dataset.

I've found that if you run the following script then it doesn't build up the memory:

for i in range(3):
    LoadISISNexus(Filename="WISH57086", LoadMonitors="Include", OutputWorkspace="test")
    AnalysisDataService.clear()

However if you do the above in any of the following ways then the memory does build up:

Run LoadISISNexus via the algorithm dialog and clear the ADS using the Clear button.
Run LoadISISNexus via the algorithm dialog and clear the ADS by running AnalysisDataService.clear() in the script editor.
Run LoadISISNexus via the script editor and clear the ADS using the Clear button.

This doesn't seem to happen in version 6.9.1 of Mantid, and again doesn't happen on Windows. When the memory fills up with this test it seems to eventually cause Mantid to freeze and become unresponsive on IDAaaS rather than crash.

warunawickramasingha · 2024-05-24T16:51:37Z

The steps were followed in IDAaaS using Valgrind memory analyser tool to check for any memory leaks. The main observation was that there was a discrepancy in the amount of memory reported as leaked by Valgrind and the memory usage reported by the Mantid workbench as shown below. As can be seen, there has not been such a significant memory leak in this case.

Valgrind summary ->

System memory usage reported by Mantid at the end of test step 6 ->

Valgrind log file for reference
37401_valgrind.zip

cailafinn · 2024-05-28T11:59:22Z

One billion permutations later, here are my results. Testing on macOS to follow.

Testing on Ubuntu 22.04. 32GiB RAM.

Run Script Once: Loop Loading Call

for i in range(15):
    AnalysisDataService.clear()
    LoadISISNexus(Filename="WISH57086", LoadMonitors="Include", OutputWorkspace="test")

Does not crash the program, even if the same load and clear function is performed multiple times within the script.

Run Script Multiple Times: Single Loading Call

# Repeatedly ran 3 times.
AnalysisDataService.clear()
LoadISISNexus(Filename="WISH57086", LoadMonitors="Include", OutputWorkspace="test")

With each re-run, the memory usage was seen to drop by around 2.3GiB when the ADS was cleared.
Memory usage then rose again. This could only be repeated 3 times before Mantid took up too much memory and was killed by the OoM Killer.

Version	Start	1st Run	Clear	2nd Run	Clear	3rd Run	Clear	4th Run	Clear	5th Run
Nightly	0.3	14.8	13.1	26.2	23.9	CRASH
6.9.1	0.3	14.8	13.4	26.2	23.8	CRASH
6.7	0.4	14.6	12.3	14.8	12.6	26.3	23.8	26.0	24.1	CRASH

Behaviour appears to be the same.

Run Script Multiple Times: Swap Clear Order

# Repeatedly ran 4 times.
LoadISISNexus(Filename="WISH57086", LoadMonitors="Include", OutputWorkspace="test")
AnalysisDataService.clear()

Version	Start	1st Run	Clear	2nd Run	Clear	3rd Run	Clear	4th Run	Clear	5th Run
Nightly	0.5	14.8	12.3	26.2	23.6	25.8	23.8	CRASH	N/A	N/A
6.9.1	0.4	14.8	12.3	25.7	23.7	25.9	23.9	25.8	24.1	CRASH

Behaviour is similar. This way around was a bit more variable for some reason, but I don't think this one is specifically a regression.

No Script

# Click Load and load WISH57086 from the algorithm dialog
# Clear the ADS using the clear button.

Version	Start	1st Run	Clear	2nd Run	Clear	3rd Run	Clear	4th Run	Clear	5th Run
Nightly	0.4	14.8	12.3	26.2	23.7	CRASH
6.9.1	0.4	14.8	12.3	14.5	12.5	14.6	12.7	14.7	12.9	14.7
6.7	0.4	14.8	12.3	14.8	12.4	14.8	12.6	14.7	12.7	14.7

Okay, beginning to see a difference here.

Load with Algorithm Dialog, Clear From Script

# Click Load and load WISH57086 from the algorithm dialog
# Run below from a script
AnalysisDataService.clear()

Version	Start	1st Run	Clear	2nd Run	Clear	3rd Run	Clear	4th Run	Clear	5th Run
Nightly	0.5	14.8	12.3	26.2	24.9	CRASH
6.9.1	0.5	14.7	12.3	14.8	12.5	14.7	13.1	14.5	13.2	14.7

Same behaviour as with no scripting involved.

Load with a script, Clear from the Button

# Run the below script to load WISH57086
# Then clear using the button on the workbench GUI.
LoadISISNexus(Filename="WISH57086", LoadMonitors="Include", OutputWorkspace="test")

Version	Start	1st Run	Clear	2nd Run	Clear	3rd Run	Clear	4th Run
Nightly	0.5	14.7	12.3	26.2	23.7	CRASH
6.9.1	0.4	14.6	13.4	26.2	23.7	CRASH
6.7	0.4	13.8	12.3	26.2	23.7	26.2	23.9	CRASH

Okay, so the total of all of this seems to tell us is that something has changed when loading from the algorithm dialog. Crashes occur on both versions when loading from scripts, regardless of what is done with the ADS clear. The only thing that stops the crashes from occurring in 6.9.1 is using the algorithm dialog to load the data.

cailafinn · 2024-05-28T12:52:06Z

Testing on macOS 14.4.1 (Intel). 32GB RAM.

No Script

# Click Load and load WISH57086 from the algorithm dialog
# Clear the ADS using the clear button.

Version	Start	1st Run	Clear	2nd Run	Clear	3rd Run	Clear	4th Run	Clear	5th Run
Nightly	0.4	5.5	2.1	7.4	2.9	7.5	6.2	10.0	3.3	12.2
6.7	0.5	12.5	2.5	12.5	3.1	12.5	3.3	12.6	etc.

Despite the workspace appearing in the "Workspaces" list as 20GB, it didn't appear to take all that up as RAM. I wonder if there's something different about how macOS handles the memory for large files?
Regardless, there doesn't seem to be an issue on macOS, it would appear to be specifically on linux.

thomashampson · 2024-05-28T13:16:44Z

@cailafinn Are you launching with the workbench entry point, or using the mantidworkbench script?

cailafinn · 2024-05-28T13:29:25Z

workbench entry point on both OS'. Always using the version from conda.

rbauststfc · 2024-05-28T13:53:48Z

We've noticed that the version of the Nightly from 24th May (which is the first to include the jemalloc pin) is looking much better on IDAaaS now, when started with the launcher from the Applications menu (FYI @sf1919). Tom has explained that launching the Nightly using the IDAaaS launcher sets up jemalloc, whereas using the workbench entry point does not use the script that sets up jemalloc.

cailafinn · 2024-05-28T14:01:09Z

Re-tested launching using mantidworkbench:

Version	Start	1st Run	Clear	2nd Run	Clear	3rd Run	Clear	4th Run	Clear	5th Run
Nightly	0.5	14.7	12.3	26.2	23.7	CRASH
N `mw`	0.5	15.9	4.3	16.2	4.2	16.2	4.2	16.2	4.5	16.1
6.9.1	0.4	14.6	13.4	26.2	23.7	CRASH
6.7	0.4	13.8	12.3	26.2	23.7	26.2	23.9	CRASH

Seems to fix the issue. Do we need to alter the launch instructions for linux? We currently recommend starting it using the workbench entry in the Downloads docs.

thomashampson · 2024-05-28T15:25:51Z

Some of my findings when loading and clearing all via the GUI (all nightly versions here are 6.9.20240524.1804):
On Ubuntu:

v6.9.1 is fine with and without jemalloc (mantidworkbench and workbench)
nightly version fills up memory and freezes with and without jemalloc

On IDAaaS:

nightly version mantidworkbench (with jemalloc) appears to fill up the memory, but is actually OK because it gets release just in time
nightly version via workbench freezes after a few iterations and eventually crashes
6.9.1 seems to be fine in both cases. With jemalloc it seems to give the memory back much quicker than with the nightly. Maybe because we downgraded from 5.3.0 to 5.2.0.

sf1919 · 2024-06-12T09:54:32Z

With the further changes to IDAaaS last week how are things looking @thomashampson ?

rbauststfc added this to the Release 6.10 milestone May 20, 2024

rbauststfc self-assigned this May 20, 2024

rbauststfc changed the title ~~Memory leak when time slicing in ISIS Reflectometry GUI~~ Memory leak when reducing ISIS Reflectometry data May 20, 2024

rbauststfc changed the title ~~Memory leak when reducing ISIS Reflectometry data~~ Possible memory leak when reducing ISIS Reflectometry data May 21, 2024

yusufjimoh mentioned this issue May 23, 2024

Fixes grabbing of legend when cursor is above the legend during scrolling of plot #37413

Closed

rbauststfc removed Reported By User Issues that were found or highlighted by a user/scientist Reflectometry Issues and pull requests related to reflectometry labels May 24, 2024

rbauststfc changed the title ~~Possible memory leak when reducing ISIS Reflectometry data~~ Possible memory leak May 24, 2024

rbauststfc added the Linux Only Only on Linux label May 28, 2024

rbauststfc added the ISIS Team: Core Issue and pull requests managed by the Core subteam at ISIS label May 29, 2024

thomashampson assigned thomashampson and unassigned rbauststfc May 29, 2024

sf1919 modified the milestones: Release 6.10, Release 6.11 Jun 4, 2024

rbauststfc removed the ISIS Team: LSS Issue and pull requests managed by the LSS subteam at ISIS label Jun 13, 2024

thomashampson modified the milestones: Release 6.11, Release 6.12 Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible memory leak #37401

Possible memory leak #37401

rbauststfc commented May 20, 2024 •

edited

Loading

rbauststfc commented May 20, 2024 •

edited

Loading

cailafinn commented May 21, 2024 •

edited

Loading

rbauststfc commented May 21, 2024 •

edited

Loading

rbauststfc commented May 23, 2024

rbauststfc commented May 23, 2024

rbauststfc commented May 24, 2024 •

edited

Loading

warunawickramasingha commented May 24, 2024 •

edited

Loading

cailafinn commented May 28, 2024 •

edited

Loading

cailafinn commented May 28, 2024 •

edited

Loading

thomashampson commented May 28, 2024

cailafinn commented May 28, 2024 •

edited

Loading

rbauststfc commented May 28, 2024

cailafinn commented May 28, 2024

thomashampson commented May 28, 2024 •

edited

Loading

sf1919 commented Jun 12, 2024

Possible memory leak #37401

Possible memory leak #37401

Comments

rbauststfc commented May 20, 2024 • edited Loading

rbauststfc commented May 20, 2024 • edited Loading

cailafinn commented May 21, 2024 • edited Loading

Breakdowns:

Nightly:

V9.6.1 and earlier

rbauststfc commented May 21, 2024 • edited Loading

rbauststfc commented May 23, 2024

rbauststfc commented May 23, 2024

rbauststfc commented May 24, 2024 • edited Loading

warunawickramasingha commented May 24, 2024 • edited Loading

cailafinn commented May 28, 2024 • edited Loading

Testing on Ubuntu 22.04. 32GiB RAM.

Run Script Once: Loop Loading Call

Run Script Multiple Times: Single Loading Call

Run Script Multiple Times: Swap Clear Order

No Script

Load with Algorithm Dialog, Clear From Script

Load with a script, Clear from the Button

cailafinn commented May 28, 2024 • edited Loading

Testing on macOS 14.4.1 (Intel). 32GB RAM.

No Script

thomashampson commented May 28, 2024

cailafinn commented May 28, 2024 • edited Loading

rbauststfc commented May 28, 2024

cailafinn commented May 28, 2024

thomashampson commented May 28, 2024 • edited Loading

sf1919 commented Jun 12, 2024

rbauststfc commented May 20, 2024 •

edited

Loading

rbauststfc commented May 20, 2024 •

edited

Loading

cailafinn commented May 21, 2024 •

edited

Loading

rbauststfc commented May 21, 2024 •

edited

Loading

rbauststfc commented May 24, 2024 •

edited

Loading

warunawickramasingha commented May 24, 2024 •

edited

Loading

cailafinn commented May 28, 2024 •

edited

Loading

cailafinn commented May 28, 2024 •

edited

Loading

cailafinn commented May 28, 2024 •

edited

Loading

thomashampson commented May 28, 2024 •

edited

Loading