Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading large EDF files with preload = False raises memory error #10634

Closed
arnaumanasanch opened this issue May 16, 2022 · 11 comments · Fixed by #10638
Closed

Reading large EDF files with preload = False raises memory error #10634

arnaumanasanch opened this issue May 16, 2022 · 11 comments · Fixed by #10638
Labels

Comments

@arnaumanasanch
Copy link

Issue/Bug

When reading in an IPython notebook a large .edf file which is 10Gb size (4 hours/ 150 channels/ 2048Hz sampling frequency) with:

raw_edf = mne.io.read_raw_edf('file.edf', preload=False)

the kernel crashes as the RAM memory, which is 12 Gb, gets full.

I thought the preload argument was loading only metadata (which should not be more than a few hundred Mbs).

Would there be a way in which I could read the raw_edf (only the metadata) and then with the get_data() method, be able to just load into memory a small piece of data (n channels and x time range) without the system failing because of the RAM being full?

If I try the same with a higher RAM (32Gb), there is no problem and once it is loaded, the RAM goes back to normal (e.g: before loading it is 1Gb, when loading it goes up to 15Gb, once loaded back to 1 Gb approx.). We need to make it functional, if possible, with the 12 Gb RAM.

For privacy reasons, I can’t share the file that I am using.

Thanks in advance.

Additional information

MNE version: e.g. 1.0.3
operating system: / Windows 10

@welcome
Copy link

welcome bot commented May 16, 2022

Hello! 👋 Thanks for opening your first issue here! ❤️ We will try to get back to you soon. 🚴🏽‍♂️

@cbrnr
Copy link
Contributor

cbrnr commented May 16, 2022

Would you be able to run mprof on your machine with 32GB RAM while loading the EDF file? Once we have the result, we can think about where to place the @profile decorator to see where this is happening.

Oh, and before that, does the problem also occur outside of a notebook (e.g. plain Python script or Python interactive interpreter, or even IPython)?

@arnaumanasanch
Copy link
Author

Thanks for the rapid response.

Yes, it also occurs outside a notebook, both in plain Python script and in a Python interactive interpreter.

I have run the memory_profiler in the 32Gb RAM machine and the result is the following:

image

The final increment is only 19Mib (the size of the metadata I assume), but the process to get there does not only require these 19Mib, but more than 10Gbs approx. Should we go deeper in the source code?

Let me know if I can help with anything else.

@agramfort
Copy link
Member

agramfort commented May 16, 2022 via email

@cbrnr
Copy link
Contributor

cbrnr commented May 16, 2022

Can you do a time-based memory profile (https://github.com/pythonprofilers/memory_profiler#time-based-memory-usage)? This should at least show the 10GB spike at some point.

@arnaumanasanch
Copy link
Author

Here it is:

image

@cbrnr
Copy link
Contributor

cbrnr commented May 16, 2022

Thanks! I guess the next thing to do would be to sprinkle some @profile decorators on functions that might be the culprit. This should then be reflected by time stamps in the diagram, plus the function name.

@arnaumanasanch
Copy link
Author

arnaumanasanch commented May 16, 2022

So, I executed the memory_profile over the source code and the RAM gets filled in the _read_segment_file function from the edf.py file.
I do not copy the entire log because it is too large, just the meaningful piece:

image

It is in the ai loop that the RAM gets full, more specifically when defining the variable many_chunk:

many_chunk = _read_ch(fid, subtype, ch_offsets[-1] * n_read, dtype_byte, dtype).reshape(n_read, -1)

If I keep track of the RAM before and after this line execution (using psutil) the RAM gets an increase of around 10Mb at every iteration.

Let me know if you need any other information. Thanks for the help.

@cbrnr
Copy link
Contributor

cbrnr commented May 16, 2022

Thanks @arnaumanasanch, I'll take a look to see why this is necessary without preload (likely it isn't).

@cbrnr
Copy link
Contributor

cbrnr commented May 17, 2022

If anyone wants to reproduce the problem, here's a reprex that generates a large EDF file (944MB on disk, 3.6GB in RAM, but values can be adapted) and reads it with read_raw_edf():

import numpy as np
from mne.io import read_raw_edf
from pyedflib.highlevel import write_edf_quick


def write_large_edf():
    n_chans = 64
    length = 2 * 60 * 60
    fs = 1024
    write_edf_quick("large.edf", np.random.randn(n_chans, length * fs), fs)


raw = read_raw_edf("large.edf", preload=False)

Running that script with mprof run -T 0.05 test.py (assuming that the script is stored in test.py and that write_large_edf() has been called separately before) followed by mprof plot -t "" produces the following graph:

Figure_1

@cbrnr
Copy link
Contributor

cbrnr commented May 17, 2022

I think I found one place where we accidentally create a view on an array, which prevents garbage collection and therefore fills up memory. This line creates a reference to many_chunks, which should be overwritten in the next loop iteration – but it won't because of that reference. If I make a copy, memory consumption goes down a lot:
Figure_1
I don't think I can bring it down further (but I will check), because we need to go through the file when annotations are present. Even in the toy data file from my example, there's a channel called "EDF Annotations", and this triggers the reading.

I'll submit a PR so that you can test with your file @arnaumanasanch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants