New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas.read_csv leaks memory while opening massive files with chunksize & iterator=True #21516

Closed
Shirui816 opened this Issue Jun 18, 2018 · 6 comments

Comments

Projects
None yet
3 participants
@Shirui816
Copy link

Shirui816 commented Jun 18, 2018

I am using anaconda and my pandas version is 0.23.1. When dealing with single large file, setting chunksize or iterator=True works fine and memory usage is low. The problem raises when I am trying to dealing with 5000+ files (file names are in filelist):

trajectory = [pd.read_csv(f, delim_whitespace=True, header=None, chunksize=10000) for f in filelist]

The memory usage raises very soon and exceeds 20GB+ quickly. However, trajectory = [open(f, 'r')....] and reading 10000 lines from each file works fine.

I also tried low_memory=True option but it's not working. Both engine='python' and memory_map=<some file> options solve the memory problem but when I use the datas with

X = np.asarray([f.get_chunk().values for f in trajectory])
FX = np.fft.fft(X, axis=0)

The multi-threading of MKL-FFT does not work anymore.

@gfyoung

This comment has been minimized.

Copy link
Member

gfyoung commented Jun 18, 2018

  • This might be related to #21353
  • When you say you tried low_memory=True, and it's not working, what do you mean?
  • You might need to check your concatenation when using engine='python' and memory_map=...
@Shirui816

This comment has been minimized.

Copy link

Shirui816 commented Jun 19, 2018

Thanks for replying :) @gfyoung

I mean that adding low_memory=True option to

trajectory = [pd.read_csv(f, delim_whitespace=True, header=None, chunksize=10000, low_memory=True) for f in filelist]

the memory usage is not change in contrast to the case without this option.

@Shirui816

This comment has been minimized.

Copy link

Shirui816 commented Jun 19, 2018

The environment is:
CentOS Linux release 7.4.1708 (Core)
Python 3.6.5 :: Anaconda custom (64-bit)
with pandas version 0.23.1

From #21353 , I tracked the memory usage:

import psutil
import pandas as pd

traj = []
i = 0
for f in argv[1:]:
    a = pd.read_csv(f, squeeze=0, header=None, delim_whitespace=1, chunksize=10000, comment='#')
    traj.append(a)
    if not i % 100:
        print('%s th file, memory: ' % (i),psutil.Process().memory_info().rss / 1024**2)
    i += 1

and the output:

0 th file, memory:  61.96484375
100 th file, memory:  214.66015625
200 th file, memory:  367.32421875
300 th file, memory:  520.046875
400 th file, memory:  674.76953125
500 th file, memory:  829.5
600 th file, memory:  982.22265625
700 th file, memory:  1134.9453125
800 th file, memory:  1287.66796875
900 th file, memory:  1442.3828125
1000 th file, memory:  1597.109375
1100 th file, memory:  1749.84765625
1200 th file, memory:  1932.57421875
1300 th file, memory:  2122.796875
1400 th file, memory:  2313.01953125
1500 th file, memory:  2503.2421875
...
4600 th file, memory:  8414.0234375
4700 th file, memory:  8604.24609375
4800 th file, memory:  8794.4765625
4900 th file, memory:  8984.6953125
5000 th file, memory:  9174.921875
5100 th file, memory:  9367.14453125
5200 th file, memory:  9557.37109375
5300 th file, memory:  9747.59375
5400 th file, memory:  9937.81640625
5500 th file, memory:  10128.04296875
5600 th file, memory:  10320.26953125

It turns out that the memory increases ~1.9 mB per file. The files using in this test is about 800 kB for each.

Also tried malloc_trim(0) from #2659 :

import psutil
import pandas as pd
from ctypes import cdll, CDLL
cdll.LoadLibrary("libc.so.6")
libc = CDLL("libc.so.6")


traj = []
i = 0
for f in argv[1:]:
    libc.malloc_trim(0)
    a = pd.read_csv(f, squeeze=0, header=None, delim_whitespace=1, chunksize=10000, comment='#')
    traj.append(a)
    if not i % 100:
        print('%s th file, memory: ' % (i),psutil.Process().memory_info().rss / 1024**2)
    i += 1

The results are same with above, the memory usage still increases quickly.

@gfyoung

This comment has been minimized.

Copy link
Member

gfyoung commented Jun 19, 2018

Hmm...admittedly, this is the first time I've been seeing so many of these issues regarding memory leakage in read_csv, and I'm still uncertain as to whether it has to deal with DataFrame or read_csv.

cc @jreback

@Liam3851

This comment has been minimized.

Copy link
Contributor

Liam3851 commented Jul 3, 2018

@Shirui816 You're appending the result of pd.read_csv to a list:

traj = []
for f in argv[1:]:
    a = pd.read_csv(f, squeeze=0, header=None, delim_whitespace=1, chunksize=10000, comment='#')
    traj.append(a)

Adding objects to a list means they can't be garbage collected. Thus you're keeping keeping thousands of file handles and the related iterator objects open-- so we would expect memory use to grow. I've confirmed that that memory does not grow if you remove the traj.append call.

If the issue is that the memory use is growing faster than you expect based on the filesizes (based on your comment ("It turns out that the memory increases ~1.9 mB per file. The files using in this test is about 800 kB for each."), note that you're not actually reading the file in all the way in the above call, you're creating a persistent iterator and file handle on the file, because you're using the chunksize parameter. If you only want the first 10000 lines of the file, use

a = next(pd.read_csv(f, squeeze=0, header=None, delim_whitespace=1, chunksize=10000, comment='#'))

This will throw away the handle and the rest of the iterator object and contain just your data.

@Shirui816

This comment has been minimized.

Copy link

Shirui816 commented Jul 4, 2018

@Liam3851 Thank you very much for the explanation. I increased file size and re-ran the test, the memory gain per file was still about 1.9mB. This handler is much larger than the open function....emmmm.... Is option engine='python' means the iterator and file handler thing held by python like open function? I am wondering why after adding this option (or/and adding memory_map=... option) the parallel acceleration of MKL doesnt work any more. I totally have no clue about this problem. Are there any suggested tests to find the reason? The codes are in my first post, after creating a list of iterators in trajectory, take a chunk from each handler then perform an FFT.

The environment is:
CentOS Linux release 7.4.1708 (Core)
Python 3.6.5 :: Anaconda custom (64-bit)
with pandas version 0.23.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment