Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in pd.read_csv or DataFrame #21353

Closed
kuraga opened this issue Jun 7, 2018 · 14 comments
Closed

Memory leak in pd.read_csv or DataFrame #21353

kuraga opened this issue Jun 7, 2018 · 14 comments
Labels
Milestone

Comments

@kuraga
Copy link

@kuraga kuraga commented Jun 7, 2018

Code Sample, a copy-pastable example if possible

import sys

m = int(sys.argv[1])
n = int(sys.argv[2])

with open('df.csv', 'wt') as f:
    for i in range(n-1):
        f.write('c' + str(i) + ',')
    f.write('c' + str(n-1) + '\n')
    for j in range(m):
        for i in range(n-1):
            f.write('1,')
        f.write('1\n')


import psutil

print(psutil.Process().memory_info().rss / 1024**2)

import pandas as pd
df = pd.read_csv('df.csv')

print(df.shape)
print(psutil.Process().memory_info().rss / 1024**2)

import gc
del df
gc.collect()

print(psutil.Process().memory_info().rss / 1024**2)

Problem description

$ ~/miniconda3/bin/python3 g.py 1 1
11.60546875
(1, 1)
64.02734375
64.02734375

$ ~/miniconda3/bin/python3 g.py 5000000 15
11.58203125
(5000000, 15)
640.45703125
68.25

$ ~/miniconda3/bin/python3 g.py 5000000 20
11.84375
(5000000, 20)
1586.65625
823.71875 - !!!

$ ~/miniconda3/bin/python3 g.py 10000000 10
11.83984375
(10000000, 10)
830.92578125
67.984375

$ ~/miniconda3/bin/python3 g.py 10000000 15
11.89453125
(10000000, 15)
2344.3046875
1199.89453125 - !!!

Two issues:

  1. There is a "standard" leak after reading any CSV OR just creating by pd.DataFrame() - ~53Mb.
  2. We see a large leak in some other cases.

cc @gfyoung

Output of pd.show_versions()

(same for 0.21, 0.22, 0.23)

pandas: 0.23.0 pytest: None pip: 9.0.3 setuptools: 39.0.1 Cython: None numpy: 1.14.3 scipy: 1.1.0 pyarrow: None xarray: None IPython: 6.4.0 sphinx: None patsy: 0.5.0 dateutil: 2.7.3 pytz: 2018.4 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.2.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.9999999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
@gfyoung
Copy link
Member

@gfyoung gfyoung commented Jun 7, 2018

@kuraga : Thanks for the updated issue!

cc @jreback @jorisvandenbossche

@kuraga
Copy link
Author

@kuraga kuraga commented Jun 13, 2018

Seems like it's not pd.read_csv issue only...

memory_leak_2

@nynorbert
Copy link

@nynorbert nynorbert commented Jun 13, 2018

I have a similiar issue. I have tried to debug it with memory_profiler but I don't see the source of the leak.
The output of the profiler:

 Line #    Mem usage    Increment   Line Contents
 ================================================
    187    261.3 MiB      0.0 MiB        if "history" in self.watch_list:
    188    491.9 MiB    230.6 MiB            self.history = pd.read_csv(self.path + '/' + self.history_files[self.current][1], delimiter=';', header=None)
    189    491.9 MiB      0.0 MiB            self.history_group = self.history.groupby([0])

This snippet of the code is inside a loop and every time it increments the memory usage. I also tried to delete the history and history_group object and calling gc.collect() manually, but nothing seems to work.
Is it possible that this is some cyclic dependency between history and history_group? And if it is then why deleting both history_group and history was not solving the problem?

p.s: My pandas version is 0.23.1

@nynorbert
Copy link

@nynorbert nynorbert commented Jun 13, 2018

Sorry, I was wrong. Not the read_csv which consumes the memory rather than a drop:

Line #    Mem usage    Increment   Line Contents
 ================================================
   265   1425.1 MiB      9.6 MiB                        self.history.drop(self.history_group.get_group(self.current_timestamp).index)

And I think I found out that malloc_trim solves the problem, similar to this: #2659

@kuraga Maybe you should try it.

@zhezherun
Copy link
Contributor

@zhezherun zhezherun commented Oct 8, 2018

I also noticed a memory leak in read_csv and ran it through valgrind, which said that the result of the kset_from_list function was never freed. I was able to fix this leak locally by patching parsers.pyx and rebuilding pandas.

@gfyoung, could you please review the patch below? It might also help with the leak discussed here, although I am not sure if it is the same leak or not. The patch

  • Moves the allocation of na_hashset further down, closer to where it is used. Otherwise it will not be freed if continue is executed,
  • Makes sure that na_hashset is deleted if there is an exception,
  • Also cleans up the allocation inside kset_from_list before raising an exception.
--- parsers.pyx	2018-08-01 19:57:16.000000000 +0100
+++ parsers.pyx	2018-10-08 15:25:32.124526087 +0100
@@ -1054,18 +1054,6 @@
 
             conv = self._get_converter(i, name)
 
-            # XXX
-            na_flist = set()
-            if self.na_filter:
-                na_list, na_flist = self._get_na_list(i, name)
-                if na_list is None:
-                    na_filter = 0
-                else:
-                    na_filter = 1
-                    na_hashset = kset_from_list(na_list)
-            else:
-                na_filter = 0
-
             col_dtype = None
             if self.dtype is not None:
                 if isinstance(self.dtype, dict):
@@ -1090,13 +1078,26 @@
                                               self.c_encoding)
                 continue
 
-            # Should return as the desired dtype (inferred or specified)
-            col_res, na_count = self._convert_tokens(
-                i, start, end, name, na_filter, na_hashset,
-                na_flist, col_dtype)
+            # XXX
+            na_flist = set()
+            if self.na_filter:
+                na_list, na_flist = self._get_na_list(i, name)
+                if na_list is None:
+                    na_filter = 0
+                else:
+                    na_filter = 1
+                    na_hashset = kset_from_list(na_list)
+            else:
+                na_filter = 0
 
-            if na_filter:
-                self._free_na_set(na_hashset)
+            try:
+                # Should return as the desired dtype (inferred or specified)
+                col_res, na_count = self._convert_tokens(
+                    i, start, end, name, na_filter, na_hashset,
+                    na_flist, col_dtype)
+            finally:
+                if na_filter:
+                    self._free_na_set(na_hashset)
 
             if upcast_na and na_count > 0:
                 col_res = _maybe_upcast(col_res)
@@ -2043,6 +2044,7 @@
 
         # None creeps in sometimes, which isn't possible here
         if not PyBytes_Check(val):
+            kh_destroy_str(table)
             raise ValueError('Must be all encoded bytes')
 
         k = kh_put_str(table, PyBytes_AsString(val), &ret)
@gfyoung
Copy link
Member

@gfyoung gfyoung commented Oct 8, 2018

@zhezherun : That's a good catch! Create a PR, and we can review.

@jreback jreback added this to the 0.24.0 milestone Oct 10, 2018
@kuraga
Copy link
Author

@kuraga kuraga commented Oct 23, 2018

Trying to patch is cool but fear that #2659 (comment)...

gfyoung added a commit to zhezherun/pandas that referenced this issue Nov 19, 2018
* Move allocation of na_hashset down to avoid a leak on continue
* Delete na_hashset if there is an exception
* Clean up table before raising an exception

Closes pandas-devgh-21353.
gfyoung added a commit to zhezherun/pandas that referenced this issue Nov 19, 2018
* Move allocation of na_hashset down to avoid a leak on continue
* Delete na_hashset if there is an exception
* Clean up table before raising an exception

Closes pandas-devgh-21353.
TomAugspurger added a commit that referenced this issue Nov 19, 2018
* Move allocation of na_hashset down to avoid a leak on continue
* Delete na_hashset if there is an exception
* Clean up table before raising an exception

Closes gh-21353.
@kuraga
Copy link
Author

@kuraga kuraga commented Nov 19, 2018

@zhezherun , @TomAugspurger , thanks very much!

But could you, please, describe the connection with @nynorbert 's observation:

And I think I found out that malloc_trim solves the problem, similar to this: #2659

So, we had memory leak in Pandas in addition to glibc's feature to not trim after free?

Thanks.

@TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Nov 19, 2018

I don't know C, so no. Perhaps @nynorbert can clarify.

Pingviinituutti added a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019
* Move allocation of na_hashset down to avoid a leak on continue
* Delete na_hashset if there is an exception
* Clean up table before raising an exception

Closes pandas-devgh-21353.
Pingviinituutti added a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019
* Move allocation of na_hashset down to avoid a leak on continue
* Delete na_hashset if there is an exception
* Clean up table before raising an exception

Closes pandas-devgh-21353.
@kuraga
Copy link
Author

@kuraga kuraga commented Jun 6, 2020

glibc.malloc.mxfast tunable has been introduced in Glibc (https://www.gnu.org/software/libc/manual/html_node/Memory-Allocation-Tunables.html).

@wasonkartik
Copy link

@wasonkartik wasonkartik commented Jul 23, 2020

Hi, I am facing this issue on google compute engine (Windows Server 2012 R2 Datacenter, 64 bit). How do I fix it? I have installed the latest version of Pandas.

@gberth
Copy link

@gberth gberth commented Aug 20, 2020

Theory: When reading large files with Python, pd.read_csv, csv.reader, plain python io, or with mmap it seems that the thread reading will hold memory. If the same thread does a new read, the already allocated memory will be used, if a new thread reads, it will aquire additional memory. With panda on google the reading of 3 files of app. 100 MB has required app 3GB that is not released. With csv.reader app 300MB, and plain read and mmap app 200MB. So multithreading read of the 3 files can result in extensive storage use (25GB+). This is not my home field, but it has been a frustrating week looking for leaks. If I'm wrong, sorry for the disturbance. (Python 3.7 and 3.8)

@bashtage
Copy link
Contributor

@bashtage bashtage commented Aug 20, 2020

@gberth If you use engine="python" do you see that same pattern?

@gberth
Copy link

@gberth gberth commented Aug 25, 2020

Sorry, no difference. If I ensure reading files twice in the same thread, it does not consume or hold more memory. Read in two different threads, and both holds 2GB+ as long as the threads live (at least looks like that to me)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

9 participants