Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: to_json memory leak (introduced in 1.1.0) #43877

Closed
3 tasks done
vernetya opened this issue Oct 4, 2021 · 11 comments · Fixed by #45489
Closed
3 tasks done

BUG: to_json memory leak (introduced in 1.1.0) #43877

vernetya opened this issue Oct 4, 2021 · 11 comments · Fixed by #45489
Labels
Bug IO JSON read_json, to_json, json_normalize Performance Memory or execution speed performance
Milestone

Comments

@vernetya
Copy link
Contributor

vernetya commented Oct 4, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

# any loop
for _ in range(1000)
    df = pd.DataFrame({str(c): np.random.random_sample(size=100_000) for c in range(10)})
    df.to_json() # same regardless orient or using file

Issue Description

It looks like a memory leak when calling to_json introduced in version 1.1.0. It seems it prevents the dataframe to be correctly garbage collected. Here's a memory profile Pandas 1.1.0 compared to the previous version 1.0.5:

mem_prof

I have the same trends either on Windows 10, Linux Ubuntu, python 3.7, 3.8 & 3.9.
This leak is still there on latest Pandas version 1.3.3 and is proportional to the size of the dataframe. I've tried direct calls to del and gc.collect() but it doesn't change anything.

It's specific to to_json method. I haven't observed leak with other formats such as CSV.

I don't know if it makes sense or help, here's an output using tracemalloc from this code:

def foo():
    df = pd.DataFrame({str(c): np.random.random_sample(size=100_000) for c in range(5)})
    df.to_json()


if __name__ == "__main__":
    tracemalloc.start(50)

    foo()

    snapshot = tracemalloc.take_snapshot()
    top_stats = snapshot.statistics('traceback')

    # pick the biggest memory block
    stat = top_stats[0]
    print("%s memory blocks: %.1f KiB" % (stat.count, stat.size / 1024))
    for line in stat.traceback.format():
        print(line)

with Pandas 1.1.0 or 1.3.3:

5 memory blocks: 782.5 KiB
  File "main.py"
    foo()
  File "main.py"
    df.to_json()
  File ".\ven37\lib\site-packages\pandas\core\generic.py", line 2571
    storage_options=storage_options,
  File ".\ven37\lib\site-packages\pandas\io\json\_json.py", line 122
    indent=indent,
  File ".\ven37\lib\site-packages\pandas\io\json\_json.py", line 183
    indent=self.indent,
  File ".\ven37\lib\site-packages\pandas\core\indexes\base.py", line 4367
    return self._data
  File ".\ven37\lib\site-packages\pandas\core\indexes\range.py", line 186
    return np.arange(self.start, self.stop, self.step, dtype=np.int64)

whereas 1.0.5 produces this:

9 memory blocks: 1.6 KiB
  File "main.py"
    foo()
  File "main.py"
    df.to_json()
  File ".\ven37\lib\site-packages\pandas\core\generic.py", line 2364
    indent=indent,
  File ".\ven37\lib\site-packages\pandas\io\json\_json.py", line 85
    indent=indent,
  File ".\ven37\lib\site-packages\pandas\io\json\_json.py", line 145
    self.indent,
  File ".\ven37\lib\site-packages\pandas\io\json\_json.py", line 245
    indent,
  File ".\ven37\lib\site-packages\pandas\io\json\_json.py", line 167
    indent=indent,

Expected Behavior

No leak expected, similar to version 1.0.5

Installed Versions

Versions with leak:
master -----------------
commit : 6599834
python : 3.8.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 13, GenuineIntel
pandas : 1.4.0.dev0+833.g6599834103
numpy : 1.21.2
pytz : 2021.1
dateutil : 2.8.2
...

1.3.3 ------------------
commit : 73c6825
python : 3.7.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 13, GenuineIntel
pandas : 1.3.3
numpy : 1.21.2
...

1.1.0 ------------------
commit : d9fff27
pandas : 1.1.0
numpy : 1.21.2
...

Versions without leak:
commit : None
pandas : 1.0.5
numpy : 1.21.2
...

@vernetya vernetya added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 4, 2021
@jreback
Copy link
Contributor

jreback commented Oct 4, 2021

pls check master as well

@phofl
Copy link
Member

phofl commented Oct 4, 2021

persists on master

@vernetya
Copy link
Contributor Author

vernetya commented Oct 5, 2021

Hi,

yes, I still see the same leak on master :

INSTALLED VERSIONS

commit : 6599834
python : 3.8.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 13, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252

pandas : 1.4.0.dev0+833.g6599834103
numpy : 1.21.2
pytz : 2021.1
dateutil : 2.8.2
...

@mzeitlin11 mzeitlin11 added IO JSON read_json, to_json, json_normalize Performance Memory or execution speed performance and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 9, 2021
@mzeitlin11 mzeitlin11 added this to the Contributions Welcome milestone Oct 9, 2021
@jbrockmendel
Copy link
Member

cc @WillAyd

@WillAyd
Copy link
Member

WillAyd commented Oct 21, 2021 via email

@asmodehn
Copy link

I did run valgrind on this script with python 3.9.5 and the current master version of pandas (commit 9018d327de) :

import pandas as pd
import numpy as np

df = pd.DataFrame({str(c): np.random.random_sample(size=100_000) for c in range(10)})
df.to_json()

There are quite a few leaks reported , although I am not familiar with valgrind, so I can't be sure what to conclude from the results.
I am only showing here the (possible) leaks that are directly related with to_json:

[...]
==19968== 400 bytes in 1 blocks are possibly lost in loss record 14,852 of 16,824
==19968==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==19968==    by 0x2928DB: _PyObject_GC_Alloc (gcmodule.c:2237)
==19968==    by 0x2928DB: _PyObject_GC_Malloc (gcmodule.c:2264)
==19968==    by 0x2928DB: _PyObject_GC_NewVar (gcmodule.c:2293)
==19968==    by 0x1643C3: frame_alloc (frameobject.c:790)
==19968==    by 0x1643C3: _PyFrame_New_NoTrack (frameobject.c:885)
==19968==    by 0x163D8C: function_code_fastcall (call.c:319)
==19968==    by 0x31E65D: _PyObject_VectorcallTstate (abstract.h:118)
==19968==    by 0x31E65D: PyObject_CallOneArg (abstract.h:188)
==19968==    by 0x31E65D: property_descr_get (descrobject.c:1573)
==19968==    by 0x1B6F09: _PyObject_GenericGetAttrWithDict (object.c:1201)
==19968==    by 0x1B6F09: PyObject_GenericGetAttr (object.c:1280)
==19968==    by 0x1B84CA: PyObject_GetAttrString (object.c:795)
==19968==    by 0x1B84CA: PyObject_GetAttrString (object.c:786)
==19968==    by 0x32CBBEA5: get_values (objToJSON.c:224)
==19968==    by 0x32CC078A: Object_beginTypeContext (objToJSON.c:1763)
==19968==    by 0x32CB9239: encode (ultrajsonenc.c:966)
==19968==    by 0x32CB9D83: JSON_EncodeObject (ultrajsonenc.c:1190)
==19968==    by 0x32CC170D: objToJSON (objToJSON.c:2089)
[...]
==19968== 416 bytes in 1 blocks are possibly lost in loss record 14,873 of 16,824
==19968==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==19968==    by 0x2928DB: _PyObject_GC_Alloc (gcmodule.c:2237)
==19968==    by 0x2928DB: _PyObject_GC_Malloc (gcmodule.c:2264)
==19968==    by 0x2928DB: _PyObject_GC_NewVar (gcmodule.c:2293)
==19968==    by 0x1643C3: frame_alloc (frameobject.c:790)
==19968==    by 0x1643C3: _PyFrame_New_NoTrack (frameobject.c:885)
==19968==    by 0x163D8C: function_code_fastcall (call.c:319)
==19968==    by 0x31E65D: _PyObject_VectorcallTstate (abstract.h:118)
==19968==    by 0x31E65D: PyObject_CallOneArg (abstract.h:188)
==19968==    by 0x31E65D: property_descr_get (descrobject.c:1573)
==19968==    by 0x1B6F09: _PyObject_GenericGetAttrWithDict (object.c:1201)
==19968==    by 0x1B6F09: PyObject_GenericGetAttr (object.c:1280)
==19968==    by 0x1D8F7C: slot_tp_getattr_hook (typeobject.c:6778)
==19968==    by 0x1B84CA: PyObject_GetAttrString (object.c:795)
==19968==    by 0x1B84CA: PyObject_GetAttrString (object.c:786)
==19968==    by 0x32CC05B5: Object_beginTypeContext (objToJSON.c:1723)
==19968==    by 0x32CB9239: encode (ultrajsonenc.c:966)
==19968==    by 0x32CB9D83: JSON_EncodeObject (ultrajsonenc.c:1190)
==19968==    by 0x32CC170D: objToJSON (objToJSON.c:2089)
[...]
==19968== LEAK SUMMARY:
==19968==    definitely lost: 33,072 bytes in 198 blocks
==19968==    indirectly lost: 12,064 bytes in 159 blocks
==19968==      possibly lost: 14,899,972 bytes in 96,619 blocks
==19968==    still reachable: 1,195,913 bytes in 8,073 blocks
==19968==                       of which reachable via heuristic:
==19968==                         stdstring          : 2,484 bytes in 62 blocks
==19968==                         multipleinheritance: 992 bytes in 12 blocks
==19968==         suppressed: 0 bytes in 0 blocks
==19968== Reachable blocks (those to which a pointer was found) are not shown.
==19968== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==19968==
==19968== For lists of detected and suppressed errors, rerun with: -s
==19968== ERROR SUMMARY: 15290 errors from 15287 contexts (suppressed: 8 from 4)

@WillAyd
Copy link
Member

WillAyd commented Oct 25, 2021

At least one of the leaks appears on line 1763. Between that and line 1780 in the code things look suspicious. Can likely be refactored:

values = get_values(tmpObj);

Line 1723 is also a culprit from what you've shared, though I don't think that was refactored any time around 1.1.0

@WillAyd
Copy link
Member

WillAyd commented Oct 25, 2021

Thanks for running that by the way!

@vernetya
Copy link
Contributor Author

yep, thanks for running it.

FYI it's reproducible without using numpy when creating the dataframe:

import pandas as pd

for _ in range(1000):
        df = pd.DataFrame({str(c): list(range(100_000)) for c in range(10)})
        df.to_json()

@WillAyd
Copy link
Member

WillAyd commented Oct 27, 2021

Cool nice even more minimal example. So yea I am 99% sure the problem is the call to get_values on 1763 is never released and also duplicative of the call on line 1780. Either releasing 1763 or refactoring so get_values doesn't get called twice should help

@vernetya vernetya mentioned this issue Jan 20, 2022
4 tasks
@vernetya
Copy link
Contributor Author

Hi @WillAyd
I tried to have a look up. I may have found the culprit. I created a pull request. Could you have a look ?

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO JSON read_json, to_json, json_normalize Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants