Skip to content

Conversation

zjweiss-google
Copy link
Contributor

Previously, this could cause some asan issues as the vecs and lens objects were not deleted.

This was noticed internally on our codebase.

This is identical to #62569, but I had to switch github accounts.

@Alvaro-Kothe
Copy link
Contributor

Alvaro-Kothe commented Oct 3, 2025

I build pandas with -fsanitize=address, created this python script to check the error:

# t.py
import numpy as np
from pandas._libs.hashing import hash_object_array

arr = np.array(["hello", None, "world", 42], dtype=object)
key = "1234567890123456"

hash_object_array(arr, key)

Executed it with

LD_PRELOAD=/usr/lib64/libasan.so.8 python t.py

And in the errors there is

Direct leak of 32 byte(s) in 1 object(s) allocated from:
    #0 0x7f700f8e6f2b in malloc (/usr/lib64/libasan.so.8+0xe6f2b) (BuildId: 10b8ccd49f75c21babf1d7abe51bb63589d8471f)
    #1 0x7b6fb107edcd in __pyx_pf_6pandas_5_libs_7hashing_hash_object_array pandas/_libs/hashing.cpython-313-x86_64-linux-gnu.so.p/pandas/_libs/hashing.pyx.c:5979
    #2 0x7b6fb107d4fa in __pyx_pw_6pandas_5_libs_7hashing_1hash_object_array pandas/_libs/hashing.cpython-313-x86_64-linux-gnu.so.p/pandas/_libs/hashing.pyx.c:5731
    #3 0x7b6fb10990e2 in __Pyx_CyFunction_Vectorcall_FASTCALL_KEYWORDS pandas/_libs/hashing.cpython-313-x86_64-linux-gnu.so.p/pandas/_libs/hashing.pyx.c:12106
    #4 0x7f700f340aa6 in PyObject_Vectorcall (/lib64/libpython3.13.so.1.0+0x140aa6) (BuildId: 0872b3c7f17afc4f09c2021d53aee657a2d60be7)
    #5 0x7f700f353a8c in _PyEval_EvalFrameDefault (/lib64/libpython3.13.so.1.0+0x153a8c) (BuildId: 0872b3c7f17afc4f09c2021d53aee657a2d60be7)
    #6 0x7f700f42a05e in PyEval_EvalCode (/lib64/libpython3.13.so.1.0+0x22a05e) (BuildId: 0872b3c7f17afc4f09c2021d53aee657a2d60be7)
    #7 0x7f700f469b3b in run_eval_code_obj (/lib64/libpython3.13.so.1.0+0x269b3b) (BuildId: 0872b3c7f17afc4f09c2021d53aee657a2d60be7)
    #8 0x7f700f466c64 in run_mod (/lib64/libpython3.13.so.1.0+0x266c64) (BuildId: 0872b3c7f17afc4f09c2021d53aee657a2d60be7)
    #9 0x7f700f4632e6 in pyrun_file (/lib64/libpython3.13.so.1.0+0x2632e6) (BuildId: 0872b3c7f17afc4f09c2021d53aee657a2d60be7)
    #10 0x7f700f462f5e in _PyRun_SimpleFileObject (/lib64/libpython3.13.so.1.0+0x262f5e) (BuildId: 0872b3c7f17afc4f09c2021d53aee657a2d60be7)
    #11 0x7f700f462d80 in _PyRun_AnyFileObject (/lib64/libpython3.13.so.1.0+0x262d80) (BuildId: 0872b3c7f17afc4f09c2021d53aee657a2d60be7)
    #12 0x7f700f461146 in Py_RunMain (/lib64/libpython3.13.so.1.0+0x261146) (BuildId: 0872b3c7f17afc4f09c2021d53aee657a2d60be7)
    #13 0x7f700f417a4a in Py_BytesMain (/lib64/libpython3.13.so.1.0+0x217a4a) (BuildId: 0872b3c7f17afc4f09c2021d53aee657a2d60be7)
    #14 0x7f700f011574 in __libc_start_call_main (/lib64/libc.so.6+0x3574) (BuildId: 48c4b9b1efb1df15da8e787f489128bf31893317)
    #15 0x7f700f011627 in __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x3627) (BuildId: 48c4b9b1efb1df15da8e787f489128bf31893317)
    #16 0x55f2cd5923d4 in _start (/usr/bin/python3.13+0x3d4) (BuildId: 53a9c612272faab67c9cb9166575b1726a625502)

Direct leak of 32 byte(s) in 1 object(s) allocated from:
    #0 0x7f700f8e6f2b in malloc (/usr/lib64/libasan.so.8+0xe6f2b) (BuildId: 10b8ccd49f75c21babf1d7abe51bb63589d8471f)
    #1 0x7b6fb107ed6b in __pyx_pf_6pandas_5_libs_7hashing_hash_object_array pandas/_libs/hashing.cpython-313-x86_64-linux-gnu.so.p/pandas/_libs/hashing.pyx.c:5942
    #2 0x7b6fb107d4fa in __pyx_pw_6pandas_5_libs_7hashing_1hash_object_array pandas/_libs/hashing.cpython-313-x86_64-linux-gnu.so.p/pandas/_libs/hashing.pyx.c:5731
    #3 0x7b6fb10990e2 in __Pyx_CyFunction_Vectorcall_FASTCALL_KEYWORDS pandas/_libs/hashing.cpython-313-x86_64-linux-gnu.so.p/pandas/_libs/hashing.pyx.c:12106
    #4 0x7f700f340aa6 in PyObject_Vectorcall (/lib64/libpython3.13.so.1.0+0x140aa6) (BuildId: 0872b3c7f17afc4f09c2021d53aee657a2d60be7)
    #5 0x7f700f353a8c in _PyEval_EvalFrameDefault (/lib64/libpython3.13.so.1.0+0x153a8c) (BuildId: 0872b3c7f17afc4f09c2021d53aee657a2d60be7)
    #6 0x7f700f42a05e in PyEval_EvalCode (/lib64/libpython3.13.so.1.0+0x22a05e) (BuildId: 0872b3c7f17afc4f09c2021d53aee657a2d60be7)
    #7 0x7f700f469b3b in run_eval_code_obj (/lib64/libpython3.13.so.1.0+0x269b3b) (BuildId: 0872b3c7f17afc4f09c2021d53aee657a2d60be7)
    #8 0x7f700f466c64 in run_mod (/lib64/libpython3.13.so.1.0+0x266c64) (BuildId: 0872b3c7f17afc4f09c2021d53aee657a2d60be7)
    #9 0x7f700f4632e6 in pyrun_file (/lib64/libpython3.13.so.1.0+0x2632e6) (BuildId: 0872b3c7f17afc4f09c2021d53aee657a2d60be7)
    #10 0x7f700f462f5e in _PyRun_SimpleFileObject (/lib64/libpython3.13.so.1.0+0x262f5e) (BuildId: 0872b3c7f17afc4f09c2021d53aee657a2d60be7)
    #11 0x7f700f462d80 in _PyRun_AnyFileObject (/lib64/libpython3.13.so.1.0+0x262d80) (BuildId: 0872b3c7f17afc4f09c2021d53aee657a2d60be7)
    #12 0x7f700f461146 in Py_RunMain (/lib64/libpython3.13.so.1.0+0x261146) (BuildId: 0872b3c7f17afc4f09c2021d53aee657a2d60be7)
    #13 0x7f700f417a4a in Py_BytesMain (/lib64/libpython3.13.so.1.0+0x217a4a) (BuildId: 0872b3c7f17afc4f09c2021d53aee657a2d60be7)
    #14 0x7f700f011574 in __libc_start_call_main (/lib64/libc.so.6+0x3574) (BuildId: 48c4b9b1efb1df15da8e787f489128bf31893317)
    #15 0x7f700f011627 in __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x3627) (BuildId: 48c4b9b1efb1df15da8e787f489128bf31893317)
    #16 0x55f2cd5923d4 in _start (/usr/bin/python3.13+0x3d4) (BuildId: 53a9c612272faab67c9cb9166575b1726a625502)

Line 5942

  /* "pandas/_libs/hashing.pyx":70
 * 
 *     # create an array of bytes
 *     vecs = <char **>malloc(n * sizeof(char *))             # <<<<<<<<<<<<<<
 *     if vecs is NULL:
 *         raise MemoryError()
*/
  __pyx_v_vecs = ((char **)malloc((__pyx_v_n * (sizeof(char *)))));

Line 5979

  /* "pandas/_libs/hashing.pyx":73
 *     if vecs is NULL:
 *         raise MemoryError()
 *     lens = <uint64_t*>malloc(n * sizeof(uint64_t))             # <<<<<<<<<<<<<<
 *     if lens is NULL:
 *         raise MemoryError()
*/
  __pyx_v_lens = ((__pyx_t_5numpy_uint64_t *)malloc((__pyx_v_n * (sizeof(__pyx_t_5numpy_uint64_t)))));

Good catch!

@mroeschke mroeschke added this to the 3.0 milestone Oct 4, 2025
@mroeschke mroeschke added the Performance Memory or execution speed performance label Oct 4, 2025
@mroeschke mroeschke merged commit 323036b into pandas-dev:main Oct 4, 2025
47 checks passed
@mroeschke
Copy link
Member

Thanks @zjweiss-google

@zjweiss-google
Copy link
Contributor Author

zjweiss-google commented Oct 4, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants