PERF: Possible Memory Leak when Importing Parquet File with PyArrow Engine in Pandas

### Pandas version checks

- [X] I have checked that this issue has not already been reported.

- [X] I have confirmed this issue exists on the [latest version](https://pandas.pydata.org/docs/whatsnew/index.html) of pandas.

- [ ] I have confirmed this issue exists on the main branch of pandas.


### Reproducible Example

**Description**

We've identified a memory leak when importing Parquet files into Pandas DataFrames using the PyArrow engine. The issue occurs specifically during the conversion from Arrow to Pandas objects, as memory is not released even after deleting the DataFrame and invoking garbage collection.

**Key findings:**

- **No leak with PyArrow alone:** When using PyArrow to read Parquet without converting to Pandas (i.e., no _.to_pandas()_), the memory leak does not occur.
- **Leak with _.to_pandas()_:** The memory leak appears during the conversion from Arrow to Pandas, suggesting the problem is tied to this process.
- **No issue with Fastparquet or Polars:** Fastparquet and Polars (even with PyArrow) do not exhibit this memory issue, reinforcing that the problem is in Pandas’ handling of Arrow data.

**Reproduction Code**

```python
import pandas as pd 
import polars as pl
import gc
import pyarrow.parquet
import ctypes

# To manually trigger memory release
malloc_trim = ctypes.CDLL("libc.so.6").malloc_trim

for _ in range(10): 
    df = pd.read_parquet("/data/to/file.parquet", engine="pyarrow")
    # Also tested with:
    # df = pyarrow.parquet.read_pandas("/data/to/file.parquet").to_pandas()
    # df = pl.read_parquet("/data/to/file.parquet", use_pyarrow=True)
    
    del df  # Explicitly delete DataFrame
    print(gc.get_count())  # Check object count before garbage collection
    
    for _ in range(3):  # Force garbage collection multiple times
        gc.collect()
    
    print(gc.get_count())  # Check object count after garbage collection

# Calling malloc_trim(0) is the only way we found to release the memory
# malloc_trim(0)
```

**Observations:**

- **Garbage Collection:** Despite invoking the garbage collector multiple times, memory allocated to the Python process keeps increasing when _.to_pandas()_ is used, indicating improper memory release during the conversion.
- **Direct Use of PyArrow:** When we import the data directly using PyArrow (without converting to Pandas), the memory usage remains stable, showing that the problem originates in the Arrow-to-Pandas conversion process.
- **Manual Memory Release (ctypes):** The only reliable way we have found to release the memory is by manually calling _malloc_trim(0)_ via ctypes. However, we believe this is not a proper solution and that memory management should be handled internally by Pandas.

**OS environment**

_Icon name: computer-vm
Chassis: vm
Virtualization: microsoft
Operating System: Red Hat Enterprise Linux 8.10 (Ootpa)
CPE OS Name: cpe:/o:redhat:enterprise_linux:8::baseos
Kernel: Linux 4.18.0-553.16.1.el8_10.x86_64
Architecture: x86-64_

**Affected Versions**

_pandas==2.2.2
pandas==2.2.3
Latest development version (as of writing)_

**Conclusion**

The issue seems to occur during the conversion from Arrow to Pandas, rather than being a problem within PyArrow itself. Given that memory is only released by manually invoking _malloc_trim(0)_, we suspect there is a problem with how Pandas handles memory management when working with Arrow data. This issue does not arise when using Fastparquet or Polars, further indicating that it is specific to the Pandas-Arrow interaction.

We recommend investigating how memory is allocated and released during the conversion from Arrow objects to Pandas DataFrames to resolve this issue.

Please let us know if further details are needed, and we are happy to assist.

**Contributors:**

- @Voltagabbana 
- @okamiRvS 
- @carlonicolini 

We would appreciate any feedback or insights from the maintainers and other contributors on how to improve memory management in this context.


### Installed Versions

INSTALLED VERSIONS
------------------
commit                : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140
python                : 3.10.14.final.0
python-bits           : 64
OS                    : Linux
OS-release            : 4.18.0-553.16.1.el8_10.x86_64
Version               : #1 SMP Thu Aug 1 04:16:12 EDT 2024
machine               : x86_64
processor             : x86_64
byteorder             : little
LC_ALL                : None
LANG                  : en_US.UTF-8
LOCALE                : en_US.UTF-8

pandas                : 2.2.2
numpy                 : 2.0.0
pytz                  : 2024.1
dateutil              : 2.9.0.post0
setuptools            : 69.5.1
pip                   : 24.0
Cython                : None
pytest                : None
hypothesis            : None
sphinx                : None
blosc                 : None
feather               : None
xlsxwriter            : None
lxml.etree            : None
html5lib              : None
pymysql               : None
psycopg2              : None
jinja2                : 3.1.4
IPython               : 8.26.0
pandas_datareader     : None
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : 4.12.3
bottleneck            : None
dataframe-api-compat  : None
fastparquet           : 2024.5.0
fsspec                : 2024.6.1
gcsfs                 : None
matplotlib            : 3.9.0
numba                 : None
numexpr               : None
odfpy                 : None
openpyxl              : None
pandas_gbq            : None
pyarrow               : 17.0.0
pyreadstat            : None
python-calamine       : None
pyxlsb                : None
s3fs                  : None
scipy                 : None
sqlalchemy            : None
tables                : None
tabulate              : None
xarray                : None
xlrd                  : None
zstandard             : None
tzdata                : 2024.1
qtpy                  : None
pyqt5                 : None


### Prior Performance

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF: Possible Memory Leak when Importing Parquet File with PyArrow Engine in Pandas #59969

Pandas version checks

Reproducible Example

Installed Versions

INSTALLED VERSIONS

Prior Performance

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

PERF: Possible Memory Leak when Importing Parquet File with PyArrow Engine in Pandas #59969

Description

Pandas version checks

Reproducible Example

Installed Versions

INSTALLED VERSIONS

Prior Performance

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions