You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I encountered a strange memory usage bug while optimizing the memory usage of a pandas DataFrame composed of Korean text and numerical data. The process followed in the code above includes:
Generating data containing Korean characters and numbers.
Creating a deep copy backup (df_backup) to track changes in memory usage and check memory usage.
Using two different functions to change the dtype of each column:
2-1. The squeeze_memory function performs a deep copy.
2-2. The optimize_memory function creates a new DataFrame and adds data from a deep-copied source.
Measuring the memory usage of both the original and optimized DataFrames(1st measure).
Re-measuring the memory usage after memory optimization with the same code(2nd measure).
Checking the memory usage of both the original DataFrame (df_test) and the deep-copied backup DataFrame (df_backup).
Expected Behavior
There are three main issues in this process:
Despite making deep copies or creating new DataFrames for memory optimization, there are fluctuations(even increased) in the memory usage of the original DataFrame (differences in the 1st and 2nd measurements).
df_backup, which has been deep-copied and not altered in any way during the process, shows a change in memory usage in the final step.
All these issues occur with non-English languages. The same problems were observed with Korean, Chinese, and Spanish data, whereas the code functions as expected with English data.
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
I encountered a strange memory usage bug while optimizing the memory usage of a pandas DataFrame composed of Korean text and numerical data. The process followed in the code above includes:
df_backup
) to track changes in memory usage and check memory usage.2-1. The
squeeze_memory
function performs a deep copy.2-2. The
optimize_memory
function creates a new DataFrame and adds data from a deep-copied source.1st measure
).2nd measure
).df_test
) and the deep-copied backup DataFrame (df_backup
).Expected Behavior
There are three main issues in this process:
1st
and2nd
measurements).df_backup
, which has been deep-copied and not altered in any way during the process, shows a change in memory usage in the final step.Here is the English case's result
Please refer to the following Colab notebook for this experiment: (https://colab.research.google.com/drive/1VJwrda_PzuzveSkrZVmenILG3c_5LlL-?usp=sharing)
I'm reporting this issue on google Colab environment, but the same issue occur at my local env(Windows 11, Python 3.11, latest pandas version).
Installed Versions
INSTALLED VERSIONS
commit : d9cdd2e
python : 3.10.12.final.0
python-bits : 64
OS : Linux
OS-release : 6.1.58+
Version : #1 SMP PREEMPT_DYNAMIC Sat Nov 18 15:31:17 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.2.2
numpy : 1.25.2
pytz : 2023.4
dateutil : 2.8.2
setuptools : 67.7.2
pip : 23.1.2
Cython : 3.0.10
pytest : 7.4.4
hypothesis : None
sphinx : 5.0.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.4
html5lib : 1.1
pymysql : None
psycopg2 : 2.9.9
jinja2 : 3.1.3
IPython : 7.34.0
pandas_datareader : 0.10.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2023.6.0
gcsfs : 2023.6.0
matplotlib : 3.7.1
numba : 0.58.1
numexpr : 2.10.0
odfpy : None
openpyxl : 3.1.2
pandas_gbq : 0.19.2
pyarrow : 14.0.2
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.11.4
sqlalchemy : 2.0.29
tables : 3.8.0
tabulate : 0.9.0
xarray : 2023.7.0
xlrd : 2.0.1
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: