Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Pandas Discrepancies in Handling Non-English Data During Memory Optimization #58233

Open
3 tasks done
smbslt3 opened this issue Apr 12, 2024 · 0 comments
Open
3 tasks done
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@smbslt3
Copy link

smbslt3 commented Apr 12, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

# ----- Define memory optimize functions -----

def squeeze_memory(raw_df):
    df = raw_df.copy(deep=True)
    df['회사명'] = df['회사명'].astype('category')
    df['평점'] = df['평점'].astype('int8')
    df['작성일'] = pd.to_datetime(df['작성일'], errors='coerce')
    df['제목'] = df['제목'].astype('string[pyarrow]')
    df['추천여부'] = df['추천여부'].astype('category')
    return df

def optimize_memory(df):
    my_df = pd.DataFrame()
    my_df['회사명'] = df['회사명'].copy(deep=True).astype('category')
    my_df['평점'] = df['평점'].copy(deep=True).astype('int8')
    my_df['작성일'] = pd.to_datetime(df['작성일'].copy(deep=True), format='%Y. %m', errors='coerce')
    my_df['제목'] = df['제목'].copy(deep=True).astype('string')
    my_df['추천여부'] = df['추천여부'].copy(deep=True).astype('category')
    return my_df

# ----- Set dataframe for test -----

np.random.seed(0)
df_test = pd.DataFrame({
    '회사명': np.random.choice(['회사A', '회사B', '회사C', '회사D'], size=2000),
    '평점': np.random.randint(1, 6, size=2000),
    '작성일': np.random.choice(pd.date_range(start='2005-01-01', periods=7000, freq='D').astype(str), size=2000),
    '제목': [f'제목_{i}' for i in range(2000)],
    '추천여부': np.random.choice(['추천', '비추천'], size=2000)
})

def memory_usage_of_dataframe(my_df):
    mem_usage = my_df.memory_usage(deep=True).sum()
    return mem_usage

df_backup = df_test.copy(deep=True)

# ----- Run optimization and check memory usages -----

print('Korean case')

print('df_test:  ', memory_usage_of_dataframe(df_test))
print('df_backup:', memory_usage_of_dataframe(df_backup))

original_memory  = memory_usage_of_dataframe(df_test)
optimized_memory = memory_usage_of_dataframe(optimize_memory(df_test))
squeezed_memory  = memory_usage_of_dataframe(squeeze_memory(df_test))

print(f"1st measure: {original_memory:,} | {optimized_memory:,} | {squeezed_memory:,}")

original_memory  = memory_usage_of_dataframe(df_test)
optimized_memory = memory_usage_of_dataframe(optimize_memory(df_test))
squeezed_memory  = memory_usage_of_dataframe(squeeze_memory(df_test))

print(f"2nd measure: {original_memory:,} | {optimized_memory:,} | {squeezed_memory:,}")

print('df_test:  ', memory_usage_of_dataframe(df_test))
print('df_backup:', memory_usage_of_dataframe(df_backup))

# ----- Result -----

Korean case
df_test:   689988
df_backup: 689988
1st measure: 689,988 | 212,763 | 59,873
2nd measure: 745,998 | 235,653 | 59,873
df_test:   745998
df_backup: 745998

Issue Description

I encountered a strange memory usage bug while optimizing the memory usage of a pandas DataFrame composed of Korean text and numerical data. The process followed in the code above includes:

  1. Generating data containing Korean characters and numbers.
  2. Creating a deep copy backup (df_backup) to track changes in memory usage and check memory usage.
  3. Using two different functions to change the dtype of each column:
    2-1. The squeeze_memory function performs a deep copy.
    2-2. The optimize_memory function creates a new DataFrame and adds data from a deep-copied source.
  4. Measuring the memory usage of both the original and optimized DataFrames(1st measure).
  5. Re-measuring the memory usage after memory optimization with the same code(2nd measure).
  6. Checking the memory usage of both the original DataFrame (df_test) and the deep-copied backup DataFrame (df_backup).

Expected Behavior

There are three main issues in this process:

  1. Despite making deep copies or creating new DataFrames for memory optimization, there are fluctuations(even increased) in the memory usage of the original DataFrame (differences in the 1st and 2nd measurements).
  2. df_backup, which has been deep-copied and not altered in any way during the process, shows a change in memory usage in the final step.
  3. All these issues occur with non-English languages. The same problems were observed with Korean, Chinese, and Spanish data, whereas the code functions as expected with English data.

Here is the English case's result

English case
df_test:   551178
df_backup: 551178
1st measure: 551,178 | 155,698 | 57,698
2nd measure: 551,178 | 155,698 | 57,698
df_test:   551178
df_backup: 551178

Please refer to the following Colab notebook for this experiment: (https://colab.research.google.com/drive/1VJwrda_PzuzveSkrZVmenILG3c_5LlL-?usp=sharing)

I'm reporting this issue on google Colab environment, but the same issue occur at my local env(Windows 11, Python 3.11, latest pandas version).

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.10.12.final.0
python-bits : 64
OS : Linux
OS-release : 6.1.58+
Version : #1 SMP PREEMPT_DYNAMIC Sat Nov 18 15:31:17 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.2
numpy : 1.25.2
pytz : 2023.4
dateutil : 2.8.2
setuptools : 67.7.2
pip : 23.1.2
Cython : 3.0.10
pytest : 7.4.4
hypothesis : None
sphinx : 5.0.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.4
html5lib : 1.1
pymysql : None
psycopg2 : 2.9.9
jinja2 : 3.1.3
IPython : 7.34.0
pandas_datareader : 0.10.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2023.6.0
gcsfs : 2023.6.0
matplotlib : 3.7.1
numba : 0.58.1
numexpr : 2.10.0
odfpy : None
openpyxl : 3.1.2
pandas_gbq : 0.19.2
pyarrow : 14.0.2
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.11.4
sqlalchemy : 2.0.29
tables : 3.8.0
tabulate : 0.9.0
xarray : 2023.7.0
xlrd : 2.0.1
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

@smbslt3 smbslt3 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

1 participant