Skip to content

Copy and grouby-sum changes column-major-ordering to row-major-ordering with huge impact on performance #26502

@TannhauserGate42

Description

@TannhauserGate42

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd

print("pandas version: ", pd.__version__)

array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
array.flags
print("Numpy array is C-contiguous: ", data.flags.c_contiguous)

dataframe = pd.DataFrame(array, index = pd.MultiIndex.from_tuples([('A', 'U'), ('A', 'V'), ('B', 'W')], names=['dim_one', 'dim_two']))
print("DataFrame is C-contiguous: ", dataframe.values.flags.c_contiguous)

dataframe_copy = dataframe.copy()
print("Copy of DataFrame is C-contiguous: ", dataframe_copy.values.flags.c_contiguous)

aggregated_dataframe = dataframe.groupby('dim_one').sum()
print("Aggregated DataFrame is C-contiguous: ", aggregated_dataframe.values.flags.c_contiguous)

## Output in Jupyter Notebook
# pandas version:  0.23.4
# Numpy array is C-contiguous:  True
# DataFrame is C-contiguous:  True
# Copy of DataFrame is C-contiguous:  False
# Aggregated DataFrame is C-contiguous:  False

Problem description

pandas.DataFrame.copy return a copy with underlying numpy array with the major-ordering changed from C to F if initially was C order

pandas.DataFrame.groupby.sum return a a DataFrame with underlying numpy array with the major-ordering changed from C to F if initially was C order on the DataFrames values.

This impacts performance on aggregation substantially! I observer a decrease from 17 seconds to 5 min 46 seconds when aggregating on the index on a 45023 times 100000 scenarios dataset.

Copy and Aggregation operations need to preserve the major-order on the underlying data!

Expected Output

# Output in Jupyter Notebook
 pandas version:  0.23.4
 Numpy array is C-contiguous:  True
 DataFrame is C-contiguous:  True
 Copy of DataFrame is C-contiguous:  True
 Aggregated DataFrame is C-contiguous:  True

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-862.14.4.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 3.2.3
pip: 19.0.2
setuptools: 40.8.0
Cython: 0.22
numpy: 1.15.4
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: 1.6.3
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 2.2.2
openpyxl: 2.4.7
xlrd: 1.1.0
xlwt: None
xlsxwriter: 1.0.2
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.3.3
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    GroupbyPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions