-
-
Notifications
You must be signed in to change notification settings - Fork 18.8k
Description
Code Sample, a copy-pastable example if possible
import numpy as np
import pandas as pd
print("pandas version: ", pd.__version__)
array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
array.flags
print("Numpy array is C-contiguous: ", data.flags.c_contiguous)
dataframe = pd.DataFrame(array, index = pd.MultiIndex.from_tuples([('A', 'U'), ('A', 'V'), ('B', 'W')], names=['dim_one', 'dim_two']))
print("DataFrame is C-contiguous: ", dataframe.values.flags.c_contiguous)
dataframe_copy = dataframe.copy()
print("Copy of DataFrame is C-contiguous: ", dataframe_copy.values.flags.c_contiguous)
aggregated_dataframe = dataframe.groupby('dim_one').sum()
print("Aggregated DataFrame is C-contiguous: ", aggregated_dataframe.values.flags.c_contiguous)
## Output in Jupyter Notebook
# pandas version: 0.23.4
# Numpy array is C-contiguous: True
# DataFrame is C-contiguous: True
# Copy of DataFrame is C-contiguous: False
# Aggregated DataFrame is C-contiguous: False
Problem description
pandas.DataFrame.copy return a copy with underlying numpy array with the major-ordering changed from C to F if initially was C order
pandas.DataFrame.groupby.sum return a a DataFrame with underlying numpy array with the major-ordering changed from C to F if initially was C order on the DataFrames values.
This impacts performance on aggregation substantially! I observer a decrease from 17 seconds to 5 min 46 seconds when aggregating on the index on a 45023 times 100000 scenarios dataset.
Copy and Aggregation operations need to preserve the major-order on the underlying data!
Expected Output
# Output in Jupyter Notebook
pandas version: 0.23.4
Numpy array is C-contiguous: True
DataFrame is C-contiguous: True
Copy of DataFrame is C-contiguous: True
Aggregated DataFrame is C-contiguous: True
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-862.14.4.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: 3.2.3
pip: 19.0.2
setuptools: 40.8.0
Cython: 0.22
numpy: 1.15.4
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: 1.6.3
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 2.2.2
openpyxl: 2.4.7
xlrd: 1.1.0
xlwt: None
xlsxwriter: 1.0.2
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.3.3
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None