.add gives incorrect result with MI Dataframe with mix of object and datetimes on index. #26558

andymcarter · 2019-05-29T09:39:08Z

Code Sample, a copy-pastable example if possible

import pandas as pd
import datetime

df1 = pd.DataFrame(
    data = [10],
    index = pd.MultiIndex.from_tuples([('Z', '2019-05-31')]),
    columns= ['A']
)

df2 = pd.DataFrame(
    data = [20],
    index = pd.MultiIndex.from_tuples([('Z', datetime.datetime(2019,5,31))]),
    columns= ['A']
)

df1.add(df2, fill_value=0)

Problem description

Two multiindexed dataframes, where the inner level is datetime-like string in the first, and actual datetime in the second.

Applying the .add function returns a dataframe with two index lines as expected (one for the string date, and the other for the datetime), but the values in the output do not match the two input frames.

df1.add(df2, fill_value=0)
                 A
Z 2019-05-31  20.0
  2019-05-31  20.0

Expected Output

df1.add(df2, fill_value=0)
                 A
Z 2019-05-31  10.0
  2019-05-31  20.0

OR

df1.add(df2, fill_value=0)
               A
Z 2019-05-31  30

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: en_GB.UTF-8

pandas: 0.24.2
pytest: None
pip: 18.1
setuptools: 40.4.3
Cython: None
numpy: 1.16.3
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.5.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.2
xlrd: 1.2.0
xlwt: None
xlsxwriter: 1.1.8
lxml.etree: 4.3.3
bs4: 4.7.1
html5lib: None
sqlalchemy: 1.3.3
pymysql: 0.9.3
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-05-29T13:36:37Z

Looks like a bug. We seem to convert the string to a datetime

The bug will be further down than in .add.

In [36]: a, b = df1.align(df2)

In [37]: a.index.levels[1]
Out[37]: DatetimeIndex(['2019-05-31'], dtype='datetime64[ns]', freq=None)

In [38]: b.index.levels[1]
Out[38]: DatetimeIndex(['2019-05-31'], dtype='datetime64[ns]', freq=None)

@andymcarter are you interested in debugging this?

andymcarter · 2019-05-29T14:06:02Z

Will give it a go now.

andymcarter · 2019-05-29T14:49:47Z

It seems to be connected to this line:

pandas/pandas/core/arrays/categorical.py

Line 2671 in a91da0c

cat = Categorical(values, ordered=False)

where the tuple of the string and the date get returned as a Categorical with a single datetime category.

And here is where the reindexing of left and right takes place with the incorrectly unioned index:

pandas/pandas/core/generic.py

Line 8601 in a91da0c

right = other._reindex_with_indexers({0: [join_index, iridx],

which explains the outcome of the sum.

nrebena · 2020-03-15T20:12:38Z

Ok, going further on this, the culprit is the way we join multiindex.
Here a miminal example of the bug:

import pandas as pd
import datetime

i1 = pd.Index(['2019-05-31'])
i2 = pd.Index([datetime.datetime(2019, 5, 31)])

mi1 = pd.MultiIndex.from_tuples([('2019-05-31',)])
mi2 = pd.MultiIndex.from_tuples([(datetime.datetime(2019, 5, 31),)])

print(i1.join(i2, return_indexers=True))
print(mi1.join(mi2, return_indexers=True))
   
# (DatetimeIndex(['2019-05-31'], dtype='datetime64[ns]', freq=None), None, None)
# (MultiIndex([('2019-05-31',)], ), None, array([-1]))

Here we can see that one of the indexers for multiindex is wrong.
I will submit a PR shortly 🤞

TomAugspurger added Dtype Conversions Unexpected or buggy dtype conversions Timeseries labels May 29, 2019

TomAugspurger added this to the Contributions Welcome milestone May 29, 2019

TomAugspurger added Effort Medium labels May 29, 2019

jbrockmendel removed Effort Medium labels Oct 21, 2019

nrebena added a commit to nrebena/pandas that referenced this issue Mar 15, 2020

TST: Add test for pandas-dev#26558

a50bfe9

nrebena added a commit to nrebena/pandas that referenced this issue Mar 15, 2020

BUG: pandas-dev#26558 Fix join on MultiIndex with datetime and string

311ad1c

nrebena mentioned this issue Mar 15, 2020

BUG: Fix join on MultiIndex for mixed Datetimelike and string levels #32739

Closed

5 tasks

mroeschke added Bug Numeric Operations Arithmetic, Comparison, and Logical operations labels Apr 2, 2020

mroeschke added the MultiIndex label Jul 10, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.add gives incorrect result with MI Dataframe with mix of object and datetimes on index. #26558

.add gives incorrect result with MI Dataframe with mix of object and datetimes on index. #26558

andymcarter commented May 29, 2019

INSTALLED VERSIONS

TomAugspurger commented May 29, 2019 •

edited

andymcarter commented May 29, 2019

andymcarter commented May 29, 2019 •

edited

nrebena commented Mar 15, 2020

.add gives incorrect result with MI Dataframe with mix of object and datetimes on index. #26558

.add gives incorrect result with MI Dataframe with mix of object and datetimes on index. #26558

Comments

andymcarter commented May 29, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented May 29, 2019 • edited

andymcarter commented May 29, 2019

andymcarter commented May 29, 2019 • edited

nrebena commented Mar 15, 2020

Output of `pd.show_versions()`

TomAugspurger commented May 29, 2019 •

edited

andymcarter commented May 29, 2019 •

edited