Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.add gives incorrect result with MI Dataframe with mix of object and datetimes on index. #26558

Open
andymcarter opened this issue May 29, 2019 · 4 comments
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions MultiIndex Numeric Operations Arithmetic, Comparison, and Logical operations Timeseries

Comments

@andymcarter
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
import datetime

df1 = pd.DataFrame(
    data = [10],
    index = pd.MultiIndex.from_tuples([('Z', '2019-05-31')]),
    columns= ['A']
)

df2 = pd.DataFrame(
    data = [20],
    index = pd.MultiIndex.from_tuples([('Z', datetime.datetime(2019,5,31))]),
    columns= ['A']
)

df1.add(df2, fill_value=0)

Problem description

Two multiindexed dataframes, where the inner level is datetime-like string in the first, and actual datetime in the second.

Applying the .add function returns a dataframe with two index lines as expected (one for the string date, and the other for the datetime), but the values in the output do not match the two input frames.

df1.add(df2, fill_value=0)
                 A
Z 2019-05-31  20.0
  2019-05-31  20.0

Expected Output

df1.add(df2, fill_value=0)
                 A
Z 2019-05-31  10.0
  2019-05-31  20.0

OR

df1.add(df2, fill_value=0)
               A
Z 2019-05-31  30

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: en_GB.UTF-8

pandas: 0.24.2
pytest: None
pip: 18.1
setuptools: 40.4.3
Cython: None
numpy: 1.16.3
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.5.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.2
xlrd: 1.2.0
xlwt: None
xlsxwriter: 1.1.8
lxml.etree: 4.3.3
bs4: 4.7.1
html5lib: None
sqlalchemy: 1.3.3
pymysql: 0.9.3
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 29, 2019

Looks like a bug. We seem to convert the string to a datetime

The bug will be further down than in .add.

In [36]: a, b = df1.align(df2)

In [37]: a.index.levels[1]
Out[37]: DatetimeIndex(['2019-05-31'], dtype='datetime64[ns]', freq=None)

In [38]: b.index.levels[1]
Out[38]: DatetimeIndex(['2019-05-31'], dtype='datetime64[ns]', freq=None)

@andymcarter are you interested in debugging this?

@TomAugspurger TomAugspurger added Dtype Conversions Unexpected or buggy dtype conversions Timeseries labels May 29, 2019
@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone May 29, 2019
@andymcarter
Copy link
Author

Will give it a go now.

@andymcarter
Copy link
Author

andymcarter commented May 29, 2019

It seems to be connected to this line:

cat = Categorical(values, ordered=False)

where the tuple of the string and the date get returned as a Categorical with a single datetime category.

And here is where the reindexing of left and right takes place with the incorrectly unioned index:

right = other._reindex_with_indexers({0: [join_index, iridx],

image

image

which explains the outcome of the sum.

@nrebena
Copy link
Contributor

nrebena commented Mar 15, 2020

Ok, going further on this, the culprit is the way we join multiindex.
Here a miminal example of the bug:

import pandas as pd
import datetime

i1 = pd.Index(['2019-05-31'])
i2 = pd.Index([datetime.datetime(2019, 5, 31)])

mi1 = pd.MultiIndex.from_tuples([('2019-05-31',)])
mi2 = pd.MultiIndex.from_tuples([(datetime.datetime(2019, 5, 31),)])

print(i1.join(i2, return_indexers=True))
print(mi1.join(mi2, return_indexers=True))
   
# (DatetimeIndex(['2019-05-31'], dtype='datetime64[ns]', freq=None), None, None)
# (MultiIndex([('2019-05-31',)], ), None, array([-1]))

Here we can see that one of the indexers for multiindex is wrong.
I will submit a PR shortly 🤞

nrebena added a commit to nrebena/pandas that referenced this issue Mar 15, 2020
nrebena added a commit to nrebena/pandas that referenced this issue Mar 15, 2020
@mroeschke mroeschke added Bug Numeric Operations Arithmetic, Comparison, and Logical operations labels Apr 2, 2020
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions MultiIndex Numeric Operations Arithmetic, Comparison, and Logical operations Timeseries
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants