Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joining MultiIndexes with NaNs treats NaN as a match-any (unlike regular index joins) #25138

Closed
thorbjornwolf opened this issue Feb 4, 2019 · 3 comments
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@thorbjornwolf
Copy link

thorbjornwolf commented Feb 4, 2019

Thank you so much for your work on pandas 🤗

I don't sufficiently understand pd.concat and MultiIndex (yet!) to explain what is going on here, so I hope someone more knowledgeable can shed light on these curious symptoms:

Code Sample

import pandas as pd
from numpy import nan

# Build some Series, where one has NaN in the index
left = pd.Series(["left-a", "left-nan", "left-c"], index=["a", nan, "c"], name="left")
# a        left-a
# NaN    left-nan
# c        left-c
# Name: left, dtype: object
right = pd.Series(["right-a", "right-b", "right-d"], index=["a", "b", "d"], name="right")
# a    right-a
# b    right-b
# d    right-d
# Name: right, dtype: object

# Concat'ing with the regular Index
single = pd.concat([left, right], axis=1, join="outer", sort=False).sort_index()
#          left    right
# a      left-a  right-a
# b         NaN  right-b
# c      left-c      NaN
# d         NaN  right-d
# NaN  left-nan      NaN


# `left_multi` and `right_multi` look just like `left` and `right`,
# and only differ in that they have a 1-level MultiIndex
left_multi = left.set_axis(pd.MultiIndex.from_arrays([left.index]), inplace=False)
right_multi = right.set_axis(pd.MultiIndex.from_arrays([right.index]), inplace=False)

# Concat'ing with the 1-level MultiIndex
multi = pd.concat([left_multi, right_multi], axis=1, join="outer", sort=False).sort_index()
#          left    right
# a      left-a  right-a
# b    left-nan  right-b    <--- surprise left-nan!
# c      left-c      NaN
# d    left-nan  right-d    <--- surprise left-nan!
# NaN  left-nan      NaN

The same results can be achieved with DataFrames in (at least) these other ways, so in that way it is very consistent 😀

left_frame = left.to_frame()
left_multi_frame = left_multi.to_frame()
right_frame = right.to_frame()
right_multi_frame = right_multi.to_frame()

# DataFrame.join
s_join = left_frame.join(right_frame, how="outer").sort_index()
assert single.equals(s_join)
m_join = left_multi_frame.join(right_multi_frame, how="outer").sort_index()
assert multi.equals(m_join)

# DataFrame.merge
s_merge = left_frame.merge(right_frame, left_index=True, right_index=True, how="outer").sort_index()
assert single.equals(s_merge)
m_merge = left_multi_frame.merge(right_multi_frame, left_index=True, right_index=True, how="outer").sort_index()
assert multi.equals(m_merge)

# pd.concat on DataFrames
s_concat = pd.concat([left_frame, right_frame], axis=1, join="outer", sort=False).sort_index()
assert single.equals(s_concat)
m_concat = pd.concat([left_multi_frame, right_multi_frame], axis=1, join="outer", sort=False).sort_index()
assert multi.equals(m_concat)

Problem description

I am surprised to see NaNs being filled out for left when the index is a MultiIndex. This is because I would expect the same underlying logic for joining on MultiIndex as on a regular Index: Either NaN counts as a match-all value (MultiIndex behaviour), or NaN just matches the other's NaN like a regular value (regular Index behaviour).

I failed to find docs describing either concat'ing of MultiIndex with NaNs or regular Index with NaNs, so I am unsure if this is intended behaviour.

Expected Output

I expect the two approaches to have the same output, be it NaN as a match-all or NaN as a regular value.

Output of pd.show_versions()

Tested in two pandas versions.

pandas 0.24.1:

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.18.0-13-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.1
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.15.4
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

pandas 0.23.4:
INSTALLED VERSIONS ------------------ commit: None python: 3.7.1.final.0 python-bits: 64 OS: Linux OS-release: 4.18.0-13-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.15.4
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.5.12
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@gfyoung gfyoung added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode MultiIndex labels Feb 7, 2019
@gfyoung
Copy link
Member

gfyoung commented Feb 7, 2019

cc @toobaz

@toobaz
Copy link
Member

toobaz commented Feb 7, 2019

The problem is in reindex when working with MultiIndex:

In [3]: index = pd.MultiIndex(levels=[['a', 'b', 'c', 'd']],
   ...:            codes=[[0, -1, 2, 1, 3]],
   ...:            sortorder=0)
   ...:            

In [4]: val = pd.Series(['left-a', 'left-nan', 'left-c'], index=pd.MultiIndex.from_arrays([['a', np.nan, 'c']]), name='left')

In [5]: val.reindex(index)
Out[5]: 
a        left-a
NaN    left-nan
c        left-c
b      left-nan
d      left-nan
Name: left, dtype: object

... somewhere inside _reindex_axes, in pandas/core/generic.py.

@mroeschke
Copy link
Member

Closing as duplicate of #32306 with a more recent discussion on the future policy we want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

4 participants