Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should DataFrame.merge match NaN with NaN? #22491

Closed
alexlenail opened this issue Aug 24, 2018 · 7 comments
Closed

Should DataFrame.merge match NaN with NaN? #22491

alexlenail opened this issue Aug 24, 2018 · 7 comments
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@alexlenail
Copy link
Contributor

Code Sample, a copy-pastable example if possible

pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, np.nan]}).merge(pd.DataFrame({'c': [6, 7, 8, 9], 'd': [4, np.nan, np.nan, 5]}), how='left', left_on='b', right_on='d')

Problem description

df1:

  a b
0 1 4.0
1 2 5.0
2 3 NaN

df2:

  c d
0 6 4.0
1 7 NaN
2 8 NaN
3 9 5.0

Current output:

  a b c d
0 1 4.0 6 4.0
1 2 5.0 9 5.0
2 3 NaN 7 NaN
3 3 NaN 8 NaN

Expected Output

  a b c d
0 1 4.0 6 4.0
1 2 5.0 9 5.0

What's happening is the NaN is df1.b is matching the NaNs in df2.d.

I don't see a situation in which this would be desirable behavior, but if such a situation exists, surely the opposite is also conceivable, and so there should be some documented option in DataFrame.merge which accomplishes this.

What do you think?

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 18.0
setuptools: 39.0.1
Cython: None
numpy: 1.15.1
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@alexlenail
Copy link
Contributor Author

I was able to solve my problem with

df1.merge(df2.dropna(subset=['d']), how='left', left_on='b', right_on='d')

Nevertheless, I still feel like this shouldn't be the default behavior for DataFrame.merge.

What do you think?

@WillAyd
Copy link
Member

WillAyd commented Aug 24, 2018

Hmm yea I don't think the NA values should be producing a match here - @TomAugspurger any thoughts?

@WillAyd WillAyd added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Aug 24, 2018
@jorisvandenbossche
Copy link
Member

Agreed, I also would not expect NAs to match here.
It might be good to explore a bit if we have always been doing that, and if we do this consistently within pandas (in which case we should certainly do some kind of deprecation if we want to change this)

@jtkiley
Copy link
Contributor

jtkiley commented Nov 27, 2018

I ran into this today with a dataset. In my case, I wanted a merge with an outer join, but I saw the same NaN-matching behavior.

My workaround was merging data frames like this (adapted to match the example above):

data = pd.merge(df1[df1['b'].notnull()],
                df2[df2['d'].notnull()], how='outer',
                left_on='b', right_on='d')
data = pd.concat([data, df1[df1['b'].isnull()],
                  df2[df2['d'].isnull()]],
                 ignore_index=True, sort=False)

I didn't know that merge would match NaNs, and I expected to get the output from merge with how='outer' that I got with this code. Fortunately, it was easy to spot (tons of repeated data), but it took a bit to understand and chase down.

Anyway, +1 on not matching NaNs, and maybe this snippet will be helpful if someone happens by who needs to do the same thing I was doing.

@dsaxton
Copy link
Member

dsaxton commented Jan 1, 2019

I kind of feel like pd.merge should behave like a SQL join, with None and np.nan being interpreted as null, and with any equality comparison involving a null value itself evaluating to null (and hence no match)

@ericness
Copy link

I agree that this behavior is unexpected. Since np.NaN == np.NaN evaluates to False these rows shouldn't be included in the merged DataFrame.

@mroeschke
Copy link
Member

Closing as duplicate of #32306 with a more recent discussion on the future policy we want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

7 participants