Should DataFrame.merge match NaN with NaN? #22491

alexlenail · 2018-08-24T00:22:25Z

Code Sample, a copy-pastable example if possible

pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, np.nan]}).merge(pd.DataFrame({'c': [6, 7, 8, 9], 'd': [4, np.nan, np.nan, 5]}), how='left', left_on='b', right_on='d')

Problem description

df1:

	a	b
0	1	4.0
1	2	5.0
2	3	NaN

df2:

	c	d
0	6	4.0
1	7	NaN
2	8	NaN
3	9	5.0

Current output:

	a	b	c	d
0	1	4.0	6	4.0
1	2	5.0	9	5.0
2	3	NaN	7	NaN
3	3	NaN	8	NaN

Expected Output

	a	b	c	d
0	1	4.0	6	4.0
1	2	5.0	9	5.0

What's happening is the NaN is df1.b is matching the NaNs in df2.d.

I don't see a situation in which this would be desirable behavior, but if such a situation exists, surely the opposite is also conceivable, and so there should be some documented option in DataFrame.merge which accomplishes this.

What do you think?

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 18.0
setuptools: 39.0.1
Cython: None
numpy: 1.15.1
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

alexlenail · 2018-08-24T01:10:08Z

I was able to solve my problem with

df1.merge(df2.dropna(subset=['d']), how='left', left_on='b', right_on='d')

Nevertheless, I still feel like this shouldn't be the default behavior for DataFrame.merge.

What do you think?

WillAyd · 2018-08-24T15:17:06Z

Hmm yea I don't think the NA values should be producing a match here - @TomAugspurger any thoughts?

jorisvandenbossche · 2018-08-24T15:31:19Z

Agreed, I also would not expect NAs to match here.
It might be good to explore a bit if we have always been doing that, and if we do this consistently within pandas (in which case we should certainly do some kind of deprecation if we want to change this)

jtkiley · 2018-11-27T23:31:31Z

I ran into this today with a dataset. In my case, I wanted a merge with an outer join, but I saw the same NaN-matching behavior.

My workaround was merging data frames like this (adapted to match the example above):

data = pd.merge(df1[df1['b'].notnull()],
                df2[df2['d'].notnull()], how='outer',
                left_on='b', right_on='d')
data = pd.concat([data, df1[df1['b'].isnull()],
                  df2[df2['d'].isnull()]],
                 ignore_index=True, sort=False)

I didn't know that merge would match NaNs, and I expected to get the output from merge with how='outer' that I got with this code. Fortunately, it was easy to spot (tons of repeated data), but it took a bit to understand and chase down.

Anyway, +1 on not matching NaNs, and maybe this snippet will be helpful if someone happens by who needs to do the same thing I was doing.

dsaxton · 2019-01-01T02:27:55Z

I kind of feel like pd.merge should behave like a SQL join, with None and np.nan being interpreted as null, and with any equality comparison involving a null value itself evaluating to null (and hence no match)

ericness · 2019-06-19T18:55:20Z

I agree that this behavior is unexpected. Since np.NaN == np.NaN evaluates to False these rows shouldn't be included in the merged DataFrame.

mroeschke · 2021-07-21T04:38:15Z

Closing as duplicate of #32306 with a more recent discussion on the future policy we want.

WillAyd added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Aug 24, 2018

WillAyd mentioned this issue Sep 6, 2018

Merge matches np.nan and None while merging #22618

Closed

TomAugspurger mentioned this issue Dec 18, 2018

BUG: merging an Integer EA rasises #23262

Merged

3 tasks

b060149ee mentioned this issue Sep 24, 2019

Merge even on NaN #28522

Closed

jorisvandenbossche mentioned this issue Mar 3, 2020

Joining on a nullable column #32306

Open

mroeschke added the Bug label Jun 22, 2021

mroeschke closed this as completed Jul 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should DataFrame.merge match NaN with NaN? #22491

Should DataFrame.merge match NaN with NaN? #22491

alexlenail commented Aug 24, 2018

INSTALLED VERSIONS

alexlenail commented Aug 24, 2018

WillAyd commented Aug 24, 2018

jorisvandenbossche commented Aug 24, 2018

jtkiley commented Nov 27, 2018

dsaxton commented Jan 1, 2019

ericness commented Jun 19, 2019

mroeschke commented Jul 21, 2021

Should DataFrame.merge match NaN with NaN? #22491

Should DataFrame.merge match NaN with NaN? #22491

Comments

alexlenail commented Aug 24, 2018

Code Sample, a copy-pastable example if possible

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

alexlenail commented Aug 24, 2018

WillAyd commented Aug 24, 2018

jorisvandenbossche commented Aug 24, 2018

jtkiley commented Nov 27, 2018

dsaxton commented Jan 1, 2019

ericness commented Jun 19, 2019

mroeschke commented Jul 21, 2021

Output of `pd.show_versions()`