Filtering by exclusion of duplicate rows does not preserve column list for an empty dataframe #25184

anatoly-scherbakov · 2019-02-06T10:38:36Z

Code Sample, a copy-pastable example if possible

import pandas as pd

x_df = pd.DataFrame(columns=['a', 'b'])
series = x_df.duplicated(subset=['a'])

list(x_df[~series])

# Expected output on Pandas 0.23.4: ['a', 'b']
# But, Pandas 0.24.1 returns: []

Problem description

We have been using this approach to remove duplicate rows on a dataframe, where rows are compared by one column only. Everything worked perfectly until we found out that, if the original dataframe is empty, in the result dataframe column list is lost after Pandas upgrade to latest version.

Expected Output

We would expect the column list to be preserved.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 2.7.15.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-44-generic
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: None.None

pandas: 0.24.1
pytest: None
pip: 18.0
setuptools: 40.0.0
Cython: 0.22
numpy: 1.12.1
scipy: None
pyarrow: None
xarray: None
IPython: 4.2.0
sphinx: 1.4.4
patsy: None
dateutil: 2.5.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.5.14
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: None
lxml.etree: 3.7.3
bs4: 4.5.3
html5lib: 1.

The text was updated successfully, but these errors were encountered:

gnilrets · 2019-02-06T16:43:38Z

Just ran into this myself today.

Looks like the root issue may be that duplicated is now returning an empty series of dtype=float64, whereas it used to return an empty series of dtype=bool.

I've never contributed to pandas before, but I think fixing this would be as simple as changing this line to return Series(dtype=bool) - https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L4638

gfyoung · 2019-02-07T06:17:42Z

@gnilrets : You are more than welcome to investigate this! For information about contributing, see here:

https://pandas.pydata.org/pandas-docs/stable/development/contributing.html

cc @jreback

jreback · 2019-02-07T12:14:26Z

yep looks like a regression

gfyoung added Index Related to the Index class or subclasses Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version and removed Index Related to the Index class or subclasses labels Feb 7, 2019

gfyoung added this to the 0.24.2 milestone Feb 7, 2019

gnilrets mentioned this issue Feb 9, 2019

BUG: Duplicated returns boolean dataframe #25234

Merged

4 tasks

jreback closed this as completed in #25234 Feb 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filtering by exclusion of duplicate rows does not preserve column list for an empty dataframe #25184

Filtering by exclusion of duplicate rows does not preserve column list for an empty dataframe #25184

anatoly-scherbakov commented Feb 6, 2019

INSTALLED VERSIONS

gnilrets commented Feb 6, 2019

gfyoung commented Feb 7, 2019 •

edited

Loading

jreback commented Feb 7, 2019

Filtering by exclusion of duplicate rows does not preserve column list for an empty dataframe #25184

Filtering by exclusion of duplicate rows does not preserve column list for an empty dataframe #25184

Comments

anatoly-scherbakov commented Feb 6, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

gnilrets commented Feb 6, 2019

gfyoung commented Feb 7, 2019 • edited Loading

jreback commented Feb 7, 2019

Output of `pd.show_versions()`

gfyoung commented Feb 7, 2019 •

edited

Loading