BUG: DataFrame.(any|all) inconsistency #34918

jbrockmendel · 2020-06-21T04:37:54Z

No description provided.

jbrockmendel · 2020-06-22T01:45:35Z

pandas/core/internals/managers.py

-        return self._combine([b for b in self.blocks if b.is_bool], copy)
+        # Note: use is_bool_dtype instead of blk.is_bool to exclude
+        #  object-dtype blocks containing all-bool entries.
+        return self._combine([b for b in self.blocks if is_bool_dtype(b.dtype)], copy)


this may actually be a bug, since it implies the following inconsistent behavior in master:

ser = pd.Series([True, False, True], dtype=object) ser2 = pd.Series(["A", "B", "C"]) df = ser.to_frame("A") >>> df._get_bool_data() A 0 True 1 False 2 True df["B"] = ser2 >>> df._get_bool_data() Empty DataFrame Columns: [] Index: [0, 1, 2]

adding columns shouldnt make get_bool_data smaller.

can you add a test for this

…f-consolidate-equals

jorisvandenbossche · 2020-06-24T16:37:48Z

pandas/core/internals/managers.py

@@ -688,8 +689,9 @@ def get_bool_data(self, copy: bool = False) -> "BlockManager":
        copy : bool, default False
            Whether to copy the blocks
        """
-        self._consolidate_inplace()


I think consolidating here might be the better option, as it at least ensures consistent behaviour when you have multiple columns independent of consolidation status. The inconsistency between a single column and multiple column is still present of course, but IMO there is nothing to do about this giving our consolidated blocks (well, we could deprecate object dtype being regarded as bool ..)

well, we could deprecate object dtype being regarded as bool

This PR does that (well changes outright, not deprecates). AFAICT thats the only way to make the behavior independent of whether the presence of another object-dtype-but-not-bool-like column.

IMO we should deprecate this first

The only non-test place where get_bool_data is called from is in DataFrame._reduce which I know you've been working on recently. The topic of how to handle object-dtype blocks has come up there, too. If we end up handling object-dtype blocks column-wise there, that would render this distinction irrelevant.

IMO we should deprecate this first

what exactly are you suggesting we deprecate? do you have an example of something that breaks on this change?

@jorisvandenbossche can you respond here re what specific behavior you'd like to deprecate?

jorisvandenbossche · 2020-06-24T16:39:48Z

pandas/core/internals/managers.py

@@ -698,7 +700,6 @@ def get_numeric_data(self, copy: bool = False) -> "BlockManager":
        copy : bool, default False
            Whether to copy the blocks
        """
-        self._consolidate_inplace()


Did you check the impact of this?

jorisvandenbossche · 2020-06-25T07:18:03Z

pandas/tests/internals/test_internals.py

+        df["B"] = ser2
+
+        bd2 = df._get_bool_data()
+        tm.assert_frame_equal(bd1, bd2)


Can you also assert the actual expected result that is constructed manually instead of only ensuring both _get_bool_data calls give the same?

jorisvandenbossche · 2020-06-25T07:19:52Z

pandas/core/internals/managers.py

@@ -688,8 +689,9 @@ def get_bool_data(self, copy: bool = False) -> "BlockManager":
        copy : bool, default False
            Whether to copy the blocks
        """
-        self._consolidate_inplace()


IMO we should deprecate this first

…f-consolidate-equals

jbrockmendel · 2020-08-24T15:24:54Z

AFAICT there are 3 concerns to be balanced here:

internal consistency w/r/t sub-frames, i.e df[subset].all(bool_only=True) should never be larger than df.all(bool_only=True) (bug in master)
independence of consolidation status (in the status quo we consolidate, so this is not an issue)
backwards compatibility

I see two possible ways to accomplish get 1) consistent with 2):

a) drop object-dtype blocks in get_bool_data (i.e. what this PR does ATM). This will produce some cases where df.all(bool_only=True) is smaller than it is in master.
b) operate column-wise on object-dtype blocks. This will produce some cases where df.all(bool_only=True) is larger than it is in master.

I lean towards a) because it is simpler to implement and describe/document. Since it is fixing a bug in a corner case, I don't think a deprecation cycle is needed

…f-consolidate-equals

jorisvandenbossche · 2020-08-25T20:56:28Z

I am also in favor of your option a). Although b) is technically possible (and something we would also get with 1D blocks), I think long term we should simply not regard object dtype columns as boolean or infer if they might be boolean (the same goes for indexing with object dtype bools).

But, I still think we can deprecate this first instead of directly changing. Yes, it's quite a specific case, but it doesn't seem that complicated to deprecate? (the logic is only in _get_bool_data?)

jbrockmendel · 2020-08-25T22:40:14Z

(the logic is only in _get_bool_data?)

AFAIK the only things affected are DataFrame.all(bool_only=True) and DataFrame.any(bool_only=True)

But, I still think we can deprecate this first instead of directly changing. Yes, it's quite a specific case, but it doesn't seem that complicated to deprecate?

The corner-ness of it makes me not care that much about deprecate vs change. I marginally lean towards getting it over with because there is a bug fix involved and it will make it easier to simplify DataFrame._reduce.

…f-consolidate-equals

jbrockmendel · 2020-09-24T01:57:50Z

Went through the issues and added Reduction and Nuisance Column labels where appropriate. Reinforced my belief that numeric_only needs to be thrown into the sun.

…f-consolidate-equals

github-actions · 2020-11-05T00:10:54Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

…f-consolidate-equals

jbrockmendel · 2020-11-05T00:15:51Z

rebased. i maintain this bug merits ripping off the bandaid.

…f-consolidate-equals

jbrockmendel added 2 commits June 20, 2020 21:36

CLN: avoid unnecessary consolidate_inplace calls

40a55ec

Fix bool-object corner case

ba0c526

jbrockmendel commented Jun 22, 2020

View reviewed changes

jreback added Clean Internals Related to non-user accessible pandas implementation labels Jun 23, 2020

jbrockmendel added 2 commits June 23, 2020 15:51

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

e73204b

…f-consolidate-equals

add test

faf8bc8

jorisvandenbossche requested changes Jun 24, 2020

View reviewed changes

jorisvandenbossche reviewed Jun 24, 2020

View reviewed changes

jorisvandenbossche requested changes Jun 25, 2020

View reviewed changes

jbrockmendel added 4 commits June 25, 2020 12:48

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

e9a9d1a

…f-consolidate-equals

more explicit assertions

fa398e0

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

52902a6

…f-consolidate-equals

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

b5ecafb

…f-consolidate-equals

jbrockmendel mentioned this pull request Jul 17, 2020

REGR: setting column with setitem should not modify existing array inplace #33457

Open

jbrockmendel added 2 commits August 4, 2020 12:05

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

a65cac0

…f-consolidate-equals

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

da19314

…f-consolidate-equals

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

887f6d1

…f-consolidate-equals

jbrockmendel mentioned this pull request Sep 5, 2020

BUG: item_cache invalidation in get_numeric_data #35882

Merged

5 tasks

jbrockmendel added 3 commits September 4, 2020 20:40

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

1d72508

…f-consolidate-equals

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

4d6749b

…f-consolidate-equals

whatsnew

103d446

jbrockmendel changed the title ~~CLN: avoid unnecessary consolidate_inplace calls~~ BUG: DataFrame.(any|all) inconsistency Sep 8, 2020

jbrockmendel added 4 commits September 8, 2020 09:02

revert whitespace change

5adff4e

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

bf3d940

…f-consolidate-equals

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

fd286dc

…f-consolidate-equals

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

a772fb7

…f-consolidate-equals

jbrockmendel added 5 commits September 21, 2020 09:57

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

03bb7a9

…f-consolidate-equals

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

2c4c011

…f-consolidate-equals

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

3f2cf67

…f-consolidate-equals

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

214b362

…f-consolidate-equals

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

3804b0d

…f-consolidate-equals

simonjayhawkins mentioned this pull request Sep 23, 2020

REGR: Series.__mod__ behaves different with numexpr #36552

Merged

5 tasks

jbrockmendel added 5 commits September 23, 2020 20:20

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

3f141b5

…f-consolidate-equals

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

b39e533

…f-consolidate-equals

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

5f8086e

…f-consolidate-equals

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

eb8c75d

…f-consolidate-equals

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

2dc84b9

…f-consolidate-equals

github-actions bot added the Stale label Nov 5, 2020

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

1a27257

…f-consolidate-equals

jbrockmendel added 2 commits November 5, 2020 13:05

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

aca1f84

…f-consolidate-equals

lint fixup

7bfadbc

jbrockmendel closed this Nov 14, 2020

jbrockmendel deleted the ref-consolidate-equals branch November 14, 2020 02:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: DataFrame.(any|all) inconsistency #34918

BUG: DataFrame.(any|all) inconsistency #34918

jbrockmendel commented Jun 21, 2020

jbrockmendel Jun 22, 2020

jreback Jun 23, 2020

jbrockmendel Jun 25, 2020

jorisvandenbossche Jun 24, 2020

jbrockmendel Jun 24, 2020

jorisvandenbossche Jun 25, 2020

jbrockmendel Jun 25, 2020

jreback Jun 25, 2020

jbrockmendel Aug 24, 2020

jorisvandenbossche Jun 24, 2020

jorisvandenbossche Jun 25, 2020

jorisvandenbossche Jun 25, 2020

jbrockmendel commented Aug 24, 2020

jorisvandenbossche commented Aug 25, 2020

jbrockmendel commented Aug 25, 2020

jbrockmendel commented Sep 24, 2020

github-actions bot commented Nov 5, 2020

jbrockmendel commented Nov 5, 2020

BUG: DataFrame.(any|all) inconsistency #34918

BUG: DataFrame.(any|all) inconsistency #34918

Conversation

jbrockmendel commented Jun 21, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Aug 24, 2020

jorisvandenbossche commented Aug 25, 2020

jbrockmendel commented Aug 25, 2020

jbrockmendel commented Sep 24, 2020

github-actions bot commented Nov 5, 2020

jbrockmendel commented Nov 5, 2020