Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: crosstab with duplicate column or index labels #37997

Merged
merged 57 commits into from
Nov 28, 2020
Merged
Show file tree
Hide file tree
Changes from 54 commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
a748ccd
BUG: Check for duplicate names in columns and index when calling cros…
cuchoi Jun 7, 2019
c5430ab
Updated test for duplicated names in crosstab
cuchoi Jun 9, 2019
5762bb6
Flake8 compliance
cuchoi Jun 9, 2019
989bd5a
Black formatting
cuchoi Sep 17, 2019
672bd4b
Creates name mapping to overcome duplicated columns issues in pd.cros…
cuchoi Sep 20, 2019
1b67514
Removed debug prints
cuchoi Sep 20, 2019
4f2fa86
Removed empty line
cuchoi Sep 20, 2019
fbff182
Improved comments
cuchoi Sep 20, 2019
6071db5
String manipulation compatible with <3.6
cuchoi Sep 20, 2019
3d9c632
Replaced custom method with _value_counts_arraylike
cuchoi Sep 22, 2019
d27233f
Resort imports
cuchoi Sep 24, 2019
478d8dc
Merge branch 'master' of https://github.com/pandas-dev/pandas into cr…
cuchoi Sep 24, 2019
1828a36
Merged master
cuchoi Nov 18, 2019
3e9333d
Added docstring and type hints to _build_names_mapper
cuchoi Nov 19, 2019
fe6f7fd
Sorted imports
cuchoi Nov 19, 2019
9f4a00a
Added types to unique_names
cuchoi Nov 19, 2019
788ac2b
Updated tests
cuchoi Nov 21, 2019
354a180
Merge branch 'master' into crosstab_dup_names
cuchoi Nov 21, 2019
d938a96
Sorted imports
cuchoi Nov 21, 2019
d60308c
Sorted imports
cuchoi Nov 21, 2019
b759073
Merged master
cuchoi Nov 21, 2019
cf63d4c
Merge remote-tracking branch 'upstream/master' into crosstab_dup_names
cuchoi Jan 2, 2020
94e8175
Merged master
cuchoi Jan 6, 2020
c6ccab1
Merge remote-tracking branch 'upstream/master' into crosstab_dup_names
cuchoi Jan 9, 2020
142ade0
Merged master
cuchoi Jan 26, 2020
710ab63
Merged master
cuchoi Feb 12, 2020
cad2dea
Improved comment
cuchoi Apr 11, 2020
f2741db
Merged master
cuchoi Apr 11, 2020
f0261e9
Updated tests
cuchoi Apr 11, 2020
2733c17
Removed pd. reference in tests
cuchoi Apr 11, 2020
f1b19a6
Removed pd. reference in tests
cuchoi Apr 11, 2020
d475edd
Merged to master
cuchoi May 29, 2020
85e010b
Added Set import from typing
cuchoi May 29, 2020
558493f
Merge remote-tracking branch 'upstream/master' into crosstab_dup_names
cuchoi Jul 13, 2020
3ed74d9
Renamed duplicated row/col names test
cuchoi Jul 13, 2020
e5c6cd9
Added what's new entry
cuchoi Jul 13, 2020
d16b840
Merge remote-tracking branch 'upstream/master' into crosstab_dup_names
simonjayhawkins Jul 17, 2020
32aa475
Moved release note to 1.2
cuchoi Aug 2, 2020
820945e
Merged master
cuchoi Sep 11, 2020
dc1d5c5
Updated reference to value_counts_arraylike
cuchoi Sep 11, 2020
0b002da
Merge branch 'master' into crosstab_dup_names
cuchoi Sep 14, 2020
42d130b
merge with upstream/master
arw2019 Nov 22, 2020
86c6ff1
consolidate test
arw2019 Nov 22, 2020
bbf7757
comments + cleanup
arw2019 Nov 22, 2020
575e7cd
Merge branch 'master' of https://github.com/pandas-dev/pandas into GH…
arw2019 Nov 22, 2020
00f6ca3
rewrite _build_names_mapper for rownames, colnames args
arw2019 Nov 22, 2020
2a41491
simplify _build_names_mapper
arw2019 Nov 22, 2020
975af3e
rename row_names->rownames, col_names->colnames for consistency
arw2019 Nov 22, 2020
33f76c8
rewrite whatsnew
arw2019 Nov 22, 2020
c9a11dd
rename vars in _build_names_mapper
arw2019 Nov 22, 2020
a849884
typing
arw2019 Nov 22, 2020
0c44679
more typing
arw2019 Nov 22, 2020
11616a0
Merge branch 'master' of https://github.com/pandas-dev/pandas into GH…
arw2019 Nov 23, 2020
84ca716
Merge branch 'master' of https://github.com/pandas-dev/pandas into GH…
arw2019 Nov 25, 2020
b59cbc7
Merge branch 'master' of https://github.com/pandas-dev/pandas into GH…
arw2019 Nov 27, 2020
e42011a
review comment: add docstring
arw2019 Nov 27, 2020
5ca0447
more on docstring
arw2019 Nov 27, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -716,6 +716,7 @@ Groupby/resample/rolling
Reshaping
^^^^^^^^^

- Bug in :meth:`DataFrame.crosstab` was returning incorrect results on inputs with duplicate row names, duplicate column names or duplicate names between row and column labels (:issue:`22529`)
- Bug in :meth:`DataFrame.pivot_table` with ``aggfunc='count'`` or ``aggfunc='sum'`` returning ``NaN`` for missing categories when pivoted on a ``Categorical``. Now returning ``0`` (:issue:`31422`)
- Bug in :func:`union_indexes` where input index names are not preserved in some cases. Affects :func:`concat` and :class:`DataFrame` constructor (:issue:`13475`)
- Bug in func :meth:`crosstab` when using multiple columns with ``margins=True`` and ``normalize=True`` (:issue:`35144`)
Expand Down
65 changes: 52 additions & 13 deletions pandas/core/reshape/pivot.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
List,
Optional,
Sequence,
Set,
Tuple,
Union,
cast,
Expand Down Expand Up @@ -578,29 +579,37 @@ def crosstab(
b 0 1 0
c 0 0 0
"""
if values is None and aggfunc is not None:
raise ValueError("aggfunc cannot be used without values.")

if values is not None and aggfunc is None:
raise ValueError("values cannot be used without an aggfunc.")

index = com.maybe_make_list(index)
columns = com.maybe_make_list(columns)

rownames = _get_names(index, rownames, prefix="row")
colnames = _get_names(columns, colnames, prefix="col")

common_idx = None
pass_objs = [x for x in index + columns if isinstance(x, (ABCSeries, ABCDataFrame))]
if pass_objs:
common_idx = get_objs_combined_axis(pass_objs, intersect=True, sort=False)

data: Dict = {}
data.update(zip(rownames, index))
data.update(zip(colnames, columns))

if values is None and aggfunc is not None:
raise ValueError("aggfunc cannot be used without values.")
rownames = _get_names(index, rownames, prefix="row")
colnames = _get_names(columns, colnames, prefix="col")

if values is not None and aggfunc is None:
raise ValueError("values cannot be used without an aggfunc.")
# duplicate names mapped to unique names for pivot op
(
rownames_mapper,
unique_rownames,
colnames_mapper,
unique_colnames,
) = _build_names_mapper(rownames, colnames)

from pandas import DataFrame

data = {
**dict(zip(unique_rownames, index)),
**dict(zip(unique_colnames, columns)),
}
df = DataFrame(data, index=common_idx)
original_df_cols = df.columns

Expand All @@ -613,8 +622,8 @@ def crosstab(

table = df.pivot_table(
["__dummy__"],
index=rownames,
columns=colnames,
index=unique_rownames,
columns=unique_colnames,
margins=margins,
margins_name=margins_name,
dropna=dropna,
Expand All @@ -633,6 +642,9 @@ def crosstab(
table, normalize=normalize, margins=margins, margins_name=margins_name
)

table = table.rename_axis(index=rownames_mapper, axis=0)
table = table.rename_axis(columns=colnames_mapper, axis=1)

return table


Expand Down Expand Up @@ -731,3 +743,30 @@ def _get_names(arrs, names, prefix: str = "row"):
names = list(names)

return names


def _build_names_mapper(
rownames: List[str], colnames: List[str]
) -> Tuple[Dict[str, str], List[str], Dict[str, str], List[str]]:
def get_duplicates(names):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a doc-string here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

seen: Set = set()
return {name for name in names if name not in seen}

shared_names = set(rownames).intersection(set(colnames))
dup_names = get_duplicates(rownames) | get_duplicates(colnames) | shared_names

rownames_mapper = {
f"row_{i}": name for i, name in enumerate(rownames) if name in dup_names
}
unique_rownames = [
f"row_{i}" if name in dup_names else name for i, name in enumerate(rownames)
]

colnames_mapper = {
f"col_{i}": name for i, name in enumerate(colnames) if name in dup_names
}
unique_colnames = [
f"col_{i}" if name in dup_names else name for i, name in enumerate(colnames)
]

return rownames_mapper, unique_rownames, colnames_mapper, unique_colnames
33 changes: 25 additions & 8 deletions pandas/tests/reshape/test_crosstab.py
Original file line number Diff line number Diff line change
Expand Up @@ -535,15 +535,32 @@ def test_crosstab_with_numpy_size(self):
)
tm.assert_frame_equal(result, expected)

def test_crosstab_dup_index_names(self):
# GH 13279
s = Series(range(3), name="foo")
def test_crosstab_duplicate_names(self):
arw2019 marked this conversation as resolved.
Show resolved Hide resolved
# GH 13279 / 22529

s1 = Series(range(3), name="foo")
s2_foo = Series(range(1, 4), name="foo")
s2_bar = Series(range(1, 4), name="bar")
s3 = Series(range(3), name="waldo")

# check result computed with duplicate labels against
# result computed with unique labels, then relabelled
mapper = {"bar": "foo"}

# duplicate row, column labels
result = crosstab(s1, s2_foo)
expected = crosstab(s1, s2_bar).rename_axis(columns=mapper, axis=1)
tm.assert_frame_equal(result, expected)

# duplicate row, unique column labels
result = crosstab([s1, s2_foo], s3)
expected = crosstab([s1, s2_bar], s3).rename_axis(index=mapper, axis=0)
tm.assert_frame_equal(result, expected)

# unique row, duplicate column labels
result = crosstab(s3, [s1, s2_foo])
expected = crosstab(s3, [s1, s2_bar]).rename_axis(columns=mapper, axis=1)

result = crosstab(s, s)
expected_index = Index(range(3), name="foo")
expected = DataFrame(
np.eye(3, dtype=np.int64), index=expected_index, columns=expected_index
)
tm.assert_frame_equal(result, expected)

@pytest.mark.parametrize("names", [["a", ("b", "c")], [("a", "b"), "c"]])
Expand Down