Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Allow non-tuples in pandas.merge #34810

Merged
merged 8 commits into from
Jun 30, 2020
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -658,7 +658,6 @@ Other API changes
- :func: `pandas.api.dtypes.is_string_dtype` no longer incorrectly identifies categorical series as string.
- :func:`read_excel` no longer takes ``**kwds`` arguments. This means that passing in keyword ``chunksize`` now raises a ``TypeError``
(previously raised a ``NotImplementedError``), while passing in keyword ``encoding`` now raises a ``TypeError`` (:issue:`34464`)
- :func: `merge` now checks ``suffixes`` parameter type to be ``tuple`` and raises ``TypeError``, whereas before a ``list`` or ``set`` were accepted and that the ``set`` could produce unexpected results (:issue:`33740`)
- :class:`Period` no longer accepts tuples for the ``freq`` argument (:issue:`34658`)
- :meth:`Series.interpolate` and :meth:`DataFrame.interpolate` now raises ValueError if ``limit_direction`` is 'forward' or 'both' and ``method`` is 'backfill' or 'bfill' or ``limit_direction`` is 'backward' or 'both' and ``method`` is 'pad' or 'ffill' (:issue:`34746`)

Expand Down Expand Up @@ -749,6 +748,7 @@ Deprecations
- :meth:`DataFrame.to_dict` has deprecated accepting short names for ``orient`` in future versions (:issue:`32515`)
- :meth:`Categorical.to_dense` is deprecated and will be removed in a future version, use ``np.asarray(cat)`` instead (:issue:`32639`)
- The ``fastpath`` keyword in the ``SingleBlockManager`` constructor is deprecated and will be removed in a future version (:issue:`33092`)
- Providing ``suffixes`` as a ``set`` in :func:`pandas.merge` is deprecated. Provide a tuple instead (:issue:`33740`, :issue:`34741`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are going to do this (I guess @jorisvandenbossche really wants this), then let's update the doc-string of pd.merge to be correct (needs list-like) and the docs in reshape.rst as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pd.merge docs already state that we want a sequence.

    suffixes : Sequence, default is ("_x", "_y")
        A length-2 sequence where

Updated the DataFrame.merge docs to be consistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm we don't use sequence elsewhere do we, don't we use list-like everywhere?

- :meth:`Index.is_mixed` is deprecated and will be removed in a future version, check ``index.inferred_type`` directly instead (:issue:`32922`)

- Passing any arguments but the first one to :func:`read_html` as
Expand Down
11 changes: 7 additions & 4 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -226,10 +226,13 @@
sort : bool, default False
Sort the join keys lexicographically in the result DataFrame. If False,
the order of the join keys depends on the join type (how keyword).
suffixes : tuple of (str, str), default ('_x', '_y')
Suffix to apply to overlapping column names in the left and right
side, respectively. To raise an exception on overlapping columns use
(False, False).
suffixes : Sequence, default is ("_x", "_y")
A length-2 sequence where each element is optionally a string
indicating the suffix to add to overlapping column names in
`left` and `right` respectively. Pass a value of `None` instead
of a string to indicate that the column name from `left` or
`right` should be left as-is, with no suffix. At least one of the
values must not be None.
copy : bool, default True
If False, avoid copy if possible.
indicator : bool or str, default False
Expand Down
10 changes: 7 additions & 3 deletions pandas/core/reshape/merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -2072,9 +2072,13 @@ def _items_overlap_with_suffix(left: Index, right: Index, suffixes: Tuple[str, s
If corresponding suffix is empty, the entry is simply converted to string.

"""
if not isinstance(suffixes, tuple):
raise TypeError(
f"suffixes should be tuple of (str, str). But got {type(suffixes).__name__}"
if isinstance(suffixes, set):
warnings.warn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just raise? also what about a dict or non-list-like for example?

do we check is_list_like anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll need a warning according to our backwards compatibility poilcy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's easy to do it with a warning, so I would also just do that.

@TomAugspurger if we don't want to have it set-specific, could also do if not is_list_like(suffixes, allow_sets=False), although in practice that might give the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that should use what @jorisvandenbossche suggests here , e.g. dict would pass this test but is not acceptable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dicts are ordered though, so they shouldn't be a problem. I thought the original issue was non-deterministic outputs because sets aren't ordered?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually sets are not ordered, but I think we need an actual sequence here (sure .keys() satisfies), but i'd rather be explicit. we do not treat dicts as list-like anywhere else.

"Passing 'suffixes' as a set, which is unordered, may result in "
"unexpected results. Provide 'suffixes' as a tuple instead. In the "
"future a 'TypeError' will be raised.",
FutureWarning,
stacklevel=4,
)

to_rename = left.intersection(right)
Expand Down
17 changes: 6 additions & 11 deletions pandas/tests/reshape/merge/test_merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -1999,6 +1999,7 @@ def test_merge_series(on, left_on, right_on, left_index, right_index, nm):
(0, 0, dict(suffixes=("", "_dup")), ["0", "0_dup"]),
(0, 0, dict(suffixes=(None, "_dup")), [0, "0_dup"]),
(0, 0, dict(suffixes=("_x", "_y")), ["0_x", "0_y"]),
(0, 0, dict(suffixes=["_x", "_y"]), ["0_x", "0_y"]),
("a", 0, dict(suffixes=(None, "_y")), ["a", 0]),
(0.0, 0.0, dict(suffixes=("_x", None)), ["0.0_x", 0.0]),
("b", "b", dict(suffixes=(None, "_y")), ["b", "b_y"]),
Expand Down Expand Up @@ -2069,18 +2070,12 @@ def test_merge_suffix_error(col1, col2, suffixes):
pd.merge(a, b, left_index=True, right_index=True, suffixes=suffixes)


@pytest.mark.parametrize(
"col1, col2, suffixes", [("a", "a", {"a", "b"}), ("a", "a", None), (0, 0, None)],
)
def test_merge_suffix_type_error(col1, col2, suffixes):
a = pd.DataFrame({col1: [1, 2, 3]})
b = pd.DataFrame({col2: [3, 4, 5]})
def test_merge_suffix_set():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this non_list_like parmaterize on dict as well

a = pd.DataFrame({"a": [1, 2, 3]})
b = pd.DataFrame({"b": [3, 4, 5]})

msg = (
f"suffixes should be tuple of \\(str, str\\). But got {type(suffixes).__name__}"
)
with pytest.raises(TypeError, match=msg):
pd.merge(a, b, left_index=True, right_index=True, suffixes=suffixes)
with tm.assert_produces_warning(FutureWarning):
pd.merge(a, b, left_index=True, right_index=True, suffixes={"left", "right"})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are not actually passing the parameterized suffixes here, so I think that the dict case is not tested, and will also not raise a warning (as dicts are considered list-like)



@pytest.mark.parametrize(
Expand Down