Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG-24212 fix when other_index has incompatible dtype #25009

Merged
merged 30 commits into from May 5, 2019

Conversation

@JustinZhengBC
Copy link
Contributor

commented Jan 29, 2019

  • closes #25001
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Followup to #24916, addresses the case when the other index has an incompatible dtype, so we cannot take directly from it. Currently, this PR naively replaces the missing index values with the number of the rows in the other index that caused them replaces the missing index values with the appropriate NA value.

Still working on adding cases when it is possible to combine indices of sparse/categorical dtypes without densifying.

@JustinZhengBC JustinZhengBC changed the title BUG-24212 fix when other_index has incompatible dtype [WIP] BUG-24212 fix when other_index has incompatible dtype Jan 29, 2019

@codecov

This comment has been minimized.

Copy link

commented Jan 29, 2019

Codecov Report

Merging #25009 into master will decrease coverage by 49.49%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25009      +/-   ##
==========================================
- Coverage   92.38%   42.88%   -49.5%     
==========================================
  Files         166      166              
  Lines       52401    52407       +6     
==========================================
- Hits        48409    22475   -25934     
- Misses       3992    29932   +25940
Flag Coverage Δ
#multiple ?
#single 42.88% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/reshape/merge.py 9.43% <0%> (-85.05%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/core/categorical.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.35%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-95.46%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.17%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.15%) ⬇️
... and 124 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update abf0824...0e6de81. Read the comment docs.

@codecov

This comment has been minimized.

Copy link

commented Jan 29, 2019

Codecov Report

Merging #25009 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25009      +/-   ##
==========================================
- Coverage   91.97%   91.96%   -0.01%     
==========================================
  Files         175      175              
  Lines       52368    52365       -3     
==========================================
- Hits        48164    48157       -7     
- Misses       4204     4208       +4
Flag Coverage Δ
#multiple 90.52% <100%> (-0.01%) ⬇️
#single 40.7% <0%> (-0.15%) ⬇️
Impacted Files Coverage Δ
pandas/core/reshape/merge.py 94.45% <100%> (-0.03%) ⬇️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 96.9% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9feb3ad...88cdf8b. Read the comment docs.

join_list[mask] = other_list[mask]
join_index = Index(join_list, dtype=other_index.dtype,
name=other_index.name)
except ValueError:

This comment has been minimized.

Copy link
@jreback

jreback Jan 30, 2019

Contributor

we really don't want to do a try/except here. What is falling into the except?

This comment has been minimized.

Copy link
@JustinZhengBC

JustinZhengBC Jan 30, 2019

Author Contributor

When 'other_index' has a different dtype that causes an exception to be raised when values from it are inserted into the current index. I did not compare dtypes because in some cases differing dtypes are possible (example: int can be added to a Float64Index)

This comment has been minimized.

Copy link
@jreback

jreback Jan 30, 2019

Contributor

can you:

  • always just do this (what you have in the except)
  • or use is_dtype_equal to test?

This comment has been minimized.

Copy link
@JustinZhengBC

JustinZhengBC Jan 30, 2019

Author Contributor

I don't think is_dtype_equal would work because some combinations of different dtypes are still usable.

I agree with the first option because joining on the index of the other frame kind of makes the row order of the other frame arbitrary anyway.

@JustinZhengBC JustinZhengBC changed the title [WIP] BUG-24212 fix when other_index has incompatible dtype BUG-24212 fix when other_index has incompatible dtype Jan 31, 2019

@jreback
Copy link
Contributor

left a comment

can you merge master and let's have a look at this

name=join_index.name)
return join_index
# if values missing (-1) from target index, replace missing
# values by their column position or NA if not applicable

This comment has been minimized.

Copy link
@jreback

jreback Mar 26, 2019

Contributor

we don't want to dispatch on the index type here at all, other than calling a method on the index. This is just ripe for errors. Need to make this much more generic.

@@ -940,11 +941,56 @@ def test_merge_two_empty_df_no_division_error(self):
merge(a, a, on=('a', 'b'))

@pytest.mark.parametrize('how', ['right', 'outer'])
def test_merge_on_index_with_more_values(self, how):
@pytest.mark.parametrize('index,expected_index',

This comment has been minimized.

Copy link
@jreback

jreback Mar 26, 2019

Contributor

can you format this a bit better. start like

@pytest.mark.parametrize(
   'index,expected_index',
    [(......),
.....
@JustinZhengBC

This comment has been minimized.

Copy link
Contributor Author

commented Mar 28, 2019

@jreback I've modified the logic so the same behavior is used for all dtypes: missing indices are filled with an appropriate NA value

'index,expected_index',
[(CategoricalIndex([1, 2, 4]),
CategoricalIndex([1, 2, 4, None, None, None])),
(DatetimeIndex(['2001-01-01',

This comment has been minimized.

Copy link
@jreback

jreback Mar 28, 2019

Contributor

its ok to put multiple values on a line el.g. in the DTI and other construction to make this a bit shorter

(TimedeltaIndex(['1d',
'2d',
'3d']),
TimedeltaIndex(['1d',

This comment has been minimized.

Copy link
@jreback

jreback Mar 28, 2019

Contributor

e.g. here

Show resolved Hide resolved pandas/tests/reshape/merge/test_merge.py

@jreback jreback added this to the 0.25.0 milestone Mar 29, 2019

if is_integer_dtype(index.dtype):
fill_value = np.nan
else:
fill_value = na_value_for_dtype(index.dtype)

This comment has been minimized.

Copy link
@jreback

jreback Mar 29, 2019

Contributor

use this for all, passing compat=False

@jreback

This comment has been minimized.

Copy link
Contributor

commented Apr 20, 2019

can you merge master

@JustinZhengBC

This comment has been minimized.

Copy link
Contributor Author

commented Apr 21, 2019

@jreback done

@jreback

This comment has been minimized.

Copy link
Contributor

commented Apr 21, 2019

@JustinZhengBC so we are calling this an internal impl change, this has no outward effects? IOW user code will be unchanged? or does this have any cases that now work that didn't?

@JustinZhengBC

This comment has been minimized.

Copy link
Contributor Author

commented Apr 22, 2019

@jreback #24916 had a whatsnew note like "bug in merge when merging by index name would sometimes result in an incorrectly numbered index," which is the same problem this addresses. There is an outwards change in that merge now fills in missing index values with NA values, whereas previously it would try to infer index values based on the other index. I have modified the whatsnew to clarify the new behaviour

@jreback

This comment has been minimized.

Copy link
Contributor

commented on doc/source/whatsnew/v0.25.0.rst in 6772618 Apr 22, 2019

add this PR number as well

@jreback

This comment has been minimized.

Copy link
Contributor

commented Apr 28, 2019

lgtm. can you merge master; ping on green.

@JustinZhengBC

This comment has been minimized.

Copy link
Contributor Author

commented May 5, 2019

@jreback done

@jreback jreback merged commit cc3b2f0 into pandas-dev:master May 5, 2019

11 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
pandas-dev.pandas Build #20190429.6 succeeded
Details
pandas-dev.pandas (Checks_and_doc) Checks_and_doc succeeded
Details
pandas-dev.pandas (Linux py35_compat) Linux py35_compat succeeded
Details
pandas-dev.pandas (Linux py36_locale_slow) Linux py36_locale_slow succeeded
Details
pandas-dev.pandas (Linux py36_locale_slow_old_np) Linux py36_locale_slow_old_np succeeded
Details
pandas-dev.pandas (Linux py37_locale) Linux py37_locale succeeded
Details
pandas-dev.pandas (Linux py37_np_dev) Linux py37_np_dev succeeded
Details
pandas-dev.pandas (Windows py36_np15) Windows py36_np15 succeeded
Details
pandas-dev.pandas (Windows py37_np141) Windows py37_np141 succeeded
Details
pandas-dev.pandas (macOS py35_macos) macOS py35_macos succeeded
Details
@jreback

This comment has been minimized.

Copy link
Contributor

commented May 5, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.