Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix Series constructor for Categorical with index #19714

Merged
merged 9 commits into from Feb 27, 2018

Conversation

Projects
None yet
5 participants
@cbertinato
Copy link
Contributor

commented Feb 15, 2018

Fixes Series constructor so that ValueError is raised when a Categorical and index of incorrect length are given. Closes issue #19342

  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry
@TomAugspurger
Copy link
Contributor

left a comment

Looks good! Thanks.

@@ -690,6 +690,7 @@ Categorical
- Bug in :meth:`Index.astype` with a categorical dtype where the resultant index is not converted to a :class:`CategoricalIndex` for all types of index (:issue:`18630`)
- Bug in :meth:`Series.astype` and ``Categorical.astype()`` where an existing categorical data does not get updated (:issue:`10696`, :issue:`18593`)
- Bug in :class:`Index` constructor with ``dtype=CategoricalDtype(...)`` where ``categories`` and ``ordered`` are not maintained (issue:`19032`)
- Bug in :class:`Series` constructor with ``Categorical`` where an error is not raised when an index of incorrect length is given (:issue:`19342`)

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Feb 15, 2018

Contributor

Maybe say "index of different length". It could be the categorical that's incorrect :)

This comment has been minimized.

Copy link
@cbertinato

cbertinato Feb 15, 2018

Author Contributor

Very good point. Will do.

@codecov

This comment has been minimized.

Copy link

commented Feb 15, 2018

Codecov Report

Merging #19714 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #19714      +/-   ##
==========================================
+ Coverage   91.66%   91.66%   +<.01%     
==========================================
  Files         150      150              
  Lines       48969    48975       +6     
==========================================
+ Hits        44886    44892       +6     
  Misses       4083     4083
Flag Coverage Δ
#multiple 90.04% <100%> (ø) ⬆️
#single 41.85% <50%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/series.py 94.44% <100%> (+0.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1e4c50a...f5db9ab. Read the comment docs.

@cbertinato cbertinato force-pushed the cbertinato:issue-19342 branch from 1434b63 to c6b2016 Feb 15, 2018

map(lambda x: x, range(3))])
def test_constructor_index_mismatch(self, input):
# GH 19342
pytest.raises(ValueError, Series, input, index=np.arange(4))

This comment has been minimized.

Copy link
@gfyoung

gfyoung Feb 15, 2018

Member

Let's also check the error message.

This comment has been minimized.

Copy link
@cbertinato

cbertinato Feb 16, 2018

Author Contributor

Will do!

@cbertinato cbertinato force-pushed the cbertinato:issue-19342 branch 2 times, most recently from 2c351ea to 3bc499d Feb 16, 2018

@pep8speaks

This comment has been minimized.

Copy link

commented Feb 16, 2018

Hello @cbertinato! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on February 26, 2018 at 14:53 Hours UTC

@cbertinato cbertinato force-pushed the cbertinato:issue-19342 branch from 3bc499d to 28b70b8 Feb 16, 2018

# raises an error
idx = np.arange(4)
if compat.PY2:
typs = types.GeneratorType

This comment has been minimized.

Copy link
@jreback

jreback Feb 18, 2018

Contributor

you don't need all of this, its just confusing just to construct an error message. just do a simpler check on the error.

@@ -210,6 +210,11 @@ def __init__(self, data=None, index=None, dtype=None, name=None,
raise ValueError("cannot specify a dtype with a "
"Categorical unless "
"dtype='category'")
if index is not None and len(index) != len(data):

This comment has been minimized.

Copy link
@jreback

jreback Feb 18, 2018

Contributor

this should go a little further down after

if index is None:
    ....
else:
      # this the check here

This comment has been minimized.

Copy link
@jreback

jreback Feb 18, 2018

Contributor

though maybe this should go in _sanitize_array around L3242

This comment has been minimized.

Copy link
@cbertinato

cbertinato Feb 18, 2018

Author Contributor

I don't see an advantage to moving it into _sanitize_array versus putting it in the if at L230. But I could be missing something. What do you think?

This comment has been minimized.

Copy link
@jreback

jreback Feb 18, 2018

Contributor

it’s prob ok here but a bit lower
early failure is good

This comment has been minimized.

Copy link
@cbertinato

cbertinato Feb 19, 2018

Author Contributor

Ok. Proposed a placement in this next push. Placing it in the

if index is None:
    ...

at ~L226 as an else breaks some tests if data is a scalar or SingleBlockManager. It's difficult to catch all cases if we put it in _sanitize_array because of the returns in the if cases. So the next best place appears to be after the call to _sanitize_array. Not ideal with regard to early failure, but really the only place that I can see to put a single check just to be able to do len(data). Not much different from the current location except that it catches cases other than Categorical.

@@ -690,6 +690,7 @@ Categorical
- Bug in :meth:`Index.astype` with a categorical dtype where the resultant index is not converted to a :class:`CategoricalIndex` for all types of index (:issue:`18630`)
- Bug in :meth:`Series.astype` and ``Categorical.astype()`` where an existing categorical data does not get updated (:issue:`10696`, :issue:`18593`)
- Bug in :class:`Index` constructor with ``dtype=CategoricalDtype(...)`` where ``categories`` and ``ordered`` are not maintained (issue:`19032`)
- Bug in :class:`Series` constructor with ``Categorical`` where an error is not raised when an index of different length is given (:issue:`19342`)

This comment has been minimized.

Copy link
@jreback

jreback Feb 18, 2018

Contributor

put this in reshaping

@cbertinato cbertinato force-pushed the cbertinato:issue-19342 branch from 28b70b8 to abe385d Feb 19, 2018

@TomAugspurger
Copy link
Contributor

left a comment

Could you merge in master and fix the merge conflict. Also a couple linting errors.

@@ -844,6 +844,7 @@ Reshaping
- Improved error message for :func:`DataFrame.merge` when there is no common merge key (:issue:`19427`)
- Bug in :func:`DataFrame.join` which does an *outer* instead of a *left* join when being called with multiple DataFrames and some have non-unique indices (:issue:`19624`)
- :func:`Series.rename` now accepts ``axis`` as a kwarg (:issue:`18589`)
- Bug in :class:`Series` constructor with ``Categorical`` where an error is not raised when an index of different length is given (:issue:`19342`)

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Feb 20, 2018

Contributor

Could you clarify error -> ValueError

@@ -5,6 +5,7 @@

from datetime import datetime, timedelta
from collections import OrderedDict
import types

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Feb 20, 2018

Contributor

These imports aren't needed now.

@cbertinato cbertinato force-pushed the cbertinato:issue-19342 branch from abe385d to 11522eb Feb 20, 2018

@jreback jreback added this to the 0.23.0 milestone Feb 21, 2018

@@ -238,6 +239,11 @@ def __init__(self, data=None, index=None, dtype=None, name=None,
data = _sanitize_array(data, index, dtype, copy,
raise_cast_failure=True)

if index is not None and len(index) != len(data):

This comment has been minimized.

Copy link
@jreback

jreback Feb 22, 2018

Contributor

this should go a touch higher,

            if index is None:
                if not is_list_like(data):
                    data = [data]
                index = com._default_index(len(data))
           else:
                # add here

            # create/copy the manager
            if isinstance(data, SingleBlockManager):
                if dtype is not None:
                    data = data.astype(dtype=dtype, errors='ignore',
                                       copy=copy)

This comment has been minimized.

Copy link
@cbertinato

cbertinato Feb 22, 2018

Author Contributor

Ok. Added a scalar check that lets scalars through, so we are assuming that the Series is shaped correctly when the scalar is broadcast to fit the index, which is probably ok.

if index is None:
    ...
else:
    if isscalar(data) and len(index) != len(data):
        ...

This comment has been minimized.

Copy link
@cbertinato

cbertinato Feb 22, 2018

Author Contributor

Nevermind. A few other inputs that break. np.array and np.dtype appear to be just two of them. Unless we add specific checks for these, I think we may need to move it lower, below _sanitize_array.

@cbertinato cbertinato force-pushed the cbertinato:issue-19342 branch from 11522eb to 98f1f16 Feb 22, 2018

@@ -226,6 +227,11 @@ def __init__(self, data=None, index=None, dtype=None, name=None,
if not is_list_like(data):
data = [data]
index = com._default_index(len(data))
else:

This comment has been minimized.

Copy link
@jreback

jreback Feb 23, 2018

Contributor

can make this an elif here

@cbertinato cbertinato force-pushed the cbertinato:issue-19342 branch from 98f1f16 to b6df1c8 Feb 23, 2018

@jreback

This comment has been minimized.

Copy link
Contributor

commented Feb 24, 2018

i rebased. ping on green.

# a scalar numpy array is list-like but doesn't
# have a proper length
try:
if len(data) > 1 and len(index) != len(data):

This comment has been minimized.

Copy link
@jreback

jreback Feb 25, 2018

Contributor

hmm, did this change? 0-len should be ok, can you add a test

This comment has been minimized.

Copy link
@cbertinato

cbertinato Feb 25, 2018

Author Contributor

0-len gets caught deeper, in the SingleBlockManager, but len 1 gets caught here. It should be let through to be broadcast in _sanitize_array. I'll add a test.

This comment has been minimized.

Copy link
@jreback

jreback Feb 25, 2018

Contributor

hmm, would be ok with catching both cases here, or are they different?

This comment has been minimized.

Copy link
@cbertinato

cbertinato Feb 25, 2018

Author Contributor

Yeah. I think catching the 0-len case here would be good for consistency. We don't want to catch the len 1 case because it will get broadcast, so it will look something like:

                # a scalar numpy array is list-like but doesn't
                # have a proper length
                try:
                    if len(data) != 1 and len(index) != len(data):

Unless the intention is not to broadcast a list-like of length 1. One could argue that it would be better to raise an error instead of broadcasting. If one wanted to broadcast a scalar, then just pass a scalar.

This comment has been minimized.

Copy link
@jreback

jreback Feb 25, 2018

Contributor

you would have to show the test which fails for this, len(data) == 0 is valid

This comment has been minimized.

Copy link
@cbertinato

cbertinato Feb 26, 2018

Author Contributor

The test test_apply_subset in tests/io/formats/test_style.py raises an error. The traceback indicates the input to the Series constructor is:

data = ['color: baz'], index = RangeIndex(start=0, stop=2, step=1), dtype = None

This should be valid. Checking that len(data) != 1 lets this case pass.

@@ -418,8 +418,8 @@ def test_constructor_numpy_scalar(self):
# GH 19342
# construction with a numpy scalar
# should not raise
result = Series(np.array(100), index=np.arange(4))
expected = Series(100, index=np.arange(4))
result = Series(np.array(100), index=np.arange(4), dtype='int64')

This comment has been minimized.

Copy link
@jreback

jreback Feb 25, 2018

Contributor

ahh, ok thanks

# a scalar numpy array is list-like but doesn't
# have a proper length
try:
if len(data) > 1 and len(index) != len(data):

This comment has been minimized.

Copy link
@jreback

jreback Feb 25, 2018

Contributor

you would have to show the test which fails for this, len(data) == 0 is valid

# a scalar numpy array is list-like but doesn't
# have a proper length
try:
if len(data) != 1 and len(index) != len(data):

This comment has been minimized.

Copy link
@jreback

jreback Feb 26, 2018

Contributor

still not convinced about this, what fails for len(data)

This comment has been minimized.

Copy link
@cbertinato

cbertinato Feb 26, 2018

Author Contributor

If I remove len(data) != 1 and run python -m pytest pandas/tests/io/formats/test_style.py I get:

                try:
                    if len(index) != len(data):
                        raise ValueError(
                            'Length of passed values is {val}, '
                            'index implies {ind}'
>                           .format(val=len(data), ind=len(index)))
E                           ValueError: ('Length of passed values is 1, index implies 2', 'occurred at index A')

pandas/core/series.py:246: ValueError

I should add a test for this case in test_constructors.

This comment has been minimized.

Copy link
@jreback

jreback Feb 26, 2018

Contributor

ok add a test in test_constructors. which is this failing on in test_style?

This comment has been minimized.

Copy link
@cbertinato

cbertinato Feb 26, 2018

Author Contributor

test_apply_subset. Input to the Series constructor is:

data = ['color: baz'],
index = RangeIndex(start=0, stop=2, step=1), 
dtype = None, 
name = 'A', 
copy = False, 
fastpath = False

This comment has been minimized.

Copy link
@jreback

jreback Feb 26, 2018

Contributor

So this should raise! if the data is a scalar this is ok, but we can't broadcast a list like that (well we can, but we shoudn't)

In [2]: data = ['color: baz']
   ...: index = pd.RangeIndex(start=0, stop=2, step=1)
   ...: 

In [3]: pd.Series(data, index)
Out[3]: 
0    color: baz
1    color: baz
dtype: object

cbertinato and others added some commits Feb 22, 2018

BUG: Fix Series constructor for Categorical with index
Fixes Series constructor so that ValueError is raised when a Categorical and index of different length are given.

@cbertinato cbertinato force-pushed the cbertinato:issue-19342 branch from 4540878 to e756c7e Feb 26, 2018

@cbertinato

This comment has been minimized.

Copy link
Contributor Author

commented Feb 26, 2018

I agree. It shouldn’t broadcast a list like that. We can remove the check and see if there’s anywhere else where this breaks. If not, then fix the test in test_style?

@cbertinato

This comment has been minimized.

Copy link
Contributor Author

commented Feb 26, 2018

I wasn’t sure whether anything else relied on this behavior.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Feb 26, 2018

I agree. It shouldn’t broadcast a list like that. We can remove the check and see if there’s anywhere else where this breaks. If not, then fix the test in test_style?

yes, and add this as an additional test in test_constructor.

Disallow broadcasting of single-element lists
Modified test setup in io/formats/test_style.py accordingly

@jreback jreback merged commit e51800b into pandas-dev:master Feb 27, 2018

2 of 3 checks passed

continuous-integration/travis-ci/pr The Travis CI build failed
Details
ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
@jreback

This comment has been minimized.

Copy link
Contributor

commented Feb 27, 2018

thanks @cbertinato sometimes the seemingly small changes are hard!

@cbertinato

This comment has been minimized.

Copy link
Contributor Author

commented Feb 27, 2018

Thanks for the help and advice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.