Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for unequal comparisons of categorical and scalar #9848

Closed
wants to merge 1 commit into from

Conversation

jankatins
Copy link
Contributor

Fixes: #9836

@jreback
Copy link
Contributor

jreback commented Apr 10, 2015

I'll paste this here, just ran into this on CategoricalIndex (this is with current impl)

@jorisvandenbossche @shoyer

In [2]: df = DataFrame({'A' : np.arange(6,dtype='int64'),
                              'B' : Series([1,1,2,1,3,2]).astype('category',categories=[3,2,1],ordered=True) }).set_index('B')

In [3]: df
Out[3]: 
   A
B   
1  0
1  1
2  2
1  3
3  4
2  5

In [4]: df.index
Out[4]: 
CategoricalIndex([1, 1, 2, 1, 3, 2],
                 categories=[3, 2, 1],
                 ordered=True,
                 name=u'B')

In [5]: df[df.index==1]
Out[5]: 
   A
B   
1  0
1  1
1  3

In [6]: df[df.index>1]
Out[6]: 
Empty DataFrame
Columns: [A]
Index: []

In [7]: df[df.index<1]
Out[7]: 
   A
B   
2  2
3  4
2  5

@jankatins jankatins force-pushed the fix_categorical_comp branch 2 times, most recently from 7817721 to 86e0376 Compare April 10, 2015 19:56
@jankatins jankatins changed the title [DO NOT MERGE] FIx for unequal comparisons of categorical and skalar Fix for unequal comparisons of categorical and scalar Apr 10, 2015
values = self.get_values()
other = _index.convert_scalar(values,_values_from_object(other))
if com.is_categorical_dtype(self):
# cats are a special case as get_values() would return an ndarray, which would then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit hacky to me. Ideally, shouldn't blocks contain all the logic for arithmetic operations? That way, the special casing for different dtypes would just be a different methods defined on block subclasses, not a mess of dtype checks on the series itself.

This is not your technical debt, though, so we can save that cleanup for later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in reaility some of the core/ops.py should be in internals. @shoyer let's make an issue about this, but @JanSchulz change is ok for now.

@JanSchulz can you add a release note. I'll merge this before CategoricalIndex

@jorisvandenbossche
Copy link
Member

Another design question: what is the goal of get_values?
Why does it return a plain array while .values returns a categorical array?

@jreback jreback added Bug Categorical Categorical Data Type labels Apr 11, 2015
@jreback jreback added this to the 0.16.1 milestone Apr 11, 2015
@jreback
Copy link
Contributor

jreback commented Apr 11, 2015

so here's the difference between .values and .get_values()

.values returns the underlying data for the Series directly, e.g. a numpy array for basic types, but a Categorical or a SparseArray for cat/sparse types.

.get_values() is a dense version. for basic types its the same, but for sparse its to_dense() and the same for categoricals.

@jankatins
Copy link
Contributor Author

@jreback Then SingleBlockManager.get_values() is wrong: it does

    def get_values(self):
        """ return a dense type view """
        return np.array(self._block.to_dense(),copy=False)

@jreback
Copy link
Contributor

jreback commented Apr 11, 2015

@JanSchulz not sure what you mean, that's exactly what it should do. In the case of a basic type (e.g. a FloatBlock, this just returns the ndarray).

@jankatins
Copy link
Contributor Author

@jreback: why not simply return self._block.to_dense()? Anyway, I think the SingleBlockManager should push that down to the block via return self._block.get_values() and then the different blocks should "do the right things", as currently CategoricalBlock would need a different implementation than the rest.

Anyway, the problem in the above was, that even when I took that out (by only using values = self.values instead of values = self.get_values() for categoricals), the next line was also not succeeding and I didn't understand what other = _index.convert_scalar(values,_values_from_object(other)) actually does :-(

@jorisvandenbossche
Copy link
Member

@JanSchulz for a simple Block, to_dense returns a view of the array, so it that sense it indeed does what @jreback said

I understand the case for a SparseArray, but I don't know if this should also be the case for a Categorical. With get_values on a categorical, you loose information about the actual values, which is not the case for a SparseArray.

@jankatins
Copy link
Contributor Author

It definitely should NOT do that in case of a CategoricalBlock, but on the other hand if the above is the Definition if get_values() then the implementation of Categorical.get_values() is also wrong as it converts back to an ndarray:

    def get_values(self):
        """ Return the values.

        For internal compatibility with pandas formatting. [...]
        """

        # if we are a period index, return a string repr
        if isinstance(self.categories, PeriodIndex):
            return take_1d(np.array(self.categories.to_native_types(), dtype=object),
                           self._codes)

        return np.array(self)

@jreback
Copy link
Contributor

jreback commented Apr 11, 2015

@JanSchulz the point of these methods is to get 2 differently thing. A ndarray w/o doing any work, and the actual object. .get_values() is used for example when I need to concat things and I have to treat the things as a uniform array, equiv to what np.array(stuff) does.

I think some of this type inference in core/ops needs to be changed (and pushed down to the block), eg. compare(....), but let's open a new issue for that.

@jankatins
Copy link
Contributor Author

As far as I understood it, we tried to make Categorical as similar to ndarray as possible so if the difference between values and get_values() is simple that the latter guarantees a to_dense() representation, then for CategoricalBlock it should just return the Categorical as there is no sparse categorical.

@jreback
Copy link
Contributor

jreback commented Apr 11, 2015

@JanSchulz as I said, this is ok for now. pls add a release note.

Before, unequal comparisons were not checking the order of the
categories.

This was due to a conversion to an ndarray, which turned the
comparison to one between ndarray and scalar, which of course
has no categories to take into account.

Also add test cases and remove the one which actually tested the
wrong behaviour.
@jankatins
Copy link
Contributor Author

@jreback: release note added. Haven't run a doc build, though :-/

@jreback
Copy link
Contributor

jreback commented Apr 11, 2015

merged via f0ac930

ty!

@kay1793
Copy link

kay1793 commented Apr 12, 2015

Isn't this behavior the exact opposite of what @JanSchulz was saying in #8995?
So equality doesn't raise when comparing against a non-category value, but inequality does (should)?

@jankatins
Copy link
Contributor Author

@jreback: should I also add an issue about get_values() which should be pushed to Blocks?

@jreback
Copy link
Contributor

jreback commented Apr 12, 2015

@JanSchulz not sure what you are talking about .get_values() it is defined correctly

@shoyer
Copy link
Member

shoyer commented Apr 12, 2015

@jreback I know you've called get_values "unadvertised public API" [1] but IMO that is a mistake -- it should not a be public method.

[1] #9580 (comment)

@jreback
Copy link
Contributor

jreback commented Apr 12, 2015

well public for quite a long time
it does what it says so not sure what the issue is

@shoyer
Copy link
Member

shoyer commented Apr 12, 2015

we can keep get_values around for the sake of backward compat but let's not encourage its use :)

@jreback
Copy link
Contributor

jreback commented Apr 12, 2015

I am NOT encouraging use :)

@JanSchulz brought up something - still not sure about what

@jankatins
Copy link
Contributor Author

This was more along the line of #9859 which asked that ops, which check for the content of the type should be pushed to the blocks layer.

I still think that get_values() for a Series of type category should return the Categorical, if get_values() is defined by "return the to_dense variant of the underlying values". All things which cannot work with this should either be fixed by adding the appropriate methods in Categorical or by adding the right things to the block layer.

@jreback
Copy link
Contributor

jreback commented Apr 12, 2015

@JanSchulz what's the point ? .values already does that

@shoyer
Copy link
Member

shoyer commented Apr 12, 2015

If get_values is API, we should stick to its existing docstring.... (so I agree with Jan)

@jorisvandenbossche
Copy link
Member

docstring get_values: "same as values (but handles sparseness conversions)"

I think it is less confusing that get_values is the same as values for categoricals (and also more logical in my opinion)

and @jreback you could say the same as ".to_dense already does that" :-)

@jankatins
Copy link
Contributor Author

@jreback @jorisvandenbossche @shoyer Whats teh status of the .get_values() discussion here? Should I open a new issue to track that or is it currently fine?

  • push .get_values() into the Blocks-layer
  • decide what get_values()should do: return a "not sparse ndarray of the values" or a "not sparse .values" (aka should that return a Categorical or a ndarray?)

There is also the issue of pushing some ops into the Blocks layer, so that type checks can be removed there.

@shoyer
Copy link
Member

shoyer commented May 11, 2015

@JanSchulz Here is the issue I made for pushing ops to the blocks layer: #9859.

If we could go back in time and remove get_values() from ever being released as a public attribute, I would do that. But we can't, so my vote is: deprecate it as public API and don't change/extend current behavior. Categorical is a new type, so you're really free to do whatever you want -- making it equivalent to .values seems like the simplest option to me.

@jreback
Copy link
Contributor

jreback commented May 11, 2015

@JanSchulz I suppose you could deprecatee .get_values(). I don't think its public, except for my comments on it (its not in the api doc), and is just a regular method. So I don't think this is a big deal. That said, you certainly can try to push stuff down to the blocks. The problem I originally ran into was that .get_values() IS needed in some cases, so it is a Series (and hence an internal method).

But if you can figure out a better way, gr8!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

less than, greater than cuts on categorical don't follow order
5 participants