New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorical fixups #7768

Closed
wants to merge 10 commits into
base: master
from

Conversation

Projects
None yet
2 participants
@jankatins
Contributor

jankatins commented Jul 16, 2014

Some fixups for Categoricals.

Fixes: #3678

  • Maybe change Series.cat after the discussion in #7207 ?
  • remove Series.cat from tab completition if Series is not of dtype category
  • fix for the "FIXME" in unittests
  • Look at problems in docs (-> hdf support)
  • Fixup Comparison thingies...

@jreback jreback added this to the 0.15.0 milestone Jul 16, 2014

@@ -220,9 +220,17 @@ def __init__(self, values, levels=None, ordered=None, name=None, fastpath=False,
inferred = com._possibly_infer_to_datetimelike(values)
if not isinstance(inferred, np.ndarray):
from pandas.core.series import _sanitize_array
values = _sanitize_array(values, None)
safe_dtype = None
if isinstance(values, list) and np.nan in values:

This comment has been minimized.

@jreback

jreback Jul 16, 2014

Contributor

I think you have to do isnull(values).any() instead of np.nan in values

This comment has been minimized.

@jankatins

jankatins Jul 16, 2014

Contributor
In[47]: np.nan in [np.nan, 1]
Out[47]: True
In[48]: np.nan in [2, 1]
Out[48]: False

Not sure what's faster: converting to numpy array and doing the isnull(..).any() check or the "is in list" check.

This comment has been minimized.

@jreback

jreback Jul 16, 2014

Contributor

this is too specific a check, instead do this:

dtype = 'object' if isnull(values).any() else None
values = _sanitize_array(values, dtype=dtype)

This comment has been minimized.

@jankatins

jankatins Jul 16, 2014

Contributor

done

@jankatins jankatins referenced this pull request Jul 16, 2014

Closed

Categoricals with NaNs #3678

@jreback

This comment has been minimized.

Contributor

jreback commented Jul 16, 2014

see here: jreback@1fc80e6

I don't buy the 2 NaN output case. very odd thing to do. A nan is a nan

@jankatins

This comment has been minimized.

Contributor

jankatins commented Jul 16, 2014

This currently does not work:

cat = pd.Categorical([1,2,3, np.nan], levels=[1,2,3])
cat.levels = [1,2,3, np.nan]
cat[1] = np.nan
exp = np.array([0,3,2,-1])
self.assert_numpy_array_equal(cat.codes, exp)

I'm not sure if this here is a bug or actually as intended:

In[18]: idx = _ensure_index([1,2,3,np.nan]) 
In[19]: idx
Out[19]: Float64Index([1.0, 2.0, 3.0, nan], dtype='float64')
In[20]: idx.get_indexer([np.nan])
Out[20]: array([-1])
@jreback

This comment has been minimized.

Contributor

jreback commented Jul 16, 2014

with a Floatindex u can't do that
u have to use hasnans and a check indexer

u really can only have 1 nan if you intend to do any kind of indexing operation rlse how would u know where it goes?

it not a value but an indicator of a missing value

so u can have more than 1 nan but then indexing (eg setitem/getitem) is impossible by value (but u can take by position)

@jankatins

This comment has been minimized.

Contributor

jankatins commented Jul 16, 2014

Re "two times nan in describe": R has the same problem with "NA as a level vs NA as missing value": http://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html

If NA is a level, the way to set a code to be missing (as opposed to the code of the missing level) is to use is.na on the left-hand-side of an assignment (as in is.na(f)[i] <- TRUE; indexing inside is.na does not work). Under those circumstances missing values are currently printed as , i.e., identical to entries of level NA.

R does the same thing, printing NA twice, but once indicating that it is a level <NA> and once that it is missing NA's:

> f <- factor(c(1,2,3,NA), exclude=F)
> f
[1] 1    2    3    <NA>
Levels: 1 2 3 <NA>
> is.na(f)[1] <- TRUE
> f
[1] <NA> 2    3    <NA>
Levels: 1 2 3 <NA>
> table(f)
f
   1    2    3 <NA> 
   0    1    1    1 
> summary(f)
   1    2    3 <NA> NA's 
   0    1    1    1    1 

@jankatins jankatins changed the title from Categorical: preserve ints when NaN are present to Categorical and NaN fixups Jul 16, 2014

@jankatins jankatins changed the title from Categorical and NaN fixups to Categorical fixups Jul 23, 2014

@@ -120,9 +120,9 @@ class Categorical(PandasObject):
Attributes
----------
levels : ndarray
levels : Index

This comment has been minimized.

@jreback

jreback Jul 23, 2014

Contributor

why don't u say both of these are index-like (very confusing before and now)

This comment has been minimized.

@jankatins

jankatins Jul 23, 2014

Contributor

cat.levels returns a Index, but cat.codes returns a (readonly) numpy.array. codes : index implies certain methods, which are not present!?

So I think this is correct now...

This comment has been minimized.

@jreback

jreback Jul 23, 2014

Contributor

they are both Indexes (at least in the currentl impl) - they don't necessarily have to be but they are

This comment has been minimized.

@jankatins

jankatins Jul 24, 2014

Contributor

Not sure what makes it "index-like", but currently cat._levels and the public accessor cat.levels is an index, but cat._codes and the public accessor cat.codes is a np.ndarray:

In[47]: type(pd.Categorical([1]).codes)
Out[47]: numpy.ndarray
In[48]: type(pd.Categorical([1])._codes)
Out[48]: numpy.ndarray
In[49]: type(pd.Categorical([1])._levels)
Out[49]: pandas.core.index.Int64Index

Internally, _code is mutable (at least in __set_item__()) , but _levels is always replaced, so this fits the ndarray vs index.

@jankatins

This comment has been minimized.

Contributor

jankatins commented Jul 23, 2014

@jreback Please have a look ... I added a workaround for NaN in index in 39a7531and implemented all the stuff which came up in the end of #7217.

So, if Travis passes and you are satisfied, I think this is ready to merge.

@jreback

This comment has been minimized.

Contributor

jreback commented Jul 23, 2014

I want to fix the issue with nan first

# assignment step.
# tuple are list_like but com.isnull(<tuple>) will return a single bool,
# which then raises an AttributeError: 'bool' object has no attribute 'any'
has_null = (com.is_list_like(values) and not isinstance(values, tuple)

This comment has been minimized.

@jankatins

jankatins Jul 23, 2014

Contributor

BTW: did I use the right method to test for lists? OI had expected that if com.is_list_like(a): np.something(a) would always work, but not if a is a tuple :-(

This comment has been minimized.

@jreback

jreback Jul 23, 2014

Contributor

let me have a look ; you are jumping through too many hoops here

@jreback

This comment has been minimized.

Contributor

jreback commented Aug 8, 2014

@JanSchulz relating to your comment from #7207
What more do you want here? I think this groups very nicely, don't show excess stuff.

In [1]: s = Series(list('aabbccd')).astype('category')

In [2]: s
Out[2]: 
0    a
1    a
2    b
3    b
4    c
5    c
6    d
dtype: category
Levels (4, object): [a < b < c < d]

In [3]: s.values.
s.values.T                     s.values.describe              s.values.from_array            s.values.max                   s.values.order                 s.values.reorder_levels        s.values.take_nd               
s.values.argsort               s.values.dtype                 s.values.from_codes            s.values.min                   s.values.ordered               s.values.shape                 s.values.to_dense              
s.values.codes                 s.values.equals                s.values.get_values            s.values.mode                  s.values.ravel                 s.values.sort                  s.values.unique                
s.values.copy                  s.values.fillna                s.values.levels                s.values.ndim                  s.values.remove_unused_levels  s.values.take                  s.values.view                  

In [3]: s.values.
@jankatins

This comment has been minimized.

Contributor

jankatins commented Aug 8, 2014

[The follwoing is just about Series.cat, not abaout Series.values which should be the underlying data structure (either numpy.array or pd.Categorical)]

from my perspective, the above has way to many methods visible (and some relevant are not even visible): when you have a Series of type category, only a few methods should be available under Series.cat: .levels .reorder_levels(), .remove_unused_levels() and .codes. Everything else is an implementation detail and the functionality is exposed via the normal Series methods (unique, min, max,...).

Only declaring the above 4 methods/properties as API (and only exposing them) will mean that we can switch to a numpy category implementation (which I expect to happen) much more easily without breaking user code and it makes for a nicer tab completion...

@jreback

This comment has been minimized.

Contributor

jreback commented Aug 8, 2014

numpy cat implementation is pie-in-the-sky. if it happens great. but don't hold your breath. And if it actually does, and we actually switch to it (a big IF, why would it really be any different/better?), so we make a change, so what.

what methods are relevant but not visible?

what about .codes,fillna,copy,astype these seem relevant?

yes they are available (and delegate from series). but as a user wouldn't I expect to see what I can do with the object?

@jankatins

This comment has been minimized.

Contributor

jankatins commented Aug 8, 2014

.codes is, but .fillna is done via Series.fillna() and not via Series.cat.fillna() (same for copy and astype). If you know what you are doing, fine, just go for Series.values and do whatever you want with it, but you also have to watch out for bugs then.

From my perspective (using pandas for some not-so-advanced data munging and statistics) I never noticed how numpy arrays are wrapped until I worked on Categorical, so I also never tried to work with Series.values and I also think it would be wrong to use methods on the numpy array directly because that would (as far as my understanding goes) lead to strange results like in the Series(numpyarray)-case where you can manipulate the numpyarray and the Series changes in some cases (depending on view or not).

I'm also a fan of only promoting to API as few methods/properties as possible to have some room for future developments, but that's probably not as problematic in python as it is in java.

@jreback

This comment has been minimized.

Contributor

jreback commented Aug 8, 2014

ok, I am on-board with this then. this is actually really easy,

just provide a new Categorical.__dir__() with the methods/properties you want (just a list of strings)

It was a bit non-trivial with datetime, because ipython search all of the base classes (which complicates things). Here base class is pretty trivial so its not an issue (and they ignore anything that starts with '_')).

and need a test as well (see what I did for the .dt tests)

@jankatins

This comment has been minimized.

Contributor

jankatins commented Aug 8, 2014

But that has a problem too: Looking at the Categorical object is like looking at the numpy array and that object should expose every method. So S.cat.<tab> should give you a different list of methods than s.values.<tab> or Categorical(...).<tab>. So I still think it makes sense to replace s.cat with a similar object like it is now done with s.dt (which has the same objective to hide the internals of the implementation from the user).

@jreback

This comment has been minimized.

Contributor

jreback commented Aug 8, 2014

oh, I c what you want to do, like a CategoricalDelegate or something that is like DatetimelikeProperties, delgating as needed (but this is only for a limited purpose).

ok then go for it, should be straightforward.

@jankatins

This comment has been minimized.

Contributor

jankatins commented Aug 8, 2014

I current try my idea with the "generate the dt accessors" and if that works I will base the cat access on that.

@jreback

This comment has been minimized.

Contributor

jreback commented Aug 8, 2014

@jankatins

This comment has been minimized.

Contributor

jankatins commented Aug 8, 2014

Just saw it. Damn, you are too fast! :-)

If _add_accessors(cls) get's changed to _add_accessors(cls, names), this will work for cat as well. Then it can also be put into a more generic place.

jankatins added some commits Jul 23, 2014

API: change default Categorical.from_codes() to ordered=False
In the normal constructor `ordered=True` is only assumed if the levels
are given or the values are sortable (which is most of the cases), but
in `from_codes(...)` we can't asssume this so the default should be
`False`.
Categorical: use s.values when calling private methods
s.values is the underlying Categorical object, s.cat will be changed
to only expose the API methods/properties.
Categorical: Fix comparison of Categoricals and Series|Categorical|np…
….array

Categorical can only be comapred to another Categorical with the same levels
and the same ordering or to a scalar value.

If the Categorical has no order defined (cat.ordered == False), only equal
(and not equal) are defined.
@jankatins

This comment has been minimized.

Contributor

jankatins commented Aug 11, 2014

@jreback So, I hope I got every unittests regression... If you (or anybody else :-) ) have any comments, I will address them tomorrow, I'm off to bed... :-)

@jreback

This comment has been minimized.

Contributor

jreback commented Aug 11, 2014

@JanSchulz haha, ok will take a look

@@ -547,7 +553,7 @@ the Categorical back to a numpy array, so levels and order information is not pr
Categorical.__array__
To create compatibility with `pandas.Series` and `numpy` arrays, the following (non-API) methods
are also introduced.

This comment has been minimized.

@jreback

jreback Aug 12, 2014

Contributor

what does this mean?

This comment has been minimized.

@jankatins

jankatins Aug 12, 2014

Contributor

non-API = they may change, so don't rely on them in production code

.. ipython:: python
cat = pd.Series(pd.Categorical([1,2,3], levels=[3,2,1]))
cat_base = pd.Series(pd.Categorical([2,2,2], levels=[3,2,1]))

This comment has been minimized.

@jreback

jreback Aug 12, 2014

Contributor

show the cats after they are created

This comment has been minimized.

@jankatins

jankatins Aug 12, 2014

Contributor

done

cat > cat_base
# This doesn't work because the levels are not the same
try:

This comment has been minimized.

@jreback

jreback Aug 12, 2014

Contributor

their is a way to do this in the docs (showing an exception); can also do a code block

This comment has been minimized.

@jankatins

jankatins Aug 12, 2014

Contributor

I haven't found a way to do that. Just letting the exception happen results in long stacktraces and I don't like codeblocks, where the exception message has to be manually inserted (and maintained). Maybe that would be a nice PR for the ipython directive....

This comment has been minimized.

@jreback

jreback Aug 12, 2014

Contributor

yep that's fine (I bet their is a way with :okexcept: though)

This comment has been minimized.

@jankatins

jankatins Aug 12, 2014

Contributor

I looked at the sphinx extension source and don't think there is a way without modifying it. `:okexcept:' basically only prevents sphinx to write the exception to stdout.

A :nostacktrace: option would be nice...

This comment has been minimized.

@jreback

jreback Aug 12, 2014

Contributor

maybe can create a small function and put in utils for this purpose (basically what u r doing)

This comment has been minimized.

@jankatins

jankatins Aug 12, 2014

Contributor

something like

with no_stacktrace():
   a < cat
except TypeError as e:
print("TypeError: " + str(e))
cat > 2

This comment has been minimized.

@jreback

jreback Aug 12, 2014

Contributor

put a comment above (eg comparison vs scalar)

This comment has been minimized.

@jankatins

jankatins Aug 12, 2014

Contributor

done

@@ -208,8 +232,25 @@ def __init__(self, values, levels=None, ordered=None, name=None, fastpath=False,
# under certain versions of numpy as well
inferred = com._possibly_infer_to_datetimelike(values)
if not isinstance(inferred, np.ndarray):
# Input sanitation...

This comment has been minimized.

@jreback

jreback Aug 12, 2014

Contributor

hmm I am but sure about some of this
u r converting to lists then back again

"""
return com.take_1d(self.levels.values, self._codes)
ret = com.take_1d(self.levels.values, self._codes)
if dtype and dtype != self.levels.dtype:

This comment has been minimized.

@jreback

jreback Aug 12, 2014

Contributor

why r u doing this? a take will always return the same dtype of what's taken

This comment has been minimized.

@jankatins

jankatins Aug 12, 2014

Contributor

Because currently we ignored any dtype which was passed in to the array interface: __array__(dtype=None), now this is taken into account and convert if the caller requests that from __array__()

This comment has been minimized.

@jreback

jreback Aug 12, 2014

Contributor

I have never seen that used. and we don't do it anywhere else. if its coerced via __array__ user can do it on their own.

This comment has been minimized.

@jankatins

jankatins Aug 12, 2014

Contributor

But then why the dtype argument?

This comment has been minimized.

@jreback

jreback Aug 12, 2014

Contributor

not sure, never used it (I am sure numpy has a reason).

This comment has been minimized.

@jankatins

jankatins Aug 12, 2014

Contributor

Note this is the ref in numpy:

ndarray.__array__(|dtype) → reference if type unchanged, copy otherwise.

Returns either a new reference to self if dtype is not given or a new array of provided data type if dtype is different from the current dtype of the array.
return ret
def astype(self, dtype, order='K', casting='unsafe', subok=True, copy=True):
return np.asarray(self, dtype)

This comment has been minimized.

@jreback

jreback Aug 12, 2014

Contributor

why r u adding numpy args here? they'd don't apply

This comment has been minimized.

@jankatins

jankatins Aug 12, 2014

Contributor

Not sure what's better: to fail because of missing arguments when someone calls the 'astype()' interface with a signature as defined in the numpy docs or to ignore arguments (what's now happening). I can change that to **kwargs and issue a warning that these arguments are not understood?

This comment has been minimized.

@jreback

jreback Aug 12, 2014

Contributor

I just fixed the signature to what we use in pandas (which doesn't have any of those argument)

This comment has been minimized.

@jankatins

jankatins Aug 12, 2014

Contributor

I removed that now

This comment has been minimized.

@jankatins

jankatins Aug 12, 2014

Contributor

I think I needed that in an earlier version of the null check, where pandas called values.astype(...) and then failed.

@@ -503,10 +551,27 @@ def order(self, inplace=False, ascending=True, na_position='last', **kwargs):
if na_position not in ['last','first']:
raise ValueError('invalid na_position: {!r}'.format(na_position))
codes = np.sort(self._codes.copy())
codes = np.sort(self._codes)

This comment has been minimized.

@jreback

jreback Aug 12, 2014

Contributor

you have to copy or it changes the original (np.sort is inplace)

This comment has been minimized.

@jankatins

jankatins Aug 12, 2014

Contributor

According to the docs it returns a new array:

np.sort??
[...]
def sort(a, axis=-1, kind='quicksort', order=None):
    """
    Return a sorted copy of an array.

This comment has been minimized.

@jreback

jreback Aug 12, 2014

Contributor

you are right (I had to change it back). somehow always though this was inplace.

@jreback

This comment has been minimized.

Contributor

jreback commented Aug 12, 2014

https://github.com/jreback/pandas/tree/cats

added a couple of commits to fix docs / reorg isnull

otherwise looks ok

give a look and I will rebase and merge

@jreback

This comment has been minimized.

Contributor

jreback commented Aug 12, 2014

I'll have to fix this when we merge: https://travis-ci.org/jreback/pandas/jobs/32332841

@jreback

This comment has been minimized.

Contributor

jreback commented Aug 12, 2014

closing in favor of #8006

@jreback jreback closed this Aug 12, 2014

@jankatins

This comment has been minimized.

Contributor

jankatins commented Aug 12, 2014

Argh, this was so nicely put into different commits :-(

@jreback

This comment has been minimized.

Contributor

jreback commented Aug 12, 2014

doesn't matter it has to get squashed anyhow. (that's just the convention)

@jankatins

This comment has been minimized.

Contributor

jankatins commented Aug 12, 2014

Why the squash? I always thought that's because of the problems with rebasing (which I did in my branch via patch export and apply).

I think all commits pass the unittests and only have one logical step, so they are no "work in progress" commits.

@jreback

This comment has been minimized.

Contributor

jreback commented Aug 12, 2014

I just simpler when looking at the log. It could be a couple, but these are all interelated so just easier.

@jankatins

This comment has been minimized.

Contributor

jankatins commented Aug 12, 2014

I've picked your changes (damn squash :-) ), but not the :okexcept: one, as there are too many raised exceptions in the doc which makes it kind of unreadable because some stacktraces are just to distracting and long.

Also squashed the whole together into one commit which lists the logical commits (i.e. the CLN is part of another commit)

Please pull https://github.com/JanSchulz/pandas/tree/categorical_fixups

@jankatins

This comment has been minimized.

Contributor

jankatins commented Aug 12, 2014

It seems that this PR is not updated anymore, should I open another PR? Or can you reopen this one?

@jreback

This comment has been minimized.

Contributor

jreback commented Aug 12, 2014

hmm, why don't you open a new one. not sure why its not updating.

@jankatins

This comment has been minimized.

Contributor

jankatins commented Aug 12, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment