API: Add str/dt accessors to categorical #10661

sinhrks · 2015-07-23T12:10:44Z

Accessors should be enabled depending on categories. Should care CategoricalIndex also.

import pandas as pd
s = pd.Series(['A', 'B'], dtype='category')
s.str
# AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

s = pd.Series([pd.Timestamp('2011-01-01'), pd.Timestamp('2011-01-01')], dtype='category')
s.dt
# AttributeError: Can only use .dt accessor with datetimelike values

The text was updated successfully, but these errors were encountered:

jreback · 2015-07-23T13:18:17Z

xref to #8627

What you are saying is the should ALSO allow the dtype correct accessor for its values (or categories if its a Categorical), in addition to the .cat if its a categorical. Then I would agree (though I don't think this actually works ATM see the xref issue)

jankatins · 2015-11-12T11:45:16Z

So, what this should do is simple allow the s.str to succeede when s.cat.categories is of type string and s.dt if s.cat.categories is of type datetime/...?

jankatins · 2015-11-12T12:08:07Z

One solution (for .dt) is to return a copy of the categorical as datetime object:

    def _make_dt_accessor(self):
        try:
            return maybe_to_datetimelike(self)
        except Exception:
            if is_categorical_dtype(self.dtype):
                try:
                    cat_dtype = self.values.categories.dtype
                    return maybe_to_datetimelike(self.astype(cat_dtype))
                except:
                    pass # raise in next line...
            raise AttributeError("Can only use .dt accessor with datetimelike "
                                 "values")

jreback · 2015-11-12T12:13:42Z

@JanSchulz not at all.

This needs to operate on the categories, then return a new categorical object that has (transformed values).

In [1]: s = Series(list('aabbc')).astype('category')

In [2]: s.astype(object).str.upper()
Out[2]: 
0    A
1    A
2    B
3    B
4    C
dtype: object

# I still want to type: s.str.upper() though
In [4]: s.cat.rename_categories(s.cat.categories.str.upper())
Out[4]: 
0    A
1    A
2    B
3    B
4    C
dtype: category
Categories (3, object): [A, B, C]

jankatins · 2015-11-12T12:36:02Z

ok, on it...

jankatins · 2015-11-12T12:41:03Z

Ok, what should that return:

s = Series(list('aabbc')).astype('category')
s.str.upper()

-> a series of type string/object or a series of type category where the categories are of dtype str (and transformed)? I would go for the first, as it would honor the contract for str, which says it's a string:

from http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling

Series.str can be used to access the values of the series as strings and apply several methods to it. These can be acccessed like Series.str.<function/property>.

jreback · 2015-11-12T12:47:28Z

@JanSchulz , no I would change the docs. The point of using a category dtype is that it essentially acts like its object cousin, but is simply more efficient.

jreback · 2015-11-12T12:55:30Z

This is slightly more tricky though (and its actually an example where it is quite useful). Note that [123] as I show below is actually pretty inefficient, as I already know the indexers. I think we can compute that directly.

In [120]: s = Series(['foo','foo','fub','fub','bar']).astype('category')

In [121]: s
Out[121]: 
0    foo
1    foo
2    fub
3    fub
4    bar
dtype: category
Categories (3, object): [bar, foo, fub]

In [122]: s.astype(object).str.contains('^f')
Out[122]: 
0     True
1     True
2     True
3     True
4    False
dtype: bool

In [123]: s.isin(s.cat.categories[s.cat.categories.str.contains('^f')])
Out[123]: 
0     True
1     True
2     True
3     True
4    False
dtype: bool

jankatins · 2015-11-12T13:03:28Z

I really hate that argument: "The point of using a category dtype is that it essentially acts like its object cousin, but is simply more efficient". Can't we get a String (or Object) class which is basically a copy of (with a base subclass) of Category?

jankatins · 2015-11-12T13:09:49Z

Ok, found a case where it should not result in a category:

In[29]: s = Series(list('aabb')).astype('category') 
In[30]: s + s
Traceback (most recent call last):
  File "C:\portabel\miniconda\envs\knitpy_27\lib\site-packages\IPython\core\interactiveshell.py", line 3066, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-30-b7ea6e807a5e>", line 1, in <module>
    s + s
  File "C:\portabel\miniconda\envs\knitpy_27\lib\site-packages\pandas\core\ops.py", line 609, in wrapper
    arr = na_op(lvalues, rvalues)
  File "C:\portabel\miniconda\envs\knitpy_27\lib\site-packages\pandas\core\ops.py", line 565, in na_op
    raise TypeError("{typ} cannot perform the operation {op}".format(typ=type(x).__name__,op=str_rep))
TypeError: Categorical cannot perform the operation +
In[31]: s = Series(list('aabb')) 
In[32]: s +s 
Out[32]: 
0    aa
1    aa
2    bb
3    bb
dtype: object

jankatins · 2015-11-12T13:19:19Z

Another problem if we would return a category: should it be ordered or not?

jankatins · 2015-11-12T13:55:38Z

Ok, I have PR which returns normal Series (not categories) so that you can concat substrings (s + s works for Series of type string, but not category) and the same for .dt. see #11582

jorisvandenbossche · 2015-11-12T13:56:50Z

I don't think we should try to return categorical here. There are way too many cornercases in all of the string manipulation functions to deal with which are all a bit subjective.
A lot simpler will be to just return the resulting series as an object series (or boolean series, depending on the function). Just being able to use the str accessor methods is already an improvement.

Eg what to do with pd.Series(list('abAb'), dtype='category').str.upper(). Merge categories?

jorisvandenbossche · 2015-11-12T14:02:35Z

I think the concatenation of strings with + is another issue as the str accessor. I am not sure I am in favor of adding this. Disallowing addition (and other operations) seems like a design choice to me. Eg also integer categories cannot be added up to each other.

jankatins · 2015-11-12T14:07:21Z

@jorisvandenbossche s + s is only allowed for series of type string, not category (before and after this PR :-)). The PR makes series_of_type_cat.str.<method> return a Series of the same type as series_of_type_string.str.<method> (and not of type category).

jorisvandenbossche · 2015-11-12T14:14:32Z

ah, sorry, I misread your comment above that you wanted to do s + s on categoricals (but you want to do it on results of str, which is a reason to let it not return a categorical series?)

jankatins · 2015-11-12T15:49:32Z

@jorisvandenbossche jep, exactly: if cat.str.whatever() returns a series of type category, you cannot add two strings or ints (e.g. .dt.days).

jreback · 2015-11-18T11:46:48Z

closed by #11582

sinhrks added Timeseries API Design Categorical Categorical Data Type labels Jul 23, 2015

sinhrks added this to the 0.17.0 milestone Jul 23, 2015

jreback modified the milestones: Next Major Release, 0.17.0 Aug 20, 2015

jreback added Prio-high labels Aug 20, 2015

jankatins mentioned this issue Nov 12, 2015

Make .str/.dt available for Series of type category with string/datetime #11582

Closed

jreback mentioned this issue Nov 13, 2015

PERF: perform .str operations on categoricals #8627

Closed

jreback modified the milestones: 0.17.1, Next Major Release Nov 17, 2015

jreback closed this as completed Nov 18, 2015

jorisvandenbossche mentioned this issue Jan 23, 2017

API: .str ops on category should return category if result is non-boolean #15198

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Add str/dt accessors to categorical #10661

API: Add str/dt accessors to categorical #10661

sinhrks commented Jul 23, 2015

jreback commented Jul 23, 2015

jankatins commented Nov 12, 2015

jankatins commented Nov 12, 2015

jreback commented Nov 12, 2015

jankatins commented Nov 12, 2015

jankatins commented Nov 12, 2015

jreback commented Nov 12, 2015

jreback commented Nov 12, 2015

jankatins commented Nov 12, 2015

jankatins commented Nov 12, 2015

jankatins commented Nov 12, 2015

jankatins commented Nov 12, 2015

jorisvandenbossche commented Nov 12, 2015 •

edited

jorisvandenbossche commented Nov 12, 2015

jankatins commented Nov 12, 2015

jorisvandenbossche commented Nov 12, 2015

jankatins commented Nov 12, 2015

jreback commented Nov 18, 2015

API: Add str/dt accessors to categorical #10661

API: Add str/dt accessors to categorical #10661

Comments

sinhrks commented Jul 23, 2015

jreback commented Jul 23, 2015

jankatins commented Nov 12, 2015

jankatins commented Nov 12, 2015

jreback commented Nov 12, 2015

jankatins commented Nov 12, 2015

jankatins commented Nov 12, 2015

jreback commented Nov 12, 2015

jreback commented Nov 12, 2015

jankatins commented Nov 12, 2015

jankatins commented Nov 12, 2015

jankatins commented Nov 12, 2015

jankatins commented Nov 12, 2015

jorisvandenbossche commented Nov 12, 2015 • edited

jorisvandenbossche commented Nov 12, 2015

jankatins commented Nov 12, 2015

jorisvandenbossche commented Nov 12, 2015

jankatins commented Nov 12, 2015

jreback commented Nov 18, 2015

jorisvandenbossche commented Nov 12, 2015 •

edited