Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Add str/dt accessors to categorical #10661

Closed
sinhrks opened this issue Jul 23, 2015 · 18 comments
Closed

API: Add str/dt accessors to categorical #10661

sinhrks opened this issue Jul 23, 2015 · 18 comments
Labels
Milestone

Comments

@sinhrks
Copy link
Member

sinhrks commented Jul 23, 2015

Accessors should be enabled depending on categories. Should care CategoricalIndex also.

import pandas as pd
s = pd.Series(['A', 'B'], dtype='category')
s.str
# AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

s = pd.Series([pd.Timestamp('2011-01-01'), pd.Timestamp('2011-01-01')], dtype='category')
s.dt
# AttributeError: Can only use .dt accessor with datetimelike values
@sinhrks sinhrks added this to the 0.17.0 milestone Jul 23, 2015
@jreback
Copy link
Contributor

jreback commented Jul 23, 2015

xref to #8627

What you are saying is the should ALSO allow the dtype correct accessor for its values (or categories if its a Categorical), in addition to the .cat if its a categorical. Then I would agree (though I don't think this actually works ATM see the xref issue)

@jreback jreback modified the milestones: Next Major Release, 0.17.0 Aug 20, 2015
@jankatins
Copy link
Contributor

So, what this should do is simple allow the s.str to succeede when s.cat.categories is of type string and s.dt if s.cat.categories is of type datetime/...?

@jankatins
Copy link
Contributor

One solution (for .dt) is to return a copy of the categorical as datetime object:

    def _make_dt_accessor(self):
        try:
            return maybe_to_datetimelike(self)
        except Exception:
            if is_categorical_dtype(self.dtype):
                try:
                    cat_dtype = self.values.categories.dtype
                    return maybe_to_datetimelike(self.astype(cat_dtype))
                except:
                    pass # raise in next line...
            raise AttributeError("Can only use .dt accessor with datetimelike "
                                 "values")

@jreback
Copy link
Contributor

jreback commented Nov 12, 2015

@JanSchulz not at all.

This needs to operate on the categories, then return a new categorical object that has (transformed values).

In [1]: s = Series(list('aabbc')).astype('category')

In [2]: s.astype(object).str.upper()
Out[2]: 
0    A
1    A
2    B
3    B
4    C
dtype: object

# I still want to type: s.str.upper() though
In [4]: s.cat.rename_categories(s.cat.categories.str.upper())
Out[4]: 
0    A
1    A
2    B
3    B
4    C
dtype: category
Categories (3, object): [A, B, C]

@jankatins
Copy link
Contributor

ok, on it...

@jankatins
Copy link
Contributor

Ok, what should that return:

s = Series(list('aabbc')).astype('category')
s.str.upper()

-> a series of type string/object or a series of type category where the categories are of dtype str (and transformed)? I would go for the first, as it would honor the contract for str, which says it's a string:

from http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling

Series.str can be used to access the values of the series as strings and apply several methods to it. These can be acccessed like Series.str.<function/property>.

@jreback
Copy link
Contributor

jreback commented Nov 12, 2015

@JanSchulz , no I would change the docs. The point of using a category dtype is that it essentially acts like its object cousin, but is simply more efficient.

@jreback
Copy link
Contributor

jreback commented Nov 12, 2015

This is slightly more tricky though (and its actually an example where it is quite useful). Note that [123] as I show below is actually pretty inefficient, as I already know the indexers. I think we can compute that directly.

In [120]: s = Series(['foo','foo','fub','fub','bar']).astype('category')

In [121]: s
Out[121]: 
0    foo
1    foo
2    fub
3    fub
4    bar
dtype: category
Categories (3, object): [bar, foo, fub]

In [122]: s.astype(object).str.contains('^f')
Out[122]: 
0     True
1     True
2     True
3     True
4    False
dtype: bool

In [123]: s.isin(s.cat.categories[s.cat.categories.str.contains('^f')])
Out[123]: 
0     True
1     True
2     True
3     True
4    False
dtype: bool

@jankatins
Copy link
Contributor

I really hate that argument: "The point of using a category dtype is that it essentially acts like its object cousin, but is simply more efficient". Can't we get a String (or Object) class which is basically a copy of (with a base subclass) of Category?

@jankatins
Copy link
Contributor

Ok, found a case where it should not result in a category:

In[29]: s = Series(list('aabb')).astype('category') 
In[30]: s + s
Traceback (most recent call last):
  File "C:\portabel\miniconda\envs\knitpy_27\lib\site-packages\IPython\core\interactiveshell.py", line 3066, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-30-b7ea6e807a5e>", line 1, in <module>
    s + s
  File "C:\portabel\miniconda\envs\knitpy_27\lib\site-packages\pandas\core\ops.py", line 609, in wrapper
    arr = na_op(lvalues, rvalues)
  File "C:\portabel\miniconda\envs\knitpy_27\lib\site-packages\pandas\core\ops.py", line 565, in na_op
    raise TypeError("{typ} cannot perform the operation {op}".format(typ=type(x).__name__,op=str_rep))
TypeError: Categorical cannot perform the operation +
In[31]: s = Series(list('aabb')) 
In[32]: s +s 
Out[32]: 
0    aa
1    aa
2    bb
3    bb
dtype: object

@jankatins
Copy link
Contributor

Another problem if we would return a category: should it be ordered or not?

@jankatins
Copy link
Contributor

Ok, I have PR which returns normal Series (not categories) so that you can concat substrings (s + s works for Series of type string, but not category) and the same for .dt. see #11582

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Nov 12, 2015

I don't think we should try to return categorical here. There are way too many cornercases in all of the string manipulation functions to deal with which are all a bit subjective.
A lot simpler will be to just return the resulting series as an object series (or boolean series, depending on the function). Just being able to use the str accessor methods is already an improvement.

Eg what to do with pd.Series(list('abAb'), dtype='category').str.upper(). Merge categories?

@jorisvandenbossche
Copy link
Member

I think the concatenation of strings with + is another issue as the str accessor. I am not sure I am in favor of adding this. Disallowing addition (and other operations) seems like a design choice to me. Eg also integer categories cannot be added up to each other.

@jankatins
Copy link
Contributor

@jorisvandenbossche s + s is only allowed for series of type string, not category (before and after this PR :-)). The PR makes series_of_type_cat.str.<method> return a Series of the same type as series_of_type_string.str.<method> (and not of type category).

@jorisvandenbossche
Copy link
Member

ah, sorry, I misread your comment above that you wanted to do s + s on categoricals (but you want to do it on results of str, which is a reason to let it not return a categorical series?)

@jankatins
Copy link
Contributor

@jorisvandenbossche jep, exactly: if cat.str.whatever() returns a series of type category, you cannot add two strings or ints (e.g. .dt.days).

@jreback jreback modified the milestones: 0.17.1, Next Major Release Nov 17, 2015
@jreback
Copy link
Contributor

jreback commented Nov 18, 2015

closed by #11582

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants