New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Add str/dt accessors to categorical #10661

Closed
sinhrks opened this Issue Jul 23, 2015 · 18 comments

Comments

Projects
None yet
4 participants
@sinhrks
Member

sinhrks commented Jul 23, 2015

Accessors should be enabled depending on categories. Should care CategoricalIndex also.

import pandas as pd
s = pd.Series(['A', 'B'], dtype='category')
s.str
# AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

s = pd.Series([pd.Timestamp('2011-01-01'), pd.Timestamp('2011-01-01')], dtype='category')
s.dt
# AttributeError: Can only use .dt accessor with datetimelike values

@sinhrks sinhrks added this to the 0.17.0 milestone Jul 23, 2015

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jul 23, 2015

Contributor

xref to #8627

What you are saying is the should ALSO allow the dtype correct accessor for its values (or categories if its a Categorical), in addition to the .cat if its a categorical. Then I would agree (though I don't think this actually works ATM see the xref issue)

Contributor

jreback commented Jul 23, 2015

xref to #8627

What you are saying is the should ALSO allow the dtype correct accessor for its values (or categories if its a Categorical), in addition to the .cat if its a categorical. Then I would agree (though I don't think this actually works ATM see the xref issue)

@jreback jreback modified the milestones: Next Major Release, 0.17.0 Aug 20, 2015

@jankatins

This comment has been minimized.

Show comment
Hide comment
@jankatins

jankatins Nov 12, 2015

Contributor

So, what this should do is simple allow the s.str to succeede when s.cat.categories is of type string and s.dt if s.cat.categories is of type datetime/...?

Contributor

jankatins commented Nov 12, 2015

So, what this should do is simple allow the s.str to succeede when s.cat.categories is of type string and s.dt if s.cat.categories is of type datetime/...?

@jankatins

This comment has been minimized.

Show comment
Hide comment
@jankatins

jankatins Nov 12, 2015

Contributor

One solution (for .dt) is to return a copy of the categorical as datetime object:

    def _make_dt_accessor(self):
        try:
            return maybe_to_datetimelike(self)
        except Exception:
            if is_categorical_dtype(self.dtype):
                try:
                    cat_dtype = self.values.categories.dtype
                    return maybe_to_datetimelike(self.astype(cat_dtype))
                except:
                    pass # raise in next line...
            raise AttributeError("Can only use .dt accessor with datetimelike "
                                 "values")
Contributor

jankatins commented Nov 12, 2015

One solution (for .dt) is to return a copy of the categorical as datetime object:

    def _make_dt_accessor(self):
        try:
            return maybe_to_datetimelike(self)
        except Exception:
            if is_categorical_dtype(self.dtype):
                try:
                    cat_dtype = self.values.categories.dtype
                    return maybe_to_datetimelike(self.astype(cat_dtype))
                except:
                    pass # raise in next line...
            raise AttributeError("Can only use .dt accessor with datetimelike "
                                 "values")
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 12, 2015

Contributor

@JanSchulz not at all.

This needs to operate on the categories, then return a new categorical object that has (transformed values).

In [1]: s = Series(list('aabbc')).astype('category')

In [2]: s.astype(object).str.upper()
Out[2]: 
0    A
1    A
2    B
3    B
4    C
dtype: object

# I still want to type: s.str.upper() though
In [4]: s.cat.rename_categories(s.cat.categories.str.upper())
Out[4]: 
0    A
1    A
2    B
3    B
4    C
dtype: category
Categories (3, object): [A, B, C]
Contributor

jreback commented Nov 12, 2015

@JanSchulz not at all.

This needs to operate on the categories, then return a new categorical object that has (transformed values).

In [1]: s = Series(list('aabbc')).astype('category')

In [2]: s.astype(object).str.upper()
Out[2]: 
0    A
1    A
2    B
3    B
4    C
dtype: object

# I still want to type: s.str.upper() though
In [4]: s.cat.rename_categories(s.cat.categories.str.upper())
Out[4]: 
0    A
1    A
2    B
3    B
4    C
dtype: category
Categories (3, object): [A, B, C]
@jankatins

This comment has been minimized.

Show comment
Hide comment
@jankatins

jankatins Nov 12, 2015

Contributor

ok, on it...

Contributor

jankatins commented Nov 12, 2015

ok, on it...

@jankatins

This comment has been minimized.

Show comment
Hide comment
@jankatins

jankatins Nov 12, 2015

Contributor

Ok, what should that return:

s = Series(list('aabbc')).astype('category')
s.str.upper()

-> a series of type string/object or a series of type category where the categories are of dtype str (and transformed)? I would go for the first, as it would honor the contract for str, which says it's a string:

from http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling

Series.str can be used to access the values of the series as strings and apply several methods to it. These can be acccessed like Series.str.<function/property>.

Contributor

jankatins commented Nov 12, 2015

Ok, what should that return:

s = Series(list('aabbc')).astype('category')
s.str.upper()

-> a series of type string/object or a series of type category where the categories are of dtype str (and transformed)? I would go for the first, as it would honor the contract for str, which says it's a string:

from http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling

Series.str can be used to access the values of the series as strings and apply several methods to it. These can be acccessed like Series.str.<function/property>.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 12, 2015

Contributor

@JanSchulz , no I would change the docs. The point of using a category dtype is that it essentially acts like its object cousin, but is simply more efficient.

Contributor

jreback commented Nov 12, 2015

@JanSchulz , no I would change the docs. The point of using a category dtype is that it essentially acts like its object cousin, but is simply more efficient.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 12, 2015

Contributor

This is slightly more tricky though (and its actually an example where it is quite useful). Note that [123] as I show below is actually pretty inefficient, as I already know the indexers. I think we can compute that directly.

In [120]: s = Series(['foo','foo','fub','fub','bar']).astype('category')

In [121]: s
Out[121]: 
0    foo
1    foo
2    fub
3    fub
4    bar
dtype: category
Categories (3, object): [bar, foo, fub]

In [122]: s.astype(object).str.contains('^f')
Out[122]: 
0     True
1     True
2     True
3     True
4    False
dtype: bool

In [123]: s.isin(s.cat.categories[s.cat.categories.str.contains('^f')])
Out[123]: 
0     True
1     True
2     True
3     True
4    False
dtype: bool
Contributor

jreback commented Nov 12, 2015

This is slightly more tricky though (and its actually an example where it is quite useful). Note that [123] as I show below is actually pretty inefficient, as I already know the indexers. I think we can compute that directly.

In [120]: s = Series(['foo','foo','fub','fub','bar']).astype('category')

In [121]: s
Out[121]: 
0    foo
1    foo
2    fub
3    fub
4    bar
dtype: category
Categories (3, object): [bar, foo, fub]

In [122]: s.astype(object).str.contains('^f')
Out[122]: 
0     True
1     True
2     True
3     True
4    False
dtype: bool

In [123]: s.isin(s.cat.categories[s.cat.categories.str.contains('^f')])
Out[123]: 
0     True
1     True
2     True
3     True
4    False
dtype: bool
@jankatins

This comment has been minimized.

Show comment
Hide comment
@jankatins

jankatins Nov 12, 2015

Contributor

I really hate that argument: "The point of using a category dtype is that it essentially acts like its object cousin, but is simply more efficient". Can't we get a String (or Object) class which is basically a copy of (with a base subclass) of Category?

Contributor

jankatins commented Nov 12, 2015

I really hate that argument: "The point of using a category dtype is that it essentially acts like its object cousin, but is simply more efficient". Can't we get a String (or Object) class which is basically a copy of (with a base subclass) of Category?

@jankatins

This comment has been minimized.

Show comment
Hide comment
@jankatins

jankatins Nov 12, 2015

Contributor

Ok, found a case where it should not result in a category:

In[29]: s = Series(list('aabb')).astype('category') 
In[30]: s + s
Traceback (most recent call last):
  File "C:\portabel\miniconda\envs\knitpy_27\lib\site-packages\IPython\core\interactiveshell.py", line 3066, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-30-b7ea6e807a5e>", line 1, in <module>
    s + s
  File "C:\portabel\miniconda\envs\knitpy_27\lib\site-packages\pandas\core\ops.py", line 609, in wrapper
    arr = na_op(lvalues, rvalues)
  File "C:\portabel\miniconda\envs\knitpy_27\lib\site-packages\pandas\core\ops.py", line 565, in na_op
    raise TypeError("{typ} cannot perform the operation {op}".format(typ=type(x).__name__,op=str_rep))
TypeError: Categorical cannot perform the operation +
In[31]: s = Series(list('aabb')) 
In[32]: s +s 
Out[32]: 
0    aa
1    aa
2    bb
3    bb
dtype: object
Contributor

jankatins commented Nov 12, 2015

Ok, found a case where it should not result in a category:

In[29]: s = Series(list('aabb')).astype('category') 
In[30]: s + s
Traceback (most recent call last):
  File "C:\portabel\miniconda\envs\knitpy_27\lib\site-packages\IPython\core\interactiveshell.py", line 3066, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-30-b7ea6e807a5e>", line 1, in <module>
    s + s
  File "C:\portabel\miniconda\envs\knitpy_27\lib\site-packages\pandas\core\ops.py", line 609, in wrapper
    arr = na_op(lvalues, rvalues)
  File "C:\portabel\miniconda\envs\knitpy_27\lib\site-packages\pandas\core\ops.py", line 565, in na_op
    raise TypeError("{typ} cannot perform the operation {op}".format(typ=type(x).__name__,op=str_rep))
TypeError: Categorical cannot perform the operation +
In[31]: s = Series(list('aabb')) 
In[32]: s +s 
Out[32]: 
0    aa
1    aa
2    bb
3    bb
dtype: object
@jankatins

This comment has been minimized.

Show comment
Hide comment
@jankatins

jankatins Nov 12, 2015

Contributor

Another problem if we would return a category: should it be ordered or not?

Contributor

jankatins commented Nov 12, 2015

Another problem if we would return a category: should it be ordered or not?

@jankatins

This comment has been minimized.

Show comment
Hide comment
@jankatins

jankatins Nov 12, 2015

Contributor

Ok, I have PR which returns normal Series (not categories) so that you can concat substrings (s + s works for Series of type string, but not category) and the same for .dt. see #11582

Contributor

jankatins commented Nov 12, 2015

Ok, I have PR which returns normal Series (not categories) so that you can concat substrings (s + s works for Series of type string, but not category) and the same for .dt. see #11582

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Nov 12, 2015

Member

I don't think we should try to return categorical here. There are way too many cornercases in all of the string manipulation functions to deal with which are all a bit subjective.
A lot simpler will be to just return the resulting series as an object series (or boolean series, depending on the function). Just being able to use the str accessor methods is already an improvement.

Eg what to do with pd.Series(list('abAb'), dtype='category').str.upper(). Merge categories?

Member

jorisvandenbossche commented Nov 12, 2015

I don't think we should try to return categorical here. There are way too many cornercases in all of the string manipulation functions to deal with which are all a bit subjective.
A lot simpler will be to just return the resulting series as an object series (or boolean series, depending on the function). Just being able to use the str accessor methods is already an improvement.

Eg what to do with pd.Series(list('abAb'), dtype='category').str.upper(). Merge categories?

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Nov 12, 2015

Member

I think the concatenation of strings with + is another issue as the str accessor. I am not sure I am in favor of adding this. Disallowing addition (and other operations) seems like a design choice to me. Eg also integer categories cannot be added up to each other.

Member

jorisvandenbossche commented Nov 12, 2015

I think the concatenation of strings with + is another issue as the str accessor. I am not sure I am in favor of adding this. Disallowing addition (and other operations) seems like a design choice to me. Eg also integer categories cannot be added up to each other.

@jankatins

This comment has been minimized.

Show comment
Hide comment
@jankatins

jankatins Nov 12, 2015

Contributor

@jorisvandenbossche s + s is only allowed for series of type string, not category (before and after this PR :-)). The PR makes series_of_type_cat.str.<method> return a Series of the same type as series_of_type_string.str.<method> (and not of type category).

Contributor

jankatins commented Nov 12, 2015

@jorisvandenbossche s + s is only allowed for series of type string, not category (before and after this PR :-)). The PR makes series_of_type_cat.str.<method> return a Series of the same type as series_of_type_string.str.<method> (and not of type category).

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Nov 12, 2015

Member

ah, sorry, I misread your comment above that you wanted to do s + s on categoricals (but you want to do it on results of str, which is a reason to let it not return a categorical series?)

Member

jorisvandenbossche commented Nov 12, 2015

ah, sorry, I misread your comment above that you wanted to do s + s on categoricals (but you want to do it on results of str, which is a reason to let it not return a categorical series?)

@jankatins

This comment has been minimized.

Show comment
Hide comment
@jankatins

jankatins Nov 12, 2015

Contributor

@jorisvandenbossche jep, exactly: if cat.str.whatever() returns a series of type category, you cannot add two strings or ints (e.g. .dt.days).

Contributor

jankatins commented Nov 12, 2015

@jorisvandenbossche jep, exactly: if cat.str.whatever() returns a series of type category, you cannot add two strings or ints (e.g. .dt.days).

@jreback jreback modified the milestones: 0.17.1, Next Major Release Nov 17, 2015

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 18, 2015

Contributor

closed by #11582

Contributor

jreback commented Nov 18, 2015

closed by #11582

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment