New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: support .astype('category') on DataFrame / aka co-factorization #12860

Closed
jreback opened this Issue Apr 11, 2016 · 6 comments

Comments

Projects
None yet
2 participants
@jreback
Contributor

jreback commented Apr 11, 2016

xref to #10696, #8709

We don't allow an astype of a DataFrame to category directly

In [44]: df.astype('category')
NotImplementedError: > 1 ndim Categorical are not supported at this time

Instead you can apply the astype per-column.

In [35]: df = DataFrame({'A' : list('aabcda'), 'B' : list('bcdaae')})

In [36]: df
Out[36]: 
   A  B
0  a  b
1  a  c
2  b  d
3  c  a
4  d  a
5  a  e

In [37]: df.apply(lambda x: x.astype('category'))
Out[37]: 
   A  B
0  a  b
1  a  c
2  b  d
3  c  a
4  d  a
5  a  e

In [38]: df.apply(lambda x: x.astype('category')).B
Out[38]: 
0    b
1    c
2    d
3    a
4    a
5    e
Name: B, dtype: category
Categories (5, object): [a, b, c, d, e]

In [39]: df.apply(lambda x: x.astype('category')).A
Out[39]: 
0    a
1    a
2    b
3    c
4    d
5    a
Name: A, dtype: category
Categories (4, object): [a, b, c, d]

But if you have 'similar' cateogories then you would usually do this, automatically
astyping with the same uniques.

In [41]: uniques = np.sort(pd.unique(df.values.ravel()))

In [42]: df.apply(lambda x: x.astype('category', categories=uniques)).A
Out[42]: 
0    a
1    a
2    b
3    c
4    d
5    a
Name: A, dtype: category
Categories (5, object): [a, b, c, d, e]

In [43]: df.apply(lambda x: x.astype('category', categories=uniques)).B
Out[43]: 
0    b
1    c
2    d
3    a
4    a
5    e
Name: B, dtype: category
Categories (5, object): [a, b, c, d, e]

This is failry straightforward to actually implement, and I think is a nice easy way of coding, w/o having to actually support 2D categoricals internally (and we are moving away from internal 2-d structures anyhow).

@jreback jreback added this to the Next Major Release milestone Apr 11, 2016

@jreback

This comment has been minimized.

Contributor

jreback commented Apr 11, 2016

@jankatins

This comment has been minimized.

Contributor

jankatins commented Apr 11, 2016

I'm not so sure what you are proposing here? That df.astype('category' , ...) would internally be mapped to df.apply(lambda x: x.astype('category', ...))?

For my usecase df.astype(...) is not necessary, I usually have different types of columns and doing a astype on the complete df would just destroy the df... But if others have the need...

My usecase is more:

lickert_columns= [...] # a few of the columns in my df
for col in lickert_columns:
    df[col] = df[col].astype("category", categories=lickert_scale, ordered=True)
@jreback

This comment has been minimized.

Contributor

jreback commented Apr 11, 2016

well, you would oftentimes do this on a sub-set I think, e.g. df[['A','B']].astype(...)

the reason I bring this up is whether we should form the uniques FIRST before conversions, IOW

if categories=None is passed (which is the default), then we would create it explicity from ALL the passed values.

As opposed to individually create them per-column.

@jankatins

This comment has been minimized.

Contributor

jankatins commented Apr 11, 2016

IMO constructing the categories from all uniques makes sense.

[How would one merge these subset back into the original DF? dropping the old columns and merging the new ones back in? Sounds like a lot of work which ends up as long as the for loop?]

@jreback

This comment has been minimized.

Contributor

jreback commented Oct 16, 2017

example from SO

Here is a complete example

In [9]: np.random.seed(1234)

In [10]: import string

In [11]: df = pd.DataFrame([np.random.choice(list(string.ascii_lowercase), 10) for i in range(5)])

In [12]: df
Out[12]: 
   0  1  2  3  4  5  6  7  8  9
0  p  t  g  v  m  u  y  z  p  r
1  x  j  l  m  w  y  q  f  q  j
2  w  p  s  q  m  f  c  g  d  h
3  l  a  j  l  q  d  c  t  m  b
4  l  t  l  r  o  t  h  k  l  o

In [14]: In [16]: b = pd.unique(df.values.T.reshape(-1, )) 
    ...: df.apply(lambda x: pd.Categorical(x, b).codes)
    ...: 
    ...: 
Out[14]: 
   0  1  2   3   4   5   6   7   8   9
0  0  4  7   9  10  14  15  20   0  12
1  1  5  3  10   2  15  11  16  11   5
2  2  0  8  11  10  16  18   7  17  19
3  3  6  5   3  11  17  18   4  10  22
4  3  4  3  12  13   4  19  21   3  13
@jreback

This comment has been minimized.

Contributor

jreback commented Oct 16, 2017

Note this can actually be implemented in a more performant way via https://github.com/pandas-dev/pandas/blob/master/pandas/core/reshape/merge.py#L1453

@jreback jreback changed the title from ENH: support .astype('category') on DataFrame to ENH: support .astype('category') on DataFrame / aka co-factorization Oct 16, 2017

@jschendel jschendel referenced this issue Nov 3, 2017

Merged

ENH: Implement DataFrame.astype('category') #18099

4 of 4 tasks complete

@jreback jreback modified the milestones: Next Major Release, 0.23.0 Feb 24, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment