fill_value kwarg for unstack #9746

Closed
amcpherson opened this Issue Mar 29, 2015 · 3 comments

Comments

Projects
None yet
2 participants
Contributor

amcpherson commented Mar 29, 2015

Currently:

In [2]: df = pd.DataFrame({'x':['a', 'a', 'b'], 'y':['j', 'k', 'j'], 'z':[0, 1, 2]})

In [3]: df.set_index(['x', 'y']).unstack()
Out[3]:
   z
y  j   k
x
a  0   1
b  2 NaN

If I want to fill with -1, i need to fillna and then astype back to int. Ideally:

In [3]: df.set_index(['x', 'y']).unstack(fill_value=-1)
Out[3]:
   z
y  j   k
x
a  0   1
b  2  -1
Contributor

jreback commented Mar 29, 2015

You can do this by specifying the downcast keyword. This is NOT automatic as a general operation this can be expensive.

In [10]: df.set_index(['x','y']).unstack().fillna(-1,downcast='infer')
Out[10]: 
   z   
y  j  k
x      
a  0  1
b  2 -1

In [11]: df.set_index(['x','y']).unstack().fillna(-1,downcast='infer').dtypes
Out[11]: 
   y
z  j    int64
   k    int64
dtype: object

jreback closed this Mar 29, 2015

Contributor

amcpherson commented Mar 29, 2015

There may be some merit to this being allowed directly, even if the functionality can be accomplished with a series of operations. For instance, when trying to limit memory usage on a big dataset, perhaps it would be preferable to keep the data as np.int8.

In [15]: idx = np.array([0, 0, 1], dtype=np.int32)

In [16]: idx2 = np.array([0, 1, 0], dtype=np.int8)

In [17]: value = np.array([0, 1, 2], dtype=np.int8)

In [18]: df = pd.DataFrame({'idx':idx, 'idx2':idx2, 'value':value})

In [19]: df.dtypes
Out[19]:
idx      int32
idx2      int8
value     int8
dtype: object

In [20]: df.set_index(['idx', 'idx2']).unstack().dtypes
Out[20]:
       idx2
value  0       float64
       1       float64
dtype: object

After the unstack my data table is suddenly much larger than necessary.

Also, from looking at the code this would be fairly trivial to implement, without much impact on existing code.

Contributor

jreback commented Mar 29, 2015

@amcpherson ok, if you can find a reasonable way to do this w/o affecting perf then would be ok to have a fill_value argument.

jreback reopened this Dec 11, 2015

jreback added this to the 0.18.0 milestone Dec 11, 2015

jreback closed this in de46056 Jan 30, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment