Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fill_value kwarg for unstack #9746

Closed
amcpherson opened this issue Mar 29, 2015 · 3 comments

Comments

Projects
None yet
2 participants
@amcpherson
Copy link
Contributor

commented Mar 29, 2015

Currently:

In [2]: df = pd.DataFrame({'x':['a', 'a', 'b'], 'y':['j', 'k', 'j'], 'z':[0, 1, 2]})

In [3]: df.set_index(['x', 'y']).unstack()
Out[3]:
   z
y  j   k
x
a  0   1
b  2 NaN

If I want to fill with -1, i need to fillna and then astype back to int. Ideally:

In [3]: df.set_index(['x', 'y']).unstack(fill_value=-1)
Out[3]:
   z
y  j   k
x
a  0   1
b  2  -1
@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 29, 2015

You can do this by specifying the downcast keyword. This is NOT automatic as a general operation this can be expensive.

In [10]: df.set_index(['x','y']).unstack().fillna(-1,downcast='infer')
Out[10]: 
   z   
y  j  k
x      
a  0  1
b  2 -1

In [11]: df.set_index(['x','y']).unstack().fillna(-1,downcast='infer').dtypes
Out[11]: 
   y
z  j    int64
   k    int64
dtype: object
@amcpherson

This comment has been minimized.

Copy link
Contributor Author

commented Mar 29, 2015

There may be some merit to this being allowed directly, even if the functionality can be accomplished with a series of operations. For instance, when trying to limit memory usage on a big dataset, perhaps it would be preferable to keep the data as np.int8.

In [15]: idx = np.array([0, 0, 1], dtype=np.int32)

In [16]: idx2 = np.array([0, 1, 0], dtype=np.int8)

In [17]: value = np.array([0, 1, 2], dtype=np.int8)

In [18]: df = pd.DataFrame({'idx':idx, 'idx2':idx2, 'value':value})

In [19]: df.dtypes
Out[19]:
idx      int32
idx2      int8
value     int8
dtype: object

In [20]: df.set_index(['idx', 'idx2']).unstack().dtypes
Out[20]:
       idx2
value  0       float64
       1       float64
dtype: object

After the unstack my data table is suddenly much larger than necessary.

Also, from looking at the code this would be fairly trivial to implement, without much impact on existing code.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 29, 2015

@amcpherson ok, if you can find a reasonable way to do this w/o affecting perf then would be ok to have a fill_value argument.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.