Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas Summarize needs to validate each result #138

Closed
machow opened this issue Oct 22, 2019 · 2 comments
Closed

pandas Summarize needs to validate each result #138

machow opened this issue Oct 22, 2019 · 2 comments
Labels

Comments

@machow
Copy link
Owner

machow commented Oct 22, 2019

Currently, if an expression passed to summarize produces a Series of length 3, only the first entry will be kept. This is because summarize creates a DataFrame with index = [0], and pandas uses this index subset the Series result.

Could do a simple length check on the Series.

Once this is complete, letting Summarize refer back to just-defined columns seems doable... (#20)

Example

import pandas as pd
from siuba import _, summarize

df = pd.DataFrame({'x': [1,2,3]})

# length 3 Series
df.x.mode()

# length 1 row data frame
summarize(df, result = _.x.mode())
@machow machow added the time:3 A couple hours label Oct 22, 2019
machow added a commit that referenced this issue Oct 24, 2019
@machow
Copy link
Owner Author

machow commented May 12, 2020

Was pairing w/ @mariekers, who hit this issue, so probably time to fix. Her specific issue was that if you do a elementwise operation that just so happens to work like an aggregation, summarize breaks.

For example,

import pandas as pd
from siuba import _, summarize

# note that all groups only have 1 row
df = pd.DataFrame({'g': ['a', 'b', 'c'], 'x': [1, 2, 3], 'y': [4,5,6]})

summarize(df.groupby('g'), res =  _.x / _.y)

Returns

   g   res
0  a  0.25
1  b   NaN
2  c   NaN

which is pretty embarrassing. Note that fast_summarize raises an informative error, which is way better. For slow summarize, we should just use the result's underlying array (to avoid the assumption that the result's index is [0]).

@machow
Copy link
Owner Author

machow commented May 12, 2020

I think ultimately the main challenge here is performance. Within each grouping, we might calculate three columns...

import pandas as pd
a = pd.Series([1, 2, 3])
b = pd.Series([4, 5, 6], index = [4,5,6])
c = 1

Then need to construct a DataFrame

# think of something like...
pd.DataFrame({'a': a, 'b': b, 'c': c})

But this will produce crazy results

     a    b  c
0  1.0  NaN  1
1  2.0  NaN  1
2  3.0  NaN  1
4  NaN  4.0  1
5  NaN  5.0  1
6  NaN  6.0  1

One way out is to strip out the underlying arrays...

pd.DataFrame({'a': a.array, 'b': b.array, 'c': c})

All this finagling is not great for speed, but is exactly why the fast_summarize function exists: it plans out how to handle operations beforehand, so you can plan once per summarize argument, rather than post-hoc instance checking. So might as well instance check here, and prioritize getting fast summarize out of experimental + user testing and documenting it. (plus, similar issues will arise and are likely dealt with in pandas transform etc..)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: No status
Development

No branches or pull requests

1 participant