pandas Summarize needs to validate each result #138

machow · 2019-10-22T20:07:43Z

Currently, if an expression passed to summarize produces a Series of length 3, only the first entry will be kept. This is because summarize creates a DataFrame with index = [0], and pandas uses this index subset the Series result.

Could do a simple length check on the Series.

Once this is complete, letting Summarize refer back to just-defined columns seems doable... (#20)

Example

import pandas as pd
from siuba import _, summarize

df = pd.DataFrame({'x': [1,2,3]})

# length 3 Series
df.x.mode()

# length 1 row data frame
summarize(df, result = _.x.mode())

The text was updated successfully, but these errors were encountered:

machow · 2020-05-12T21:07:51Z

Was pairing w/ @mariekers, who hit this issue, so probably time to fix. Her specific issue was that if you do a elementwise operation that just so happens to work like an aggregation, summarize breaks.

For example,

import pandas as pd
from siuba import _, summarize

# note that all groups only have 1 row
df = pd.DataFrame({'g': ['a', 'b', 'c'], 'x': [1, 2, 3], 'y': [4,5,6]})

summarize(df.groupby('g'), res =  _.x / _.y)

Returns

   g   res
0  a  0.25
1  b   NaN
2  c   NaN

which is pretty embarrassing. Note that fast_summarize raises an informative error, which is way better. For slow summarize, we should just use the result's underlying array (to avoid the assumption that the result's index is [0]).

machow · 2020-05-12T21:27:24Z

I think ultimately the main challenge here is performance. Within each grouping, we might calculate three columns...

import pandas as pd
a = pd.Series([1, 2, 3])
b = pd.Series([4, 5, 6], index = [4,5,6])
c = 1

Then need to construct a DataFrame

# think of something like...
pd.DataFrame({'a': a, 'b': b, 'c': c})

But this will produce crazy results

     a    b  c
0  1.0  NaN  1
1  2.0  NaN  1
2  3.0  NaN  1
4  NaN  4.0  1
5  NaN  5.0  1
6  NaN  6.0  1

One way out is to strip out the underlying arrays...

pd.DataFrame({'a': a.array, 'b': b.array, 'c': c})

All this finagling is not great for speed, but is exactly why the fast_summarize function exists: it plans out how to handle operations beforehand, so you can plan once per summarize argument, rather than post-hoc instance checking. So might as well instance check here, and prioritize getting fast summarize out of experimental + user testing and documenting it. (plus, similar issues will arise and are likely dealt with in pandas transform etc..)

machow added the time:3 A couple hours label Oct 22, 2019

machow added a commit that referenced this issue Oct 24, 2019

add skipped test for issue #138

fa1f1f5

machow added the high-priority label Mar 6, 2020

machow mentioned this issue May 12, 2020

fix: pandas summarize validate len, use Series arrays #231

Merged

machow closed this as completed May 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandas Summarize needs to validate each result #138

pandas Summarize needs to validate each result #138

machow commented Oct 22, 2019 •

edited

machow commented May 12, 2020

machow commented May 12, 2020 •

edited

pandas Summarize needs to validate each result #138

pandas Summarize needs to validate each result #138

Comments

machow commented Oct 22, 2019 • edited

machow commented May 12, 2020

machow commented May 12, 2020 • edited

machow commented Oct 22, 2019 •

edited

machow commented May 12, 2020 •

edited