Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grouped DataFrame with array elements fails to combine #3424

Closed
huangyxi opened this issue Jan 31, 2024 · 4 comments
Closed

Grouped DataFrame with array elements fails to combine #3424

huangyxi opened this issue Jan 31, 2024 · 4 comments

Comments

@huangyxi
Copy link

When performing a combine operation on a DataFrame using the combine function with groupby, the output format is inconsistent based on the type of aggregation being applied. Specifically, when aggregating with the :v=>sum operation, the expected output should retain the array format for the column being aggregated. However, the current behavior results in a scalar value for the aggregated column.

julia> a = DataFrame(k=[1,1,2,2], v=[[1,2],[3,4],[5,6],[7,8]], t=[1,2,3,4])
4×3 DataFrame
 Row │ k      v       t
     │ Int64  Array  Int64
─────┼──────────────────────
   11  [1, 2]      1
   21  [3, 4]      2
   32  [5, 6]      3
   42  [7, 8]      4

julia> combine(groupby(a, :k), :v=>sum)
4×2 DataFrame
 Row │ k      v_sum
     │ Int64  Int64
─────┼──────────────
   11      4
   21      6
   32     12
   42     14

julia> combine(groupby(a, :k), :t=>sum)
2×2 DataFrame
 Row │ k      t_sum
     │ Int64  Int64
─────┼──────────────
   11      3
   22      7

Expected:

julia> combine(groupby(a, :k), :v=>sum)
2×2 DataFrame
 Row │ k        v_sum
     │ Int64    Array
─────┼────────────────
   11    [4, 6]
   22  [12, 14]
@huangyxi
Copy link
Author

Another example:

julia> combine(groupby(a, :k), :t=>stack)
4×2 DataFrame
 Row │ k      t_stack
     │ Int64  Int64
─────┼────────────────
   11        1
   21        2
   32        3
   42        4

Expected behavior:

julia> combine(groupby(a, :k), :t=>stack)
4×2 DataFrame
 Row │ k      t_stack
     │ Int64   Array
─────┼────────────────
   11   [1, 2]
   22   [3, 4]

I'm uncertain about the internal mechanism, but it appears that the DataFrame might undergo flattening after the combination of grouped DataFrames.

@huangyxi
Copy link
Author

@bkamins
Copy link
Member

bkamins commented Jan 31, 2024

This is a design feature. The rule is that if function returns a vector it gets expanded. The reason is that in a vast majority of cases this is what users expect, and requiring them to flatten the result every time in this case would be inconvenient.
Note that even the simplest :a => identity requires flattening to produce a correct result.

It is important to understand that aggregation functions decide about how to handle the results on transformations based on the VALUE returned, not based on a function called. Relying on a function called would produce many special cases that would be even harder to learn.

Your case is rare (applying sum over vector of vectors) therefore the decision was that it should be handled by a special rule. As you have found, and as is written in the docstring:

In all of these cases, function can return either a single row or multiple rows. As a particular rule, values wrapped in a Ref or a 0-dimensional AbstractArray are unwrapped and then treated as a single row.

So you can write e.g. one of these (whichever is easier to remember for you):

julia> combine(groupby(a, :k), :v=>Ref∘sum)
2×2 DataFrame
 Row │ k      v_Ref_sum
     │ Int64  Array…
─────┼──────────────────
   1 │     1  [4, 6]
   2 │     2  [12, 14]

julia> combine(groupby(a, :k), :v=>fill∘sum)
2×2 DataFrame
 Row │ k      v_fill_sum
     │ Int64  Array…
─────┼───────────────────
   1 │     1  [4, 6]
   2 │     2  [12, 14]

To get what you want.

@huangyxi
Copy link
Author

Thank you for your response. I have updated the document to ensure proper dissemination.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants