Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify documentation for the agg_list argument in Expr.map_batches #13612

Closed
Wainberg opened this issue Jan 10, 2024 · 7 comments · Fixed by #13625
Closed

Clarify documentation for the agg_list argument in Expr.map_batches #13612

Wainberg opened this issue Jan 10, 2024 · 7 comments · Fixed by #13625
Assignees
Labels
accepted Ready for implementation documentation Improvements or additions to documentation

Comments

@Wainberg
Copy link
Contributor

Description

The documentation for Expr.map_batches pithily describes the function of the agg_list argument as "Aggregate list". What does this argument do? It would be good to update the documentation.

@Wainberg Wainberg added the enhancement New feature or an improvement of an existing feature label Jan 10, 2024
@deanm0000
Copy link
Collaborator

It seems to be for map_elements. I don't think it has a practical usage outside of map_elements using it but I haven't messed with it much

@Wainberg
Copy link
Contributor Author

map_elements doesn't have an agg_list argument, though.

@deanm0000
Copy link
Collaborator

Sorry, to clarify, map_elements calls map_batches and in doing so it sets that parameter in different conditions that I don't remember off hand.

@cmdlineluser
Copy link
Contributor

cmdlineluser commented Jan 10, 2024

Looks like it controls ApplyOptions::ApplyList

let (collect_groups, name) = if agg_list {
(ApplyOptions::ApplyList, MAP_LIST_NAME)

Which is defined here:

pub enum ApplyOptions {
/// Collect groups to a list and apply the function over the groups.
/// This can be important in aggregation context.
// e.g. [g1, g1, g2] -> [[g1, g1], g2]
GroupWise,
// collect groups to a list and then apply
// e.g. [g1, g1, g2] -> list([g1, g1, g2])
ApplyList,
// do not collect before apply
// e.g. [g1, g1, g2] -> [g1, g1, g2]
ElementWise,

@Wainberg
Copy link
Contributor Author

Does it actually do anything? I haven't been able to find an example where it changes the result.

@reswqa
Copy link
Collaborator

reswqa commented Jan 11, 2024

They have a clear distinction mainly in the agg context. If agg_list is False, the UDF is called per group. In contrast, the UDF is invoked only once on a list of groups.

Let's use an example to illustrate this further:

df = pl.DataFrame(
         {
            "a": [0,1,0,1],
            "b": [1,2,3,4],
        }
    )

def f(x):
        print(x)
        return x
  1. Disable agg_list:
df.group_by("a").agg(pl.col("b").map_batches(f, agg_list=False))

# first output
Series: '' [i64]
[
	2
	4
]

# second output
Series: '' [i64]
[
	1
	3
]
  1. Enable agg_list:
df.group_by("a").agg(pl.col("b").map_batches(f, agg_list=True))

# output
Series: 'b' [list[i64]]
[
	[2, 4]
	[1, 3]
]

@reswqa
Copy link
Collaborator

reswqa commented Jan 11, 2024

Maybe I can update this document to make it easier to understand.

@reswqa reswqa added the documentation Improvements or additions to documentation label Jan 11, 2024
@reswqa reswqa self-assigned this Jan 11, 2024
@reswqa reswqa added accepted Ready for implementation and removed enhancement New feature or an improvement of an existing feature labels Jan 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation documentation Improvements or additions to documentation
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants