Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache multiple identical map_elements calls within a function call #13895

Closed
Adrianf23 opened this issue Jan 22, 2024 · 3 comments
Closed

Cache multiple identical map_elements calls within a function call #13895

Adrianf23 opened this issue Jan 22, 2024 · 3 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@Adrianf23
Copy link

Adrianf23 commented Jan 22, 2024

Description

I asked a question about map_elements in Discord and @MarcoGorelli suggested that I add a feature that addresses some optimizations (will need clarification on this point). I want to run a UDF (or third-party library function) with multiple return values and assign those specific values to different columns. In my case, I am looking for a way to run the function once per item because it is computationally heavy. If any one of these methods is the preferred way to do things, I think it will be helpful to add it to the documentation.

Multiple function calls per row (total = number of rows * number of new cols)

In this example, you can see that the function is run twice (for each get() call)

df = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}).lazy()


def foo(x):
    print("here")
    return (x, x + 1)


out = df.with_columns(
    r1=pl.col("a").map_elements(foo).list.get(0),
    r2=pl.col("a").map_elements(foo).list.get(1),
).collect()

print(out)

here
here
here
here
here
here
shape: (3, 4)
┌─────┬─────┬─────┬─────┐
│ abr1r2  │
│ ------------ │
│ i64i64i64i64 │
╞═════╪═════╪═════╪═════╡
│ 1412   │
│ 2523   │
│ 3634   │
└─────┴─────┴─────┴─────┘

One function call per row (total = number of rows)

In this example, the function isn't called multiple times for the same column. This is closest to what I was looking for, but we do have to create this r column and then delete it. I believe this is where the aforementioned optimization can come in. I am not sure how to tackle this exactly, but I would like to see a similar syntax or argument that allows you keep/drop the "result".

out_2 = (
    df.with_columns(r=pl.col("a").map_elements(foo))
    .with_columns(
        r1=pl.col("r").list.get(0),
        r2=pl.col("r").list.get(1),
    )
    .drop("r")
)

print(out_2)

print(out)

here
here
here
shape: (3, 4)
┌─────┬─────┬─────┬─────┐
│ abr1r2  │
│ ------------ │
│ i64i64i64i64 │
╞═════╪═════╪═════╪═════╡
│ 1412   │
│ 2523   │
│ 3634   │
└─────┴─────┴─────┴─────┘

Note: Unnest a dictionary of return elements

This might work if you define the function yourself. I think it would be tedious to make a wrapper around a third-party library so that it returns a dictionary.

def exp_and_x_root(x):
    exp = x**x
    x_root = x ** (1 / x)
    return {"x^x": exp, "x^(1/x)": x_root}

df = pl.DataFrame({"num_list": (1, 2, 3)})
out = df.with_columns(
    result=pl.col("num_list").map_elements(lambda x: exp_and_x_root(x))
).unnest("result")

print(out)

shape: (3, 3)
┌──────────┬─────┬──────────┐
│ num_listx^xx^(1/x)  │
│ ---------      │
│ i64i64f64      │
╞══════════╪═════╪══════════╡
│ 111.0      │
│ 241.414214 │
│ 3271.44225  │
└──────────┴─────┴──────────┘

I tried reading the enhancement tags for something that specifically matched this but didn't find anything. @MarcoGorelli if I missed something in the feature request, please let me know.

@Adrianf23 Adrianf23 added the enhancement New feature or an improvement of an existing feature label Jan 22, 2024
@MarcoGorelli MarcoGorelli changed the title Create multiple columns from a single map_elements function call Cache multiple identical map_elements calls within a function call Jan 22, 2024
@MarcoGorelli
Copy link
Collaborator

Thanks @Adrianf23 for the report

To put this more concisely: the request is that if you write

df.with_columns(
    r1=pl.col("a").map_elements(foo).list.get(0),
    r2=pl.col("a").map_elements(foo).list.get(1),
)

then Polars should cache the common map_elements(foo) part, so that it becomes equivalent to

(
    df.with_columns(r=pl.col("a").map_elements(foo))
    .with_columns(
        r1=pl.col("r").list.get(0),
        r2=pl.col("r").list.get(1),
    )
    .drop("r")
)

This looks reasonable to me, but is too far off from anything I've personally worked on - @reswqa do you know about this one?

@ritchie46
Copy link
Member

We cannot trust UDFs to be pure. You could access a global count, write a random number generator, etc.

@MarcoGorelli
Copy link
Collaborator

right, thanks - closing as this looks expected then

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

3 participants