Cache multiple identical `map_elements` calls within a function call #13895

Adrianf23 · 2024-01-22T02:51:04Z

Description

I asked a question about map_elements in Discord and @MarcoGorelli suggested that I add a feature that addresses some optimizations (will need clarification on this point). I want to run a UDF (or third-party library function) with multiple return values and assign those specific values to different columns. In my case, I am looking for a way to run the function once per item because it is computationally heavy. If any one of these methods is the preferred way to do things, I think it will be helpful to add it to the documentation.

Multiple function calls per row (total = number of rows * number of new cols)

In this example, you can see that the function is run twice (for each get() call)

df = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}).lazy()


def foo(x):
    print("here")
    return (x, x + 1)


out = df.with_columns(
    r1=pl.col("a").map_elements(foo).list.get(0),
    r2=pl.col("a").map_elements(foo).list.get(1),
).collect()

print(out)

here
here
here
here
here
here
shape: (3, 4)
┌─────┬─────┬─────┬─────┐
│ a   ┆ b   ┆ r1  ┆ r2  │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╡
│ 1   ┆ 4   ┆ 1   ┆ 2   │
│ 2   ┆ 5   ┆ 2   ┆ 3   │
│ 3   ┆ 6   ┆ 3   ┆ 4   │
└─────┴─────┴─────┴─────┘

One function call per row (total = number of rows)

In this example, the function isn't called multiple times for the same column. This is closest to what I was looking for, but we do have to create this r column and then delete it. I believe this is where the aforementioned optimization can come in. I am not sure how to tackle this exactly, but I would like to see a similar syntax or argument that allows you keep/drop the "result".

out_2 = (
    df.with_columns(r=pl.col("a").map_elements(foo))
    .with_columns(
        r1=pl.col("r").list.get(0),
        r2=pl.col("r").list.get(1),
    )
    .drop("r")
)

print(out_2)

print(out)

here
here
here
shape: (3, 4)
┌─────┬─────┬─────┬─────┐
│ a   ┆ b   ┆ r1  ┆ r2  │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╡
│ 1   ┆ 4   ┆ 1   ┆ 2   │
│ 2   ┆ 5   ┆ 2   ┆ 3   │
│ 3   ┆ 6   ┆ 3   ┆ 4   │
└─────┴─────┴─────┴─────┘

Note: Unnest a dictionary of return elements

This might work if you define the function yourself. I think it would be tedious to make a wrapper around a third-party library so that it returns a dictionary.

def exp_and_x_root(x):
    exp = x**x
    x_root = x ** (1 / x)
    return {"x^x": exp, "x^(1/x)": x_root}

df = pl.DataFrame({"num_list": (1, 2, 3)})
out = df.with_columns(
    result=pl.col("num_list").map_elements(lambda x: exp_and_x_root(x))
).unnest("result")

print(out)

shape: (3, 3)
┌──────────┬─────┬──────────┐
│ num_list ┆ x^x ┆ x^(1/x)  │
│ ---      ┆ --- ┆ ---      │
│ i64      ┆ i64 ┆ f64      │
╞══════════╪═════╪══════════╡
│ 1        ┆ 1   ┆ 1.0      │
│ 2        ┆ 4   ┆ 1.414214 │
│ 3        ┆ 27  ┆ 1.44225  │
└──────────┴─────┴──────────┘

I tried reading the enhancement tags for something that specifically matched this but didn't find anything. @MarcoGorelli if I missed something in the feature request, please let me know.

The text was updated successfully, but these errors were encountered:

MarcoGorelli · 2024-01-22T08:27:01Z

Thanks @Adrianf23 for the report

To put this more concisely: the request is that if you write

df.with_columns(
    r1=pl.col("a").map_elements(foo).list.get(0),
    r2=pl.col("a").map_elements(foo).list.get(1),
)

then Polars should cache the common map_elements(foo) part, so that it becomes equivalent to

(
    df.with_columns(r=pl.col("a").map_elements(foo))
    .with_columns(
        r1=pl.col("r").list.get(0),
        r2=pl.col("r").list.get(1),
    )
    .drop("r")
)

This looks reasonable to me, but is too far off from anything I've personally worked on - @reswqa do you know about this one?

ritchie46 · 2024-01-22T08:43:22Z

We cannot trust UDFs to be pure. You could access a global count, write a random number generator, etc.

MarcoGorelli · 2024-01-22T08:49:47Z

right, thanks - closing as this looks expected then

Adrianf23 added the enhancement New feature or an improvement of an existing feature label Jan 22, 2024

MarcoGorelli changed the title ~~Create multiple columns from a single map_elements function call~~ Cache multiple identical map_elements calls within a function call Jan 22, 2024

MarcoGorelli closed this as completed Jan 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache multiple identical `map_elements` calls within a function call #13895

Cache multiple identical `map_elements` calls within a function call #13895

Adrianf23 commented Jan 22, 2024 •

edited

MarcoGorelli commented Jan 22, 2024

ritchie46 commented Jan 22, 2024

MarcoGorelli commented Jan 22, 2024

Cache multiple identical map_elements calls within a function call #13895

Cache multiple identical map_elements calls within a function call #13895

Comments

Adrianf23 commented Jan 22, 2024 • edited

Description

Multiple function calls per row (total = number of rows * number of new cols)

One function call per row (total = number of rows)

Note: Unnest a dictionary of return elements

MarcoGorelli commented Jan 22, 2024

ritchie46 commented Jan 22, 2024

MarcoGorelli commented Jan 22, 2024

Cache multiple identical `map_elements` calls within a function call #13895

Cache multiple identical `map_elements` calls within a function call #13895

Adrianf23 commented Jan 22, 2024 •

edited