Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LazyFrame crashes on with_columns and rolling when DataFrame succeeds #15588

Closed
2 tasks done
SMa2021 opened this issue Apr 10, 2024 · 8 comments · Fixed by #15718
Closed
2 tasks done

LazyFrame crashes on with_columns and rolling when DataFrame succeeds #15588

SMa2021 opened this issue Apr 10, 2024 · 8 comments · Fixed by #15718
Assignees
Labels
accepted Ready for implementation bug Something isn't working invalid A bug report that is not actually a bug python Related to Python Polars

Comments

@SMa2021
Copy link

SMa2021 commented Apr 10, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

test_data = pl.LazyFrame({
            'Rank': list(range(4)),
            'Value': [1]*4
        }).set_sorted('Rank')

def func(df):
    df = df.with_columns(
        df.rolling(index_column = 'Rank', period = '3i').agg(
            pl.sum('Value').alias('RolledValue'),
        )
    )
    return df

func(test_data.collect()) # this works
func(test_data).collect() # this crashes

Log output

invalid literal value: 'naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)\n\nAGGREGATE\n\t[col("Value").sum().alias("RolledValue")] BY [] FROM\n   WITH_COLUMNS:\n   [col("Rank").set_sorted()]\n    DF ["Rank", "Value"]; PROJECT */2 COLUMNS; SELECTION: "None"'

Issue description

The intention is to add a column of the rolled values to the dataframe. While .agg return a dataframe, the eager DataFrame executed successfully, but the LazyFrame rejected the operation.

Expected behavior

Here is the expected output. The eager mode achieves this output.

 expected_output = pl.DataFrame({
      'Rank': list(range(4)),
      'Value': [1]*4,
      'RolledValue': [1,2,3,3]
  })

Installed versions

--------Version info---------
Polars:               0.20.19
Index type:           UInt32
Platform:             Linux-5.4.0-174-generic-x86_64-with-glibc2.31
Python:               3.11.3 (main, May 15 2023, 15:45:52) [GCC 11.2.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2023.6.0
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.7.2
nest_asyncio:         1.5.1
numpy:                1.24.4
openpyxl:             <not installed>
pandas:               2.0.2
pyarrow:              12.0.1
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           1.4.52
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@SMa2021 SMa2021 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Apr 10, 2024
@deanm0000 deanm0000 added invalid A bug report that is not actually a bug and removed needs triage Awaiting prioritization by a maintainer labels Apr 10, 2024
@deanm0000
Copy link
Collaborator

You're trying to nest the lazyframe inside of a with_columns. You'd have to collect the inner one.

You can redefine your func as

def func(df):
    df = df.with_columns(
        df.lazy().rolling(index_column = 'Rank', period = '3i').agg(
            pl.sum('Value').alias('RolledValue'),
        ).collect()
    )
    return df

and then it will work for both eager and lazy just be aware it's double materializing.

@cmdlineluser
Copy link
Contributor

Should something be done to improve the error message here instead of trying to stringify a LazyFrame?

The fact it works with DataFrames is a side-effect of iteration unpacking each Series object.

The "proper" way to combine them would be pl.concat() horizontally or .join() which would work regardless of eager/lazy.

@deanm0000
Copy link
Collaborator

The error could be better, sure. We'd have to check all the inputs to with_columns (and all the other contexts) to see if any are LazyFrames and then it'd have a specific error. Should we add checks on the eager side so that both of these scenarios error?

I'll reopen just in case others feel differently but it seems like it's just bike shedding from here.

@deanm0000 deanm0000 reopened this Apr 10, 2024
@SMa2021
Copy link
Author

SMa2021 commented Apr 11, 2024

Thanks for the response. I ended up using

def func(df):
    df_agg = df.rolling(index_column = 'Rank', period = '3i').agg(
            pl.sum('Value').alias('RolledValue'))
    return pl.concat([df, df_agg], how='align')

@cmdlineluser
Copy link
Contributor

cmdlineluser commented Apr 11, 2024

@deanm0000 I'm not sure - it just seemed a little odd to me that this happens.

It's as if the eager version works "by accident" - but perhaps I'm mistaken.

@SMa2021 It seems there is also Expr.rolling() which can create a new column directly.

pl.sum("Value").rolling(index_column="Rank", period="3i")

@SMa2021
Copy link
Author

SMa2021 commented Apr 11, 2024

Hi @cmdlineluser, thank you for the suggestion! I also want to use a group_by argument, which seems to not be supported by Expr.rolling(), but I removed it from the example to be more minimal.

@deanm0000
Copy link
Collaborator

deanm0000 commented Apr 11, 2024

@cmdlineluser It's definitely working by accident for the reason you said. A dataframe has __iter__ which turns a df into its individual column Series. The issue is that we want both df.with_columns([something]) and df.with_columns(something) to each work but in this case with outer_df.with_columns(inner_df) the inner_df ends up getting treated like it was a list and when a df is iterated it outputs its individual Series which are, on a standalone basis, valid inputs. I made a PR to raise if a DF or LF are inputs where an expression should go.

@stinodego
Copy link
Member

Python does not have strict typing and we cannot check for all possible wrong inputs.

A DataFrame is an iterable of its Series. So when inputting it into a function that accepts an iterable of IntoExpr, it will be treated like an iterable of Series and it will work. It's no different than passing a list of Series. This is not exactly intended, but it's not wrong. And the performance should be comparable to a horizontal concat.

A LazyFrame is not an iterable. So it correctly raises a TypeError. The error message can be improved, I will open a PR for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation bug Something isn't working invalid A bug report that is not actually a bug python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants