Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong type promotion on lazy parquet dataframe #6192

Closed
2 tasks done
Hoeze opened this issue Jan 12, 2023 · 3 comments · Fixed by #13776
Closed
2 tasks done

Wrong type promotion on lazy parquet dataframe #6192

Hoeze opened this issue Jan 12, 2023 · 3 comments · Fixed by #13776
Assignees
Labels
accepted Ready for implementation bug Something isn't working P-high Priority: high python Related to Python Polars

Comments

@Hoeze
Copy link

Hoeze commented Jan 12, 2023

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

When reading a parquet dataframe lazily and selecting some transformed Float32 column, the data type gets incorrectly reported as being Float64. After collecting the result however, it's correctly reported as being Float32 again.

Reproducible example

import polars as pl
import numpy as np

df = pl.DataFrame({
    "x": np.array([1,1,1], dtype="int32"),
    "y": np.array([1,2,3], dtype="float32"),
}).write_parquet("test.parquet")

df = pl.read_parquet("test.parquet").lazy()

print(df.schema)
# {'x': Int32, 'y': Float32}

print(df.select(pl.col("y")).schema)
# {'y': Float32}

# THIS FAILS! Data type should be reported as Float32
print(df.select(-pl.col("y")).schema)
# {'literal': Float64}

# After collecting the dataset, the type is correct
print(df.select(-pl.col("y")).collect().schema)
# {'literal': Float32}

Expected behavior

print(df.schema)
# {'x': Int32, 'y': Float32}

print(df.select(-pl.col("y")).schema)
# {'literal': Float32}

Installed versions

---Version info---
Polars: 0.15.14
Index type: UInt32
Platform: Linux-6.1.0-1.el8.elrepo.x86_64-x86_64-with-glibc2.28
Python: 3.9.9 | packaged by conda-forge | (main, Dec 20 2021, 02:41:03) 
[GCC 9.4.0]
---Optional dependencies---
pyarrow: 9.0.0
pandas: 1.5.2
numpy: 1.23.2
fsspec: 2022.01.0
connectorx: <not installed>
xlsx2csv: <not installed>
matplotlib: 3.5.1
@Hoeze Hoeze added bug Something isn't working python Related to Python Polars labels Jan 12, 2023
@stinodego stinodego added needs triage Awaiting prioritization by a maintainer A-io Area: reading and writing data and removed A-io Area: reading and writing data labels Jan 13, 2024
@stinodego
Copy link
Member

stinodego commented Jan 16, 2024

This has nothing to do with Parquet. A more minimal example:

import polars as pl
from polars.testing import assert_frame_equal

lf = pl.LazyFrame({"a": [1.0, 2.0]}, schema={"a": pl.Float32})

result = lf.select(-pl.col("a"))

expected = pl.LazyFrame({"a": [-1.0, -2.0]}, schema={"a": pl.Float32})
assert_frame_equal(result, expected)
# AssertionError: LazyFrames are different (dtypes do not match)
# [left]:  {'a': Float64}
# [right]: {'a': Float32}

There are two issues here:

  • We calculate the negation by doing pl.lit(0) - expr, which casts to the supertype of expr and Int32. We should implement negation directly to avoid this.
  • The supertype in this case is Float64. So the schema is correct, but the result of the minus operation is wrong. This should be fixed.

I'll make a fix.

@stinodego stinodego added P-low Priority: low and removed needs triage Awaiting prioritization by a maintainer labels Jan 16, 2024
@stinodego stinodego self-assigned this Jan 16, 2024
@Hoeze
Copy link
Author

Hoeze commented Jan 16, 2024

Thanks a lot, @stinodego!

@stinodego stinodego added P-high Priority: high and removed P-low Priority: low labels Jan 16, 2024
@stinodego
Copy link
Member

I opened a PR that will nicely fix the code example in this issue.

I opened a separate issue for the underlying cause that should be fixed in another way: #13804

@c-peters c-peters added the accepted Ready for implementation label Jan 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation bug Something isn't working P-high Priority: high python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants