Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No error thrown when using clip on int column with float bounds #11385

Open
nicolas-mng opened this issue Sep 28, 2023 · 7 comments
Open

No error thrown when using clip on int column with float bounds #11385

nicolas-mng opened this issue Sep 28, 2023 · 7 comments
Labels
bug Something isn't working P-medium Priority: medium python Related to Python Polars

Comments

@nicolas-mng
Copy link

nicolas-mng commented Sep 28, 2023

Description

Hi,
I have this situation occurring to me and I think that polars throwing an error would have helped me debug.

# Case 1 with an int64 polars DataFrame
# Trying to clip a column between min_clip and max_clip, both being float values
import polars as pl
import numpy as np

df = pl.DataFrame({"column": np.arange(100, dtype=np.int64)})
min_clip, max_clip = 0.2, 20
df = df.with_columns(
                [
                    pl.lit(min_clip).alias("min_clip"),  # just here to show the actual values and types
                    pl.lit(max_clip).alias("max_clip"),
                    (pl.col("column").clip(min_clip, max_clip)).alias("column_clipped"),
                ])

┌───────────┬───────────────────┬───────────────────┬───────────────────┐
│ column      ┆ min_clip       ┆ max_clip      ┆ column_clipped│
│ ---       ┆ ---               ┆ ---               ┆ ---               │
│ i64       ┆ f64               ┆ i32               ┆ i64               │
╞═══════════╪═══════════════════╪═══════════════════╪═══════════════════╡
│ 0         ┆ 0.2               ┆ 20                ┆ 0 <- the min clipped value here is wrong, it should be 0.2
│ 1         ┆ 0.2               ┆ 20                ┆ 1                 │
│ 2         ┆ 0.2               ┆ 20                ┆ 2                 │
│ 3         ┆ 0.2               ┆ 20                ┆ 3                 │
│ …         ┆ …                 ┆ …                 ┆ …                 │
│ 96        ┆ 0.2               ┆ 20                ┆ 20                │
│ 97        ┆ 0.2               ┆ 20                ┆ 20                │
│ 98        ┆ 0.2               ┆ 20                ┆ 20                │
│ 99        ┆ 0.2               ┆ 20                ┆ 20                │
└───────────┴───────────────────┴───────────────────┴───────────────────┘

# Case 2 with a float32 polars DataFrame
df = pl.DataFrame({"column": np.arange(100, dtype=np.float32)})
┌───────────┬───────────────────┬───────────────────┬───────────────────┐
│ column       ┆ min_clip      ┆ max_clip      ┆ column_clipped│
│ ---       ┆ ---               ┆ ---               ┆ ---               │
│ f32       ┆ f64               ┆ i32               ┆ f32               │
╞═══════════╪═══════════════════╪═══════════════════╪═══════════════════╡
│ 0.0       ┆ 0.2               ┆ 20                ┆ 0.2  <- it now works│
│ 1.0       ┆ 0.2               ┆ 20                ┆ 1.0               │
│ 2.0       ┆ 0.2               ┆ 20                ┆ 2.0               │
│ 3.0       ┆ 0.2               ┆ 20                ┆ 3.0               │
│ …         ┆ …                 ┆ …                 ┆ …                 │
│ 96.0      ┆ 0.2               ┆ 20                ┆ 20.0              │
│ 97.0      ┆ 0.2               ┆ 20                ┆ 20.0              │
│ 98.0      ┆ 0.2               ┆ 20                ┆ 20.0              │
│ 99.0      ┆ 0.2               ┆ 20                ┆ 20.0              │
└───────────┴───────────────────┴───────────────────┴───────────────────┘

The error must be happening because of various casting of ints and floats and adding a .cast(pl.Float32) seems to fix the issue. But in my actual use case, this operation was hidden within a more convoluted DataFrame operation so it was harder to track the origin of my bug. Maybe having polars throw a warning here would help? Something about boundaries not respected because of type mismatch?

Thanks for the work on polars!

@nicolas-mng nicolas-mng added the enhancement New feature or an improvement of an existing feature label Sep 28, 2023
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Sep 28, 2023

Could you fix the test case?
Currently doesn't run ;)

@nicolas-mng
Copy link
Author

Oops sorry, it now works :)

@stinodego stinodego added accepted Ready for implementation and removed accepted Ready for implementation labels Oct 2, 2023
@stinodego
Copy link
Member

Not sure what the right way to go is here. We could have clip always return a float, but that feels inefficient...

@nicolas-mng
Copy link
Author

I guess, either autocasting to the higher of the precisions or throwing a type mismatch error would be two legitimate options. I'm worried now that a simple warning would be too quiet and I cannot think of too many use-cases where having values wrongly clipped is ok?

@s-banach
Copy link
Contributor

s-banach commented Jan 11, 2024

I would assume x.clip(0.2, 20) clips all integers below 0.2 and above 20.
So I would assume 0 gets clipped up to 1 in an integer column.
Just got caught by surprise with this one today.

@s-banach
Copy link
Contributor

Similar issue, no warning with Series.scatter(), scattering float values into an int column.

@stinodego
Copy link
Member

stinodego commented Jan 12, 2024

I guess we should strict cast the inputs to the type of the column, rather than regular cast. That would raise an error with bad inputs and require you to cast the column manually first.

EDIT: Actually, we should use supertypes here.

EDIT2: Actually, we should not upcast the original column...

@stinodego stinodego added accepted Ready for implementation bug Something isn't working P-medium Priority: medium and removed enhancement New feature or an improvement of an existing feature labels Jan 12, 2024
@stinodego stinodego self-assigned this Jan 12, 2024
@stinodego stinodego added python Related to Python Polars and removed accepted Ready for implementation labels Jan 12, 2024
@stinodego stinodego changed the title No warning thrown when using pl.clip on int column No error thrown when using clip on int column with float bounds Feb 11, 2024
@stinodego stinodego removed their assignment Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P-medium Priority: medium python Related to Python Polars
Projects
Status: Ready
Development

Successfully merging a pull request may close this issue.

4 participants