Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-deterministic behaviour when using is_null() in LazyFrame #14595

Closed
2 tasks done
dcferreira opened this issue Feb 19, 2024 · 7 comments · Fixed by #14668
Closed
2 tasks done

Non-deterministic behaviour when using is_null() in LazyFrame #14595

dcferreira opened this issue Feb 19, 2024 · 7 comments · Fixed by #14668
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@dcferreira
Copy link

dcferreira commented Feb 19, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Unfortunately I couldn't get a reproducible example without my data (though I tried quite hard!), but I am willing to spend some time on this if someone has an idea of how to get one.

df is a LazyFrame read from a delta table.

ids_set = {
    '02d0927b-77ea-400f-8adc-22474e45d6d5',
    '03094785-91c7-4e98-9072-3336ff67c222',
    '031d9dfb-38c2-4229-92b3-d397b6d0313b',
    '033d8ebd-07ca-467a-8833-f5c23138746b',
    '0347c17f-dd73-43fc-969d-2e46b6406dea'
}
df_filtered = df.filter(pl.col("label_id").is_in(ids_set))
tmp = df_filtered.select('label_id', 'code', pl.col("code").is_not_null().alias("code_not_null"))

print('"original" dataframe')
print(tmp.collect())
print()

print("shapes")
for _ in range(10):
    tmp1 = tmp.collect().filter(pl.col("code").is_not_null())
    tmp2 = tmp.filter(pl.col("code").is_not_null()).collect()
    tmp3 = tmp.filter(pl.col("code_not_null")).collect()
    
    print(tmp1.shape, tmp2.shape, tmp3.shape)

Outputs:

"original" dataframe
shape: (5, 3)
┌───────────────────────────────────┬───────┬───────────────┐
│ label_id                          ┆ code  ┆ code_not_null │
│ ---                               ┆ ---   ┆ ---           │
│ str                               ┆ str   ┆ bool          │
╞═══════════════════════════════════╪═══════╪═══════════════╡
│ 033d8ebd-07ca-467a-8833-f5c23138… ┆ A-P-2 ┆ true          │
│ 0347c17f-dd73-43fc-969d-2e46b640… ┆ null  ┆ false         │
│ 02d0927b-77ea-400f-8adc-22474e45… ┆ null  ┆ false         │
│ 031d9dfb-38c2-4229-92b3-d397b6d0… ┆ null  ┆ false         │
│ 03094785-91c7-4e98-9072-3336ff67… ┆ null  ┆ false         │
└───────────────────────────────────┴───────┴───────────────┘

shapes
(1, 3) (1, 3) (1, 3)
(1, 3) (1, 3) (1, 3)
(1, 3) (5, 3) (1, 3)
(1, 3) (1, 3) (1, 3)
(1, 3) (1, 3) (1, 3)
(1, 3) (5, 3) (1, 3)
(1, 3) (1, 3) (1, 3)
(1, 3) (5, 3) (1, 3)
(1, 3) (1, 3) (1, 3)
(1, 3) (5, 3) (1, 3)

Log output

No response

Issue description

Filtering by pl.col().is_null() or pl.col().is_not_null() before collecting gives me a non-deterministic wrong result.

I really tried to get a completely reproducible example, but did not succeed.
Here's what I tried:

  • make a small dataframe with some nulls -> convert to lazy -> run the code above 1000s of times
  • save the dataframe to delta -> load with pl.scan_delta -> run the code above also 1000s of times

In both these cases, the results were consistently correct.

However, for the example in my data, something is clearly wrong.

Expected behavior

In the code snippet above, I'm filtering a lazyframe by null values in a column, and printing out the shape of the output.
I'm doing that in 3 different ways:

  1. collect and then filter
  2. filter and then collect
  3. filter on a boolean column that represents the same filter, and then collect.

I expected that all 3 of these to give the exact same result.
However, the filtering in nr 2 only works sometimes.

Installed versions

--------Version info---------
Polars:               0.20.10
Index type:           UInt32
Platform:             Linux-5.15.0-1043-aws-x86_64-with-glibc2.31
Python:               3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           <not installed>
deltalake:            0.15.3
fsspec:               2024.2.0
gevent:               24.2.1
hvplot:               <not installed>
matplotlib:           3.8.3
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.0.1
pyarrow:              15.0.0
pydantic:             1.10.14
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           1.4.51
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@dcferreira dcferreira added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 19, 2024
@ritchie46
Copy link
Member

Can you please create a reproducable example? There is nothing we can do without.

@dcferreira
Copy link
Author

Like I said, unfortunately I couldn't get a reproducible example (see in the post for a few things I tried). If you or anyone else has suggestions that might help me get one, I'm definitely willing to try.

@nameexhaustion
Copy link
Collaborator

Could you also show the output of tmp.filter(pl.col("code").is_not_null()).explain() of every iteration?

@dcferreira
Copy link
Author

Yes of course!

So, I changed the loop a little bit to add the explanation:

print("shapes")
for _ in range(10):
    tmp2_plan = tmp.filter(pl.col("code").is_not_null())
    tmp1 = tmp.collect().filter(pl.col("code").is_not_null())
    tmp2 = tmp2_plan.collect()
    tmp3 = tmp.filter(pl.col("code_not_null")).collect()

    print(tmp2_plan.explain())
    print(tmp1.shape, tmp2.shape, tmp3.shape)
    print('=' * 100)

here's the (long) output (note that the data changed very slightly, but the issue persists):

Long output
"original" dataframe
shape: (6, 3)
┌───────────────────────────────────┬───────┬───────────────┐
│ label_id                          ┆ code  ┆ code_not_null │
│ ---                               ┆ ---   ┆ ---           │
│ str                               ┆ str   ┆ bool          │
╞═══════════════════════════════════╪═══════╪═══════════════╡
│ 033d8ebd-07ca-467a-8833-f5c23138… ┆ A-P-2 ┆ true          │
│ 033d8ebd-07ca-467a-8833-f5c23138… ┆ A-P-2 ┆ true          │
│ 02d0927b-77ea-400f-8adc-22474e45… ┆ null  ┆ false         │
│ 0347c17f-dd73-43fc-969d-2e46b640… ┆ null  ┆ false         │
│ 031d9dfb-38c2-4229-92b3-d397b6d0… ┆ null  ┆ false         │
│ 03094785-91c7-4e98-9072-3336ff67… ┆ null  ┆ false         │
└───────────────────────────────────┴───────┴───────────────┘

shapes
 SELECT [col("label_id"), col("code"), col("code").is_not_null().alias("code_not_null")] FROM
  FAST_PROJECT: [external_id, label_id, code]
    INNER JOIN:
    LEFT PLAN ON: [col("external_id")]
      FAST_PROJECT: [external_id, label_id, code]
        INNER JOIN:
        LEFT PLAN ON: [col("external_id")]
          FILTER col("label_id").is_in([Series]) FROM

          UNIQUE BY Some(["external_id"])
            FAST_PROJECT: [label_id, external_id, code]

                PYTHON SCAN 
                PROJECT 3/18 COLUMNS
                SELECTION: ~(pa.compute.field('code')).is_null()
        RIGHT PLAN ON: [col("external_id")]
          FAST_PROJECT: [external_id]

              PYTHON SCAN 
              PROJECT 1/10 COLUMNS
        END INNER JOIN
    RIGHT PLAN ON: [col("external_id")]
      FAST_PROJECT: [external_id, p_split]

          PYTHON SCAN 
          PROJECT 2/5 COLUMNS
          SELECTION: (pa.compute.field('p_split') == 'val')
    END INNER JOIN
(2, 3) (6, 3) (2, 3)
====================================================================================================
 SELECT [col("label_id"), col("code"), col("code").is_not_null().alias("code_not_null")] FROM
  FAST_PROJECT: [external_id, label_id, code]
    INNER JOIN:
    LEFT PLAN ON: [col("external_id")]
      FAST_PROJECT: [external_id, label_id, code]
        INNER JOIN:
        LEFT PLAN ON: [col("external_id")]
          DROP_NULLS by: [code]
            UNIQUE BY Some(["external_id"])
              FAST_PROJECT: [label_id, external_id, code]

                  PYTHON SCAN 
                  PROJECT 3/18 COLUMNS
                  SELECTION: (pa.compute.field('label_id')).isin(["031d9dfb-38c2-4229-92b3-d397b6d0313b","02d0927b-77ea-400f-8adc-22474e45d6d5","0347c17f-dd73-43fc-969d-2e46b6406dea","033d8ebd-07ca-467a-8833-f5c23138746b","03094785-91c7-4e98-9072-3336ff67c222"])
        RIGHT PLAN ON: [col("external_id")]
          FAST_PROJECT: [external_id]

              PYTHON SCAN 
              PROJECT 1/10 COLUMNS
        END INNER JOIN
    RIGHT PLAN ON: [col("external_id")]
      FAST_PROJECT: [external_id, p_split]

          PYTHON SCAN 
          PROJECT 2/5 COLUMNS
          SELECTION: (pa.compute.field('p_split') == 'val')
    END INNER JOIN
(2, 3) (6, 3) (2, 3)
====================================================================================================
 SELECT [col("label_id"), col("code"), col("code").is_not_null().alias("code_not_null")] FROM
  FAST_PROJECT: [external_id, label_id, code]
    INNER JOIN:
    LEFT PLAN ON: [col("external_id")]
      FAST_PROJECT: [external_id, label_id, code]
        INNER JOIN:
        LEFT PLAN ON: [col("external_id")]
          DROP_NULLS by: [code]
            UNIQUE BY Some(["external_id"])
              FAST_PROJECT: [label_id, external_id, code]

                  PYTHON SCAN 
                  PROJECT 3/18 COLUMNS
                  SELECTION: (pa.compute.field('label_id')).isin(["031d9dfb-38c2-4229-92b3-d397b6d0313b","02d0927b-77ea-400f-8adc-22474e45d6d5","0347c17f-dd73-43fc-969d-2e46b6406dea","033d8ebd-07ca-467a-8833-f5c23138746b","03094785-91c7-4e98-9072-3336ff67c222"])
        RIGHT PLAN ON: [col("external_id")]
          FAST_PROJECT: [external_id]

              PYTHON SCAN 
              PROJECT 1/10 COLUMNS
        END INNER JOIN
    RIGHT PLAN ON: [col("external_id")]
      FAST_PROJECT: [external_id, p_split]

          PYTHON SCAN 
          PROJECT 2/5 COLUMNS
          SELECTION: (pa.compute.field('p_split') == 'val')
    END INNER JOIN
(2, 3) (6, 3) (2, 3)
====================================================================================================
 SELECT [col("label_id"), col("code"), col("code").is_not_null().alias("code_not_null")] FROM
  FAST_PROJECT: [external_id, label_id, code]
    INNER JOIN:
    LEFT PLAN ON: [col("external_id")]
      FAST_PROJECT: [external_id, label_id, code]
        INNER JOIN:
        LEFT PLAN ON: [col("external_id")]
          DROP_NULLS by: [code]
            UNIQUE BY Some(["external_id"])
              FAST_PROJECT: [label_id, external_id, code]

                  PYTHON SCAN 
                  PROJECT 3/18 COLUMNS
                  SELECTION: (pa.compute.field('label_id')).isin(["031d9dfb-38c2-4229-92b3-d397b6d0313b","02d0927b-77ea-400f-8adc-22474e45d6d5","0347c17f-dd73-43fc-969d-2e46b6406dea","033d8ebd-07ca-467a-8833-f5c23138746b","03094785-91c7-4e98-9072-3336ff67c222"])
        RIGHT PLAN ON: [col("external_id")]
          FAST_PROJECT: [external_id]

              PYTHON SCAN 
              PROJECT 1/10 COLUMNS
        END INNER JOIN
    RIGHT PLAN ON: [col("external_id")]
      FAST_PROJECT: [external_id, p_split]

          PYTHON SCAN 
          PROJECT 2/5 COLUMNS
          SELECTION: (pa.compute.field('p_split') == 'val')
    END INNER JOIN
(2, 3) (6, 3) (2, 3)
====================================================================================================
 SELECT [col("label_id"), col("code"), col("code").is_not_null().alias("code_not_null")] FROM
  FAST_PROJECT: [external_id, label_id, code]
    INNER JOIN:
    LEFT PLAN ON: [col("external_id")]
      FAST_PROJECT: [external_id, label_id, code]
        INNER JOIN:
        LEFT PLAN ON: [col("external_id")]
          DROP_NULLS by: [code]
            UNIQUE BY Some(["external_id"])
              FAST_PROJECT: [label_id, external_id, code]

                  PYTHON SCAN 
                  PROJECT 3/18 COLUMNS
                  SELECTION: (pa.compute.field('label_id')).isin(["031d9dfb-38c2-4229-92b3-d397b6d0313b","02d0927b-77ea-400f-8adc-22474e45d6d5","0347c17f-dd73-43fc-969d-2e46b6406dea","033d8ebd-07ca-467a-8833-f5c23138746b","03094785-91c7-4e98-9072-3336ff67c222"])
        RIGHT PLAN ON: [col("external_id")]
          FAST_PROJECT: [external_id]

              PYTHON SCAN 
              PROJECT 1/10 COLUMNS
        END INNER JOIN
    RIGHT PLAN ON: [col("external_id")]
      FAST_PROJECT: [external_id, p_split]

          PYTHON SCAN 
          PROJECT 2/5 COLUMNS
          SELECTION: (pa.compute.field('p_split') == 'val')
    END INNER JOIN
(2, 3) (6, 3) (2, 3)
====================================================================================================
 SELECT [col("label_id"), col("code"), col("code").is_not_null().alias("code_not_null")] FROM
  FAST_PROJECT: [external_id, label_id, code]
    INNER JOIN:
    LEFT PLAN ON: [col("external_id")]
      FAST_PROJECT: [external_id, label_id, code]
        INNER JOIN:
        LEFT PLAN ON: [col("external_id")]
          DROP_NULLS by: [code]
            UNIQUE BY Some(["external_id"])
              FAST_PROJECT: [label_id, external_id, code]

                  PYTHON SCAN 
                  PROJECT 3/18 COLUMNS
                  SELECTION: (pa.compute.field('label_id')).isin(["031d9dfb-38c2-4229-92b3-d397b6d0313b","02d0927b-77ea-400f-8adc-22474e45d6d5","0347c17f-dd73-43fc-969d-2e46b6406dea","033d8ebd-07ca-467a-8833-f5c23138746b","03094785-91c7-4e98-9072-3336ff67c222"])
        RIGHT PLAN ON: [col("external_id")]
          FAST_PROJECT: [external_id]

              PYTHON SCAN 
              PROJECT 1/10 COLUMNS
        END INNER JOIN
    RIGHT PLAN ON: [col("external_id")]
      FAST_PROJECT: [external_id, p_split]

          PYTHON SCAN 
          PROJECT 2/5 COLUMNS
          SELECTION: (pa.compute.field('p_split') == 'val')
    END INNER JOIN
(2, 3) (2, 3) (2, 3)
====================================================================================================
 SELECT [col("label_id"), col("code"), col("code").is_not_null().alias("code_not_null")] FROM
  FAST_PROJECT: [external_id, label_id, code]
    INNER JOIN:
    LEFT PLAN ON: [col("external_id")]
      FAST_PROJECT: [external_id, label_id, code]
        INNER JOIN:
        LEFT PLAN ON: [col("external_id")]
          DROP_NULLS by: [code]
            UNIQUE BY Some(["external_id"])
              FAST_PROJECT: [label_id, external_id, code]

                  PYTHON SCAN 
                  PROJECT 3/18 COLUMNS
                  SELECTION: (pa.compute.field('label_id')).isin(["031d9dfb-38c2-4229-92b3-d397b6d0313b","02d0927b-77ea-400f-8adc-22474e45d6d5","0347c17f-dd73-43fc-969d-2e46b6406dea","033d8ebd-07ca-467a-8833-f5c23138746b","03094785-91c7-4e98-9072-3336ff67c222"])
        RIGHT PLAN ON: [col("external_id")]
          FAST_PROJECT: [external_id]

              PYTHON SCAN 
              PROJECT 1/10 COLUMNS
        END INNER JOIN
    RIGHT PLAN ON: [col("external_id")]
      FAST_PROJECT: [external_id, p_split]

          PYTHON SCAN 
          PROJECT 2/5 COLUMNS
          SELECTION: (pa.compute.field('p_split') == 'val')
    END INNER JOIN
(2, 3) (2, 3) (2, 3)
====================================================================================================
 SELECT [col("label_id"), col("code"), col("code").is_not_null().alias("code_not_null")] FROM
  FAST_PROJECT: [external_id, label_id, code]
    INNER JOIN:
    LEFT PLAN ON: [col("external_id")]
      FAST_PROJECT: [external_id, label_id, code]
        INNER JOIN:
        LEFT PLAN ON: [col("external_id")]
          FILTER col("label_id").is_in([Series]) FROM

          UNIQUE BY Some(["external_id"])
            FAST_PROJECT: [label_id, external_id, code]

                PYTHON SCAN 
                PROJECT 3/18 COLUMNS
                SELECTION: ~(pa.compute.field('code')).is_null()
        RIGHT PLAN ON: [col("external_id")]
          FAST_PROJECT: [external_id]

              PYTHON SCAN 
              PROJECT 1/10 COLUMNS
        END INNER JOIN
    RIGHT PLAN ON: [col("external_id")]
      FAST_PROJECT: [external_id, p_split]

          PYTHON SCAN 
          PROJECT 2/5 COLUMNS
          SELECTION: (pa.compute.field('p_split') == 'val')
    END INNER JOIN
(2, 3) (2, 3) (2, 3)
====================================================================================================
 SELECT [col("label_id"), col("code"), col("code").is_not_null().alias("code_not_null")] FROM
  FAST_PROJECT: [external_id, label_id, code]
    INNER JOIN:
    LEFT PLAN ON: [col("external_id")]
      FAST_PROJECT: [external_id, label_id, code]
        INNER JOIN:
        LEFT PLAN ON: [col("external_id")]
          FILTER col("label_id").is_in([Series]) FROM

          UNIQUE BY Some(["external_id"])
            FAST_PROJECT: [label_id, external_id, code]

                PYTHON SCAN 
                PROJECT 3/18 COLUMNS
                SELECTION: ~(pa.compute.field('code')).is_null()
        RIGHT PLAN ON: [col("external_id")]
          FAST_PROJECT: [external_id]

              PYTHON SCAN 
              PROJECT 1/10 COLUMNS
        END INNER JOIN
    RIGHT PLAN ON: [col("external_id")]
      FAST_PROJECT: [external_id, p_split]

          PYTHON SCAN 
          PROJECT 2/5 COLUMNS
          SELECTION: (pa.compute.field('p_split') == 'val')
    END INNER JOIN
(2, 3) (6, 3) (2, 3)
====================================================================================================
 SELECT [col("label_id"), col("code"), col("code").is_not_null().alias("code_not_null")] FROM
  FAST_PROJECT: [external_id, label_id, code]
    INNER JOIN:
    LEFT PLAN ON: [col("external_id")]
      FAST_PROJECT: [external_id, label_id, code]
        INNER JOIN:
        LEFT PLAN ON: [col("external_id")]
          DROP_NULLS by: [code]
            UNIQUE BY Some(["external_id"])
              FAST_PROJECT: [label_id, external_id, code]

                  PYTHON SCAN 
                  PROJECT 3/18 COLUMNS
                  SELECTION: (pa.compute.field('label_id')).isin(["031d9dfb-38c2-4229-92b3-d397b6d0313b","02d0927b-77ea-400f-8adc-22474e45d6d5","0347c17f-dd73-43fc-969d-2e46b6406dea","033d8ebd-07ca-467a-8833-f5c23138746b","03094785-91c7-4e98-9072-3336ff67c222"])
        RIGHT PLAN ON: [col("external_id")]
          FAST_PROJECT: [external_id]

              PYTHON SCAN 
              PROJECT 1/10 COLUMNS
        END INNER JOIN
    RIGHT PLAN ON: [col("external_id")]
      FAST_PROJECT: [external_id, p_split]

          PYTHON SCAN 
          PROJECT 2/5 COLUMNS
          SELECTION: (pa.compute.field('p_split') == 'val')
    END INNER JOIN
(2, 3) (2, 3) (2, 3)
====================================================================================================

The plan seems to be the same every time; I'm not sure what can be concluded from that.

@nameexhaustion
Copy link
Collaborator

Thanks @dcferreira , the query plans show that predicate pushdown optimizing differently, there are 2 variations in the SELECTION:

SELECTION: ~(pa.compute.field('code')).is_null()
SELECTION: (pa.compute.field('label_id')).isin(["031d9dfb-...

This is also likely the cause of the differing outputs. But in fact predicates should not be getting pushed past the DISTINCT operator at all unless they depend only on the column subset.

@dcferreira
Copy link
Author

Good spotting!

Thanks so much for the PR @nameexhaustion, that was super fast!

Your fix/everything around the planning goes a bit over my head, but while trying to understand it I ran into something a bit weird:

This code outputs 2 different results, with the only difference being that the optimizations are enabled/disabled:

ids_set = {
    '02d0927b-77ea-400f-8adc-22474e45d6d5',
    '03094785-91c7-4e98-9072-3336ff67c222',
    '031d9dfb-38c2-4229-92b3-d397b6d0313b',
    '033d8ebd-07ca-467a-8833-f5c23138746b',
    '0347c17f-dd73-43fc-969d-2e46b6406dea'
}
df_filtered = df.filter(pl.col("label_id").is_in(ids_set))
tmp = df_filtered.select('label_id', 'code', pl.col("code").is_not_null().alias("code_not_null"))

print('"original" dataframe')
print(tmp.explain(optimized=False))
print(tmp.collect(no_optimization=True))
print("#" * 100)
print(tmp.explain(optimized=True))
print(tmp.collect(no_optimization=False))
Long output
"original" dataframe
 SELECT [col("label_id"), col("code"), col("code").is_not_null().alias("code_not_null")] FROM
  FILTER col("label_id").is_in([Series]) FROM

  INNER JOIN:
  LEFT PLAN ON: [col("external_id")]
    INNER JOIN:
    LEFT PLAN ON: [col("external_id")]
      UNIQUE BY Some(["external_id"])
         SELECT [col("label_id"), col("external_id"), col("p_label_batch"), col("label_source"), col("code")] FROM

            PYTHON SCAN 
            PROJECT */18 COLUMNS
    RIGHT PLAN ON: [col("external_id")]
       SELECT [col("external_id"), col("p_source")] FROM

          PYTHON SCAN 
          PROJECT */10 COLUMNS
    END INNER JOIN
  RIGHT PLAN ON: [col("external_id")]
    FILTER [(col("p_split")) == (String(val))] FROM

     SELECT [col("external_id"), col("p_split")] FROM

        PYTHON SCAN 
        PROJECT */5 COLUMNS
  END INNER JOIN
shape: (4, 3)
┌───────────────────────────────────┬──────┬───────────────┐
│ label_id                          ┆ code ┆ code_not_null │
│ ---                               ┆ ---  ┆ ---           │
│ str                               ┆ str  ┆ bool          │
╞═══════════════════════════════════╪══════╪═══════════════╡
│ 031d9dfb-38c2-4229-92b3-d397b6d0… ┆ null ┆ false         │
│ 0347c17f-dd73-43fc-969d-2e46b640… ┆ null ┆ false         │
│ 02d0927b-77ea-400f-8adc-22474e45… ┆ null ┆ false         │
│ 03094785-91c7-4e98-9072-3336ff67… ┆ null ┆ false         │
└───────────────────────────────────┴──────┴───────────────┘
####################################################################################################
 SELECT [col("label_id"), col("code"), col("code").is_not_null().alias("code_not_null")] FROM
  FAST_PROJECT: [external_id, label_id, code]
    INNER JOIN:
    LEFT PLAN ON: [col("external_id")]
      FAST_PROJECT: [external_id, label_id, code]
        INNER JOIN:
        LEFT PLAN ON: [col("external_id")]
          UNIQUE BY Some(["external_id"])
            FAST_PROJECT: [label_id, external_id, code]

                PYTHON SCAN 
                PROJECT 3/18 COLUMNS
                SELECTION: (pa.compute.field('label_id')).isin(["031d9dfb-38c2-4229-92b3-d397b6d0313b","02d0927b-77ea-400f-8adc-22474e45d6d5","03094785-91c7-4e98-9072-3336ff67c222","0347c17f-dd73-43fc-969d-2e46b6406dea","033d8ebd-07ca-467a-8833-f5c23138746b"])
        RIGHT PLAN ON: [col("external_id")]
          FAST_PROJECT: [external_id]

              PYTHON SCAN 
              PROJECT 1/10 COLUMNS
        END INNER JOIN
    RIGHT PLAN ON: [col("external_id")]
      FAST_PROJECT: [external_id, p_split]

          PYTHON SCAN 
          PROJECT 2/5 COLUMNS
          SELECTION: (pa.compute.field('p_split') == 'val')
    END INNER JOIN
shape: (6, 3)
┌───────────────────────────────────┬───────┬───────────────┐
│ label_id                          ┆ code  ┆ code_not_null │
│ ---                               ┆ ---   ┆ ---           │
│ str                               ┆ str   ┆ bool          │
╞═══════════════════════════════════╪═══════╪═══════════════╡
│ 033d8ebd-07ca-467a-8833-f5c23138… ┆ A-P-2 ┆ true          │
│ 033d8ebd-07ca-467a-8833-f5c23138… ┆ A-P-2 ┆ true          │
│ 02d0927b-77ea-400f-8adc-22474e45… ┆ null  ┆ false         │
│ 0347c17f-dd73-43fc-969d-2e46b640… ┆ null  ┆ false         │
│ 031d9dfb-38c2-4229-92b3-d397b6d0… ┆ null  ┆ false         │
│ 03094785-91c7-4e98-9072-3336ff67… ┆ null  ┆ false         │
└───────────────────────────────────┴───────┴───────────────┘

Is this covered in the test you added with (maintain_order=False, keep="none"), or is there still some additional issue here? (or did I misunderstand the meaning of "optimization", and different results are just expected?)

@nameexhaustion
Copy link
Collaborator

With the existing release, I expect if you run with collect(predicate_pushdown=False), it should give you the same result as collect(no_optimization=True). And also make sure you are not using keep="any" (the default) for unique

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants