Using pl.write_parquet() gives wrong results for values inside lists. #17805

Matthias-Warlop · 2024-07-23T12:25:07Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

df_gold = df_silver.select(
            [
                pl.when(pl.col("description").is_null()).then(pl.lit(None))
                .when(pl.col("description") == "")
                .then(pl.lit(None))
                .otherwise(
                    pl.concat_list(
                        [
                            pl.struct(
                                pl.lit("en").alias("locale"),
                                pl.col("description").alias("translation"),
                            )
                        ]
                    )
                ).alias("description"),
            ]
        )

Log output

No response

Issue description

When the description is equal to an empty string "" in df_silver the resulting df_gold has a None value (this is expected).
However, the subsequent row, regardless of its value, results in the description column in gold, as
[{'locale': 'en', 'translation': ''}]
Because the value originally was a sentence (string), this is an unexpected result. If I then filter the dataframe so that the row with the empty string is not present anymore, the same row that first resulted as [{'locale': 'en', 'translation': ''}] , is now resulting as [{'locale': 'en', 'translation': '{ORIGINALSENTENCE}'}] which would be the expected result.

Expected behavior

The preceding row in a dataframe should not have any influence on the result of the next row this is a bug that results in unexpected and nonsensical results.

Installed versions

--------Version info---------
Polars:               1.1.0
Index type:           UInt32
Platform:             Linux-6.8.0-38-generic-x86_64-with-glibc2.39
Python:               3.12.3 (main, Apr 10 2024, 05:33:47) [GCC 13.2.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         1.6.0
numpy:                2.0.0
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              17.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

cmdlineluser · 2024-07-23T12:51:33Z

Can you show an example of the problem?

e.g.

df = pl.DataFrame({
    "description": ["", None, "hi", ""],
    "locale": ["a", "b", "c", "d"],
})

df.select(
    pl.when(pl.col("description").is_null()).then(pl.lit(None))
      .when(pl.col("description") == "")
      .then(pl.lit(None))
      .otherwise(
          pl.concat_list(
              pl.struct(
                  pl.lit("en").alias("locale"),
                  pl.col("description").alias("translation"),
              )
          )
      )
      .alias("description")
)

shape: (4, 1)
┌─────────────────┐
│ description     │
│ ---             │
│ list[struct[2]] │
╞═════════════════╡
│ null            │
│ null            │
│ [{"en","hi"}]   │
│ null            │
└─────────────────┘

Matthias-Warlop · 2024-07-23T13:14:13Z

I have been investigating further. The issue seems to be more complex.

df_silver = pl.read_parquet(f"data/silver/locations.parquet")
df_gold = df_silver.select(
    [
        pl.when(pl.col("description").is_null())
        .then(pl.lit(None))
        .when(pl.col("description") == "")
        .then(pl.lit(None))
        .otherwise(
            pl.concat_list(
                pl.struct(
                    pl.lit("en").alias("locale"),
                    pl.col("description").alias("translation"),
                )
            )
        )
        .alias("description"),
    ]
)
# printing the the range of rows where the issue occurs
print(df_gold[60:70])

shape: (10, 1)
┌─────────────────────────────────┐
│ description                     │
│ ---                             │
│ list[struct[2]]                 │
╞═════════════════════════════════╡
│ null                            │
│ [{"en","Parking working hours … │
│ null                            │
│ [{"en","Parking working hours … │
│ [{"en","Parking working hours … │
│ [{"en","Parking working hours … │
│ [{"en","Parking working hours … │
│ null                            │
│ [{"en","Parking working hours … │
│ [{"en","Parking working hours … │
└─────────────────────────────────┘

As you can see, no issue.
If I then do

df_gold.write_parquet(f"data/gold/locations.parquet")
df_gold_read = pl.read_parquet(f"data/gold/locations.parquet")
print(df_gold_read[60:70])

shape: (10, 1)
┌─────────────────────────────────┐
│ description                     │
│ ---                             │
│ list[struct[2]]                 │
╞═════════════════════════════════╡
│ null                            │
│ [{"en","Parking working hours … │
│ null                            │
│ [{"en","Parking working hours … │
│ [{"en","Parking working hours … │
│ [{"en","Parking working hours … │
│ [{"en","Parking working hours … │
│ null                            │
│ [{"en",""}]                     │
│ [{"en","Parking working hours … │
└─────────────────────────────────┘

I see the empty translation field again, just like I saw when exploring the parquet file in data wrangler.
Even stranger is when i do this:

df_gold.write_json(f"data/gold/eldrive_locations.csv")
df_gold_read = pl.read_json(f"data/gold/eldrive_locations.csv")
print(df_gold_read[60:70])

shape: (10, 1)
┌─────────────────────────────────┐
│ description                     │
│ ---                             │
│ list[struct[2]]                 │
╞═════════════════════════════════╡
│ null                            │
│ [{"en",null}]                   │
│ null                            │
│ [{"en","Parking working hours … │
│ [{"en",null}]                   │
│ [{"en",null}]                   │
│ [{"en","Parking working hours … │
│ null                            │
│ [{"en",null}]                   │
│ [{"en","Parking working hours … │
└─────────────────────────────────┘

To show the complete values in the dataframe I ran:

for row in df_gold[60:70].iter_rows(named=True):
    print(row["description"])

None
[{'locale': 'en', 'translation': 'Parking working hours and tariffs as per location terms'}]
None
[{'locale': 'en', 'translation': 'Parking working hours and tariffs as per location terms.'}]
[{'locale': 'en', 'translation': 'Parking working hours and tariffs as per location terms'}]
[{'locale': 'en', 'translation': 'Parking working hours and tariffs as per location terms'}]
[{'locale': 'en', 'translation': 'Parking working hours and tariffs as per location terms'}]
None
[{'locale': 'en', 'translation': 'Parking working hours and tariffs as per location terms'}]
[{'locale': 'en', 'translation': 'Parking working hours and tariffs as per location terms'}]

ritchie46 · 2024-07-23T13:16:25Z

Can you update to latest Polars (1.2.1) and confirm it still occurs?

Matthias-Warlop · 2024-07-23T13:24:50Z

Sorry, I thought that I already was on the latest version. The ouput of my tests are identical with the new version.

pl.show_versions()

--------Version info---------
Polars:               1.2.1
Index type:           UInt32
Platform:             Linux-6.8.0-38-generic-x86_64-with-glibc2.39
Python:               3.12.3 (main, Apr 10 2024, 05:33:47) [GCC 13.2.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         1.6.0
numpy:                2.0.0
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              17.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

I also tried using the dataframe you used in your example. Saving as json and parquet give the same results here. printing the read dataframe:

shape: (4, 1)
┌─────────────────┐
│ description     │
│ ---             │
│ list[struct[2]] │
╞═════════════════╡
│ null            │
│ null            │
│ [{"en",""}]     │
│ null            │
└─────────────────┘

ritchie46 · 2024-07-23T13:29:32Z

And have you got a minimal repro? I don't see any way to reproduce your query? Ideally on syntetic data in memory, otherwise from the file. Cut out everything that isn't involved.

Matthias-Warlop · 2024-07-23T13:39:20Z

minimal reproducable code:

import polars as pl

df_silver = pl.DataFrame(
    {
        "description": ["", "hello", "hi", ""],
    }
)

df_gold = df_silver.select(
    [
        pl.when(pl.col("description") == "")
        .then(pl.lit(None))
        .otherwise(
            pl.concat_list(
                pl.struct(
                    pl.col("description").alias("translation"),
                )
            )
        )
        .alias("description"),
    ]
)

print(df_gold)
df_gold.write_parquet(f"test.parquet")
df_gold_read = pl.read_parquet(f"test.parquet")
print(df_gold_read)

shape: (4, 1)
┌─────────────────┐
│ description     │
│ ---             │
│ list[struct[1]] │
╞═════════════════╡
│ null            │
│ [{"hello"}]     │
│ [{"hi"}]        │
│ null            │
└─────────────────┘
shape: (4, 1)
┌─────────────────┐
│ description     │
│ ---             │
│ list[struct[1]] │
╞═════════════════╡
│ null            │
│ [{""}]          │
│ [{"hello"}]     │
│ null            │
└─────────────────┘

I left away the None value and check for None, because this does not seem to influence the test. Also remove the locale field.

Matthias-Warlop · 2024-07-23T15:06:58Z

The struct also does not seem to influence the bug

import polars as pl

df_silver = pl.DataFrame({"description": ["", "hello", "hi", ""],})

df_gold = df_silver.select(
    [
        pl.when(pl.col("description") == "")
        .then(pl.lit(None))
        .otherwise(
            pl.concat_list(
                pl.col("description").alias("translation"),
            )
        )
        .alias("description_new"),
    ]
)

print(df_gold)
df_gold.write_parquet(f"test.parquet")
df_gold_read = pl.read_parquet(f"test.parquet")
print(df_gold_read)

shape: (4, 1)
┌─────────────────┐
│ description_new │
│ ---             │
│ list[str]       │
╞═════════════════╡
│ null            │
│ ["hello"]       │
│ ["hi"]          │
│ null            │
└─────────────────┘
shape: (4, 1)
┌─────────────────┐
│ description_new │
│ ---             │
│ list[str]       │
╞═════════════════╡
│ null            │
│ [""]            │
│ ["hello"]       │
│ null            │
└─────────────────┘

cmdlineluser · 2024-07-23T15:41:16Z

So it seems to end up being a problem with write_parquet specifically?

If we switch to sink_parquet - it does not replicate the bug:

df_gold.lazy().sink_parquet(f"test.parquet")
pl.read_parquet(f"test.parquet")

# shape: (4, 1)
# ┌─────────────────┐
# │ description_new │
# │ ---             │
# │ list[str]       │
# ╞═════════════════╡
# │ null            │
# │ ["hello"]       │
# │ ["hi"]          │
# │ null            │
# └─────────────────┘

Matthias-Warlop · 2024-07-24T07:37:59Z

This indeed fixes the issue. Am I right to think this should be considered as a major issue, as it negatively affects the data quality?
Also I tried further testing and it does have something to do with the creation of the df_gold dataframe (so the .select part). When I create a dataframe similar to df_gold directly like this:

df_silver = pl.DataFrame({"description": [None, ["hello"], ["hi"], None]})

print(df_silver)
df_silver.write_parquet(f"test.parquet")
df_silver_read = pl.read_parquet(f"test.parquet")
print(df_silver_read)

┌─────────────┐
│ description │
│ ---         │
│ list[str]   │
╞═════════════╡
│ null        │
│ ["hello"]   │
│ ["hi"]      │
│ null        │
└─────────────┘
shape: (4, 1)
┌─────────────┐
│ description │
│ ---         │
│ list[str]   │
╞═════════════╡
│ null        │
│ ["hello"]   │
│ ["hi"]      │
│ null        │
└─────────────┘

Both the hello and hi appear correctly in the output.

ritchie46 · 2024-07-24T07:54:38Z

import polars as pl

df_silver = pl.DataFrame({"description": ["", "hello", "hi", ""],})

df_gold = df_silver.select(
    [
        pl.when(pl.col("description") == "")
        .then(pl.lit(None))
        .otherwise(
            pl.concat_list(
                pl.col("description").alias("translation"),
            )
        )
        .alias("description_new"),
    ]
)

print(df_gold)
df_gold.write_parquet(f"test.parquet")
df_gold_read = pl.read_parquet(f"test.parquet")
print(df_gold_read)

@coastalwhite can you take a look?

Matthias-Warlop · 2024-07-24T08:06:15Z

The lazy().sink_parquet does not seem to fix the issue. The bug is extemely specific as to when it occurs, but when it does, it occurs consistently.
Take the following code:

import polars as pl

df_silver = pl.DataFrame(
    {
        "description": [
            "Hello",
            "Hello",
            "Hello",
            "Hello",
            "Hello",
            None,
            "Hello",
            None,
            None,
            "Hello",
            None,
            "Hello",
            "Hello",
            "Hello",
            "Hello",
            "",
            "Hello",
            "Hello",
            "Hello",
            "Hello",
        ],
    }
)


df_gold = df_silver.select(
    [
        pl.when(pl.col("description").is_null())
        .then(pl.lit(None))
        .when(pl.col("description") == "")
        .then(pl.lit(None))
        .otherwise(
            pl.concat_list(
                pl.col("description").alias("translation"),
            )
        )
        .alias("description_new"),
    ]
)

print(df_gold[10:])
df_gold.lazy().sink_parquet(f"test.parquet")
df_gold_read = pl.read_parquet(f"test.parquet")
print(df_gold_read[10:])

shape: (10, 1)
┌─────────────────┐
│ description_new │
│ ---             │
│ list[str]       │
╞═════════════════╡
│ null            │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ null            │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
└─────────────────┘
shape: (10, 1)
┌─────────────────┐
│ description_new │
│ ---             │
│ list[str]       │
╞═════════════════╡
│ null            │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ null            │
│ [""]            │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
└─────────────────┘

when the "" value is placed one row higher during the creation of the dataframe:

import polars as pl

df_silver = pl.DataFrame(
    {
        "description": [
            "Hello",
            "Hello",
            "Hello",
            "Hello",
            "Hello",
            None,
            "Hello",
            None,
            None,
            "Hello",
            None,
            "Hello",
            "Hello",
            "Hello",
            "",
            "Hello",
            "Hello",
            "Hello",
            "Hello",
            "Hello",
        ],
    }
)


df_gold = df_silver.select(
    [
        pl.when(pl.col("description").is_null())
        .then(pl.lit(None))
        .when(pl.col("description") == "")
        .then(pl.lit(None))
        .otherwise(
            pl.concat_list(
                pl.col("description").alias("translation"),
            )
        )
        .alias("description_new"),
    ]
)

print(df_gold[10:])
df_gold.lazy().sink_parquet(f"test.parquet")
df_gold_read = pl.read_parquet(f"test.parquet")
print(df_gold_read[10:])

shape: (10, 1)
┌─────────────────┐
│ description_new │
│ ---             │
│ list[str]       │
╞═════════════════╡
│ null            │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ null            │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
└─────────────────┘
shape: (10, 1)
┌─────────────────┐
│ description_new │
│ ---             │
│ list[str]       │
╞═════════════════╡
│ null            │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ null            │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
└─────────────────┘

There is no longer a bug. lazy().sink_parquet seems to alter the behaviour of the bug, but the bug is still very much there. Again when reading the parquet I can see the wrong values too, so the bug is not because of the pl.read_parquet

cmdlineluser · 2024-07-24T12:03:58Z

Trying to find out when this changed:

pl.__version__
df_gold[10:].equals(df_gold_read[10:])

It seems it happened after 0.20.15

'0.20.16'
False

'0.20.15'
True

Looking at the release notes for 0.20.16:

https://github.com/pola-rs/polars/releases/tag/py-0.20.16

add new when-then-otherwise kernels #15089

As the bug seems to depend on when-then, this may be a starting point for further investigation.

Fixes pola-rs#17805.

Fixes pola-rs#17805. This fixes an issue on the Parquet writer where values that would be valid in the primitive array but invalid at a higher nesting level would still be written. This could for example be true when do `x = (x == "") ? [ x ] : None`. In this case, the empty string might still be valid but the above list is not valid anymore. This is solved by walking through the structure and propagating the nulls to the lower levels in the parquet writer.

Matthias-Warlop added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jul 23, 2024

Matthias-Warlop changed the title ~~Select expression with .when and .then statements gives incorrect results depending on preceding row.~~ Using pl.write_parquet() gives wrong results for values inside lists. Jul 24, 2024

coastalwhite self-assigned this Jul 24, 2024

coastalwhite added P-high Priority: high A-io-parquet Area: reading/writing Parquet files and removed needs triage Awaiting prioritization by a maintainer labels Jul 24, 2024

coastalwhite added a commit to coastalwhite/polars that referenced this issue Jul 24, 2024

fix: write highly-nested nulls in Parquet

1c175d2

Fixes pola-rs#17805.

coastalwhite mentioned this issue Jul 24, 2024

fix: Properly write nest-nulled values in Parquet #17845

Merged

ritchie46 closed this as completed in #17845 Jul 25, 2024

c-peters added the accepted Ready for implementation label Jul 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using pl.write_parquet() gives wrong results for values inside lists. #17805

Using pl.write_parquet() gives wrong results for values inside lists. #17805

Matthias-Warlop commented Jul 23, 2024

cmdlineluser commented Jul 23, 2024

Matthias-Warlop commented Jul 23, 2024

ritchie46 commented Jul 23, 2024

Matthias-Warlop commented Jul 23, 2024 •

edited

Loading

ritchie46 commented Jul 23, 2024

Matthias-Warlop commented Jul 23, 2024 •

edited

Loading

Matthias-Warlop commented Jul 23, 2024

cmdlineluser commented Jul 23, 2024

Matthias-Warlop commented Jul 24, 2024

ritchie46 commented Jul 24, 2024

Matthias-Warlop commented Jul 24, 2024 •

edited

Loading

cmdlineluser commented Jul 24, 2024 •

edited

Loading

Using pl.write_parquet() gives wrong results for values inside lists. #17805

Using pl.write_parquet() gives wrong results for values inside lists. #17805

Comments

Matthias-Warlop commented Jul 23, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

cmdlineluser commented Jul 23, 2024

Matthias-Warlop commented Jul 23, 2024

ritchie46 commented Jul 23, 2024

Matthias-Warlop commented Jul 23, 2024 • edited Loading

ritchie46 commented Jul 23, 2024

Matthias-Warlop commented Jul 23, 2024 • edited Loading

Matthias-Warlop commented Jul 23, 2024

cmdlineluser commented Jul 23, 2024

Matthias-Warlop commented Jul 24, 2024

ritchie46 commented Jul 24, 2024

Matthias-Warlop commented Jul 24, 2024 • edited Loading

cmdlineluser commented Jul 24, 2024 • edited Loading

Matthias-Warlop commented Jul 23, 2024 •

edited

Loading

Matthias-Warlop commented Jul 23, 2024 •

edited

Loading

Matthias-Warlop commented Jul 24, 2024 •

edited

Loading

cmdlineluser commented Jul 24, 2024 •

edited

Loading