Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using pl.write_parquet() gives wrong results for values inside lists. #17805

Closed
2 tasks done
Matthias-Warlop opened this issue Jul 23, 2024 · 12 comments · Fixed by #17845
Closed
2 tasks done

Using pl.write_parquet() gives wrong results for values inside lists. #17805

Matthias-Warlop opened this issue Jul 23, 2024 · 12 comments · Fixed by #17845
Assignees
Labels
A-io-parquet Area: reading/writing Parquet files accepted Ready for implementation bug Something isn't working P-high Priority: high python Related to Python Polars

Comments

@Matthias-Warlop
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

df_gold = df_silver.select(
            [
                pl.when(pl.col("description").is_null()).then(pl.lit(None))
                .when(pl.col("description") == "")
                .then(pl.lit(None))
                .otherwise(
                    pl.concat_list(
                        [
                            pl.struct(
                                pl.lit("en").alias("locale"),
                                pl.col("description").alias("translation"),
                            )
                        ]
                    )
                ).alias("description"),
            ]
        )

Log output

No response

Issue description

When the description is equal to an empty string "" in df_silver the resulting df_gold has a None value (this is expected).
However, the subsequent row, regardless of its value, results in the description column in gold, as
[{'locale': 'en', 'translation': ''}]
Because the value originally was a sentence (string), this is an unexpected result. If I then filter the dataframe so that the row with the empty string is not present anymore, the same row that first resulted as [{'locale': 'en', 'translation': ''}] , is now resulting as [{'locale': 'en', 'translation': '{ORIGINALSENTENCE}'}] which would be the expected result.

Expected behavior

The preceding row in a dataframe should not have any influence on the result of the next row this is a bug that results in unexpected and nonsensical results.

Installed versions

--------Version info---------
Polars:               1.1.0
Index type:           UInt32
Platform:             Linux-6.8.0-38-generic-x86_64-with-glibc2.39
Python:               3.12.3 (main, Apr 10 2024, 05:33:47) [GCC 13.2.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         1.6.0
numpy:                2.0.0
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              17.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@Matthias-Warlop Matthias-Warlop added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jul 23, 2024
@cmdlineluser
Copy link
Contributor

Can you show an example of the problem?

e.g.

df = pl.DataFrame({
    "description": ["", None, "hi", ""],
    "locale": ["a", "b", "c", "d"],
})

df.select(
    pl.when(pl.col("description").is_null()).then(pl.lit(None))
      .when(pl.col("description") == "")
      .then(pl.lit(None))
      .otherwise(
          pl.concat_list(
              pl.struct(
                  pl.lit("en").alias("locale"),
                  pl.col("description").alias("translation"),
              )
          )
      )
      .alias("description")
)
shape: (4, 1)
┌─────────────────┐
│ description     │
│ ---             │
│ list[struct[2]] │
╞═════════════════╡
│ null            │
│ null            │
│ [{"en","hi"}]   │
│ null            │
└─────────────────┘

@Matthias-Warlop
Copy link
Author

I have been investigating further. The issue seems to be more complex.

df_silver = pl.read_parquet(f"data/silver/locations.parquet")
df_gold = df_silver.select(
    [
        pl.when(pl.col("description").is_null())
        .then(pl.lit(None))
        .when(pl.col("description") == "")
        .then(pl.lit(None))
        .otherwise(
            pl.concat_list(
                pl.struct(
                    pl.lit("en").alias("locale"),
                    pl.col("description").alias("translation"),
                )
            )
        )
        .alias("description"),
    ]
)
# printing the the range of rows where the issue occurs
print(df_gold[60:70])
shape: (10, 1)
┌─────────────────────────────────┐
│ description                     │
│ ---                             │
│ list[struct[2]]                 │
╞═════════════════════════════════╡
│ null                            │
│ [{"en","Parking working hours … │
│ null                            │
│ [{"en","Parking working hours … │
│ [{"en","Parking working hours … │
│ [{"en","Parking working hours … │
│ [{"en","Parking working hours … │
│ null                            │
│ [{"en","Parking working hours … │
│ [{"en","Parking working hours … │
└─────────────────────────────────┘

As you can see, no issue.
If I then do

df_gold.write_parquet(f"data/gold/locations.parquet")
df_gold_read = pl.read_parquet(f"data/gold/locations.parquet")
print(df_gold_read[60:70])
shape: (10, 1)
┌─────────────────────────────────┐
│ description                     │
│ ---                             │
│ list[struct[2]]                 │
╞═════════════════════════════════╡
│ null                            │
│ [{"en","Parking working hours … │
│ null                            │
│ [{"en","Parking working hours … │
│ [{"en","Parking working hours … │
│ [{"en","Parking working hours … │
│ [{"en","Parking working hours … │
│ null                            │
│ [{"en",""}]                     │
│ [{"en","Parking working hours … │
└─────────────────────────────────┘

I see the empty translation field again, just like I saw when exploring the parquet file in data wrangler.
Even stranger is when i do this:

df_gold.write_json(f"data/gold/eldrive_locations.csv")
df_gold_read = pl.read_json(f"data/gold/eldrive_locations.csv")
print(df_gold_read[60:70])
shape: (10, 1)
┌─────────────────────────────────┐
│ description                     │
│ ---                             │
│ list[struct[2]]                 │
╞═════════════════════════════════╡
│ null                            │
│ [{"en",null}]                   │
│ null                            │
│ [{"en","Parking working hours … │
│ [{"en",null}]                   │
│ [{"en",null}]                   │
│ [{"en","Parking working hours … │
│ null                            │
│ [{"en",null}]                   │
│ [{"en","Parking working hours … │
└─────────────────────────────────┘

To show the complete values in the dataframe I ran:

for row in df_gold[60:70].iter_rows(named=True):
    print(row["description"])
None
[{'locale': 'en', 'translation': 'Parking working hours and tariffs as per location terms'}]
None
[{'locale': 'en', 'translation': 'Parking working hours and tariffs as per location terms.'}]
[{'locale': 'en', 'translation': 'Parking working hours and tariffs as per location terms'}]
[{'locale': 'en', 'translation': 'Parking working hours and tariffs as per location terms'}]
[{'locale': 'en', 'translation': 'Parking working hours and tariffs as per location terms'}]
None
[{'locale': 'en', 'translation': 'Parking working hours and tariffs as per location terms'}]
[{'locale': 'en', 'translation': 'Parking working hours and tariffs as per location terms'}]

@ritchie46
Copy link
Member

Can you update to latest Polars (1.2.1) and confirm it still occurs?

@Matthias-Warlop
Copy link
Author

Matthias-Warlop commented Jul 23, 2024

Sorry, I thought that I already was on the latest version. The ouput of my tests are identical with the new version.

pl.show_versions()
--------Version info---------
Polars:               1.2.1
Index type:           UInt32
Platform:             Linux-6.8.0-38-generic-x86_64-with-glibc2.39
Python:               3.12.3 (main, Apr 10 2024, 05:33:47) [GCC 13.2.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         1.6.0
numpy:                2.0.0
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              17.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

I also tried using the dataframe you used in your example. Saving as json and parquet give the same results here. printing the read dataframe:

shape: (4, 1)
┌─────────────────┐
│ description     │
│ ---             │
│ list[struct[2]] │
╞═════════════════╡
│ null            │
│ null            │
│ [{"en",""}]     │
│ null            │
└─────────────────┘

@ritchie46
Copy link
Member

And have you got a minimal repro? I don't see any way to reproduce your query? Ideally on syntetic data in memory, otherwise from the file. Cut out everything that isn't involved.

@Matthias-Warlop
Copy link
Author

Matthias-Warlop commented Jul 23, 2024

minimal reproducable code:

import polars as pl

df_silver = pl.DataFrame(
    {
        "description": ["", "hello", "hi", ""],
    }
)

df_gold = df_silver.select(
    [
        pl.when(pl.col("description") == "")
        .then(pl.lit(None))
        .otherwise(
            pl.concat_list(
                pl.struct(
                    pl.col("description").alias("translation"),
                )
            )
        )
        .alias("description"),
    ]
)

print(df_gold)
df_gold.write_parquet(f"test.parquet")
df_gold_read = pl.read_parquet(f"test.parquet")
print(df_gold_read)
shape: (4, 1)
┌─────────────────┐
│ description     │
│ ---             │
│ list[struct[1]] │
╞═════════════════╡
│ null            │
│ [{"hello"}]     │
│ [{"hi"}]        │
│ null            │
└─────────────────┘
shape: (4, 1)
┌─────────────────┐
│ description     │
│ ---             │
│ list[struct[1]] │
╞═════════════════╡
│ null            │
│ [{""}]          │
│ [{"hello"}]     │
│ null            │
└─────────────────┘

I left away the None value and check for None, because this does not seem to influence the test. Also remove the locale field.

@Matthias-Warlop
Copy link
Author

The struct also does not seem to influence the bug

import polars as pl

df_silver = pl.DataFrame({"description": ["", "hello", "hi", ""],})

df_gold = df_silver.select(
    [
        pl.when(pl.col("description") == "")
        .then(pl.lit(None))
        .otherwise(
            pl.concat_list(
                pl.col("description").alias("translation"),
            )
        )
        .alias("description_new"),
    ]
)

print(df_gold)
df_gold.write_parquet(f"test.parquet")
df_gold_read = pl.read_parquet(f"test.parquet")
print(df_gold_read)
shape: (4, 1)
┌─────────────────┐
│ description_new │
│ ---             │
│ list[str]       │
╞═════════════════╡
│ null            │
│ ["hello"]       │
│ ["hi"]          │
│ null            │
└─────────────────┘
shape: (4, 1)
┌─────────────────┐
│ description_new │
│ ---             │
│ list[str]       │
╞═════════════════╡
│ null            │
│ [""]            │
│ ["hello"]       │
│ null            │
└─────────────────┘

@cmdlineluser
Copy link
Contributor

So it seems to end up being a problem with write_parquet specifically?

If we switch to sink_parquet - it does not replicate the bug:

df_gold.lazy().sink_parquet(f"test.parquet")
pl.read_parquet(f"test.parquet")

# shape: (4, 1)
# ┌─────────────────┐
# │ description_new │
# │ ---             │
# │ list[str]       │
# ╞═════════════════╡
# │ null            │
# │ ["hello"]       │
# │ ["hi"]          │
# │ null            │
# └─────────────────┘

@Matthias-Warlop
Copy link
Author

This indeed fixes the issue. Am I right to think this should be considered as a major issue, as it negatively affects the data quality?
Also I tried further testing and it does have something to do with the creation of the df_gold dataframe (so the .select part). When I create a dataframe similar to df_gold directly like this:

df_silver = pl.DataFrame({"description": [None, ["hello"], ["hi"], None]})

print(df_silver)
df_silver.write_parquet(f"test.parquet")
df_silver_read = pl.read_parquet(f"test.parquet")
print(df_silver_read)
┌─────────────┐
│ description │
│ ---         │
│ list[str]   │
╞═════════════╡
│ null        │
│ ["hello"]   │
│ ["hi"]      │
│ null        │
└─────────────┘
shape: (4, 1)
┌─────────────┐
│ description │
│ ---         │
│ list[str]   │
╞═════════════╡
│ null        │
│ ["hello"]   │
│ ["hi"]      │
│ null        │
└─────────────┘

Both the hello and hi appear correctly in the output.

@ritchie46
Copy link
Member

import polars as pl

df_silver = pl.DataFrame({"description": ["", "hello", "hi", ""],})

df_gold = df_silver.select(
    [
        pl.when(pl.col("description") == "")
        .then(pl.lit(None))
        .otherwise(
            pl.concat_list(
                pl.col("description").alias("translation"),
            )
        )
        .alias("description_new"),
    ]
)

print(df_gold)
df_gold.write_parquet(f"test.parquet")
df_gold_read = pl.read_parquet(f"test.parquet")
print(df_gold_read)

@coastalwhite can you take a look?

@Matthias-Warlop
Copy link
Author

Matthias-Warlop commented Jul 24, 2024

The lazy().sink_parquet does not seem to fix the issue. The bug is extemely specific as to when it occurs, but when it does, it occurs consistently.
Take the following code:

import polars as pl

df_silver = pl.DataFrame(
    {
        "description": [
            "Hello",
            "Hello",
            "Hello",
            "Hello",
            "Hello",
            None,
            "Hello",
            None,
            None,
            "Hello",
            None,
            "Hello",
            "Hello",
            "Hello",
            "Hello",
            "",
            "Hello",
            "Hello",
            "Hello",
            "Hello",
        ],
    }
)


df_gold = df_silver.select(
    [
        pl.when(pl.col("description").is_null())
        .then(pl.lit(None))
        .when(pl.col("description") == "")
        .then(pl.lit(None))
        .otherwise(
            pl.concat_list(
                pl.col("description").alias("translation"),
            )
        )
        .alias("description_new"),
    ]
)

print(df_gold[10:])
df_gold.lazy().sink_parquet(f"test.parquet")
df_gold_read = pl.read_parquet(f"test.parquet")
print(df_gold_read[10:])
shape: (10, 1)
┌─────────────────┐
│ description_new │
│ ---             │
│ list[str]       │
╞═════════════════╡
│ null            │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ null            │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
└─────────────────┘
shape: (10, 1)
┌─────────────────┐
│ description_new │
│ ---             │
│ list[str]       │
╞═════════════════╡
│ null            │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ null            │
│ [""]            │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
└─────────────────┘

when the "" value is placed one row higher during the creation of the dataframe:

import polars as pl

df_silver = pl.DataFrame(
    {
        "description": [
            "Hello",
            "Hello",
            "Hello",
            "Hello",
            "Hello",
            None,
            "Hello",
            None,
            None,
            "Hello",
            None,
            "Hello",
            "Hello",
            "Hello",
            "",
            "Hello",
            "Hello",
            "Hello",
            "Hello",
            "Hello",
        ],
    }
)


df_gold = df_silver.select(
    [
        pl.when(pl.col("description").is_null())
        .then(pl.lit(None))
        .when(pl.col("description") == "")
        .then(pl.lit(None))
        .otherwise(
            pl.concat_list(
                pl.col("description").alias("translation"),
            )
        )
        .alias("description_new"),
    ]
)

print(df_gold[10:])
df_gold.lazy().sink_parquet(f"test.parquet")
df_gold_read = pl.read_parquet(f"test.parquet")
print(df_gold_read[10:])
shape: (10, 1)
┌─────────────────┐
│ description_new │
│ ---             │
│ list[str]       │
╞═════════════════╡
│ null            │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ null            │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
└─────────────────┘
shape: (10, 1)
┌─────────────────┐
│ description_new │
│ ---             │
│ list[str]       │
╞═════════════════╡
│ null            │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ null            │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
│ ["Hello"]       │
└─────────────────┘

There is no longer a bug. lazy().sink_parquet seems to alter the behaviour of the bug, but the bug is still very much there. Again when reading the parquet I can see the wrong values too, so the bug is not because of the pl.read_parquet

@Matthias-Warlop Matthias-Warlop changed the title Select expression with .when and .then statements gives incorrect results depending on preceding row. Using pl.write_parquet() gives wrong results for values inside lists. Jul 24, 2024
@cmdlineluser
Copy link
Contributor

cmdlineluser commented Jul 24, 2024

Trying to find out when this changed:

pl.__version__
df_gold[10:].equals(df_gold_read[10:])

It seems it happened after 0.20.15

'0.20.16'
False
'0.20.15'
True

Looking at the release notes for 0.20.16:

https://github.com/pola-rs/polars/releases/tag/py-0.20.16

  • add new when-then-otherwise kernels #15089

As the bug seems to depend on when-then, this may be a starting point for further investigation.

@coastalwhite coastalwhite self-assigned this Jul 24, 2024
@coastalwhite coastalwhite added P-high Priority: high A-io-parquet Area: reading/writing Parquet files and removed needs triage Awaiting prioritization by a maintainer labels Jul 24, 2024
coastalwhite added a commit to coastalwhite/polars that referenced this issue Jul 24, 2024
coastalwhite added a commit to coastalwhite/polars that referenced this issue Jul 24, 2024
Fixes pola-rs#17805.

This fixes an issue on the Parquet writer where values that would be valid in
the primitive array but invalid at a higher nesting level would still be
written. This could for example be true when do `x = (x == "") ? [ x ] : None`.
In this case, the empty string might still be valid but the above list is not
valid anymore.

This is solved by walking through the structure and propagating the nulls to
the lower levels in the parquet writer.
coastalwhite added a commit to coastalwhite/polars that referenced this issue Jul 24, 2024
Fixes pola-rs#17805.

This fixes an issue on the Parquet writer where values that would be valid in
the primitive array but invalid at a higher nesting level would still be
written. This could for example be true when do `x = (x == "") ? [ x ] : None`.
In this case, the empty string might still be valid but the above list is not
valid anymore.

This is solved by walking through the structure and propagating the nulls to
the lower levels in the parquet writer.
coastalwhite added a commit to coastalwhite/polars that referenced this issue Jul 24, 2024
Fixes pola-rs#17805.

This fixes an issue on the Parquet writer where values that would be valid in
the primitive array but invalid at a higher nesting level would still be
written. This could for example be true when do `x = (x == "") ? [ x ] : None`.
In this case, the empty string might still be valid but the above list is not
valid anymore.

This is solved by walking through the structure and propagating the nulls to
the lower levels in the parquet writer.
@c-peters c-peters added the accepted Ready for implementation label Jul 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-parquet Area: reading/writing Parquet files accepted Ready for implementation bug Something isn't working P-high Priority: high python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants