-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using pl.write_parquet() gives wrong results for values inside lists. #17805
Comments
Can you show an example of the problem? e.g. df = pl.DataFrame({
"description": ["", None, "hi", ""],
"locale": ["a", "b", "c", "d"],
})
df.select(
pl.when(pl.col("description").is_null()).then(pl.lit(None))
.when(pl.col("description") == "")
.then(pl.lit(None))
.otherwise(
pl.concat_list(
pl.struct(
pl.lit("en").alias("locale"),
pl.col("description").alias("translation"),
)
)
)
.alias("description")
)
|
I have been investigating further. The issue seems to be more complex. df_silver = pl.read_parquet(f"data/silver/locations.parquet")
df_gold = df_silver.select(
[
pl.when(pl.col("description").is_null())
.then(pl.lit(None))
.when(pl.col("description") == "")
.then(pl.lit(None))
.otherwise(
pl.concat_list(
pl.struct(
pl.lit("en").alias("locale"),
pl.col("description").alias("translation"),
)
)
)
.alias("description"),
]
)
# printing the the range of rows where the issue occurs
print(df_gold[60:70])
As you can see, no issue. df_gold.write_parquet(f"data/gold/locations.parquet")
df_gold_read = pl.read_parquet(f"data/gold/locations.parquet")
print(df_gold_read[60:70])
I see the empty translation field again, just like I saw when exploring the parquet file in data wrangler. df_gold.write_json(f"data/gold/eldrive_locations.csv")
df_gold_read = pl.read_json(f"data/gold/eldrive_locations.csv")
print(df_gold_read[60:70])
To show the complete values in the dataframe I ran: for row in df_gold[60:70].iter_rows(named=True):
print(row["description"])
|
Can you update to latest Polars (1.2.1) and confirm it still occurs? |
Sorry, I thought that I already was on the latest version. The ouput of my tests are identical with the new version. pl.show_versions()
I also tried using the dataframe you used in your example. Saving as json and parquet give the same results here. printing the read dataframe:
|
And have you got a minimal repro? I don't see any way to reproduce your query? Ideally on syntetic data in memory, otherwise from the file. Cut out everything that isn't involved. |
minimal reproducable code: import polars as pl
df_silver = pl.DataFrame(
{
"description": ["", "hello", "hi", ""],
}
)
df_gold = df_silver.select(
[
pl.when(pl.col("description") == "")
.then(pl.lit(None))
.otherwise(
pl.concat_list(
pl.struct(
pl.col("description").alias("translation"),
)
)
)
.alias("description"),
]
)
print(df_gold)
df_gold.write_parquet(f"test.parquet")
df_gold_read = pl.read_parquet(f"test.parquet")
print(df_gold_read)
I left away the None value and check for None, because this does not seem to influence the test. Also remove the locale field. |
The struct also does not seem to influence the bug import polars as pl
df_silver = pl.DataFrame({"description": ["", "hello", "hi", ""],})
df_gold = df_silver.select(
[
pl.when(pl.col("description") == "")
.then(pl.lit(None))
.otherwise(
pl.concat_list(
pl.col("description").alias("translation"),
)
)
.alias("description_new"),
]
)
print(df_gold)
df_gold.write_parquet(f"test.parquet")
df_gold_read = pl.read_parquet(f"test.parquet")
print(df_gold_read)
|
So it seems to end up being a problem with If we switch to df_gold.lazy().sink_parquet(f"test.parquet")
pl.read_parquet(f"test.parquet")
# shape: (4, 1)
# ┌─────────────────┐
# │ description_new │
# │ --- │
# │ list[str] │
# ╞═════════════════╡
# │ null │
# │ ["hello"] │
# │ ["hi"] │
# │ null │
# └─────────────────┘ |
This indeed fixes the issue. Am I right to think this should be considered as a major issue, as it negatively affects the data quality? df_silver = pl.DataFrame({"description": [None, ["hello"], ["hi"], None]})
print(df_silver)
df_silver.write_parquet(f"test.parquet")
df_silver_read = pl.read_parquet(f"test.parquet")
print(df_silver_read)
Both the hello and hi appear correctly in the output. |
@coastalwhite can you take a look? |
The import polars as pl
df_silver = pl.DataFrame(
{
"description": [
"Hello",
"Hello",
"Hello",
"Hello",
"Hello",
None,
"Hello",
None,
None,
"Hello",
None,
"Hello",
"Hello",
"Hello",
"Hello",
"",
"Hello",
"Hello",
"Hello",
"Hello",
],
}
)
df_gold = df_silver.select(
[
pl.when(pl.col("description").is_null())
.then(pl.lit(None))
.when(pl.col("description") == "")
.then(pl.lit(None))
.otherwise(
pl.concat_list(
pl.col("description").alias("translation"),
)
)
.alias("description_new"),
]
)
print(df_gold[10:])
df_gold.lazy().sink_parquet(f"test.parquet")
df_gold_read = pl.read_parquet(f"test.parquet")
print(df_gold_read[10:])
when the import polars as pl
df_silver = pl.DataFrame(
{
"description": [
"Hello",
"Hello",
"Hello",
"Hello",
"Hello",
None,
"Hello",
None,
None,
"Hello",
None,
"Hello",
"Hello",
"Hello",
"",
"Hello",
"Hello",
"Hello",
"Hello",
"Hello",
],
}
)
df_gold = df_silver.select(
[
pl.when(pl.col("description").is_null())
.then(pl.lit(None))
.when(pl.col("description") == "")
.then(pl.lit(None))
.otherwise(
pl.concat_list(
pl.col("description").alias("translation"),
)
)
.alias("description_new"),
]
)
print(df_gold[10:])
df_gold.lazy().sink_parquet(f"test.parquet")
df_gold_read = pl.read_parquet(f"test.parquet")
print(df_gold_read[10:])
There is no longer a bug. |
Trying to find out when this changed: pl.__version__
df_gold[10:].equals(df_gold_read[10:]) It seems it happened after
'0.20.15'
True Looking at the release notes for https://github.com/pola-rs/polars/releases/tag/py-0.20.16
As the bug seems to depend on when-then, this may be a starting point for further investigation. |
Fixes pola-rs#17805. This fixes an issue on the Parquet writer where values that would be valid in the primitive array but invalid at a higher nesting level would still be written. This could for example be true when do `x = (x == "") ? [ x ] : None`. In this case, the empty string might still be valid but the above list is not valid anymore. This is solved by walking through the structure and propagating the nulls to the lower levels in the parquet writer.
Fixes pola-rs#17805. This fixes an issue on the Parquet writer where values that would be valid in the primitive array but invalid at a higher nesting level would still be written. This could for example be true when do `x = (x == "") ? [ x ] : None`. In this case, the empty string might still be valid but the above list is not valid anymore. This is solved by walking through the structure and propagating the nulls to the lower levels in the parquet writer.
Fixes pola-rs#17805. This fixes an issue on the Parquet writer where values that would be valid in the primitive array but invalid at a higher nesting level would still be written. This could for example be true when do `x = (x == "") ? [ x ] : None`. In this case, the empty string might still be valid but the above list is not valid anymore. This is solved by walking through the structure and propagating the nulls to the lower levels in the parquet writer.
Checks
Reproducible example
Log output
No response
Issue description
When the description is equal to an empty string
""
in df_silver the resulting df_gold has aNone
value (this is expected).However, the subsequent row, regardless of its value, results in the description column in gold, as
[{'locale': 'en', 'translation': ''}]
Because the value originally was a sentence (string), this is an unexpected result. If I then filter the dataframe so that the row with the empty string is not present anymore, the same row that first resulted as
[{'locale': 'en', 'translation': ''}]
, is now resulting as[{'locale': 'en', 'translation': '{ORIGINALSENTENCE}'}]
which would be the expected result.Expected behavior
The preceding row in a dataframe should not have any influence on the result of the next row this is a bug that results in unexpected and nonsensical results.
Installed versions
The text was updated successfully, but these errors were encountered: