Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault when sorting with nulls_last on all-null no-category lexical categorical sicne 0.20.22 #16004

Closed
2 tasks done
wenjuno opened this issue May 2, 2024 · 2 comments · Fixed by #16013
Closed
2 tasks done
Assignees
Labels
accepted Ready for implementation bug Something isn't working P-high Priority: high python Related to Python Polars

Comments

@wenjuno
Copy link

wenjuno commented May 2, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
df1 = pl.DataFrame({'a': [1]})
df2 = pl.DataFrame({'a': [-1], 'b': pl.Series(['A'], dtype=pl.Categorical('lexical'))})
df3 = df1.join(df2, on='a', how='left')
print(f"{df3=}")
print(f"{df3.dtypes=}")
print(f"{df3['b'].cat.get_categories()=}")
print(f"{df3.sort(['a', 'b'], nulls_last=True)=}")
df4 = df3.cast({'b': pl.Categorical('lexical')})
print(f"{df4=}")
print(f"{df4.dtypes=}")
print(f"{df4['b'].cat.get_categories()=}")
print(f"{df4.sort(['a', 'b'], nulls_last=True)=}")

Log output

df3=shape: (1, 2)
┌─────┬──────┐
│ a   ┆ b    │
│ --- ┆ ---  │
│ i64 ┆ cat  │
╞═════╪══════╡
│ 1   ┆ null │
└─────┴──────┘
df3.dtypes=[Int64, Categorical(ordering='lexical')]
df3['b'].cat.get_categories()=shape: (1,)
Series: 'b' [str]
[
        "A"
]
df3.sort(['a', 'b'], nulls_last=True)=shape: (1, 2)
┌─────┬──────┐
│ a   ┆ b    │
│ --- ┆ ---  │
│ i64 ┆ cat  │
╞═════╪══════╡
│ 1   ┆ null │
└─────┴──────┘
df4=shape: (1, 2)
┌─────┬──────┐
│ a   ┆ b    │
│ --- ┆ ---  │
│ i64 ┆ cat  │
╞═════╪══════╡
│ 1   ┆ null │
└─────┴──────┘
df4.dtypes=[Int64, Categorical(ordering='lexical')]
df4['b'].cat.get_categories()=shape: (0,)
Series: 'b' [str]
[
]
zsh: segmentation fault (core dumped)

Issue description

To be clear, the bug seems to be triggered under these conditions since 0.20.22:

  1. Sort by multiple columns with nulls_last=True AND
  2. One or more columns are lexical ordering Categorical with no categories and all-null values.

It works fine with 0.20.21.

Note that in the example the df2["b"] doesn't have to be a lexical Categorical, it could very well be string type and sort on df4 will still segfault. I make df2["b"] a Categorical to show that having an empty category list seems to be essential to trigger the problem. The cast seems to recompute the categories even when the type is not actually changing.

Expected behavior

Sort the dataset without crashing.

Installed versions

--------Version info---------
Polars:               0.20.23
Index type:           UInt32
Platform:             Linux-6.8.7-arch1-1-x86_64-with-glibc2.39
Python:               3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                <not installed>
openpyxl:             <not installed>
pandas:               <not installed>
pyarrow:              <not installed>
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed
@wenjuno wenjuno added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 2, 2024
@orlp orlp added P-high Priority: high and removed needs triage Awaiting prioritization by a maintainer labels May 2, 2024
@orlp
Copy link
Collaborator

orlp commented May 2, 2024

Can reproduce.

@orlp orlp self-assigned this May 2, 2024
@ritchie46
Copy link
Member

Ah, contention. Nevermind

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation bug Something isn't working P-high Priority: high python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants