Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix groupby behaviour on columns with missing values #3391

Merged
merged 4 commits into from Nov 29, 2022

Conversation

oleksiyskononenko
Copy link
Contributor

@oleksiyskononenko oleksiyskononenko commented Nov 29, 2022

It appears as though we never initialized na_position_ in the case of dt.by(), and this resulted in some random data corruption for columns, that contain missing values. As of this PR, we initialize na_position_ to NaPosition::FIRST to be consistent with what we declare in dt.by() documentation:

The default behavior of groupby is to sort the groups in the ascending order, with NA values appearing before any other values.

Also, we switch to python 3.8 for testing debug wheels, so that we keep track of the status of the mentioned groupby tests.

Closes #3331

@oleksiyskononenko oleksiyskononenko added this to the Release 1.1.0 milestone Nov 29, 2022
@oleksiyskononenko oleksiyskononenko added the FIX Fix for an issue label Nov 29, 2022
@oleksiyskononenko oleksiyskononenko self-assigned this Nov 29, 2022
@oleksiyskononenko oleksiyskononenko added the bug Any bugs / errors in datatable; however for severe bugs use [segfault] label label Nov 29, 2022
@oleksiyskononenko oleksiyskononenko changed the title [WIP] Fix groupby tests Fix default groupby behaviour Nov 29, 2022
@oleksiyskononenko oleksiyskononenko changed the title Fix default groupby behaviour Fix groupby behaviour on columns with missing values Nov 29, 2022
@oleksiyskononenko oleksiyskononenko merged commit 7e8f19f into main Nov 29, 2022
@oleksiyskononenko oleksiyskononenko deleted the ok/gby-fix branch November 29, 2022 18:52
samukweku pushed a commit that referenced this pull request Nov 30, 2022
It appears as though we never initialized `na_position_` in the case of `dt.by()`, and this resulted in some random data corruption for columns, that contain missing values. As of this PR, we initialize `na_position_` to `NaPosition::FIRST` to be consistent with what we declare in `dt.by()` [documentation](https://datatable.readthedocs.io/en/latest/api/dt/by.html):

```
The default behavior of groupby is to sort the groups in the ascending order, with NA values appearing before any other values.
```

Also, we switch to python 3.8 for testing debug wheels, so that we keep track of the status of the mentioned groupby tests.
 
Closes #3331
samukweku pushed a commit that referenced this pull request Jan 3, 2023
It appears as though we never initialized `na_position_` in the case of `dt.by()`, and this resulted in some random data corruption for columns, that contain missing values. As of this PR, we initialize `na_position_` to `NaPosition::FIRST` to be consistent with what we declare in `dt.by()` [documentation](https://datatable.readthedocs.io/en/latest/api/dt/by.html):

```
The default behavior of groupby is to sort the groups in the ascending order, with NA values appearing before any other values.
```

Also, we switch to python 3.8 for testing debug wheels, so that we keep track of the status of the mentioned groupby tests.
 
Closes #3331
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Any bugs / errors in datatable; however for severe bugs use [segfault] label FIX Fix for an issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Several groupby tests are failing for debug build on Windows
2 participants