Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement na_position argument for sort() #793

Closed
st-pasha opened this issue Mar 1, 2018 · 4 comments · Fixed by #2659
Closed

Implement na_position argument for sort() #793

st-pasha opened this issue Mar 1, 2018 · 4 comments · Fixed by #2659
Assignees
Labels
improve Improvement of an existing functionality low priority Low priority tasks sort Issues related to sorting
Projects
Milestone

Comments

@st-pasha
Copy link
Contributor

st-pasha commented Mar 1, 2018

Possible values: "first" (default), "last", "remove".

@st-pasha st-pasha added improve Improvement of an existing functionality sort Issues related to sorting labels Mar 1, 2018
@st-pasha st-pasha self-assigned this Mar 1, 2018
@st-pasha st-pasha added this to To Do in Sorting Mar 1, 2018
@st-pasha st-pasha added the low priority Low priority tasks label Jun 2, 2018
@pradkrish
Copy link
Collaborator

pradkrish commented Sep 20, 2020

@st-pasha @oleksiyskononenko

I have been working on this issue for a couple of days now and I have a branch with a preliminary implementation.

What's on the branch:

  1. Possible values na_position can handle are first, last and remove.
  2. Add tests

Please feel free to pull the branch from here to test. Parts of the implementation is subjected to change as the task grows. I am happy to get any feedback.

Here is the sample script I use to test the branch

from datatable import dt, f

dt_int = dt.Frame(x=[-5,-8,None,11,2,8,None,4])
dt_bool = dt.Frame(x=[True,None,False,None,False,True])
dt_float = dt.Frame(x=[-5.9,-8.3,11.5576,2.2,8.9,None,4.1])

na_pos = "last" # default: "first"
rev = False

dt_int[:, :, dt.sort(0, reverse=rev, na_position=na_pos)]
dt_bool[:, :, dt.sort(0, reverse=rev, na_position=na_pos)]
dt_float[:, :, dt.sort(0, reverse=rev, na_position=na_pos)]

If no one is working it, I wouldn't mind to have this task assigned to me.

@pradkrish
Copy link
Collaborator

pradkrish commented Sep 21, 2020

@st-pasha @oleksiyskononenko

A general question regarding Buffer::acquire in src/core/sort.cc:625.

Say my container_o contains the following data [5, 1, 2, 0, 6]. If I would like to create a rowindex with the first 4 elements, I can write

auto data = static_cast<int32_t*>(container_o.release());
return RowIndex(Buffer::acquire(data, (n-1) * sizeof(int32_t)), RowIndex::ARR32); // n = 5

and that works. The output is column values at row ids [5, 1, 2, 0].

On the other hand, If I would like to create a rowindex from the last 4 elements starting from index 1, a simple pointer arithmatic like the following

auto data = static_cast<int32_t*>(container_o.release());
return RowIndex(Buffer::acquire(data + 1, (n-1) * sizeof(int32_t)), RowIndex::ARR32); // n = 5

throws a segmentation fault. I am posting part of the backtrace here

#0  musable (mem=0x9ea8c4) at malloc.c:4855
#1  __malloc_usable_size (m=0x9ea8c4) at malloc.c:4867
#2  0x00007ffff636747c in Memory_BufferImpl::verify_integrity (this=0xa32220) at src/core/buffer.cc:292
#3  0x00007ffff63660f7 in Buffer::verify_integrity (this=0xa32270) at src/core/buffer.cc:956
#4  0x00007ffff63080a4 in ArrayRowIndexImpl::verify_integrity (this=0xa32250) at src/core/rowindex_array.cc:410
#5  0x00007ffff6306340 in test (o=0xa32250) at src/core/rowindex_array.cc:38
#6  0x00007ffff630655a in ArrayRowIndexImpl::ArrayRowIndexImpl (this=0xa32250, buffer=..., flags=1) at src/core/rowindex_array.cc:60
#7  0x00007ffff63289d8 in RowIndex::RowIndex (this=0x7fffffffce50, buf=..., flags=1) at src/core/rowindex.cc:78
#8  0x00007ffff6333d11 in SortContext::get_result_rowindex (this=0x7fffffffce80, use_na_count=true) at src/core/sort.cc:628
#9  0x00007ffff632f679 in group (columns=std::vector of length 1, capacity 1 = {...}, flags=std::vector of length 1, capacity 1 = {...}, na_pos="first") at src/core/sort.cc:1481
....

The memory address of data[0] is 0x9ea8c0 and data[1] is 0x9ea8c4, which is where the segmentation fault is occuring. At least the pointer arithmetic is not wrong.

Any idea what's wrong with my usage of Buffer::acquire in the second case? Thanks.

@st-pasha
Copy link
Contributor Author

If you look at file buffer.h it explains what Buffer::acquire() does:

    // Buffer::acquire(ptr, n)
    //   Create Buffer from an existing pointer `ptr` to a memory buffer
    //   of size `n`. The ownership of `ptr` will be transferred to the
    //   Buffer object. In this case the `ptr` should have had been
    //   allocated using `dt::malloc`.

So basically when you are using Buffer::acquire, you are saying "I have a pointer ptr (to a memory region of size n), that I allocated, but I don't want to own it anymore. Please take this pointer and make it into a Buffer object". The Buffer object will own this pointer from now on, and will deallocate it when it dies.

Thus, Buffer::acquire(data+1, n-1) is basically as bad as trying to free(data+1).

In your case, your data is owned by the container_o object of type omem. You can convert it into a regular Buffer, and then use Buffer::view() to slice a piece out of it:

auto data = static_cast<int32_t*>(container_o.release());
auto buf = Buffer::acquire(data, n*sizeof(int32_t));
return RowIndex(Buffer::view(buf, (n-1)*sizeof(int32_t), sizeof(int32_t)), RowIndex::ARR32);

Also, container_o should have been a Buffer in the first place. I guess when we written that code the Buffer class was not flexible enough.

@pradkrish
Copy link
Collaborator

pradkrish commented Sep 23, 2020

@st-pasha

That is definitely a much safer way to slice the buffer, thanks.

@st-pasha st-pasha assigned pradkrish and unassigned st-pasha Sep 24, 2020
Sorting automation moved this from To Do to Completed Oct 23, 2020
@st-pasha st-pasha added this to the Release 1.0.0 milestone Dec 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improve Improvement of an existing functionality low priority Low priority tasks sort Issues related to sorting
Projects
No open projects
Sorting
  
Completed
Development

Successfully merging a pull request may close this issue.

2 participants