[BUG] reverse_index fails for some unicode strings #767

Closed
pmichaud opened this Issue May 8, 2012 · 5 comments

Comments

Projects
None yet
3 participants
Owner

pmichaud commented May 8, 2012

The 'reverse_index' method on the String PMC fails for some unicode-encoded strings. Here's a simple test demonstrating the bug:

pmichaud@kiwi:~/p6/parrot$ cat x.pir
.sub 'main'
    # works, outputs 1
    $P0 = box "hello"
    $I0 = $P0.'reverse_index'('e', 0)
    say $I0                

    # fails, outputs -1
    $P0 = box unicode:"hello"
    $I0 = $P0.'reverse_index'('e', 0)
    say $I0                
.end

pmichaud@kiwi:~/p6/parrot$ ./parrot x.pir
1
-1
pmichaud@kiwi:~/p6/parrot$ 
Contributor

Whiteknight commented May 8, 2012

The String.reverse_index method eventually leads back to the encoding_rindex() function in https://github.com/parrot/parrot/blob/master/src/string/encoding/shared.c. I wont have time to look at this issue until tonight.

If some other developer wants to set up a debugger with breakpoints at Parrot_String_nci_reverse_index and encoding_rindex and figure out where the operation is going wrong with the unicode strings, that would be a big help.

Owner

pmichaud commented May 8, 2012

See also Rakudo bug RT #112818.

Pm

Contributor

Whiteknight commented May 8, 2012

Actually, on further inspection I suspect the unicode case is actually working correctly, while the ascii case is the one that has the bug. The String PMC source code has this to say about reverse_index:

Find last occurrence of substring, but not after the start position.

In the example above, calling $I0 = $P0.'reverse_index'('e', 4) (notice changing the poorly-named "start" parameter to 4) gives us the correct expected answer. If that's the case, we have a few things to look at:

  1. The ASCII case needs to be fixed to return -1 in this case
  2. The reverse_index method probably needs a better interface where the parameter is renamed something less confusing, and is optional in the case where we just want to search the whole string.

Does this sound reasonable to do?

Member

Benabik commented May 8, 2012

The unicode version of reverse_index is encoding_rindex, but the ASCII version is fixed8_rindex which is a thin wrapper around Parrot_util_byte_rindex.

encoding_rindex says "oh, you wanted to start searching backwards at position 0" and returns -1.

Parrot_util_byte_rindex says "oh, you gave me false for start, I'll start at the end of the string". (src/util.c:642)

Based on the documentation for reverse_index, I'd also argue that the ASCII case is wrong. Making the parameter optional and defaulting to "search the whole string" does seem the saner direction to go.

Whiteknight was assigned May 8, 2012

Contributor

Whiteknight commented May 8, 2012

I've fixed the reverse_index method so the second parameter is optional (defaults to length(str)) and to return -1 on the ascii case as well as the unicode case. Also, on request from @moritz, I've added two new experimental rindex (i_s_s and i_s_s_i) opcodes, which perform exactly the same operation but do not require you to create a String PMC first, which should help with GC pressure.

Whiteknight removed their assignment Mar 7, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment