concat opcode throws exception when handed unicode constant #837

Closed
pmichaud opened this Issue Sep 19, 2012 · 6 comments

Projects

None yet

3 participants

@pmichaud
Member

The concat opcode doesn't seem to be able to handle two constant arguments when either of them is encoded other than ASCII. Here's the test program:

pmichaud@kiwi:~/p6/nqp/parrot$ cat z1.pir
.sub 'main' :main
    concat $S1, unicode:"\x{a2}", unicode:"\x{a2}"
    say $S1
.end

pmichaud@kiwi:~/p6/nqp/parrot$ ./parrot z1.pir
Invalid character in ASCII string

Expected output is "¢¢\n".

It likewise fails if only one of the arguments is marked with an encoding:

pmichaud@kiwi:~/p6/nqp/parrot$ cat z2.pir
.sub 'main' :main
    concat $S1, unicode:"\x{a2}", "c"
    say $S1
.end

pmichaud@kiwi:~/p6/nqp/parrot$ ./parrot z2.pir
Invalid character in ASCII string

If both strings are ASCII, then things work fine:

pmichaud@kiwi:~/p6/nqp/parrot$ cat z3.pir
.sub 'main' :main
    concat $S1, "c", "c"
    say $S1
.end

pmichaud@kiwi:~/p6/nqp/parrot$ ./parrot z3.pir
cc

Indeed, if the two strings are marked unicode: but contain only ASCII characters, it works:

pmichaud@kiwi:~/p6/nqp/parrot$ cat z4.pir
.sub 'main' :main
    concat $S1, unicode:"\x{62}", unicode:"\x{62}"
    say $S1
.end

pmichaud@kiwi:~/p6/nqp/parrot$ ./parrot z4.pir
bb

This appears to be the source of the problem for NQP issue #55 perl6/nqp#55 .

This problem appears to have existed in Parrot for a long time -- at least back through 3.0.0.

Thanks!

Pm

@leto
Member
leto commented Sep 19, 2012

Thanks for the great bug report! I am working on converting these into failing tests in t/pmc/string.t

@leto leto added a commit that referenced this issue Sep 20, 2012
@leto leto Revert "[t] Add some failing tests for #837 reported by pmichaud++"
This reverts commit 5d7aa2d.
c552eb6
@leto
Member
leto commented Sep 20, 2012

Here is a backtrace from ascii_scan in src/string/encoding/ascii.c

(gdb) bt
#0  ascii_scan (interp=0x100503ba0, src=0x10081e218) at ascii.c:171
#1  0x0000000100011f32 in Parrot_str_new_init (interp=0x100503ba0, buffer=0x10027b960 "ascii", len=5, encoding=0x10031b500, flags=12288) at api.c:676
#2  0x0000000100013229 in Parrot_str_new_constant (interp=0x100503ba0, buffer=0x10027b960 "ascii") at api.c:629
#3  0x0000000100074d80 in Parrot_str_internal_register_encoding_names (interp=0x100503ba0) at encoding.c:383
#4  0x00000001000750b6 in Parrot_encodings_init (interp=0x100503ba0) at encoding.c:456
#5  0x0000000100011d43 in Parrot_str_init (interp=0x100503ba0) at api.c:155
#6  0x000000010009df70 in Parrot_interp_initialize_interpreter (interp=0x100503ba0, args=0x7fff5fbff250) at api.c:262
#7  0x000000010006a090 in Parrot_api_make_interpreter (parent=0x0, flags=0, args=0x100503b70, interp=0x7fff5fbff368) at api.c:154
#8  0x0000000100000c1c in main (argc=2, argv=0x7fff5fbff3a8) at main.c:142

It seems like we have a bug in src/string/api.c on line 629. Parrot_str_new_constant always creates new strings with the default encoding, which seems wrong to me.

@leto
Member
leto commented Sep 20, 2012

This problem is actually deeper than the concat operator. This even simpler test program:

.sub 'main' :main
    $S1 = utf8:"\x{a2}"
    say $S1
.end

has a similar backtrace as above:

(gdb) bt
#0  ascii_scan (interp=0x100503b80, src=0x10081e218) at ascii.c:171
#1  0x0000000100011f32 in Parrot_str_new_init (interp=0x100503b80, buffer=0x10027b960 "ascii", len=5, encoding=0x10031b500, flags=12288) at api.c:676
#2  0x0000000100013229 in Parrot_str_new_constant (interp=0x100503b80, buffer=0x10027b960 "ascii") at api.c:629
...

It seems to me that string constants have a bug where they are internally marked as "ascii" even if they shouldn't be. This seemingly does not manifest as a bug until you call concat() on them. Other string related functions may have similar bugs because of this.

This bug could be related to the "fast path" here: https://github.com/parrot/parrot/blob/master/src/string/api.c#L665

@rurban rurban added a commit that referenced this issue Sep 20, 2012
@rurban rurban [GH #837] Honor encoding in imcc optimizer
Context registers unfortunately do not store type information,
just the values. We would need at least type 'U" here.
So recreate an proper const'ed encoding string from the register
value.
8a56a8e
@rurban
Member
rurban commented Sep 20, 2012

Fixed with 8a56a8e

But the real problem is that SREG registers nor constants cannot handle STRING values, so we loose the encoding information and have to escape and unescape the raw char * with encoding prefix multiple times internally.
I have no idea if the GC cannot handle STRING pointers, but it should.

@rurban
Member
rurban commented Sep 21, 2012

Fixed in branch gh837_concat
Ready to be merged

@rurban rurban added a commit that referenced this issue Sep 21, 2012
@rurban rurban [GH #837] Add benchmark to test encoding slowdown
stress_stringsu.pir tests encoding via sprintf and concat.
The fixed version is only a bit slower then the old broken version.
13.666s (broken) vs 13.506s (fixed) vs 3.022s (without encoding).

So fixing the encoding representation as STRING* in SREG and consts 'U'
is a worthwile goal.
e82482e
@rurban rurban added a commit that referenced this issue Sep 21, 2012
@rurban rurban [GH #837] New comparison benchmark against stress_stringsu.pir
Bad news:
There is almost no measurable speed advantage from ascii strings to encoded strings.
With the fix and without the fix, as sprintf handling slows down the test 4x times.

So converting a SREG and const to encoded STRING* will not gain much performance.
9c81593
@rurban
Member
rurban commented Sep 21, 2012

Merged with 81f3b68

@rurban rurban closed this Sep 21, 2012
@rurban rurban added a commit that referenced this issue Sep 21, 2012
@rurban rurban Merge branch 'gh837_concat'
Fixes [GH #837], a imcc optimizer problem with SREG's and const
not holding STRING*, thus no good encoding information.
81f3b68
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment