Incorrect UTF-8 invalid code point replacement handling #77

ktakashi · 2015-05-29T07:29:05Z

I think the following script should print #\g twice but #!eof twice.

(import (rnrs))

(define buf-size 10)
(define bv (make-bytevector buf-size (char->integer #\a)))
(define (bytevector-append . bvs)
  (let* ((len (fold-left (lambda (sum bv) 
                           (+ (bytevector-length bv) sum)) 0 bvs))
         (r (make-bytevector len)))
    (fold-left (lambda (off bv)
                 (let ((len (bytevector-length bv)))
                   (bytevector-copy! bv 0 r off len)
                   (+ off len)))
               0 bvs)
    r))

(let ((bv2 (bytevector-append bv #vu8(#xe0 #x67 #x0a))))
  (call-with-port (transcoded-port 
                   (open-bytevector-input-port bv2) 
                   (make-transcoder (utf-8-codec)
                                    (eol-style lf)
                                    (error-handling-mode replace)))
    (lambda (in)
      (get-string-n in (+ 1 buf-size)) ;; read until invalid code point
      (write (get-char in)) (newline)))
  (call-with-port (transcoded-port 
                   (open-bytevector-input-port #vu8(#xe0 #x67 #x0a))
                   (make-transcoder (utf-8-codec)
                                    (eol-style lf)
                                    (error-handling-mode replace)))
    (lambda (in)
      (get-char in)
      (write (get-char in)) (newline))))
(flush-output-port (current-output-port))

Version:

$ vicare -V
Vicare Scheme version 0.3d7, 64-bit
Revision devel/d64e00ea0dc3b4f63d72d97eb223103d94a2675b
Build 2015-02-09

Copyright (c) 2006-2010 Abdulaziz Ghuloum and contributors
Copyright (c) 2011-2013 Marco Maggi

This is free software; see the  source or use the '--license' option for
copying conditions.  There is NO warranty; not  even for MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.

The text was updated successfully, but these errors were encountered:

marcomaggi · 2015-05-31T06:40:19Z

Thanks for the report. I will work on this ASAP.

marcomaggi · 2015-06-01T06:51:40Z

For Vicare's input/output functions: the Book of Truths about Unicode is the library (vicare unsafe unicode).

The offending triplet of octets is:

#xe0 #x67 #x0a

which in binary format is:

#b11100000
#b01100111
#b00001010

The first octet has the bit-pattern of a "first of three octets" UTF-8 sequence: its most significant bits are #b1110xxxx.

The second octet is not a "second of three octets": its most significant bits are #b01xxxxxx, but they should be #b10xxxxxx.

The third octet is not a "third of three octets": its most significant bits are #b00xxxxxx, but they should be #b10xxxxxx.

The standard, in the documentation of error-handling-mode says:

If the error-handling mode is replace, the replacement character U+FFFD is injected into the data stream, an appropriate number of bytes are ignored, and decoding continues with the following bytes.

So it does not specify how many bytes is "appropriate" to ignore. Since a valid "first of three" octet is found, Vicare decides to discard three octets, then goes on. This behaviour looks compliant to me.

There is another problem: the behaviour of the input/output functions is inconsistent with the behaviour of the utf8->string function and similar. I will have to review that bytevector to string code.

marcomaggi · 2015-06-07T14:44:11Z

I did a code review. The current code in the master branch should be less wrong and more coherent.

marcomaggi closed this as completed Jun 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect UTF-8 invalid code point replacement handling #77

Incorrect UTF-8 invalid code point replacement handling #77

ktakashi commented May 29, 2015

marcomaggi commented May 31, 2015

marcomaggi commented Jun 1, 2015

marcomaggi commented Jun 7, 2015

Incorrect UTF-8 invalid code point replacement handling #77

Incorrect UTF-8 invalid code point replacement handling #77

Comments

ktakashi commented May 29, 2015

marcomaggi commented May 31, 2015

marcomaggi commented Jun 1, 2015

marcomaggi commented Jun 7, 2015