Skip to content
This repository has been archived by the owner on Jan 9, 2019. It is now read-only.

Incorrect UTF-8 invalid code point replacement handling #77

Closed
ktakashi opened this issue May 29, 2015 · 3 comments
Closed

Incorrect UTF-8 invalid code point replacement handling #77

ktakashi opened this issue May 29, 2015 · 3 comments

Comments

@ktakashi
Copy link

I think the following script should print #\g twice but #!eof twice.

(import (rnrs))

(define buf-size 10)
(define bv (make-bytevector buf-size (char->integer #\a)))
(define (bytevector-append . bvs)
  (let* ((len (fold-left (lambda (sum bv) 
                           (+ (bytevector-length bv) sum)) 0 bvs))
         (r (make-bytevector len)))
    (fold-left (lambda (off bv)
                 (let ((len (bytevector-length bv)))
                   (bytevector-copy! bv 0 r off len)
                   (+ off len)))
               0 bvs)
    r))

(let ((bv2 (bytevector-append bv #vu8(#xe0 #x67 #x0a))))
  (call-with-port (transcoded-port 
                   (open-bytevector-input-port bv2) 
                   (make-transcoder (utf-8-codec)
                                    (eol-style lf)
                                    (error-handling-mode replace)))
    (lambda (in)
      (get-string-n in (+ 1 buf-size)) ;; read until invalid code point
      (write (get-char in)) (newline)))
  (call-with-port (transcoded-port 
                   (open-bytevector-input-port #vu8(#xe0 #x67 #x0a))
                   (make-transcoder (utf-8-codec)
                                    (eol-style lf)
                                    (error-handling-mode replace)))
    (lambda (in)
      (get-char in)
      (write (get-char in)) (newline))))
(flush-output-port (current-output-port))

Version:

$ vicare -V
Vicare Scheme version 0.3d7, 64-bit
Revision devel/d64e00ea0dc3b4f63d72d97eb223103d94a2675b
Build 2015-02-09

Copyright (c) 2006-2010 Abdulaziz Ghuloum and contributors
Copyright (c) 2011-2013 Marco Maggi

This is free software; see the  source or use the '--license' option for
copying conditions.  There is NO warranty; not  even for MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.
@marcomaggi
Copy link
Owner

Thanks for the report. I will work on this ASAP.

@marcomaggi
Copy link
Owner

For Vicare's input/output functions: the Book of Truths about Unicode is the library (vicare unsafe unicode).

The offending triplet of octets is:

#xe0 #x67 #x0a

which in binary format is:

#b11100000
#b01100111
#b00001010

The first octet has the bit-pattern of a "first of three octets" UTF-8 sequence: its most significant bits are #b1110xxxx.

The second octet is not a "second of three octets": its most significant bits are #b01xxxxxx, but they should be #b10xxxxxx.

The third octet is not a "third of three octets": its most significant bits are #b00xxxxxx, but they should be #b10xxxxxx.

The standard, in the documentation of error-handling-mode says:

If the error-handling mode is replace, the replacement character U+FFFD is injected into the data stream, an appropriate number of bytes are ignored, and decoding continues with the following bytes.

So it does not specify how many bytes is "appropriate" to ignore. Since a valid "first of three" octet is found, Vicare decides to discard three octets, then goes on. This behaviour looks compliant to me.

There is another problem: the behaviour of the input/output functions is inconsistent with the behaviour of the utf8->string function and similar. I will have to review that bytevector to string code.

@marcomaggi
Copy link
Owner

I did a code review. The current code in the master branch should be less wrong and more coherent.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants