Skip to content
This repository was archived by the owner on Nov 5, 2022. It is now read-only.
This repository was archived by the owner on Nov 5, 2022. It is now read-only.

UnicodeReader misdetects UTF-32LE as UTF-16LE #471

@tayloj

Description

@tayloj

UnicodeReader can't actually detect UTF-32LE encodings. There's a big chain of if/else if/... blocks in the constructor that examine the first few bytes from an input stream. The blocks for detecting UTF-16LE and UTF-32LE are:

/* ... * /
else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) {
  encoding = "UTF-16LE";
  unread = n - 2;
}
/* ...code for UTF-32BE ... */
else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)
    && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)) {
  encoding = "UTF-32LE";
  unread = n - 4;
} else /* ... */

The condition for the UTF-32LE case:

(bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)
  && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)

can't be true unless the earlier case for UTF-16LE was also true:

(bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)

So something that's UTF-32LE would be detected as UTF-16LE.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions