encoding/xml: unable to handle utf-16-encoded file without manual manipulation of source bytes #38335
Labels
NeedsInvestigation
Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes.
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
I have a file like the following, encoded in
utf-16
(BOM is little endian) on disk:When I read the file from disk, the bytes (including the line containing
<?xml version="1.0" encoding="utf-16"?>
), are encoded in utf-16.I wanted to parse the full file, with no modification, using the
encoding/xml
package.What did you expect to see?
I expected to be able to either A: transform the file's bytes to
utf8
and pass that reader toxml.NewDecoder
to successfully parse the utf8 data as xml, or B: pass theutf16
-encoded bytes toxml.NewDecider
and provide aCharsetReader
to successfully parse theutf16
data as XML.What did you see instead?
There were a couple of error cases.
resultOfOsOpen
directly toxml.NewDecoder
, with or without settingCharsetReader
tocharset.NewReaderLabel
:XML syntax error on line 1: invalid UTF-8
utf-8
reader returned bycharset.NewReader(resultOfOsOpen, "text/xml")
toxml.NewDecoder
:xml: encoding "utf-16" declared but Decoder.CharsetReader is nil
utf-8
reader returned bycharset.NewReader(resultOfOsOpen, "text/xml")
toxml.NewDecoder
, AND setCharsetReader
tocharset.NewReaderLabel
: The (nowutf-8
-encoded) data is interpreted asutf-16
and I the decoder reads the file as gibberish.It seems to me that the
encoding/xml
package expects the line containing<?xml version="1.0" encoding="utf-16"?>
to be in some encoding that resembles validutf-8
-encoded text in order to read the encoding line and properly parse the rest of the file, OR for the line to be removed if manual transformation of the input is done beforehand (like withutf-16
, which cannot be read as validutf-8
text).Am I missing something? Is there a way to do this without modifying the input bytes?
The text was updated successfully, but these errors were encountered: