BUG: pd.read_xml read chinese tag throw Syntax error #47902

fhopecc · 2022-07-30T11:06:52Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
xml = "<中文標籤><row><c1>1</c1><c2>2</c2></row></中文標籤>"
df = pd.read_xml(xml)

Issue Description

XML includes chinese tag will throws lxml.etree.XMLSyntaxError,
because pandas.io.xml include the code:
self.xml_doc = XML(self._parse_doc(self.path_or_buffer))
_prase_doc use lxml parse xml file, then pass it into lxml.tostring to return bytes as following:

b'<中文標籤><row><c1>1</c1><c2>2</c2></row></中文標籤>'

then lxml.XML parse and report error.

Expected Behavior

parse chinese tag correct and rerurn correct df:

c1 c2
1 2

Installed Versions

Replace this line with the output of pd.show_versions()
INSTALLED VERSIONS

commit : 4bfe3d0
python : 3.10.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19044
machine : AMD64
processor : AMD64 Family 23 Model 104 Stepping 1, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : zh_TW
LOCALE : Chinese (Traditional)_Taiwan.950

pandas : 1.4.2
numpy : 1.22.3
pytz : 2022.1
dateutil : 2.8.2
pip : 22.1.2
setuptools : 58.1.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.8.0
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.1
IPython : 8.2.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : 1.0.9
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : 2.1.1
matplotlib : 3.5.1
numba : None
numexpr : None
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.8.0
snappy : None
sqlalchemy : None
tables : None
tabulate : 0.8.10
xarray : None
xlrd : 2.0.1
xlwt : None
zstandard : None

The text was updated successfully, but these errors were encountered:

ParfaitG · 2022-07-30T19:18:33Z

Thank you for this report! As you note, we unnecessarily convert the parsed document to bytes with tostring() for a subsequent XML() call which works for Western Latin based character sets. Avoiding this conversion resolves the issue and preserves original functionality. Fix forthcoming.

fhopecc added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 30, 2022

ParfaitG added IO XML read_xml, to_xml and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 30, 2022

ParfaitG mentioned this issue Jul 31, 2022

BUG: Fix read_xml raising syntax error when reading XML with Chinese tags #47905

Merged

5 tasks

mroeschke closed this as completed in #47905 Aug 1, 2022

simonjayhawkins modified the milestones: 1.4.4, 1.5 Aug 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: pd.read_xml read chinese tag throw Syntax error #47902

BUG: pd.read_xml read chinese tag throw Syntax error #47902

fhopecc commented Jul 30, 2022

Replace this line with the output of pd.show_versions()
INSTALLED VERSIONS

ParfaitG commented Jul 30, 2022

BUG: pd.read_xml read chinese tag throw Syntax error #47902

BUG: pd.read_xml read chinese tag throw Syntax error #47902

Comments

fhopecc commented Jul 30, 2022

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

Replace this line with the output of pd.show_versions() INSTALLED VERSIONS

ParfaitG commented Jul 30, 2022

Replace this line with the output of pd.show_versions()
INSTALLED VERSIONS