Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: pd.read_xml read chinese tag throw Syntax error #47902

Closed
2 of 3 tasks
fhopecc opened this issue Jul 30, 2022 · 1 comment · Fixed by #47905
Closed
2 of 3 tasks

BUG: pd.read_xml read chinese tag throw Syntax error #47902

fhopecc opened this issue Jul 30, 2022 · 1 comment · Fixed by #47905
Labels
Bug IO XML read_xml, to_xml
Milestone

Comments

@fhopecc
Copy link

fhopecc commented Jul 30, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
xml = "<中文標籤><row><c1>1</c1><c2>2</c2></row></中文標籤>"
df = pd.read_xml(xml)

Issue Description

XML includes chinese tag will throws lxml.etree.XMLSyntaxError,
because pandas.io.xml include the code:
self.xml_doc = XML(self._parse_doc(self.path_or_buffer))
_prase_doc use lxml parse xml file, then pass it into lxml.tostring to return bytes as following:

b'<&#20013;&#25991;&#27161;&#31844;><row><c1>1</c1><c2>2</c2></row></&#20013;&#25991;&#27161;&#31844;>'

then lxml.XML parse and report error.

Expected Behavior

parse chinese tag correct and rerurn correct df:

c1 c2
1 2

Installed Versions

Replace this line with the output of pd.show_versions()
INSTALLED VERSIONS

commit : 4bfe3d0
python : 3.10.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19044
machine : AMD64
processor : AMD64 Family 23 Model 104 Stepping 1, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : zh_TW
LOCALE : Chinese (Traditional)_Taiwan.950

pandas : 1.4.2
numpy : 1.22.3
pytz : 2022.1
dateutil : 2.8.2
pip : 22.1.2
setuptools : 58.1.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.8.0
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.1
IPython : 8.2.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : 1.0.9
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : 2.1.1
matplotlib : 3.5.1
numba : None
numexpr : None
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.8.0
snappy : None
sqlalchemy : None
tables : None
tabulate : 0.8.10
xarray : None
xlrd : 2.0.1
xlwt : None
zstandard : None

@fhopecc fhopecc added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 30, 2022
@ParfaitG ParfaitG added IO XML read_xml, to_xml and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 30, 2022
@ParfaitG
Copy link
Contributor

Thank you for this report! As you note, we unnecessarily convert the parsed document to bytes with tostring() for a subsequent XML() call which works for Western Latin based character sets. Avoiding this conversion resolves the issue and preserves original functionality. Fix forthcoming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO XML read_xml, to_xml
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants