You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
XML includes chinese tag will throws lxml.etree.XMLSyntaxError,
because pandas.io.xml include the code:
self.xml_doc = XML(self._parse_doc(self.path_or_buffer))
_prase_doc use lxml parse xml file, then pass it into lxml.tostring to return bytes as following:
Thank you for this report! As you note, we unnecessarily convert the parsed document to bytes with tostring() for a subsequent XML() call which works for Western Latin based character sets. Avoiding this conversion resolves the issue and preserves original functionality. Fix forthcoming.
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
XML includes chinese tag will throws lxml.etree.XMLSyntaxError,
because pandas.io.xml include the code:
self.xml_doc = XML(self._parse_doc(self.path_or_buffer))
_prase_doc use lxml parse xml file, then pass it into lxml.tostring to return bytes as following:
b'<中文標籤><row><c1>1</c1><c2>2</c2></row></中文標籤>'
then lxml.XML parse and report error.
Expected Behavior
parse chinese tag correct and rerurn correct df:
c1 c2
1 2
Installed Versions
Replace this line with the output of pd.show_versions()
INSTALLED VERSIONS
commit : 4bfe3d0
python : 3.10.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19044
machine : AMD64
processor : AMD64 Family 23 Model 104 Stepping 1, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : zh_TW
LOCALE : Chinese (Traditional)_Taiwan.950
pandas : 1.4.2
numpy : 1.22.3
pytz : 2022.1
dateutil : 2.8.2
pip : 22.1.2
setuptools : 58.1.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.8.0
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.1
IPython : 8.2.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : 1.0.9
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : 2.1.1
matplotlib : 3.5.1
numba : None
numexpr : None
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.8.0
snappy : None
sqlalchemy : None
tables : None
tabulate : 0.8.10
xarray : None
xlrd : 2.0.1
xlwt : None
zstandard : None
The text was updated successfully, but these errors were encountered: