UnicodeDecodeError when import sas7bdat with special characters in Chinese variables #20447

witwall · 2018-03-22T08:29:00Z

Code Sample, a copy-pastable example if possible

here is the sas7bdat files for test (Chinese names end with special characters),

issue.sas7bdat has Chinese values, can be correctly imported by pandas.

issue1.sas7bdat has Chinese variables

# Your code here
df1=pd.read_sas('issue1.sas7bdat',encoding='GBK')

Problem description

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-2-bcd7a0b4819c> in <module>()
----> 1 df1=pd.read_sas('issue1.sas7bdat',encoding='GBK')

~/.pyenv/versions/3.6.4/envs/ts/lib/python3.6/site-packages/pandas/io/sas/sasreader.py in read_sas(filepath_or_buffer, format, index, encoding, chunksize, iterator)
     59         reader = SAS7BDATReader(filepath_or_buffer, index=index,
     60                                 encoding=encoding,
---> 61                                 chunksize=chunksize)
     62     else:
     63         raise ValueError('unknown SAS format')

~/.pyenv/versions/3.6.4/envs/ts/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py in __init__(self, path_or_buf, index, convert_dates, blank_missing, chunksize, encoding, convert_text, convert_header_text)
     96 
     97         self._get_properties()
---> 98         self._parse_metadata()
     99 
    100     def close(self):

~/.pyenv/versions/3.6.4/envs/ts/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py in _parse_metadata(self)
    276                 raise ValueError(
    277                     "Failed to read a meta data page from the SAS file.")
--> 278             done = self._process_page_meta()
    279 
    280     def _process_page_meta(self):

~/.pyenv/versions/3.6.4/envs/ts/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py in _process_page_meta(self)
    282         pt = [const.page_meta_type, const.page_amd_type] + const.page_mix_types
    283         if self._current_page_type in pt:
--> 284             self._process_page_metadata()
    285         return ((self._current_page_type in [256] + const.page_mix_types) or
    286                 (self._current_page_data_subheader_pointers is not None))

~/.pyenv/versions/3.6.4/envs/ts/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py in _process_page_metadata(self)
    312                 self._get_subheader_index(subheader_signature,
    313                                           pointer.compression, pointer.ptype))
--> 314             self._process_subheader(subheader_index, pointer)
    315 
    316     def _get_subheader_index(self, signature, compression, ptype):

~/.pyenv/versions/3.6.4/envs/ts/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py in _process_subheader(self, subheader_index, pointer)
    382             raise ValueError("unknown subheader index")
    383 
--> 384         processor(offset, length)
    385 
    386     def _process_rowsize_subheader(self, offset, length):

~/.pyenv/versions/3.6.4/envs/ts/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py in _process_columntext_subheader(self, offset, length)
    432 
    433         if self.convert_header_text:
--> 434             cname = cname.decode(self.encoding or self.default_encoding)#cname.decode(self.encoding or self.default_encoding,'ignore')
    435         self.column_names_strings.append(cname)
    436 

UnicodeDecodeError: 'gbk' codec can't decode byte 0x8c in position 0: illegal multibyte sequence

[this should explain why the current behaviour is a problem and why the expected output is a better solution.]

Note: We receive a lot of issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates!

Note: Many problems can be resolved by simply upgrading pandas to the latest version. Before submitting, please check if that solution works for you. If possible, you may want to check if master addresses this issue, but that is not necessary.

For documentation-related issues, you can check the latest versions of the docs on master here:

https://pandas-docs.github.io/pandas-docs-travis/

If the issue has not been resolved there, go ahead and file it in the issue tracker.

Expected Output

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

commit: None

pandas: 0.22.0
pytest: None
pip: 9.0.2
setuptools: 28.8.0
Cython: None
numpy: 1.14.0
scipy: None
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

the code fired issue,

        if self.convert_header_text:
            cname = cname.decode(self.encoding or self.default_encoding)

if change to

        if self.convert_header_text:
            cname = cname.decode(self.encoding or self.default_encoding,'ignore')

got

if change to

       # if self.convert_header_text:
       #     cname = cname.decode(self.encoding or self.default_encoding)

got

and decode columns with the following code,

col=df1.columns.tolist()
col = [x.decode('GBK', 'ignore') for x in col]
df1.columns=pd.Index(col)

got the correct one,

btw,
sas7bdat works well.

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-03-22T11:14:10Z

Is GBK the correct encoding?

witwall · 2018-03-22T11:51:54Z

@TomAugspurger correct.

jbrockmendel added Unicode Unicode strings IO SAS SAS: read_sas labels Jul 25, 2018

jbrockmendel added this to Encodings in IO Method Robustness Dec 20, 2019

mroeschke added the Bug label Apr 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError when import sas7bdat with special characters in Chinese variables #20447

UnicodeDecodeError when import sas7bdat with special characters in Chinese variables #20447

witwall commented Mar 22, 2018 •

edited

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

TomAugspurger commented Mar 22, 2018

witwall commented Mar 22, 2018

UnicodeDecodeError when import sas7bdat with special characters in Chinese variables #20447

UnicodeDecodeError when import sas7bdat with special characters in Chinese variables #20447

Comments

witwall commented Mar 22, 2018 • edited

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line] INSTALLED VERSIONS

TomAugspurger commented Mar 22, 2018

witwall commented Mar 22, 2018

witwall commented Mar 22, 2018 •

edited

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS