Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError when import sas7bdat with special characters in Chinese variables #20447

Open
witwall opened this issue Mar 22, 2018 · 2 comments
Labels
Bug IO SAS SAS: read_sas Unicode Unicode strings

Comments

@witwall
Copy link

witwall commented Mar 22, 2018

Code Sample, a copy-pastable example if possible

here is the sas7bdat files for test (Chinese names end with special characters),

issue.zip

issue.sas7bdat has Chinese values, can be correctly imported by pandas.

image

issue1.sas7bdat has Chinese variables

image

# Your code here
df1=pd.read_sas('issue1.sas7bdat',encoding='GBK')

Problem description

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-2-bcd7a0b4819c> in <module>()
----> 1 df1=pd.read_sas('issue1.sas7bdat',encoding='GBK')

~/.pyenv/versions/3.6.4/envs/ts/lib/python3.6/site-packages/pandas/io/sas/sasreader.py in read_sas(filepath_or_buffer, format, index, encoding, chunksize, iterator)
     59         reader = SAS7BDATReader(filepath_or_buffer, index=index,
     60                                 encoding=encoding,
---> 61                                 chunksize=chunksize)
     62     else:
     63         raise ValueError('unknown SAS format')

~/.pyenv/versions/3.6.4/envs/ts/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py in __init__(self, path_or_buf, index, convert_dates, blank_missing, chunksize, encoding, convert_text, convert_header_text)
     96 
     97         self._get_properties()
---> 98         self._parse_metadata()
     99 
    100     def close(self):

~/.pyenv/versions/3.6.4/envs/ts/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py in _parse_metadata(self)
    276                 raise ValueError(
    277                     "Failed to read a meta data page from the SAS file.")
--> 278             done = self._process_page_meta()
    279 
    280     def _process_page_meta(self):

~/.pyenv/versions/3.6.4/envs/ts/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py in _process_page_meta(self)
    282         pt = [const.page_meta_type, const.page_amd_type] + const.page_mix_types
    283         if self._current_page_type in pt:
--> 284             self._process_page_metadata()
    285         return ((self._current_page_type in [256] + const.page_mix_types) or
    286                 (self._current_page_data_subheader_pointers is not None))

~/.pyenv/versions/3.6.4/envs/ts/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py in _process_page_metadata(self)
    312                 self._get_subheader_index(subheader_signature,
    313                                           pointer.compression, pointer.ptype))
--> 314             self._process_subheader(subheader_index, pointer)
    315 
    316     def _get_subheader_index(self, signature, compression, ptype):

~/.pyenv/versions/3.6.4/envs/ts/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py in _process_subheader(self, subheader_index, pointer)
    382             raise ValueError("unknown subheader index")
    383 
--> 384         processor(offset, length)
    385 
    386     def _process_rowsize_subheader(self, offset, length):

~/.pyenv/versions/3.6.4/envs/ts/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py in _process_columntext_subheader(self, offset, length)
    432 
    433         if self.convert_header_text:
--> 434             cname = cname.decode(self.encoding or self.default_encoding)#cname.decode(self.encoding or self.default_encoding,'ignore')
    435         self.column_names_strings.append(cname)
    436 

UnicodeDecodeError: 'gbk' codec can't decode byte 0x8c in position 0: illegal multibyte sequence

[this should explain why the current behaviour is a problem and why the expected output is a better solution.]

Note: We receive a lot of issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates!

Note: Many problems can be resolved by simply upgrading pandas to the latest version. Before submitting, please check if that solution works for you. If possible, you may want to check if master addresses this issue, but that is not necessary.

For documentation-related issues, you can check the latest versions of the docs on master here:

https://pandas-docs.github.io/pandas-docs-travis/

If the issue has not been resolved there, go ahead and file it in the issue tracker.

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None

pandas: 0.22.0
pytest: None
pip: 9.0.2
setuptools: 28.8.0
Cython: None
numpy: 1.14.0
scipy: None
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

the code fired issue,

        if self.convert_header_text:
            cname = cname.decode(self.encoding or self.default_encoding)

if change to

        if self.convert_header_text:
            cname = cname.decode(self.encoding or self.default_encoding,'ignore')

got
image

if change to

       # if self.convert_header_text:
       #     cname = cname.decode(self.encoding or self.default_encoding)

got
image

and decode columns with the following code,

col=df1.columns.tolist()
col = [x.decode('GBK', 'ignore') for x in col]
df1.columns=pd.Index(col)

got the correct one,
image

btw,
sas7bdat works well.
image

@TomAugspurger
Copy link
Contributor

Is GBK the correct encoding?

@witwall
Copy link
Author

witwall commented Mar 22, 2018

@TomAugspurger correct.

@jbrockmendel jbrockmendel added Unicode Unicode strings IO SAS SAS: read_sas labels Jul 25, 2018
@jbrockmendel jbrockmendel added this to Encodings in IO Method Robustness Dec 20, 2019
@mroeschke mroeschke added the Bug label Apr 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO SAS SAS: read_sas Unicode Unicode strings
Projects
No open projects
Development

No branches or pull requests

4 participants