Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError for Stata file #25960

Closed
hudcap opened this issue Apr 2, 2019 · 20 comments · Fixed by #25967
Closed

UnicodeDecodeError for Stata file #25960

hudcap opened this issue Apr 2, 2019 · 20 comments · Fixed by #25967
Labels
IO Stata read_stata, to_stata
Milestone

Comments

@hudcap
Copy link

hudcap commented Apr 2, 2019

Code Sample, a copy-pastable example if possible

import pandas as pd
pd.read_stata('mwe.dta')

mwe.dta available here: mwe.zip
This file is a derivative of The Supreme Court Database

Problem description

The command raises

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa4 in position 20: invalid start byte

I traced the error to a value label containing that byte.
This is a follow-up for #21244 and #23736
Changing line 1334 of pandas.io.stata from

return s.decode('utf-8')

to

return s.decode('latin-1')

allows me to read in the file.

Expected Output

The file should be correctly read and parsed.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.2
pytest: None
pip: 19.0.3
setuptools: 40.6.2
Cython: None
numpy: 1.16.2
scipy: None
pyarrow: None
xarray: None
IPython: 7.4.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@WillAyd
Copy link
Member

WillAyd commented Apr 2, 2019

You could probably also leverage self._encoding in the class since I see that used elsewhere. PRs welcome

@WillAyd WillAyd added the IO Stata read_stata, to_stata label Apr 2, 2019
@hudcap
Copy link
Author

hudcap commented Apr 2, 2019

def _set_encoding(self):
    """
    Set string encoding which depends on file version
    """
    if self.format_version < 118:
        self._encoding = 'latin-1'
    else:
        self._encoding = 'utf-8'

@WillAyd from the above method, it seems that self._encoding is 'utf-8' for version 118 files (like mine), so I don't think that will help.
There seems to be some conflict about which encoding these files use (see #21246), and I don't really understand what's going on...

@WillAyd
Copy link
Member

WillAyd commented Apr 2, 2019

Not familiar enough with stata to guess, but from reading through linked PR it looks like this was intentional to make the encoding strict based off version

@bashtage might have thoughts

@bashtage
Copy link
Contributor

bashtage commented Apr 2, 2019

According to Stata these should be UTF-8.

    The value-label names associated with each variable are recorded as


              <value_label_names>lbllist</value_label_names>


    where lbllist is a sequence (array) of K 129-byte, \0-terminated, UTF-8
    label names.  Each name may be up to 32-characters in length.

@bashtage
Copy link
Contributor

bashtage commented Apr 2, 2019

This file is a 118 format file.

@hudcap
Copy link
Author

hudcap commented Apr 2, 2019

This character is actually in the value_label_table as described here
The byte in question maps to 'CURRENCY SIGN' (U+00A4) (according to fileformat.info). I'm not sure why it's not considered valid Unicode. I've been trying to educate myself over the past hour, but this is a really confusing topic.
I don't understand why

b'\xa4'.decode('utf-8')

does not work if it maps to 'CURRENCY SIGN' (U+00A4)
Please forgive my ignorance.

@bashtage
Copy link
Contributor

bashtage commented Apr 2, 2019

It is a 2-byte encoding in utf-8.

The correctly formatted string should be

R.C.R.A. (42 U.S.C. \xc2\xa4 1978)

or just

'¤'.encode('utf-8')
b'\xc2\xa4'

@hudcap
Copy link
Author

hudcap commented Apr 2, 2019

Thank you.
So does this mean that Stata messed it up?

@bashtage
Copy link
Contributor

bashtage commented Apr 2, 2019

It is not obvious to me. Stata does some undocumented things w.r.t. latin-1 encodings.

This block would make it work.

        try:
            return s.decode('utf-8')
        except UnicodeDecodeError:
            return s.decode('unicode_escape')

@bashtage
Copy link
Contributor

bashtage commented Apr 2, 2019

Further looking says yes, this is invalid unicode. In particular, 0xA4 is an invalud Unicode encoding, so this qualifies as an undocumented Stata "feature"

@hudcap
Copy link
Author

hudcap commented Apr 2, 2019

It is not obvious to me. Stata does some undocumented things w.r.t. latin-1 encodings.

This block would make it work.

        try:
            return s.decode('utf-8')
        except UnicodeDecodeError:
            return s.decode('unicode_escape')

I can confirm that this works on my end too.
Thank you.

I think I'm starting to get a little more understanding of this.
In UTF-8, anything over 127 should jump to 2+ bytes, because the one-byte UTF-8 characters can't start with a binary 1. Did I get that right?

@bashtage
Copy link
Contributor

bashtage commented Apr 2, 2019

I fully thing happens for these values:

Correct unicode
11000010 10100100

This is read as 000010100100 (since the leading 11 is discarded to incidate a multibyte encoding, and the leading 10 in the second byte is dropped) which is 10100100. 0xA4 is also 10100100.

@bashtage
Copy link
Contributor

bashtage commented Apr 2, 2019

@hudcap That is correct. When a byte leads with 1 the first buyte must be 11xxxxxx followed by a sequence of 10yyyyyy bytes. The next byte that has the form 0zzzzzzz (ascii point) or 11xxxxxx (new unicode point) ends the varible encoding.

@bashtage
Copy link
Contributor

bashtage commented Apr 2, 2019

I would support a patch used a try/except as above since this seems like a real Stata issue. I resaved the file using Stata 14.2 and got the same incorrect format.

@hudcap
Copy link
Author

hudcap commented Apr 2, 2019

It is not obvious to me. Stata does some undocumented things w.r.t. latin-1 encodings.

This block would make it work.

        try:
            return s.decode('utf-8')
        except UnicodeDecodeError:
            return s.decode('unicode_escape')

What if a string has these invalid Unicode encodings as well as valid UTF-8?
Is there any way to handle that?

@bashtage
Copy link
Contributor

bashtage commented Apr 2, 2019 via email

@hudcap
Copy link
Author

hudcap commented Apr 2, 2019

I would support a patch used a try/except as above since this seems like a real Stata issue. I resaved the file using Stata 14.2 and got the same incorrect format.

Should I make a PR?
I'm fine with that, but it's your fix, and I don't want to step on your toes :)

@bashtage
Copy link
Contributor

bashtage commented Apr 2, 2019

When you open this file in Stata, the bad character shows up as � (U+FFFD), so Stata also doesn't correctly read it (although it does read it). I also cannot create a file that uses Latin-1 -- Stata correctly always writes unicode C2 A4 for the offending character, and pandas reads it correctly. I wonder if this dataset was dumped using some other program.

It seems that this file is not a valid Stata file.

Ultimately it probably doesn't make sense to provide ad hoc paths that might support damaged files when the current implementation seems to do a good job with correctly formatted Stata files.

@hudcap
Copy link
Author

hudcap commented Apr 2, 2019

Good point -- I completely agree. Given that the file is offered in many different formats, it's reasonable to assume that some other program created it.
In fact, I think that the character is supposed to be C2 A7 (section sign).

I'll correct the file before importing it.
Is there an intelligent way to do this?
I can't simply replace all A4 with C2 A7, because there is at least one other instance of that byte code that is not meant to be Unicode, but rather part of a number.

bashtage added a commit to bashtage/pandas that referenced this issue Apr 2, 2019
Add a fall back decode path that allows improperly formatted Stata
files written in 118 format but using latin-1 encoded strings to be
read

closes pandas-dev#25960
@bashtage
Copy link
Contributor

bashtage commented Apr 2, 2019

Stata can produce these files, even though it is probably a big. You can reproduce by opening a 117 file with latin-1 characters (ord>127), and then save as 118. This is how the file above was produced. The original large dataset is 117 and can be read file. I made an example file in Stata and have added a fix in #25967

bashtage added a commit to bashtage/pandas that referenced this issue Apr 3, 2019
Add a fall back decode path that allows improperly formatted Stata
files written in 118 format but using latin-1 encoded strings to be
read

closes pandas-dev#25960
@jreback jreback added this to the 0.25.0 milestone Apr 4, 2019
jreback pushed a commit that referenced this issue Apr 4, 2019
* ENH: Allow poorly formatted stata files to be read

Add a fall back decode path that allows improperly formatted Stata
files written in 118 format but using latin-1 encoded strings to be
read

closes #25960

* MAINT: Refactor decode

Refactor decode and null terminate to use file encoding
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Stata read_stata, to_stata
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants