UnicodeDecodeError for Stata file #25960

hudcap · 2019-04-02T13:51:55Z

Code Sample, a copy-pastable example if possible

import pandas as pd
pd.read_stata('mwe.dta')

mwe.dta available here: mwe.zip
This file is a derivative of The Supreme Court Database

Problem description

The command raises

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa4 in position 20: invalid start byte

I traced the error to a value label containing that byte.
This is a follow-up for #21244 and #23736
Changing line 1334 of pandas.io.stata from

return s.decode('utf-8')

to

return s.decode('latin-1')

allows me to read in the file.

Expected Output

The file should be correctly read and parsed.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.2
pytest: None
pip: 19.0.3
setuptools: 40.6.2
Cython: None
numpy: 1.16.2
scipy: None
pyarrow: None
xarray: None
IPython: 7.4.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-04-02T14:20:15Z

You could probably also leverage self._encoding in the class since I see that used elsewhere. PRs welcome

hudcap · 2019-04-02T14:44:51Z

def _set_encoding(self):
    """
    Set string encoding which depends on file version
    """
    if self.format_version < 118:
        self._encoding = 'latin-1'
    else:
        self._encoding = 'utf-8'

@WillAyd from the above method, it seems that self._encoding is 'utf-8' for version 118 files (like mine), so I don't think that will help.
There seems to be some conflict about which encoding these files use (see #21246), and I don't really understand what's going on...

WillAyd · 2019-04-02T14:58:17Z

Not familiar enough with stata to guess, but from reading through linked PR it looks like this was intentional to make the encoding strict based off version

@bashtage might have thoughts

bashtage · 2019-04-02T15:27:43Z

According to Stata these should be UTF-8.

    The value-label names associated with each variable are recorded as


              <value_label_names>lbllist</value_label_names>


    where lbllist is a sequence (array) of K 129-byte, \0-terminated, UTF-8
    label names.  Each name may be up to 32-characters in length.

bashtage · 2019-04-02T15:29:00Z

This file is a 118 format file.

hudcap · 2019-04-02T15:37:34Z

This character is actually in the value_label_table as described here
The byte in question maps to 'CURRENCY SIGN' (U+00A4) (according to fileformat.info). I'm not sure why it's not considered valid Unicode. I've been trying to educate myself over the past hour, but this is a really confusing topic.
I don't understand why

b'\xa4'.decode('utf-8')

does not work if it maps to 'CURRENCY SIGN' (U+00A4)
Please forgive my ignorance.

bashtage · 2019-04-02T15:41:22Z

It is a 2-byte encoding in utf-8.

The correctly formatted string should be

R.C.R.A. (42 U.S.C. \xc2\xa4 1978)

or just

'¤'.encode('utf-8')
b'\xc2\xa4'

hudcap · 2019-04-02T15:46:05Z

Thank you.
So does this mean that Stata messed it up?

bashtage · 2019-04-02T15:51:49Z

It is not obvious to me. Stata does some undocumented things w.r.t. latin-1 encodings.

This block would make it work.

        try:
            return s.decode('utf-8')
        except UnicodeDecodeError:
            return s.decode('unicode_escape')

bashtage · 2019-04-02T15:56:34Z

Further looking says yes, this is invalid unicode. In particular, 0xA4 is an invalud Unicode encoding, so this qualifies as an undocumented Stata "feature"

hudcap · 2019-04-02T15:59:57Z

It is not obvious to me. Stata does some undocumented things w.r.t. latin-1 encodings.

This block would make it work.
        try:
            return s.decode('utf-8')
        except UnicodeDecodeError:
            return s.decode('unicode_escape')

I can confirm that this works on my end too.
Thank you.

I think I'm starting to get a little more understanding of this.
In UTF-8, anything over 127 should jump to 2+ bytes, because the one-byte UTF-8 characters can't start with a binary 1. Did I get that right?

bashtage · 2019-04-02T16:00:42Z

I fully thing happens for these values:

Correct unicode
11000010 10100100

This is read as 000010100100 (since the leading 11 is discarded to incidate a multibyte encoding, and the leading 10 in the second byte is dropped) which is 10100100. 0xA4 is also 10100100.

bashtage · 2019-04-02T16:02:29Z

@hudcap That is correct. When a byte leads with 1 the first buyte must be 11xxxxxx followed by a sequence of 10yyyyyy bytes. The next byte that has the form 0zzzzzzz (ascii point) or 11xxxxxx (new unicode point) ends the varible encoding.

bashtage · 2019-04-02T16:52:27Z

I would support a patch used a try/except as above since this seems like a real Stata issue. I resaved the file using Stata 14.2 and got the same incorrect format.

hudcap · 2019-04-02T17:48:08Z

It is not obvious to me. Stata does some undocumented things w.r.t. latin-1 encodings.

This block would make it work.
        try:
            return s.decode('utf-8')
        except UnicodeDecodeError:
            return s.decode('unicode_escape')

What if a string has these invalid Unicode encodings as well as valid UTF-8?
Is there any way to handle that?

bashtage · 2019-04-02T19:02:08Z

If they are not latin1 or ubicode then it will error, which is the right decision since we don't know what the right value is.

…

On Tue, Apr 2, 2019, 18:48 Yehuda Davis ***@***.***> wrote: It is not obvious to me. Stata does some undocumented things w.r.t. latin-1 encodings. This block would make it work. try: return s.decode('utf-8') except UnicodeDecodeError: return s.decode('unicode_escape') What if a string has these invalid Unicode encodings as well as valid UTF-8? Is there any way to handle that? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#25960 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFU5Rcw5m_E_Lv-04JMeGBcJRuWjxjqSks5vc5fwgaJpZM4cYGcN> .

hudcap · 2019-04-02T20:06:23Z

I would support a patch used a try/except as above since this seems like a real Stata issue. I resaved the file using Stata 14.2 and got the same incorrect format.

Should I make a PR?
I'm fine with that, but it's your fix, and I don't want to step on your toes :)

bashtage · 2019-04-02T21:15:32Z

When you open this file in Stata, the bad character shows up as � (U+FFFD), so Stata also doesn't correctly read it (although it does read it). I also cannot create a file that uses Latin-1 -- Stata correctly always writes unicode C2 A4 for the offending character, and pandas reads it correctly. I wonder if this dataset was dumped using some other program.

It seems that this file is not a valid Stata file.

Ultimately it probably doesn't make sense to provide ad hoc paths that might support damaged files when the current implementation seems to do a good job with correctly formatted Stata files.

hudcap · 2019-04-02T21:42:27Z

Good point -- I completely agree. Given that the file is offered in many different formats, it's reasonable to assume that some other program created it.
In fact, I think that the character is supposed to be C2 A7 (section sign).

I'll correct the file before importing it.
Is there an intelligent way to do this?
I can't simply replace all A4 with C2 A7, because there is at least one other instance of that byte code that is not meant to be Unicode, but rather part of a number.

Add a fall back decode path that allows improperly formatted Stata files written in 118 format but using latin-1 encoded strings to be read closes pandas-dev#25960

bashtage · 2019-04-02T22:51:42Z

Stata can produce these files, even though it is probably a big. You can reproduce by opening a 117 file with latin-1 characters (ord>127), and then save as 118. This is how the file above was produced. The original large dataset is 117 and can be read file. I made an example file in Stata and have added a fix in #25967

Add a fall back decode path that allows improperly formatted Stata files written in 118 format but using latin-1 encoded strings to be read closes pandas-dev#25960

* ENH: Allow poorly formatted stata files to be read Add a fall back decode path that allows improperly formatted Stata files written in 118 format but using latin-1 encoded strings to be read closes #25960 * MAINT: Refactor decode Refactor decode and null terminate to use file encoding

WillAyd added the IO Stata read_stata, to_stata label Apr 2, 2019

bashtage mentioned this issue Apr 2, 2019

ENH: Allow poorly formatted stata files to be read #25967

Merged

4 tasks

jreback added this to the 0.25.0 milestone Apr 4, 2019

jreback closed this as completed in #25967 Apr 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError for Stata file #25960

UnicodeDecodeError for Stata file #25960

hudcap commented Apr 2, 2019

INSTALLED VERSIONS

WillAyd commented Apr 2, 2019

hudcap commented Apr 2, 2019

WillAyd commented Apr 2, 2019

bashtage commented Apr 2, 2019

bashtage commented Apr 2, 2019

hudcap commented Apr 2, 2019

bashtage commented Apr 2, 2019

hudcap commented Apr 2, 2019

bashtage commented Apr 2, 2019

bashtage commented Apr 2, 2019

hudcap commented Apr 2, 2019

bashtage commented Apr 2, 2019

bashtage commented Apr 2, 2019

bashtage commented Apr 2, 2019 •

edited

Loading

hudcap commented Apr 2, 2019

bashtage commented Apr 2, 2019 via email

hudcap commented Apr 2, 2019

bashtage commented Apr 2, 2019

hudcap commented Apr 2, 2019

bashtage commented Apr 2, 2019

UnicodeDecodeError for Stata file #25960

UnicodeDecodeError for Stata file #25960

Comments

hudcap commented Apr 2, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

WillAyd commented Apr 2, 2019

hudcap commented Apr 2, 2019

WillAyd commented Apr 2, 2019

bashtage commented Apr 2, 2019

bashtage commented Apr 2, 2019

hudcap commented Apr 2, 2019

bashtage commented Apr 2, 2019

hudcap commented Apr 2, 2019

bashtage commented Apr 2, 2019

bashtage commented Apr 2, 2019

hudcap commented Apr 2, 2019

bashtage commented Apr 2, 2019

bashtage commented Apr 2, 2019

bashtage commented Apr 2, 2019 • edited Loading

hudcap commented Apr 2, 2019

bashtage commented Apr 2, 2019 via email

hudcap commented Apr 2, 2019

bashtage commented Apr 2, 2019

hudcap commented Apr 2, 2019

bashtage commented Apr 2, 2019

Output of `pd.show_versions()`

bashtage commented Apr 2, 2019 •

edited

Loading