ERR: validate encoding on to_stata #15723

Closed
ozak opened this Issue Mar 17, 2017 · 13 comments

Comments

Projects
None yet
3 participants

ozak commented Mar 17, 2017 edited

It seems pandas in python3.5 causes issues due to encoding. For example the following generates a corrupt output file

import pandas as pd
df1 = pd.DataFrame(np.array([1,2,3,4]), columns=['var1'])
df1.to_stata('corrupt.dta', write_index=False, encoding='utf8')

while

df1.to_stata('not-corrupt.dta', write_index=False)

generates a correct file. I imagine this may be due to use of encoding and the difference in the treatment between python 2 and python 3, which breaks compatibility of scripts across python versions. I guess it would be nice if it does not take this option into account on python 3, unless the error is caused by something else.

Contributor

jreback commented Mar 17, 2017

In [8]: df1 = pd.DataFrame(np.array([1,2,3,4]), columns=['var1'])
   ...: df1.to_stata('corrupt.dta', write_index=False, encoding='latin1')
   ...: 
   ...: 
   ...: 

In [9]: pd.read_stata('corrupt.dta')
Out[9]: 
   var1
0     1
1     2
2     3
3     4

doc-string

Signature: df1.to_stata(fname, convert_dates=None, write_index=True, encoding='latin-1', byteorder=None, time_stamp=None, data_label=None, variable_labels=None)
Docstring:
A class for writing Stata binary dta files from array-like objects

Parameters
----------
fname : str or buffer
    String path of file-like object
convert_dates : dict
    Dictionary mapping columns containing datetime types to stata
    internal format to use when wirting the dates. Options are 'tc',
    'td', 'tm', 'tw', 'th', 'tq', 'ty'. Column can be either an integer
    or a name. Datetime columns that do not have a conversion type
    specified will be converted to 'tc'. Raises NotImplementedError if
    a datetime column has timezone information
write_index : bool
    Write the index to Stata dataset.
encoding : str
    Default is latin-1. Unicode is not supported
byteorder : str
    Can be ">", "<", "little", or "big". default is `sys.byteorder`
time_stamp : datetime
    A datetime to use as file creation date.  Default is the current
    time.
dataset_label : str
    A label for the data set.  Must be 80 characters or smaller.
variable_labels : dict
    Dictionary containing columns as keys and variable labels as
    values. Each label must be 80 characters or smaller.

I would say this is technically correct, passing a unicode encoding is invalid. But I think we should simply reject these, rather than actually write an invalid format. want to do a PR to do this? (now I am not sure which encoding stata can actually support, any idea?)

jreback added this to the Next Major Release milestone Mar 17, 2017

jreback changed the title from Pandas generates corrupt Stata files in python 3.5 on OSX to ERR: validate encoding on to_stata Mar 17, 2017

ozak commented Mar 17, 2017

I am a it confused now. From the docs it should fail in both python 2 and 3, but in python 2

df1.to_stata('corrupt.dta', write_index=False, encoding='utf8')

generates the correct file. So, I would think it may be better to just ignore the option in python 3, but keep it in python 2.

I think stata 14 now supports UTF8 as a default, but it actually may be more general, not 100% sure (see here). I'll try to find some time to write a PR. I'll see what the code actually does. I'll let you know.

Contributor

jreback commented Mar 17, 2017

cc @bashtage any ideas here?

Contributor

bashtage commented Mar 17, 2017

There is no UTF8 in Stata. Only ASCII And the simple 8 bit encoding Latin-1.

Contributor

bashtage commented Mar 17, 2017

In Stata I mean in to_stata. Adding UTF8 is a major effort since the current format is all fixed width and I don't see much of a case for making the effort.

Contributor

jreback commented Mar 17, 2017

@bashtage so we should then validate encoding='ascii'|'latin1'|None as the only allowed encodings at all.

ozak commented Mar 17, 2017

So why doesn't it generate a corrupt file in python 2, but does in python 3? I have the same pandas version in both, so it may not be pandas specific?

Contributor

jreback commented Mar 17, 2017

because its actually encoding it in PY3 with the passed in encoding (utf8), rather than the default of latin1.

ozak commented Mar 17, 2017

So in PY2 it is not encoded as UTF8 even when giving the option? How is it saved then? What in pandas is affecting the IO to Stata that corrupts the file? Given that Stata14 uses UTF8 as the default it should not have an issue opening UTF8 encoded files.

ozak commented Mar 17, 2017

Just noticed one more thing

import pandas as pd
df1 = pd.DataFrame(np.array([u'á',u'Ö']), columns=['var1'])
df1.to_stata('not-corrupt.dta', write_index=False, encoding='utf8')

df = pd.read_stata('corrupt3.dta')
df == df1

generates a usable file in both PY2 and PY3. Still, the data is wrong as seen in the example.

Contributor

jreback commented Mar 17, 2017

So in PY2 it is not encoded as UTF8 even when giving the option?

not really sure it is actually encoding as utf8 internally. lots of things in py2 are wonky. it probably happens to work.

Contributor

jreback commented Mar 17, 2017

in any event. seems you have a bunch of tests cases! it seems easy enough to simply validate the encoding that is passed and raise if its not valid.

ozak commented Mar 17, 2017

Indeed.

I guess my example shows that PY2 is not encoding at all.

@bashtage bashtage added a commit to bashtage/pandas that referenced this issue Mar 21, 2017

@bashtage bashtage BIG: Enforce correc encoding in stata
Ensure StataReader and StataWriter have the correct encoding.
Standardized default encoding to 'latin-1'

closes #15723
f549481

bashtage referenced this issue Mar 21, 2017

Closed

BUG: Enforce correct encoding in stata #15768

4 of 4 tasks complete

@bashtage bashtage added a commit to bashtage/pandas that referenced this issue Mar 21, 2017

@bashtage bashtage BUG: Enforce correct encoding in stata
Ensure StataReader and StataWriter have the correct encoding.
Standardized default encoding to 'latin-1'

closes #15723
2f02697

@jreback jreback modified the milestone: 0.20.0, Next Major Release Mar 21, 2017

jreback closed this in 1c9d46a Mar 21, 2017

@mattip mattip added a commit to mattip/pandas that referenced this issue Apr 3, 2017

@bashtage @mattip bashtage + mattip BUG: Enforce correct encoding in stata
Ensure StataReader and StataWriter have the correct encoding.
Standardized default encoding to 'latin-1'

closes #15723

Author: Kevin Sheppard <kevin.k.sheppard@gmail.com>

Closes #15768 from bashtage/limit-stata-encoding and squashes the following commits:

8278be7 [Kevin Sheppard] BUG: Fix limited key range on 32-bit platofrms
2f02697 [Kevin Sheppard] BUG: Enforce correct encoding in stata
e03ee35

@linebp linebp added a commit to linebp/pandas that referenced this issue Apr 17, 2017

@bashtage @linebp bashtage + linebp BUG: Enforce correct encoding in stata
Ensure StataReader and StataWriter have the correct encoding.
Standardized default encoding to 'latin-1'

closes #15723

Author: Kevin Sheppard <kevin.k.sheppard@gmail.com>

Closes #15768 from bashtage/limit-stata-encoding and squashes the following commits:

8278be7 [Kevin Sheppard] BUG: Fix limited key range on 32-bit platofrms
2f02697 [Kevin Sheppard] BUG: Enforce correct encoding in stata
640e1cb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment