Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
ERR: validate encoding on to_stata #15723
Comments
doc-string
I would say this is technically correct, passing a unicode encoding is invalid. But I think we should simply reject these, rather than actually write an invalid format. want to do a PR to do this? (now I am not sure which encoding stata can actually support, any idea?) |
jreback
added Difficulty Intermediate Effort Medium Error Reporting IO Stata Unicode
labels
Mar 17, 2017
jreback
added this to the
Next Major Release
milestone
Mar 17, 2017
jreback
changed the title from
Pandas generates corrupt Stata files in python 3.5 on OSX to ERR: validate encoding on to_stata
Mar 17, 2017
ozak
commented
Mar 17, 2017
|
I am a it confused now. From the docs it should fail in both
generates the correct file. So, I would think it may be better to just ignore the option in I think |
|
cc @bashtage any ideas here? |
|
There is no UTF8 in Stata. Only ASCII And the simple 8 bit encoding Latin-1. |
|
In Stata I mean in |
|
@bashtage so we should then validate |
ozak
commented
Mar 17, 2017
|
So why doesn't it generate a corrupt file in |
|
because its actually encoding it in PY3 with the passed in encoding (utf8), rather than the default of latin1. |
ozak
commented
Mar 17, 2017
|
So in PY2 it is not encoded as UTF8 even when giving the option? How is it saved then? What in |
ozak
commented
Mar 17, 2017
|
Just noticed one more thing import pandas as pd
df1 = pd.DataFrame(np.array([u'á',u'Ö']), columns=['var1'])
df1.to_stata('not-corrupt.dta', write_index=False, encoding='utf8')
df = pd.read_stata('corrupt3.dta')
df == df1generates a usable file in both PY2 and PY3. Still, the data is wrong as seen in the example. |
not really sure it is actually encoding as utf8 internally. lots of things in py2 are wonky. it probably happens to work. |
|
in any event. seems you have a bunch of tests cases! it seems easy enough to simply validate the encoding that is passed and raise if its not valid. |
ozak
commented
Mar 17, 2017
|
Indeed. I guess my example shows that PY2 is not encoding at all. |
bashtage
added a commit
to bashtage/pandas
that referenced
this issue
Mar 21, 2017
|
|
bashtage |
f549481
|
bashtage
added a commit
to bashtage/pandas
that referenced
this issue
Mar 21, 2017
|
|
bashtage |
2f02697
|
jreback
modified the milestone: 0.20.0, Next Major Release
Mar 21, 2017
jreback
closed this
in 1c9d46a
Mar 21, 2017
mattip
added a commit
to mattip/pandas
that referenced
this issue
Apr 3, 2017
|
|
bashtage + mattip |
e03ee35
|
linebp
added a commit
to linebp/pandas
that referenced
this issue
Apr 17, 2017
|
|
bashtage + linebp |
640e1cb
|
ozak commentedMar 17, 2017
•
edited
It seems
pandasinpython3.5causes issues due to encoding. For example the following generates a corrupt output filewhile
generates a correct file. I imagine this may be due to use of encoding and the difference in the treatment between
python 2andpython 3, which breaks compatibility of scripts acrosspythonversions. I guess it would be nice if it does not take this option into account onpython 3, unless the error is caused by something else.