Support writing unicode characters in df.to_stata() #23573

kylebarron · 2018-11-08T17:46:45Z

Code Sample, a copy-pastable example if possible

import pandas as pd
df = pd.DataFrame({'a': ['丆']})
df.to_stata('test.dta')
# UnicodeEncodeError: 'latin-1' codec can't encode character '\u4e06' in position 0: ordinal not in range(256)

I picked an arbitrary CJK character to test this with.

Problem description

It would be possible to write Unicode strings to a Stata file by implementing a writer according to version 118 of the dta format.

~~I'd be interested in trying to submit a PR for this.~~ (Edit: I don't use Stata anymore)

Expected Output

Stata file written to disk.

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-696.18.7.el6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 3.5.1
pip: 10.0.1
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.14.3
scipy: 1.1.0
pyarrow: 0.10.0
xarray: None
IPython: 7.0.1
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.6
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

adamrossnelson · 2018-11-24T03:58:28Z

I think I can add to this issue with the following ... The current scheme seems to be that pd.to_stata knows it does not support writing unicode, so if it finds unicode it'll helpfully throw an error.

However, I believe I've found some characters that pd.to_stata will write to a Stata data file without throwing an error, but that will cause trouble for Stata's version 117. (Long way here of saying I vote in favor of an enhancement that'll support Version 118).

Code Sample, a copy-pastable example if possible

import pandas as pd

# Make demonstration data. This data contains characters that should
# cause Pandas to throw an error when using df.to_stata().
bad_txt_sneaking_through = ''' Multiline text that sneaks by
Here is one __�__
Another one __·__   Another one __½__
Bad bad bad __Á__   Bad bad bad __¦__
Still more __é__    Still more __§__    Still more __®__ '''

data_list = []
data_list.append(['First Record', bad_txt_sneaking_through])
data_list.append(['Second Record', 'This one will be fine'])

# Make DataFrame from demonstration data.
df = pd.DataFrame(data_list, columns=['RecNum', 'Txt'])

# Write data frame to Stata data file. Shouldn't write but does.
# This file will not open in Stata.
df.to_stata('Demo_ShouldNotWork.dta', version=117, convert_strl=['Txt'])

# Write first record which has the offending characters.
# This file will not open in Stata.
df[0:1].to_stata('Demo_DoesNotWork.dta', version=117, convert_strl=['Txt'])

# Write second record which has no offending characters.
# This file will open in Stata.
df[1:2].to_stata('Demo_DoesWork.dta', version=117, convert_strl=['Txt'])

# Define function that tests diagnosis (bad chacter count)
def make_it_work(bad_text):
    ret_txt = ''
    for item in bad_text:
        ret_txt += item if len(item.encode(encoding='utf_8')) == 1 else ''
    return(ret_txt)

df['Txt'] = df['Txt'].apply(make_it_work)

# Write data frame to Stata data file. This time it should write and does.
# This file will open in Stata.
df.to_stata('Demo_ShouldWork.dta', version=117, convert_strl=['Txt'])

Problem description

Pandas pd.to_stata does not throw an error for some characters which are problematic in stata.

When writing Stata data files Pandas usually (and helpfully) throws an error if there are non Latin-1 characters in an StrL data field. However, when I was working with a large dataset I scraped from the web I managed to write a data file without getting an error from Pandas. All was going well. But, Stata was unable to read the file.

With some assistance from Stata technical support I believe the correct diagnosis was an issue with under counting the total number of characters in the StrL. Stata technical support indicated to me that Unicode characters will throw off the count. At first I thought this shouldn't be a problem because I thought pd.to_stata would handle the issue by throwing an error. As demonstrated below, some characters seem to be able to sneak through.

In troubleshooting and documenting the issue I believe the function make_it_work() is at least a partial solution. It is crude. This solution might help folks using pd.to_stata until there is a more integrated fix. Or, perhaps someone can point me to another solution. I would also put a plug in for enhancing pd.to_stata so that it will support writing unicode.

Expected Output

Thus I would vote in favor of future developments finding a method to throw an error for such characters that seem for now to be sneaking through. Here is code and output that produces what would be more helpful. In the alternative, an enhancement as @kylebarron suggested that would accommodate Unicode would also be an appropriate solution.

bad_txt_not_sneaking_through = '''Bad text that does not sneak through...

Here you go ► '''

data_list = []
data_list.append(['First Record', bad_txt_not_sneaking_through])
data_list.append(['Second Record', 'This one will be fine'])

df = pd.DataFrame(data_list, columns=['RecNum', 'Txt'])

df.to_stata('Demo_ShouldNotWork.dta', version=117, convert_strl=['Txt'])

Output (abridged):

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-4-aa00fe80a653> in <module>()
      9 df = pd.DataFrame(data_list, columns=['RecNum', 'Txt'])
     10 
---> 11 df.to_stata('Demo_ShouldNotWork.dta', version=117, convert_strl=['Txt'])

. . .

UnicodeEncodeError: 'latin-1' codec can't encode character '\u25ba' in position 53: ordinal not in range(256)

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 3.3.2
pip: 10.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Edit1: Elaborated intro comments.
Edit2: Added to the expected output session.

kylebarron · 2018-12-12T21:58:26Z

@adamrossnelson

All of those characters are included in the Latin-1 encoding, and seem to work fine in Stata. If you want to test it yourself, run this:

set obs 10
gen x = ""
replace x = "Here is one __�__" in 1
replace x = "Another one __·__" in 2
replace x = "Another one __½__" in 3
replace x = "Bad bad bad __Á__" in 4
replace x = "Bad bad bad __¦__" in 5
saveold test.dta, version(13)

use test.dta, clear

So those characters should work with version 117, and if they don't it's a bug.

bashtage · 2018-12-17T11:09:28Z

@adamrossnelson You are actually experiencing a different bug. The files that you can't don't work with Stata don't work because they are only partially written (at least in master). A patch is needed to cleanup writes that fail (ok, but can be difficult to get right, but easy to implement), or to possible reorder the steps so that all of the data checks happen before the file is created (better solution, but may need a lot of work).

adamrossnelson · 2018-12-17T14:33:28Z

@bashtage ... would you suggest starting a new issue? I'd be happy to do that to help track.

bashtage · 2018-12-17T14:38:50Z

Not worth it. I have the PR ready.

kylebarron · 2018-12-17T17:54:23Z

@bashtage I might be mistaken, but I felt that the bug was that those characters should be able to be written with the current Pandas writer. The df.to_stata documentation says that it supports Latin-1 encoding, and all of those characters in @adamrossnelson 's comment are present in that encoding. Therefore it should be able to be written, right?

bashtage · 2018-12-17T17:59:05Z

� copies and pastes as U+FFFD which is not supported in Latin-1.

bashtage · 2018-12-17T18:23:44Z

You can run this to sanitize it:

import pandas as pd

# Make demonstration data. This data contains characters that should
# cause Pandas to throw an error when using df.to_stata().
bad_txt_sneaking_through = ''' Multiline text that sneaks by
Here is one __�__
Another one __·__   Another one __½__
Bad bad bad __Á__   Bad bad bad __¦__
Still more __é__    Still more __§__    Still more __®__ '''

data_list = []
data_list.append(['First Record', bad_txt_sneaking_through])
data_list.append(['Second Record', 'This one will be fine'])

# Make DataFrame from demonstration data.
df = pd.DataFrame(data_list, columns=['RecNum', 'Txt'])

# Define function that tests diagnosis (bad chacter count)
def make_it_work(bad_text):
    ret_txt = ''
    for item in bad_text:
        try:
            ret_txt += item.encode('latin-1').decode('latin-1')
        except:
            pass
    return(ret_txt)

df2 = df.copy()
df2['Txt'] = df['Txt'].apply(make_it_work)

df2.to_stata('if-it-doesnt-load-there-is-a-bug.dta', version=117, convert_strl=['Txt'])

It seems that this file is not loadable which suggests there is a bug in multiline strl encoding. I suppose the first issue is to determine whether strl supports multilines in 117.

kylebarron · 2018-12-18T14:25:14Z

I don't believe the original issue of this thread (unicode Stata file writing support) is actually resolved. @adamrossnelson posted an example which showed both 1) the need for unicode write support and 2) a bug with the current code. #24337 fixed the bug with the current code; unicode write support is a larger endeavor and is still uncompleted.

bashtage · 2018-12-18T14:27:43Z

Yeah, it should be reopened. It closed an issue in this issue, rather than the issue.

jbrockmendel · 2019-12-11T18:59:41Z

@bashtage is this actionable?

bashtage · 2019-12-11T19:50:57Z

Yes. Someone could write a format 118 or 119 writer that supports unicode.

bashtage · 2019-12-11T19:51:36Z

The spec is available. Non-trivial since need utf8 encode but we use numpy arrays internally which have utf32 blobs

jbrockmendel added Unicode Unicode strings IO Stata read_stata, to_stata labels Nov 10, 2018

bashtage mentioned this issue Dec 17, 2018

BUG: Ensure incomplete stata files are deleted #24319

Merged

4 tasks

bashtage mentioned this issue Dec 18, 2018

BUG: Fix GSO values when writing latin-1 strLs #24337

Merged

4 tasks

jreback added this to the 0.24.0 milestone Dec 18, 2018

jreback closed this as completed in #24337 Dec 18, 2018

jreback reopened this Dec 18, 2018

jreback modified the milestones: 0.24.0, Contributions Welcome Dec 27, 2018

bashtage mentioned this issue Dec 16, 2019

ENH: Add StataWriter 118 for unicode support #30285

Merged

5 tasks

jreback modified the milestones: Contributions Welcome, 1.0 Dec 26, 2019

jbrockmendel closed this as completed in #30285 Dec 31, 2019

bashtage mentioned this issue Jan 14, 2020

ENH: Add Stata 119 writer #30959

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support writing unicode characters in df.to_stata() #23573

Support writing unicode characters in df.to_stata() #23573

kylebarron commented Nov 8, 2018 •

edited

Loading

adamrossnelson commented Nov 24, 2018 •

edited

Loading

INSTALLED VERSIONS

kylebarron commented Dec 12, 2018

bashtage commented Dec 17, 2018

adamrossnelson commented Dec 17, 2018

bashtage commented Dec 17, 2018

kylebarron commented Dec 17, 2018

bashtage commented Dec 17, 2018

bashtage commented Dec 17, 2018

kylebarron commented Dec 18, 2018

bashtage commented Dec 18, 2018

jbrockmendel commented Dec 11, 2019

bashtage commented Dec 11, 2019

bashtage commented Dec 11, 2019

Support writing unicode characters in df.to_stata() #23573

Support writing unicode characters in df.to_stata() #23573

Comments

kylebarron commented Nov 8, 2018 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

adamrossnelson commented Nov 24, 2018 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

kylebarron commented Dec 12, 2018

bashtage commented Dec 17, 2018

adamrossnelson commented Dec 17, 2018

bashtage commented Dec 17, 2018

kylebarron commented Dec 17, 2018

bashtage commented Dec 17, 2018

bashtage commented Dec 17, 2018

kylebarron commented Dec 18, 2018

bashtage commented Dec 18, 2018

jbrockmendel commented Dec 11, 2019

bashtage commented Dec 11, 2019

bashtage commented Dec 11, 2019

kylebarron commented Nov 8, 2018 •

edited

Loading

Output of `pd.show_versions()`

adamrossnelson commented Nov 24, 2018 •

edited

Loading

Output of `pd.show_versions()`