Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support writing unicode characters in df.to_stata() #23573

Closed
kylebarron opened this issue Nov 8, 2018 · 13 comments · Fixed by #24337 or #30285
Closed

Support writing unicode characters in df.to_stata() #23573

kylebarron opened this issue Nov 8, 2018 · 13 comments · Fixed by #24337 or #30285
Labels
IO Stata read_stata, to_stata Unicode Unicode strings
Milestone

Comments

@kylebarron
Copy link
Contributor

kylebarron commented Nov 8, 2018

Code Sample, a copy-pastable example if possible

import pandas as pd
df = pd.DataFrame({'a': ['丆']})
df.to_stata('test.dta')
# UnicodeEncodeError: 'latin-1' codec can't encode character '\u4e06' in position 0: ordinal not in range(256)

I picked an arbitrary CJK character to test this with.

Problem description

It would be possible to write Unicode strings to a Stata file by implementing a writer according to version 118 of the dta format.

I'd be interested in trying to submit a PR for this. (Edit: I don't use Stata anymore)

Expected Output

Stata file written to disk.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-696.18.7.el6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 3.5.1
pip: 10.0.1
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.14.3
scipy: 1.1.0
pyarrow: 0.10.0
xarray: None
IPython: 7.0.1
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.6
pandas_gbq: None
pandas_datareader: None
@jbrockmendel jbrockmendel added Unicode Unicode strings IO Stata read_stata, to_stata labels Nov 10, 2018
@adamrossnelson
Copy link

adamrossnelson commented Nov 24, 2018

I think I can add to this issue with the following ... The current scheme seems to be that pd.to_stata knows it does not support writing unicode, so if it finds unicode it'll helpfully throw an error.

However, I believe I've found some characters that pd.to_stata will write to a Stata data file without throwing an error, but that will cause trouble for Stata's version 117. (Long way here of saying I vote in favor of an enhancement that'll support Version 118).

Code Sample, a copy-pastable example if possible

import pandas as pd

# Make demonstration data. This data contains characters that should
# cause Pandas to throw an error when using df.to_stata().
bad_txt_sneaking_through = ''' Multiline text that sneaks by
Here is one __�__
Another one __·__   Another one __½__
Bad bad bad __Á__   Bad bad bad __¦__
Still more __é__    Still more __§__    Still more __®__ '''

data_list = []
data_list.append(['First Record', bad_txt_sneaking_through])
data_list.append(['Second Record', 'This one will be fine'])

# Make DataFrame from demonstration data.
df = pd.DataFrame(data_list, columns=['RecNum', 'Txt'])

# Write data frame to Stata data file. Shouldn't write but does.
# This file will not open in Stata.
df.to_stata('Demo_ShouldNotWork.dta', version=117, convert_strl=['Txt'])

# Write first record which has the offending characters.
# This file will not open in Stata.
df[0:1].to_stata('Demo_DoesNotWork.dta', version=117, convert_strl=['Txt'])

# Write second record which has no offending characters.
# This file will open in Stata.
df[1:2].to_stata('Demo_DoesWork.dta', version=117, convert_strl=['Txt'])

# Define function that tests diagnosis (bad chacter count)
def make_it_work(bad_text):
    ret_txt = ''
    for item in bad_text:
        ret_txt += item if len(item.encode(encoding='utf_8')) == 1 else ''
    return(ret_txt)

df['Txt'] = df['Txt'].apply(make_it_work)

# Write data frame to Stata data file. This time it should write and does.
# This file will open in Stata.
df.to_stata('Demo_ShouldWork.dta', version=117, convert_strl=['Txt'])

Problem description

Pandas pd.to_stata does not throw an error for some characters which are problematic in stata.

When writing Stata data files Pandas usually (and helpfully) throws an error if there are non Latin-1 characters in an StrL data field. However, when I was working with a large dataset I scraped from the web I managed to write a data file without getting an error from Pandas. All was going well. But, Stata was unable to read the file.

With some assistance from Stata technical support I believe the correct diagnosis was an issue with under counting the total number of characters in the StrL. Stata technical support indicated to me that Unicode characters will throw off the count. At first I thought this shouldn't be a problem because I thought pd.to_stata would handle the issue by throwing an error. As demonstrated below, some characters seem to be able to sneak through.

In troubleshooting and documenting the issue I believe the function make_it_work() is at least a partial solution. It is crude. This solution might help folks using pd.to_stata until there is a more integrated fix. Or, perhaps someone can point me to another solution. I would also put a plug in for enhancing pd.to_stata so that it will support writing unicode.

Expected Output

Thus I would vote in favor of future developments finding a method to throw an error for such characters that seem for now to be sneaking through. Here is code and output that produces what would be more helpful. In the alternative, an enhancement as @kylebarron suggested that would accommodate Unicode would also be an appropriate solution.

bad_txt_not_sneaking_through = '''Bad text that does not sneak through...

Here you go ► '''

data_list = []
data_list.append(['First Record', bad_txt_not_sneaking_through])
data_list.append(['Second Record', 'This one will be fine'])

df = pd.DataFrame(data_list, columns=['RecNum', 'Txt'])

df.to_stata('Demo_ShouldNotWork.dta', version=117, convert_strl=['Txt'])

Output (abridged):

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-4-aa00fe80a653> in <module>()
      9 df = pd.DataFrame(data_list, columns=['RecNum', 'Txt'])
     10 
---> 11 df.to_stata('Demo_ShouldNotWork.dta', version=117, convert_strl=['Txt'])

. . .

UnicodeEncodeError: 'latin-1' codec can't encode character '\u25ba' in position 53: ordinal not in range(256)

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 3.3.2
pip: 10.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Edit1: Elaborated intro comments.
Edit2: Added to the expected output session.

@kylebarron
Copy link
Contributor Author

@adamrossnelson

All of those characters are included in the Latin-1 encoding, and seem to work fine in Stata. If you want to test it yourself, run this:

set obs 10
gen x = ""
replace x = "Here is one __�__" in 1
replace x = "Another one __·__" in 2
replace x = "Another one __½__" in 3
replace x = "Bad bad bad __Á__" in 4
replace x = "Bad bad bad __¦__" in 5
saveold test.dta, version(13)

use test.dta, clear

So those characters should work with version 117, and if they don't it's a bug.

@bashtage
Copy link
Contributor

@adamrossnelson You are actually experiencing a different bug. The files that you can't don't work with Stata don't work because they are only partially written (at least in master). A patch is needed to cleanup writes that fail (ok, but can be difficult to get right, but easy to implement), or to possible reorder the steps so that all of the data checks happen before the file is created (better solution, but may need a lot of work).

@adamrossnelson
Copy link

@bashtage ... would you suggest starting a new issue? I'd be happy to do that to help track.

@bashtage
Copy link
Contributor

Not worth it. I have the PR ready.

@kylebarron
Copy link
Contributor Author

@bashtage I might be mistaken, but I felt that the bug was that those characters should be able to be written with the current Pandas writer. The df.to_stata documentation says that it supports Latin-1 encoding, and all of those characters in @adamrossnelson 's comment are present in that encoding. Therefore it should be able to be written, right?

@bashtage
Copy link
Contributor

� copies and pastes as U+FFFD which is not supported in Latin-1.

@bashtage
Copy link
Contributor

You can run this to sanitize it:

import pandas as pd

# Make demonstration data. This data contains characters that should
# cause Pandas to throw an error when using df.to_stata().
bad_txt_sneaking_through = ''' Multiline text that sneaks by
Here is one __�__
Another one __·__   Another one __½__
Bad bad bad __Á__   Bad bad bad __¦__
Still more __é__    Still more __§__    Still more __®__ '''

data_list = []
data_list.append(['First Record', bad_txt_sneaking_through])
data_list.append(['Second Record', 'This one will be fine'])

# Make DataFrame from demonstration data.
df = pd.DataFrame(data_list, columns=['RecNum', 'Txt'])

# Define function that tests diagnosis (bad chacter count)
def make_it_work(bad_text):
    ret_txt = ''
    for item in bad_text:
        try:
            ret_txt += item.encode('latin-1').decode('latin-1')
        except:
            pass
    return(ret_txt)

df2 = df.copy()
df2['Txt'] = df['Txt'].apply(make_it_work)

df2.to_stata('if-it-doesnt-load-there-is-a-bug.dta', version=117, convert_strl=['Txt'])

It seems that this file is not loadable which suggests there is a bug in multiline strl encoding. I suppose the first issue is to determine whether strl supports multilines in 117.

@kylebarron
Copy link
Contributor Author

I don't believe the original issue of this thread (unicode Stata file writing support) is actually resolved. @adamrossnelson posted an example which showed both 1) the need for unicode write support and 2) a bug with the current code. #24337 fixed the bug with the current code; unicode write support is a larger endeavor and is still uncompleted.

@bashtage
Copy link
Contributor

Yeah, it should be reopened. It closed an issue in this issue, rather than the issue.

@jreback jreback reopened this Dec 18, 2018
@jreback jreback modified the milestones: 0.24.0, Contributions Welcome Dec 27, 2018
@jbrockmendel
Copy link
Member

@bashtage is this actionable?

@bashtage
Copy link
Contributor

Yes. Someone could write a format 118 or 119 writer that supports unicode.

@bashtage
Copy link
Contributor

The spec is available. Non-trivial since need utf8 encode but we use numpy arrays internally which have utf32 blobs

@jreback jreback modified the milestones: Contributions Welcome, 1.0 Dec 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Stata read_stata, to_stata Unicode Unicode strings
Projects
None yet
5 participants