Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StataWriter for version 117 fails on None in a string column long enough to be a Stata StrL. #23633

Closed
jtkiley opened this issue Nov 12, 2018 · 7 comments · Fixed by #23692
Closed
Labels
IO Data IO issues that don't fit into a more specific label IO Stata read_stata, to_stata
Milestone

Comments

@jtkiley
Copy link
Contributor

jtkiley commented Nov 12, 2018

The version 114 writer seems to handle columns of strings containing None just fine, but the 117 writer produces the AttributeError below.

Code to reproduce

import pandas as pd

sample_data1 = [
    {'str1': 'string' * 30,
     'number': 0},
    {'str1': None,
     'number': 1}
]

sample_data2 = [
    {'str1': 'string' * 500,
     'number': 0},
    {'str1': None,
     'number': 1}
]

data1 = pd.DataFrame(sample_data1)
data2 = pd.DataFrame(sample_data2)

data1.to_stata('./sample1_114.dta')
data1.to_stata('./sample1_117.dta', version=117)

# Will produce the ValueError for over 244 characters.
# data2.to_stata('./sample1_114.dta')

data2.to_stata('./sample2_117.dta', version=117)

Error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-23-f2347fb2dfc8> in <module>
     24 # data2.to_stata('./sample1_114.dta')
     25 
---> 26 data2.to_stata('./sample2_117.dta', version=117)

~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in to_stata(self, fname, convert_dates, write_index, encoding, byteorder, time_stamp, data_label, variable_labels, version, convert_strl)
   1875                              write_index=write_index,
   1876                              variable_labels=variable_labels, **kwargs)
-> 1877         writer.write_file()
   1878 
   1879     def to_feather(self, fname):

~/anaconda3/lib/python3.6/site-packages/pandas/io/stata.py in write_file(self)
   2214             self._write_expansion_fields()
   2215             self._write_characteristics()
-> 2216             self._prepare_data()
   2217             self._write_data()
   2218             self._write_strls()

~/anaconda3/lib/python3.6/site-packages/pandas/io/stata.py in _prepare_data(self)
   2375                                                                self.fmtlist[i])
   2376         # 2. Convert strls
-> 2377         data = self._convert_strls(data)
   2378 
   2379         # 3. Convert bad string data to '' and pad to correct length

~/anaconda3/lib/python3.6/site-packages/pandas/io/stata.py in _convert_strls(self, data)
   2963             tab, new_data = ssw.generate_table()
   2964             data = new_data
-> 2965             self._strl_blob = ssw.generate_blob(tab)
   2966         return data
   2967 

~/anaconda3/lib/python3.6/site-packages/pandas/io/stata.py in generate_blob(self, gso_table)
   2652 
   2653             # llll
-> 2654             encoded = self._encode(strl)
   2655             bio.write(struct.pack(len_type, len(encoded) + 1))
   2656 

~/anaconda3/lib/python3.6/site-packages/pandas/io/stata.py in _encode(self, s)
   2592         """
   2593         if compat.PY3:
-> 2594             return s.encode(self._encoding)
   2595         else:
   2596             if isinstance(s, text_type):

AttributeError: 'NoneType' object has no attribute 'encode'

Problem description

This seems like a fairly straightforward regression in the Stata StrL part of the 117 writer compared to how the 114 writer does some more checking on regular strings before working with them.

Note: I'm not sure if other datatypes are handled differently and may have a similar issue, but this is the one I encountered and could reproduce.

If you're wondering why my actual data is so ugly that I encountered this, I'd like to blame something else, but I wrote the code that parses the data that ends up there.

Expected Output

I'd expect the 117 writer to clean up these StrL columns like the 114 writer cleans up standard strings (and so does the 117 writer, because it appears to use the same codepath for standard strings and only handles StrLs differently).

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 3.10.0
pip: 18.1
setuptools: 40.5.0
Cython: 0.29
numpy: 1.15.4
scipy: 1.1.0
pyarrow: 0.9.0
xarray: None
IPython: 7.1.1
sphinx: 1.8.1
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.1
openpyxl: 2.5.9
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.1.2
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.13
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: 0.1.6
pandas_gbq: None
pandas_datareader: 0.7.0

@jtkiley jtkiley changed the title StataWriter for version 117 fails on None in a string columns long enough to be a Stata StrL. StataWriter for version 117 fails on None in a string column long enough to be a Stata StrL. Nov 12, 2018
@TomAugspurger TomAugspurger added IO Data IO issues that don't fit into a more specific label IO Stata read_stata, to_stata labels Nov 12, 2018
@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone Nov 12, 2018
@kylebarron
Copy link
Contributor

kylebarron commented Nov 12, 2018

@jtkiley
The error is in a different place, but functionally this is a duplicate of #23572. The reason for your error is not the string length, but the fact that the object in your second row is None, and not a string. Your thoughts would definitely be appreciated on #23572.

See the difference in this code:

import pandas as pd
df1 = pd.DataFrame({'str1': ['string' * 500, '']})
df2 = pd.DataFrame({'str1': ['string' * 500, None]})

df1.to_stata('df1_117.dta', version=117)
# Works
df2.to_stata('df2_117.dta', version=117)
# Attribute error (can't encode None to a string)

@jtkiley
Copy link
Contributor Author

jtkiley commented Nov 12, 2018

@kylebarron I may be missing it (a cue for more coffee either way), but I think this is similar yet not quite the same.

In #23572, you seem to suggest that it fails because it's all None and that one string in that column will produce the coercing behavior.

Here, I have one string and one None in that column, so the coercing behavior works when the string is short. However, when it's long, the column is handed differently as a srtL by the 117 writer (a long string produces ValueError with the default 114 writer), and that same coercion behavior is not applied. So, the main issue is that it's inconsistent, and I'm suggesting that I like the coercing behavior.

Thanks for the pointer to #23572. I'll think about it a bit more and comment there, too.

@kylebarron
Copy link
Contributor

Yes, on second thought you're right that this isn't functionally the same.

I did want to point you to that issue anyways to get your thoughts on when None should be coerced to an empty string.

@jtkiley
Copy link
Contributor Author

jtkiley commented Nov 12, 2018

Just a note for anyone happening to end up here: there's some extended conversation on when/where to coerce over on #23572.

@jtkiley
Copy link
Contributor Author

jtkiley commented Nov 12, 2018

@bashtage I've been looking through stata.py, and I think I see what the issue is, but I'm not sure I understand all of the moving parts (and the consequences of changes) well enough to make a PR.

pandas/pandas/io/stata.py

Lines 2359 to 2383 in a08bf3d

# 2. Convert strls
data = self._convert_strls(data)
# 3. Convert bad string data to '' and pad to correct length
dtypes = []
data_cols = []
has_strings = False
native_byteorder = self._byteorder == _set_endianness(sys.byteorder)
for i, col in enumerate(data):
typ = typlist[i]
if typ <= self._max_string_length:
has_strings = True
data[col] = data[col].fillna('').apply(_pad_bytes, args=(typ,))
stype = 'S%d' % typ
dtypes.append(('c' + str(i), stype))
string = data[col].str.encode(self._encoding)
data_cols.append(string.values.astype(stype))
else:
values = data[col].values
dtype = data[col].dtype
if not native_byteorder:
dtype = dtype.newbyteorder(self._byteorder)
dtypes.append(('c' + str(i), dtype))
data_cols.append(values)
dtypes = np.dtype(dtypes)

In the 114 writer, line 2360 doesn't do anything, so we keep going and strings get filled by line 2371. In the 117 writer, strl columns are dealt with before filling (and, if I'm reading it right, their typ value wouldn't be true at line 2369 anyway), so they fail if they're None, but the ones <=2045 pass through and get fixed at 2371.

So, it seems like the fix is something like one of these two:

  1. Moving all of the filling before 2360, though that would require checking every column to see if it's a string (so, either <= _max_string_length or == 36768).
  2. Adding a similar filling operation for strls somewhere. It looks like StataStrLWriter gets the whole dataframe and a list of strl columns, so it might be as easy as adding for col in self.columns: self.df[col] = self.df[col].fillna('') to the constructor.

Thoughts? If option 2 is viable, I can make a PR for that, as I don't see as much concern for breaking things that I don't understand.

bashtage added a commit to bashtage/pandas that referenced this issue Nov 14, 2018
Enable export of large columns to Stata strls when the column
contains None as a null value

closes pandas-dev#23633
@bashtage
Copy link
Contributor

Simple fix in #23692

@jreback jreback modified the milestones: Contributions Welcome, 0.24.0 Nov 14, 2018
@jtkiley
Copy link
Contributor Author

jtkiley commented Nov 14, 2018

@bashtage Great, thanks. I figured there would be a better place, but I haven't read enough to the 117 docs to quite grasp what you were doing in that part of StataStrLWriter.

jreback pushed a commit that referenced this issue Nov 14, 2018
Enable export of large columns to Stata strls when the column
contains None as a null value

closes #23633
tm9k1 pushed a commit to tm9k1/pandas that referenced this issue Nov 19, 2018
Enable export of large columns to Stata strls when the column
contains None as a null value

closes pandas-dev#23633
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019
Enable export of large columns to Stata strls when the column
contains None as a null value

closes pandas-dev#23633
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019
Enable export of large columns to Stata strls when the column
contains None as a null value

closes pandas-dev#23633
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label IO Stata read_stata, to_stata
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants