StataWriter for version 117 fails on None in a string column long enough to be a Stata StrL. #23633

jtkiley · 2018-11-12T00:07:52Z

The version 114 writer seems to handle columns of strings containing None just fine, but the 117 writer produces the AttributeError below.

Code to reproduce

import pandas as pd

sample_data1 = [
    {'str1': 'string' * 30,
     'number': 0},
    {'str1': None,
     'number': 1}
]

sample_data2 = [
    {'str1': 'string' * 500,
     'number': 0},
    {'str1': None,
     'number': 1}
]

data1 = pd.DataFrame(sample_data1)
data2 = pd.DataFrame(sample_data2)

data1.to_stata('./sample1_114.dta')
data1.to_stata('./sample1_117.dta', version=117)

# Will produce the ValueError for over 244 characters.
# data2.to_stata('./sample1_114.dta')

data2.to_stata('./sample2_117.dta', version=117)

Error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-23-f2347fb2dfc8> in <module>
     24 # data2.to_stata('./sample1_114.dta')
     25 
---> 26 data2.to_stata('./sample2_117.dta', version=117)

~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in to_stata(self, fname, convert_dates, write_index, encoding, byteorder, time_stamp, data_label, variable_labels, version, convert_strl)
   1875                              write_index=write_index,
   1876                              variable_labels=variable_labels, **kwargs)
-> 1877         writer.write_file()
   1878 
   1879     def to_feather(self, fname):

~/anaconda3/lib/python3.6/site-packages/pandas/io/stata.py in write_file(self)
   2214             self._write_expansion_fields()
   2215             self._write_characteristics()
-> 2216             self._prepare_data()
   2217             self._write_data()
   2218             self._write_strls()

~/anaconda3/lib/python3.6/site-packages/pandas/io/stata.py in _prepare_data(self)
   2375                                                                self.fmtlist[i])
   2376         # 2. Convert strls
-> 2377         data = self._convert_strls(data)
   2378 
   2379         # 3. Convert bad string data to '' and pad to correct length

~/anaconda3/lib/python3.6/site-packages/pandas/io/stata.py in _convert_strls(self, data)
   2963             tab, new_data = ssw.generate_table()
   2964             data = new_data
-> 2965             self._strl_blob = ssw.generate_blob(tab)
   2966         return data
   2967 

~/anaconda3/lib/python3.6/site-packages/pandas/io/stata.py in generate_blob(self, gso_table)
   2652 
   2653             # llll
-> 2654             encoded = self._encode(strl)
   2655             bio.write(struct.pack(len_type, len(encoded) + 1))
   2656 

~/anaconda3/lib/python3.6/site-packages/pandas/io/stata.py in _encode(self, s)
   2592         """
   2593         if compat.PY3:
-> 2594             return s.encode(self._encoding)
   2595         else:
   2596             if isinstance(s, text_type):

AttributeError: 'NoneType' object has no attribute 'encode'

Problem description

This seems like a fairly straightforward regression in the Stata StrL part of the 117 writer compared to how the 114 writer does some more checking on regular strings before working with them.

Note: I'm not sure if other datatypes are handled differently and may have a similar issue, but this is the one I encountered and could reproduce.

If you're wondering why my actual data is so ugly that I encountered this, I'd like to blame something else, but I wrote the code that parses the data that ends up there.

Expected Output

I'd expect the 117 writer to clean up these StrL columns like the 114 writer cleans up standard strings (and so does the 117 writer, because it appears to use the same codepath for standard strings and only handles StrLs differently).

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 3.10.0
pip: 18.1
setuptools: 40.5.0
Cython: 0.29
numpy: 1.15.4
scipy: 1.1.0
pyarrow: 0.9.0
xarray: None
IPython: 7.1.1
sphinx: 1.8.1
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.1
openpyxl: 2.5.9
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.1.2
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.13
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: 0.1.6
pandas_gbq: None
pandas_datareader: 0.7.0

The text was updated successfully, but these errors were encountered:

kylebarron · 2018-11-12T16:05:48Z

@jtkiley
The error is in a different place, but functionally this is a duplicate of #23572. The reason for your error is not the string length, but the fact that the object in your second row is None, and not a string. Your thoughts would definitely be appreciated on #23572.

See the difference in this code:

import pandas as pd
df1 = pd.DataFrame({'str1': ['string' * 500, '']})
df2 = pd.DataFrame({'str1': ['string' * 500, None]})

df1.to_stata('df1_117.dta', version=117)
# Works
df2.to_stata('df2_117.dta', version=117)
# Attribute error (can't encode None to a string)

jtkiley · 2018-11-12T16:23:44Z

@kylebarron I may be missing it (a cue for more coffee either way), but I think this is similar yet not quite the same.

In #23572, you seem to suggest that it fails because it's all None and that one string in that column will produce the coercing behavior.

Here, I have one string and one None in that column, so the coercing behavior works when the string is short. However, when it's long, the column is handed differently as a srtL by the 117 writer (a long string produces ValueError with the default 114 writer), and that same coercion behavior is not applied. So, the main issue is that it's inconsistent, and I'm suggesting that I like the coercing behavior.

Thanks for the pointer to #23572. I'll think about it a bit more and comment there, too.

kylebarron · 2018-11-12T16:34:14Z

Yes, on second thought you're right that this isn't functionally the same.

I did want to point you to that issue anyways to get your thoughts on when None should be coerced to an empty string.

jtkiley · 2018-11-12T21:09:16Z

Just a note for anyone happening to end up here: there's some extended conversation on when/where to coerce over on #23572.

jtkiley · 2018-11-12T23:06:09Z

@bashtage I've been looking through stata.py, and I think I see what the issue is, but I'm not sure I understand all of the moving parts (and the consequences of changes) well enough to make a PR.

pandas/pandas/io/stata.py

Lines 2359 to 2383 in a08bf3d

    
           # 2. Convert strls 
        
           data = self._convert_strls(data) 
        
           # 3. Convert bad string data to '' and pad to correct length 
        
           dtypes = [] 
        
           data_cols = [] 
        
           has_strings = False 
        
           native_byteorder = self._byteorder == _set_endianness(sys.byteorder) 
        
           for i, col in enumerate(data): 
        
               typ = typlist[i] 
        
               if typ <= self._max_string_length: 
        
                   has_strings = True 
        
                   data[col] = data[col].fillna('').apply(_pad_bytes, args=(typ,)) 
        
                   stype = 'S%d' % typ 
        
                   dtypes.append(('c' + str(i), stype)) 
        
                   string = data[col].str.encode(self._encoding) 
        
                   data_cols.append(string.values.astype(stype)) 
        
               else: 
        
                   values = data[col].values 
        
                   dtype = data[col].dtype 
        
                   if not native_byteorder: 
        
                       dtype = dtype.newbyteorder(self._byteorder) 
        
                   dtypes.append(('c' + str(i), dtype)) 
        
                   data_cols.append(values) 
        
           dtypes = np.dtype(dtypes)

In the 114 writer, line 2360 doesn't do anything, so we keep going and strings get filled by line 2371. In the 117 writer, strl columns are dealt with before filling (and, if I'm reading it right, their typ value wouldn't be true at line 2369 anyway), so they fail if they're None, but the ones <=2045 pass through and get fixed at 2371.

So, it seems like the fix is something like one of these two:

Moving all of the filling before 2360, though that would require checking every column to see if it's a string (so, either <= _max_string_length or == 36768).
Adding a similar filling operation for strls somewhere. It looks like StataStrLWriter gets the whole dataframe and a list of strl columns, so it might be as easy as adding for col in self.columns: self.df[col] = self.df[col].fillna('') to the constructor.

Thoughts? If option 2 is viable, I can make a PR for that, as I don't see as much concern for breaking things that I don't understand.

Enable export of large columns to Stata strls when the column contains None as a null value closes pandas-dev#23633

bashtage · 2018-11-14T12:27:06Z

Simple fix in #23692

jtkiley · 2018-11-14T13:49:19Z

@bashtage Great, thanks. I figured there would be a better place, but I haven't read enough to the 117 docs to quite grasp what you were doing in that part of StataStrLWriter.

Enable export of large columns to Stata strls when the column contains None as a null value closes #23633

Enable export of large columns to Stata strls when the column contains None as a null value closes pandas-dev#23633

jtkiley changed the title ~~StataWriter for version 117 fails on None in a string columns long enough to be a Stata StrL.~~ StataWriter for version 117 fails on None in a string column long enough to be a Stata StrL. Nov 12, 2018

TomAugspurger added IO Data IO issues that don't fit into a more specific label IO Stata read_stata, to_stata labels Nov 12, 2018

TomAugspurger added this to the Contributions Welcome milestone Nov 12, 2018

jtkiley mentioned this issue Nov 12, 2018

df.to_stata fails when a column of type object contains only None #23572

Closed

bashtage added a commit to bashtage/pandas that referenced this issue Nov 14, 2018

ENH: Allow export of mixed columns to Stata strl

75f9d80

Enable export of large columns to Stata strls when the column contains None as a null value closes pandas-dev#23633

bashtage mentioned this issue Nov 14, 2018

ENH: Allow export of mixed columns to Stata strl #23692

Merged

4 tasks

jreback modified the milestones: Contributions Welcome, 0.24.0 Nov 14, 2018

jreback closed this as completed in #23692 Nov 14, 2018

jreback pushed a commit that referenced this issue Nov 14, 2018

ENH: Allow export of mixed columns to Stata strl (#23692)

fcb8403

Enable export of large columns to Stata strls when the column contains None as a null value closes #23633

tm9k1 pushed a commit to tm9k1/pandas that referenced this issue Nov 19, 2018

ENH: Allow export of mixed columns to Stata strl (pandas-dev#23692)

b93de32

Enable export of large columns to Stata strls when the column contains None as a null value closes pandas-dev#23633

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019

ENH: Allow export of mixed columns to Stata strl (pandas-dev#23692)

f09698e

Enable export of large columns to Stata strls when the column contains None as a null value closes pandas-dev#23633

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019

ENH: Allow export of mixed columns to Stata strl (pandas-dev#23692)

1f26fc2

Enable export of large columns to Stata strls when the column contains None as a null value closes pandas-dev#23633

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StataWriter for version 117 fails on None in a string column long enough to be a Stata StrL. #23633

StataWriter for version 117 fails on None in a string column long enough to be a Stata StrL. #23633

jtkiley commented Nov 12, 2018

INSTALLED VERSIONS

kylebarron commented Nov 12, 2018 •

edited

Loading

jtkiley commented Nov 12, 2018

kylebarron commented Nov 12, 2018

jtkiley commented Nov 12, 2018

jtkiley commented Nov 12, 2018

bashtage commented Nov 14, 2018

jtkiley commented Nov 14, 2018

StataWriter for version 117 fails on None in a string column long enough to be a Stata StrL. #23633

StataWriter for version 117 fails on None in a string column long enough to be a Stata StrL. #23633

Comments

jtkiley commented Nov 12, 2018

Code to reproduce

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

kylebarron commented Nov 12, 2018 • edited Loading

jtkiley commented Nov 12, 2018

kylebarron commented Nov 12, 2018

jtkiley commented Nov 12, 2018

jtkiley commented Nov 12, 2018

bashtage commented Nov 14, 2018

jtkiley commented Nov 14, 2018

Output of `pd.show_versions()`

kylebarron commented Nov 12, 2018 •

edited

Loading