Suggested improvement: allow ExcelWriter to automatically convert lists and dict to strings #8188

Closed
teese opened this Issue Sep 5, 2014 · 9 comments

Comments

Projects
None yet
5 participants

teese commented Sep 5, 2014

Problem description:
I have pandas dataframes with diverse data types and structures, including lists. The csv-writer saves the lists without trouble, but I prefer to also save in excel format for collaboration with my colleagues.
Currently the ExcelWriter returns a TypeError if the dataframe contains a list.
TypeError: float() argument must be a string or a number, not 'list'

It would be really great if ExcelWriter accepted lists and dictionaries and tuples.
Would it cause any problem to simply convert these data formats to strings before saving?
Sorry I’m not an experienced programmer, so I haven‘t made a pull request and attempted to improve the module myself.
Here is some code (Python 3.4, pandas 0.14.0) to describe the current situation.

import pandas as pd
#create a new dataframe that includes a list and a dict
title = pd.Series(['Toy Story (1995)',
 'Jumanji (1995)',
 'Grumpier Old Men (1995)',
 'Waiting to Exhale (1995)',
 'Father of the Bride Part II (1995)'], index = [0 ,1, 2, 3, 4], name='titles')
genre_as_list = pd.Series([['Animation', "Children's", 'Comedy'],
 ['Adventure', "Children's", 'Fantasy'],
 ['Comedy', 'Romance'],
 ['Comedy', 'Drama'],
 ['Comedy']], index = [0 ,1, 2, 3, 4], name='genre_as_list')
rating_as_dict = pd.Series([{'F':0.5,'M':0.6},
 {'F':0.1,'M':0.2},
 {'F':0.3,'M':0.3},
 {'F':0.8,'M':0.8},
 {'F':0.8,'M':0.6}], index = [0 ,1, 2, 3, 4], name='rating_as_dict')
df = pd.concat([title, genre_as_list, rating_as_dict], axis=1)

attempt to save the dataframe in excel

writer = pd.ExcelWriter('saved_df.xlsx')
df.to_excel(writer, sheet_name='Sheet1')

which will result in
TypeError: float() argument must be a string or a number, not 'list'
to save the data, it is necessary to convert the python lists or dicts to strings, separated by ", " or pipes "|"

df['genre_as_string'] = [str(i) for i in df['genre_as_list']]
df['rating_as_string'] = [str(i) for i in df['rating_as_dict']]
df['genre_pipes_string'] = ['|'.join(i) for i in df['genre_as_list']]
#in my ipython output, the difference is not actually visible between the lists and the stringlists
print('first item from list: %s\nfirst item from stringlist: %s' % (df['genre_as_list'][0][0],
                                                          df['genre_as_string'][0][0]))

After deleting any columns containing a python list or dict, I can now save without any problem.

df = df.drop(['genre_as_list','rating_as_dict'], axis=1)
writer = pd.ExcelWriter('saved_df.xlsx')
df.to_excel(writer, sheet_name='Sheet1')
writer.save()

If I load from the saved excel sheet, I can confirm that the data is intact

df_from_excel = pd.read_excel('saved_df.xlsx', sheetname='Sheet1')
#convert back to original python lists and dict, if desired
df_from_excel['genre_as_list_again'] = [eval(i) for i in df_from_excel['genre_as_string']]
df_from_excel['rating_as_dict_again'] = [eval(i) for i in df_from_excel['rating_as_string']]
df_from_excel.loc[0,'genre_as_list_again'][0]
Contributor

jtratner commented Sep 7, 2014

This is really something that happens because of a quirk of how the CSV module is set up (converts everything to str) vs. how excel writers are set up (tries to do the right thing by value type). You can actually just do something like: df['myrow'] = df['myrow'].astype(str):

In [24]: df['rating_as_dict'] = df['rating_as_dict'].astype(str)

In [25]: df
Out[25]:
                               titles                     genre_as_list  \
0                    Toy Story (1995)   [Animation, Children's, Comedy]
1                      Jumanji (1995)  [Adventure, Children's, Fantasy]
2             Grumpier Old Men (1995)                 [Comedy, Romance]
3            Waiting to Exhale (1995)                   [Comedy, Drama]
4  Father of the Bride Part II (1995)                          [Comedy]

         rating_as_dict
0  {'M': 0.6, 'F': 0.5}
1  {'M': 0.2, 'F': 0.1}
2  {'M': 0.3, 'F': 0.3}
3  {'M': 0.8, 'F': 0.8}
4  {'M': 0.6, 'F': 0.8}

In [26]: df['rating_as_dict'][0]
Out[26]: "{'M': 0.6, 'F': 0.5}"

There isn't really a 'correct' answer here (I think I'd personally want an error to be raised).

Contributor

jtratner commented Sep 8, 2014

but it wouldn't be that difficult to incorporate that - do you want to take a stab at it? (the code isn't that complicated and I believe it's all contained within pandas/io/excel.py, really just adding another case to the checks - you can use core's pandas.core.common.is_list_like() for your check

Contributor

onesandzeroes commented Sep 8, 2014

I can have a go at this if @markinlabcoat doesn't want to. My thinking is that it should be a non-default behaviour, so we'd be adding a convert_to_str=False default arg to to_excel(). Would also need to add a note about it in the TypeError message so people know where to look if they want to enable.

Does that sound about right @jtratner ?

Is it necessary to have a specific keyword for this? The other possibility is to leave this up to the user to do themselves astype(str) on the needed columns instead of something like convert_to_str=True if they really want to store lists or dicts with to_excel?

But the error message could certainly be improved.

Contributor

jtratner commented Sep 8, 2014

@jorisvandenbossche @onesandzeroes - is there a reason why it's useful to not convert to string (given that to_csv does it already)? it's not a particularly large performance impact (a third if clause branch) and it just means that you can do to_excel() without worrying about the types in your DataFrame.

Yes, indeed, that is also good. I just wouldn't introduce a keyword for it. But converting it to string by default seems also good. Certainly if that is also the behaviour of to_csv

Contributor

onesandzeroes commented Sep 9, 2014

I guess I was thinking in terms of how easy it is to actually doing anything with the string-converted lists once you've got them in Excel, but now that I think about it, that's probably not a big concern. I'll start working on converting to string by default.

Contributor

jtratner commented Sep 9, 2014

You can do this pretty easily, all you need to do is slightly modify this function (in pandas/io/excel.py):

def _conv_value(val):
    # Convert numpy types to Python types for the Excel writers.
    if com.is_integer(val):
        val = int(val)
    elif com.is_float(val):
        val = float(val)
    elif com.is_bool(val):
        val = bool(val)
    elif isinstance(val, Period):
        val = "%s" % val

    return val

just make up a few test cases and we're good :)

teese commented Sep 9, 2014

Thanks for the comments. Please go ahead with your own solution. It's somewhat beyond my current ability, and I'm busy wearing my labcoat this week :)

@hunterowens hunterowens added a commit to hunterowens/pandas that referenced this issue Jan 2, 2015

@hunterowens hunterowens adding is_list_like check for excel, #8188 caf507e

@hunterowens hunterowens added a commit to hunterowens/pandas that referenced this issue Jan 2, 2015

@hunterowens hunterowens test is_list_like, #8188 bb468f8

jreback added this to the 0.17.0 milestone Jun 18, 2015

@bashtage bashtage pushed a commit to bashtage/pandas that referenced this issue Jun 20, 2015

Kevin Sheppard + Kevin Sheppard ENH: Enable ExcelWriter to construct in-memory sheets
Add support for StringIO/BytesIO to ExcelWriter
Add vbench support for writing excel files
Add support for serializing lists/dicts to strings
Fix bug when reading blank excel sheets
Added xlwt to Python 3.4 builds

closes #8188
closes #7074
closes #6403
closes #7171
closes #6947
9220309

jreback closed this in #10376 Jun 20, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment