Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

include_tailing_empty=False causes crash in get_as_df when last column is empty #423

Closed
aaroncooper opened this issue May 6, 2020 · 8 comments

Comments

@aaroncooper
Copy link

Note, if this is a usage question, please ask a question in stackoverflow with pygsheets tag.

Describe the bug
When using worksheet.get_as_df and setting include_tailing_empty=False, a completely empty final named column leads to a crash. It'd be nicer to gracefully return a df with

To Reproduce
Steps to reproduce the behavior: make a google sheet with a rightmost named column with empty cells below. attempt worksheet.get_as_df

code here ...

  File "/app/orf_optimize.py", line 286, in main
    df = wks.get_as_df(start='A1', empty_value=np.nan, include_tailing_empty=False)
  File "/usr/local/lib/python3.7/site-packages/pygsheets/worksheet.py", line 1412, in get_as_df
    df = pd.DataFrame(values, columns=keys)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/frame.py", line 474, in __init__
    arrays, columns = to_arrays(data, columns, dtype=dtype)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 461, in to_arrays
    return _list_to_arrays(data, columns, coerce_float=coerce_float, dtype=dtype)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 500, in _list_to_arrays
    raise ValueError(e) from e
ValueError: 6 columns passed, passed data had 5 columns

System Information

  • OS: OSX
  • pygsheets version : 2.0.3.1
  • pygsheets installed from (github or pypi): pypi
@nithinmurali
Copy link
Owner

Hi, it shouldn't throw an error but it does makes sense to forcefully enable include_tailing_empty=True when has_header=True.because values should match number of columns.

May i ask what is the behavior you are expecting when calling with has_header and include_tailing_empty=True?

@aaroncooper
Copy link
Author

With has_header=True, include_tailing_empty=True, it makes sense to return all the columns on the Google sheet, which is the behavior I see now. I was thinking that when `has_header=True, include_tailing_empty=False', it would be better to return as many columns as there are header rows. That would avoid the above error, wouldn't it?

@aaroncooper
Copy link
Author

aaroncooper commented May 9, 2020

Right now, I'm getting around this with this block of code, which seems a little convoluted:

df = wks.get_as_df(include_tailing_empty=True)
non_empty_cols = [col for col in df if col != '']
df = df[non_empty_cols]
df = df.fillna('')

I realize that non_empty_cols would be more properly named has_header_cols in the context of our discusson.

@nithinmurali
Copy link
Owner

The error is because when include_tailing_empty=False and some rows are empty then there will be a mismatch. the solution is either to auto fill the values according to the length of header or always set tailing_empty=True depending when header is True.

I think its better to go with 1st option.

@aaroncooper
Copy link
Author

I agree with you-- if the cells are empty in the Google sheet, it makes sense to auto fill with empty_value

@shenker
Copy link

shenker commented Oct 3, 2020

Just wanted to chime in to say it'd be great to get this fixed!

@RussianImperialScott
Copy link
Contributor

The error is because when include_tailing_empty=False and some rows are empty then there will be a mismatch. the solution is either to auto fill the values according to the length of header or always set tailing_empty=True depending when header is True.

I think its better to go with 1st option.

Looking at the commit that closed out this issue (and the behavior of .get_as_df() that caused me to look into this) it looks like the 2nd option was chosen instead. This results in the return of extra columns with empty strings as names as previously described. Is there a particular reason why?

If I were to submit a pull request to change this behavior to option 1 instead, would it be better for the the .get_values() call return a perfect rectangle, or for .get_as_df() to fill in the missing values for the empty columns that have header entries?

Just a side note that returning extra columns with empty strings as names is very hard to debug. Interactively, pandas printouts do not make it obvious that those columns exist. And by specifying include_tailing_empty=False, the user doesn't even expect those phantom columns to exist. I would very much like for this to not be the default behavior. Thanks!

RussianImperialScott added a commit to RussianImperialScott/pygsheets that referenced this issue Feb 1, 2021
When include_tailing_empty=False, now returns a DataFrame just as wide
as necessary to accommodate the header/data. Issues a warning if
has_header=True and >=1 column name is an empty string.
@nithinmurali
Copy link
Owner

Merged @RussianImperialScott 's PR hence the default behavior is changed to option 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants