Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: read_excel sheet_name argument #19842

Closed
trespitt opened this issue Feb 22, 2018 · 13 comments · Fixed by #42874
Closed

Feature Request: read_excel sheet_name argument #19842

trespitt opened this issue Feb 22, 2018 · 13 comments · Fixed by #42874
Labels
Enhancement IO Excel read_excel, to_excel

Comments

@trespitt
Copy link

I would love to be able to only read in un-hidden sheets. this could be a new possible value that can be passed to the sheet_name argument.

alternatively, what would also solve my problem is some way to filter the columns (again, probably for sheet_name). ideally be able to use regex or string comparison to choose which sheets to read.

also, what about being able to specify which sheets to be ignored?

@celsopneto
Copy link

Wouldn't usecols solve your problem?

usecols : int or list, default None

If None then parse all columns,
If int then indicates last column to be parsed
If list of ints then indicates list of column numbers to be parsed
If string then indicates comma separated list of Excel column letters and column ranges (e.g. “A:E” or “A,C,E:F”). Ranges are inclusive of both sides.

then you can define a function to pass the list or string as you want.

@trespitt
Copy link
Author

i dont think that would solve my problem. the equivalent 'usesheets' would, however.

My problem is that one of the sheets, a hidden one, has different columns than all the others. (this prevents the use of names=[x,y,..] in the initial call to read_excel(), forcing me to iterate through the dict of DFs and rename the columns individually

let me know if there's something i'm missing. thanks for your suggestion

@WillAyd
Copy link
Member

WillAyd commented Feb 22, 2018

sheet_name does accept a list returning an OrderedDict of DataFrame objects - does that not cover your use case?

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html

@trespitt
Copy link
Author

@WillAyd thanks for the suggestion

My problem is that one of the sheets, a hidden one, has different columns than all the others. (this prevents the use of names=[x,y,..] in the initial call to read_excel(), forcing me to iterate through the dict of DFs and rename the columns individually

@WillAyd
Copy link
Member

WillAyd commented Feb 22, 2018

I'm not entirely clear on what you are looking for, but I think you have two options:

Option 1
This will essentially union all column names encountered, filling in NaN where appropriate

# Change sheet_name as required
df_dict = pd.read_excel('your_file.xlsx', sheet_name=None)  
df = pd.concat(df_dict)

Option 2
This will ignore the first row of data altogether, allowing you to set the names you want.

df_dict = pd.read_excel('test.xlsx', sheet_name=None, header=None, skiprows=1, 
                        names=['foo', 'bar', 'baz'])
df = pd.concat(df_dict)

@gfyoung
Copy link
Member

gfyoung commented Feb 23, 2018

@tres-pitt : Thanks for the report! Let me dissect this a bit:

  1. Checking for whether a sheet is hidden would be tricky unless we were somehow able to surface an attribute for visibility. Not sure if that's something we have available, but open to investigation.

  2. Your suggestion of regex wouldn't be particularly hard to add, though I'm not sure how great the use-case would be for such an enhancement. Regex is not a very user-friendly option for most.

  3. I'm more a fan of your suggestion to skip sheets. We do this with columns, so I don't see why we couldn't do the same with Excel sheets.

So IMO 1 and 3 would be the best ways to go, with 3 being the easiest to implement.

@AlistairMills
Copy link

This is probably not the correct place to report what seems to me like a bug. Please direct me to the appropriate place.

I am having trouble with read_excel sheet_name. Pandas reads the spreadsheet, but it always reads sheet 0. When I try sheet_name=None, it reads the excel file too.

@WillAyd
Copy link
Member

WillAyd commented Apr 11, 2018

@AlistairMills if you are seeing what you think is a bug then you should open a separate issue for it. Below is documentation on the most effective way to do that:

https://github.com/pandas-dev/pandas/blob/master/.github/CONTRIBUTING.md#filing-issues

@AlistairMills
Copy link

Thank you for this advice. I shall do as you suggest.

@WillAyd WillAyd added this to the Contributions Welcome milestone Jan 21, 2019
@ahawryluk
Copy link
Contributor

I'd be interested in tackling the original issue by adding a new argument to ExcelFile and read_excel: ignore_hidden=False. I sometimes have to work with a large collection of spreadsheets with hidden sheets, so it's a feature I would absolutely use. (I'm already checking for hidden xlrd/openpyxl sheets to save time in my particular use case.)

Do you think this would be worth adding yet another argument? If so, is that a good name?

@ahawryluk
Copy link
Contributor

@rhshadrach curious for your opinion on this one. Would it be worth adding a default argument to pd.read_excel ignore_hidden=False? The benefit is more control over which sheets are loaded and (at least in my own case) no time wasted on unwanted sheets. The cost is another argument and a bit more code. The xlrd and openpyxl engines would get just a couple of lines extra, while the ods engine might be a little more involved. The xlsb engine wouldn't support this option without changes upstream.

@rhshadrach
Copy link
Member

@ahawryluk - I believe one can perform introspection into the contents of an excel workbook using a third party engine (e.g. openpyxl) to determine the sheets and which are hidden. Doing so, a user can specify which sheets they want to read using the existing API. If this is correct, then it seems to me it'd be more appropriate for the cookbook, maintaining a separation of what is in pandas vs third party excel-engine libraries. In particular, the line I think this is crossing is an introspection into sheets within a workbook and their state when determining what to read.

@ahawryluk
Copy link
Contributor

ahawryluk commented Aug 3, 2021

@rhshadrach yes, at least for .xls and .xlsx/.xlsm files, checking for hidden sheets is really easy ahead of time:

wb = xlrd.open_workbook('myfile.xls')
sheets = [sheet.name for sheet in wb.sheets() if sheet.visibility == 0]
dfs = pd.read_excel(wb, sheets)

wb = openpyxl.open('myfile.xlsx', read_only=True, data_only=True)
sheets = [sheet.title for sheet in wb if sheet.sheet_state == 'visible']
dfs = pd.read_excel(wb, sheets)
wb.close()

I like your cookbook suggestion; I'll make a very small PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants