Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_excel opimize nrows #32727

Closed
Zoynels opened this issue Mar 15, 2020 · 4 comments · Fixed by #35974 or #46894
Closed

read_excel opimize nrows #32727

Zoynels opened this issue Mar 15, 2020 · 4 comments · Fixed by #35974 or #46894
Assignees
Labels
IO Excel read_excel, to_excel Performance Memory or execution speed performance
Milestone

Comments

@Zoynels
Copy link

Zoynels commented Mar 15, 2020

Code Sample

pd.read_excel(fname, nrows=10)

Problem description

Pandas has option toread only several rows of excel files.
But now it always read all rows and after pandas cut some part.
For example, file have 100 columns and 50k rows, but for test need only first 10 rows.
Now pandas will read to list all 50k rows which use memory and take too many time to read.

les this should explain why the current behaviour is a problem and why the expected output is a better solution.]

Expected Output

Better solution should be read only rows which need for operation.

as I understand there shoul be some changes

pandas/io/excel/_base.py

    @abc.abstractmethod
    def get_sheet_data(self, sheet, convert_float, header, skiprows, nrows):
        pass

pandas/io/excel/_base.py

   data = self.get_sheet_data(sheet, convert_float, header, skiprows, nrows)

and in files _openpyxl.py, _odfreader.py, _xlrd.py
there should be something like

    def get_sheet_data(self, sheet, convert_float: bool, header: int, skiprows: int, nrows: int) -> List[List[Scalar]]:
        data = []  # type: List[List[Scalar]]
        skiprows = 0 if skiprows is None else skiprows
        header = 0 if header is None else header

        for row in sheet.rows:
            if nrows is not None:
                if header > 0:
                    #print("skip lines before header")
                    header -= 1
                    data.append(["", ""])
                    continue
    
                if skiprows > 0:
                    #print("skip skiprows after header")
                    skiprows -= 1
                    data.append(["", ""])
                    continue
    
                if nrows >= 0:
                    #print("read nrows")
                    nrows -= 1
                else:
                    break
            data.append([self._convert_cell(cell, convert_float) for cell in row])

        return data

With this changes read_excel with engine='openpyxl' takes only 5 seconds instead of 50 seconds of current version. And if file will contain 1kk rows, it will take always around 5 seconds, but current version will take tens of minutes.

@MarcoGorelli MarcoGorelli added the Performance Memory or execution speed performance label Mar 15, 2020
@mproszewska
Copy link
Contributor

take

@jreback
Copy link
Contributor

jreback commented Sep 22, 2020

reverted

@jreback jreback modified the milestones: 1.2, Contributions Welcome Nov 19, 2020
@MarcoGorelli MarcoGorelli removed their assignment Dec 20, 2020
@mroeschke mroeschke added the IO Excel read_excel, to_excel label Jul 30, 2021
@LiewShanWei
Copy link

take

@LiewShanWei LiewShanWei removed their assignment Feb 14, 2022
@ahawryluk
Copy link
Contributor

take

@jreback jreback modified the milestones: Contributions Welcome, 1.5 Apr 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Excel read_excel, to_excel Performance Memory or execution speed performance
Projects
None yet
7 participants