read_excel opimize nrows #32727

Zoynels · 2020-03-15T11:46:14Z

Code Sample

pd.read_excel(fname, nrows=10)

Problem description

Pandas has option toread only several rows of excel files.
But now it always read all rows and after pandas cut some part.
For example, file have 100 columns and 50k rows, but for test need only first 10 rows.
Now pandas will read to list all 50k rows which use memory and take too many time to read.

les this should explain why the current behaviour is a problem and why the expected output is a better solution.]

Expected Output

Better solution should be read only rows which need for operation.

as I understand there shoul be some changes

pandas/io/excel/_base.py

    @abc.abstractmethod
    def get_sheet_data(self, sheet, convert_float, header, skiprows, nrows):
        pass

pandas/io/excel/_base.py

   data = self.get_sheet_data(sheet, convert_float, header, skiprows, nrows)

and in files _openpyxl.py, _odfreader.py, _xlrd.py
there should be something like

    def get_sheet_data(self, sheet, convert_float: bool, header: int, skiprows: int, nrows: int) -> List[List[Scalar]]:
        data = []  # type: List[List[Scalar]]
        skiprows = 0 if skiprows is None else skiprows
        header = 0 if header is None else header

        for row in sheet.rows:
            if nrows is not None:
                if header > 0:
                    #print("skip lines before header")
                    header -= 1
                    data.append(["", ""])
                    continue
    
                if skiprows > 0:
                    #print("skip skiprows after header")
                    skiprows -= 1
                    data.append(["", ""])
                    continue
    
                if nrows >= 0:
                    #print("read nrows")
                    nrows -= 1
                else:
                    break
            data.append([self._convert_cell(cell, convert_float) for cell in row])

        return data

With this changes read_excel with engine='openpyxl' takes only 5 seconds instead of 50 seconds of current version. And if file will contain 1kk rows, it will take always around 5 seconds, but current version will take tens of minutes.

mproszewska · 2020-03-27T18:37:55Z

take

jreback · 2020-09-22T01:51:17Z

reverted

LiewShanWei · 2022-01-12T14:05:57Z

take

ahawryluk · 2022-04-20T03:55:01Z

take

MarcoGorelli added the Performance Memory or execution speed performance label Mar 15, 2020

github-actions bot assigned mproszewska Mar 27, 2020

mproszewska mentioned this issue Apr 4, 2020

ENH: Optimize nrows in read_excel #33281

Closed

5 tasks

MarcoGorelli assigned MarcoGorelli and unassigned mproszewska Aug 29, 2020

MarcoGorelli mentioned this issue Aug 29, 2020

ENH: Optimize nrows in read_excel #35974

Merged

5 tasks

jreback added this to the 1.2 milestone Sep 13, 2020

WillAyd closed this as completed in #35974 Sep 21, 2020

jreback reopened this Sep 22, 2020

jreback modified the milestones: 1.2, Contributions Welcome Nov 19, 2020

MarcoGorelli removed their assignment Dec 20, 2020

lithomas1 mentioned this issue Feb 3, 2021

ENH: Using nrows option while processing xlsb files #39518

Closed

mroeschke added the IO Excel read_excel, to_excel label Jul 30, 2021

github-actions bot assigned LiewShanWei Jan 12, 2022

LiewShanWei removed their assignment Feb 14, 2022

github-actions bot assigned ahawryluk Apr 20, 2022

ahawryluk mentioned this issue Apr 29, 2022

PERF: Optimize read_excel nrows #46894

Merged

4 tasks

jreback modified the milestones: Contributions Welcome, 1.5 Apr 29, 2022

jreback closed this as completed in #46894 Jun 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_excel opimize nrows #32727

read_excel opimize nrows #32727

Zoynels commented Mar 15, 2020

mproszewska commented Mar 27, 2020

jreback commented Sep 22, 2020

LiewShanWei commented Jan 12, 2022

ahawryluk commented Apr 20, 2022

read_excel opimize nrows #32727

read_excel opimize nrows #32727

Comments

Zoynels commented Mar 15, 2020

Code Sample

Problem description

Expected Output

mproszewska commented Mar 27, 2020

jreback commented Sep 22, 2020

LiewShanWei commented Jan 12, 2022

ahawryluk commented Apr 20, 2022