Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading large File increases Execution time & Memory (E.g file with 500000 records) #629

Closed
kumarkal opened this issue Aug 7, 2018 · 34 comments
Labels
enhancement reader/xlsx Reader for MS OfficeOpenXML-format (xlsx) spreadsheet files stale

Comments

@kumarkal
Copy link

kumarkal commented Aug 7, 2018

This is: a feature request

- [ ] a bug report
- [ ] a feature request
- [ ] **not** a usage question (ask them on https://stackoverflow.com/questions/tagged/phpspreadsheet or https://gitter.im/PHPOffice/PhpSpreadsheet)

What is the expected behavior?

  • To Read large files quickly

What is the current behavior?

  • Takes more time to Read

What are the steps to reproduce?

  • Try loading csv file of 500000 records and split it into smaller set of files of about 100000 records each using chunks which will be 5 files.
  • As the size of file keeps reducing, performance increases

Please provide a Minimal, Complete, and Verifiable example of code that exhibits the issue without relying on an external Excel file or a web server:

<?php

require __DIR__ . '/vendor/autoload.php';

// Create new Spreadsheet object
$spreadsheet = new \PhpOffice\PhpSpreadsheet\Spreadsheet();

// add code that show the issue here...

Which versions of PhpSpreadsheet and PHP are affected?

@bferrantino
Copy link

I would like to add a bit of clarity to this as well, since we are working together on this project.

We are processing rather large files, that are upwards of 300k lines. In order to scale this and not load the whole file into memory at once, we are using the chunkFilter functionality as laid out here: https://phpspreadsheet.readthedocs.io/en/develop/topics/reading-files/#loading-a-spreadsheet-file under Reading Only Specific Columns and Rows from a File (Read Filters) .

We did some baseline testing of our script and we're noticing that once we pass the threshold of about 250k records, the $reader->load($inputFileName); portion of the code takes longer as the file gets larger. No matter how large of a chunk we are using, each time it tries to load the file, it takes about 20-30 seconds. This is compared to a file that is 100k records, that only takes on average 10 seconds to read each chunk. This number seems to increase/decrease significantly as we use a larger/smaller file, respectively.

What we're basically looking to determine here is how we can proceed further. Is there a known limitation with reading a certain sized file? Are there any steps we can take to make the reader perform better?

Below is the code we're using for our Chunk IReadFilter:

<?php

use PhpOffice\PhpSpreadsheet\Reader\IReadFilter;

/**  Define a Read Filter class implementing IReadFilter  */
class Chunk implements IReadFilter
{
    private $startRow = 0;

    private $endRow = 0;

    /**
     * Set the list of rows that we want to read.
     *
     * @param mixed $startRow
     * @param mixed $chunkSize
     */
    public function setRows($startRow, $chunkSize)
    {
        $this->startRow = $startRow;
        $this->endRow = $startRow + $chunkSize;
    }

    public function readCell($column, $row, $worksheetName = '')
    {
        //  Only read the heading row, and the rows that are configured in $this->_startRow and $this->_endRow
        if (($row == 1) || ($row >= $this->startRow && $row < $this->endRow)) {
            return true;
        }

        return false;
    }
}

?>

As well as a trimmed down version of the code we're using to chunk the file:

		// Create a new Reader of the type defined in $inputFileType
		$reader = IOFactory::createReader($inputFileType);

		// Define how many rows we want to read for each "chunk"
		$chunkSize = 10000;
		// Create a new Instance of our Read Filter
		$chunkFilter = new Chunk();

		// Tell the Reader that we want to use the Read Filter that we've Instantiated
		$reader->setReadFilter($chunkFilter);
		// Loop to read our worksheet in "chunk size" blocks
		for ($startRow = 1; $startRow <= $rawRows; $startRow += $chunkSize) {
			// Tell the Read Filter, the limits on which rows we want to read this iteration
			$chunkFilter->setRows($startRow, $chunkSize);
			// Load only the rows that match our filter from $inputFileName to a PhpSpreadsheet Object
			$spreadsheet = $reader->load($inputFileName);
		        ...
		}

Thank you in advance for your assistance.

Brian F (& Hemanth K, the original poster)

@dkarlovi
Copy link
Contributor

I'm currently also looking into why it takes as much memory to read a file.

What I suggest is to install xdebug 2.6 which has memory usage profiling. I've already found some problematic areas in my code which reduced the memory usage somewhat, but it's still way to high.

@bferrantino
Copy link

Looking for some more guidance on this from the Devs. Maybe they know something directly related that can help us to resolve or at least improve our situation.

@pop-mihai
Copy link

Tried using file system caching of cells... a 411K row file still consumes beyond 1GB of RAM... chunks or no chunks the behavior is roughly the same. What are we missing ?

@zarte
Copy link

zarte commented Sep 17, 2018

0.0

@pop-mihai
Copy link

We have been playing with this library for a few weeks now, benchmarking and profiling various read and write methods. We have hit several bottlenecks, managed to escape most of them except one: READING LARGE XLSX FILES !

Before we discuss about the XLSX problem in particular, I would like to share a few more things:

  1. Following the examples provided here: https://phpspreadsheet.readthedocs.io/en/develop/topics/memory_saving/ - we tried using a simple disk cache.. and immediately failed because of iNodes limit. By default each excel cell would be saved as ONE file.. eventually exhausting the inodes server limit. So we wrote our own adaptor that stores several rows per file instead. This has significantly improved performance and memory usage.

  2. Following the examples provided here: https://phpspreadsheet.readthedocs.io/en/develop/topics/reading-files/ - we also made use of the readFilter to read large files in smaller chunks. This has also improved memory usage.

Obviously, both methods reduce memory usage but also increase processing time considerably.
By using a mix of both these changes, we managed to get a good overall result for everything except large XLSX files.

Here is our sample file: https://drive.google.com/file/d/1DbKD28u46BI761YjdsVelvjFux6YZJl1/view?usp=sharing

Whenever load() is called on this file, 8GB of RAM are exhausted immediately regardless of chunk size... and it seams to want even more after exhausting the full RAM available as it started using swap.

  • php 7.2 environment
  • a custom value binder that skips formula handling (not applicable in our use cases).
  • we use setReadDataOnly(true) & setReadEmptyCells(false);
  • dump to disk cache every 500 rows.
  • reading in chunks of 4000 rows.
  • garbage collecting at end of chunks.

For now we ended up making use of listWorksheetInfo to detect the total number of rows and columns and abandon the whole process if that number is too big. Method listWorksheetInfo seams to be able to handle the file quite well, but once you try to load the file everything blows up.

In total we have: 161 columns * 100K rows = 16.1 million cells.
We get it, its a big number.. but still after converting this file to CSV, we are able to process the entire thing without any problems and without ever going above 500MB.

After converting to XLS (which cannot go beyond 65536 rows) we are also able to process the 65K rows without exceeding 500 MB of RAM.

Any help on this matter would be highly appreciated.

@PowerKiKi
Copy link
Member

Thanks @bferrantino and @pop-mihai for those detailed reports. I unfortunately cannot give you a quick and easy solution. Because I don't know PhpSpreadsheet's ugly details well enough. But we can attempt a few things.

First off, @pop-mihai you mentionned that CSV is successful with 16 millions cells. The first thing that comes to mind is that CSV does not have any styling at all. Have you tried commenting out all style-related code in the XLSX reader ? Does it make any difference ?

Were you able to do memory profiling with xdebug as @dkarlovi suggested ? were you able to pinpoint something specific that would consume especially a lot of memory ?

There is #648 that might help you. We need somebody to confirm the memory improvement and create a proper PR to merge it. Would one of you be able to help ?

@PowerKiKi PowerKiKi added enhancement reader/xlsx Reader for MS OfficeOpenXML-format (xlsx) spreadsheet files labels Oct 7, 2018
@pop-mihai
Copy link

Hello @PowerKiKi , thank you for the input.
I looked over PR #648 but it seamed old compared to the latest phpSpreadSheet code.
I also tried all configurable of phpSpreadSheet, it does not make any difference when it comes to xlsx, we run out of resourecs.

We did some memory profiling yes, but at this volume it is somewhat hard to point any fingers. Biggest memory consuming process reported by xdebug is caused by the preg_match within the static coordinateFromString method which is weird.

We also looked at other libraries, including box/spout which is very promising due to its streaming mechanism - but suffers the same problem, memory leaks with xlsx, however not at this scale.

  • CSV does just fine with 2 MB of RAM !
  • XLSX ate up to ~500MB of RAM (still better than 8GB)
  • They offer no support for XLS, nor do they plan to ever offer such support.

We believe that there is a memory leak in the compiled libxml version. With PHP 7.2, the compiled libxml version is: 2.9.1 while on xmlsoft.org the latest version is: 2.9.7 released Nov 02 2017. If you look at the change log for keyword "memory" between the 2 versions you`ll see over 25 fixes!

We will attempt to compile our php instance with the latest libxml version and see if this makes a notable difference. Then we`ll have to choose the best tool for each need (csv, xls, xlsx).

@dkarlovi
Copy link
Contributor

dkarlovi commented Oct 8, 2018

Biggest memory consuming process reported by xdebug is caused by the preg_match within the static coordinateFromString method which is weird.

This is not weird at all, the functionality is used for each cell as it's being read. I had the same problem.

@PowerKiKi
Copy link
Member

If that preg_match can indeed be confirmed to be a bottleneck, it could perhaps be replaced by a very simple parsing process instead. Since the pattern is actually quite simple, the parsing code would be rather straightforward and could maybe save time and/or memory ?

Anyway, as you know PhpSpreadsheet is not known for its speed or memory consumption. So every effort to improve the situation would be very welcome. @pop-mihai you seems to be in a good spot for that, with a strong use-case and good knowledge of the overall issues. Don't hesitate to experiment with PhpSpreadsheet own code and see if it could be merged back in a second step...

@dkarlovi
Copy link
Contributor

dkarlovi commented Oct 8, 2018

@PowerKiKi I actually started working on this a bit for my use case, I've switched from getCell() to getCellByColumnAndRow(), passing createCell = false to missing cells.

See sigwinhq/xezilaires@e69afc4#diff-e11287c474a458a580cd0b18582ae5a8R236

@ryzr
Copy link

ryzr commented Oct 17, 2018

Does your spreadsheet have many calculated cells? I can import a sheet with around ~200k cells within a few seconds, however a sheet with ~600k cells were many of them are calculated times out after 5 minutes. Here are the major culprits in my profile. Note that it did time out, so it probably could have gotten worse?

EDIT: sorry, not thinking straight. It timed out because of xdebug - but my point was that there's definitely a high cost if your sheet has a lot of calculated cells.

castToFormula
screen shot 2018-10-17 at 11 21 28 am

getCell
screen shot 2018-10-17 at 11 23 09 am

coordinateFromString
screen shot 2018-10-17 at 11 23 39 am

@stale
Copy link

stale bot commented Dec 16, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
If this is still an issue for you, please try to help by debugging it further and sharing your results.
Thank you for your contributions.

@stale stale bot added the stale label Dec 16, 2018
@matt-allan
Copy link

Following the examples provided here: https://phpspreadsheet.readthedocs.io/en/develop/topics/memory_saving/ - we tried using a simple disk cache.. and immediately failed because of iNodes limit. By default each excel cell would be saved as ONE file.. eventually exhausting the inodes server limit. So we wrote our own adaptor that stores several rows per file instead. This has significantly improved performance and memory usage.

This worked for me too. By storing all of the cells in a single cache file and keeping an in memory index only 25% of the memory is used and it isn't noticeably slower than keeping everything in memory (it's actually sometimes faster, I'm not sure why).

The length of the generated cache prefix doesn't help memory usage, since it ends up requiring > 39 bytes per cache key.

@matt-allan
Copy link

matt-allan commented Dec 19, 2018

If that preg_match can indeed be confirmed to be a bottleneck, it could perhaps be replaced by a very simple parsing process instead

After spending some time profiling this I'm fairly certain it's not the preg_match itself. I'm not sure why by profilers seem to blame coordinateFromString for the allocation of the coordinate string even though it only uses it as a parameter and the method doesn't store it anywhere.

If you look at the profile linked to #823 it shows the memory usage of coordinateFromString going down by 29%, even though nothing changed in that method or how it's returned value is used.

I tried removing the call to coordinateFromString in createNewCell entirely (The xlsx reader already calls coordinateFromString so you can pass the parsed row/column as an argument instead of calling the method a second time) and that didn't help the overall memory usage at all. While the memory usage of that method goes down the memory usage of Cell::__construct, Cells::add, and Cells::storeCurrentCell goes up by the same amount. You can view the comparison here.

If you rewrite coordinateFromString with a simple hand written parser the memory usage does not noticeably change. If you inline the parsing logic into Worksheet::createNewCell you will see the same thing I observed when I removed the call entirely; the memory allocation just moved around.

A good percentage of the memory usage being blamed on coordinateFromString is from the cache keys (prefix + coordinate). If you remove the PSR cache entirely and use a simple array instead the memory usage of coordinateFromString drops 39% and the overall usage drops 50%. I'm guessing the remaining 61% is from the coordinates themselves and the rows/columns which we can't really do anything about.

@ArmanKoke
Copy link

@ryzr how did you track the time in xdebug like in screenshots, please can you show me how

Does your spreadsheet have many calculated cells? I can import a sheet with around ~200k cells within a few seconds, however a sheet with ~600k cells were many of them are calculated times out after 5 minutes. Here are the major culprits in my profile. Note that it did time out, so it probably could have gotten worse?

EDIT: sorry, not thinking straight. It timed out because of xdebug - but my point was that there's definitely a high cost if your sheet has a lot of calculated cells.

castToFormula
screen shot 2018-10-17 at 11 21 28 am

getCell
screen shot 2018-10-17 at 11 23 09 am

coordinateFromString
screen shot 2018-10-17 at 11 23 39 am

and also about issue how can i maximize time for reading excel files, it takes minutes to open 4 mb file. no images in it

@franksl
Copy link

franksl commented Mar 7, 2019

Hi,
I have often a similar problem when opening an xlsx document created by google drive (exporting a google spreadsheet to xlsx).
Those documents are not that large but I noticed that it happens especially with documents that have many empty cells but with some sort of left formatting. I can usually avoid the problem by deleting empty rows and columns.
There is a specific point where it fails:
PhpSpreadsheet/Collection/Memory.php on line 66

Hope this helps finding the problem,
Thanks,
Frank

@stale
Copy link

stale bot commented May 6, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
If this is still an issue for you, please try to help by debugging it further and sharing your results.
Thank you for your contributions.

@stale stale bot added the stale label May 6, 2019
@stale stale bot closed this as completed May 13, 2019
@danielbachhuber
Copy link

I'm finding https://github.com/box/spout to be a good way for reading XLSX files in a performant manner.

@pop-mihai
Copy link

@danielbachhuber - yes box/spout due to its streaming mechanism and use of yielding is very efficient, but it will only work for XLSX and CSV.

For XLS we couldn't find any alternative (or at least none with a comprehensive feature list as what we are looking for), so we continue to use PHPSpreadSheet in such encounters. Here we managed to lower the memory footprint only by reading in smaller chunks. All attempts to use Cell cache, force freeing of memory e.t.c. have either failed or moved the problem from one side to another.

@Sharat-Ojha
Copy link

Sharat-Ojha commented Nov 8, 2019

Hi,

I am still new to this but tried out a solution which help us here :
We can read files in Chunk through excel sheet as mentioned in above comments but to save memory we can create reader inside the loop and release it at the end of the loop like as mentioned below:

    // Define how many rows we want to read for each "chunk"
    $chunkSize = 1000;		
    // Loop to read our worksheet in "chunk size" blocks
    for ($startRow = 1; $startRow <= $rawRows; $startRow += $chunkSize) {
        // Create a new Reader of the type defined in
        $reader = IOFactory::createReader($inputFileType);

        // Create a new Instance of our Read Filter
        $chunkFilter = new Chunk();

        // Tell the Reader that we want to use the Read Filter that we've Instantiated
        $reader->setReadFilter($chunkFilter);

        // Tell the Read Filter, the limits on which rows we want to read this iteration
        $chunkFilter->setRows($startRow, $chunkSize);
        // Load only the rows that match our filter from $inputFileName to a PhpSpreadsheet Object
        $spreadsheet = $reader->load($inputFileName);
        .....
        // process the file
        .....

        // then release the memory
        $spreadsheet->__destruct();
        $spreadsheet = null;
        unset($spreadsheet);
    
        $reader->__destruct();
        $reader = null;
        unset($reader);
    }

This helps for large sheets to use only memory of a chunk and never exceed the memory limit.
Please let me know if this is helpful.

@FreeWebStyler
Copy link

FreeWebStyler commented Dec 5, 2019

@Sharat-Ojha

$spreadsheet->__destruct();

helped me, thanks!

@twittobal
Copy link

Hi,

I am still new to this but tried out a solution which help us here :
We can read files in Chunk through excel sheet as mentioned in above comments but to save memory we can create reader inside the loop and release it at the end of the loop like as mentioned below:

    // Define how many rows we want to read for each "chunk"
    $chunkSize = 1000;		
    // Loop to read our worksheet in "chunk size" blocks
    for ($startRow = 1; $startRow <= $rawRows; $startRow += $chunkSize) {
        // Create a new Reader of the type defined in
        $reader = IOFactory::createReader($inputFileType);

        // Create a new Instance of our Read Filter
        $chunkFilter = new Chunk();

        // Tell the Reader that we want to use the Read Filter that we've Instantiated
        $reader->setReadFilter($chunkFilter);

        // Tell the Read Filter, the limits on which rows we want to read this iteration
        $chunkFilter->setRows($startRow, $chunkSize);
        // Load only the rows that match our filter from $inputFileName to a PhpSpreadsheet Object
        $spreadsheet = $reader->load($inputFileName);
        .....
        // process the file
        .....

        // then release the memory
        $spreadsheet->__destruct();
        $spreadsheet = null;
        unset($spreadsheet);
    
        $reader->__destruct();
        $reader = null;
        unset($reader);
    }

This helps for large sheets to use only memory of a chunk and never exceed the memory limit.
Please let me know if this is helpful.

What about $rawRows? How I can get it?

@wucdbm
Copy link

wucdbm commented May 13, 2020

Hi,
I am still new to this but tried out a solution which help us here :
We can read files in Chunk through excel sheet as mentioned in above comments but to save memory we can create reader inside the loop and release it at the end of the loop like as mentioned below:

    // Define how many rows we want to read for each "chunk"
    $chunkSize = 1000;		
    // Loop to read our worksheet in "chunk size" blocks
    for ($startRow = 1; $startRow <= $rawRows; $startRow += $chunkSize) {
        // Create a new Reader of the type defined in
        $reader = IOFactory::createReader($inputFileType);

        // Create a new Instance of our Read Filter
        $chunkFilter = new Chunk();

        // Tell the Reader that we want to use the Read Filter that we've Instantiated
        $reader->setReadFilter($chunkFilter);

        // Tell the Read Filter, the limits on which rows we want to read this iteration
        $chunkFilter->setRows($startRow, $chunkSize);
        // Load only the rows that match our filter from $inputFileName to a PhpSpreadsheet Object
        $spreadsheet = $reader->load($inputFileName);
        .....
        // process the file
        .....

        // then release the memory
        $spreadsheet->__destruct();
        $spreadsheet = null;
        unset($spreadsheet);
    
        $reader->__destruct();
        $reader = null;
        unset($reader);
    }

This helps for large sheets to use only memory of a chunk and never exceed the memory limit.
Please let me know if this is helpful.

What about $rawRows? How I can get it?

$woksheetData = $reader->listWorksheetInfo($path);

@cuongngoz
Copy link

Does your spreadsheet have many calculated cells? I can import a sheet with around ~200k cells within a few seconds, however a sheet with ~600k cells were many of them are calculated times out after 5 minutes. Here are the major culprits in my profile. Note that it did time out, so it probably could have gotten worse?

EDIT: sorry, not thinking straight. It timed out because of xdebug - but my point was that there's definitely a high cost if your sheet has a lot of calculated cells.

castToFormula
screen shot 2018-10-17 at 11 21 28 am

getCell
screen shot 2018-10-17 at 11 23 09 am

coordinateFromString
screen shot 2018-10-17 at 11 23 39 am

Hi there, How to see these info likes these screenshots ? Thx

@shubhamt619
Copy link

@cuongngoz The screenshots are from X-Debug, Depending on your OS + Editor, you can simply google "Setting up X-debug on with "

Here is a sample link

https://www.codewall.co.uk/debug-php-in-vscode-with-xdebug/

@cuongngoz
Copy link

@cuongngoz The screenshots are from X-Debug, Depending on your OS + Editor, you can simply google "Setting up X-debug on with "

Here is a sample link

https://www.codewall.co.uk/debug-php-in-vscode-with-xdebug/

Thanks @shubhamt619 , I have already setup xdebug with my editor is PhpStorm actually, but just don't know how to see the analyze code calculating times as your screenshot, looks useful btw

@edeiller-gfi
Copy link

Hi,
I am still new to this but tried out a solution which help us here :
We can read files in Chunk through excel sheet as mentioned in above comments but to save memory we can create reader inside the loop and release it at the end of the loop like as mentioned below:

    // Define how many rows we want to read for each "chunk"
    $chunkSize = 1000;		
    // Loop to read our worksheet in "chunk size" blocks
    for ($startRow = 1; $startRow <= $rawRows; $startRow += $chunkSize) {
        // Create a new Reader of the type defined in
        $reader = IOFactory::createReader($inputFileType);

        // Create a new Instance of our Read Filter
        $chunkFilter = new Chunk();

        // Tell the Reader that we want to use the Read Filter that we've Instantiated
        $reader->setReadFilter($chunkFilter);

        // Tell the Read Filter, the limits on which rows we want to read this iteration
        $chunkFilter->setRows($startRow, $chunkSize);
        // Load only the rows that match our filter from $inputFileName to a PhpSpreadsheet Object
        $spreadsheet = $reader->load($inputFileName);
        .....
        // process the file
        .....

        // then release the memory
        $spreadsheet->__destruct();
        $spreadsheet = null;
        unset($spreadsheet);
    
        $reader->__destruct();
        $reader = null;
        unset($reader);
    }

This helps for large sheets to use only memory of a chunk and never exceed the memory limit.
Please let me know if this is helpful.

What about $rawRows? How I can get it?

$rawRows : what does this correspond to?

@RLOFLS
Copy link

RLOFLS commented Mar 3, 2022

Hi,

I am still new to this but tried out a solution which help us here : We can read files in Chunk through excel sheet as mentioned in above comments but to save memory we can create reader inside the loop and release it at the end of the loop like as mentioned below:

    // Define how many rows we want to read for each "chunk"
    $chunkSize = 1000;		
    // Loop to read our worksheet in "chunk size" blocks
    for ($startRow = 1; $startRow <= $rawRows; $startRow += $chunkSize) {
        // Create a new Reader of the type defined in
        $reader = IOFactory::createReader($inputFileType);

        // Create a new Instance of our Read Filter
        $chunkFilter = new Chunk();

        // Tell the Reader that we want to use the Read Filter that we've Instantiated
        $reader->setReadFilter($chunkFilter);

        // Tell the Read Filter, the limits on which rows we want to read this iteration
        $chunkFilter->setRows($startRow, $chunkSize);
        // Load only the rows that match our filter from $inputFileName to a PhpSpreadsheet Object
        $spreadsheet = $reader->load($inputFileName);
        .....
        // process the file
        .....

        // then release the memory
        $spreadsheet->__destruct();
        $spreadsheet = null;
        unset($spreadsheet);
    
        $reader->__destruct();
        $reader = null;
        unset($reader);
    }

This helps for large sheets to use only memory of a chunk and never exceed the memory limit. Please let me know if this is helpful.

Thank you for solving my problem
In addition, in my program, I need to call gc_collect_cycles() again, the occupied memory will be reclaimed immediately

        ...
         // then release the memory
         ....
         unset($reader);

        gc_collect_cycles(); 
        ...

@edeiller-gfi
Copy link

// Create a new Reader of the type defined in
$reader = IOFactory::createReader($inputFileType);

    // Create a new Instance of our Read Filter
    $chunkFilter = new Chunk();

why recreating a new $reader and $chunkFilter in each loop ?
is it increasing performance and reduce memory problems ?

@RLOFLS
Copy link

RLOFLS commented Mar 3, 2022

// Create a new Reader of the type defined in
$reader = IOFactory::createReader($inputFileType);

    // Create a new Instance of our Read Filter
    $chunkFilter = new Chunk();

why recreating a new $reader and $chunkFilter in each loop ? is it increasing performance and reduce memory problems ?

@edeiller-gfi I tested my program and can put it outside the loop

...
        $reader = IOFactory::createReader(ucfirst($this->fileExt));
        //Create a new Instance of our Read Filter
        $chunkFilter = new ChunkReadFilter();
        // Tell the Reader that we want to use the Read Filter that we've Instantiated
        $reader->setReadFilter($chunkFilter);

        for ($startRow = 1; $startRow <= $this->limitMaxRow; $startRow += $this->chunkSize) {
            var_dump("{$startRow}-row-start:" . memory_get_usage());
            
            $chunkFilter->setRows($startRow, $this->chunkSize, range('A', $this->endColumn));
            $spreadsheet = $reader->load($this->filePath);
            $workSheet = $spreadsheet->getActiveSheet();
             ...
    
            var_dump("{$startRow}-row-beforeClear:" . memory_get_usage());

            // then release the memory
            $spreadsheet->__destruct();
            $workSheet = null;
            $spreadsheet = null;
            //$reader = null;
            gc_collect_cycles();

            var_dump("{$startRow}-row-end:" . memory_get_usage());
            ...
       }
      ...

output:

string(20) "1-row-start:24360120"
string(26) "1-row-beforeClear:32823784"
string(18) "1-row-end:29116552"
string(22) "101-row-start:29116552"
string(28) "101-row-beforeClear:33353824"
string(20) "101-row-end:29116560"
string(22) "201-row-start:29116560"
string(28) "201-row-beforeClear:33068592"
string(20) "201-row-end:29116560"

@damienalexandre
Copy link

In my opinion this issue should be re-opened.

  • https://github.com/box/spout is archived / does not exists anymore
  • with a 2300 rows XLSX:
    • on 1.29.0 it took 3 minutes and 35mb
    • on 2.2.2 it takes more than 18 minutes and explode the memory at +130mb
  • Rewritten my script to use with https://github.com/shuchkin/simplexlsx for reading, got 1,5 minute and 11mb - comparaison is not fair as I also removed the calls to "getCells" so don't worry much about it.

What's needs to be addressed is the huge speed regression between 1.29 and 2.2.

I will still use PHPSpreadsheet for writing - but reading needs to be faster to be used in production. The "getCell" is definitively culprit in that situation.

@franksl
Copy link

franksl commented Sep 18, 2024

If you are looking for a different way to read documents, the spout project is now here: https://github.com/openspout/openspout

@oleibman
Copy link
Collaborator

PR #4153 has made some significant speed improvements. If you can test against master, please do so. If not, I expect it to be part of a formal release within the next few weeks. There are also some useful suggestions about speeding things up in the discussion of that PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement reader/xlsx Reader for MS OfficeOpenXML-format (xlsx) spreadsheet files stale
Development

No branches or pull requests