Improve efficiency of the dependency graph (56% speed-up) #1293

sequba · 2023-07-26T09:34:14Z

Test case

A spreadsheet with 500k rows and 10 columns filled with string data. Total of 5M data cells, no formulas.

Script:

const hf = HyperFormula.buildFromArray([], {
  licenseKey: 'gpl-v3',
  maxRows: 500100,
  useStats: true,
  chooseAddressMappingPolicy: new AlwaysDense(),
})

const data = Array(500000).fill(0).map(() => Array(10).fill(0).map(() => 'A500000'))

var ty1 = (new Date()).getTime()
hf.setSheetContent(0, data)
var ty2 = (new Date()).getTime()
console.log(ty2 - ty1)

Ideas for improvement

Function getTopSortedWithSccSubgraphFrom, which is the iterative implementation of the Tarjan algorithm that performs the topological sorting of the dependency graph and finds cycles (SCCs) in the graph. In this test case, the dependency graph is trivial; it contains only isolated nodes without any edges. It seems that it can be done more efficiently.
Function parseDateTimeFromConfigFormats, which tries to parse all string data as date/time values. This test case contains no date/time values, so there might be some way of saving time by determining it quickly and avoiding running the heavy parsing operations.

This PR focuses on optimizing topological sorting of the dependency graph

I made the Tarjan algorithm more efficient by changing the data structures it uses. Initially, the information about the graph nodes was stored in maps and sets with nodes as keys:

    const entranceTime: Map<T, number> = new Map()
    const low: Map<T, number> = new Map()
    const parent: Map<T, T> = new Map()
    const inSCC: Set<T> = new Set()

My approach was to use simple arrays indexed by integer ids and keep a single array of nodes as a mapping from id to node data.

  private nodes: T[] = []

  private entranceTime: number[] = []
  private low: number[] = []
  private parent: number[] = []
  private inSCC: boolean[] = []

Results

Total time:
Before: 25923ms
After: 11160ms

Function getTopSortedWithSccSubgraphFrom:
Before: 51.6%
After: 23.1%

Profiler:
Before:

After:

Profiler: Chrome Dev Tools

Overall, I achieved the 56,02% speed-up of HyperFormula for this use-case.

How did you test your changes?

unit tests passed
new unit tests for the Graph class

Comparison on our existing performance benchmarks on my PC:

                                     testName | before |  after
-----------------------------------------------------------------------
                                      Sheet A | 417.77 | 401.12
                                      Sheet B | 133.09 | 129.18
                                      Sheet T | 119.23 | 114.51
                                Column ranges | 292.47 | 408.56   <-- 40% slow-down
Sheet A:  change value, add/remove row/column |     23 |   27.2
 Sheet B: change value, add/remove row/column |    169 |  194.0
                   Column ranges - add column |  136.8 |  137.4
                Column ranges - without batch |  339.2 |  316.8
                        Column ranges - batch |    124 |   80.0

Column ranges benchmark

The 40% slow-down is observed only in Node environment. Running this benchmark in V8 yields similar results before and after applying the change.

Types of changes

Breaking change (a fix or a feature because of which an existing functionality doesn't work as expected anymore)
New feature or improvement (a non-breaking change that adds functionality)
Bug fix (a non-breaking change that fixes an issue)
Additional language file, or a change to an existing language file (translations)
Change to the documentation

Related issues:

Fixes #876

Checklist:

I have reviewed the guidelines about Contributing to HyperFormula and I confirm that my code follows the code style of this project.
I have signed the Contributor License Agreement.
My change is compliant with the OpenDocument standard.
My change is compatible with Microsoft Excel.
My change is compatible with Google Sheets.
I described my changes in the CHANGELOG.md file.
My changes require a documentation update.
My changes require a migration guide.

github-actions · 2023-07-26T09:36:23Z

Performance comparison of head (`69095fc`) vs base (`0e39178`)

                                     testName |   base |   head |  change
-------------------------------------------------------------------------
                                      Sheet A | 736.13 | 708.76 |  -3.72%
                                      Sheet B | 249.44 | 232.93 |  -6.62%
                                      Sheet T | 222.16 | 210.55 |  -5.23%
                                Column ranges | 567.02 | 815.93 | +43.90%
Sheet A:  change value, add/remove row/column |     36 |     48 | +33.33%
 Sheet B: change value, add/remove row/column |    344 |    385 | +11.92%
                   Column ranges - add column |    200 |    205 |  +2.50%
                Column ranges - without batch |    577 |    577 |   0.00%
                        Column ranges - batch |    166 |    156 |  -6.02%

…aps and Sets of nodes - 15% speedup

…s not look like a date/time value

…efaults.ts

src/DateTimeDefault.ts

src/DependencyGraph/Graph.ts

…eature/issue-876

budnix

Looks great! 🥇

I tested the changes using the demo provided in the PR's description, and here are my results:

On Chrome (browser), there is a 60% improvement (from ~25s to ~10s);
Node (v20), there is a 58% improvement (from ~29s to ~12s);

sequba · 2023-09-07T12:11:17Z

@budnix Could you also run the Column ranges benchmark yourself? I'm curious if you'd confirm my results. Here's the code:

import { AlwaysDense, HyperFormula } from 'hyperformula';

function columnIndexToLabel(column) {
  let result = ''

  while (column >= 0) {
    result = String.fromCharCode((column % 26) + 97) + result
    column = Math.floor(column / 26) - 1
  }

  return result.toUpperCase()
}

function simpleCellAddressToString(address) {
  const column = columnIndexToLabel(address.col)
  return `${column}${address.row + 1}`
}

const cols = 50;
const data = [];
const firstRow = [1];

for (let i = 1; i < cols; ++i) {
  const adr = simpleCellAddressToString({sheet: 0, row: 0, col: i - 1});
  firstRow.push(`=${adr} + 1`);
}

data.push(firstRow);

for (let i = 1; i < cols; ++i) {
  const rowToPush = Array(i).fill(null);
  const startColumn = columnIndexToLabel(i - 1);

  for (let j = i - 1; j < cols - 1; ++j) {
    const endColumn = columnIndexToLabel(j);
    rowToPush.push(`=SUM(${startColumn}:${endColumn})`);
  }

  data.push(rowToPush);
}

const sheetId = 0;
const ty1 = (new Date()).getTime();

for (let i = 1; i < 200 ; ++i) {
  const hf = HyperFormula.buildFromArray([], {
    licenseKey: 'gpl-v3',
    maxRows: 500100,
    useStats: true,
    chooseAddressMappingPolicy: new AlwaysDense(),
  });

  hf.setSheetContent(sheetId, data);
}

const ty2 = (new Date()).getTime();
console.log(ty2 - ty1);

budnix · 2023-09-07T12:27:01Z

@budnix Could you also run the Column ranges benchmark yourself? I'm curious if you'd confirm my results.

Using your code, I spotted performance degradation. However, considering it's a rare use (hundreds of sheets created by buildFromArray), I think it's acceptable. I tested on Node 20 and it took: Before PR ~23s and after PR ~25s.

AMBudnik · 2023-09-19T08:56:10Z

Released in v2.6.0

izulin · 2023-09-20T07:21:38Z

awesome!

codecov · 2024-10-30T13:29:12Z

Codecov Report

Attention: Patch coverage is 99.71591% with 1 line in your changes missing coverage. Please review.

Project coverage is 97.25%. Comparing base (0e39178) to head (69095fc).
Report is 79 commits behind head on develop.

Files with missing lines	Patch %	Lines
src/DependencyGraph/Graph.ts	99.31%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #1293      +/-   ##
===========================================
+ Coverage    97.23%   97.25%   +0.01%     
===========================================
  Files          167      169       +2     
  Lines        14304    14408     +104     
  Branches      3065     3092      +27     
===========================================
+ Hits         13909    14012     +103     
- Misses         395      396       +1

Files with missing lines	Coverage Δ
src/BuildEngineFactory.ts	`100.00% <100.00%> (ø)`
src/DateTimeDefault.ts	`97.43% <100.00%> (ø)`
...c/DependencyGraph/AddressMapping/AddressMapping.ts	`88.00% <100.00%> (ø)`
src/DependencyGraph/DependencyGraph.ts	`97.94% <100.00%> (+0.02%)`	⬆️
src/DependencyGraph/ProcessableValue.ts	`100.00% <100.00%> (ø)`
src/DependencyGraph/TopSort.ts	`100.00% <100.00%> (ø)`
src/DependencyGraph/index.ts	`100.00% <100.00%> (ø)`
src/HyperFormula.ts	`99.59% <100.00%> (ø)`
src/Operations.ts	`99.25% <100.00%> (+<0.01%)`	⬆️
src/errors.ts	`100.00% <ø> (ø)`
... and 1 more

Refactor src/DateTimeDefault

69878c7

sequba added 8 commits July 26, 2023 11:43

Tmp changes to package.json

d38f047

Reduce the number of calls to adjacentNodes()

cfbc55a

Move topological sorting algorithm to separate class

4952638

Partition topsort function into smaller parts

f0f591c

Rewrite topological sorting to use ids and simple arrays instead of M…

7e9737b

…aps and Sets of nodes - 15% speedup

Add a date/time parsing quick check to escape early if the string doe…

ca5cb70

…s not look like a date/time value

Memoize parsing result of date and time formats for better performance

a1b8ffa

Improve documentation of the date/time parsing functions in DateTimeD…

48a1d42

…efaults.ts

github-advanced-security bot found potential problems Aug 14, 2023

View reviewed changes

src/DateTimeDefault.ts Fixed Show fixed Hide fixed

sequba added 4 commits August 14, 2023 17:03

Fix regexp escape character

6f7625f

Remove unused method in Graph class

62d876a

Make Graph.nodes private

e6cc8ca

Fix removeEdge issue

0bc660d

github-advanced-security bot found potential problems Aug 25, 2023

View reviewed changes

src/DependencyGraph/Graph.ts Fixed Show fixed Hide fixed

sequba added 8 commits August 29, 2023 14:18

Merge branch 'develop' of github.com:handsontable/hyperformula into f…

ed2aad8

…eature/issue-876

Add changelog entry

39e770a

Remove unnecessary comments

fd9f3c0

Remove graph.nodesCount() function (YAGNI)

ea1f448

Remove graph.edgesCount() function (YAGNI)

294aaf4

Remove graph.getDependencies() function (YAGNI)

f42c368

Add typedocs to all methods of Graph.ts

30bdcb2

Refactor tests for Graph class

0db23ea

sequba marked this pull request as ready for review August 29, 2023 16:37

Improve code coverage for the Graph class

0fd2c24

sequba changed the title ~~Improve performance for spreadsheets with no formulas~~ Improve efficiency of the dependency graph Aug 29, 2023

Improve code formatting in Graph.ts and TopSort.ts

1a99297

sequba changed the title ~~Improve efficiency of the dependency graph~~ Improve efficiency of the dependency graph (51% speed-up) Aug 30, 2023

sequba mentioned this pull request Aug 30, 2023

Increase number of runs for the basic performance benchmarks #1303

Merged

14 tasks

sequba added 3 commits August 30, 2023 12:41

Convert recently changed nodes set to map in Graph class

e76fa43

Convert volatile nodes set to array of numbers in Graph class

3019ac6

Convert all sets of nodes to arrays of numbers in Graph class

c234ef0

sequba changed the title ~~Improve efficiency of the dependency graph (51% speed-up)~~ Improve efficiency of the dependency graph (??% speed-up) Aug 30, 2023

Improve documentation for Graph class

c7e3bdb

sequba changed the title ~~Improve efficiency of the dependency graph (??% speed-up)~~ Improve efficiency of the dependency graph (56% speed-up) Aug 30, 2023

sequba added 3 commits September 5, 2023 22:52

Introduce ProcessableValue class for better performance

004ac3d

Try a few more optimizations

ce5e10f

Change graph.infiniteRangeIds to be a Set of numbers (ids)

c135a48

sequba self-assigned this Sep 6, 2023

sequba linked an issue Sep 6, 2023 that may be closed by this pull request

Slow graph building with 500k of rows without formulas #876

Closed

sequba requested a review from budnix September 6, 2023 14:12

sequba added 2 commits September 6, 2023 16:31

Adjust test descriptions

65d4505

Fix browser tests

e3bc190

budnix approved these changes Sep 7, 2023

View reviewed changes

Merge branch 'develop' into feature/issue-876

69095fc

sequba merged commit 4fa4df3 into develop Sep 7, 2023
21 checks passed

sequba deleted the feature/issue-876 branch September 7, 2023 12:48

AMBudnik added the Released label Sep 19, 2023

sequba mentioned this pull request Nov 7, 2023

Nodes edges map graph #1329

Closed

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve efficiency of the dependency graph (56% speed-up) #1293

Improve efficiency of the dependency graph (56% speed-up) #1293

sequba commented Jul 26, 2023 •

edited

Loading

github-actions bot commented Jul 26, 2023 •

edited

Loading

budnix left a comment

sequba commented Sep 7, 2023 •

edited

Loading

budnix commented Sep 7, 2023 •

edited

Loading

AMBudnik commented Sep 19, 2023

izulin commented Sep 20, 2023

codecov bot commented Oct 30, 2024

Improve efficiency of the dependency graph (56% speed-up) #1293

Improve efficiency of the dependency graph (56% speed-up) #1293

Conversation

sequba commented Jul 26, 2023 • edited Loading

Test case

Ideas for improvement

This PR focuses on optimizing topological sorting of the dependency graph

Results

Overall, I achieved the 56,02% speed-up of HyperFormula for this use-case.

How did you test your changes?

Column ranges benchmark

Types of changes

Related issues:

Checklist:

github-actions bot commented Jul 26, 2023 • edited Loading

Performance comparison of head (69095fc) vs base (0e39178)

budnix left a comment

Choose a reason for hiding this comment

sequba commented Sep 7, 2023 • edited Loading

budnix commented Sep 7, 2023 • edited Loading

AMBudnik commented Sep 19, 2023

izulin commented Sep 20, 2023

codecov bot commented Oct 30, 2024

Codecov Report

sequba commented Jul 26, 2023 •

edited

Loading

github-actions bot commented Jul 26, 2023 •

edited

Loading

Performance comparison of head (`69095fc`) vs base (`0e39178`)

sequba commented Sep 7, 2023 •

edited

Loading

budnix commented Sep 7, 2023 •

edited

Loading