Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve efficiency of the dependency graph (56% speed-up) #1293

Merged
merged 33 commits into from
Sep 7, 2023

Conversation

sequba
Copy link
Contributor

@sequba sequba commented Jul 26, 2023

Test case

A spreadsheet with 500k rows and 10 columns filled with string data. Total of 5M data cells, no formulas.

Script:

const hf = HyperFormula.buildFromArray([], {
  licenseKey: 'gpl-v3',
  maxRows: 500100,
  useStats: true,
  chooseAddressMappingPolicy: new AlwaysDense(),
})

const data = Array(500000).fill(0).map(() => Array(10).fill(0).map(() => 'A500000'))

var ty1 = (new Date()).getTime()
hf.setSheetContent(0, data)
var ty2 = (new Date()).getTime()
console.log(ty2 - ty1)

Ideas for improvement

  • Function getTopSortedWithSccSubgraphFrom, which is the iterative implementation of the Tarjan algorithm that performs the topological sorting of the dependency graph and finds cycles (SCCs) in the graph. In this test case, the dependency graph is trivial; it contains only isolated nodes without any edges. It seems that it can be done more efficiently.
  • Function parseDateTimeFromConfigFormats, which tries to parse all string data as date/time values. This test case contains no date/time values, so there might be some way of saving time by determining it quickly and avoiding running the heavy parsing operations.

This PR focuses on optimizing topological sorting of the dependency graph

I made the Tarjan algorithm more efficient by changing the data structures it uses. Initially, the information about the graph nodes was stored in maps and sets with nodes as keys:

    const entranceTime: Map<T, number> = new Map()
    const low: Map<T, number> = new Map()
    const parent: Map<T, T> = new Map()
    const inSCC: Set<T> = new Set()

My approach was to use simple arrays indexed by integer ids and keep a single array of nodes as a mapping from id to node data.

  private nodes: T[] = []

  private entranceTime: number[] = []
  private low: number[] = []
  private parent: number[] = []
  private inSCC: boolean[] = []

Results

Total time:
Before: 25923ms
After: 11160ms

Function getTopSortedWithSccSubgraphFrom:
Before: 51.6%
After: 23.1%

Profiler:
Before:
image

After:
image

Profiler: Chrome Dev Tools

Overall, I achieved the 56,02% speed-up of HyperFormula for this use-case.

How did you test your changes?

  • unit tests passed
  • new unit tests for the Graph class

Comparison on our existing performance benchmarks on my PC:

                                     testName | before |  after
-----------------------------------------------------------------------
                                      Sheet A | 417.77 | 401.12
                                      Sheet B | 133.09 | 129.18
                                      Sheet T | 119.23 | 114.51
                                Column ranges | 292.47 | 408.56   <-- 40% slow-down
Sheet A:  change value, add/remove row/column |     23 |   27.2
 Sheet B: change value, add/remove row/column |    169 |  194.0
                   Column ranges - add column |  136.8 |  137.4
                Column ranges - without batch |  339.2 |  316.8
                        Column ranges - batch |    124 |   80.0

Column ranges benchmark

The 40% slow-down is observed only in Node environment. Running this benchmark in V8 yields similar results before and after applying the change.

Types of changes

  • Breaking change (a fix or a feature because of which an existing functionality doesn't work as expected anymore)
  • New feature or improvement (a non-breaking change that adds functionality)
  • Bug fix (a non-breaking change that fixes an issue)
  • Additional language file, or a change to an existing language file (translations)
  • Change to the documentation

Related issues:

Fixes #876

Checklist:

  • I have reviewed the guidelines about Contributing to HyperFormula and I confirm that my code follows the code style of this project.
  • I have signed the Contributor License Agreement.
  • My change is compliant with the OpenDocument standard.
  • My change is compatible with Microsoft Excel.
  • My change is compatible with Google Sheets.
  • I described my changes in the CHANGELOG.md file.
  • My changes require a documentation update.
  • My changes require a migration guide.

@github-actions
Copy link

github-actions bot commented Jul 26, 2023

Performance comparison of head (69095fc) vs base (0e39178)

                                     testName |   base |   head |  change
-------------------------------------------------------------------------
                                      Sheet A | 736.13 | 708.76 |  -3.72%
                                      Sheet B | 249.44 | 232.93 |  -6.62%
                                      Sheet T | 222.16 | 210.55 |  -5.23%
                                Column ranges | 567.02 | 815.93 | +43.90%
Sheet A:  change value, add/remove row/column |     36 |     48 | +33.33%
 Sheet B: change value, add/remove row/column |    344 |    385 | +11.92%
                   Column ranges - add column |    200 |    205 |  +2.50%
                Column ranges - without batch |    577 |    577 |   0.00%
                        Column ranges - batch |    166 |    156 |  -6.02%

src/DateTimeDefault.ts Fixed Show fixed Hide fixed
@sequba sequba marked this pull request as ready for review August 29, 2023 16:37
@sequba sequba changed the title Improve performance for spreadsheets with no formulas Improve efficiency of the dependency graph Aug 29, 2023
@sequba sequba changed the title Improve efficiency of the dependency graph Improve efficiency of the dependency graph (51% speed-up) Aug 30, 2023
@sequba sequba changed the title Improve efficiency of the dependency graph (51% speed-up) Improve efficiency of the dependency graph (??% speed-up) Aug 30, 2023
@sequba sequba changed the title Improve efficiency of the dependency graph (??% speed-up) Improve efficiency of the dependency graph (56% speed-up) Aug 30, 2023
@sequba sequba self-assigned this Sep 6, 2023
@sequba sequba linked an issue Sep 6, 2023 that may be closed by this pull request
@sequba sequba requested a review from budnix September 6, 2023 14:12
Copy link
Member

@budnix budnix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! 🥇

I tested the changes using the demo provided in the PR's description, and here are my results:

  • On Chrome (browser), there is a 60% improvement (from ~25s to ~10s);
  • Node (v20), there is a 58% improvement (from ~29s to ~12s);

@sequba
Copy link
Contributor Author

sequba commented Sep 7, 2023

@budnix Could you also run the Column ranges benchmark yourself? I'm curious if you'd confirm my results. Here's the code:

import { AlwaysDense, HyperFormula } from 'hyperformula';

function columnIndexToLabel(column) {
  let result = ''

  while (column >= 0) {
    result = String.fromCharCode((column % 26) + 97) + result
    column = Math.floor(column / 26) - 1
  }

  return result.toUpperCase()
}

function simpleCellAddressToString(address) {
  const column = columnIndexToLabel(address.col)
  return `${column}${address.row + 1}`
}

const cols = 50;
const data = [];
const firstRow = [1];

for (let i = 1; i < cols; ++i) {
  const adr = simpleCellAddressToString({sheet: 0, row: 0, col: i - 1});
  firstRow.push(`=${adr} + 1`);
}

data.push(firstRow);

for (let i = 1; i < cols; ++i) {
  const rowToPush = Array(i).fill(null);
  const startColumn = columnIndexToLabel(i - 1);

  for (let j = i - 1; j < cols - 1; ++j) {
    const endColumn = columnIndexToLabel(j);
    rowToPush.push(`=SUM(${startColumn}:${endColumn})`);
  }

  data.push(rowToPush);
}

const sheetId = 0;
const ty1 = (new Date()).getTime();

for (let i = 1; i < 200 ; ++i) {
  const hf = HyperFormula.buildFromArray([], {
    licenseKey: 'gpl-v3',
    maxRows: 500100,
    useStats: true,
    chooseAddressMappingPolicy: new AlwaysDense(),
  });

  hf.setSheetContent(sheetId, data);
}

const ty2 = (new Date()).getTime();
console.log(ty2 - ty1);

@budnix
Copy link
Member

budnix commented Sep 7, 2023

@budnix Could you also run the Column ranges benchmark yourself? I'm curious if you'd confirm my results.

Using your code, I spotted performance degradation. However, considering it's a rare use (hundreds of sheets created by buildFromArray), I think it's acceptable. I tested on Node 20 and it took: Before PR ~23s and after PR ~25s.

@sequba sequba merged commit 4fa4df3 into develop Sep 7, 2023
21 checks passed
@sequba sequba deleted the feature/issue-876 branch September 7, 2023 12:48
@AMBudnik
Copy link
Contributor

Released in v2.6.0

@izulin
Copy link
Collaborator

izulin commented Sep 20, 2023

awesome!

@sequba sequba mentioned this pull request Nov 7, 2023
13 tasks
Copy link

codecov bot commented Oct 30, 2024

Codecov Report

Attention: Patch coverage is 99.71591% with 1 line in your changes missing coverage. Please review.

Project coverage is 97.25%. Comparing base (0e39178) to head (69095fc).
Report is 79 commits behind head on develop.

Files with missing lines Patch % Lines
src/DependencyGraph/Graph.ts 99.31% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1293      +/-   ##
===========================================
+ Coverage    97.23%   97.25%   +0.01%     
===========================================
  Files          167      169       +2     
  Lines        14304    14408     +104     
  Branches      3065     3092      +27     
===========================================
+ Hits         13909    14012     +103     
- Misses         395      396       +1     
Files with missing lines Coverage Δ
src/BuildEngineFactory.ts 100.00% <100.00%> (ø)
src/DateTimeDefault.ts 97.43% <100.00%> (ø)
...c/DependencyGraph/AddressMapping/AddressMapping.ts 88.00% <100.00%> (ø)
src/DependencyGraph/DependencyGraph.ts 97.94% <100.00%> (+0.02%) ⬆️
src/DependencyGraph/ProcessableValue.ts 100.00% <100.00%> (ø)
src/DependencyGraph/TopSort.ts 100.00% <100.00%> (ø)
src/DependencyGraph/index.ts 100.00% <100.00%> (ø)
src/HyperFormula.ts 99.59% <100.00%> (ø)
src/Operations.ts 99.25% <100.00%> (+<0.01%) ⬆️
src/errors.ts 100.00% <ø> (ø)
... and 1 more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slow graph building with 500k of rows without formulas
4 participants