New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PERF] evaluation: use custom data structures instead of Map and Set #3823
Conversation
ff7c51f
to
99ff33a
Compare
src/helpers/position_map.ts
Outdated
@@ -0,0 +1,41 @@ | |||
import { CellPosition, UID } from "../types"; | |||
|
|||
export class PositionMap<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also implement this with an unrolled 1-D array similar to the BinaryGrid.
I don't think it would be much much faster (?) (lookups in objects are fast), but we might save quite a bit of memory since we save all the object keys.
Do we want to go that far? (It would also mean the size needs to be fixed beforehand and known at instantiation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick and dirty POC (see commit history) didn't show any meaningful difference 🤷
src/helpers/binary_grid.ts
Outdated
function log2Ceil(value) { | ||
// A faster version of Math.ceil(Math.log2(value)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is called only a few times so it won't actually make any difference in the overall picture, even if it's 10x faster
49c8e01#r139640595
92c1558
to
89c25d1
Compare
robodoo r+ |
This commit drastically improves the evaluation execution time (~43% faster with the large formula demo dataset) `Map` and `Set` data structures are slow compared to objects and arrays (especially with `BigInt`, more on that later). Why use `Map` and `Set` then ? We needed a set data structure to hold which positions should be evaluated. We obviously don't want duplicates (hence the set). However, we can't blindly add a position object `{ sheetId, col, row }` in a `Set` though because the same position could be added twice by adding the same position as two different objects that are not referentially equal. That's why we came up with a way to encode the position object to a single value that can be added safely to the set without worrying about object references being equal. This encoded value is a `BigInt` (see `PositionBitsEncoder`). That's why we used `Set<BigInt>`. Since we had the position encoded as a single value, we naturally used that value as a key in a `Map` (regular `object` can only have `string` or `number` keys, not `BigInt`). Using this `BigInt` key also avoid having to decode the encoded value, then have a nested `[sheetId][col][row]` objects. In the evaluation process, there's a lot going on. To the point were lookup in the map is a bottleneck. This commit replaces the `Set` and `Map` in the evaluation by custom data structure `PositionSet` and `PositionMap<T>` with faster addition and lookup. In addition to being faster, this makes the evaluation code simpler as we no longer need encoding/decoding everywhere. Those data structures are also easier to debug. With the `BigInt` encoding, it was hard to find back which position is the number `28416511n`. Some measures ---------------------- *(all measures are averages over 8 runs)* ```js console.time("eval"); o_spreadsheet.__DEBUG__.model.dispatch("EVALUATE_CELLS") console.timeEnd("eval"); ``` On the large formula demo data set (260k cells): -43% | | Time | Allocated memory | |-------- |------- |------------------ | | Before | 512ms | 190MB | | After | 291ms | 122MB | I also tried implementing `PositionSet` with a `PositionMap<boolean>` instead of `BinaryGrid`. It's ~10% slower and allocates ~20MB more (but it's 120 LoC less). On RNG's Timesheet spreadsheet: -23% | | Time | Allocated memory | |-------- |------- |------------------ | | Before | 1894ms | 580MB | | After | 1469ms | 460MB | Basic tests on the raw data structures with 1M positions (10k rows and 100 columns) | | | add | has | |--- |---------------- |------- |------- | | 1 | `Set<BigInt>` | 314ms | 262ms | | 2 | `Set` + encoding | 440ms | | | 3 | `PositionSet` | 43ms | 46ms | | | | set | get | |--- |--------------------- |------- |------- | | 4 | `Map<BigInt, number>` | 365ms | 304ms | | 5 | `PositionMap<number>` | 47ms | 51ms | Setup code shared by all benchmarks: ```js const sheetId = "1"; const postions = []; for (let i = 0; i < 10000; i++) { for (let j = 0; j < 100; j++) { postions.push({ sheetId, row: i, col: j }); } } ``` 1. `Set<BigInt>` 314ms 262ms ```js const set = new Set(); const length = BigInt(postions.length); performance.mark("add"); for (let i = 0n; i < length; i++) { set.add(i); } performance.measure("add", "add").duration); performance.mark("has"); for (let i = 0n; i < length; i++) { set.has(i); } performance.measure("has", "has").duration; ``` 2. `Set<BigInt> + encoding` 440ms ```js const encoder = new PositionBitsEncoder(); const set = new Set(); performance.mark("add"); for (const position of postions) { set.add(encoder.encode(position)); } performance.measure("add", "add").duration; ``` 3. `PositionSet` 43ms 46ms ```js const set = new PositionSet({ [sheetId]: { rows: 10000, cols: 100 }, }); performance.mark("add"); for (const position of postions) { set.add(position); } performance.measure("add", "add").duration; performance.mark("has"); let r; for (const position of postions) { r = set.has(position); } performance.measure("has", "has").duration; ``` 4. `Map<BigInt, number>` 365ms 304ms ```js const map = new Map(); const length = BigInt(postions.length); performance.mark("set"); for (let i = 0n; i < length; i++) { map.set(i, 4); } performance.measure("set", "set").duration; performance.mark("get"); let r; for (let i = 0n; i < length; i++) { r = map.get(i); } performance.measure("get", "get").duration; ``` 5. `PositionMap` 47ms 51ms ```js performance.mark("add"); for (const position of postions) { map.set(position, 4); } performance.measure("add", "add").duration; performance.mark("get"); let r; for (const position of postions) { r = map.get(position); } performance.measure("get", "get").duration; ``` Task: 3802246
89c25d1
to
f021d4f
Compare
robodoo r+ |
This commit drastically improves the evaluation execution time (~43% faster with the large formula demo dataset) `Map` and `Set` data structures are slow compared to objects and arrays (especially with `BigInt`, more on that later). Why use `Map` and `Set` then ? We needed a set data structure to hold which positions should be evaluated. We obviously don't want duplicates (hence the set). However, we can't blindly add a position object `{ sheetId, col, row }` in a `Set` though because the same position could be added twice by adding the same position as two different objects that are not referentially equal. That's why we came up with a way to encode the position object to a single value that can be added safely to the set without worrying about object references being equal. This encoded value is a `BigInt` (see `PositionBitsEncoder`). That's why we used `Set<BigInt>`. Since we had the position encoded as a single value, we naturally used that value as a key in a `Map` (regular `object` can only have `string` or `number` keys, not `BigInt`). Using this `BigInt` key also avoid having to decode the encoded value, then have a nested `[sheetId][col][row]` objects. In the evaluation process, there's a lot going on. To the point were lookup in the map is a bottleneck. This commit replaces the `Set` and `Map` in the evaluation by custom data structure `PositionSet` and `PositionMap<T>` with faster addition and lookup. In addition to being faster, this makes the evaluation code simpler as we no longer need encoding/decoding everywhere. Those data structures are also easier to debug. With the `BigInt` encoding, it was hard to find back which position is the number `28416511n`. Some measures ---------------------- *(all measures are averages over 8 runs)* ```js console.time("eval"); o_spreadsheet.__DEBUG__.model.dispatch("EVALUATE_CELLS") console.timeEnd("eval"); ``` On the large formula demo data set (260k cells): -43% | | Time | Allocated memory | |-------- |------- |------------------ | | Before | 512ms | 190MB | | After | 291ms | 122MB | I also tried implementing `PositionSet` with a `PositionMap<boolean>` instead of `BinaryGrid`. It's ~10% slower and allocates ~20MB more (but it's 120 LoC less). On RNG's Timesheet spreadsheet: -23% | | Time | Allocated memory | |-------- |------- |------------------ | | Before | 1894ms | 580MB | | After | 1469ms | 460MB | Basic tests on the raw data structures with 1M positions (10k rows and 100 columns) | | | add | has | |--- |---------------- |------- |------- | | 1 | `Set<BigInt>` | 314ms | 262ms | | 2 | `Set` + encoding | 440ms | | | 3 | `PositionSet` | 43ms | 46ms | | | | set | get | |--- |--------------------- |------- |------- | | 4 | `Map<BigInt, number>` | 365ms | 304ms | | 5 | `PositionMap<number>` | 47ms | 51ms | Setup code shared by all benchmarks: ```js const sheetId = "1"; const postions = []; for (let i = 0; i < 10000; i++) { for (let j = 0; j < 100; j++) { postions.push({ sheetId, row: i, col: j }); } } ``` 1. `Set<BigInt>` 314ms 262ms ```js const set = new Set(); const length = BigInt(postions.length); performance.mark("add"); for (let i = 0n; i < length; i++) { set.add(i); } performance.measure("add", "add").duration); performance.mark("has"); for (let i = 0n; i < length; i++) { set.has(i); } performance.measure("has", "has").duration; ``` 2. `Set<BigInt> + encoding` 440ms ```js const encoder = new PositionBitsEncoder(); const set = new Set(); performance.mark("add"); for (const position of postions) { set.add(encoder.encode(position)); } performance.measure("add", "add").duration; ``` 3. `PositionSet` 43ms 46ms ```js const set = new PositionSet({ [sheetId]: { rows: 10000, cols: 100 }, }); performance.mark("add"); for (const position of postions) { set.add(position); } performance.measure("add", "add").duration; performance.mark("has"); let r; for (const position of postions) { r = set.has(position); } performance.measure("has", "has").duration; ``` 4. `Map<BigInt, number>` 365ms 304ms ```js const map = new Map(); const length = BigInt(postions.length); performance.mark("set"); for (let i = 0n; i < length; i++) { map.set(i, 4); } performance.measure("set", "set").duration; performance.mark("get"); let r; for (let i = 0n; i < length; i++) { r = map.get(i); } performance.measure("get", "get").duration; ``` 5. `PositionMap` 47ms 51ms ```js performance.mark("add"); for (const position of postions) { map.set(position, 4); } performance.measure("add", "add").duration; performance.mark("get"); let r; for (const position of postions) { r = map.get(position); } performance.measure("get", "get").duration; ``` closes #3823 Task: 3802246 Signed-off-by: Vincent Schippefilt (vsc) <vsc@odoo.com>
Description:
This commit drastically improves the evaluation execution time (~43% faster with the large formula demo dataset)
Map
andSet
data structures are slow compared to objects and arrays (especially withBigInt
, more on that later). Why useMap
andSet
then ?We needed a set data structure to hold which positions should be evaluated. We obviously don't want duplicates (hence the set). However, we can't blindly add a position object
{ sheetId, col, row }
in aSet
though because the same position could be added twice by adding the same position as two different objects that are not referentially equal.That's why we came up with a way to encode the position object to a single value that can be added safely to the set without worrying about object references being equal. This encoded value is a
BigInt
(seePositionBitsEncoder
). That's why we usedSet<BigInt>
. Since we had the position encoded as a single value, we naturally used that value as a key in aMap
(regularobject
can only havestring
ornumber
keys, notBigInt
). Using thisBigInt
key also avoid having to decode the encoded value, then have a nested[sheetId][col][row]
objects.In the evaluation process, there's a lot going on. To the point were lookup in the map is a bottleneck.
This commit replaces the
Set
andMap
in the evaluation by custom data structurePositionSet
andPositionMap<T>
with faster addition and lookup.In addition to being faster, this makes the evaluation code simpler as we no longer need encoding/decoding everywhere. Those data structures are also easier to debug. With the
BigInt
encoding, it was hard to find back which position is the number28416511n
.Some measures
(all measures are averages over 8 runs)
On the large formula demo data set (260k cells): -43%
I also tried implementing
PositionSet
with aPositionMap<boolean>
instead ofBinaryGrid
. It's ~10% slower and allocates ~20MB more (but it's 120 LoC less).On RNG's Timesheet spreadsheet: -23%
Basic tests on the raw data structures with 1M positions (10k rows and 100 columns)
Set<BigInt>
Set
+ encodingPositionSet
Map<BigInt, number>
PositionMap<number>
Setup code shared by all benchmarks:
Set<BigInt>
314ms 262msSet<BigInt> + encoding
440msPositionSet
43ms 46msMap<BigInt, number>
365ms 304msPositionMap
47ms 51msTask: : 3802246
review checklist