Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PERF] evaluation: use custom data structures instead of Map and Set #3823

Closed
wants to merge 1 commit into from

Conversation

LucasLefevre
Copy link
Collaborator

@LucasLefevre LucasLefevre commented Mar 12, 2024

Description:

This commit drastically improves the evaluation execution time (~43% faster with the large formula demo dataset)

Map and Set data structures are slow compared to objects and arrays (especially with BigInt, more on that later). Why use Map and Set then ?

We needed a set data structure to hold which positions should be evaluated. We obviously don't want duplicates (hence the set). However, we can't blindly add a position object { sheetId, col, row } in a Set though because the same position could be added twice by adding the same position as two different objects that are not referentially equal.
That's why we came up with a way to encode the position object to a single value that can be added safely to the set without worrying about object references being equal. This encoded value is a BigInt (see PositionBitsEncoder). That's why we used Set<BigInt>. Since we had the position encoded as a single value, we naturally used that value as a key in a Map (regular object can only have string or number keys, not BigInt). Using this BigInt key also avoid having to decode the encoded value, then have a nested [sheetId][col][row] objects.

In the evaluation process, there's a lot going on. To the point were lookup in the map is a bottleneck.

This commit replaces the Set and Map in the evaluation by custom data structure PositionSet and PositionMap<T> with faster addition and lookup.

In addition to being faster, this makes the evaluation code simpler as we no longer need encoding/decoding everywhere. Those data structures are also easier to debug. With the BigInt encoding, it was hard to find back which position is the number 28416511n.

Some measures

(all measures are averages over 8 runs)

console.time("eval");
o_spreadsheet.__DEBUG__.model.dispatch("EVALUATE_CELLS")
console.timeEnd("eval");

On the large formula demo data set (260k cells): -43%

Time Allocated memory
Before 512ms 190MB
After 291ms 122MB

I also tried implementing PositionSet with a PositionMap<boolean> instead of BinaryGrid. It's ~10% slower and allocates ~20MB more (but it's 120 LoC less).

On RNG's Timesheet spreadsheet: -23%

Time Allocated memory
Before 1894ms 580MB
After 1469ms 460MB

Basic tests on the raw data structures with 1M positions (10k rows and 100 columns)

add has
1 Set<BigInt> 314ms 262ms
2 Set + encoding 440ms
3 PositionSet 43ms 46ms
set get
4 Map<BigInt, number> 365ms 304ms
5 PositionMap<number> 47ms 51ms

Setup code shared by all benchmarks:

const sheetId = "1";
const postions = [];
for (let i = 0; i < 10000; i++) {
  for (let j = 0; j < 100; j++) {
    postions.push({ sheetId, row: i, col: j });
  }
}
  1. Set<BigInt> 314ms 262ms
const set = new Set();
const length = BigInt(postions.length);
performance.mark("add");
for (let i = 0n; i < length; i++) {
  set.add(i);
}
performance.measure("add", "add").duration);

performance.mark("has");
for (let i = 0n; i < length; i++) {
  set.has(i);
}
performance.measure("has", "has").duration;
  1. Set<BigInt> + encoding 440ms
const encoder = new PositionBitsEncoder();
const set = new Set();
performance.mark("add");
for (const position of postions) {
  set.add(encoder.encode(position));
}
performance.measure("add", "add").duration;
  1. PositionSet 43ms 46ms
const set = new PositionSet({
    [sheetId]: { rows: 10000, cols: 100 },
});
performance.mark("add");
for (const position of postions) {
  set.add(position);
}
performance.measure("add", "add").duration;

performance.mark("has");
let r;
for (const position of postions) {
  r = set.has(position);
}
performance.measure("has", "has").duration;
  1. Map<BigInt, number> 365ms 304ms
const map = new Map();
const length = BigInt(postions.length);
performance.mark("set");
for (let i = 0n; i < length; i++) {
  map.set(i, 4);
}
performance.measure("set", "set").duration;

performance.mark("get");
let r;
for (let i = 0n; i < length; i++) {
  r = map.get(i);
}
performance.measure("get", "get").duration;
  1. PositionMap 47ms 51ms
performance.mark("add");
for (const position of postions) {
  map.set(position, 4);
}
performance.measure("add", "add").duration;

performance.mark("get");
let r;
for (const position of postions) {
  r = map.get(position);
}
performance.measure("get", "get").duration;

Task: : 3802246

review checklist

  • feature is organized in plugin, or UI components
  • support of duplicate sheet (deep copy)
  • in model/core: ranges are Range object, and can be adapted (adaptRanges)
  • in model/UI: ranges are strings (to show the user)
  • undo-able commands (uses this.history.update)
  • multiuser-able commands (has inverse commands and transformations where needed)
  • new/updated/removed commands are documented
  • exportable in excel
  • translations (_t("qmsdf %s", abc))
  • unit tested
  • clean commented code
  • track breaking changes
  • doc is rebuild (npm run doc)
  • status is correct in Odoo

@robodoo
Copy link
Collaborator

robodoo commented Mar 12, 2024

@LucasLefevre LucasLefevre force-pushed the master-no-more-map-and-set-lul branch 3 times, most recently from ff7c51f to 99ff33a Compare March 12, 2024 15:12
@@ -0,0 +1,41 @@
import { CellPosition, UID } from "../types";

export class PositionMap<T> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also implement this with an unrolled 1-D array similar to the BinaryGrid.
I don't think it would be much much faster (?) (lookups in objects are fast), but we might save quite a bit of memory since we save all the object keys.
Do we want to go that far? (It would also mean the size needs to be fixed beforehand and known at instantiation)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick and dirty POC (see commit history) didn't show any meaningful difference 🤷

Comment on lines 89 to 90
function log2Ceil(value) {
// A faster version of Math.ceil(Math.log2(value)).
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called only a few times so it won't actually make any difference in the overall picture, even if it's 10x faster
49c8e01#r139640595

@LucasLefevre LucasLefevre force-pushed the master-no-more-map-and-set-lul branch 2 times, most recently from 92c1558 to 89c25d1 Compare March 18, 2024 08:33
@VincentSchippefilt
Copy link
Collaborator

robodoo r+

This commit drastically improves the evaluation execution time (~43% faster
with the large formula demo dataset)

`Map` and `Set` data structures are slow compared to objects and arrays
(especially with `BigInt`, more on that later). Why use `Map` and `Set` then ?

We needed a set data structure to hold which positions should be evaluated. We
obviously don't want duplicates (hence the set). However, we can't blindly add
a position object `{ sheetId, col, row }` in a `Set` though because the same
position could be added twice by adding the same position as two different
objects that are not referentially equal.
That's why we came up with a way to encode the position object to a single
value that can be added safely to the set without worrying about object
references being equal. This encoded value is a `BigInt` (see
`PositionBitsEncoder`). That's why we used `Set<BigInt>`. Since we had the
position encoded as a single value, we naturally used that value as a key in a
`Map` (regular `object` can only have `string` or `number` keys, not `BigInt`).
Using this `BigInt` key also avoid having to decode the encoded value, then
have a nested `[sheetId][col][row]` objects.

In the evaluation process, there's a lot going on. To the point were lookup in
the map is a bottleneck.

This commit replaces the `Set` and `Map` in the evaluation by custom data
structure `PositionSet` and `PositionMap<T>` with faster addition and lookup.

In addition to being faster, this makes the evaluation code simpler as we no
longer need encoding/decoding everywhere. Those data structures are also easier
to debug. With the `BigInt` encoding, it was hard to find back which position
is the number `28416511n`.

Some measures
----------------------
*(all measures are averages over 8 runs)*

```js
console.time("eval");
o_spreadsheet.__DEBUG__.model.dispatch("EVALUATE_CELLS")
console.timeEnd("eval");
```
On the large formula demo data set (260k cells): -43%

|        	| Time  	| Allocated memory 	|
|--------	|-------	|------------------	|
| Before 	| 512ms 	| 190MB            	|
| After  	| 291ms 	| 122MB            	|

I also tried implementing `PositionSet` with a `PositionMap<boolean>` instead
of `BinaryGrid`. It's ~10% slower and allocates ~20MB more (but it's 120 LoC
less).

On RNG's Timesheet spreadsheet: -23%

|        	| Time  	| Allocated memory 	|
|--------	|-------	|------------------	|
| Before 	| 1894ms 	| 580MB            	|
| After  	| 1469ms 	| 460MB            	|

Basic tests on the raw data structures with 1M positions (10k rows and 100
columns)

|   	|                	| add   	| has   	|
|---	|----------------	|-------	|-------	|
| 1 	| `Set<BigInt>`    	| 314ms 	| 262ms 	|
| 2 	| `Set` + encoding 	| 440ms 	|       	|
| 3 	| `PositionSet`    	| 43ms  	| 46ms  	|

|   	|                     	| set   	| get   	|
|---	|---------------------	|-------	|-------	|
| 4 	| `Map<BigInt, number>` 	| 365ms 	| 304ms 	|
| 5 	| `PositionMap<number>` 	| 47ms  	| 51ms  	|

Setup code shared by all benchmarks:
```js
const sheetId = "1";
const postions = [];
for (let i = 0; i < 10000; i++) {
  for (let j = 0; j < 100; j++) {
    postions.push({ sheetId, row: i, col: j });
  }
}
```
1. `Set<BigInt>` 314ms	262ms
```js
const set = new Set();
const length = BigInt(postions.length);
performance.mark("add");
for (let i = 0n; i < length; i++) {
  set.add(i);
}
performance.measure("add", "add").duration);

performance.mark("has");
for (let i = 0n; i < length; i++) {
  set.has(i);
}
performance.measure("has", "has").duration;
```

2. `Set<BigInt> + encoding` 440ms
```js
const encoder = new PositionBitsEncoder();
const set = new Set();
performance.mark("add");
for (const position of postions) {
  set.add(encoder.encode(position));
}
performance.measure("add", "add").duration;
```

3. `PositionSet` 43ms	46ms
```js
const set = new PositionSet({
    [sheetId]: { rows: 10000, cols: 100 },
});
performance.mark("add");
for (const position of postions) {
  set.add(position);
}
performance.measure("add", "add").duration;

performance.mark("has");
let r;
for (const position of postions) {
  r = set.has(position);
}
performance.measure("has", "has").duration;
```

4. `Map<BigInt, number>` 365ms	304ms

```js
const map = new Map();
const length = BigInt(postions.length);
performance.mark("set");
for (let i = 0n; i < length; i++) {
  map.set(i, 4);
}
performance.measure("set", "set").duration;

performance.mark("get");
let r;
for (let i = 0n; i < length; i++) {
  r = map.get(i);
}
performance.measure("get", "get").duration;
```

5. `PositionMap` 47ms	51ms

```js
performance.mark("add");
for (const position of postions) {
  map.set(position, 4);
}
performance.measure("add", "add").duration;

performance.mark("get");
let r;
for (const position of postions) {
  r = map.get(position);
}
performance.measure("get", "get").duration;
```

Task: 3802246
@LucasLefevre LucasLefevre force-pushed the master-no-more-map-and-set-lul branch from 89c25d1 to f021d4f Compare March 18, 2024 08:41
@VincentSchippefilt
Copy link
Collaborator

robodoo r+

robodoo pushed a commit that referenced this pull request Mar 18, 2024
This commit drastically improves the evaluation execution time (~43% faster
with the large formula demo dataset)

`Map` and `Set` data structures are slow compared to objects and arrays
(especially with `BigInt`, more on that later). Why use `Map` and `Set` then ?

We needed a set data structure to hold which positions should be evaluated. We
obviously don't want duplicates (hence the set). However, we can't blindly add
a position object `{ sheetId, col, row }` in a `Set` though because the same
position could be added twice by adding the same position as two different
objects that are not referentially equal.
That's why we came up with a way to encode the position object to a single
value that can be added safely to the set without worrying about object
references being equal. This encoded value is a `BigInt` (see
`PositionBitsEncoder`). That's why we used `Set<BigInt>`. Since we had the
position encoded as a single value, we naturally used that value as a key in a
`Map` (regular `object` can only have `string` or `number` keys, not `BigInt`).
Using this `BigInt` key also avoid having to decode the encoded value, then
have a nested `[sheetId][col][row]` objects.

In the evaluation process, there's a lot going on. To the point were lookup in
the map is a bottleneck.

This commit replaces the `Set` and `Map` in the evaluation by custom data
structure `PositionSet` and `PositionMap<T>` with faster addition and lookup.

In addition to being faster, this makes the evaluation code simpler as we no
longer need encoding/decoding everywhere. Those data structures are also easier
to debug. With the `BigInt` encoding, it was hard to find back which position
is the number `28416511n`.

Some measures
----------------------
*(all measures are averages over 8 runs)*

```js
console.time("eval");
o_spreadsheet.__DEBUG__.model.dispatch("EVALUATE_CELLS")
console.timeEnd("eval");
```
On the large formula demo data set (260k cells): -43%

|        	| Time  	| Allocated memory 	|
|--------	|-------	|------------------	|
| Before 	| 512ms 	| 190MB            	|
| After  	| 291ms 	| 122MB            	|

I also tried implementing `PositionSet` with a `PositionMap<boolean>` instead
of `BinaryGrid`. It's ~10% slower and allocates ~20MB more (but it's 120 LoC
less).

On RNG's Timesheet spreadsheet: -23%

|        	| Time  	| Allocated memory 	|
|--------	|-------	|------------------	|
| Before 	| 1894ms 	| 580MB            	|
| After  	| 1469ms 	| 460MB            	|

Basic tests on the raw data structures with 1M positions (10k rows and 100
columns)

|   	|                	| add   	| has   	|
|---	|----------------	|-------	|-------	|
| 1 	| `Set<BigInt>`    	| 314ms 	| 262ms 	|
| 2 	| `Set` + encoding 	| 440ms 	|       	|
| 3 	| `PositionSet`    	| 43ms  	| 46ms  	|

|   	|                     	| set   	| get   	|
|---	|---------------------	|-------	|-------	|
| 4 	| `Map<BigInt, number>` 	| 365ms 	| 304ms 	|
| 5 	| `PositionMap<number>` 	| 47ms  	| 51ms  	|

Setup code shared by all benchmarks:
```js
const sheetId = "1";
const postions = [];
for (let i = 0; i < 10000; i++) {
  for (let j = 0; j < 100; j++) {
    postions.push({ sheetId, row: i, col: j });
  }
}
```
1. `Set<BigInt>` 314ms	262ms
```js
const set = new Set();
const length = BigInt(postions.length);
performance.mark("add");
for (let i = 0n; i < length; i++) {
  set.add(i);
}
performance.measure("add", "add").duration);

performance.mark("has");
for (let i = 0n; i < length; i++) {
  set.has(i);
}
performance.measure("has", "has").duration;
```

2. `Set<BigInt> + encoding` 440ms
```js
const encoder = new PositionBitsEncoder();
const set = new Set();
performance.mark("add");
for (const position of postions) {
  set.add(encoder.encode(position));
}
performance.measure("add", "add").duration;
```

3. `PositionSet` 43ms	46ms
```js
const set = new PositionSet({
    [sheetId]: { rows: 10000, cols: 100 },
});
performance.mark("add");
for (const position of postions) {
  set.add(position);
}
performance.measure("add", "add").duration;

performance.mark("has");
let r;
for (const position of postions) {
  r = set.has(position);
}
performance.measure("has", "has").duration;
```

4. `Map<BigInt, number>` 365ms	304ms

```js
const map = new Map();
const length = BigInt(postions.length);
performance.mark("set");
for (let i = 0n; i < length; i++) {
  map.set(i, 4);
}
performance.measure("set", "set").duration;

performance.mark("get");
let r;
for (let i = 0n; i < length; i++) {
  r = map.get(i);
}
performance.measure("get", "get").duration;
```

5. `PositionMap` 47ms	51ms

```js
performance.mark("add");
for (const position of postions) {
  map.set(position, 4);
}
performance.measure("add", "add").duration;

performance.mark("get");
let r;
for (const position of postions) {
  r = map.get(position);
}
performance.measure("get", "get").duration;
```

closes #3823

Task: 3802246
Signed-off-by: Vincent Schippefilt (vsc) <vsc@odoo.com>
@robodoo robodoo closed this Mar 18, 2024
@robodoo robodoo added the 17.2 label Mar 18, 2024
@fw-bot fw-bot deleted the master-no-more-map-and-set-lul branch April 1, 2024 09:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants