Memory efficient management of outputs #28

kampersanda · 2022-04-05T12:32:52Z

Daachorse handles a value-length pair as a pattern's output (see https://github.com/daac-tools/daachorse/blob/main/src/lib.rs#L289-L292)

In the current implementation, 31 bits and 32 bits are assigned to length and value, respectively. (1 bit is used for the flag,)
But, in many cases, the assignment is too rich.
For example, when the maximum length is 255, 1 byte is sufficient to represent.

If we know the maximum length and value, we can memory-efficiently store members on byte-aligned memory.
For example, if a length is represented in 1 byte and a value (with flag) is represented in 3 bytes, we can interleave them in a byte array outputs as follows.

outputs[0] = length 1
outputs[1] = value 1
outputs[2] = value 1
outputs[3] = value 1
outputs[4] = length 2
outputs[5] = value 2
outputs[6] = value 2
outputs[7] = value 2
...

The text was updated successfully, but these errors were encountered:

kampersanda · 2022-04-06T04:20:32Z

But, the approach can degrade time efficiency to extract members from a byte array.
We need to investigate if the trade off is acceptable.

vbkaisetsu · 2022-04-06T11:40:27Z

How about using the following bincode like variant encoding:
https://github.com/bincode-org/bincode/blob/trunk/docs/spec.md

That encoding cannot represent a large value (>= 0x1000000) in 4 byte sequence, but I think it is enough to represent lengths.

For example

match x {
  0..=253 => [x],
  254..=0xffff => [254, x & 0xff, x >> 8],
  0x10000..=0xffffff => [255, x & 0xff, x >> 8 & 0xff, x >> 16],
}

vbkaisetsu · 2022-04-06T11:51:38Z

1 bit for a flag, so

match x {
  0..=125 => [x],
  126..=0xffff => [126, x & 0xff, x >> 8],
  0x10000..=0xffffff => [127, x & 0xff, x >> 8 & 0xff, x >> 16],
}

kampersanda · 2022-04-07T01:20:56Z

@vbkaisetsu

Yeah, such variable encoding is a good alternative. Compared to a well-known variable byte encoding, the bincode scheme would be faster because it does not need while loop to extract.

For a terminator flag, I think embedding it into 1 bit of value is reasonable if we can limit the max of value to 2**31.

kampersanda · 2022-05-16T16:11:40Z

A memory-efficient representation of output sets is achieved by another solution (#30).

kampersanda added the enhancement New feature or request label Apr 5, 2022

kampersanda changed the title ~~Memory efficient serialization of outputs~~ Memory efficient management of outputs Apr 6, 2022

vbkaisetsu assigned kg86 Apr 12, 2022

kampersanda closed this as completed May 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory efficient management of outputs #28

Memory efficient management of outputs #28

kampersanda commented Apr 5, 2022 •

edited

kampersanda commented Apr 6, 2022 •

edited

vbkaisetsu commented Apr 6, 2022

vbkaisetsu commented Apr 6, 2022 •

edited

kampersanda commented Apr 7, 2022

kampersanda commented May 16, 2022

Memory efficient management of outputs #28

Memory efficient management of outputs #28

Comments

kampersanda commented Apr 5, 2022 • edited

kampersanda commented Apr 6, 2022 • edited

vbkaisetsu commented Apr 6, 2022

vbkaisetsu commented Apr 6, 2022 • edited

kampersanda commented Apr 7, 2022

kampersanda commented May 16, 2022

kampersanda commented Apr 5, 2022 •

edited

kampersanda commented Apr 6, 2022 •

edited

vbkaisetsu commented Apr 6, 2022 •

edited