Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory efficient management of outputs #28

Closed
kampersanda opened this issue Apr 5, 2022 · 5 comments
Closed

Memory efficient management of outputs #28

kampersanda opened this issue Apr 5, 2022 · 5 comments
Assignees
Labels
enhancement New feature or request

Comments

@kampersanda
Copy link
Member

kampersanda commented Apr 5, 2022

Daachorse handles a value-length pair as a pattern's output (see https://github.com/daac-tools/daachorse/blob/main/src/lib.rs#L289-L292)

In the current implementation, 31 bits and 32 bits are assigned to length and value, respectively. (1 bit is used for the flag,)
But, in many cases, the assignment is too rich.
For example, when the maximum length is 255, 1 byte is sufficient to represent.

If we know the maximum length and value, we can memory-efficiently store members on byte-aligned memory.
For example, if a length is represented in 1 byte and a value (with flag) is represented in 3 bytes, we can interleave them in a byte array outputs as follows.

outputs[0] = length 1
outputs[1] = value 1
outputs[2] = value 1
outputs[3] = value 1
outputs[4] = length 2
outputs[5] = value 2
outputs[6] = value 2
outputs[7] = value 2
...
@kampersanda kampersanda added the enhancement New feature or request label Apr 5, 2022
@kampersanda kampersanda changed the title Memory efficient serialization of outputs Memory efficient management of outputs Apr 6, 2022
@kampersanda
Copy link
Member Author

kampersanda commented Apr 6, 2022

But, the approach can degrade time efficiency to extract members from a byte array.
We need to investigate if the trade off is acceptable.

@vbkaisetsu
Copy link
Member

How about using the following bincode like variant encoding:
https://github.com/bincode-org/bincode/blob/trunk/docs/spec.md

That encoding cannot represent a large value (>= 0x1000000) in 4 byte sequence, but I think it is enough to represent lengths.

For example

match x {
  0..=253 => [x],
  254..=0xffff => [254, x & 0xff, x >> 8],
  0x10000..=0xffffff => [255, x & 0xff, x >> 8 & 0xff, x >> 16],
}

@vbkaisetsu
Copy link
Member

vbkaisetsu commented Apr 6, 2022

1 bit for a flag, so

match x {
  0..=125 => [x],
  126..=0xffff => [126, x & 0xff, x >> 8],
  0x10000..=0xffffff => [127, x & 0xff, x >> 8 & 0xff, x >> 16],
}

@kampersanda
Copy link
Member Author

@vbkaisetsu

Yeah, such variable encoding is a good alternative. Compared to a well-known variable byte encoding, the bincode scheme would be faster because it does not need while loop to extract.

For a terminator flag, I think embedding it into 1 bit of value is reasonable if we can limit the max of value to 2**31.

@kampersanda
Copy link
Member Author

A memory-efficient representation of output sets is achieved by another solution (#30).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants