Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seg compress: fix LCP initialization #9460

Merged
merged 2 commits into from
Feb 24, 2024
Merged

seg compress: fix LCP initialization #9460

merged 2 commits into from
Feb 24, 2024

Conversation

battlmonstr
Copy link
Collaborator

@battlmonstr battlmonstr commented Feb 16, 2024

lcp[inv[n-1]] was not initialized.
Its value was reused from a previously processed unrelated superstring. This made the seg file dependant on the order in which parallel workers process superstrings.

Effects of the fix:

  • the torrent hashes for large files will change from the previous version
  • a bit more stable torrent hash with parallel mode generation
  • easier to implement in silkworm (without this fix, we'll have to introduce the same bug in silkworm to produce identical files)

lcp[inv[n-1]] was not initialized.
Its value was reused from a previously processed unrelated superstring.
This made the seg file dependant on the order in which parallel workers process superstrings.

Effects of the fix:
* the torrent hashes for large files will change from the previous version
* a bit more stable torrent hash with parallel mode generation
* easier to implement in silkworm
  (without this fix, we'll have to introduce the same bug in silkworm to produce identical files)
@battlmonstr
Copy link
Collaborator Author

@AskAlexSharov is it required to bump snapshot file versions for such a change?

@battlmonstr
Copy link
Collaborator Author

battlmonstr commented Feb 16, 2024

Debugging details:

with v1-003500-004000-transactions.seg

the score of a few patterns was different, because processSuperstring was producing less patterns for some words,
compared to the case where lcp[] buffer was zero-initialized:

silkworm
score pattern_hex
10367780 ffffffffffffffffffffffffffffffffffffffff
83072 ffffffffffffffffffffffffffffffffffffffff16ff5b505b50565b6000610ba9336109a5565b15610f0557610bb7846116d7565b15610ca0577f92ca3a8085

erigon
score pattern_hex
506200 ffffffffffffffffffffffffffffffffffffffff
80960 ffffffffffffffffffffffffffffffffffffffff16ff5b505b50565b6000610ba9336109a5565b15610f0557610bb7846116d7565b15610ca0577f92ca3a8085

This led to a different pattern selection by optimiseCluster for some words, and then a different pattern usage frequency and a different huffman coding table.

The compression ratio is almost not affected by this change:

  • before the fix: 1880400486 bytes
  • after the fix: 1880397769 bytes

@AskAlexSharov
Copy link
Collaborator

@mholt-dv , do you wanna include this fix to v2 files? like erigon decompress | erigon compress ?

FYI: i don't mind to merge algo change which will change compression logic.

Copy link
Collaborator

@AskAlexSharov AskAlexSharov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also tested: no ratio affect, "erigon snapshots diff" says that files are identical. let's merge.

@AskAlexSharov AskAlexSharov merged commit 9006d9b into devel Feb 24, 2024
7 checks passed
@AskAlexSharov AskAlexSharov deleted the pr/lcp_fix branch February 24, 2024 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants