centrifugation: calculate simhash(body) and simhash(text(body)) #124
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
It'll enable heuristics that rely on page similarity to mine more blockpages our of existing OONI dataset.
Draft implementation included non-trivial caching scheme that was trying to avoid unnecessary computation of simhashes for known body_sha256 values. It turned out that the complexity of the code was providing not that much benefits for the modern state of OONI data. There are still ~30% of perfectly same pages coming from different probes, but runtime de-duplication takes as much real time as computing hashes, moreover, existing code took non-trivial amount of RAM (over 512M) for computation.
Here are approximate relative numbers of wall clock time needed to ingest a bucket of OONI data:
No simhash — 1.0
Old simhash — 4.8
New simhash(body) — 4.7
Same + simhash(text) — 6.0
Current bucket ingestion takes 40...80 minutes, so it'll become 4..8 hours. That's still okay as it's single-CPU and can easily utilize SMP when it'll be needed.
An open problem is re-ingestion of data stored in http_control and http_request tables as it'll include re-computation of all the simhashes without any good reason.
So the de-duplication code is dropped and simhashes are stored together with body_sha256.