Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce number of collisions in the ptrack map #5

Closed
ololobus opened this issue Apr 21, 2021 · 0 comments · Fixed by #6
Closed

Reduce number of collisions in the ptrack map #5

ololobus opened this issue Apr 21, 2021 · 0 comments · Fixed by #6
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@ololobus
Copy link
Contributor

ololobus commented Apr 21, 2021

Previously we thought that 1 MB can track changes page-to-page in the 1 GB of data files. However, recently it became evident that our ptrack map or basic hash table behaves more like a Bloom filter with a number of hash functions k = 1. See more here: https://en.wikipedia.org/wiki/Bloom_filter#Probability_of_false_positives. Such filter has naturally more collisions.

For example, for database of size 22 GB and ptrack.map_size = 32 MB previously we thought that we will get almost zero collisions. However, theory says that we should track an extra 640 MB by tracking a 1 GB of data:

import math

def col_proba(n, m, k=1):
  return (1 - math.exp(-k*n/m))**k

map_size = 1024*1024      # 1 MB in bytes
map_size = map_size*32    # ptrack.map_size in bytes
map_pages = map_size // 8 # 64-bit uint or 8 bytes per slot

change_size = 1000*1024         # size of changes in data files in KB
change_pages = change_size // 8 # number of changed pages in DB

db_pages = 22*1024*1024 // 8 # 22 GB in pages

proba = col_proba(change_pages, map_pages, 1)

print("Collision probability: ", proba)
print("Extra data (MB): ", proba*(db_pages - change_pages)*8/1024)

Collision probability:  0.030056617868655988
Extra data (MB):  647.0588694764261

Unfortunately this seems to be true and experiment confirms that we mark 1.6 GB of pages instead of 1 GB. Same experiment with 16 MB of map, 24 GB database and 1 GB insert: theory says there should be 1375 MB of extra blocks, in reality it appears to be ~1275 MB, which is very close.

Theoretically we can substantially reduce collisions rate by storing each page LSN into two cells (i.e. k = 2):

Collision probability:  0.003505804615154038
Extra data (MB):  75.47296175503612

Let's try to do that.

@ololobus ololobus added the enhancement New feature or request label Apr 21, 2021
@ololobus ololobus self-assigned this Apr 21, 2021
@ololobus ololobus added this to the Ptrack 2.2.0 milestone Apr 21, 2021
ololobus added a commit that referenced this issue May 16, 2021
Resolve issues #5 and #1: reduce number of collisions in the ptrack map
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant