You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Previously we thought that 1 MB can track changes page-to-page in the 1 GB of data files. However, recently it became evident that our ptrack map or basic hash table behaves more like a Bloom filter with a number of hash functions k = 1. See more here: https://en.wikipedia.org/wiki/Bloom_filter#Probability_of_false_positives. Such filter has naturally more collisions.
For example, for database of size 22 GB and ptrack.map_size = 32 MB previously we thought that we will get almost zero collisions. However, theory says that we should track an extra 640 MB by tracking a 1 GB of data:
importmathdefcol_proba(n, m, k=1):
return (1-math.exp(-k*n/m))**kmap_size=1024*1024# 1 MB in bytesmap_size=map_size*32# ptrack.map_size in bytesmap_pages=map_size//8# 64-bit uint or 8 bytes per slotchange_size=1000*1024# size of changes in data files in KBchange_pages=change_size//8# number of changed pages in DBdb_pages=22*1024*1024//8# 22 GB in pagesproba=col_proba(change_pages, map_pages, 1)
print("Collision probability: ", proba)
print("Extra data (MB): ", proba*(db_pages-change_pages)*8/1024)
Collisionprobability: 0.030056617868655988Extradata (MB): 647.0588694764261
Unfortunately this seems to be true and experiment confirms that we mark 1.6 GB of pages instead of 1 GB. Same experiment with 16 MB of map, 24 GB database and 1 GB insert: theory says there should be 1375 MB of extra blocks, in reality it appears to be ~1275 MB, which is very close.
Theoretically we can substantially reduce collisions rate by storing each page LSN into two cells (i.e. k = 2):
Collision probability: 0.003505804615154038
Extra data (MB): 75.47296175503612
Let's try to do that.
The text was updated successfully, but these errors were encountered:
Previously we thought that 1 MB can track changes page-to-page in the 1 GB of data files. However, recently it became evident that our ptrack map or basic hash table behaves more like a Bloom filter with a number of hash functions
k = 1
. See more here: https://en.wikipedia.org/wiki/Bloom_filter#Probability_of_false_positives. Such filter has naturally more collisions.For example, for database of size 22 GB and ptrack.map_size = 32 MB previously we thought that we will get almost zero collisions. However, theory says that we should track an extra 640 MB by tracking a 1 GB of data:
Unfortunately this seems to be true and experiment confirms that we mark 1.6 GB of pages instead of 1 GB. Same experiment with 16 MB of map, 24 GB database and 1 GB insert: theory says there should be 1375 MB of extra blocks, in reality it appears to be ~1275 MB, which is very close.
Theoretically we can substantially reduce collisions rate by storing each page LSN into two cells (i.e.
k = 2
):Let's try to do that.
The text was updated successfully, but these errors were encountered: