Reduce number of collisions in the ptrack map #5

ololobus · 2021-04-21T16:06:58Z

Previously we thought that 1 MB can track changes page-to-page in the 1 GB of data files. However, recently it became evident that our ptrack map or basic hash table behaves more like a Bloom filter with a number of hash functions k = 1. See more here: https://en.wikipedia.org/wiki/Bloom_filter#Probability_of_false_positives. Such filter has naturally more collisions.

For example, for database of size 22 GB and ptrack.map_size = 32 MB previously we thought that we will get almost zero collisions. However, theory says that we should track an extra 640 MB by tracking a 1 GB of data:

import math

def col_proba(n, m, k=1):
  return (1 - math.exp(-k*n/m))**k

map_size = 1024*1024      # 1 MB in bytes
map_size = map_size*32    # ptrack.map_size in bytes
map_pages = map_size // 8 # 64-bit uint or 8 bytes per slot

change_size = 1000*1024         # size of changes in data files in KB
change_pages = change_size // 8 # number of changed pages in DB

db_pages = 22*1024*1024 // 8 # 22 GB in pages

proba = col_proba(change_pages, map_pages, 1)

print("Collision probability: ", proba)
print("Extra data (MB): ", proba*(db_pages - change_pages)*8/1024)

Collision probability:  0.030056617868655988
Extra data (MB):  647.0588694764261

Unfortunately this seems to be true and experiment confirms that we mark 1.6 GB of pages instead of 1 GB. Same experiment with 16 MB of map, 24 GB database and 1 GB insert: theory says there should be 1375 MB of extra blocks, in reality it appears to be ~1275 MB, which is very close.

Theoretically we can substantially reduce collisions rate by storing each page LSN into two cells (i.e. k = 2):

Collision probability:  0.003505804615154038
Extra data (MB):  75.47296175503612

Let's try to do that.

The text was updated successfully, but these errors were encountered:

Resolve issues #5 and #1: reduce number of collisions in the ptrack map

ololobus added the enhancement New feature or request label Apr 21, 2021

ololobus self-assigned this Apr 21, 2021

ololobus added this to the Ptrack 2.2.0 milestone Apr 21, 2021

ololobus mentioned this issue Apr 22, 2021

Resolve issues #5 and #1: reduce number of collisions in the ptrack map #6

Merged

ololobus closed this as completed in #6 May 16, 2021

ololobus added a commit that referenced this issue May 16, 2021

Merge pull request #6 from postgrespro/double_slot

708c8e2

Resolve issues #5 and #1: reduce number of collisions in the ptrack map

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce number of collisions in the ptrack map #5

Reduce number of collisions in the ptrack map #5

ololobus commented Apr 21, 2021 •

edited

Reduce number of collisions in the ptrack map #5

Reduce number of collisions in the ptrack map #5

Comments

ololobus commented Apr 21, 2021 • edited

ololobus commented Apr 21, 2021 •

edited