# Private blocking with SNC-Size
For Sorted Neighborhood Clustering (SNC) we define:

- $D^A$ as a set of $n_{D^A}$ records from database A;
- $D^B$ as a set of $n_{D^B}$ records from database B;
- $k$ as the minimum number of elements in a cluster;
- a ordered list $R$ of $n_R$ (public) reference values.

The value of $n_R$ can be determined using:

$$n_R = \frac{min(n_{D^A}, n_{D^B})}{k}$$






In [1]:
Da = [
    # ID     Surname   Firstname
    ('RA1', 'millar', 'robert'),
    ('RA2', 'marten', 'amyas'),
    ('RA3', 'melar', 'gail'),
    ('RA4', 'miller', 'robart'),
    ('RA5', 'morley', 'william'),
    ('RA6', 'philips', 'colin'),
    ('RA7', 'smith', 'alisen'),
    ('RA8', 'sampson', 'taylor')
]

Db = [
    # ID     Surname   Firstname    
    ('RB1', 'millar', 'robert'),
    ('RB2', 'marris', 'roberto'),
    ('RB3', 'malar', 'gayle'),
    ('RB4', 'perris', 'charles'),
    ('RB5', 'robbins', 'william'),
    ('RB6', 'robertson', 'amy'),
    ('RB7', 'samuell', 'tailor'),
    ('RB8', 'smeeth', 'rupert'),
    ('RB9', 'smith', 'alison')
]

k = 3

R = {
    1: 'miller',
    2: 'robinson',
    3: 'smith'
}   

Clusters are created by inserting records into $R$ based on their sorting key value (SKV). The SKV can be anything as long as it correctly positions records relative to the values of $R$. In this example we use the concatenation of surename and firstname as SKV.

In [2]:
def get_skv(record):
    """Composes a sorting key value for the provided record."""
    return record[1] + record[2]

def cluster(records):
    """Creates clusters based on the provided records."""
    records = sorted(records, key=lambda rec: get_skv(rec))
    clusters = {i:[] for i in R.keys()}
    
    cursor = 1
    for rec in records:
        while get_skv(rec) > R[cursor] and cursor < max(R.keys()):
            cursor += 1
            
        clusters[cursor].append(rec)
        
    return clusters

In [3]:
Da_clusters = cluster(Da)
Da_clusters

{1: [('RA2', 'marten', 'amyas'),
  ('RA3', 'melar', 'gail'),
  ('RA1', 'millar', 'robert')],
 2: [('RA4', 'miller', 'robart'),
  ('RA5', 'morley', 'william'),
  ('RA6', 'philips', 'colin')],
 3: [('RA8', 'sampson', 'taylor'), ('RA7', 'smith', 'alisen')]}

In [4]:
Db_clusters = cluster(Db)
Db_clusters

{1: [('RB3', 'malar', 'gayle'),
  ('RB2', 'marris', 'roberto'),
  ('RB1', 'millar', 'robert')],
 2: [('RB4', 'perris', 'charles'),
  ('RB5', 'robbins', 'william'),
  ('RB6', 'robertson', 'amy')],
 3: [('RB7', 'samuell', 'tailor'),
  ('RB8', 'smeeth', 'rupert'),
  ('RB9', 'smith', 'alison')]}

We can merge clusters together to make sure we only have clusters with at least $k$ elements. 

With SNC-Size we merge solely based on size:

- find smallest cluster $c_i$ with less than $k$ elements;
- merge $c_i$ with the smallest of its two neighbors $c_{i-1}$ and $c_{i+1}$;
- repeat these steps until only clusters with more than $k$ elements remain.

Created clusters will be assigned an ID that indicated the range of $R$ values that it contains.

In [5]:
def above_cluster(i, clusters):
    """Returns the cluster above the indicated cluster."""
    i_max = max(clusters.keys())
    if i == i_max:
        return (i, None)

    cluster = None
    while cluster is None and i <= n:
        i += 1
        cluster = clusters.get(i, None)

    return (i, cluster)

def below_cluster(i, clusters):
    """Returns the cluster below the indicated cluster."""
    i_min = min(clusters.keys())
    if i == i_min:
        return (i, None)

    cluster = None
    while cluster is None and i >= 1:
        i -= 1
        cluster = clusters.get(i, None)

    return (i, cluster)

def get_smallest(clusters):
    """Returns the smallest cluster."""
    return sorted(list(clusters.values()), key=lambda c: len(c))[0]

def has_too_small(k, clusters):
    """Indicates if there are clusters that are smaller than k."""
    return len(get_smallest(clusters)) < k

def get_cluster_ids(clusters):
    """Returns IDs for provided clusters."""
    keys = list(clusters.keys())
    keys_n = len(keys)
    
    for i in range(keys_n):
        if i < keys_n-1:
            to = keys[i+1] - 1
        else:
            to = max(list(R.keys()))

        yield (keys[i], to)         

def merge_clusters(clusters):
    """Merges clusters based on size."""
    while has_too_small(k, clusters):
        smallest = get_smallest(clusters)

        smallest_i = list(clusters.values()).index(smallest) + 1
        smallest_l = len(smallest)
        
        above_i, above_c = above_cluster(smallest_i, clusters)
        below_i, below_c = below_cluster(smallest_i, clusters)

        if below_c is None and above_c is None:
            break
            
        if below_c is None:
            clusters[smallest_i] += clusters[above_i]
            clusters.pop(above_i)
        elif above_c is None or len(below_c) < len(above_c):
            clusters[below_i] += clusters[smallest_i]
            clusters.pop(smallest_i)
        else:
            clusters[smallest_i] += clusters[above_i]
            clusters.pop(above_i)
            
    ids = get_cluster_ids(clusters)
    return {c_id: c for (c_id, c) in zip(ids, clusters.values())}

In [6]:
Da_merged = merge_clusters(Da_clusters)
Da_merged

{(1, 1): [('RA2', 'marten', 'amyas'),
  ('RA3', 'melar', 'gail'),
  ('RA1', 'millar', 'robert')],
 (2, 3): [('RA4', 'miller', 'robart'),
  ('RA5', 'morley', 'william'),
  ('RA6', 'philips', 'colin'),
  ('RA8', 'sampson', 'taylor'),
  ('RA7', 'smith', 'alisen')]}

In [7]:
Db_merged = merge_clusters(Db_clusters)
Db_merged

{(1, 1): [('RB3', 'malar', 'gayle'),
  ('RB2', 'marris', 'roberto'),
  ('RB1', 'millar', 'robert')],
 (2, 2): [('RB4', 'perris', 'charles'),
  ('RB5', 'robbins', 'william'),
  ('RB6', 'robertson', 'amy')],
 (3, 3): [('RB7', 'samuell', 'tailor'),
  ('RB8', 'smeeth', 'rupert'),
  ('RB9', 'smith', 'alison')]}

After creating clusters, we can indicate candidate pairs for comparison by linking blocks with overlapping ID-ranges.

In [8]:
def has_overlap(a, b):
    """Indicates if there is overlap in ID-range."""
    a_range = range(a[0], a[1] + 1)
    return b[0] in a_range or b[1] in a_range

a_keys = Da_merged.keys()
b_keys = Db_merged.keys()

candidates = {a:[b for b in b_keys if has_overlap(a, b)] for a in a_keys}
for a in candidates.keys():
    print('Compare A{} with {}'.format(a, ', '.join(['B{}'.format(b) for b in candidates[a]])))

Compare A(1, 1) with B(1, 1)
Compare A(2, 3) with B(2, 2), B(3, 3)


## References
- [Sorted Neighborhood Clustering for efficient private blocking](https://www.semanticscholar.org/paper/Sorted-Nearest-Neighborhood-Clustering-for-Efficie-Vatsalan-Christen/db50cda21cc0ae68e65fd40f0bb472decad6e4ec)