# Linkage and Blocking Evaluation and Tuning Tool

The purpose of this tool is to look at the various options for implementing blocking in [anonlink](https://anonlink-entity-service.readthedocs.io/en/stable/) for CODI datasets. Currently, anonlink uses blocklib library which supports two blocking methods:

- “p-sig”: Probabilistic signature
- “lambda-fold”: LSH based lambda-fold

which are proposed by the following publications:

- [Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases](https://arxiv.org/abs/1712.09691)
- [An LSH-Based Blocking Approach with a Homomorphic Matching Technique for Privacy-Preserving Record Linkage](https://www.computer.org/csdl/journal/tk/2015/04/06880802/13rRUxASubY)

Adjustments to the blocklib configuration will be made to the type of blocking, encoding, and threshold. We will evaluate multiple runs of our linkage tools using an example data set. The metrics for evaluation include:

- Precision
- Recall
- Reduction Ratio
- Set completeness
- Performance based on average block size

## Useful Terminology

- Blocking - a technique that makes record linkage scalable. It is achieved by partitioning datasets into groups, called blocks and only comparing records in corresponding blocks. This can reduce the number of comparisons that need to be conducted to find which pairs of records should be linked.
- Bloom filter - a probabilistic data structure used to test set membership. It tells if an element may be in a set, or definitely isn't.
- Precision – how many of the found matches are actual matches (found groups : true matches)
- Recall – how many of the actual matches we found (true matches : found groups)
- Reduction Ratio – measures the proportion of number of comparisons reduced by using blocking
- Set Completeness – how many true matches are maintained after blocking
- Feature hashing – a fast and space-efficient way of vectorizing features, i.e. turning arbitrary features into indices in a vector or matrix
- `p-sig` signature – A subrecord of an entity that can be used to uniquely link commonality between multiple records of an entity
- `p-sig` Blocking keys – lower the cost of comparison between datasets by selecting partitions of the raw records (ex. First name, last name, postal code) *its assumed records sharing no blocking keys do not match with each other*


## Setting up the Environment

- The basic requirement is that you have [data-owner-tools](https://github.com/mitre/data-owner-tools) set up with all of Synthetic Denver sites extracted via extract.py and named with the format pii_\*.csv (e.g. pii_ch.csv) for all 5 sites (scripts can easily be adjusted to work with other data) 

In [41]:
from __future__ import print_function
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
import seaborn as sns
from IPython.display import FileLink, FileLinks
import qgrid
import csv
import json
from pathlib import Path
import dcctools.config
from itertools import combinations
import importlib
import time

In [45]:
# Only needed if edits are made to dcctools.config
importlib.reload(dcctools.config)

<module 'dcctools.config' from '/Users/apellitieri/Desktop/CDC/CODI/linkage-agent-tools/dcctools/config.py'>

### Setting up a new run:

Set an optional run identifier / description string for the row that will be created in the table of run data

In [42]:
run_description = 'File reorg run'

Set path to data-owner-tools project:

In [43]:
DATA_OWNER_TOOLS_DIR = '/Users/apellitieri/Desktop/CDC/CODI/data-owner-tools/'
SECRET_FILE = '/Users/apellitieri/Desktop/CDC/CODI/deidentification_secret.txt'

Ensure the blocking schema file to use for the run is set correctly in `config.json`:

In [48]:
CONFIG = dcctools.config.Configuration("config.json")
print('Config file in use:')
with open('config.json', 'r') as config:
    config_data = json.load(config)
    print(json.dumps(config_data, indent=2))
blocking_schema_file = ''
if CONFIG.blocked:
    blocking_schema_file = CONFIG.blocking_schema
    with open(blocking_schema_file, 'r') as blocking_schema:
        schema_data = json.load(blocking_schema)
        print('\nBlocking schema being used:')
        print(json.dumps(schema_data, indent=2))
else:
    blocking_schema_file = 'None'
    print('\nNo blocking set to be used on this run.')

Config file in use:
{
  "systems": [
    "site_a",
    "site_b",
    "site_c",
    "site_d",
    "site_e",
    "site_f"
  ],
  "projects": [
    "name-sex-dob-phone",
    "name-sex-dob-zip",
    "name-sex-dob-parents",
    "name-sex-dob-addr"
  ],
  "schema_folder": "/Users/apellitieri/Desktop/CDC/CODI/data-owner-tools/example-schema",
  "inbox_folder": "/Users/apellitieri/Desktop/CDC/CODI/inbox",
  "matching_results_folder": "/Users/apellitieri/Desktop/CDC/CODI/results",
  "output_folder": "/Users/apellitieri/Desktop/CDC/CODI/output",
  "entity_service_url": "http://localhost:8851/api/v1",
  "matching_threshold": 0.75,
  "mongo_uri": "localhost:27017",
  "blocked": false,
  "blocking_schema": "/Users/apellitieri/Desktop/CDC/CODI/data-owner-tools/example-blocking-schema/lambda.json",
  "household_match": true,
  "household_schema": "/Users/apellitieri/Desktop/CDC/CODI/data-owner-tools/example-schema/household-schema/fn-phone-addr-zip.json"
}

No blocking set to be used on this run.


## Garble and block with anonlink

The following block runs the script to garble the pii_\*.csv files and then block and package the CLKs into the inbox folder.

[TODO]: Make the garble scripts not specific to the synthetic denver sites - currently need to change name of script depending on using synthetic denver or the new site

[TODO]: Figure out a way to record the blocking statistics from the run. The anonlink client blocking program prints out statistics about the blocking run but does not provide the output in an easily digestable format. The important output to look at is the maximum and average block size. From the [anonlink-client documentation](https://anonlink-client.readthedocs.io/en/stable/tutorial/Blocking%20with%20Anonlink%20Entity%20Service.html#Blocking):
```
The record linkage run time will be largely dominated by the maximum block size, and the number of blocks. In general the smaller the average block size, the better.
```

In [49]:
inbox_folder = CONFIG.inbox_folder
garble_start = time.perf_counter()
if CONFIG.blocked:
    shell_script = "{}testing-and-tuning/blocking-garble.sh".format(DATA_OWNER_TOOLS_DIR)
    !$shell_script {blocking_schema_file} {inbox_folder}
else:
    shell_script = "{}testing-and-tuning/garble.sh".format(DATA_OWNER_TOOLS_DIR)
    !$shell_script {inbox_folder} {SECRET_FILE}
garble_end = time.perf_counter()
garble_time = garble_end - garble_start

Cleaning inbox...
Running garble.py for site_a
[31mCLK data written to output/name-sex-dob-phone.json[0m
[31mCLK data written to output/name-sex-dob-zip.json[0m
[31mCLK data written to output/name-sex-dob-parents.json[0m
[31mCLK data written to output/name-sex-dob-addr.json[0m
Zip file created at: output/garbled.zip
Grouping individuals into households: 100%|███| 751/751 [01:38<00:00,  7.64it/s]
[31mCLK data written to output/households/fn-phone-addr-zip.json[0m
Zip file created at: output/garbled_households.zip
Running garble.py for site_b
[31mCLK data written to output/name-sex-dob-phone.json[0m
[31mCLK data written to output/name-sex-dob-zip.json[0m
[31mCLK data written to output/name-sex-dob-parents.json[0m
[31mCLK data written to output/name-sex-dob-addr.json[0m
Zip file created at: output/garbled.zip
Grouping individuals into households: 100%|███| 751/751 [01:33<00:00,  8.02it/s]
[31mCLK data written to output/households/fn-phone-addr-zip.json[0m
Zip file crea

In [50]:
print(f"Garble and block (if enabled) took {garble_time:0.2f} seconds")

Garble and block (if enabled) took 626.09 seconds


In [51]:
!python validate.py

All necessary input is present


In [52]:
# Need to drop database collection here if previous run took place
!python drop.py

Database cleared.


In [54]:
match_start = time.perf_counter()
!python match.py
match_end = time.perf_counter()
match_time = match_end - match_start

{'current_stage': {'description': 'waiting for CLKs', 'number': 1, 'progress': {'absolute': 6, 'description': 'number of parties already contributed', 'relative': 1.0}}, 'stages': 3, 'state': 'created', 'time_added': '2021-09-12T23:16:40.243880+00:00', 'time_started': None}
{'current_stage': {'description': 'waiting for CLKs', 'number': 1, 'progress': {'absolute': 6, 'description': 'number of parties already contributed', 'relative': 1.0}}, 'stages': 3, 'state': 'created', 'time_added': '2021-09-12T23:16:40.243880+00:00', 'time_started': None}
{'current_stage': {'description': 'compute similarity scores', 'number': 2}, 'stages': 3, 'state': 'queued', 'time_added': '2021-09-12T23:16:40.243880+00:00', 'time_started': None}
{'current_stage': {'description': 'compute similarity scores', 'number': 2}, 'stages': 3, 'state': 'queued', 'time_added': '2021-09-12T23:16:40.243880+00:00', 'time_started': None}
{'current_stage': {'description': 'compute similarity scores', 'number': 2}, 'stages': 3

{'current_stage': {'description': 'compute similarity scores', 'number': 2}, 'stages': 3, 'state': 'queued', 'time_added': '2021-09-12T23:17:02.563037+00:00', 'time_started': None}
{'current_stage': {'description': 'compute similarity scores', 'number': 2}, 'stages': 3, 'state': 'running', 'time_added': '2021-09-12T23:17:02.563037+00:00', 'time_started': '2021-09-12T23:17:10.049761+00:00'}
{'current_stage': {'description': 'compute similarity scores', 'number': 2}, 'stages': 3, 'state': 'running', 'time_added': '2021-09-12T23:17:02.563037+00:00', 'time_started': '2021-09-12T23:17:10.049761+00:00'}
{'current_stage': {'description': 'compute similarity scores', 'number': 2}, 'stages': 3, 'state': 'running', 'time_added': '2021-09-12T23:17:02.563037+00:00', 'time_started': '2021-09-12T23:17:10.049761+00:00'}
{'current_stage': {'description': 'compute similarity scores', 'number': 2}, 'stages': 3, 'state': 'running', 'time_added': '2021-09-12T23:17:02.563037+00:00', 'time_started': '2021-0

{'current_stage': {'description': 'compute similarity scores', 'number': 2}, 'stages': 3, 'state': 'running', 'time_added': '2021-09-12T23:17:28.321381+00:00', 'time_started': '2021-09-12T23:17:35.970545+00:00'}
{'current_stage': {'description': 'compute similarity scores', 'number': 2}, 'stages': 3, 'state': 'running', 'time_added': '2021-09-12T23:17:28.321381+00:00', 'time_started': '2021-09-12T23:17:35.970545+00:00'}
{'current_stage': {'description': 'compute similarity scores', 'number': 2}, 'stages': 3, 'state': 'running', 'time_added': '2021-09-12T23:17:28.321381+00:00', 'time_started': '2021-09-12T23:17:35.970545+00:00'}
{'current_stage': {'description': 'compute similarity scores', 'number': 2}, 'stages': 3, 'state': 'running', 'time_added': '2021-09-12T23:17:28.321381+00:00', 'time_started': '2021-09-12T23:17:35.970545+00:00'}
{'current_stage': {'description': 'compute similarity scores', 'number': 2}, 'stages': 3, 'state': 'running', 'time_added': '2021-09-12T23:17:28.321381+

{'current_stage': {'description': 'compute similarity scores', 'number': 2}, 'stages': 3, 'state': 'running', 'time_added': '2021-09-12T23:18:00.221856+00:00', 'time_started': '2021-09-12T23:18:07.810075+00:00'}
{'current_stage': {'description': 'compute similarity scores', 'number': 2}, 'stages': 3, 'state': 'running', 'time_added': '2021-09-12T23:18:00.221856+00:00', 'time_started': '2021-09-12T23:18:07.810075+00:00'}
{'current_stage': {'description': 'compute similarity scores', 'number': 2}, 'stages': 3, 'state': 'running', 'time_added': '2021-09-12T23:18:00.221856+00:00', 'time_started': '2021-09-12T23:18:07.810075+00:00'}
{'current_stage': {'description': 'compute similarity scores', 'number': 2}, 'stages': 3, 'state': 'running', 'time_added': '2021-09-12T23:18:00.221856+00:00', 'time_started': '2021-09-12T23:18:07.810075+00:00'}
{'current_stage': {'description': 'compute similarity scores', 'number': 2, 'progress': {'absolute': 8452506, 'description': 'number of already computed

{'current_stage': {'description': 'compute similarity scores', 'number': 2}, 'stages': 3, 'state': 'running', 'time_added': '2021-09-12T23:18:29.115037+00:00', 'time_started': '2021-09-12T23:18:36.983225+00:00'}
{'current_stage': {'description': 'compute similarity scores', 'number': 2}, 'stages': 3, 'state': 'running', 'time_added': '2021-09-12T23:18:29.115037+00:00', 'time_started': '2021-09-12T23:18:36.983225+00:00'}
{'current_stage': {'description': 'compute similarity scores', 'number': 2}, 'stages': 3, 'state': 'running', 'time_added': '2021-09-12T23:18:29.115037+00:00', 'time_started': '2021-09-12T23:18:36.983225+00:00'}
{'current_stage': {'description': 'compute similarity scores', 'number': 2}, 'stages': 3, 'state': 'running', 'time_added': '2021-09-12T23:18:29.115037+00:00', 'time_started': '2021-09-12T23:18:36.983225+00:00'}
{'current_stage': {'description': 'compute similarity scores', 'number': 2}, 'stages': 3, 'state': 'running', 'time_added': '2021-09-12T23:18:29.115037+

In [55]:
print(f"Matching took {match_time:0.2f} seconds")

Matching took 161.51 seconds


In [56]:
!python link_ids.py

/Users/apellitieri/Desktop/CDC/CODI/results/link_ids.csv created
Before deconflict: 856
After deconflict and before add singles: 856
Final linkage count: 1783
Exact individual links found in pprl household links: 1092
Number of individual links conflicting with pprl links: 1240
Number of individual links combined into PPRL links: 1119
Number of individual links skipped from previous conflict: 0
/Users/apellitieri/Desktop/CDC/CODI/results/household_link_ids.csv created


In [57]:
!python -m tuning-files-scripts.patid_translate --dotools {DATA_OWNER_TOOLS_DIR}

results/patid_link_ids.csv created


## Record Results

In [58]:
!python -m tuning-files-scripts.household_score --dotools {DATA_OWNER_TOOLS_DIR}

Pair-wise scoring:
Precision: 0.944079810974009 Recall: 0.7094101400670744 F-Score: 0.8100923631448524
Perfect scoring:
Precision: 0.5027932960893855 Recall: 0.5990016638935108 F-Score: 0.5466970387243737
Partial scoring:
Precision: 0.9040307101727447 Recall: 0.78369384359401 F-Score: 0.8395721925133689


In [59]:
RESULTS_PATH = '/Users/apellitieri/Desktop/CDC/CODI/results'
ANSWER_KEY_CSV = '/Users/apellitieri/Desktop/CDC/CODI/new_answer_key.csv'

In [60]:
run_precision = 0
run_recall = 0
run_f_score = 0
answer_key_length = 0
proposed_pairs_count = 0

systems = CONFIG.systems
threshold = CONFIG.matching_threshold

true_positives = 0
false_positives = 0

answer_key = []

with open(ANSWER_KEY_CSV) as ak_csv:
  ak_reader = csv.reader(ak_csv)
  next(ak_reader)
  for row in ak_reader:
    if row[3] == '1':
      answer_pair = [row[1], row[2]]
      answer_pair.sort()
      answer_key.append(answer_pair)

answer_key_length = len(answer_key)

patid_csv_path = Path(RESULTS_PATH) / "patid_link_ids.csv"

with open(patid_csv_path) as patid_csv:
  pat_id_reader = csv.reader(patid_csv)
  next(pat_id_reader)
  for row in pat_id_reader:
    patids = row[1:6]
    patids = list(filter(lambda id: len(id) > 0, patids))
    combos = combinations(patids, 2)
    for a, b in combos:
      pair = [a, b]
      pair.sort()
      if pair in answer_key:
        true_positives += 1
      else:
        false_positives += 1

run_precision = true_positives / (true_positives + false_positives)
run_recall = true_positives / answer_key_length
run_f_score = 2 * ((run_precision * run_recall) / (run_precision + run_recall))
proposed_pairs_count = true_positives + false_positives

In [61]:
print(f"Precision: {run_precision:0.2f}\nRecall: {run_recall:0.2f}\nF-Score: {run_f_score:0.2f}")

Precision: 0.99
Recall: 0.60
F-Score: 0.74


In [62]:
with open('tuning-files-scripts/example_run_data.csv', 'a', newline='') as csvfile:
    fieldnames = ['Run Description', 'Blocking', 'Match Threshold', 'Precision',
                  'Recall', 'F-Score', 'Answer Key Size',
                  'Proposed Pairs', 'True Positives', 'Garble & Block Time (s)',
                  'Match Time (s)', 'Blocking Config']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    # writer.writeheader() # Remove after initial run
    writer.writerow({
        'Run Description': run_description, 'Blocking': CONFIG.blocked,
        'Match Threshold': threshold,
        'Precision': run_precision, 'Recall': run_recall, 'F-Score': run_f_score,
        'Answer Key Size': answer_key_length, 'Proposed Pairs': proposed_pairs_count,
        'True Positives': true_positives,
        'Garble & Block Time (s)': garble_time,
        'Match Time (s)': match_time, 'Blocking Config': blocking_schema_file
    })
print("Successfully added run to example_run_data.csv!")

Successfully added run to example_run_data.csv!


In [63]:
pd.read_csv('tuning-files-scripts/example_run_data.csv')

Unnamed: 0,Run Description,Blocking,Match Threshold,Precision,Recall,F-Score,Answer Key Size,Proposed Pairs,True Positives,Garble & Block Time (s),Match Time (s),Blocking Config
0,Lambda-fold run 1 (Synthetic Denver),True,0.8,0.941725,0.83299,0.884026,970,858,,22.998019,81.662922,/Users/apellitieri/Desktop/CDC/CODI/data-owner...
1,No Blocking Run 1 (Synthetic Denver),False,0.8,0.942263,0.841237,0.888889,970,866,,13.903272,76.427395,
2,No Blocking Run 1 (New Data Set),False,0.8,0.996756,0.563965,0.720354,7082,4007,,15.183644,89.882657,
3,No Blocking Run 2 (New Data Set),False,0.8,0.996756,0.563965,0.720354,7082,4007,3994.0,15.443283,100.035062,
4,No Blocking Run 3 (New Data Set),False,0.75,0.988751,0.595736,0.743502,7082,4267,4219.0,15.563336,91.285207,
5,No Blocking Run 4 (New Data Set),False,0.7,0.970907,0.607879,0.747655,7082,4434,4305.0,16.600817,97.703846,
6,No Blocking Run 4 (New Data Set),False,0.65,0.955512,0.615645,0.748819,7082,4563,4360.0,15.385254,92.289256,
7,No Blocking Run 4 (New Data Set),False,0.5,0.895652,0.450861,0.599793,7082,3565,3193.0,15.295187,134.370978,
8,Lambda Run 1 (New Data Set),True,0.65,0.955566,0.601243,0.738083,7082,4456,4258.0,28.138004,103.387432,/Users/apellitieri/Desktop/CDC/CODI/data-owner...
9,File reorg run,False,0.75,0.98922,0.596018,0.743854,7082,4267,4221.0,626.093749,161.507163,
