Skip to content

Commit

Permalink
Merge pull request #35 from mitre/update-garble-default
Browse files Browse the repository at this point in the history
Added default `sourcefile` behavior to `garble.py` and `households.py`
  • Loading branch information
dehall committed Oct 12, 2022
2 parents a3f569e + 1573f3f commit ee996ea
Show file tree
Hide file tree
Showing 3 changed files with 72 additions and 21 deletions.
42 changes: 29 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,24 +175,22 @@ Any aberrant results should be investigated rectified within the data set before

anonlink will garble personally identifiable information (PII) in a way that it can be used for linkage later on. The CODI PPRL process garbles information a number of different ways. The `garble.py` script will manage executing anonlink multiple times and package the information for transmission to the linkage agent.

`garble.py` requires 3 different inputs:
1. The location of a CSV file containing the PII to garble
1. The location of a directory of anonlink linkage schema files
1. The location of a secret file to use in the garbling process - this should be a text file containing a single hexadecimal string of at least 128 bits (32 characters); the `testing-and-tuning/generate_secret.py` script will create this for you if require it, e.g.:
`garble.py` accepts the following positional inputs:
1. (optional) The location of a CSV file containing the PII to garble. If not provided, the script will look for the newest `pii-TIMESTAMP.csv` file in the `temp-data` directory.
1. (required) The location of a directory of anonlink linkage schema files
1. (required) The location of a secret file to use in the garbling process - this should be a text file containing a single hexadecimal string of at least 128 bits (32 characters); the `testing-and-tuning/generate_secret.py` script will create this for you if require it, e.g.:
```
python testing-and-tuning/generate_secret.py
```
This should create a new file called deidentification_secret.txt in your root directory.

`garble.py` requires that the location of the PII file, schema directory, and secret file are provided via positional arguments.

The [anonlink schema files](https://anonlink-client.readthedocs.io/en/latest/schema.html) specify the fields that will be used in the hashing process as well as assigning weights to those fields. The `example-schema` directory contains a set of example schema that can be used to test the tools.

`garble.py`, and all other scripts in the repository, will provide usage information with the `-h` flag:

```
$ python garble.py -h
usage: garble.py [-h] [-o OUTPUTFILE] sourcefile schemadir secretfile
usage: garble.py [-h] [-z OUTPUTZIP] [-o OUTPUTDIR] [sourcefile] schemadir secretfile
Tool for garbling PII in for PPRL purposes in the CODI project
Expand All @@ -203,13 +201,15 @@ positional arguments:
optional arguments:
-h, --help show this help message and exit
-o OUTPUTFILE, --output OUTPUTFILE
Specify an output file. Default is output/garbled.zip
-z OUTPUTZIP, --outputzip OUTPUTZIP
Specify an name for the .zip file. Default is garbled.zip
-o OUTPUTDIR, --outputdir OUTPUTDIR
Specify an output directory. Default is output/
```

`garble.py` will package up the garbled PII files into a [zip file](https://en.wikipedia.org/wiki/Zip_(file_format)) called `garbled.zip` and place it in the `output/` folder by default, you can change this with an `--output` flag if desired.

Example execution of `garble.py` is shown below:
Two example executions of `garble.py` is shown below–first with the PII CSV specified via positional argument:

```
$ python garble.py temp-data/pii-TIMESTAMP.csv example-schema ../deidentification_secret.txt
Expand All @@ -219,20 +219,36 @@ CLK data written to output/name-sex-dob-parents.json
CLK data written to output/name-sex-dob-addr.json
Zip file created at: output/garbled.zip
```
And second without the PII CSV specified as a positional argument:
```
$ python garble.py example-schema ../deidentification_secret.txt
PII Source: temp-data/pii-TIMESTAMP.csv
CLK data written to output/name-sex-dob-phone.json
CLK data written to output/name-sex-dob-zip.json
CLK data written to output/name-sex-dob-parents.json
CLK data written to output/name-sex-dob-addr.json
Zip file created at: output/garbled.zip
```
### [Optional] Household Extract and Garble

You may now run `households.py` with the same arguments as the `garble.py` script, with the only difference being specifying a specific schema file instead of a schema directory - if no schema is specified it will default to the `example-schema/household-schema/fn-phone-addr-zip.json` (use `-h` flag for more information). NOTE: If you want to generate the testing and tuning files for development on a synthetic dataset, you need to specify the `-t` or `--testrun` flags
You may now run `households.py` with the same arguments as the `garble.py` script, with the only difference being specifying a specific schema file instead of a schema directory - if no schema is specified it will default to the `example-schema/household-schema/fn-phone-addr-zip.json`. To specify a schemafile, it must be preceeded by the flag `--schemafile` (use `-h` flag for more information). NOTE: If you want to generate the testing and tuning files for development on a synthetic dataset, you need to specify the `-t` or `--testrun` flags

The households script will do the following:
1. Attempt to group individuals into households and store those records in a csv file in temp-data
1. Create a mapping file to be sent to the linkage agent, along with a zip file of household specific garbled information.

This information must be provided to the linkage agent if you would like to get a household linkages table as well.

Example run:
Example run with PII CSV specified:
```
$ python households.py temp-data/pii-TIMESTAMP.csv ../deidentification_secret.txt
Grouping individuals into households: 100%|███████████████████████| 819/819 [01:12<00:00, 11.37it/s]
CLK data written to output/households/fn-phone-addr-zip.json
Zip file created at: output/garbled_households.zip
```
and without PII CSV specified:
```
$ python households.py ../deidentification_secret.txt
PII Source: temp-data/pii-TIMESTAMP.csv
CLK data written to output/households/fn-phone-addr-zip.json
Zip file created at: output/garbled_households.zip
```
Expand Down
20 changes: 18 additions & 2 deletions garble.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
import shutil
import subprocess
import sys
from datetime import datetime
from pathlib import Path
from zipfile import ZipFile

Expand All @@ -17,7 +18,9 @@ def parse_arguments():
parser = argparse.ArgumentParser(
description="Tool for garbling PII in for PPRL purposes in the CODI project"
)
parser.add_argument("sourcefile", help="Source pii-TIMESTAMP.csv file")
parser.add_argument(
"sourcefile", default=None, nargs="?", help="Source pii-TIMESTAMP.csv file"
)
parser.add_argument("schemadir", help="Directory of linkage schema")
parser.add_argument("secretfile", help="Location of de-identification secret file")
parser.add_argument(
Expand Down Expand Up @@ -70,7 +73,20 @@ def validate_clks(clk_files, metadata_file):

def garble_pii(args):
secret_file = Path(args.secretfile)
source_file = Path(args.sourcefile)

if args.sourcefile:
source_file = Path(args.sourcefile)
else:
filenames = list(
filter(lambda x: "pii" in x and len(x) == 23, os.listdir("temp-data"))
)
timestamps = [
datetime.strptime(filename[4:-4], "%Y%m%dT%H%M%S") for filename in filenames
]
newest_name = filenames[timestamps.index(max(timestamps))]
source_file = Path("temp-data") / newest_name
print(f"PII Source: {str(source_file)}")

os.makedirs("output", exist_ok=True)

source_file_name = os.path.basename(source_file)
Expand Down
31 changes: 25 additions & 6 deletions households.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@

import argparse
import csv
import datetime
import json
import os
import subprocess
import sys
from datetime import datetime
from pathlib import Path
from random import shuffle
from zipfile import ZipFile
Expand All @@ -32,7 +32,9 @@ def parse_arguments():
description="Tool for garbling household PII for PPRL purposes"
" in the CODI project"
)
parser.add_argument("sourcefile", help="Source PII CSV file")
parser.add_argument(
"sourcefile", default=None, nargs="?", help="Source pii-TIMESTAMP.csv file"
)
parser.add_argument("secretfile", help="Location of de-identification secret file")
parser.add_argument(
"-d",
Expand Down Expand Up @@ -66,7 +68,7 @@ def parse_arguments():
help="Optional generate files used for testing against an answer key",
)
args = parser.parse_args()
if not Path(args.sourcefile).exists():
if args.sourcefile and not Path(args.sourcefile).exists():
parser.error("Unable to find source file: " + args.secretfile)
if not Path(args.schemafile).exists():
parser.error("Unable to find schema file: " + args.secretfile)
Expand Down Expand Up @@ -154,8 +156,22 @@ def bfs_traverse_matches(pos_to_pairs, position):
return visited


def get_default_pii_csv(dirname="temp-data"):
filenames = list(filter(lambda x: "pii" in x and len(x) == 23, os.listdir(dirname)))
timestamps = [
datetime.strptime(filename[4:-4], "%Y%m%dT%H%M%S") for filename in filenames
]
newest_name = filenames[timestamps.index(max(timestamps))]
source_file = Path("temp-data") / newest_name
return source_file


def write_mapping_file(pos_pid_rows, hid_pat_id_rows, args):
source_file = Path(args.sourcefile)
if args.sourcefile:
source_file = Path(args.sourcefile)
else:
source_file = get_default_pii_csv()
print(f"PII Source: {str(source_file)}")
pii_lines = parse_source_file(source_file)
output_rows = []
mapping_file = Path(args.mappingfile)
Expand Down Expand Up @@ -269,7 +285,10 @@ def create_output_zip(args, n_households, household_time):

timestamp = household_time.strftime("%Y%m%dT%H%M%S")

source_file = Path(args.sourcefile)
if args.sourcefile:
source_file = Path(args.sourcefile)
else:
source_file = get_default_pii_csv()

source_file_name = os.path.basename(source_file)
source_dir_name = os.path.dirname(source_file)
Expand Down Expand Up @@ -313,7 +332,7 @@ def create_output_zip(args, n_households, household_time):

def main():
args = parse_arguments()
household_time = datetime.datetime.now()
household_time = datetime.now()
if not args.householddef:
n_households = infer_households(args, household_time)
else:
Expand Down

0 comments on commit ee996ea

Please sign in to comment.