Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added default sourcefile behavior to garble.py and households.py #35

Merged
merged 18 commits into from
Oct 12, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 44 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,24 +175,22 @@ Any aberrant results should be investigated rectified within the data set before

anonlink will garble personally identifiable information (PII) in a way that it can be used for linkage later on. The CODI PPRL process garbles information a number of different ways. The `garble.py` script will manage executing anonlink multiple times and package the information for transmission to the linkage agent.

`garble.py` requires 3 different inputs:
1. The location of a CSV file containing the PII to garble
1. The location of a directory of anonlink linkage schema files
1. The location of a secret file to use in the garbling process - this should be a text file containing a single hexadecimal string of at least 128 bits (32 characters); the `testing-and-tuning/generate_secret.py` script will create this for you if require it, e.g.:
`garble.py` accepts the following positional inputs:
1. (optional) The location of a CSV file containing the PII to garble. If not provided, the script will look for the newest `pii-TIMESTAMP.csv` file in the `temp-data` directory.
1. (required) The location of a directory of anonlink linkage schema files
1. (required) The location of a secret file to use in the garbling process - this should be a text file containing a single hexadecimal string of at least 128 bits (32 characters); the `testing-and-tuning/generate_secret.py` script will create this for you if require it, e.g.:
```
python testing-and-tuning/generate_secret.py
```
This should create a new file called deidentification_secret.txt in your root directory.

`garble.py` requires that the location of the PII file, schema directory, and secret file are provided via positional arguments.

The [anonlink schema files](https://anonlink-client.readthedocs.io/en/latest/schema.html) specify the fields that will be used in the hashing process as well as assigning weights to those fields. The `example-schema` directory contains a set of example schema that can be used to test the tools.

`garble.py`, and all other scripts in the repository, will provide usage information with the `-h` flag:

```
$ python garble.py -h
usage: garble.py [-h] [-o OUTPUTFILE] sourcefile schemadir secretfile
usage: garble.py [-h] [-z OUTPUTZIP] [-o OUTPUTDIR] [sourcefile] schemadir secretfile

Tool for garbling PII in for PPRL purposes in the CODI project

Expand All @@ -203,13 +201,15 @@ positional arguments:

optional arguments:
-h, --help show this help message and exit
-o OUTPUTFILE, --output OUTPUTFILE
Specify an output file. Default is output/garbled.zip
-z OUTPUTZIP, --outputzip OUTPUTZIP
Specify an name for the .zip file. Default is garbled.zip
-o OUTPUTDIR, --outputdir OUTPUTDIR
Specify an output directory. Default is output/
```

`garble.py` will package up the garbled PII files into a [zip file](https://en.wikipedia.org/wiki/Zip_(file_format)) called `garbled.zip` and place it in the `output/` folder by default, you can change this with an `--output` flag if desired.

Example execution of `garble.py` is shown below:
Two example executions of `garble.py` is shown below–first with the PII CSV specified via positional argument:

```
$ python garble.py temp-data/pii-TIMESTAMP.csv example-schema ../deidentification_secret.txt
Expand All @@ -219,20 +219,36 @@ CLK data written to output/name-sex-dob-parents.json
CLK data written to output/name-sex-dob-addr.json
Zip file created at: output/garbled.zip
```
And second without the PII CSV specified as a positional argument:
```
$ python garble.py example-schema ../deidentification_secret.txt
PII Source: temp-data/pii-TIMESTAMP.csv
CLK data written to output/name-sex-dob-phone.json
CLK data written to output/name-sex-dob-zip.json
CLK data written to output/name-sex-dob-parents.json
CLK data written to output/name-sex-dob-addr.json
Zip file created at: output/garbled.zip
```
### [Optional] Household Extract and Garble

You may now run `households.py` with the same arguments as the `garble.py` script, with the only difference being specifying a specific schema file instead of a schema directory - if no schema is specified it will default to the `example-schema/household-schema/fn-phone-addr-zip.json` (use `-h` flag for more information). NOTE: If you want to generate the testing and tuning files for development on a synthetic dataset, you need to specify the `-t` or `--testrun` flags
You may now run `households.py` with the same arguments as the `garble.py` script, with the only difference being specifying a specific schema file instead of a schema directory - if no schema is specified it will default to the `example-schema/household-schema/fn-phone-addr-zip.json`. To specify a schemafile, it must be preceeded by the flag `--schemafile` (use `-h` flag for more information). NOTE: If you want to generate the testing and tuning files for development on a synthetic dataset, you need to specify the `-t` or `--testrun` flags

The households script will do the following:
1. Attempt to group individuals into households and store those records in a csv file in temp-data
1. Create a mapping file to be sent to the linkage agent, along with a zip file of household specific garbled information.

This information must be provided to the linkage agent if you would like to get a household linkages table as well.

Example run:
Example run with PII CSV specified:
```
$ python households.py temp-data/pii-TIMESTAMP.csv ../deidentification_secret.txt
Grouping individuals into households: 100%|███████████████████████| 819/819 [01:12<00:00, 11.37it/s]
CLK data written to output/households/fn-phone-addr-zip.json
Zip file created at: output/garbled_households.zip
```
and without PII CSV specified:
```
$ python households.py ../deidentification_secret.txt
PII Source: temp-data/pii-TIMESTAMP.csv
CLK data written to output/households/fn-phone-addr-zip.json
Zip file created at: output/garbled_households.zip
```
Expand Down Expand Up @@ -282,15 +298,29 @@ To map the LINK_IDs back to PATIDs, use the `linkid_to_patid.py` script. The scr

1. The path to the pii-timestamp.csv file.
2. The path to the LINK_ID CSV file provided by the linkage agent
3. The path to the Household pii-timestamp.csv file, either provided by the data owner directly or inferred by the `households.py` script
3. The path to the household pii CSV file, either provided by the data owner directly or inferred by the `households.py` script (which by default is named `household_pii-timestamp.csv`)
4. The path to the HOUSEHOLDID CSV file provided by the linkage agent if you provided household information

If both the pii-timestamp.csv and LINK_ID CSV file are provided as arguments, the script will create a file called `linkid_to_patid.csv` with the mapping of LINK_IDs to PATIDs in the `output/` folder by default. If both the household pii-timestamp.csv and LINK_ID CSV file are provided as arguments this will also create a `householdid_to_patid.csv` file in the `output/` folder.

### [Optional] Independently Validate Result Metadata

The metadata created by the garbling process is used to validate the metadata returned by the linkage agent within the `linkid_to_patid.py` script. Additionally, the metadata returned by the linkage agents can be validated outside of the `linkid_to_patid.py` script using the `validate_metadata.py` script in the `utils` directory. The syntax from the root directory is
```
python utils\validate_metadata.py <path-to-garbled.zip> <path-to-result.zip>
```
So, assuming that the output of `garble.py` is a file, `garble.zip` located in the `output` directory, and assuming that the results from the linkage agent are received as a zip archive named `results.zip` located in the `inbox` directory, the syntax would be
```
python utils\validate_metadata.py output\garble.py inbox\results.zip
```
By default, the script will only return the number of issues found during the validation process. Use the `-v` flag in order to print detailled information about each of the issues encountered during validation.

## Cleanup

In between runs it is advisable to run `rm temp-data/*` to clean up temporary data files used for individuals runs.



## Developer Testing

The documentation above outlines the approach for a single data owner to run these tools. For a developer who is testing on a synthetic data set, they might want to run all of the above steps quickly and repeatedly for a list of artificial data owners.
Expand Down
20 changes: 18 additions & 2 deletions garble.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
import shutil
import subprocess
import sys
from datetime import datetime
from pathlib import Path
from zipfile import ZipFile

Expand All @@ -17,7 +18,9 @@ def parse_arguments():
parser = argparse.ArgumentParser(
description="Tool for garbling PII in for PPRL purposes in the CODI project"
)
parser.add_argument("sourcefile", help="Source pii-TIMESTAMP.csv file")
parser.add_argument(
"sourcefile", default=None, nargs="?", help="Source pii-TIMESTAMP.csv file"
)
parser.add_argument("schemadir", help="Directory of linkage schema")
parser.add_argument("secretfile", help="Location of de-identification secret file")
parser.add_argument(
Expand Down Expand Up @@ -70,7 +73,20 @@ def validate_clks(clk_files, metadata_file):

def garble_pii(args):
secret_file = Path(args.secretfile)
source_file = Path(args.sourcefile)

if args.sourcefile:
source_file = Path(args.sourcefile)
else:
filenames = list(
filter(lambda x: "pii" in x and len(x) == 23, os.listdir("temp-data"))
)
timestamps = [
datetime.strptime(filename[4:-4], "%Y%m%dT%H%M%S") for filename in filenames
]
newest_name = filenames[timestamps.index(max(timestamps))]
source_file = Path("temp-data") / newest_name
print(f"PII Source: {str(source_file)}")

os.makedirs("output", exist_ok=True)

source_file_name = os.path.basename(source_file)
Expand Down
88 changes: 66 additions & 22 deletions households.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import os
import subprocess
import sys
from datetime import datetime
from pathlib import Path
from random import shuffle
from zipfile import ZipFile
Expand All @@ -31,7 +32,9 @@ def parse_arguments():
description="Tool for garbling household PII for PPRL purposes"
" in the CODI project"
)
parser.add_argument("sourcefile", help="Source PII CSV file")
parser.add_argument(
"sourcefile", default=None, nargs="?", help="Source pii-TIMESTAMP.csv file"
)
parser.add_argument("secretfile", help="Location of de-identification secret file")
parser.add_argument(
"-d",
Expand Down Expand Up @@ -65,7 +68,7 @@ def parse_arguments():
help="Optional generate files used for testing against an answer key",
)
args = parser.parse_args()
if not Path(args.sourcefile).exists():
if args.sourcefile and not Path(args.sourcefile).exists():
parser.error("Unable to find source file: " + args.secretfile)
if not Path(args.schemafile).exists():
parser.error("Unable to find schema file: " + args.secretfile)
Expand Down Expand Up @@ -116,10 +119,14 @@ def parse_source_file(source_file):
return df


def write_households_pii(output_rows):
def write_households_pii(output_rows, household_time):
shuffle(output_rows)
timestamp = household_time.strftime("%Y%m%dT%H%M%S")
with open(
"temp-data/households_pii.csv", "w", newline="", encoding="utf-8"
Path("temp-data") / f"households_pii-{timestamp}.csv",
"w",
newline="",
encoding="utf-8",
) as house_csv:
writer = csv.writer(house_csv)
writer.writerow(HOUSEHOLD_PII_HEADERS)
Expand Down Expand Up @@ -149,8 +156,22 @@ def bfs_traverse_matches(pos_to_pairs, position):
return visited


def get_default_pii_csv(dirname="temp-data"):
filenames = list(filter(lambda x: "pii" in x and len(x) == 23, os.listdir(dirname)))
timestamps = [
datetime.strptime(filename[4:-4], "%Y%m%dT%H%M%S") for filename in filenames
]
newest_name = filenames[timestamps.index(max(timestamps))]
source_file = Path("temp-data") / newest_name
return source_file


def write_mapping_file(pos_pid_rows, hid_pat_id_rows, args):
source_file = Path(args.sourcefile)
if args.sourcefile:
source_file = Path(args.sourcefile)
else:
source_file = get_default_pii_csv()
print(f"PII Source: {str(source_file)}")
pii_lines = parse_source_file(source_file)
output_rows = []
mapping_file = Path(args.mappingfile)
Expand Down Expand Up @@ -199,7 +220,7 @@ def write_mapping_file(pos_pid_rows, hid_pat_id_rows, args):
def write_scoring_file(hid_pat_id_rows):
# Format is used for scoring
with open(
"temp-data/hh_pos_patids.csv", "w", newline="", encoding="utf-8"
Path("temp-data") / "hh_pos_patids.csv", "w", newline="", encoding="utf-8"
) as hpos_pat_csv:
writer = csv.writer(hpos_pat_csv)
writer.writerow(HOUSEHOLD_POS_PID_HEADERS)
Expand All @@ -210,15 +231,16 @@ def write_scoring_file(hid_pat_id_rows):
def write_hid_hh_pos_map(pos_pid_rows):
# Format is used for generating a hid to hh_pos for full answer key
with open(
"temp-data/household_pos_pid.csv", "w", newline="", encoding="utf-8"
Path("temp-data") / "household_pos_pid.csv", "w", newline="", encoding="utf-8"
) as house_pos_csv:
writer = csv.writer(house_pos_csv)
writer.writerow(HOUSEHOLD_POS_PID_HEADERS)
for output_row in pos_pid_rows:
writer.writerow(output_row)


def hash_households(args):
def hash_households(args, household_time):
timestamp = household_time.strftime("%Y%m%dT%H%M%S")
schema_file = Path(args.schemafile)
secret_file = Path(args.secretfile)
secret = validate_secret_file(secret_file)
Expand All @@ -230,8 +252,10 @@ def hash_households(args):
"The following schema uses doubleHash, which is insecure: "
+ str(schema_file)
)
output_file = Path("output/households/fn-phone-addr-zip.json")
household_pii_file = args.householddef or "temp-data/households_pii.csv"
output_file = Path("output") / "households" / "fn-phone-addr-zip.json"
household_pii_file = (
args.householddef or Path("temp-data") / f"households_pii-{timestamp}.csv"
)
subprocess.run(
[
"anonlink",
Expand All @@ -244,22 +268,27 @@ def hash_households(args):
)


def infer_households(args):
def infer_households(args, household_time):
pos_pid_rows = []
hid_pat_id_rows = []
os.makedirs(Path("output") / "households", exist_ok=True)
os.makedirs("temp-data", exist_ok=True)
output_rows, n_households = write_mapping_file(pos_pid_rows, hid_pat_id_rows, args)
write_households_pii(output_rows)
write_households_pii(output_rows, household_time)
if args.testrun:
write_scoring_file(hid_pat_id_rows)
write_hid_hh_pos_map(pos_pid_rows)
return n_households


def create_output_zip(args, n_households):
def create_output_zip(args, n_households, household_time):

timestamp = household_time.strftime("%Y%m%dT%H%M%S")

source_file = Path(args.sourcefile)
if args.sourcefile:
source_file = Path(args.sourcefile)
else:
source_file = get_default_pii_csv()

source_file_name = os.path.basename(source_file)
source_dir_name = os.path.dirname(source_file)
Expand All @@ -271,33 +300,48 @@ def create_output_zip(args, n_households):
metadata_file = Path(source_dir_name) / metadata_file_name
with open(metadata_file, "r") as fp:
metadata = json.load(fp)

new_metadata_filename = f"households_metadata-{timestamp}.json"
meta_timestamp = metadata["creation_date"].replace("-", "").replace(":", "")[:-7]
assert (
source_timestamp == meta_timestamp
), "Metadata creation date does not match pii file timestamp"

metadata["number_of_households"] = n_households

with open(Path("output") / metadata_file_name, "w") as metadata_file:
json.dump(metadata, metadata_file)
if not args.householddef:
metadata["household_inference_time"] = household_time.isoformat()
metadata["households_inferred"] = True
else:
metadata["households_inferred"] = False

with open(Path("temp-data") / new_metadata_filename, "w") as metadata_file:
json.dump(metadata, metadata_file, indent=2)

with open(Path("output") / new_metadata_filename, "w") as metadata_file:
json.dump(metadata, metadata_file, indent=2)

with ZipFile(Path(args.outputfile), "w") as garbled_zip:
garbled_zip.write(Path("output") / "households" / "fn-phone-addr-zip.json")
garbled_zip.write(Path("output") / metadata_file_name)
garbled_zip.write(Path("output") / new_metadata_filename)

os.remove(Path("output") / metadata_file_name)
os.remove(Path("output") / new_metadata_filename)

print("Zip file created at: " + str(Path(args.outputfile)))


def main():
args = parse_arguments()
household_time = datetime.now()
if not args.householddef:
n_households = infer_households(args)
n_households = infer_households(args, household_time)
else:
n_households = 0
hash_households(args)
create_output_zip(args, n_households)
with open(args.householddef) as household_file:
households = household_file.read()
n_households = len(households.split()) - 1

hash_households(args, household_time)
create_output_zip(args, n_households, household_time)


if __name__ == "__main__":
Expand Down
Loading