Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added default sourcefile behavior to garble.py and households.py #35

Merged
merged 18 commits into from
Oct 12, 2022
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 36 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,7 @@ Any aberrant results should be investigated rectified within the data set before

anonlink will garble personally identifiable information (PII) in a way that it can be used for linkage later on. The CODI PPRL process garbles information a number of different ways. The `garble.py` script will manage executing anonlink multiple times and package the information for transmission to the linkage agent.

`garble.py` requires 3 different inputs:
`garble.py` requires 2 different inputs:
1. The location of a CSV file containing the PII to garble
1. The location of a directory of anonlink linkage schema files
1. The location of a secret file to use in the garbling process - this should be a text file containing a single hexadecimal string of at least 128 bits (32 characters); the `testing-and-tuning/generate_secret.py` script will create this for you if require it, e.g.:
jsrockhill marked this conversation as resolved.
Show resolved Hide resolved
Expand All @@ -184,7 +184,7 @@ python testing-and-tuning/generate_secret.py
```
This should create a new file called deidentification_secret.txt in your root directory.

`garble.py` requires that the location of the PII file, schema directory, and secret file are provided via positional arguments.
`garble.py` requires that the location of the PII file, schema directory, and secret file are provided via positional arguments. If only two positional arguments are given, `garble.py` will use them as the schema directory and secret file locations, and will look for the newest `pii-TIMESTAMP.csv` file in the `temp-data` directory.

The [anonlink schema files](https://anonlink-client.readthedocs.io/en/latest/schema.html) specify the fields that will be used in the hashing process as well as assigning weights to those fields. The `example-schema` directory contains a set of example schema that can be used to test the tools.

Expand All @@ -209,7 +209,7 @@ optional arguments:

`garble.py` will package up the garbled PII files into a [zip file](https://en.wikipedia.org/wiki/Zip_(file_format)) called `garbled.zip` and place it in the `output/` folder by default, you can change this with an `--output` flag if desired.

Example execution of `garble.py` is shown below:
Two example executions of `garble.py` is shown below–first with the PII CSV specified via positional argument:

```
$ python garble.py temp-data/pii-TIMESTAMP.csv example-schema ../deidentification_secret.txt
Expand All @@ -219,6 +219,16 @@ CLK data written to output/name-sex-dob-parents.json
CLK data written to output/name-sex-dob-addr.json
Zip file created at: output/garbled.zip
```
And second without the PII CSV specified as a positional argument:
```
$ python garble.py example-schema ../deidentification_secret.txt
PII Source: temp-data/pii-TIMESTAMP.csv
CLK data written to output/name-sex-dob-phone.json
CLK data written to output/name-sex-dob-zip.json
CLK data written to output/name-sex-dob-parents.json
CLK data written to output/name-sex-dob-addr.json
Zip file created at: output/garbled.zip
```
### [Optional] Household Extract and Garble

You may now run `households.py` with the same arguments as the `garble.py` script, with the only difference being specifying a specific schema file instead of a schema directory - if no schema is specified it will default to the `example-schema/household-schema/fn-phone-addr-zip.json` (use `-h` flag for more information). NOTE: If you want to generate the testing and tuning files for development on a synthetic dataset, you need to specify the `-t` or `--testrun` flags
Expand All @@ -229,10 +239,16 @@ The households script will do the following:

This information must be provided to the linkage agent if you would like to get a household linkages table as well.

Example run:
Example run with PII CSV specified:
```
$ python households.py temp-data/pii-TIMESTAMP.csv ../deidentification_secret.txt
Grouping individuals into households: 100%|███████████████████████| 819/819 [01:12<00:00, 11.37it/s]
CLK data written to output/households/fn-phone-addr-zip.json
Zip file created at: output/garbled_households.zip
```
and without PII CSV specified:
```
$ python households.py ../deidentification_secret.txt
PII Source: temp-data/pii-TIMESTAMP.csv
CLK data written to output/households/fn-phone-addr-zip.json
Zip file created at: output/garbled_households.zip
```
Expand Down Expand Up @@ -282,15 +298,29 @@ To map the LINK_IDs back to PATIDs, use the `linkid_to_patid.py` script. The scr

1. The path to the pii-timestamp.csv file.
2. The path to the LINK_ID CSV file provided by the linkage agent
3. The path to the Household pii-timestamp.csv file, either provided by the data owner directly or inferred by the `households.py` script
3. The path to the household pii CSV file, either provided by the data owner directly or inferred by the `households.py` script (which by default is named `household_pii-timestamp.csv`)
4. The path to the HOUSEHOLDID CSV file provided by the linkage agent if you provided household information

If both the pii-timestamp.csv and LINK_ID CSV file are provided as arguments, the script will create a file called `linkid_to_patid.csv` with the mapping of LINK_IDs to PATIDs in the `output/` folder by default. If both the household pii-timestamp.csv and LINK_ID CSV file are provided as arguments this will also create a `householdid_to_patid.csv` file in the `output/` folder.

### [Optional] Independently Validate Result Metadata

The metadata created by the garbling process is used to validate the metadata returned by the linkage agent within the `linkid_to_patid.py` script. Additionally, the metadata returned by the linkage agents can be validated outside of the `linkid_to_patid.py` script using the `validate_metadata.py` script in the `utils` directory. The syntax from the root directory is
```
python utils\validate_metadata.py <path-to-garbled.zip> <path-to-result.zip>
```
So, assuming that the output of `garble.py` is a file, `garble.zip` located in the `output` directory, and assuming that the results from the linkage agent are received as a zip archive named `results.zip` located in the `inbox` directory, the syntax would be
```
python utils\validate_metadata.py output\garble.py inbox\results.zip
```
By default, the script will only return the number of issues found during the validation process. Use the `-v` flag in order to print detailled information about each of the issues encountered during validation.

## Cleanup

In between runs it is advisable to run `rm temp-data/*` to clean up temporary data files used for individuals runs.



## Developer Testing

The documentation above outlines the approach for a single data owner to run these tools. For a developer who is testing on a synthetic data set, they might want to run all of the above steps quickly and repeatedly for a list of artificial data owners.
Expand Down
22 changes: 20 additions & 2 deletions garble.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
import shutil
import subprocess
import sys
from datetime import datetime
from pathlib import Path
from zipfile import ZipFile

Expand All @@ -17,7 +18,9 @@ def parse_arguments():
parser = argparse.ArgumentParser(
description="Tool for garbling PII in for PPRL purposes in the CODI project"
)
parser.add_argument("sourcefile", help="Source pii-TIMESTAMP.csv file")
parser.add_argument(
"sourcefile", default=None, nargs="?", help="Source pii-TIMESTAMP.csv file"
)
parser.add_argument("schemadir", help="Directory of linkage schema")
parser.add_argument("secretfile", help="Location of de-identification secret file")
parser.add_argument(
Expand Down Expand Up @@ -70,7 +73,22 @@ def validate_clks(clk_files, metadata_file):

def garble_pii(args):
secret_file = Path(args.secretfile)
source_file = Path(args.sourcefile)

if args.sourcefile:
source_file = Path(args.sourcefile)
else:
oldest_ts = datetime.fromtimestamp(0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is supposed to be newest? The code looks right, just the variable name is backwards. Same thing in households.py

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes and I'm happy to change the name. I chose "oldest" because we're looking for the 'largest' timestamp/furthest from 0 but understand the confusion.

oldest_name = ""
for filename in filter(
lambda x: "pii" in x and len(x) == 23, os.listdir("temp-data")
):
timestamp = datetime.strptime(filename[4:-4], "%Y%m%dT%H%M%S")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's worth changing but the YYYYMMDD format means you could just take the maximum. The benefit of parsing I suppose is that you crash if there's anything unexpected in there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you suggesting I do max() on the timestamps as strings? I see that how/why this would work (and would even work with the current format because the %H%M%S format gives 24-hr time but would prefer at least parsing the time to make sure it's a time and we're not accidentally comparing an errant file that ended up in temp-data. A hypothetical "pii-data-report-v2.json" which would also be picked up by this code and would str-compare to be newer than all of the timestamps because of how python compares strings. That said I could replace a few lines of this logic with a max() of the parsed timestamp objects, because datetime has comparators built in to it. Does that make sense?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you'd prefer something that's more crash-resistant, I could filter on a regex to ensure the filename is exactly what we're looking for but to me that felt like a little bit overkill, but I'm happy to put it in if you think it's warranted.

if timestamp > oldest_ts:
oldest_name = filename
oldest_ts = timestamp
source_file = Path("temp-data") / oldest_name
print(f"PII Source: {str(source_file)}")

os.makedirs("output", exist_ok=True)

source_file_name = os.path.basename(source_file)
Expand Down
90 changes: 68 additions & 22 deletions households.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import os
import subprocess
import sys
from datetime import datetime
from pathlib import Path
from random import shuffle
from zipfile import ZipFile
Expand All @@ -31,7 +32,9 @@ def parse_arguments():
description="Tool for garbling household PII for PPRL purposes"
" in the CODI project"
)
parser.add_argument("sourcefile", help="Source PII CSV file")
parser.add_argument(
"sourcefile", default=None, nargs="?", help="Source pii-TIMESTAMP.csv file"
)
parser.add_argument("secretfile", help="Location of de-identification secret file")
parser.add_argument(
"-d",
Expand Down Expand Up @@ -65,7 +68,7 @@ def parse_arguments():
help="Optional generate files used for testing against an answer key",
)
args = parser.parse_args()
if not Path(args.sourcefile).exists():
if args.sourcefile and not Path(args.sourcefile).exists():
parser.error("Unable to find source file: " + args.secretfile)
if not Path(args.schemafile).exists():
parser.error("Unable to find schema file: " + args.secretfile)
Expand Down Expand Up @@ -116,10 +119,14 @@ def parse_source_file(source_file):
return df


def write_households_pii(output_rows):
def write_households_pii(output_rows, household_time):
shuffle(output_rows)
timestamp = household_time.strftime("%Y%m%dT%H%M%S")
with open(
"temp-data/households_pii.csv", "w", newline="", encoding="utf-8"
Path("temp-data") / f"households_pii-{timestamp}.csv",
"w",
newline="",
encoding="utf-8",
) as house_csv:
writer = csv.writer(house_csv)
writer.writerow(HOUSEHOLD_PII_HEADERS)
Expand Down Expand Up @@ -149,8 +156,24 @@ def bfs_traverse_matches(pos_to_pairs, position):
return visited


def get_default_pii_csv(dirname="temp-data"):
oldest_ts = datetime.fromtimestamp(0)
oldest_name = ""
for filename in filter(lambda x: "pii" in x and len(x) == 23, os.listdir(dirname)):
timestamp = datetime.strptime(filename[4:-4], "%Y%m%dT%H%M%S")
if timestamp > oldest_ts:
oldest_name = filename
oldest_ts = timestamp
source_file = Path("temp-data") / oldest_name
return source_file


def write_mapping_file(pos_pid_rows, hid_pat_id_rows, args):
source_file = Path(args.sourcefile)
if args.sourcefile:
source_file = Path(args.sourcefile)
else:
source_file = get_default_pii_csv()
print(f"PII Source: {str(source_file)}")
pii_lines = parse_source_file(source_file)
output_rows = []
mapping_file = Path(args.mappingfile)
Expand Down Expand Up @@ -199,7 +222,7 @@ def write_mapping_file(pos_pid_rows, hid_pat_id_rows, args):
def write_scoring_file(hid_pat_id_rows):
# Format is used for scoring
with open(
"temp-data/hh_pos_patids.csv", "w", newline="", encoding="utf-8"
Path("temp-data") / "hh_pos_patids.csv", "w", newline="", encoding="utf-8"
) as hpos_pat_csv:
writer = csv.writer(hpos_pat_csv)
writer.writerow(HOUSEHOLD_POS_PID_HEADERS)
Expand All @@ -210,15 +233,16 @@ def write_scoring_file(hid_pat_id_rows):
def write_hid_hh_pos_map(pos_pid_rows):
# Format is used for generating a hid to hh_pos for full answer key
with open(
"temp-data/household_pos_pid.csv", "w", newline="", encoding="utf-8"
Path("temp-data") / "household_pos_pid.csv", "w", newline="", encoding="utf-8"
) as house_pos_csv:
writer = csv.writer(house_pos_csv)
writer.writerow(HOUSEHOLD_POS_PID_HEADERS)
for output_row in pos_pid_rows:
writer.writerow(output_row)


def hash_households(args):
def hash_households(args, household_time):
timestamp = household_time.strftime("%Y%m%dT%H%M%S")
schema_file = Path(args.schemafile)
secret_file = Path(args.secretfile)
secret = validate_secret_file(secret_file)
Expand All @@ -230,8 +254,10 @@ def hash_households(args):
"The following schema uses doubleHash, which is insecure: "
+ str(schema_file)
)
output_file = Path("output/households/fn-phone-addr-zip.json")
household_pii_file = args.householddef or "temp-data/households_pii.csv"
output_file = Path("output") / "households" / "fn-phone-addr-zip.json"
household_pii_file = (
args.householddef or Path("temp-data") / f"households_pii-{timestamp}.csv"
)
subprocess.run(
[
"anonlink",
Expand All @@ -244,22 +270,27 @@ def hash_households(args):
)


def infer_households(args):
def infer_households(args, household_time):
pos_pid_rows = []
hid_pat_id_rows = []
os.makedirs(Path("output") / "households", exist_ok=True)
os.makedirs("temp-data", exist_ok=True)
output_rows, n_households = write_mapping_file(pos_pid_rows, hid_pat_id_rows, args)
write_households_pii(output_rows)
write_households_pii(output_rows, household_time)
if args.testrun:
write_scoring_file(hid_pat_id_rows)
write_hid_hh_pos_map(pos_pid_rows)
return n_households


def create_output_zip(args, n_households):
def create_output_zip(args, n_households, household_time):

timestamp = household_time.strftime("%Y%m%dT%H%M%S")

source_file = Path(args.sourcefile)
if args.sourcefile:
source_file = Path(args.sourcefile)
else:
source_file = get_default_pii_csv()

source_file_name = os.path.basename(source_file)
source_dir_name = os.path.dirname(source_file)
Expand All @@ -271,33 +302,48 @@ def create_output_zip(args, n_households):
metadata_file = Path(source_dir_name) / metadata_file_name
with open(metadata_file, "r") as fp:
metadata = json.load(fp)

new_metadata_filename = f"households_metadata-{timestamp}.json"
meta_timestamp = metadata["creation_date"].replace("-", "").replace(":", "")[:-7]
assert (
source_timestamp == meta_timestamp
), "Metadata creation date does not match pii file timestamp"

metadata["number_of_households"] = n_households

with open(Path("output") / metadata_file_name, "w") as metadata_file:
json.dump(metadata, metadata_file)
if not args.householddef:
metadata["household_inference_time"] = household_time.isoformat()
metadata["households_inferred"] = True
else:
metadata["households_inferred"] = False

with open(Path("temp-data") / new_metadata_filename, "w") as metadata_file:
json.dump(metadata, metadata_file, indent=2)

with open(Path("output") / new_metadata_filename, "w") as metadata_file:
json.dump(metadata, metadata_file, indent=2)

with ZipFile(Path(args.outputfile), "w") as garbled_zip:
garbled_zip.write(Path("output") / "households" / "fn-phone-addr-zip.json")
garbled_zip.write(Path("output") / metadata_file_name)
garbled_zip.write(Path("output") / new_metadata_filename)

os.remove(Path("output") / metadata_file_name)
os.remove(Path("output") / new_metadata_filename)

print("Zip file created at: " + str(Path(args.outputfile)))


def main():
args = parse_arguments()
household_time = datetime.now()
if not args.householddef:
n_households = infer_households(args)
n_households = infer_households(args, household_time)
else:
n_households = 0
hash_households(args)
create_output_zip(args, n_households)
with open(args.householddef) as household_file:
households = household_file.read()
n_households = len(households.split()) - 1

hash_households(args, household_time)
create_output_zip(args, n_households, household_time)


if __name__ == "__main__":
Expand Down
Loading