mitre · dehall · Oct 12, 2022 · Oct 6, 2022 · Oct 6, 2022 · Oct 6, 2022
diff --git a/README.md b/README.md
@@ -175,24 +175,22 @@ Any aberrant results should be investigated rectified within the data set before
 
 anonlink will garble personally identifiable information (PII) in a way that it can be used for linkage later on. The CODI PPRL process garbles information a number of different ways. The `garble.py` script will manage executing anonlink multiple times and package the information for transmission to the linkage agent.
 
-`garble.py` requires 3 different inputs:
-1. The location of a CSV file containing the PII to garble
-1. The location of a directory of anonlink linkage schema files
-1. The location of a secret file to use in the garbling process - this should be a text file containing a single hexadecimal string of at least 128 bits (32 characters); the `testing-and-tuning/generate_secret.py` script will create this for you if require it, e.g.:
+`garble.py` accepts the following positional inputs:
+1. (optional) The location of a CSV file containing the PII to garble. If not provided, the script will look for the newest `pii-TIMESTAMP.csv` file in the `temp-data` directory.
+1. (required) The location of a directory of anonlink linkage schema files
+1. (required)  The location of a secret file to use in the garbling process - this should be a text file containing a single hexadecimal string of at least 128 bits (32 characters); the `testing-and-tuning/generate_secret.py` script will create this for you if require it, e.g.:
 ```
 python testing-and-tuning/generate_secret.py
 ```
 This should create a new file called deidentification_secret.txt in your root directory.
 
-`garble.py` requires that the location of the PII file, schema directory, and secret file are provided via positional arguments.
-
 The [anonlink schema files](https://anonlink-client.readthedocs.io/en/latest/schema.html) specify the fields that will be used in the hashing process as well as assigning weights to those fields. The `example-schema` directory contains a set of example schema that can be used to test the tools.
 
 `garble.py`, and all other scripts in the repository, will provide usage information with the `-h` flag:
 
 ```
 $ python garble.py -h
-usage: garble.py [-h] [-o OUTPUTFILE] sourcefile schemadir secretfile
+usage: garble.py [-h] [-z OUTPUTZIP] [-o OUTPUTDIR] [sourcefile] schemadir secretfile
 
 Tool for garbling PII in for PPRL purposes in the CODI project
 
@@ -203,13 +201,15 @@ positional arguments:
 
 optional arguments:
   -h, --help            show this help message and exit
-  -o OUTPUTFILE, --output OUTPUTFILE
-                        Specify an output file. Default is output/garbled.zip
+  -z OUTPUTZIP, --outputzip OUTPUTZIP
+                        Specify an name for the .zip file. Default is garbled.zip
+  -o OUTPUTDIR, --outputdir OUTPUTDIR
+                        Specify an output directory. Default is output/
 ```
 
 `garble.py` will package up the garbled PII files into a [zip file](https://en.wikipedia.org/wiki/Zip_(file_format)) called `garbled.zip` and place it in the `output/` folder by default, you can change this with an `--output` flag if desired.
 
-Example execution of `garble.py` is shown below:
+Two example executions of `garble.py` is shown below–first with the PII CSV specified via positional argument:
 
 ```
 $ python garble.py temp-data/pii-TIMESTAMP.csv example-schema ../deidentification_secret.txt
@@ -219,20 +219,36 @@ CLK data written to output/name-sex-dob-parents.json
 CLK data written to output/name-sex-dob-addr.json
 Zip file created at: output/garbled.zip
 ```
+And second without the PII CSV specified as a positional argument:
+```
+$ python garble.py example-schema ../deidentification_secret.txt
+PII Source: temp-data/pii-TIMESTAMP.csv
+CLK data written to output/name-sex-dob-phone.json
+CLK data written to output/name-sex-dob-zip.json
+CLK data written to output/name-sex-dob-parents.json
+CLK data written to output/name-sex-dob-addr.json
+Zip file created at: output/garbled.zip
+```
 ### [Optional] Household Extract and Garble
 
-You may now run `households.py` with the same arguments as the `garble.py` script, with the only difference being specifying a specific schema file instead of a schema directory - if no schema is specified it will default to the `example-schema/household-schema/fn-phone-addr-zip.json` (use `-h` flag for more information). NOTE: If you want to generate the testing and tuning files for development on a synthetic dataset, you need to specify the `-t` or `--testrun` flags
+You may now run `households.py` with the same arguments as the `garble.py` script, with the only difference being specifying a specific schema file instead of a schema directory - if no schema is specified it will default to the `example-schema/household-schema/fn-phone-addr-zip.json`. To specify a schemafile, it must be preceeded by the flag `--schemafile` (use `-h` flag for more information). NOTE: If you want to generate the testing and tuning files for development on a synthetic dataset, you need to specify the `-t` or `--testrun` flags
 
 The households script will do the following:
   1. Attempt to group individuals into households and store those records in a csv file in temp-data
   1. Create a mapping file to be sent to the linkage agent, along with a zip file of household specific garbled information.
 
 This information must be provided to the linkage agent if you would like to get a household linkages table as well.
 
-Example run:
+Example run with PII CSV specified:
 ```
 $ python households.py temp-data/pii-TIMESTAMP.csv ../deidentification_secret.txt
-Grouping individuals into households: 100%|███████████████████████| 819/819 [01:12<00:00, 11.37it/s]
+CLK data written to output/households/fn-phone-addr-zip.json
+Zip file created at: output/garbled_households.zip
+```
+and without PII CSV specified:
+```
+$ python households.py ../deidentification_secret.txt
+PII Source: temp-data/pii-TIMESTAMP.csv
 CLK data written to output/households/fn-phone-addr-zip.json
 Zip file created at: output/garbled_households.zip
 ```
@@ -282,15 +298,29 @@ To map the LINK_IDs back to PATIDs, use the `linkid_to_patid.py` script. The scr
 
 1. The path to the pii-timestamp.csv file. 
 2. The path to the LINK_ID CSV file provided by the linkage agent
-3. The path to the Household pii-timestamp.csv file, either provided by the data owner directly or inferred by the `households.py` script
+3. The path to the household pii CSV file, either provided by the data owner directly or inferred by the `households.py` script (which by default is named `household_pii-timestamp.csv`)
 4. The path to the HOUSEHOLDID CSV file provided by the linkage agent if you provided household information
 
 If both the pii-timestamp.csv and LINK_ID CSV file are provided as arguments, the script will create a file called `linkid_to_patid.csv` with the mapping of LINK_IDs to PATIDs in the `output/` folder by default. If both the household pii-timestamp.csv and LINK_ID CSV file are provided as arguments this will also create a `householdid_to_patid.csv` file in the `output/` folder.
 
+### [Optional] Independently Validate Result Metadata
+
+The metadata created by the garbling process is used to validate the metadata returned by the linkage agent within the `linkid_to_patid.py` script. Additionally, the metadata returned by the linkage agents can be validated outside of the `linkid_to_patid.py` script using the `validate_metadata.py` script in the `utils` directory. The syntax from the root directory is 
+```
+python utils\validate_metadata.py <path-to-garbled.zip> <path-to-result.zip>
+```
+So, assuming that the output of `garble.py` is a file, `garble.zip` located in the `output` directory, and assuming that the results from the linkage agent are received as a zip archive named `results.zip` located in the `inbox` directory, the syntax would be
+```
+python utils\validate_metadata.py output\garble.py inbox\results.zip
+```
+By default, the script will only return the number of issues found during the validation process. Use the `-v` flag in order to print detailled information about each of the issues encountered during validation.
+
 ## Cleanup
 
 In between runs it is advisable to run `rm temp-data/*` to clean up temporary data files used for individuals runs.
 
+
+
 ## Developer Testing
 
 The documentation above outlines the approach for a single data owner to run these tools. For a developer who is testing on a synthetic data set, they might want to run all of the above steps quickly and repeatedly for a list of artificial data owners.

diff --git a/garble.py b/garble.py
@@ -7,6 +7,7 @@
 import shutil
 import subprocess
 import sys
+from datetime import datetime
 from pathlib import Path
 from zipfile import ZipFile
 
@@ -17,7 +18,9 @@ def parse_arguments():
     parser = argparse.ArgumentParser(
         description="Tool for garbling PII in for PPRL purposes in the CODI project"
     )
-    parser.add_argument("sourcefile", help="Source pii-TIMESTAMP.csv file")
+    parser.add_argument(
+        "sourcefile", default=None, nargs="?", help="Source pii-TIMESTAMP.csv file"
+    )
     parser.add_argument("schemadir", help="Directory of linkage schema")
     parser.add_argument("secretfile", help="Location of de-identification secret file")
     parser.add_argument(
@@ -70,7 +73,20 @@ def validate_clks(clk_files, metadata_file):
 
 def garble_pii(args):
     secret_file = Path(args.secretfile)
-    source_file = Path(args.sourcefile)
+
+    if args.sourcefile:
+        source_file = Path(args.sourcefile)
+    else:
+        filenames = list(
+            filter(lambda x: "pii" in x and len(x) == 23, os.listdir("temp-data"))
+        )
+        timestamps = [
+            datetime.strptime(filename[4:-4], "%Y%m%dT%H%M%S") for filename in filenames
+        ]
+        newest_name = filenames[timestamps.index(max(timestamps))]
+        source_file = Path("temp-data") / newest_name
+        print(f"PII Source: {str(source_file)}")
+
     os.makedirs("output", exist_ok=True)
 
     source_file_name = os.path.basename(source_file)

diff --git a/households.py b/households.py
@@ -6,6 +6,7 @@
 import os
 import subprocess
 import sys
+from datetime import datetime
 from pathlib import Path
 from random import shuffle
 from zipfile import ZipFile
@@ -31,7 +32,9 @@ def parse_arguments():
         description="Tool for garbling household PII for PPRL purposes"
         " in the CODI project"
     )
-    parser.add_argument("sourcefile", help="Source PII CSV file")
+    parser.add_argument(
+        "sourcefile", default=None, nargs="?", help="Source pii-TIMESTAMP.csv file"
+    )
     parser.add_argument("secretfile", help="Location of de-identification secret file")
     parser.add_argument(
         "-d",
@@ -65,7 +68,7 @@ def parse_arguments():
         help="Optional generate files used for testing against an answer key",
     )
     args = parser.parse_args()
-    if not Path(args.sourcefile).exists():
+    if args.sourcefile and not Path(args.sourcefile).exists():
         parser.error("Unable to find source file: " + args.secretfile)
     if not Path(args.schemafile).exists():
         parser.error("Unable to find schema file: " + args.secretfile)
@@ -116,10 +119,14 @@ def parse_source_file(source_file):
     return df
 
 
-def write_households_pii(output_rows):
+def write_households_pii(output_rows, household_time):
     shuffle(output_rows)
+    timestamp = household_time.strftime("%Y%m%dT%H%M%S")
     with open(
-        "temp-data/households_pii.csv", "w", newline="", encoding="utf-8"
+        Path("temp-data") / f"households_pii-{timestamp}.csv",
+        "w",
+        newline="",
+        encoding="utf-8",
     ) as house_csv:
         writer = csv.writer(house_csv)
         writer.writerow(HOUSEHOLD_PII_HEADERS)
@@ -149,8 +156,22 @@ def bfs_traverse_matches(pos_to_pairs, position):
     return visited
 
 
+def get_default_pii_csv(dirname="temp-data"):
+    filenames = list(filter(lambda x: "pii" in x and len(x) == 23, os.listdir(dirname)))
+    timestamps = [
+        datetime.strptime(filename[4:-4], "%Y%m%dT%H%M%S") for filename in filenames
+    ]
+    newest_name = filenames[timestamps.index(max(timestamps))]
+    source_file = Path("temp-data") / newest_name
+    return source_file
+
+
 def write_mapping_file(pos_pid_rows, hid_pat_id_rows, args):
-    source_file = Path(args.sourcefile)
+    if args.sourcefile:
+        source_file = Path(args.sourcefile)
+    else:
+        source_file = get_default_pii_csv()
+        print(f"PII Source: {str(source_file)}")
     pii_lines = parse_source_file(source_file)
     output_rows = []
     mapping_file = Path(args.mappingfile)
@@ -199,7 +220,7 @@ def write_mapping_file(pos_pid_rows, hid_pat_id_rows, args):
 def write_scoring_file(hid_pat_id_rows):
     # Format is used for scoring
     with open(
-        "temp-data/hh_pos_patids.csv", "w", newline="", encoding="utf-8"
+        Path("temp-data") / "hh_pos_patids.csv", "w", newline="", encoding="utf-8"
     ) as hpos_pat_csv:
         writer = csv.writer(hpos_pat_csv)
         writer.writerow(HOUSEHOLD_POS_PID_HEADERS)
@@ -210,15 +231,16 @@ def write_scoring_file(hid_pat_id_rows):
 def write_hid_hh_pos_map(pos_pid_rows):
     # Format is used for generating a hid to hh_pos for full answer key
     with open(
-        "temp-data/household_pos_pid.csv", "w", newline="", encoding="utf-8"
+        Path("temp-data") / "household_pos_pid.csv", "w", newline="", encoding="utf-8"
     ) as house_pos_csv:
         writer = csv.writer(house_pos_csv)
         writer.writerow(HOUSEHOLD_POS_PID_HEADERS)
         for output_row in pos_pid_rows:
             writer.writerow(output_row)
 
 
-def hash_households(args):
+def hash_households(args, household_time):
+    timestamp = household_time.strftime("%Y%m%dT%H%M%S")
     schema_file = Path(args.schemafile)
     secret_file = Path(args.secretfile)
     secret = validate_secret_file(secret_file)
@@ -230,8 +252,10 @@ def hash_households(args):
                 "The following schema uses doubleHash, which is insecure: "
                 + str(schema_file)
             )
-    output_file = Path("output/households/fn-phone-addr-zip.json")
-    household_pii_file = args.householddef or "temp-data/households_pii.csv"
+    output_file = Path("output") / "households" / "fn-phone-addr-zip.json"
+    household_pii_file = (
+        args.householddef or Path("temp-data") / f"households_pii-{timestamp}.csv"
+    )
     subprocess.run(
         [
             "anonlink",
@@ -244,22 +268,27 @@ def hash_households(args):
     )
 
 
-def infer_households(args):
+def infer_households(args, household_time):
     pos_pid_rows = []
     hid_pat_id_rows = []
     os.makedirs(Path("output") / "households", exist_ok=True)
     os.makedirs("temp-data", exist_ok=True)
     output_rows, n_households = write_mapping_file(pos_pid_rows, hid_pat_id_rows, args)
-    write_households_pii(output_rows)
+    write_households_pii(output_rows, household_time)
     if args.testrun:
         write_scoring_file(hid_pat_id_rows)
         write_hid_hh_pos_map(pos_pid_rows)
     return n_households
 
 
-def create_output_zip(args, n_households):
+def create_output_zip(args, n_households, household_time):
+
+    timestamp = household_time.strftime("%Y%m%dT%H%M%S")
 
-    source_file = Path(args.sourcefile)
+    if args.sourcefile:
+        source_file = Path(args.sourcefile)
+    else:
+        source_file = get_default_pii_csv()
 
     source_file_name = os.path.basename(source_file)
     source_dir_name = os.path.dirname(source_file)
@@ -271,33 +300,48 @@ def create_output_zip(args, n_households):
     metadata_file = Path(source_dir_name) / metadata_file_name
     with open(metadata_file, "r") as fp:
         metadata = json.load(fp)
+
+    new_metadata_filename = f"households_metadata-{timestamp}.json"
     meta_timestamp = metadata["creation_date"].replace("-", "").replace(":", "")[:-7]
     assert (
         source_timestamp == meta_timestamp
     ), "Metadata creation date does not match pii file timestamp"
 
     metadata["number_of_households"] = n_households
 
-    with open(Path("output") / metadata_file_name, "w") as metadata_file:
-        json.dump(metadata, metadata_file)
+    if not args.householddef:
+        metadata["household_inference_time"] = household_time.isoformat()
+        metadata["households_inferred"] = True
+    else:
+        metadata["households_inferred"] = False
+
+    with open(Path("temp-data") / new_metadata_filename, "w") as metadata_file:
+        json.dump(metadata, metadata_file, indent=2)
+
+    with open(Path("output") / new_metadata_filename, "w") as metadata_file:
+        json.dump(metadata, metadata_file, indent=2)
 
     with ZipFile(Path(args.outputfile), "w") as garbled_zip:
         garbled_zip.write(Path("output") / "households" / "fn-phone-addr-zip.json")
-        garbled_zip.write(Path("output") / metadata_file_name)
+        garbled_zip.write(Path("output") / new_metadata_filename)
 
-    os.remove(Path("output") / metadata_file_name)
+    os.remove(Path("output") / new_metadata_filename)
 
     print("Zip file created at: " + str(Path(args.outputfile)))
 
 
 def main():
     args = parse_arguments()
+    household_time = datetime.now()
     if not args.householddef:
-        n_households = infer_households(args)
+        n_households = infer_households(args, household_time)
     else:
-        n_households = 0
-    hash_households(args)
-    create_output_zip(args, n_households)
+        with open(args.householddef) as household_file:
+            households = household_file.read()
+        n_households = len(households.split()) - 1
+
+    hash_households(args, household_time)
+    create_output_zip(args, n_households, household_time)
 
 
 if __name__ == "__main__":