rename piiTIMESTAMP to pii-TIMESTAMP

Also renames `metadataTIMESTAMP` to `metadata-TIMESTAMP`
mitre · Sep 14, 2022 · 308bf57 · 308bf57
1 parent cf550ca
commit 308bf57
Show file tree

Hide file tree

Showing 5 changed files with 23 additions and 21 deletions.
diff --git a/README.md b/README.md
@@ -84,7 +84,7 @@ N.B. If the install fails during install of psycopg2 due to a clang error, you m
 
 ## Extract PII
 
-The CODI PPRL process depends on information pulled from a database or translated from a `.csv` file structured to match the CODI Data Model. `extract.py` either connects to a database and extracts information, or reads from a provided `.csv` file, cleaning and validating it to prepare it for the PPRL process. The script will output a `temp-data/pii.csv` file that contains the PII ready for garbling.
+The CODI PPRL process depends on information pulled from a database or translated from a `.csv` file structured to match the CODI Data Model. `extract.py` either connects to a database and extracts information, or reads from a provided `.csv` file, cleaning and validating it to prepare it for the PPRL process. The script will output a `temp-data/pii-TIMESTAMP.csv` file that contains the PII ready for garbling.
 
 To extract from a database, `extract.py` requires a database connection string to connect. Consult the [SQLAlchemy documentation](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) to determine the exact string for the database in use.
 
@@ -159,11 +159,11 @@ See [`testing-and-tuning/sample_conf.json`](https://github.com/mitre/data-owner-
 
 ### Data Quality and Characterization
 
-A data characterization script is provided to assist in identifying data anomalies or quality issues. This script can be run against the `pii.csv` generated by `extract.py` or directly against the database used by `extract.py`.
-It is recommended that `data_analyis.py` at least be run against the generated `pii.csv` file to help ensure that extraction succeeded successfully.
+A data characterization script is provided to assist in identifying data anomalies or quality issues. This script can be run against the `pii-TIMESTAMP.csv` generated by `extract.py` or directly against the database used by `extract.py`.
+It is recommended that `data_analyis.py` at least be run against the generated `pii-TIMESTAMP.csv` file to help ensure that extraction succeeded successfully.
 
 ```shell
-python data_analysis.py --csv temp-data/pii.csv
+python data_analysis.py --csv temp-data/pii-TIMESTAMP.csv
 
 python data_analysis.py --db postgresql://username:password@host:port/database
 ```
@@ -197,7 +197,7 @@ usage: garble.py [-h] [-o OUTPUTFILE] sourcefile schemadir secretfile
 Tool for garbling PII in for PPRL purposes in the CODI project
 
 positional arguments:
-  sourcefile            Source PII CSV file
+  sourcefile            Source pii-TIMESTAMP.csv file
   schemadir             Directory of linkage schema
   secretfile            Location of de-identification secret file
 
@@ -212,7 +212,7 @@ optional arguments:
 Example execution of `garble.py` is shown below:
 
 ```
-$ python garble.py temp-data/pii.csv example-schema ../deidentification_secret.txt
+$ python garble.py temp-data/pii-TIMESTAMP.csv example-schema ../deidentification_secret.txt
 CLK data written to output/name-sex-dob-phone.json
 CLK data written to output/name-sex-dob-zip.json
 CLK data written to output/name-sex-dob-parents.json
@@ -231,7 +231,7 @@ This information must be provided to the linkage agent if you would like to get
 
 Example run:
 ```
-$ python households.py temp-data/pii.csv ../deidentification_secret.txt
+$ python households.py temp-data/pii-TIMESTAMP.csv ../deidentification_secret.txt
 Grouping individuals into households: 100%|███████████████████████| 819/819 [01:12<00:00, 11.37it/s]
 CLK data written to output/households/fn-phone-addr-zip.json
 Zip file created at: output/garbled_households.zip
@@ -276,16 +276,16 @@ Statistics for the generated blocks:
 
 ## Mapping LINKIDs to PATIDs
 
-When anonlink matches across data owners / partners, it identifies records by their position in the file. It essentially uses the line number in the extracted PII file as the identifier for the record. When results are returned from the linkage agent, it will assign a LINK_ID to a line number in the PII CSV file.
+When anonlink matches across data owners / partners, it identifies records by their position in the file. It essentially uses the line number in the extracted PII file as the identifier for the record. When results are returned from the linkage agent, it will assign a LINK_ID to a line number in the pii-timestamp.csv file.
 
 To map the LINK_IDs back to PATIDs, use the `linkid_to_patid.py` script. The script takes four arguments:
 
-1. The path to the PII CSV file. 
+1. The path to the pii-timestamp.csv file. 
 2. The path to the LINK_ID CSV file provided by the linkage agent
-3. The path to the Household PII CSV file, either provided by the data owner directly or inferred by the `households.py` script
+3. The path to the Household pii-timestamp.csv file, either provided by the data owner directly or inferred by the `households.py` script
 4. The path to the HOUSEHOLDID CSV file provided by the linkage agent if you provided household information
 
-If both the PII CSV and LINK_ID CSV file are provided as arguments, the script will create a file called `linkid_to_patid.csv` with the mapping of LINK_IDs to PATIDs in the `output/` folder by default. If both the household PII CSV and LINK_ID CSV file are provided as arguments this will also create a `householdid_to_patid.csv` file in the `output/` folder.
+If both the pii-timestamp.csv and LINK_ID CSV file are provided as arguments, the script will create a file called `linkid_to_patid.csv` with the mapping of LINK_IDs to PATIDs in the `output/` folder by default. If both the household pii-timestamp.csv and LINK_ID CSV file are provided as arguments this will also create a `householdid_to_patid.csv` file in the `output/` folder.
 
 ## Cleanup
 

diff --git a/data_analysis.py b/data_analysis.py
@@ -19,7 +19,7 @@ def parse_args():
     group = parser.add_mutually_exclusive_group(required=True)
     group.add_argument(
         "--csv",
-        help="Location of pii.csv file to analyze",
+        help="Location of pii-TIMESTAMP.csv file to analyze",
     )
     group.add_argument(
         "--db",

diff --git a/extract.py b/extract.py
@@ -260,7 +260,7 @@ def write_metadata(n_rows, creation_time):
         "uuid1": str(uuid.uuid1()),
     }
     timestamp = datetime.strftime(creation_time, "%Y%m%dT%H%M%S")
-    metaname = f"temp-data/metadata{timestamp}.json"
+    metaname = f"temp-data/metadata-{timestamp}.json"
     with open(metaname, "w", newline="", encoding="utf-8") as metafile:
         metafile.write(json.dumps(metadata))
 
@@ -269,7 +269,7 @@ def write_data(output_rows, args):
     creation_time = datetime.now()
     timestamp = datetime.strftime(creation_time, "%Y%m%dT%H%M%S")
     os.makedirs("temp-data", exist_ok=True)
-    csvname = f"temp-data/pii{timestamp}.csv"
+    csvname = f"temp-data/pii-{timestamp}.csv"
     with open(csvname, "w", newline="", encoding="utf-8") as csvfile:
         writer = csv.writer(csvfile)
         writer.writerow(HEADER)

diff --git a/garble.py b/garble.py
@@ -17,7 +17,7 @@ def parse_arguments():
     parser = argparse.ArgumentParser(
         description="Tool for garbling PII in for PPRL purposes in the CODI project"
     )
-    parser.add_argument("sourcefile", help="Source PII CSV file")
+    parser.add_argument("sourcefile", help="Source pii-TIMESTAMP.csv file")
     parser.add_argument("schemadir", help="Directory of linkage schema")
     parser.add_argument("secretfile", help="Location of de-identification secret file")
     parser.add_argument(
@@ -70,19 +70,21 @@ def validate_clks(clk_files, metadata_file):
 
 def garble_pii(args):
     secret_file = Path(args.secretfile)
-    source_file = args.sourcefile
+    source_file = Path(args.sourcefile)
     os.makedirs("output", exist_ok=True)
 
-    source_file_parts = source_file.split("/")
-    source_file_name = source_file_parts[-1]
-    source_timestamp = source_file_name.replace("pii", "").replace(".csv", "")
+    source_file_name = os.path.basename(source_file)
+    source_dir_name = os.path.dirname(source_file)
+
+    source_timestamp = os.path.splitext(source_file_name.replace("pii-", ""))[0]
     metadata_file_name = source_file_name.replace("pii", "metadata").replace(
         ".csv", ".json"
     )
-    metadata_file = Path("/".join(source_file_parts[:-1] + [metadata_file_name]))
+    metadata_file = Path(source_dir_name) / metadata_file_name
     with open(metadata_file, "r") as fp:
         metadata = json.load(fp)
     meta_timestamp = metadata["creation_date"].replace("-", "").replace(":", "")[:-7]
+    import pdb; pdb.set_trace()
     assert (
         source_timestamp == meta_timestamp
     ), "Metadata creation date does not match pii file timestamp"

diff --git a/linkid_to_patid.py b/linkid_to_patid.py
@@ -13,7 +13,7 @@ def parse_arguments():
     parser = argparse.ArgumentParser(
         description="Tool for translating LINK_IDs back into PATIDs"
     )
-    parser.add_argument("--sourcefile", help="Source PII CSV file")
+    parser.add_argument("--sourcefile", help="Source pii-TIMESTAMP.csv file")
     parser.add_argument("--linksfile", help="LINK_ID CSV file from linkage agent")
     parser.add_argument(
         "--hhsourcefile",