Skip to content

Commit

Permalink
rename piiTIMESTAMP to pii-TIMESTAMP
Browse files Browse the repository at this point in the history
Also renames `metadataTIMESTAMP` to `metadata-TIMESTAMP`
  • Loading branch information
radamson committed Sep 14, 2022
1 parent cf550ca commit 308bf57
Show file tree
Hide file tree
Showing 5 changed files with 23 additions and 21 deletions.
22 changes: 11 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ N.B. If the install fails during install of psycopg2 due to a clang error, you m

## Extract PII

The CODI PPRL process depends on information pulled from a database or translated from a `.csv` file structured to match the CODI Data Model. `extract.py` either connects to a database and extracts information, or reads from a provided `.csv` file, cleaning and validating it to prepare it for the PPRL process. The script will output a `temp-data/pii.csv` file that contains the PII ready for garbling.
The CODI PPRL process depends on information pulled from a database or translated from a `.csv` file structured to match the CODI Data Model. `extract.py` either connects to a database and extracts information, or reads from a provided `.csv` file, cleaning and validating it to prepare it for the PPRL process. The script will output a `temp-data/pii-TIMESTAMP.csv` file that contains the PII ready for garbling.

To extract from a database, `extract.py` requires a database connection string to connect. Consult the [SQLAlchemy documentation](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) to determine the exact string for the database in use.

Expand Down Expand Up @@ -159,11 +159,11 @@ See [`testing-and-tuning/sample_conf.json`](https://github.com/mitre/data-owner-

### Data Quality and Characterization

A data characterization script is provided to assist in identifying data anomalies or quality issues. This script can be run against the `pii.csv` generated by `extract.py` or directly against the database used by `extract.py`.
It is recommended that `data_analyis.py` at least be run against the generated `pii.csv` file to help ensure that extraction succeeded successfully.
A data characterization script is provided to assist in identifying data anomalies or quality issues. This script can be run against the `pii-TIMESTAMP.csv` generated by `extract.py` or directly against the database used by `extract.py`.
It is recommended that `data_analyis.py` at least be run against the generated `pii-TIMESTAMP.csv` file to help ensure that extraction succeeded successfully.

```shell
python data_analysis.py --csv temp-data/pii.csv
python data_analysis.py --csv temp-data/pii-TIMESTAMP.csv

python data_analysis.py --db postgresql://username:password@host:port/database
```
Expand Down Expand Up @@ -197,7 +197,7 @@ usage: garble.py [-h] [-o OUTPUTFILE] sourcefile schemadir secretfile
Tool for garbling PII in for PPRL purposes in the CODI project
positional arguments:
sourcefile Source PII CSV file
sourcefile Source pii-TIMESTAMP.csv file
schemadir Directory of linkage schema
secretfile Location of de-identification secret file
Expand All @@ -212,7 +212,7 @@ optional arguments:
Example execution of `garble.py` is shown below:

```
$ python garble.py temp-data/pii.csv example-schema ../deidentification_secret.txt
$ python garble.py temp-data/pii-TIMESTAMP.csv example-schema ../deidentification_secret.txt
CLK data written to output/name-sex-dob-phone.json
CLK data written to output/name-sex-dob-zip.json
CLK data written to output/name-sex-dob-parents.json
Expand All @@ -231,7 +231,7 @@ This information must be provided to the linkage agent if you would like to get

Example run:
```
$ python households.py temp-data/pii.csv ../deidentification_secret.txt
$ python households.py temp-data/pii-TIMESTAMP.csv ../deidentification_secret.txt
Grouping individuals into households: 100%|███████████████████████| 819/819 [01:12<00:00, 11.37it/s]
CLK data written to output/households/fn-phone-addr-zip.json
Zip file created at: output/garbled_households.zip
Expand Down Expand Up @@ -276,16 +276,16 @@ Statistics for the generated blocks:

## Mapping LINKIDs to PATIDs

When anonlink matches across data owners / partners, it identifies records by their position in the file. It essentially uses the line number in the extracted PII file as the identifier for the record. When results are returned from the linkage agent, it will assign a LINK_ID to a line number in the PII CSV file.
When anonlink matches across data owners / partners, it identifies records by their position in the file. It essentially uses the line number in the extracted PII file as the identifier for the record. When results are returned from the linkage agent, it will assign a LINK_ID to a line number in the pii-timestamp.csv file.

To map the LINK_IDs back to PATIDs, use the `linkid_to_patid.py` script. The script takes four arguments:

1. The path to the PII CSV file.
1. The path to the pii-timestamp.csv file.
2. The path to the LINK_ID CSV file provided by the linkage agent
3. The path to the Household PII CSV file, either provided by the data owner directly or inferred by the `households.py` script
3. The path to the Household pii-timestamp.csv file, either provided by the data owner directly or inferred by the `households.py` script
4. The path to the HOUSEHOLDID CSV file provided by the linkage agent if you provided household information

If both the PII CSV and LINK_ID CSV file are provided as arguments, the script will create a file called `linkid_to_patid.csv` with the mapping of LINK_IDs to PATIDs in the `output/` folder by default. If both the household PII CSV and LINK_ID CSV file are provided as arguments this will also create a `householdid_to_patid.csv` file in the `output/` folder.
If both the pii-timestamp.csv and LINK_ID CSV file are provided as arguments, the script will create a file called `linkid_to_patid.csv` with the mapping of LINK_IDs to PATIDs in the `output/` folder by default. If both the household pii-timestamp.csv and LINK_ID CSV file are provided as arguments this will also create a `householdid_to_patid.csv` file in the `output/` folder.

## Cleanup

Expand Down
2 changes: 1 addition & 1 deletion data_analysis.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ def parse_args():
group = parser.add_mutually_exclusive_group(required=True)
group.add_argument(
"--csv",
help="Location of pii.csv file to analyze",
help="Location of pii-TIMESTAMP.csv file to analyze",
)
group.add_argument(
"--db",
Expand Down
4 changes: 2 additions & 2 deletions extract.py
Original file line number Diff line number Diff line change
Expand Up @@ -260,7 +260,7 @@ def write_metadata(n_rows, creation_time):
"uuid1": str(uuid.uuid1()),
}
timestamp = datetime.strftime(creation_time, "%Y%m%dT%H%M%S")
metaname = f"temp-data/metadata{timestamp}.json"
metaname = f"temp-data/metadata-{timestamp}.json"
with open(metaname, "w", newline="", encoding="utf-8") as metafile:
metafile.write(json.dumps(metadata))

Expand All @@ -269,7 +269,7 @@ def write_data(output_rows, args):
creation_time = datetime.now()
timestamp = datetime.strftime(creation_time, "%Y%m%dT%H%M%S")
os.makedirs("temp-data", exist_ok=True)
csvname = f"temp-data/pii{timestamp}.csv"
csvname = f"temp-data/pii-{timestamp}.csv"
with open(csvname, "w", newline="", encoding="utf-8") as csvfile:
writer = csv.writer(csvfile)
writer.writerow(HEADER)
Expand Down
14 changes: 8 additions & 6 deletions garble.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ def parse_arguments():
parser = argparse.ArgumentParser(
description="Tool for garbling PII in for PPRL purposes in the CODI project"
)
parser.add_argument("sourcefile", help="Source PII CSV file")
parser.add_argument("sourcefile", help="Source pii-TIMESTAMP.csv file")
parser.add_argument("schemadir", help="Directory of linkage schema")
parser.add_argument("secretfile", help="Location of de-identification secret file")
parser.add_argument(
Expand Down Expand Up @@ -70,19 +70,21 @@ def validate_clks(clk_files, metadata_file):

def garble_pii(args):
secret_file = Path(args.secretfile)
source_file = args.sourcefile
source_file = Path(args.sourcefile)
os.makedirs("output", exist_ok=True)

source_file_parts = source_file.split("/")
source_file_name = source_file_parts[-1]
source_timestamp = source_file_name.replace("pii", "").replace(".csv", "")
source_file_name = os.path.basename(source_file)
source_dir_name = os.path.dirname(source_file)

source_timestamp = os.path.splitext(source_file_name.replace("pii-", ""))[0]
metadata_file_name = source_file_name.replace("pii", "metadata").replace(
".csv", ".json"
)
metadata_file = Path("/".join(source_file_parts[:-1] + [metadata_file_name]))
metadata_file = Path(source_dir_name) / metadata_file_name
with open(metadata_file, "r") as fp:
metadata = json.load(fp)
meta_timestamp = metadata["creation_date"].replace("-", "").replace(":", "")[:-7]
import pdb; pdb.set_trace()
assert (
source_timestamp == meta_timestamp
), "Metadata creation date does not match pii file timestamp"
Expand Down
2 changes: 1 addition & 1 deletion linkid_to_patid.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ def parse_arguments():
parser = argparse.ArgumentParser(
description="Tool for translating LINK_IDs back into PATIDs"
)
parser.add_argument("--sourcefile", help="Source PII CSV file")
parser.add_argument("--sourcefile", help="Source pii-TIMESTAMP.csv file")
parser.add_argument("--linksfile", help="LINK_ID CSV file from linkage agent")
parser.add_argument(
"--hhsourcefile",
Expand Down

0 comments on commit 308bf57

Please sign in to comment.