-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added default sourcefile
behavior to garble.py
and households.py
#35
Conversation
… output and results from linkage-agent-tools
…it to utils. Updated linkid_to_patid.py to validate metadata before translating linkids
…d updated linkid_to_patid.py to work with households
… output and results from linkage-agent-tools
…it to utils. Updated linkid_to_patid.py to validate metadata before translating linkids
…d updated linkid_to_patid.py to work with households
…a-owner-tools into verify-linkage-results stemming from rebase
…ata if no sourcefile argument is provided
sourcefile
population to garble.py
and households.py
sourcefile
behavior to garble.py
and households.py
garble.py
Outdated
if args.sourcefile: | ||
source_file = Path(args.sourcefile) | ||
else: | ||
oldest_ts = datetime.fromtimestamp(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is supposed to be newest
? The code looks right, just the variable name is backwards. Same thing in households.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes and I'm happy to change the name. I chose "oldest" because we're looking for the 'largest' timestamp/furthest from 0 but understand the confusion.
garble.py
Outdated
for filename in filter( | ||
lambda x: "pii" in x and len(x) == 23, os.listdir("temp-data") | ||
): | ||
timestamp = datetime.strptime(filename[4:-4], "%Y%m%dT%H%M%S") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it's worth changing but the YYYYMMDD format means you could just take the maximum. The benefit of parsing I suppose is that you crash if there's anything unexpected in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you suggesting I do max()
on the timestamps as strings? I see that how/why this would work (and would even work with the current format because the %H%M%S
format gives 24-hr time but would prefer at least parsing the time to make sure it's a time and we're not accidentally comparing an errant file that ended up in temp-data
. A hypothetical "pii-data-report-v2.json"
which would also be picked up by this code and would str-compare to be newer than all of the timestamps because of how python compares strings. That said I could replace a few lines of this logic with a max()
of the parsed timestamp objects, because datetime
has comparators built in to it. Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you'd prefer something that's more crash-resistant, I could filter on a regex to ensure the filename is exactly what we're looking for but to me that felt like a little bit overkill, but I'm happy to put it in if you think it's warranted.
Co-authored-by: Dylan Hall <dehall@mitre.org>
… handle timestamp comparison in line with comments from @dehall. Also made a few updates to README
households.py
andgarble.py
now both look for the newest file (based on filename timestamp) in thetemp-data
directory if nosourcefile
argument is passed from the command line. For ticket #183028439