-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add date expressions for first and last day of NHS financial year #723
Conversation
session.commit() | ||
study = StudyDefinition( | ||
index_date=index_date, | ||
population=patients.with_these_clinical_events( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you tested against the EMIS backend, you tested the expressions with the context of population
and event_date
; the former because it's special and the latter because it's not, I assume. Why not do the same here?
I'm sure there's a good reason 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did the tpp tests first, and it made sense to just test against the patient ids that were returned. population
on its own just gets you a set of patient_id
in the results. I tried to do the same thing in the emis tests, but the emis backend generates patient ids (I confess I didn't look too closely at exactly what it does), so the patient ids in the results were always 0 and 1. The emis one is only that way to have some way to checking that the 2 patients it returns are the expected ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, of course! Yes, the EMIS backend handles patient IDs differently, making is much harder to test against just population
. More here:
cohort-extractor/cohortextractor/emis_backend.py
Lines 1585 to 1596 in 84dddec
def truncate_patient_id(patient_id): | |
""" | |
EMIS patient IDs are 128 byte strings which significantly bloat our output | |
files. However, because they are actually hex-encoded SHA-512 hashes we can | |
get away with throwing away all but the first 63 bits and storing them as | |
int64s (63 bits so we can safely store them as signed int64s without | |
worrying about negative patient IDs, which might upset expectations | |
elsewhere). 63 bits still gives us plenty of entropy as sqrt(2^63) is about | |
2 billion so well above the threshold where we'd need to worry about | |
collisions with a population of a few tens of millions. | |
""" | |
return int(patient_id[:16], 16) >> 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very clear! I feel I've learnt a lot more about cohort-extractor, just from looking at this PR.
Fixes #719