Add disclosure control for measures#1746
Conversation
1d4f557 to
64bb61d
Compare
Deploying with
|
| Latest commit: |
6d2758a
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://392528df.databuilder.pages.dev |
| Branch Preview URL: | https://iaindillingham-add-disclosur.databuilder.pages.dev |
64bb61d to
0ff72ed
Compare
inglesp
left a comment
There was a problem hiding this comment.
The OpenSAFELY docs [2] suggest suppressing small numbers by replacing
them with"[REDACTED]". However, doing so adds complexity to the
implementation.
How much complexity?
ehrql/measures/measures.py
Outdated
|
|
||
| By default, the numerators and denominators of measures generated from real and | ||
| dummy tables, and by the dummy data generator, are subject to disclosure | ||
| control. First, values less than or equal to `7` are replaced with `0` |
There was a problem hiding this comment.
Took me a couple of goes to parse this. How about:
By default, numerators and denominators are subject to disclosure control (unless the user has provided their own dummy data file).
There was a problem hiding this comment.
I think Peter's rephrasing is clearer. I wonder though if we should just apply SDC to everything, including user-supplied dummy data. It's a better match for what's going to happen in production, and it gives us a simpler story here.
There was a problem hiding this comment.
I'm going to make the reasoning explicit.
The question: Should the measures framework always respect measures.disclosure_control_config.enabled?
- If the answer is "no", then the story is more complex: one route for these kinds of data, another route for this kind of data.
- If the answer is "yes", then the story is simpler. What are the drawbacks?
If a user intentionally supplied an uncontrolled dummy data file, but didn't call measures.configure_disclosure_control(enabled=False), then the measures framework would apply disclosure control to their dummy data. The drawback is that doing so might confuse the user. But as you say, a local run would match a production run: We would want to prompt the user either to change their dummy data or to configure disclosure control appropriately.
That's reasonable: I will update to always respect measures.disclosure_control_config.enabled.
ehrql/measures/measures.py
Outdated
| By default, the numerators and denominators of measures generated from real and | ||
| dummy tables, and by the dummy data generator, are subject to disclosure | ||
| control. First, values less than or equal to `7` are replaced with `0` | ||
| (suppressed); then, values are rounded to the nearest `5`. |
There was a problem hiding this comment.
Do we need to wrap numbers in backticks?
There was a problem hiding this comment.
We don't. No idea where those came from.
| births,2021-01-01,2021-12-31,,0,0,male | ||
| births,2021-01-01,2021-12-31,,0,0,female | ||
| """ | ||
| ) |
There was a problem hiding this comment.
We discussed having richer data here, so that numerators/denominators weren't all replaced with zero. That's a nice-to-have -- how much work would it be?
There was a problem hiding this comment.
We did, and I neglected to include them. I have done so now, for this test.
There was a problem hiding this comment.
We're missing a test about what happens when no configuration is set. Addressing this would also smooth out the weirdness of having to escape curly braces in MEASURE_DEFINITIONS (why are we trying to put a dict in a set?? oh...). Specifically, we could leave MEASURE_DEFINITIONS as-is, and then append to it as required in each test.
There was a problem hiding this comment.
I've retained MEASURE_DEFINITIONS, as you suggest, and created DISABLE_DISCLOSURE_CONTROL. I've used these like this
if disclosure_control_enabled:
measure_definitions.write_text(MEASURE_DEFINITIONS)
else:
measure_definitions.write_text(MEASURE_DEFINITIONS + DISABLE_DISCLOSURE_CONTROL)rather than creating a new variable and then calling write_text once because, I couldn't think of a name for the new variable.
| births,2021-01-01,2021-12-31,0.0,0,1,male | ||
| births,2021-01-01,2021-12-31,1.0,1,1,female | ||
| """ | ||
| ) |
There was a problem hiding this comment.
As these multiline strings are re-used, would it make sense to have them as module-level constants?
There was a problem hiding this comment.
They're not, now, because test_generate_measures has richer data.
ehrql/measures/measures.py
Outdated
| """ | ||
| # The following is covered by tests/integration/test_main.py, but coverage can't | ||
| # detect that. | ||
| self.disclosure_control_config.enabled = enabled # pragma: no cover |
There was a problem hiding this comment.
It will be because the definition file is evaluated in a different process (for sandboxing purposes).
I think it might make sense to have a unit test for this anyway in tests/unit/measures/test_measures.py. It's obviously pretty trivial, but if we start configuring more stuff here and needing some kind of validation logic then it would be good to have a pre-existing test to expand on. And I feel like having the test is simpler than explaining the presence of the pragma.
There was a problem hiding this comment.
That's correct.
I've added test_configure_disclosure_control. I've also updated test_define_measures and test_define_measures_with_default_group_by to test for the default value.
ehrql/measures/measures.py
Outdated
|
|
||
| By default, the numerators and denominators of measures generated from real and | ||
| dummy tables, and by the dummy data generator, are subject to disclosure | ||
| control. First, values less than or equal to `7` are replaced with `0` |
There was a problem hiding this comment.
I think Peter's rephrasing is clearer. I wonder though if we should just apply SDC to everything, including user-supplied dummy data. It's a better match for what's going to happen in production, and it gives us a simpler story here.
ehrql/measures/measures.py
Outdated
| """ | ||
| # The following is covered by tests/integration/test_main.py, but coverage can't | ||
| # detect that. | ||
| self.disclosure_control_config.enabled = enabled # pragma: no cover |
There was a problem hiding this comment.
It will be because the definition file is evaluated in a different process (for sandboxing purposes).
I think it might make sense to have a unit test for this anyway in tests/unit/measures/test_measures.py. It's obviously pretty trivial, but if we start configuring more stuff here and needing some kind of validation logic then it would be good to have a pre-existing test to expand on. And I feel like having the test is simpler than explaining the presence of the pragma.
ehrql/utils/sdc_utils.py
Outdated
| @@ -0,0 +1,17 @@ | |||
| """Statistical Disclosure Control (SDC) utilities. | |||
There was a problem hiding this comment.
This is more of an aesthetic preference than anything particularly concrete, but I'm not sure I'd break this out into a utils module like this. Utils generally speaking (although we're definitely not fully consistent here) generally contain code which isn't "core business logic" and either needs to be called from multiple places, or is just too long-winded and faffy to include in the module which needs it.
I wonder if a better structure would be to pull the code out of the calculate module and have a ehrql.measures.disclosure_control module which contains that code plus the stuff here in utils.
There was a problem hiding this comment.
That makes sense, Dave. Thanks for describing your rationale. I have added ehrql.measures.disclosure_control.
I originally wanted to keep apply_sdc_to_measure_results next to get_measure_results to make it clear that the former returns a tuple that matches the structure of the latter (we're using the tuple as a record, not just as an immutable list, in other words). However, a separate module keeps everything related to disclosure control together, which is clearer.
I guess that this might suggest we need a namedtuple to make the record's structure explicit. I'm going to leave that question for another day.
I assume the problem is that |
This adds disclosure control -- suppressing small numbers (less than or equal to seven) and then rounding numbers (to the nearest five) -- to the measures framework. The OpenSAFELY approach to disclosure control is documented in "Updated disclosure control guidance" [1]. Disclosure control is enabled by default. It is disabled by calling: ``` measures.configure_disclosure_control(enabled=False) ``` The OpenSAFELY docs [2] suggest suppressing small numbers by replacing them with `"[REDACTED]"` (i.e. a string). However, doing so adds complexity to the implementation, because some output formats have typed columns. I considered replacing small numbers with `None`, but this does the same: The complexity shifts from the output format to the calculation of the ratio. Replacing small numbers with zeros doesn't add complexity to the implementation. As well as being of the same type, the calculation of the ratio before disclosure control is the same as the calculation of the ratio after disclosure control. Consequently, small numbers are replaced with zeros. [1]: https://www.opensafely.org/updated-output-checking-processes/ [2]: https://docs.opensafely.org/releasing-files/#redacting-counts-less-than-or-equal-to-7
d90a2bd to
6d2758a
Compare
evansd
left a comment
There was a problem hiding this comment.
This is all remarkably neat and non-invasive! I think applying it to user-supplied dummy data was definitely the right call here, now I see how it simplifies the implementation.
This adds disclosure control -- suppressing small numbers (less than or equal to seven) and then rounding numbers (to the nearest five) -- to the measures framework. The OpenSAFELY approach to disclosure control is documented in "Updated disclosure control guidance".
Disclosure control is enabled by default. It is disabled by calling:
The OpenSAFELY docs suggest suppressing small numbers by replacing them with
"[REDACTED]"(i.e. a string). However, doing so adds complexity to the implementation, because some output formats have typed columns. I considered replacing small numbers withNone, but this does the same: The complexity shifts from the output format to the calculation of the ratio.Replacing small numbers with zeros doesn't add complexity to the implementation. As well as being of the same type, the calculation of the ratio before disclosure control is the same as the calculation of the ratio after disclosure control.
Consequently, small numbers are replaced with zeros.
Closes #1666.