Skip to content

Add disclosure control for measures#1746

Merged
iaindillingham merged 2 commits intomainfrom
iaindillingham/add-disclosure-control-for-measures
Nov 24, 2023
Merged

Add disclosure control for measures#1746
iaindillingham merged 2 commits intomainfrom
iaindillingham/add-disclosure-control-for-measures

Conversation

@iaindillingham
Copy link
Member

@iaindillingham iaindillingham commented Nov 17, 2023

This adds disclosure control -- suppressing small numbers (less than or equal to seven) and then rounding numbers (to the nearest five) -- to the measures framework. The OpenSAFELY approach to disclosure control is documented in "Updated disclosure control guidance".

Disclosure control is enabled by default. It is disabled by calling:

measures.configure_disclosure_control(enabled=False)

The OpenSAFELY docs suggest suppressing small numbers by replacing them with "[REDACTED]" (i.e. a string). However, doing so adds complexity to the implementation, because some output formats have typed columns. I considered replacing small numbers with None, but this does the same: The complexity shifts from the output format to the calculation of the ratio.

Replacing small numbers with zeros doesn't add complexity to the implementation. As well as being of the same type, the calculation of the ratio before disclosure control is the same as the calculation of the ratio after disclosure control.

Consequently, small numbers are replaced with zeros.

Closes #1666.

@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Nov 17, 2023

Deploying with  Cloudflare Pages  Cloudflare Pages

Latest commit: 6d2758a
Status: ✅  Deploy successful!
Preview URL: https://392528df.databuilder.pages.dev
Branch Preview URL: https://iaindillingham-add-disclosur.databuilder.pages.dev

View logs

Copy link
Contributor

@inglesp inglesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OpenSAFELY docs [2] suggest suppressing small numbers by replacing
them with "[REDACTED]". However, doing so adds complexity to the
implementation.

How much complexity?


By default, the numerators and denominators of measures generated from real and
dummy tables, and by the dummy data generator, are subject to disclosure
control. First, values less than or equal to `7` are replaced with `0`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took me a couple of goes to parse this. How about:

By default, numerators and denominators are subject to disclosure control (unless the user has provided their own dummy data file).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Peter's rephrasing is clearer. I wonder though if we should just apply SDC to everything, including user-supplied dummy data. It's a better match for what's going to happen in production, and it gives us a simpler story here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to make the reasoning explicit.

The question: Should the measures framework always respect measures.disclosure_control_config.enabled?

  • If the answer is "no", then the story is more complex: one route for these kinds of data, another route for this kind of data.
  • If the answer is "yes", then the story is simpler. What are the drawbacks?

If a user intentionally supplied an uncontrolled dummy data file, but didn't call measures.configure_disclosure_control(enabled=False), then the measures framework would apply disclosure control to their dummy data. The drawback is that doing so might confuse the user. But as you say, a local run would match a production run: We would want to prompt the user either to change their dummy data or to configure disclosure control appropriately.

That's reasonable: I will update to always respect measures.disclosure_control_config.enabled.

By default, the numerators and denominators of measures generated from real and
dummy tables, and by the dummy data generator, are subject to disclosure
control. First, values less than or equal to `7` are replaced with `0`
(suppressed); then, values are rounded to the nearest `5`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to wrap numbers in backticks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't. No idea where those came from.

births,2021-01-01,2021-12-31,,0,0,male
births,2021-01-01,2021-12-31,,0,0,female
"""
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed having richer data here, so that numerators/denominators weren't all replaced with zero. That's a nice-to-have -- how much work would it be?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did, and I neglected to include them. I have done so now, for this test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're missing a test about what happens when no configuration is set. Addressing this would also smooth out the weirdness of having to escape curly braces in MEASURE_DEFINITIONS (why are we trying to put a dict in a set?? oh...). Specifically, we could leave MEASURE_DEFINITIONS as-is, and then append to it as required in each test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've retained MEASURE_DEFINITIONS, as you suggest, and created DISABLE_DISCLOSURE_CONTROL. I've used these like this

if disclosure_control_enabled:
    measure_definitions.write_text(MEASURE_DEFINITIONS)
else:
    measure_definitions.write_text(MEASURE_DEFINITIONS + DISABLE_DISCLOSURE_CONTROL)

rather than creating a new variable and then calling write_text once because, I couldn't think of a name for the new variable.

births,2021-01-01,2021-12-31,0.0,0,1,male
births,2021-01-01,2021-12-31,1.0,1,1,female
"""
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As these multiline strings are re-used, would it make sense to have them as module-level constants?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're not, now, because test_generate_measures has richer data.

"""
# The following is covered by tests/integration/test_main.py, but coverage can't
# detect that.
self.disclosure_control_config.enabled = enabled # pragma: no cover
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Odd -- do we know why?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be because the definition file is evaluated in a different process (for sandboxing purposes).

I think it might make sense to have a unit test for this anyway in tests/unit/measures/test_measures.py. It's obviously pretty trivial, but if we start configuring more stuff here and needing some kind of validation logic then it would be good to have a pre-existing test to expand on. And I feel like having the test is simpler than explaining the presence of the pragma.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct.

I've added test_configure_disclosure_control. I've also updated test_define_measures and test_define_measures_with_default_group_by to test for the default value.


By default, the numerators and denominators of measures generated from real and
dummy tables, and by the dummy data generator, are subject to disclosure
control. First, values less than or equal to `7` are replaced with `0`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Peter's rephrasing is clearer. I wonder though if we should just apply SDC to everything, including user-supplied dummy data. It's a better match for what's going to happen in production, and it gives us a simpler story here.

"""
# The following is covered by tests/integration/test_main.py, but coverage can't
# detect that.
self.disclosure_control_config.enabled = enabled # pragma: no cover
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be because the definition file is evaluated in a different process (for sandboxing purposes).

I think it might make sense to have a unit test for this anyway in tests/unit/measures/test_measures.py. It's obviously pretty trivial, but if we start configuring more stuff here and needing some kind of validation logic then it would be good to have a pre-existing test to expand on. And I feel like having the test is simpler than explaining the presence of the pragma.

@@ -0,0 +1,17 @@
"""Statistical Disclosure Control (SDC) utilities.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more of an aesthetic preference than anything particularly concrete, but I'm not sure I'd break this out into a utils module like this. Utils generally speaking (although we're definitely not fully consistent here) generally contain code which isn't "core business logic" and either needs to be called from multiple places, or is just too long-winded and faffy to include in the module which needs it.

I wonder if a better structure would be to pull the code out of the calculate module and have a ehrql.measures.disclosure_control module which contains that code plus the stuff here in utils.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, Dave. Thanks for describing your rationale. I have added ehrql.measures.disclosure_control.

I originally wanted to keep apply_sdc_to_measure_results next to get_measure_results to make it clear that the former returns a tuple that matches the structure of the latter (we're using the tuple as a record, not just as an immutable list, in other words). However, a separate module keeps everything related to disclosure control together, which is clearer.

I guess that this might suggest we need a namedtuple to make the record's structure explicit. I'm going to leave that question for another day.

@evansd
Copy link
Contributor

evansd commented Nov 23, 2023

The OpenSAFELY docs [2] suggest suppressing small numbers by replacing
them with "[REDACTED]". However, doing so adds complexity to the
implementation.

How much complexity?

I assume the problem is that "[REDCATED]" is not a number and therefore you can't store it in a column intended for numbers. At least, you certainly can't in an Arrow file and even with CSV it would make parsing it very annoying.

This adds disclosure control -- suppressing small numbers (less than or
equal to seven) and then rounding numbers (to the nearest five) -- to
the measures framework. The OpenSAFELY approach to disclosure control is
documented in "Updated disclosure control guidance" [1].

Disclosure control is enabled by default. It is disabled by calling:

```
measures.configure_disclosure_control(enabled=False)
```

The OpenSAFELY docs [2] suggest suppressing small numbers by replacing
them with `"[REDACTED]"` (i.e. a string). However, doing so adds
complexity to the implementation, because some output formats have typed
columns. I considered replacing small numbers with `None`, but this does
the same: The complexity shifts from the output format to the
calculation of the ratio.

Replacing small numbers with zeros doesn't add complexity to the
implementation. As well as being of the same type, the calculation of
the ratio before disclosure control is the same as the calculation of
the ratio after disclosure control.

Consequently, small numbers are replaced with zeros.

[1]: https://www.opensafely.org/updated-output-checking-processes/
[2]: https://docs.opensafely.org/releasing-files/#redacting-counts-less-than-or-equal-to-7
Copy link
Contributor

@evansd evansd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all remarkably neat and non-invasive! I think applying it to user-supplied dummy data was definitely the right call here, now I see how it simplifies the implementation.

@iaindillingham iaindillingham merged commit 6e25c3d into main Nov 24, 2023
@iaindillingham iaindillingham deleted the iaindillingham/add-disclosure-control-for-measures branch November 24, 2023 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support small number suppression and rounding of measures output

3 participants