Decide how to handle tables without a `patient_id` column #543

evansd · 2022-06-08T16:33:42Z

As a simple example, suppose we have a practices table

practice_id	stp
1234	abcd
5678	efgh

Which is linked to patients via a practice_registrations table:

patient_id	practice_id	start_date	end_date
1	1234	2000-01-01
2	5678	2010-01-01	2020-01-01

We want to write a query which fetches the STP of the practice a patient was registered with at a given time. How should this be done?

The simplest approach doesn't involve any new databuilder machinery: we just use the backends.QueryTable class to create a denormalised view on the practice_registrations table which joins in the practices table so it looks something like this:

patient_id	practice_id	start_date	end_date	practice_stp
1	1234	2000-01-01		abcd
2	5678	2010-01-01	2020-01-01	efgh

This has the obvious appeal of involving less work, but it has the significant disadvantage that it loses the logical relationship between practice_id and practice_stp i.e. that patients registered with the same practice should have the same practice STP. In order to generate sensible dummy data we'd therefore need some additional mechanism for specifying this logical relationship.

I think it also makes documenting the schema more awkward as you then need to make it clear that certain fields in fact belong to other objects.

Both the above problems are compounded in cases where there is more than one "route" from patient-associated data to some non-patient table. In these cases we'll end up having to duplicate column documentation.

So I think it would be better if these relationships were modelled explicitly in ehrQL/QM. In terms of the syntax I picture something like:

practice_stp = practice_registration_as_of(date).practice.stp

As long as we're only dealing with many-one relationships then I don't think this introduces any fundamental changes to the semantics of ehrQL — it's just making explicit what the "denormalised view" trick made hidden.

The thing I'm not yet clear on is how such relationships are defined in the schema. At the moment table schemas are nothing more than a dict mapping string column names to types. Obviously they'll need to be something more if they're to capture these foreign key relations. But I'm not quite sure what that looks like.

The text was updated successfully, but these errors were encountered:

benbc · 2022-06-09T10:52:30Z

I think we should handle this by denormalization in the contracts. I think that's similar (or maybe even identical) to Dave's denormalization suggestion here. There is some discussion of this in the Contracts domain modelling doc here and here.

I think this is simpler for data definition authors because it's aligned with our no-join philosophy elsewhere.

it has the significant disadvantage that it loses the logical relationship

Is this just a problem for dummy data generation? I don't think this is too big a deal as there will be all sorts of correlations that we need to enforce to get reasonable dummy data which need metadata to describe them.

it also makes documenting the schema more awkward as you then need to make it clear that certain fields in fact belong to other objects

I don't think I understand this objection. Is it obviated by specifying this all the way back in the contract?

compounded in cases where there is more than one "route" from patient-associated data to some non-patient table

Ditto, maybe?

evansd · 2022-06-09T12:45:54Z

Just making a quick note of where we got to following our discussion.

We've decided there are some potential problems with the denormalisation approach. An example that came up was the "Cluster RCT" data, which provides a whole load of new fields associated with each practice. We obviously don't want to clutter the standard practice_registrations table with all these linked fields, so we'd want to provide a specialised practice_rct (or whatever) table. However, in order to link this to a patient we need to know which practice the patient was registered with at a given time. This means that the logic for determining a patient's registered practice (which is not entirely trivial because we need to disambiguate overlapping registrations) needs to be duplicated somewhere. This could either be done in ehrQL, which would mean putting all the relevant registration fields on practice_rct, or it could be done in the SQL which defines the practice_rct view. But both options have downsides.

A similar problem occurs for any data associated with a patient's address, but it's worse here because the logic for disambiguating overlapping address registrations is more complicated.

We discussed whether adding an "interval" type (i.e. a type which represents a time range rather than just a single date) might mitigate some of these issues.

We also discussed supporting other kinds of implicit join rather than just those on patient_id.

However, we've also decided that this is too big a question to address at this stage in development and that the original "denormalised contract" approach will be sufficient for now and doesn't prevent us returning to the issue later.

benbc · 2022-12-02T14:21:54Z

@evansd Does Cohort Extractor currently expose any tables without patient ids?

evansd · 2022-12-02T14:42:00Z

Well, Cohort Extractor doesn't really "expose" tables in the same sense, because the table, and the logic involved in querying it are all bundled up together. But we do have tables without patients IDs that are queried by Cohort Extractor. The ones that immediately spring to mind are:
https://reports.opensafely.org/reports/opensafely-tpp-database-schema/#Household
https://reports.opensafely.org/reports/opensafely-tpp-database-schema/#Organisation

We have the various "cluster RCT" properties which are accessed via the registered_practice_as_of() function:
https://github.com/opensafely-core/cohort-extractor/blob/19b3e31b2b8d2247b3cc97fa6d88396dd07273ff/cohortextractor/patients.py#L1039-L1040

But I don't see the corresponding tables in the database report, so maybe these aren't actually used.

evansd mentioned this issue Jun 8, 2022

Decide how to handle non-patient tables in Comparative Booster study #544

Closed

benbc added the databuilder-3 label Dec 7, 2022

inglesp removed the databuilder-3 label Apr 6, 2023

inglesp added this to the P4 milestone Oct 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decide how to handle tables without a `patient_id` column #543

Decide how to handle tables without a `patient_id` column #543

evansd commented Jun 8, 2022

benbc commented Jun 9, 2022

evansd commented Jun 9, 2022 •

edited

benbc commented Dec 2, 2022

evansd commented Dec 2, 2022

Decide how to handle tables without a patient_id column #543

Decide how to handle tables without a patient_id column #543

Comments

evansd commented Jun 8, 2022

benbc commented Jun 9, 2022

evansd commented Jun 9, 2022 • edited

benbc commented Dec 2, 2022

evansd commented Dec 2, 2022

Decide how to handle tables without a `patient_id` column #543

Decide how to handle tables without a `patient_id` column #543

evansd commented Jun 9, 2022 •

edited