Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decide how to handle tables without a patient_id column #543

Open
evansd opened this issue Jun 8, 2022 · 4 comments
Open

Decide how to handle tables without a patient_id column #543

evansd opened this issue Jun 8, 2022 · 4 comments
Milestone

Comments

@evansd
Copy link
Contributor

evansd commented Jun 8, 2022

As a simple example, suppose we have a practices table

practice_id stp
1234 abcd
5678 efgh

Which is linked to patients via a practice_registrations table:

patient_id practice_id start_date end_date
1 1234 2000-01-01
2 5678 2010-01-01 2020-01-01

We want to write a query which fetches the STP of the practice a patient was registered with at a given time. How should this be done?

The simplest approach doesn't involve any new databuilder machinery: we just use the backends.QueryTable class to create a denormalised view on the practice_registrations table which joins in the practices table so it looks something like this:

patient_id practice_id start_date end_date practice_stp
1 1234 2000-01-01 abcd
2 5678 2010-01-01 2020-01-01 efgh

This has the obvious appeal of involving less work, but it has the significant disadvantage that it loses the logical relationship between practice_id and practice_stp i.e. that patients registered with the same practice should have the same practice STP. In order to generate sensible dummy data we'd therefore need some additional mechanism for specifying this logical relationship.

I think it also makes documenting the schema more awkward as you then need to make it clear that certain fields in fact belong to other objects.

Both the above problems are compounded in cases where there is more than one "route" from patient-associated data to some non-patient table. In these cases we'll end up having to duplicate column documentation.

So I think it would be better if these relationships were modelled explicitly in ehrQL/QM. In terms of the syntax I picture something like:

practice_stp = practice_registration_as_of(date).practice.stp

As long as we're only dealing with many-one relationships then I don't think this introduces any fundamental changes to the semantics of ehrQL — it's just making explicit what the "denormalised view" trick made hidden.

The thing I'm not yet clear on is how such relationships are defined in the schema. At the moment table schemas are nothing more than a dict mapping string column names to types. Obviously they'll need to be something more if they're to capture these foreign key relations. But I'm not quite sure what that looks like.

@benbc
Copy link
Contributor

benbc commented Jun 9, 2022

I think we should handle this by denormalization in the contracts. I think that's similar (or maybe even identical) to Dave's denormalization suggestion here. There is some discussion of this in the Contracts domain modelling doc here and here.

I think this is simpler for data definition authors because it's aligned with our no-join philosophy elsewhere.

it has the significant disadvantage that it loses the logical relationship

Is this just a problem for dummy data generation? I don't think this is too big a deal as there will be all sorts of correlations that we need to enforce to get reasonable dummy data which need metadata to describe them.

it also makes documenting the schema more awkward as you then need to make it clear that certain fields in fact belong to other objects

I don't think I understand this objection. Is it obviated by specifying this all the way back in the contract?

compounded in cases where there is more than one "route" from patient-associated data to some non-patient table

Ditto, maybe?

@evansd
Copy link
Contributor Author

evansd commented Jun 9, 2022

Just making a quick note of where we got to following our discussion.

We've decided there are some potential problems with the denormalisation approach. An example that came up was the "Cluster RCT" data, which provides a whole load of new fields associated with each practice. We obviously don't want to clutter the standard practice_registrations table with all these linked fields, so we'd want to provide a specialised practice_rct (or whatever) table. However, in order to link this to a patient we need to know which practice the patient was registered with at a given time. This means that the logic for determining a patient's registered practice (which is not entirely trivial because we need to disambiguate overlapping registrations) needs to be duplicated somewhere. This could either be done in ehrQL, which would mean putting all the relevant registration fields on practice_rct, or it could be done in the SQL which defines the practice_rct view. But both options have downsides.

A similar problem occurs for any data associated with a patient's address, but it's worse here because the logic for disambiguating overlapping address registrations is more complicated.

We discussed whether adding an "interval" type (i.e. a type which represents a time range rather than just a single date) might mitigate some of these issues.

We also discussed supporting other kinds of implicit join rather than just those on patient_id.

However, we've also decided that this is too big a question to address at this stage in development and that the original "denormalised contract" approach will be sufficient for now and doesn't prevent us returning to the issue later.

@benbc
Copy link
Contributor

benbc commented Dec 2, 2022

@evansd Does Cohort Extractor currently expose any tables without patient ids?

@evansd
Copy link
Contributor Author

evansd commented Dec 2, 2022

Well, Cohort Extractor doesn't really "expose" tables in the same sense, because the table, and the logic involved in querying it are all bundled up together. But we do have tables without patients IDs that are queried by Cohort Extractor. The ones that immediately spring to mind are:
https://reports.opensafely.org/reports/opensafely-tpp-database-schema/#Household
https://reports.opensafely.org/reports/opensafely-tpp-database-schema/#Organisation

We have the various "cluster RCT" properties which are accessed via the registered_practice_as_of() function:
https://github.com/opensafely-core/cohort-extractor/blob/19b3e31b2b8d2247b3cc97fa6d88396dd07273ff/cohortextractor/patients.py#L1039-L1040

But I don't see the corresponding tables in the database report, so maybe these aren't actually used.

@inglesp inglesp added this to the P4 milestone Oct 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

3 participants