New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decide how to handle tables without a patient_id
column
#543
Comments
I think we should handle this by denormalization in the contracts. I think that's similar (or maybe even identical) to Dave's denormalization suggestion here. There is some discussion of this in the Contracts domain modelling doc here and here. I think this is simpler for data definition authors because it's aligned with our no-join philosophy elsewhere.
Is this just a problem for dummy data generation? I don't think this is too big a deal as there will be all sorts of correlations that we need to enforce to get reasonable dummy data which need metadata to describe them.
I don't think I understand this objection. Is it obviated by specifying this all the way back in the contract?
Ditto, maybe? |
Just making a quick note of where we got to following our discussion. We've decided there are some potential problems with the denormalisation approach. An example that came up was the "Cluster RCT" data, which provides a whole load of new fields associated with each practice. We obviously don't want to clutter the standard A similar problem occurs for any data associated with a patient's address, but it's worse here because the logic for disambiguating overlapping address registrations is more complicated. We discussed whether adding an "interval" type (i.e. a type which represents a time range rather than just a single date) might mitigate some of these issues. We also discussed supporting other kinds of implicit join rather than just those on patient_id. However, we've also decided that this is too big a question to address at this stage in development and that the original "denormalised contract" approach will be sufficient for now and doesn't prevent us returning to the issue later. |
@evansd Does Cohort Extractor currently expose any tables without patient ids? |
Well, Cohort Extractor doesn't really "expose" tables in the same sense, because the table, and the logic involved in querying it are all bundled up together. But we do have tables without patients IDs that are queried by Cohort Extractor. The ones that immediately spring to mind are: We have the various "cluster RCT" properties which are accessed via the But I don't see the corresponding tables in the database report, so maybe these aren't actually used. |
As a simple example, suppose we have a
practices
tableWhich is linked to patients via a
practice_registrations
table:We want to write a query which fetches the STP of the practice a patient was registered with at a given time. How should this be done?
The simplest approach doesn't involve any new databuilder machinery: we just use the
backends.QueryTable
class to create a denormalised view on thepractice_registrations
table which joins in thepractices
table so it looks something like this:This has the obvious appeal of involving less work, but it has the significant disadvantage that it loses the logical relationship between
practice_id
andpractice_stp
i.e. that patients registered with the same practice should have the same practice STP. In order to generate sensible dummy data we'd therefore need some additional mechanism for specifying this logical relationship.I think it also makes documenting the schema more awkward as you then need to make it clear that certain fields in fact belong to other objects.
Both the above problems are compounded in cases where there is more than one "route" from patient-associated data to some non-patient table. In these cases we'll end up having to duplicate column documentation.
So I think it would be better if these relationships were modelled explicitly in ehrQL/QM. In terms of the syntax I picture something like:
As long as we're only dealing with many-one relationships then I don't think this introduces any fundamental changes to the semantics of ehrQL — it's just making explicit what the "denormalised view" trick made hidden.
The thing I'm not yet clear on is how such relationships are defined in the schema. At the moment table schemas are nothing more than a dict mapping string column names to types. Obviously they'll need to be something more if they're to capture these foreign key relations. But I'm not quite sure what that looks like.
The text was updated successfully, but these errors were encountered: