Add Therapeutics dataset (Antibody and antivirals)#729
Conversation
| f"coalesce(',' + NULLIF({part}, ''), '')" for part in parts | ||
| ) | ||
| # use stuff to remove the first ',' | ||
| column_definition = f"STUFF({coalesced_parts}, 1, 1, '')" |
There was a problem hiding this comment.
There was a request to also remove duplicates that may appear in this joined string. That's fairly simple if TPP was using a modern version of SQL server, but since we can't use STRING_SPLIT, it woud involve a complicated XML parse which doesn't seem worth it. Especially given that removing duplicates should be pretty simple after data extraction, and also currently I don't think there are any patients with risk groups in more than one of these fields, so it's likely to be an infrequent occurrence anyway
evansd
left a comment
There was a problem hiding this comment.
This all looks good to me. An impressive understanding of all the weird idioms of the cohortextractor! Though also a sad reflection of the state of some of the data we have to work with.
| All columns are included, including those that we don't use, so that when we remove | ||
| duplicates, we only remove complete duplicate rows | ||
| """ | ||
| if self._therapeutics_table_name is None: |
There was a problem hiding this comment.
Building a temp table for the just this data to get it into the format we want is a neat approach
| filter_conditions.append(f"CurrentStatus IN ({', '.join(statuses)})") | ||
|
|
||
| # Data (Jan 2022) contains the following values: | ||
| # 'Casirivimab and imdevimab '[note trailing space], 'Molnupiravir', 'Remdesivir', 'sarilumab', 'Sotrovimab' , 'Tocilizumab' |
| elif returning == "risk_group": | ||
| # First remove any "Patients with a" and replace " and " with "," within individual risk group fields | ||
| # Then join the 3 risk cohort fields with "," | ||
| # Note that the last "s" in SOT02_risk_cohorts is correct |
cohortextractor/tpp_backend.py
Outdated
| # There can be duplicates per patient in the Therapeutics dataset | ||
| # These are likely to be invalid in some way, but we keep them in so users | ||
| # can identify them and deal with them as appropriate | ||
| # (We remove fully duplicate rows in the temp table only) | ||
| # Duplicate rows are sorted by all the fields that have been identified to | ||
| # contain duplicate values for a patient, to ensure a consistent return value |
There was a problem hiding this comment.
I misunderstood this comment at first but I understand now having seen the conversation on the original ticket. So we only ever return one result per patient (as we must) but when we count results we included duplicates and so this is allows researchers to spot that the duplicates are present. Is that right?
There was a problem hiding this comment.
Yes, so there can be duplicates per patient, which differ on one or more fields, but the assumption (for now at least) is that those duplicates are errors. Mostly the values are the same apart for one or two fields. We only ever return one row per patient, either the first or last by TreatmentStartDate, but in the case where there are duplicates, we want the sorting to be consistent. There's no sensible way to pick which e.g. region is the correct one, if all other fields are the same, so this just makes sure that it always returns the same one (the first, alphabetically). The idea is that researchers can use the number_of_matches_in_period return value to check if they have duplicates and exclude those patients from further analysis.
I'll see if I can make this comment more understandable :)
Fixes #713
Adds the new Therapeutics dataset with a
with_covid_therapeuticsstudy definition query function, as described in #713Note that there are some duplicate rows for patients, which aer probably not valid, and some "bad" dates that are probably typos (e.g. year 5202). I'm not excluding or attempting to fix any of these - for now Iwell leave that to the researcher to exclude by setting a sensible date range to filter on, and by creating variables to target specific therapeutics of interest