Skip to content
This repository was archived by the owner on Jun 26, 2025. It is now read-only.

Add Therapeutics dataset (Antibody and antivirals)#729

Merged
rebkwok merged 7 commits intomainfrom
therapeutics
Feb 2, 2022
Merged

Add Therapeutics dataset (Antibody and antivirals)#729
rebkwok merged 7 commits intomainfrom
therapeutics

Conversation

@rebkwok
Copy link
Contributor

@rebkwok rebkwok commented Feb 2, 2022

Fixes #713

Adds the new Therapeutics dataset with a with_covid_therapeutics study definition query function, as described in #713

covid_therapeutics = patients.with_covid_therapeutics(
	returning = "date",
	with_these_statuses = status_codelist,
	with_these_therapeutics = therapeutic_codelist,
	with_these_indications = indication_codelist,
	on_or_after = "index_date",
	date_format ="YYYY-MM-DD",
        return_first_match_in_period = True,
)

Note that there are some duplicate rows for patients, which aer probably not valid, and some "bad" dates that are probably typos (e.g. year 5202). I'm not excluding or attempting to fix any of these - for now Iwell leave that to the researcher to exclude by setting a sensible date range to filter on, and by creating variables to target specific therapeutics of interest

f"coalesce(',' + NULLIF({part}, ''), '')" for part in parts
)
# use stuff to remove the first ','
column_definition = f"STUFF({coalesced_parts}, 1, 1, '')"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a request to also remove duplicates that may appear in this joined string. That's fairly simple if TPP was using a modern version of SQL server, but since we can't use STRING_SPLIT, it woud involve a complicated XML parse which doesn't seem worth it. Especially given that removing duplicates should be pretty simple after data extraction, and also currently I don't think there are any patients with risk groups in more than one of these fields, so it's likely to be an infrequent occurrence anyway

Copy link
Contributor

@evansd evansd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good to me. An impressive understanding of all the weird idioms of the cohortextractor! Though also a sad reflection of the state of some of the data we have to work with.

All columns are included, including those that we don't use, so that when we remove
duplicates, we only remove complete duplicate rows
"""
if self._therapeutics_table_name is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building a temp table for the just this data to get it into the format we want is a neat approach

filter_conditions.append(f"CurrentStatus IN ({', '.join(statuses)})")

# Data (Jan 2022) contains the following values:
# 'Casirivimab and imdevimab '[note trailing space], 'Molnupiravir', 'Remdesivir', 'sarilumab', 'Sotrovimab' , 'Tocilizumab'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[note trailing space]

Sigh

elif returning == "risk_group":
# First remove any "Patients with a" and replace " and " with "," within individual risk group fields
# Then join the 3 risk cohort fields with ","
# Note that the last "s" in SOT02_risk_cohorts is correct
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also sigh

Comment on lines +2930 to +2935
# There can be duplicates per patient in the Therapeutics dataset
# These are likely to be invalid in some way, but we keep them in so users
# can identify them and deal with them as appropriate
# (We remove fully duplicate rows in the temp table only)
# Duplicate rows are sorted by all the fields that have been identified to
# contain duplicate values for a patient, to ensure a consistent return value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misunderstood this comment at first but I understand now having seen the conversation on the original ticket. So we only ever return one result per patient (as we must) but when we count results we included duplicates and so this is allows researchers to spot that the duplicates are present. Is that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so there can be duplicates per patient, which differ on one or more fields, but the assumption (for now at least) is that those duplicates are errors. Mostly the values are the same apart for one or two fields. We only ever return one row per patient, either the first or last by TreatmentStartDate, but in the case where there are duplicates, we want the sorting to be consistent. There's no sensible way to pick which e.g. region is the correct one, if all other fields are the same, so this just makes sure that it always returns the same one (the first, alphabetically). The idea is that researchers can use the number_of_matches_in_period return value to check if they have duplicates and exclude those patients from further analysis.

I'll see if I can make this comment more understandable :)

@rebkwok rebkwok merged commit a0e5542 into main Feb 2, 2022
@rebkwok rebkwok deleted the therapeutics branch February 2, 2022 13:22
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Antibody and antiviral deployment

2 participants