# High cost drugs dataset
## Short data report

This short report describes how high cost drugs data can be identified in the OpenSAFELY-TPP database, and the strengths and weaknesses of the data. 

This is a living document that will be updated to reflect changes to the OpenSAFELY-TPP database and the patient records within.

## Introduction

The high cost drugs dataset is a patient level dataset that provides information on the use of drugs that are tariff-excluded. These are medicines that are not reimbursed directly through the national tariff and comissioned directly by NHS England & Improvement (NHSEI) specialised commissioning or Clinic Commissioning Groups (CCGs).

The main purpose of the collection of data on the use of high cost drugs is for payments from NHSEI specialised commissioning or CCGs to hospitals once a drug has been supplied to a patient. Hospitals fill out a submission for each patient and each high cost drug and submit this to either a CCG or NHSEI specialised commissioning, based on which organisation is the commissioner, to recieve payment. 

As a secondary use these individual submissions can then be collected together to provide information on the total number of patients treated with high cost drugs, the volume of high cost drugs used and the cost of these medicines to commissioners. The dataset also provides information on clinical indications, the reason a patient was treated with the high cost drug, and the month the patient was treated.

## Creation of the national high cost drugs dataset

The high cost drug payment request submissions are routinely collected together to produce datasets at an individual CCG and NHSEI specialised commissioning level. However, despite clear use cases for national policy and clinical teams, academics and other stakeholders, this data had never been collected together to provide a national overview of the use of high cost drugs in England prior to the work of the North of England Commissioing Support Unit (NECS) and the DataLab in spring/summer 2020.

*To add information on how national data collection was proposed/approved*

To create the first national high cost drugs dataset the NECS collected submissions from all commissioners in England - this covers *135?* CCGs and NHSEI specialised commissioning. The scope of the original dataset was all high cost drugs sumbissions from FY 2018/19 and FY 2019/20.

The NECS faced a number of challenges in collecting and collating the submissions from this range of providers, particularly around data uniformity and validation. These are discussed further in the "Challenges" section of this data report.

## Variable overview

The national high cost drugs dataset is a patient level dataset and includes variables on patient characteristics, clinical indications and medicine prescribed.

The data collection of inidivudal submissions is called the Drugs Patient Level Contract Monitoring Data Set and the national specification for submissions is published on the [NHS Data Model and Dictionary website](https://datadictionary.nhs.uk/data_sets/supporting_data_sets/drugs_patient_level_contract_monitoring_data_set.html).

The NECS collated together the submissions from each CCG and NHSEI specialised commissioning to create a national dataset. The version of this dataset shared with OpenSafely and TPP includes a subset of the variables from the national specification and some derived variables. The dataset shared with OpenSafely and TPP only includes patients that are or were registered at a GP practice that uses the TPP EHRC systems. This data collection is therefore a sample of the full national high cost drugs dataset that the NECS produced.

*Don't know whether to include this bit as only saved on OpenSafley google drive at mo - would need to be published*
NECS provided a meta data file for the high cost drugs dataset, this is [saved here.](https://docs.google.com/spreadsheets/d/1i_Ux8UveZR8brMDO0-_TV_tYfWMLSSHY/edit?rtpof=true#gid=1394523185)

### Table one: Variables included in the OpenSafely-TPP high cost drugs dataset

|Variable Name | Variable Type | Specification details | Other information |
:--------------|:---------------|:---------------------|:-------------------|
|Patient_Id | n10 | Mandatory where relevant | Psdonomised patient id, used to match dataset to other datasets within OpenSAFELY-TPP.|
| FinancialMonth|max an2| Mandatory  | Financial month the prescribed item was administered to patient.<br> 1 = April <br> ... <br> 12 = March|
|FinancialYear | an6 | Mandatory| Financial year the prescribed item was administered to patient.  <br> FY 2017/18 = 201718|
|PersonAge | n | Derived | Age of patient when prescribed item was administered to patient. <br> Some submissions included age at intervention. <br> Where missing this variable was derived using clinical interventiona date and date of birth.|
|PersonGender | an1 |Mandatory where relevant | Gender as stated by the patient. <br> 1 = Male <br> 2 = Female <br> 9 = Indeterminate (unable to be classified as either male or female)|
|ActivityTreatmentFunctionCode | an3 |Mandatory where relevant <br> Full list of codes [here](https://datadictionary.nhs.uk/data_elements/activity_treatment_function_code.html) | Code to describe the clinical area that prescribing is taking place in, based on main speciality.|
|TherapeuticIndicationCode | min an6 <br> max an20| Mandatory where relevant <br> SNOMED CT Code| Code used to identify the reason for administering drug to the patient.|
|HighCostTariffExcludedDrugCode|min an6 <br> max an20 |Optional <br> SNOMED CT dm+d | dm+d description of medicine administred to patient. <br> only populated when provider has dm+d enabled system.|
|DrugName|max an255 | Mandatory where relevant <br> Free text |The name of the prescribed item. <br> Should be the SNOMED CT name. <br> For drugs not listed in dm+d, this must be the valid name in UPPER CASE.|
|Route of Administration|min an6 <br> max an20 |Mandatory where relevant <br> SNOMED CT dm+d|What does this tell us? <br> To be populated at providers with an e-prescribing system.|
|DrugStrength|max an100 |Mandatory where relevant | The amount of ingredient substance in the prescribed item.|
|DrugVolume|max an100| Mandatory where relevant| The volume of the drug administered to a patient when given in liquid form.|
|DrugPackSize| max an100| Optional| The amount of product in a pack or container.|
|DrugQuanitityOrWeightProportion*|max n4.max n4|Mandatory where relevant | The quantity prescribed in terms of either the packsize or number of doses. <br> * To note, the variable name is misspelled.|
|UnitOfMeasurement | |Mandatory where relevant <br> SNOMED CT dm+d | Describes what the DrugQuantityOrWeightProportion variable is measuring.|
|DispensingRoute|an1| Mandatory where releavant.|Describes where the prescription item was dispensed to the patient. <br> 1 = Inpatient (via Internal Pharmacy) <br> 2 = Outpatient (via Internal Pharmacy) <br> 3 = Outsourced Pharmacy <br> 4 = Homecare Delivery <br> 5 = Community Pharmacy (FP10) <br> 6 = Other (not listed)|
|HomeDeliveryCharge|max n18.max n8 | Mandatory| The amount charged for delivery of item to patient's home. |
|TotalCost| max n18.max n8| Mandatory| The total cost of the activity that includes any agreed adjustments.|
|DerivedSNOMEDFromName|max an255 | Derived by NECS | dm+d code dervied from DrugName variable. <br> Majority of values are missing.|
|DerivedVTM|max an255| Derived by NECS | VTM code derived from ? <br> Many missing values.|
|DerivedVTMName| max an255| Dervied by NECS |VTM name derived from VTM code. <br> Many missing values.|

## Variable discussion in detail

In [16]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

In [22]:
# runnning code to import libraries
import pandas as pd
import numpy as np
import os as os
from matplotlib import pyplot as plt

### Patient ID, Financial Year and Financial Month

The patient ID in the national high cost drugs dataset is used to match the information from this dataset to other patient level data included in the OpenSAFELY-TPP environment. This ID allows OpenSafely-TPP users to include information from other data sources on the platform (e.g. hospital episodes or COVID-19 testing) in any analysis of high cost drugs use.

The financial year and financial month variables are stored seperately which makes time period analysis a little more complex then if this information was stored as one variable. The OpenSafely cohort extracter has been developed so that users can query dates in the routine way and the transaltion from routine dates to seperate financial year and financial month filters is done in the background of the OpenSafely-TPP platform.

The high cost drugs dataset contains submissions from April 2018 to March 2021. However, there are only a small number of submissions for FY 2020/21 and these are prospective submissions - submitted before the patient has recieved the medicine. We would recommend that these records are ignored and not used in any analysis.

- In FY 2018/19 there are 2.8 million submissions for 1.1 million unique patient IDs. The average number of submissions per patient over the year is 2.63.

- In FY 2019/20 there are 4.0 million submissions for 1.3 million unique patient IDs. The average number of submission per patient over the year is 3.10.

In [14]:
# steps to get the correct filepath for the csv file and read in patient record summary
os.getcwd()
parentDirectory = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
path = "/released-output/"
filename = parentDirectory + path + "record_summary_20210216.csv"

cols = [
    "Unique Patient IDs",
    "Number of Records",
    "Number of NULL Patient IDs",
    "Financial Year",
    "Financial Month"]

patient_record_summary = pd.read_csv(filename, header = None, names = cols, index_col = False)
patient_record_summary = patient_record_summary.sort_values(["Financial Year", "Financial Month" ])
patient_record_summary = patient_record_summary[["Financial Year", "Financial Month", "Number of NULL Patient IDs", 
                                                "Unique Patient IDs", "Number of Records"]]
patient_record_summary["Average num of records per patient"] = patient_record_summary["Number of Records"] / patient_record_summary["Unique Patient IDs"]
patient_record_summary_1819_1920 = patient_record_summary[(patient_record_summary["Financial Year"] != 202021)]

In [20]:
agg_list = ["Unique Patient IDs", "Number of Records"]
FY_Summary = patient_record_summary_1819_1920.groupby(["Financial Year"])[agg_list].sum()
FY_Summary["Average num of records per patient"] = FY_Summary["Number of Records"] / FY_Summary["Unique Patient IDs"]
FY_Summary

Unnamed: 0_level_0,Unique Patient IDs,Number of Records,Average num of records per patient
Financial Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
201819,1064279,2799394,2.63032
201920,1286804,3984198,3.096196


### Person Age and Person Gender
There are other sources of data for the patients age and gender in the OpenSAFELY-TPP platform. The most common sources for this information is *which is the recommended source for this information?*
A comparison of the patient age and gender from the high cost drugs dataset and patient age and gender from *other source* shows that *detail on how consistent information is with other sources.*

In [None]:
# steps to get the correct filepath for the csv file and read in age summary
filename = parentDirectory + path + "age_summary_20210216.csv"

cols = [
    "Unique Patient IDs",
    "Number of Records",
    "Age group in FY 201920"]

age_summary = pd.read_csv(filename, header = None, names = cols, index_col = False)
patient_record_summary = patient_record_summary.sort_values(["Financial Year", "Financial Month" ])
patient_record_summary = patient_record_summary[["Financial Year", "Financial Month", "Number of NULL Patient IDs", 
                                                "Unique Patient IDs", "Number of Records"]]
patient_record_summary["Average num of records per patient"] = patient_record_summary["Number of Records"] / patient_record_summary["Unique Patient IDs"]
patient_record_summary_1819_1920 = patient_record_summary[(patient_record_summary["Financial Year"] != 202021)]

### Activity Treatment Function Code and Therapuetic Indication Code
Activity treatment function codes are unique identifiers used to describe the clinical treatment of a patient. A full list of codes mapped to descriptions is published on the NHS dm+d website, [link here](https://datadictionary.nhs.uk/data_elements/activity_treatment_function_code.html).

Therapuetic indication codes are unique SNOMED clinical terminology used to identify the reason for administering a drug to a patient. (*How can users find out what the SNOMED codes mean?*)

### High Cost Tariff Excluded Drug Code and Drug Name

The high cost tariff excluded drug code is an optional variable and will only be populated if the hospital systems use the dm+d drug definitions. (*Is this correct?*) When populated this should be a valid dm+d code. In the current national dataset this is only populated for x% of records.

The drug name variable is mandatory where relevant. Where hospital systems use the dm+d drug definitions this will be the dm+d drug name otherwise this should be an upper case string and should be a valid name as listed in the specification (*what is the specification?*). This variable is populated for x% of records, where the drug name is NULL the record is (*what are the records where null?*)

### Additional drug details

Route of administration is mandatory where relevant. This is a SNOMED CT code and is to be populated by all providers with an e-prescribing system. (*But what does it tell us?*)

Drug strength is mandatory where relevant and should be populated by providers who have not provided a dm+d high cost tariff excluded drug code. When populated this variable must contain the units of strength as well as the amount i.e. 100mg.

Drug volume is mandatory where relevant and should be populated by providers who have not provided a dm+d high cost tariff excluded drug code. When populated this variable must contain the units of volume as well as the amount i.e. 100ml.

Drug pack size is an optional variable and should be populated by providers who have provided a dm+d high cost tariff excluded drug code but where this code does not include pack size information (e.g. is not a VMPP or AMPP code).

Drug quantity or weight proportion is mandatory where relevant and can be populated in two ways:
1. When the drug pack size variable has been populated then the drug quantity or weight proportion should express the quantity as a propoftion of the drug pack size. 
2. Where the drug pack size is not relevant then the drug quantity or weight proportion expresses the quantity or weight of drug given i.e. quantity of dose units.

Unit of measurement is mandatory where relevant and is a SNOMED CT code to be populated in conjunction with the drug quantity or weight proportion variable. This describes what the drug quantity or weight proportion variable is measuring, i.e. packs or mgs. (*To check - is this correct?)

Dispensing route is mandatory where relevant and provides information on the setting the drug was dispensed.

### Financial information

Home delivery charge is a mandatory variable and provides infromation on the charge for home delivery. Where this is not relevant or there is no charge this variable should equal zero.

Total cost is a mandatory variable and provides information on the the total cost of the drug, including VAT where relevant. The calculation of total cost is:

(unit price as set by the commissioner * drug quantity or weight proportion) + home delivery charge

In cases where VAT is charged this should be added to the above and included in the total cost.

### Derived variables

In order to try and provide some uniformity to the drug information included in the national high cost drugs dataset the NESC attempted to derive some drug descriptor variables in the dm+d format.

Derived snomed from name brings in the snomed description (*which snomed description?*) based on the drug name variable. If there is no match between the drug name variable and a VMP/VTM (*what variable do they match on?*) then this returns a NULL value.

Derived VTM is the virtual therapuetic moiety code that can be matched to the derived snomed code from the drug name. If there is no match then this returns a NULL value.

Derived VTM name is the VTM name that can be matched to the derived VTM code. If there is no match then this returns a NULL value.

## Challenges

**Caveats**
* The NECS are unable to say with any certainty that the data received is a complete data set for each Provider and Commissioner. 
* The drug names and other text inputs are non-standardised, further work is required to standardise variables in the dataset.
* The NECS are unable to comment on the level of data validation that has been undertaken before it was submitted.

Further information on the NECS data collation and cleaning processes can be found in [this document.](https://docs.google.com/document/d/1JbUPp962KRNGsIC1wThexdMUrV0XOb-D/edit#heading=h.gjdgxs)

**Time period covered by the dataset**

The high cost drugs dataset covers data from April 2018 to March 2021. Most records appear between April 2018 and March 2020. We don't yet know why there are records from "the future" (beyond April 2020 at the time NECS collected the data submissions).

In [10]:
# bring in libraries
import pandas as pd
import numpy as np
import os

# steps to get the correct filepath for the csv file
os.getcwd()
parentDirectory = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
path = "/released-output/"
filename = parentDirectory + path + "year_month_info.csv"

cols = [
    "Year",
    "Financial Month",
    "Number of records"]

month_summary = pd.read_csv(filename, header = None, names = cols, index_col = False)
month_summary

Unnamed: 0,Year,Financial Month,Number of records
0,201819,,885
1,201819,1.0,214294
2,201819,2.0,233206
3,201819,3.0,223959
4,201819,4.0,236186
5,201819,5.0,240151
6,201819,6.0,220098
7,201819,7.0,248774
8,201819,8.0,241963
9,201819,9.0,226459


**Non standardised drug names**

The drug names in the high cost drugs dataset are not in the dm+d format. From looking at some examples it looks like during the data input drug names can be entered as free text and there can be many different variations for a single medicine.

Using adalimumab as an example - the code below lists all the unique drug names which contain the word "adalimumab". There are 427 unique drug names (with variations on upper and lower case characters).

Bespoke code lists will be required to pick up all relevant medicines use from the high cost drugs dataset in it's current form - it is not going to be enough to search on dm+d values alone.

In [11]:
# steps to get the correct filepath for the csv file
os.getcwd()
parentDirectory = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
path = "/released-output/"
filename = parentDirectory + path + "drug_name_summary.csv"

# reading in data summary and setting column names
cols = [
    "DrugName",
    "HighCostTariffExcludedDrugCode",
    "DerivedSNOMEDFromName",
    "DerivedVTM",
    "DerivedVTMName",
    "NumberOfAppearances"]

drug_name_summary = pd.read_csv(filename, header = None, names = cols, index_col = False)
searchfor = ['adalimumab', 'Adalimumab', "ADALIMUMAB"]

adalimumab = drug_name_summary[drug_name_summary['DrugName'].str.contains('|'.join(searchfor),na = False)] 

adalimumab_drugname_unique = adalimumab.DrugName.unique()
len(adalimumab_drugname_unique)

427

In [6]:
adalimumab_drugname_unique

array(['HC ADALIMUMAB IMRALDI 40 mg Injection Pre Filled Pen',
       'ADALIMUMAB', 'ADALIMUMAB (D2E7) - HOMECARE 40 mg Preloaded Pen',
       'ADALIMUMAB (IMRALDI) (HOMECARE)', 'ADALIMUMAB (IMRALDI) ',
       'HC ADALIMUMAB HUMIRA 40 mg Injection Pre Filled Pen',
       'Adalimumab 40mg/0.8ml solution for injection pre-filled disposable devices',
       'HOMECARE ADALIMUMAB (IMRALDI)', 'ADALIMUMAB REFERENCE PRICE',
       'HOMECARE IMRALDI (ADALIMUMAB)', 'ADALIMUMAB(AMGEVITA)',
       'HOMECARE - ADALIMUMAB (AMGEVITA) 40 mg in 0.8ml Auto Injector Pen',
       'ADALIMUMAB (HUMIRA)_(HOMECARE) 40 mg in 0.4mL Pre-filled Injection Pen',
       'ADALIMUMAB (HUMIRA) (HOMECARE)', 'ADALIMUMAB(HUMIRA)',
       'Adalimumab (Homecare)', 'Adalimumab',
       'HOMECARE AMGEVITA (ADALIMUMAB)',
       'HOMECARE ADALIMUMAB!40mg/0.8mL! PEN (HYRIMOZ)',
       'IMRALDI (HOMECARE PEN PACK) ADALIMUMAB',
       'ADALIMUMAB [IMRALDI] (HOMECARE) 40 mg/0.8mL Soln for Injection PF Pen',
       'ADALIMUMAB (IMRA

**Top 20 drug names**

The table below summaries the 20 unique drug names that appear the most in the dataset.

In [41]:
drug_name_summary["NumberOfAppearances - no NULLS"] = drug_name_summary.fillna(0)["NumberOfAppearances"]

drug_name_summary_top20 = drug_name_summary.groupby('DrugName', dropna = False).agg({"NumberOfAppearances - no NULLS": "sum"})
drug_name_summary_top20 = drug_name_summary_top20.sort_values(by = "NumberOfAppearances - no NULLS", ascending = False)
drug_name_summary_top20.head(20)

Unnamed: 0_level_0,NumberOfAppearances - no NULLS
DrugName,Unnamed: 1_level_1
FLUOROURACIL,154970.0
CAPECITABINE,98188.0
FILGRASTIM,97544.0
CYCLOPHOSPHAMIDE,97044.0
TACROLIMUS,82349.0
,82302.0
CARBOPLATIN,80951.0
PACLITAXEL,75804.0
DEXAMETHASONE,55976.0
OXALIPLATIN,55448.0
