**High cost drugs dataset**

The high cost drugs dataset brings together high cost drugs patient level data submissions from all regional Data Services for Commissioners Regional Offices (DSCROs). The North of England Commisioning Support Unit (NECS) collected submissions from all Acute Providers and Commissioners nationally. The current dataset contains 6.8 million records.

**Caveats**
* The NECS are unable to say with any certainty that the data received is a complete data set for each Provider and Commissioner. 
* The drug names and other text inputs are non-standardised, further work is required to standardise variables in the dataset.
* The NECS are unable to comment on the level of data validation that has been undertaken before it was submitted.

Further information on the NECS data collation and cleaning processes can be found in [this document.](https://docs.google.com/document/d/1JbUPp962KRNGsIC1wThexdMUrV0XOb-D/edit#heading=h.gjdgxs)

**Meta data**

The NECS provided a meta data file for the high cost drugs dataset, this is [saved here.](https://docs.google.com/spreadsheets/d/1i_Ux8UveZR8brMDO0-_TV_tYfWMLSSHY/edit?rtpof=true#gid=1394523185)
The dataset includes the following variables:
* Patient_ID
* FinancialMonth
* FinancialYear
* PersonAge
* PersonGender
* ActivityTreatmentFunctionCode
* TherapeuticIndicationCode
* HighCostTariffExcludedDrugCode (2.54m records with NULL value)
* DrugName (82k records with NULL value)
* Route of Administration
* DrugStrength
* DrugVolume
* DrugPackSize
* DrugQuantityOrWeightProportion
* UnitOfMeasurement
* DispensingRoute
* HomeDeliveryCharge
* TotalCost
* DerivedSNOMEDFromName (6.2m records with NULL value - very low coverage)
* DerivedVTM (2.3m records with NULL value)
* DerivedVTMName (2.3m records with NULL value)

**Time period covered by the dataset**

The high cost drugs dataset covers data from April 2018 to March 2021. Most records appear between April 2018 and March 2020. We don't yet know why there are records from "the future" (beyond April 2020 at the time NECS collected the data submissions).

In [10]:
# bring in libraries
import pandas as pd
import numpy as np
import os

# steps to get the correct filepath for the csv file
os.getcwd()
parentDirectory = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
path = "/released-output/"
filename = parentDirectory + path + "year_month_info.csv"

cols = [
    "Year",
    "Financial Month",
    "Number of records"]

month_summary = pd.read_csv(filename, header = None, names = cols, index_col = False)
month_summary

Unnamed: 0,Year,Financial Month,Number of records
0,201819,,885
1,201819,1.0,214294
2,201819,2.0,233206
3,201819,3.0,223959
4,201819,4.0,236186
5,201819,5.0,240151
6,201819,6.0,220098
7,201819,7.0,248774
8,201819,8.0,241963
9,201819,9.0,226459


**Non standardised drug names**

The drug names in the high cost drugs dataset are not in the dm+d format. From looking at some examples it looks like during the data input drug names can be entered as free text and there can be many different variations for a single medicine.

Using adalimumab as an example - the code below lists all the unique drug names which contain the word "adalimumab". There are 427 unique drug names (with variations on upper and lower case characters).

Bespoke code lists will be required to pick up all relevant medicines use from the high cost drugs dataset in it's current form - it is not going to be enough to search on dm+d values alone.

In [11]:
# steps to get the correct filepath for the csv file
os.getcwd()
parentDirectory = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
path = "/released-output/"
filename = parentDirectory + path + "drug_name_summary.csv"

# reading in data summary and setting column names
cols = [
    "DrugName",
    "HighCostTariffExcludedDrugCode",
    "DerivedSNOMEDFromName",
    "DerivedVTM",
    "DerivedVTMName",
    "NumberOfAppearances"]

drug_name_summary = pd.read_csv(filename, header = None, names = cols, index_col = False)
searchfor = ['adalimumab', 'Adalimumab', "ADALIMUMAB"]

adalimumab = drug_name_summary[drug_name_summary['DrugName'].str.contains('|'.join(searchfor),na = False)] 

adalimumab_drugname_unique = adalimumab.DrugName.unique()
len(adalimumab_drugname_unique)

427

In [6]:
adalimumab_drugname_unique

array(['HC ADALIMUMAB IMRALDI 40 mg Injection Pre Filled Pen',
       'ADALIMUMAB', 'ADALIMUMAB (D2E7) - HOMECARE 40 mg Preloaded Pen',
       'ADALIMUMAB (IMRALDI) (HOMECARE)', 'ADALIMUMAB (IMRALDI) ',
       'HC ADALIMUMAB HUMIRA 40 mg Injection Pre Filled Pen',
       'Adalimumab 40mg/0.8ml solution for injection pre-filled disposable devices',
       'HOMECARE ADALIMUMAB (IMRALDI)', 'ADALIMUMAB REFERENCE PRICE',
       'HOMECARE IMRALDI (ADALIMUMAB)', 'ADALIMUMAB(AMGEVITA)',
       'HOMECARE - ADALIMUMAB (AMGEVITA) 40 mg in 0.8ml Auto Injector Pen',
       'ADALIMUMAB (HUMIRA)_(HOMECARE) 40 mg in 0.4mL Pre-filled Injection Pen',
       'ADALIMUMAB (HUMIRA) (HOMECARE)', 'ADALIMUMAB(HUMIRA)',
       'Adalimumab (Homecare)', 'Adalimumab',
       'HOMECARE AMGEVITA (ADALIMUMAB)',
       'HOMECARE ADALIMUMAB!40mg/0.8mL! PEN (HYRIMOZ)',
       'IMRALDI (HOMECARE PEN PACK) ADALIMUMAB',
       'ADALIMUMAB [IMRALDI] (HOMECARE) 40 mg/0.8mL Soln for Injection PF Pen',
       'ADALIMUMAB (IMRA

**Top 20 drug names**

The table below summaries the 20 unique drug names that appear the most in the dataset.

In [41]:
drug_name_summary["NumberOfAppearances - no NULLS"] = drug_name_summary.fillna(0)["NumberOfAppearances"]

drug_name_summary_top20 = drug_name_summary.groupby('DrugName', dropna = False).agg({"NumberOfAppearances - no NULLS": "sum"})
drug_name_summary_top20 = drug_name_summary_top20.sort_values(by = "NumberOfAppearances - no NULLS", ascending = False)
drug_name_summary_top20.head(20)

Unnamed: 0_level_0,NumberOfAppearances - no NULLS
DrugName,Unnamed: 1_level_1
FLUOROURACIL,154970.0
CAPECITABINE,98188.0
FILGRASTIM,97544.0
CYCLOPHOSPHAMIDE,97044.0
TACROLIMUS,82349.0
,82302.0
CARBOPLATIN,80951.0
PACLITAXEL,75804.0
DEXAMETHASONE,55976.0
OXALIPLATIN,55448.0
