**High cost drugs dataset**

The high cost drugs dataset brings together high cost drugs patient level data submissions from all regional Data Services for Commissioners Regional Offices (DSCROs). The North of England Commisioning Support Unit (NECS) collected submissions from all Acute Providers and Commissioners nationally. 

**Caveats**
* The NECS are unable to say with any certainty that the data received is a complete data set for each Provider and Commissioner. 
* The drug names and other text inputs are non-standardised, further work is required to standardise variables in the dataset.
* The NECS are unable to comment on the level of data validation that has been undertaken before it was submitted.

Further information on the NECS data collation and cleaning processes can be found in [this document.](https://docs.google.com/document/d/1JbUPp962KRNGsIC1wThexdMUrV0XOb-D/edit#heading=h.gjdgxs)

**Steps to read in summary data**

In [3]:
# bring in libraries
import pandas as pd
import numpy as np
import os

In [15]:
# steps to get the correct filepath for the csv file
os.getcwd()
parentDirectory = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
print(parentDirectory)
path = "/released-output/"
filename = parentDirectory + path + "drug_name_summary.csv"
print(filename)

# reading in data summary and setting column names
cols = [
    "DrugName",
    "HighCostTariffExcludedDrugCode",
    "DerivedSNOMEDFromName",
    "DerivedVTM",
    "DerivedVTMName",
    "NumberOfAppearances"]

drug_name_summary = pd.read_csv(filename, header = None, names = cols, index_col = False)
print(drug_name_summary.columns)

drug_name_summary.head(5)

C:\Users\arowan\Documents\GitHub\highcostdrugs-research
C:\Users\arowan\Documents\GitHub\highcostdrugs-research/released-output/drug_name_summary.csv
Index(['DrugName', 'HighCostTariffExcludedDrugCode', 'DerivedSNOMEDFromName',
       'DerivedVTM', 'DerivedVTMName', 'NumberOfAppearances'],
      dtype='object')


Unnamed: 0,DrugName,HighCostTariffExcludedDrugCode,DerivedSNOMEDFromName,DerivedVTM,DerivedVTMName,NumberOfAppearances
0,FLUOROURACIL,,,3127006.0,Fluorouracil,67450.0
1,CYCLOPHOSPHAMIDE,,,74470007.0,Cyclophosphamide,40754.0
2,Aranesp,,,,,40508.0
3,FLUOROURACIL,3127006.0,,3127006.0,Fluorouracil,39362.0
4,CARBOPLATIN,,,108759002.0,Carboplatin,35560.0


**Non standardised drug names**
The drug names in the high cost drugs dataset are not in the dm+d format. From looking at some examples it looks like the drug names can be entered as free text input and there can be many different variations for a single medicine.

Using adalimumab as an example - the code below lists all the unique drug names which contain adalimumab. There are 427 drug names that contain adalimumab (with variations on upper and lower case characters).

Bespoke code lists will be required to pick up all relevant medicines use from the high cost drugs dataset in it's current form - it is not going to be enough to search on dm+d values alone.

In [30]:
searchfor = ['adalimumab', 'Adalimumab', "ADALIMUMAB"]

adalimumab = drug_name_summary[drug_name_summary['DrugName'].str.contains('|'.join(searchfor),na = False)] 

adalimumab_drugname_unique = adalimumab.DrugName.unique()
len(drugname_unique)

427

In [31]:
adalimumab_drugname_unique

array(['HC ADALIMUMAB IMRALDI 40 mg Injection Pre Filled Pen',
       'ADALIMUMAB', 'ADALIMUMAB (D2E7) - HOMECARE 40 mg Preloaded Pen',
       'ADALIMUMAB (IMRALDI) (HOMECARE)', 'ADALIMUMAB (IMRALDI) ',
       'HC ADALIMUMAB HUMIRA 40 mg Injection Pre Filled Pen',
       'Adalimumab 40mg/0.8ml solution for injection pre-filled disposable devices',
       'HOMECARE ADALIMUMAB (IMRALDI)', 'ADALIMUMAB REFERENCE PRICE',
       'HOMECARE IMRALDI (ADALIMUMAB)', 'ADALIMUMAB(AMGEVITA)',
       'HOMECARE - ADALIMUMAB (AMGEVITA) 40 mg in 0.8ml Auto Injector Pen',
       'ADALIMUMAB (HUMIRA)_(HOMECARE) 40 mg in 0.4mL Pre-filled Injection Pen',
       'ADALIMUMAB (HUMIRA) (HOMECARE)', 'ADALIMUMAB(HUMIRA)',
       'Adalimumab (Homecare)', 'Adalimumab',
       'HOMECARE AMGEVITA (ADALIMUMAB)',
       'HOMECARE ADALIMUMAB!40mg/0.8mL! PEN (HYRIMOZ)',
       'IMRALDI (HOMECARE PEN PACK) ADALIMUMAB',
       'ADALIMUMAB [IMRALDI] (HOMECARE) 40 mg/0.8mL Soln for Injection PF Pen',
       'ADALIMUMAB (IMRA