# Using Python to get data from the English Prescribing Dataset (EPD)

The [English Prescribing Dataset (EPD)](https://opendata.nhsbsa.net/dataset/english-prescribing-data-epd/resource/b98f8fe1-43ee-44ee-817d-f45ba3d179b8?filters=PRACTICE_NAME%3AHAWTHORNS%20MEDICAL%20CENTRE) is too big to deal with as a dataset - even when imported directly into Colab it crashes (the Pro version may be able to handle it, but we're always looking for a free alternative).

But it does have an **API**, so can we use that to fetch filtered aspects of the dataset that we want to work with?

Clicking on 'Data API' on that page brings up some example queries including "Query example (first 5 results)" and this URL: `https://opendata.nhsbsa.net/api/3/action/datastore_search?resource_id=EPD_202109&limit=5`

We can try to fetch that to begin with.

In [1]:
#import the pandas library to fetch the data and store it
import pandas as pd

In [2]:
#store the url
apiurl = "https://opendata.nhsbsa.net/api/3/action/datastore_search?resource_id=EPD_202109&limit=5"
#fetch the JSON at that URL
apidata = pd.read_json(apiurl)

In [3]:
#show the first few rows
apidata.head()

Unnamed: 0,help,success,result
_links,https://opendata.nhsbsa.net/api/3/action/help_...,True,{'start': '/api/3/action/datastore_search?reso...
fields,https://opendata.nhsbsa.net/api/3/action/help_...,True,"[{'type': 'INTEGER', 'id': 'YEAR_MONTH'}, {'ty..."
include_total,https://opendata.nhsbsa.net/api/3/action/help_...,True,True
records,https://opendata.nhsbsa.net/api/3/action/help_...,True,"[{'BNF_CODE': '1310020N0AAAAAA', 'TOTAL_QUANTI..."
records_format,https://opendata.nhsbsa.net/api/3/action/help_...,True,objects


## Check the keys in the JSON dictionary

We can see that at this level of the data we're too 'high up' and need to dive deeper to make sense of it. 

First, let's see what the keys are.

In [4]:
#show the keys
apidata.keys()

Index(['help', 'success', 'result'], dtype='object')

'Result' looks promising, so let's drill into that.

In [5]:
#drill down into 'result'
apidata['result']

_links            {'start': '/api/3/action/datastore_search?reso...
fields            [{'type': 'INTEGER', 'id': 'YEAR_MONTH'}, {'ty...
include_total                                                  True
records           [{'BNF_CODE': '1310020N0AAAAAA', 'TOTAL_QUANTI...
records_format                                              objects
resource_id                                              EPD_202109
total                                                      17542094
Name: result, dtype: object

In [6]:
#drill further down into 'records'
apidata['result']['records']

[{'ACTUAL_COST': 1.7092,
  'ADDRESS_1': 'FORD MEDICAL PRACTICE',
  'ADDRESS_2': '91/93 GORSEY LANE',
  'ADDRESS_3': 'LITHERLAND',
  'ADDRESS_4': 'LIVERPOOL',
  'ADQUSAGE': 0.0,
  'BNF_CHAPTER_PLUS_CODE': '13: Skin',
  'BNF_CHEMICAL_SUBSTANCE': '1310020N0',
  'BNF_CODE': '1310020N0AAAAAA',
  'BNF_DESCRIPTION': 'Miconazole 2% cream',
  'CHEMICAL_SUBSTANCE_BNF_DESCR': 'Miconazole nitrate',
  'ITEMS': 1,
  'NIC': 1.82,
  'PCO_CODE': '01T00',
  'PCO_NAME': 'SOUTH SEFTON CCG',
  'POSTCODE': 'L21 0DF',
  'PRACTICE_CODE': 'N84029',
  'PRACTICE_NAME': 'FORD MEDICAL PRACTICE',
  'QUANTITY': 30.0,
  'REGIONAL_OFFICE_CODE': 'Y62',
  'REGIONAL_OFFICE_NAME': 'NORTH WEST',
  'STP_CODE': 'QYG',
  'STP_NAME': 'CHESHIRE & MERSEYSIDE STP',
  'TOTAL_QUANTITY': 30.0,
  'UNIDENTIFIED': False,
  'YEAR_MONTH': 202109},
 {'ACTUAL_COST': 9.142759999999999,
  'ADDRESS_1': 'GLOVERS LANE SURGERY',
  'ADDRESS_2': 'MAGDALEN SQUARE',
  'ADDRESS_3': 'NETHERTON',
  'ADDRESS_4': 'BOOTLE',
  'ADQUSAGE': 56.0,
  'BNF_CHAP

## Normalize the 'records' branch

Now we've got some recognisable data, we can normalise it and store it.

In [None]:
#turn that into a dataframe structure
pd.json_normalize(apidata['result']['records'])

Unnamed: 0,BNF_CODE,TOTAL_QUANTITY,POSTCODE,YEAR_MONTH,UNIDENTIFIED,STP_NAME,PRACTICE_NAME,BNF_CHAPTER_PLUS_CODE,ACTUAL_COST,QUANTITY,REGIONAL_OFFICE_CODE,ITEMS,ADDRESS_4,ADDRESS_1,ADDRESS_2,ADDRESS_3,BNF_CHEMICAL_SUBSTANCE,ADQUSAGE,PCO_CODE,REGIONAL_OFFICE_NAME,NIC,CHEMICAL_SUBSTANCE_BNF_DESCR,PRACTICE_CODE,PCO_NAME,STP_CODE,BNF_DESCRIPTION
0,20030900426,12.0,WA4 6HJ,202109,False,CHESHIRE & MERSEYSIDE STP,STOCKTON HEATH MED.CENTRE,20: Dressings,17.3533,12.0,Y62,1,"WARRINGTON,CHESHIRE",STOCKTON HEATH MED.CENTRE,"THE FORGE,LONDON ROAD",STOCKTON HEATH,2003,0.0,02E00,NORTH WEST,18.6,Wound Management & Other Dressings,N81075,WARRINGTON CCG,QYG,KerraFoam Simple Border dressing 10cm x 10cm
1,0402010ABAAACAC,140.0,L21 0DF,202109,False,CHESHIRE & MERSEYSIDE STP,FORD MEDICAL PRACTICE,04: Central Nervous System,7.20725,140.0,Y62,1,LIVERPOOL,FORD MEDICAL PRACTICE,91/93 GORSEY LANE,LITHERLAND,0402010AB,46.66667,01T00,NORTH WEST,7.61,Quetiapine,N84029,SOUTH SEFTON CCG,QYG,Quetiapine 100mg tablets
2,0402010ADAAAEAE,28.0,L30 5TA,202109,False,CHESHIRE & MERSEYSIDE STP,GLOVERS LANE SURGERY,04: Central Nervous System,33.65933,28.0,Y62,1,BOOTLE,GLOVERS LANE SURGERY,MAGDALEN SQUARE,NETHERTON,0402010AD,0.0,01T00,NORTH WEST,36.09,Aripiprazole,N84004,SOUTH SEFTON CCG,QYG,Aripiprazole 10mg orodispersible tablets sugar...
3,0206010K0BBADAF,56.0,WA4 6HJ,202109,False,CHESHIRE & MERSEYSIDE STP,STOCKTON HEATH MED.CENTRE,02: Cardiovascular System,6.89282,56.0,Y62,1,"WARRINGTON,CHESHIRE",STOCKTON HEATH MED.CENTRE,"THE FORGE,LONDON ROAD",STOCKTON HEATH,0206010K0,0.0,02E00,NORTH WEST,7.38,Isosorbide mononitrate,N81075,WARRINGTON CCG,QYG,Elantan LA50 capsules
4,0407020ADBSAFAH,70.0,L21 0DF,202109,False,CHESHIRE & MERSEYSIDE STP,FORD MEDICAL PRACTICE,04: Central Nervous System,29.74319,14.0,Y62,5,LIVERPOOL,FORD MEDICAL PRACTICE,91/93 GORSEY LANE,LITHERLAND,0407020AD,46.66667,01T00,NORTH WEST,31.3,Oxycodone hydrochloride,N84029,SOUTH SEFTON CCG,QYG,Oxypro 40mg modified-release tablets


In [7]:
#store that dataframe structure
records = pd.json_normalize(apidata['result']['records'])
#check it's 5 records
print(len(records))

5


## Forming a query

Now we have an idea of structure we can try to form a URL which asks for something more specific than just the first 5 records of all the data.

Let's start by changing the limit from 5 to 500 at the end of the URL, `limit=500`

In [8]:
#store the url
apiurl = "https://opendata.nhsbsa.net/api/3/action/datastore_search?resource_id=EPD_202109&limit=500"
#fetch the JSON at that URL
apidata = pd.read_json(apiurl)
#store that dataframe structure
records = pd.json_normalize(apidata['result']['records'])
#check it's 5 records
print(len(records))

500


Well that works.

In [13]:
records.head(6)

Unnamed: 0,BNF_CODE,TOTAL_QUANTITY,POSTCODE,YEAR_MONTH,UNIDENTIFIED,STP_NAME,PRACTICE_NAME,BNF_CHAPTER_PLUS_CODE,ACTUAL_COST,QUANTITY,REGIONAL_OFFICE_CODE,ITEMS,ADDRESS_4,ADDRESS_1,ADDRESS_2,ADDRESS_3,BNF_CHEMICAL_SUBSTANCE,ADQUSAGE,PCO_CODE,REGIONAL_OFFICE_NAME,NIC,CHEMICAL_SUBSTANCE_BNF_DESCR,PRACTICE_CODE,PCO_NAME,STP_CODE,BNF_DESCRIPTION
0,20030900426,12.0,WA4 6HJ,202109,False,CHESHIRE & MERSEYSIDE STP,STOCKTON HEATH MED.CENTRE,20: Dressings,17.3533,12.0,Y62,1,"WARRINGTON,CHESHIRE",STOCKTON HEATH MED.CENTRE,"THE FORGE,LONDON ROAD",STOCKTON HEATH,2003,0.0,02E00,NORTH WEST,18.6,Wound Management & Other Dressings,N81075,WARRINGTON CCG,QYG,KerraFoam Simple Border dressing 10cm x 10cm
1,0402010ABAAACAC,140.0,L21 0DF,202109,False,CHESHIRE & MERSEYSIDE STP,FORD MEDICAL PRACTICE,04: Central Nervous System,7.20725,140.0,Y62,1,LIVERPOOL,FORD MEDICAL PRACTICE,91/93 GORSEY LANE,LITHERLAND,0402010AB,46.66667,01T00,NORTH WEST,7.61,Quetiapine,N84029,SOUTH SEFTON CCG,QYG,Quetiapine 100mg tablets
2,0402010ADAAAEAE,28.0,L30 5TA,202109,False,CHESHIRE & MERSEYSIDE STP,GLOVERS LANE SURGERY,04: Central Nervous System,33.65933,28.0,Y62,1,BOOTLE,GLOVERS LANE SURGERY,MAGDALEN SQUARE,NETHERTON,0402010AD,0.0,01T00,NORTH WEST,36.09,Aripiprazole,N84004,SOUTH SEFTON CCG,QYG,Aripiprazole 10mg orodispersible tablets sugar...
3,0206010K0BBADAF,56.0,WA4 6HJ,202109,False,CHESHIRE & MERSEYSIDE STP,STOCKTON HEATH MED.CENTRE,02: Cardiovascular System,6.89282,56.0,Y62,1,"WARRINGTON,CHESHIRE",STOCKTON HEATH MED.CENTRE,"THE FORGE,LONDON ROAD",STOCKTON HEATH,0206010K0,0.0,02E00,NORTH WEST,7.38,Isosorbide mononitrate,N81075,WARRINGTON CCG,QYG,Elantan LA50 capsules
4,0407020ADBSAFAH,70.0,L21 0DF,202109,False,CHESHIRE & MERSEYSIDE STP,FORD MEDICAL PRACTICE,04: Central Nervous System,29.74319,14.0,Y62,5,LIVERPOOL,FORD MEDICAL PRACTICE,91/93 GORSEY LANE,LITHERLAND,0407020AD,46.66667,01T00,NORTH WEST,31.3,Oxycodone hydrochloride,N84029,SOUTH SEFTON CCG,QYG,Oxypro 40mg modified-release tablets
5,0404000V0AAAAAA,56.0,WA4 6HJ,202109,False,CHESHIRE & MERSEYSIDE STP,STOCKTON HEATH MED.CENTRE,04: Central Nervous System,122.18181,56.0,Y62,1,"WARRINGTON,CHESHIRE",STOCKTON HEATH MED.CENTRE,"THE FORGE,LONDON ROAD",STOCKTON HEATH,0404000V0,0.0,02E00,NORTH WEST,131.04,Guanfacine,N81075,WARRINGTON CCG,QYG,Guanfacine 3mg modified-release tablets


## Trying the SQL query

We can try the SQL query given as a suggestion. This throws an error.

In [14]:
#store the url
apiurl = "https://opendata.nhsbsa.net/api/3/action/datastore_search?resource_id=EPD_202109&&sql=SELECT * from `EPD_202109` limit 5"
#fetch the JSON at that URL
apidata = pd.read_json(apiurl)
#store that dataframe structure
records = pd.json_normalize(apidata['result']['records'])
#check it's 5 records
print(len(records))

InvalidURL: ignored

Not surprisingly, because there are spaces in that URL. If we type it directly into a browser then it does actually work - the browser handles the spaces and replaces them. 

We can copy the URL that the browser converts it to and see if that works.

In [15]:
apiurl = "https://opendata.nhsbsa.net/api/3/action/datastore_search_sql?resource_id=EPD_202109&sql=SELECT%20*%20from%20`EPD_202109`%20limit%205"
#fetch the JSON at that URL
apidata = pd.read_json(apiurl)
#store that dataframe structure
records = pd.json_normalize(apidata['result']['records'])
#check it's 5 records
print(len(records))

KeyError: ignored

Now at least we get a different error - a `KeyError`. So let's backtrack and check the keys.

In [17]:
apidata.keys()

Index(['help', 'success', 'result'], dtype='object')

In [18]:
apidata['result']

help       https://demo.ckan.org/api/3/action/help_show?n...
result     {'records': [{'BNF_CODE': '0206020C0AABABA', '...
success                                                 true
Name: result, dtype: object

In [19]:
apidata['result']['result']

{'fields': [],
 'records': [{'ACTUAL_COST': 25.84968,
   'ADDRESS_1': 'STOCKTON HEATH MED.CENTRE',
   'ADDRESS_2': 'THE FORGE,LONDON ROAD',
   'ADDRESS_3': 'STOCKTON HEATH',
   'ADDRESS_4': 'WARRINGTON,CHESHIRE',
   'ADQUSAGE': 112.0,
   'BNF_CHAPTER_PLUS_CODE': '02: Cardiovascular System',
   'BNF_CHEMICAL_SUBSTANCE': '0206020C0',
   'BNF_CODE': '0206020C0AABABA',
   'BNF_DESCRIPTION': 'Diltiazem 360mg modified-release capsules',
   'CHEMICAL_SUBSTANCE_BNF_DESCR': 'Diltiazem hydrochloride',
   'ITEMS': 2,
   'NIC': 27.7,
   'PCO_CODE': '02E00',
   'PCO_NAME': 'WARRINGTON CCG',
   'POSTCODE': 'WA4 6HJ',
   'PRACTICE_CODE': 'N81075',
   'PRACTICE_NAME': 'STOCKTON HEATH MED.CENTRE',
   'QUANTITY': 28.0,
   'REGIONAL_OFFICE_CODE': 'Y62',
   'REGIONAL_OFFICE_NAME': 'NORTH WEST',
   'STP_CODE': 'QYG',
   'STP_NAME': 'CHESHIRE & MERSEYSIDE STP',
   'TOTAL_QUANTITY': 56.0,
   'UNIDENTIFIED': False,
   'YEAR_MONTH': 202109},
  {'ACTUAL_COST': 9.48463,
   'ADDRESS_1': 'FORD MEDICAL PRACTICE',
 

So we just need to replace 'records'?

In [20]:
#store that dataframe structure
records = pd.json_normalize(apidata['result']['result'])
#check it's 5 records
print(len(records))

1


No - this only gives us one result. We need to *insert* an extra 'records' branch before it. 

In [21]:
#store that dataframe structure
records = pd.json_normalize(apidata['result']['result']['records'])
#check it's 5 records
print(len(records))

5


Let's try that with our previous code, this time replacing spaces with `%20` as a browser would.

In [28]:
#store the url
apiurl = "https://opendata.nhsbsa.net/api/3/action/datastore_search?resource_id=EPD_202109&&sql=SELECT * from `EPD_202109` limit 5"
#replace the spaces
apiurl = apiurl.replace(' ','%20')
print(apiurl)
#fetch the JSON at that URL
apidata = pd.read_json(apiurl)
#store that dataframe structure
print(apidata['result'].keys())
records = pd.json_normalize(apidata['result']['result']['records'])
#check it's 5 records
print(len(records))

https://opendata.nhsbsa.net/api/3/action/datastore_search?resource_id=EPD_202109&&sql=SELECT%20*%20from%20`EPD_202109`%20limit%205
Index(['_links', 'fields', 'include_total', 'records', 'records_format',
       'resource_id', 'total'],
      dtype='object')


KeyError: ignored

Another `KeyError`.

A closer look reveals that the previous URL changed something else: `datastore_search` became `datastore_search_sql`:

https://opendata.nhsbsa.net/api/3/action/datastore_search_sql?resource_id=EPD_202109&sql=SELECT%20*%20from%20`EPD_202109`%20limit%205

Let's adapt the code to use that form of URL.

In [29]:
#store the url
apiurl = "https://opendata.nhsbsa.net/api/3/action/datastore_search_sql?resource_id=EPD_202109&&sql=SELECT * from `EPD_202109` limit 10"
#replace the spaces
apiurl = apiurl.replace(' ','%20')
print(apiurl)
#fetch the JSON at that URL
apidata = pd.read_json(apiurl)
#store that dataframe structure
print(apidata['result'].keys())
records = pd.json_normalize(apidata['result']['result']['records'])
#check it's 10 records
print(len(records))

https://opendata.nhsbsa.net/api/3/action/datastore_search_sql?resource_id=EPD_202109&&sql=SELECT%20*%20from%20`EPD_202109`%20limit%2010
Index(['help', 'result', 'success'], dtype='object')
10


In [30]:
records.head()

Unnamed: 0,BNF_CODE,TOTAL_QUANTITY,POSTCODE,YEAR_MONTH,UNIDENTIFIED,STP_NAME,PRACTICE_NAME,BNF_CHAPTER_PLUS_CODE,ACTUAL_COST,QUANTITY,REGIONAL_OFFICE_CODE,ITEMS,ADDRESS_4,ADDRESS_1,ADDRESS_2,ADDRESS_3,BNF_CHEMICAL_SUBSTANCE,ADQUSAGE,PCO_CODE,REGIONAL_OFFICE_NAME,NIC,CHEMICAL_SUBSTANCE_BNF_DESCR,PRACTICE_CODE,PCO_NAME,STP_CODE,BNF_DESCRIPTION
0,20050400133,80.0,L21 0DF,202109,False,CHESHIRE & MERSEYSIDE STP,FORD MEDICAL PRACTICE,20: Dressings,26.925,80.0,Y62,1,LIVERPOOL,FORD MEDICAL PRACTICE,91/93 GORSEY LANE,LITHERLAND,2005,0.0,01T00,NORTH WEST,28.88,Tracheostomy & Laryngectomy Appliances,N84029,SOUTH SEFTON CCG,QYG,Provox cleaning towel
1,22301003068,60.0,WA4 6HJ,202109,False,CHESHIRE & MERSEYSIDE STP,STOCKTON HEATH MED.CENTRE,22: Incontinence Appliances,101.41627,60.0,Y62,1,"WARRINGTON,CHESHIRE",STOCKTON HEATH MED.CENTRE,"THE FORGE,LONDON ROAD",STOCKTON HEATH,2230,0.0,02E00,NORTH WEST,108.78,Incontinence Sheaths,N81075,WARRINGTON CCG,QYG,Conveen Optima latex free self-sealing Urishea...
2,0109040N0BDABAQ,900.0,WA4 6HJ,202109,False,CHESHIRE & MERSEYSIDE STP,STOCKTON HEATH MED.CENTRE,01: Gastro-Intestinal System,237.07606,300.0,Y62,3,"WARRINGTON,CHESHIRE",STOCKTON HEATH MED.CENTRE,"THE FORGE,LONDON ROAD",STOCKTON HEATH,0109040N0,0.0,02E00,NORTH WEST,254.25,Pancreatin,N81075,WARRINGTON CCG,QYG,Creon 25000 gastro-resistant capsules
3,0408010ADBBADAA,28.0,L21 0DF,202109,False,CHESHIRE & MERSEYSIDE STP,FORD MEDICAL PRACTICE,04: Central Nervous System,29.34952,28.0,Y62,1,LIVERPOOL,FORD MEDICAL PRACTICE,91/93 GORSEY LANE,LITHERLAND,0408010AD,0.0,01T00,NORTH WEST,31.36,Zonisamide,N84029,SOUTH SEFTON CCG,QYG,Zonegran 100mg capsules
4,0604011G0AAAUAU,480.0,WA4 6HJ,202109,False,CHESHIRE & MERSEYSIDE STP,STOCKTON HEATH MED.CENTRE,06: Endocrine System,26.88762,160.0,Y62,3,"WARRINGTON,CHESHIRE",STOCKTON HEATH MED.CENTRE,"THE FORGE,LONDON ROAD",STOCKTON HEATH,0604011G0,0.0,02E00,NORTH WEST,28.8,Estradiol,N81075,WARRINGTON CCG,QYG,Estradiol 0.06% gel (750microgram per actuation)


In [31]:
#store the url
apiurl = "https://opendata.nhsbsa.net/api/3/action/datastore_search_sql?resource_id=EPD_202109&&sql=SELECT POSTCODE from `EPD_202109` limit 10"
#replace the spaces
apiurl = apiurl.replace(' ','%20')
print(apiurl)
#fetch the JSON at that URL
apidata = pd.read_json(apiurl)
#store that dataframe structure
print(apidata['result'].keys())
records = pd.json_normalize(apidata['result']['result']['records'])
#check it's 10 records
print(len(records))

https://opendata.nhsbsa.net/api/3/action/datastore_search_sql?resource_id=EPD_202109&&sql=SELECT%20POSTCODE%20from%20`EPD_202109`%20limit%2010
Index(['help', 'result', 'success'], dtype='object')
10


In [32]:
records.head()

Unnamed: 0,POSTCODE
0,L30 5TA
1,L21 0DF
2,L21 0DF
3,L21 0DF
4,L21 0DF


In [33]:
query = "SELECT REGIONAL_OFFICE_NAME, COUNT(REGIONAL_OFFICE_NAME) from `EPD_202109` GROUP BY REGIONAL_OFFICE_NAME"
queryclean = query.replace(' ','%20')
baseurl = "https://opendata.nhsbsa.net/api/3/action/datastore_search_sql?resource_id=EPD_202109&&sql="
apiurl = baseurl+queryclean
print(apiurl)
#fetch the JSON at that URL
apidata = pd.read_json(apiurl)
#store that dataframe structure
print(apidata['result'].keys())
records = pd.json_normalize(apidata['result']['result']['records'])
#check it's 10 records
print(len(records))

https://opendata.nhsbsa.net/api/3/action/datastore_search_sql?resource_id=EPD_202109&&sql=SELECT%20REGIONAL_OFFICE_NAME,%20COUNT(REGIONAL_OFFICE_NAME)%20from%20`EPD_202109`%20GROUP%20BY%20REGIONAL_OFFICE_NAME
Index(['help', 'result', 'success'], dtype='object')
8


In [34]:
print(records)

       REGIONAL_OFFICE_NAME      f0_
0                NORTH WEST  2624133
1           EAST OF ENGLAND  1890971
2  NORTH EAST AND YORKSHIRE  2887226
3                  MIDLANDS  3393136
4                SOUTH WEST  1677217
5                    LONDON  2557024
6                SOUTH EAST  2507714
7              UNIDENTIFIED     4673


## Introducing SQL queries

SQL = Structured Query Language. A query in SQL looks something like this:

> `SELECT [columns] FROM [a table] WHERE [condition is true] ORDER BY [a column] GROUP BY [a column]`

It can be simpler (just selecting certain columns from a table) or more complex (performing calculations, joining tables  etc) but that's the gist.

Let's re-import the data so we have all the columns.

In [38]:
#store the url
apiurl = "https://opendata.nhsbsa.net/api/3/action/datastore_search_sql?resource_id=EPD_202109&&sql=SELECT * from `EPD_202109` limit 10"
#replace the spaces
apiurl = apiurl.replace(' ','%20')
print(apiurl)
#fetch the JSON at that URL
apidata = pd.read_json(apiurl)
#store that dataframe structure
print(apidata['result'].keys())
records = pd.json_normalize(apidata['result']['result']['records'])
#check it's 10 records
print(len(records))

https://opendata.nhsbsa.net/api/3/action/datastore_search_sql?resource_id=EPD_202109&&sql=SELECT%20*%20from%20`EPD_202109`%20limit%2010
Index(['help', 'result', 'success'], dtype='object')
10


And see what the column names are to help us form a query.

In [39]:
records.columns

Index(['BNF_CODE', 'TOTAL_QUANTITY', 'POSTCODE', 'YEAR_MONTH', 'UNIDENTIFIED',
       'STP_NAME', 'PRACTICE_NAME', 'BNF_CHAPTER_PLUS_CODE', 'ACTUAL_COST',
       'QUANTITY', 'REGIONAL_OFFICE_CODE', 'ITEMS', 'ADDRESS_4', 'ADDRESS_1',
       'ADDRESS_2', 'ADDRESS_3', 'BNF_CHEMICAL_SUBSTANCE', 'ADQUSAGE',
       'PCO_CODE', 'REGIONAL_OFFICE_NAME', 'NIC',
       'CHEMICAL_SUBSTANCE_BNF_DESCR', 'PRACTICE_CODE', 'PCO_NAME', 'STP_CODE',
       'BNF_DESCRIPTION'],
      dtype='object')

## Perform a calculation on the aggregated data

When you are selecting columns you can put `SUM()` around a column name to calculate a sum based on whatever dimension you are using to `GROUP BY`.

You can aso use `COUNT()`, `AVG()`, `MIN()` and `MAX()`.

In [47]:
#form the query
query = "SELECT STP_NAME, SUM(TOTAL_QUANTITY) from `EPD_202109` GROUP BY STP_NAME"
#replace spaces with %20
queryclean = query.replace(' ','%20')
#store the base url
baseurl = "https://opendata.nhsbsa.net/api/3/action/datastore_search_sql?resource_id=EPD_202109&&sql="
#combine the two
apiurl = baseurl+queryclean
print(apiurl)
#fetch the JSON at that URL
apidata = pd.read_json(apiurl)
#store that dataframe structure
print(apidata['result'].keys())
records = pd.json_normalize(apidata['result']['result']['records'])
#check it
print(records)

https://opendata.nhsbsa.net/api/3/action/datastore_search_sql?resource_id=EPD_202109&&sql=SELECT%20STP_NAME,%20SUM(TOTAL_QUANTITY)%20from%20`EPD_202109`%20GROUP%20BY%20STP_NAME
Index(['help', 'result', 'success'], dtype='object')
             f0_                                  STP_NAME
0   4.460302e+08                 CHESHIRE & MERSEYSIDE STP
1   1.355829e+08            SUFFOLK & NORTH EAST ESSEX STP
2   2.448498e+08  HEALTHIER LANCASHIRE & SOUTH CUMBRIA STP
3   2.599582e+08                  HUMBER, COAST & VALE STP
4   2.032046e+08        STAFFORDSHIRE & STOKE ON TRENT STP
5   4.669498e+08                  CUMBRIA & NORTH EAST STP
6   4.905166e+08    GREATER MANCHESTER HSC PARTNERSHIP STP
7   2.235251e+08           SOUTH YORKSHIRE & BASSETLAW STP
8   9.771192e+07        HEREFORDSHIRE & WORCESTERSHIRE STP
9   1.016639e+08        BRISTOL, N SOMERSET & S GLOUCS STP
10  1.200309e+08                   MID AND SOUTH ESSEX STP
11  1.601712e+08                                 DEVON STP
12 

### Formatting numbers 

Those numbers are coming through as scientific notation. We could simply export it and format them better in Excel...

In [48]:
records.to_csv("records.csv")

...or we can use `CAST()` around that `SUM()` and specify that we want to show it as a particular type of number (in this case, `AS int`)

In [46]:
#form the query
query = "SELECT STP_NAME, CAST(SUM(TOTAL_QUANTITY) AS int) from `EPD_202109` GROUP BY STP_NAME"
#replace spaces with %20
queryclean = query.replace(' ','%20')
#store the base url
baseurl = "https://opendata.nhsbsa.net/api/3/action/datastore_search_sql?resource_id=EPD_202109&&sql="
#combine the two
apiurl = baseurl+queryclean
print(apiurl)
#fetch the JSON at that URL
apidata = pd.read_json(apiurl)
#store that dataframe structure
print(apidata['result'].keys())
records = pd.json_normalize(apidata['result']['result']['records'])
#check it
print(records)

https://opendata.nhsbsa.net/api/3/action/datastore_search_sql?resource_id=EPD_202109&&sql=SELECT%20STP_NAME,%20CAST(SUM(TOTAL_QUANTITY)%20AS%20int)%20from%20`EPD_202109`%20GROUP%20BY%20STP_NAME
Index(['help', 'result', 'success'], dtype='object')
          f0_                                  STP_NAME
0   446030163                 CHESHIRE & MERSEYSIDE STP
1   135582884            SUFFOLK & NORTH EAST ESSEX STP
2   223525076           SOUTH YORKSHIRE & BASSETLAW STP
3   259958232                  HUMBER, COAST & VALE STP
4   466949757                  CUMBRIA & NORTH EAST STP
5   490516617    GREATER MANCHESTER HSC PARTNERSHIP STP
6   244849759  HEALTHIER LANCASHIRE & SOUTH CUMBRIA STP
7   203204580        STAFFORDSHIRE & STOKE ON TRENT STP
8    97711924        HEREFORDSHIRE & WORCESTERSHIRE STP
9   101663916        BRISTOL, N SOMERSET & S GLOUCS STP
10  175230602          NORTH LONDON PARTNERS IN H&C STP
11  120030915                   MID AND SOUTH ESSEX STP
12  245460431            

## A query looking for particular text

Below is another query looking for records `WHERE CHEMICAL_SUBSTANCE_BNF_DESCR LIKE '%Co-codamol%'` - in this case we're looking for a text match on each record but the `%` signs act as a wild card allowing it to match records where that text appears anywhere.

In [70]:
#form the query
query = "SELECT TOTAL_QUANTITY, CHEMICAL_SUBSTANCE_BNF_DESCR, BNF_DESCRIPTION from `EPD_202109` WHERE CHEMICAL_SUBSTANCE_BNF_DESCR LIKE '%Co-codamol%' LIMIT 1000"
#replace spaces with %20
queryclean = query.replace(' ','%20')
#store the base url
baseurl = "https://opendata.nhsbsa.net/api/3/action/datastore_search_sql?resource_id=EPD_202109&&sql="
#combine the two
apiurl = baseurl+queryclean
print(apiurl)
#fetch the JSON at that URL
apidata = pd.read_json(apiurl)
#store that dataframe structure
print(apidata['result'].keys())
records = pd.json_normalize(apidata['result']['result']['records'])
#check it
print(records)

https://opendata.nhsbsa.net/api/3/action/datastore_search_sql?resource_id=EPD_202109&&sql=SELECT%20TOTAL_QUANTITY,%20CHEMICAL_SUBSTANCE_BNF_DESCR,%20BNF_DESCRIPTION%20from%20`EPD_202109`%20WHERE%20CHEMICAL_SUBSTANCE_BNF_DESCR%20LIKE%20'%Co-codamol%'%20LIMIT%201000
Index(['help', 'result', 'success'], dtype='object')
     TOTAL_QUANTITY  ...                                    BNF_DESCRIPTION
0             600.0  ...          Co-codamol 8mg/500mg effervescent tablets
1             230.0  ...                      Co-codamol 30mg/500mg tablets
2             150.0  ...                      Co-codamol 30mg/500mg tablets
3             448.0  ...                     Co-codamol 30mg/500mg capsules
4             200.0  ...  Co-codamol 8mg/500mg effervescent tablets suga...
..              ...  ...                                                ...
995            50.0  ...                      Co-codamol 30mg/500mg tablets
996           240.0  ...                      Co-codamol 30mg/500mg tablet