# Pulling data from the NHS BSA Open Data Portal (ODP) using Python

In [None]:
# Import any packages
import requests
import pandas as pd
import matplotlib
import scipy

# Make the plots appear inline
%matplotlib inline

The ODP https://opendata.nhsbsa.net/ has two programatic methods to access data from it...

* `datastore_search` e.g. https://opendata.nhsbsa.net/api/3/action/datastore_search?resource_id=EPD_201401&limit=5
* `datastore_search_sql` e.g. https://opendata.nhsbsa.net/api/3/action/datastore_search_sql?sql=SELECT%20*%20FROM%20EPD_201401%20LIMIT%205

The following code demonstrates the process using the SQL style query. It is a more flexible way to access any data and easy if you already know some SQL (if not don't worry - the code is there for you to follow).

In [None]:
# Define the url for the API call
base_endpoint = "https://opendata.nhsbsa.net/api/3/action"
action_method = "/datastore_search_sql?sql=" # SQL

In [None]:
# Define the parameters for the SQL query
resource_name = "EPD_202001"
pco_code = "13T00" # Newcastle Gateshead CCG
bnf_chemical_substance = "0407010H0" # Paracetamol

In [None]:
# Construct the SQL query
query = f"""
    SELECT 
        *
    FROM 
        {resource_name} 
    WHERE 
        1=1
        AND pco_code = '{pco_code}' 
        AND bnf_chemical_substance = '{bnf_chemical_substance}' 
"""

In [None]:
# Send API call and grab the response as a json
response = requests.get(
    base_endpoint 
    + action_method 
    + query.replace(" ", "%20") # Encode spaces in the url
).json()

In [None]:
# Convert the records in the response to a pandas dataframe
result_df = pd.DataFrame(response['result']['result']['records'])

In [None]:
# View the first 5 rows of data
result_df.head()

Next up we can utilise some of the inbuilt `pandas` plotting functionality (with a `matplotlib` backend) to create some quick and easy visualisations of the `QUANITY` column.

Note that the `;` at the end of each plot hides the metadata:
https://stackoverflow.com/questions/38968404/hide-matplotlib-descriptions-in-jupyter-notebook

In [None]:
# Lets inspect the QUANTITY column
result_df.hist(column='QUANTITY');

In [None]:
# Remove the background
result_df.hist(column='QUANTITY', grid=False);

In [None]:
# How about using more bins
result_df.hist(column='QUANTITY', grid=False, bins=50);

In [None]:
# One bin per value of QUANTITY
max_quantity = int(max(result_df['QUANTITY']))
result_df.hist(
    column='QUANTITY', 
    grid=False, 
    bins=max_quantity
);

In [None]:
# Lets see if QUANTITY varies by BNF_DESCRIPTION
result_df.hist(
    column='QUANTITY', 
    by='BNF_DESCRIPTION',
    grid=False, 
    bins=50,
    sharex=True, # All the rows share the same x axis
    layout=(18, 1), # 18 rows and one column
    figsize=(10, 20) # Make the graph big enough 
);

We can see that `BNF_DESCRIPTION` contains different forms for the drugs, and that the `QUANTITY` differs (look at `BNF_DESCRIPTION == 'Paracetamol 250mg/5ml oral suspension sugar free'`)

In [None]:
# Subset the data to tablets
tablet_df = result_df[result_df['BNF_DESCRIPTION'].str.contains('tablet')]

# Lets see if QUANTITY varies by BNF_DESCRIPTION
tablet_df.hist(
    column='QUANTITY', 
    by='BNF_DESCRIPTION',
    grid=False, 
    bins=50,
    sharex=True,
    sharey=True,
    layout=(5, 1),
    figsize=(10, 10) # Make the figure big enough for the plot
);

In [None]:
# We can see there are peaks for certain QUANTITY so lets examine the 10 most 
# common QUANITTY
tablet_df['QUANTITY'].value_counts().head(10)

TASK

Create another subset called `oral_suspension_df` (containing only 'oral suspension' instead of 'tablet') and then for `QUANTITY`:

1) Produce an overall histogram
2) Produce one histogram per `BNF_DESCRITPION`
3) Get the top 5 most common `QUANTITY`

In [None]:
# Do your work in here
