# Pulling data from the NHS BSA Open Data Portal (ODP) using Python

In [None]:
# Make the plots appear inline
%matplotlib inline

# Import any packages
import requests 
import pandas as pd

The ODP https://opendata.nhsbsa.net/ has two programatic methods to access data from it...

* `datastore_search` e.g. https://opendata.nhsbsa.net/api/3/action/datastore_search?resource_id=EPD_201401&limit=5
* `datastore_search_sql` e.g. https://opendata.nhsbsa.net/api/3/action/datastore_search_sql?sql=SELECT%20*%20FROM%20EPD_201401%20LIMIT%205

The following code demonstrates the process using the SQL style query. It is a more flexible way to access any data and easy if you already know some SQL (if not don't worry - the code is there for you to follow).

In [None]:
# Define the url for the API call
base_endpoint = "https://opendata.nhsbsa.net/api/3/action"
action_method = "/datastore_search_sql?sql=" # SQL

# Define the parameters for the SQL query
resource_name = "EPD_202001"
pco_code = "13T00" # Newcastle Gateshead CCG
bnf_chemical_substance = "0407010H0" # Paracetamol

# Construct the SQL query
query = f"""
    SELECT 
        *
    FROM 
        {resource_name} 
    WHERE 
        1=1
        AND pco_code = '{pco_code}' 
        AND bnf_chemical_substance = '{bnf_chemical_substance}' 
"""

# Send API call and grab the response as a json
response = requests.get(
    base_endpoint 
    + action_method 
    + query.replace(" ", "%20") # Encode spaces in the url
).json()

The response from the API is held as a dictionary, you can view it by using the `print()` command below:

In [None]:
# Try to print some of the data we have... e.g. print(response), print(query)

Now we can use the `pandas` library to analyse the data in a tabular format. This is the most popular Python package for data manipulation and analysis.

In [None]:
# Convert the records in the response to a dataframe
result_df = pd.DataFrame(response['result']['result']['records'])

# View the first 6 rows of data
result_df.head()

Next up we can utilise some of the inbuilt `pandas` plotting functionality to create some quick and easy visualisations

In [2]:
# Lets inspect the QUANTITY column
result_df.hist(column='QUANTITY')

# Can we try removing the background
result_df.hist(column='QUANTITY', grid=False)

# How about using more bins
result_df.hist(column='QUANTITY', grid=False, bins=50)

# What about one bin per value of QUANTITY
result_df.hist(
    column='QUANTITY', 
    grid=False, 
    bins=int(max(result_df['QUANTITY']))
)

# Lets see if QUANTITY varies by BNF_DESCRIPTION
result_df.hist(
    column='QUANTITY', 
    by='BNF_DESCRIPTION',
    grid=False, 
    bins=50,
    sharex=True, # All the rows share the same x axis
    layout=(18, 1), # 18 rows and one column
    figsize=(10, 20) # Make the graph big enough 
)

# We can see that BNF_DESCRIPTION contains different forms for the drugs... 
# why don't we limit this to 'tablet' and check again
tablet_df = result_df[result_df['BNF_DESCRIPTION'].str.contains('tablet')]
tablet_df.hist(
    column='QUANTITY', 
    by='BNF_DESCRIPTION',
    grid=False, 
    bins=int(max(tablet_df['QUANTITY'])), # Bin by each value of QUANTITY
    sharex=True,
    layout=(5, 1),
    figsize=(5, 10)
)

# We can see there are peaks for certain QUANTITY so lets examine the 10 most 
# common QUANITTY
tablet_df['QUANTITY'].value_counts().head(10)

ModuleNotFoundError: No module named 'pandas'

Now recreate the previous graph but for 'oral suspension' instead of 'tablet'

In [None]:
# Try to create a DataFrame called oral_suspension_df and then produce a histogram from it