# Pulling data from the NHS BSA Open Data Portal (ODP) using R

The ODP https://opendata.nhsbsa.net/ has two programatic methods to access data from it...

* `datastore_search` e.g. https://opendata.nhsbsa.net/api/3/action/datastore_search?resource_id=EPD_201401&limit=5
* `datastore_search_sql` e.g. https://opendata.nhsbsa.net/api/3/action/datastore_search_sql?sql=SELECT%20*%20FROM%20EPD_201401%20LIMIT%205

The following code demonstrates the process using the SQL style query. It is a more flexible way to access any data and easy if you already know some SQL (if not don't worry - the code is there for you to follow).

In [None]:
# Define the url for the API call
base_endpoint <- "https://opendata.nhsbsa.net/api/3/action"
action_method <- "/datastore_search_sql?sql=" # SQL

In [None]:
# Define the parameters for the SQL query
resource_name <- "EPD_202001"
pco_code <- "13T00" # Newcastle Gateshead CCG
bnf_chemical_substance <- "0407010H0" # Paracetamol

In [None]:
# Construct the SQL query
query <- paste0(
    "
    SELECT 
        * 
    FROM ", 
        resource_name, " 
    WHERE 
        1=1 
    AND pco_code = '", pco_code, "' 
    AND bnf_chemical_substance = '", bnf_chemical_substance, "'"
)

In [None]:
# Send API call and grab the response as a json
response <- jsonlite::fromJSON(paste0(
    base_endpoint,
    action_method, 
    URLencode(query) # Encode spaces in the url
))

In [None]:
# Extract records in the response to a dataframe
result_df <- response$result$result$records

In [None]:
# View the first 6 rows of data
head(result_df)

Next up we can utilise some of the base `R` plotting functionality to create some quick and easy visualisations

In [None]:
# Lets inspect the QUANTITY column
hist(x = result_df$QUANTITY)

In [None]:
# Use more bins
hist(
    x = result_df$QUANTITY, 
    xlab = NULL, 
    ylab = NULL,
    breaks = 50
)

In [None]:
# One bin per value of QUANTITY
max_quantity <- max(result_df$QUANTITY)
hist(
    x = result_df$QUANTITY, 
    xlab = NULL, 
    ylab = NULL,
    breaks = max_quantity
)

Now we can use the `ggplot2` package to make more complex visualisations

In [None]:
# Make the figure big enough for the plot
options(repr.plot.width = 10, repr.plot.height = 20)

# Lets see if QUANTITY varies by BNF_DESCRIPTION
ggplot2::ggplot(data = result_df, mapping = ggplot2::aes(x = QUANTITY)) +
    ggplot2::geom_histogram(bins = 50) +
    ggplot2::facet_wrap(facets = . ~ BNF_DESCRIPTION, ncol = 1) # One row per BNF_DESCRIPTION

# Reset to default figure
options(repr.plot.width = NULL, repr.plot.height = NULL)

We can see that `BNF_DESCRIPTION` contains different forms for the drugs, and that the `QUANTITY` differs (look at `BNF_DESCRIPTION == 'Paracetamol 250mg/5ml oral suspension sugar free'`)

In [None]:
# Subset the data to tablets
tablet_df <- subset(result_df, grepl("tablet", BNF_DESCRIPTION))

# Make the figure big enough for the plot
options(repr.plot.width = 10, repr.plot.height = 10)

# Lets see if QUANTITY varies by BNF_DESCRIPTION
ggplot2::ggplot(data = tablet_df, mapping = ggplot2::aes(x = QUANTITY)) +
    ggplot2::geom_histogram(bins = 50) +
    ggplot2::facet_wrap(facets = . ~ BNF_DESCRIPTION, ncol = 1) # One row per BNF_DESCRIPTION

# Reset to default figure
options(repr.plot.width = NULL, repr.plot.height = NULL)

In [None]:
# We can see there are peaks for certain QUANTITY so lets examine the 10 most 
# common QUANITTY
head(sort(x = table(tablet_df$QUANTITY), decreasing = TRUE), 10)

TASK

Create another subset called `oral_suspension_df` (containing only 'oral suspension' instead of 'tablet') and then for `QUANTITY`:

1) Produce an overall histogram
2) Produce one histogram per `BNF_DESCRITPION`
3) Get the top 5 most common `QUANTITY`

In [None]:
# Do your work in here
