<a href="https://colab.research.google.com/github/p-stehlik/StudentNotebooks/blob/main/PBSTimeSeries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 4013PHM PBS Data Wrangling Notebook
SOMETHING
This notebook provides a step by steb guide to doing data wrangling (cleaning, filtering etc) for your publicaly available PBS data.

You can amend this notebook for MBS or other data but you should feel comfortable that you understand the code and what it means, and make any adjustments accordingly.

HINT: use co-pilot or Google AI to help you understand the code and how you might amend it to suit your needs.

If you do update the code, it is good practice to document your logic (i.e. what the code is doing and why) using comments.

Comments are done by using # at the start of the code line - you can see lots of commenting I have put in below, hopefully this helps you understand what you are doing and why!

Broadly, we want to clean whatever data you have so you can create a nice visual - so ultimately you need to consider what you want at the end and create a dataframe that can be easily analysed.

For the purposes of this course, we will use a [pre-developed Shiny app](https://robin-visser.shinyapps.io/The_TIM/) that can do a few different kinds of time series analyses for you.

Have a look at the example on the app to see what structure your final data needs to be in.

## Before you start

Be sure you read about the data you will use, what each column means and each category within any columns.

You also need to consider:


*   How was the data generated?
*   When was it generated?
* Is there any missing data you should be aware of?





# Load libraries

Libraries are packages or mini software within python that allow you to do things within your code without having to code from scratch.

There are LOTS of packages out there - we use a few below

In [1]:
#highly used python package for data wrangling and analysis
import pandas as pd

#package for datetime data - ie dealing with dates and time!
from datetime import datetime

# Load the data

In [None]:
#LOAD THE PBS DATA

#To get the file path go to the file explorer tab, press the "three dots" and click "copy path"
#paste the file path into the quotation marks below - NOTE: the quotation marks tell python that its a string (sentence) and not a command
file_path = "Data/dos-jul-2021-to-nov-2025.xlsx"

#the code below reads and .xlsx file, looks at all the sheets, and then merges them into one huge sheet
#don't change the code below - keep as is.

sheets = pd.read_excel(file_path, sheet_name=None)


df_PBS = pd.concat(
    [d.assign(sheet_name=name) for name, d in sheets.items()],
    ignore_index=True
)

#this creates a "data frame" which we have named "df_PBS"
#honestly you can name the dataframe bananaMOOOMOO for all the program cares, but convention is to put df_NAME, and for your readability give it a sensible name, and you cannot use spaces!
#so if you are for example looking at MBS data, you can change to df_MBS or something

In [20]:
#have a look at the first few rows of df_PBS

df_PBS.head() # you can change the number of rows by putting a number in the

Unnamed: 0,MONTH_OF_SUPPLY,ITEM_CODE,ATC5_CODE,DRUG_NAME,PTNT_CTGRY_DRVD_CD,DRG_TYP_CTGRY,SCRIPT_TYPE,PRSCRPTN_CNT,PATIENT_CONTRIB,GOVT_CONTRIB,RETAIL_MARKUP,TOTAL_COST,PATIENT_NET_CONTRIB,sheet_name
0,202107,00000B,Z,MISSING ITEM CODE,R0,Unknown,ABOVE CO-PAYMENT,771,0.0,30747.73,0.0,30747.73,18.8,DOS_FY2021_22
1,202107,00000B,Z,MISSING ITEM CODE,R0,Unknown,UNDER CO-PAYMENT,25,6.6,0.0,0.0,6.6,89.72,DOS_FY2021_22
2,202107,00000B,Z,MISSING ITEM CODE,R1,Unknown,ABOVE CO-PAYMENT,795,5233.8,41013.11,0.0,46246.91,2868.49,DOS_FY2021_22
3,202107,00000B,Z,MISSING ITEM CODE,R1,Unknown,UNDER CO-PAYMENT,2,12.3,0.0,0.0,12.3,12.3,DOS_FY2021_22
4,202107,00013Q,Z,EXTEMPORANEOUSLY PREPARED,C0,Section 85,ABOVE CO-PAYMENT,1959,0.0,83309.92,0.0,83309.92,23.09,DOS_FY2021_22


In [None]:
# dataFrame.info() function lets you explore your dataframe
#it tells you the column names, the number of columns, the number of rows (entries) and the data types
#objects = categories
#int64 = a type of number

df_PBS.info()

#notice that the MONTH_OF_SUPPLY column is an int, not a date. We will need to convert this to a date format for a time series analysis!

One of the most important steps to data cleaning is ensuring that data is in the correct format.

One of the most DIFFICULT data types to work with is date_time. And it is especially a pain when you work with dates and excel!

In [None]:
#There are a few different ways to do this but an example is provided below

df_PBS['MONTH_OF_SUPPLY_dt'] = pd.to_datetime(df_PBS['MONTH_OF_SUPPLY'],
                                              format = "%Y%m")

#NOTE: format function tells python what format the data has come in so it can accurately convert to the a date/time. The PBS data came as 202107 - ie year and then month without a date


#have a look at what the data looks like now and checked it worked
df_PBS.head()

You should be able to see above that there are LOTS of rows, but you can see what is in each row.
It is important to remember that this now has ALL of the data but you will only need to look at whatever you're interested only.

# Create a filter for your item codes

You will need to create a .csv file with your item codes.

This will be used to create a list of item codes you are intersted in!


In [None]:
#import your .csv file with all of your item codes.

#first tell the code where your .csv file is
file_path = "Data/ItemCodes.csv"

#import the file
#again you can call the df below whatever you want, but I thought this name might be intuitive here, just give it a sensible name.
dfItemCodes = pd.read_csv(file_path)

#check it worked
dfItemCodes

In [24]:
#now we want to create a FILTER using the item codes from our imported .csv file
#in this case it will be the ITEM_CODE column of the df we just created
PBSItems = dfItemCodes["ITEM_CODE"]

 #now filter your large df_PBS and only get the rows of interest
 #the code below looks into the df_PBS dataframe, and then looks through the column called "ITEM_CODE" and finds which rows match any of the item codes you have listed in the "PBSItems" list

df_PBSFiltered = df_PBS[df_PBS['ITEM_CODE'].isin(PBSItems)]

df_PBSFiltered.head()

Unnamed: 0,MONTH_OF_SUPPLY,ITEM_CODE,ATC5_CODE,DRUG_NAME,PTNT_CTGRY_DRVD_CD,DRG_TYP_CTGRY,SCRIPT_TYPE,PRSCRPTN_CNT,PATIENT_CONTRIB,GOVT_CONTRIB,RETAIL_MARKUP,TOTAL_COST,PATIENT_NET_CONTRIB,sheet_name,MONTH_OF_SUPPLY_dt
1870,202107,01884E,J01CA04,AMOXICILLIN,C0,Section 85,ABOVE CO-PAYMENT,567,0.0,8345.18,2879.97,8345.18,112.64,DOS_FY2021_22,2021-07-01
1871,202107,01884E,J01CA04,AMOXICILLIN,C1,Section 85,ABOVE CO-PAYMENT,2829,17417.4,22829.25,13285.17,40246.65,17193.96,DOS_FY2021_22,2021-07-01
1872,202107,01884E,J01CA04,AMOXICILLIN,C1,Section 85,UNDER CO-PAYMENT,22,32.87,0.0,0.0,32.87,134.05,DOS_FY2021_22,2021-07-01
1873,202107,01884E,J01CA04,AMOXICILLIN,G1,Section 85,ABOVE CO-PAYMENT,17,105.6,145.86,86.02,251.46,103.6,DOS_FY2021_22,2021-07-01
1874,202107,01884E,J01CA04,AMOXICILLIN,G2,Section 85,ABOVE CO-PAYMENT,75,495.0,632.88,326.8,1127.88,490.77,DOS_FY2021_22,2021-07-01


In [25]:
#lets have a look at the new df structure
df_PBSFiltered.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5832 entries, 1870 to 1199470
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   MONTH_OF_SUPPLY      5832 non-null   int64         
 1   ITEM_CODE            5832 non-null   object        
 2   ATC5_CODE            5832 non-null   object        
 3   DRUG_NAME            5832 non-null   object        
 4   PTNT_CTGRY_DRVD_CD   5832 non-null   object        
 5   DRG_TYP_CTGRY        5832 non-null   object        
 6   SCRIPT_TYPE          5832 non-null   object        
 7   PRSCRPTN_CNT         5832 non-null   int64         
 8   PATIENT_CONTRIB      5832 non-null   float64       
 9   GOVT_CONTRIB         5832 non-null   float64       
 10  RETAIL_MARKUP        5832 non-null   float64       
 11  TOTAL_COST           5832 non-null   float64       
 12  PATIENT_NET_CONTRIB  5832 non-null   float64       
 13  sheet_name           5832 non-nu

One of the things you might be interested in is changes in formulation type, especially if looking at supply shortages. However, the PBS data does not have a column for this and it is only in the PBS drug map.

So what we will do below is MAP our drug column data from the PBS drug map to our pbs data.

In [26]:
#first greate a dictionary for our mapping process
#see here for more details: https://www.geeksforgeeks.org/python/python-mapping-key-values-to-dictionary/

#this will have a key (item code) and value (formulation)

mapping = dict(zip(dfItemCodes["ITEM_CODE"], #our item codes of interest
                   dfItemCodes["FORM/STRENGTH"])) #their associated formulation

#have a look to see if it worked
mapping

{'01884E': 'Capsule 250 mg (as trihydrate)',
 '01889K': 'Capsule 500 mg (as trihydrate)',
 '03300Q': 'Capsule 500 mg (as trihydrate)',
 '03301R': 'Capsule 250 mg (as trihydrate)',
 '11947T': 'Capsule 500 mg (as trihydrate)',
 '11998L': 'Capsule 250 mg (as trihydrate)',
 '02655R': 'Capsule 250 mg (as monohydrate)',
 '03058Y': 'Capsule 250 mg (as monohydrate)',
 '03119E': 'Capsule 500 mg (as monohydrate)',
 '03317N': 'Capsule 250 mg (as monohydrate)',
 '03318P': 'Capsule 500 mg (as monohydrate)',
 '10778G': 'Capsule 500 mg (as monohydrate)',
 '11934D': 'Capsule 500 mg (as monohydrate)',
 '11963P': 'Capsule 250 mg (as monohydrate)'}

In [27]:
#now we need to add a column in our PBS data and map the fomulation to that column baesd on what the item code is
#the basic approach is df['new_column'] = df['existing_column'].map(your_dict)

df_PBSFiltered["FORM/STRENGTH"] = df_PBSFiltered['ITEM_CODE'].map(mapping)

df_PBSFiltered.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_PBSFiltered["FORM/STRENGTH"] = df_PBSFiltered['ITEM_CODE'].map(mapping)


Unnamed: 0,MONTH_OF_SUPPLY,ITEM_CODE,ATC5_CODE,DRUG_NAME,PTNT_CTGRY_DRVD_CD,DRG_TYP_CTGRY,SCRIPT_TYPE,PRSCRPTN_CNT,PATIENT_CONTRIB,GOVT_CONTRIB,RETAIL_MARKUP,TOTAL_COST,PATIENT_NET_CONTRIB,sheet_name,MONTH_OF_SUPPLY_dt,FORM/STRENGTH
1870,202107,01884E,J01CA04,AMOXICILLIN,C0,Section 85,ABOVE CO-PAYMENT,567,0.0,8345.18,2879.97,8345.18,112.64,DOS_FY2021_22,2021-07-01,Capsule 250 mg (as trihydrate)
1871,202107,01884E,J01CA04,AMOXICILLIN,C1,Section 85,ABOVE CO-PAYMENT,2829,17417.4,22829.25,13285.17,40246.65,17193.96,DOS_FY2021_22,2021-07-01,Capsule 250 mg (as trihydrate)
1872,202107,01884E,J01CA04,AMOXICILLIN,C1,Section 85,UNDER CO-PAYMENT,22,32.87,0.0,0.0,32.87,134.05,DOS_FY2021_22,2021-07-01,Capsule 250 mg (as trihydrate)
1873,202107,01884E,J01CA04,AMOXICILLIN,G1,Section 85,ABOVE CO-PAYMENT,17,105.6,145.86,86.02,251.46,103.6,DOS_FY2021_22,2021-07-01,Capsule 250 mg (as trihydrate)
1874,202107,01884E,J01CA04,AMOXICILLIN,G2,Section 85,ABOVE CO-PAYMENT,75,495.0,632.88,326.8,1127.88,490.77,DOS_FY2021_22,2021-07-01,Capsule 250 mg (as trihydrate)


## Save the data

You can do additional cleaning and w Excel or [OpenRefine](https://openrefine.org/])

# Data exploration

In [29]:
#if you want to look at what the unique values are in a column you can use the code structure dataFrame["my_column"].unique()
#for example if I want to check which item codes I have filtered to make sure my code above worked

df_PBSFiltered["ITEM_CODE"].unique()

array(['01884E', '01889K', '02655R', '03058Y', '03119E', '03300Q',
       '03301R', '03317N', '03318P', '10778G', '11934D', '11947T',
       '11963P', '11998L'], dtype=object)

In [30]:
#similar to unique(), you can also see how many of each category you have with dataFrame["my_column"].value_counts()
df_PBSFiltered["ITEM_CODE"].value_counts()

ITEM_CODE
03119E    505
01889K    499
11947T    486
11934D    473
03058Y    433
10778G    422
11998L    421
02655R    419
03300Q    418
01884E    417
11963P    398
03318P    358
03301R    325
03317N    258
Name: count, dtype: int64

In [63]:
#explore some columns here and add additional code chuncks as needed

## Pivot tables

Pivot tables are highly useful tools to summarise data. This can be done in [Excel](https://support.microsoft.com/en-us/office/overview-of-pivottables-and-pivotcharts-527c8fa3-02c0-445a-a2db-7794676bce96#:~:text=A%20PivotTable%20is%20an%20interactive,unanticipated%20questions%20about%20your%20data.) but can be done in pandas too and it is absolutely a favourite of mine when trying to understand my data.

Examples of how to create pivot tables with python can be found [here](https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html).
**NOTE**: you just need to scroll past all the documentation bit to get to the examples. However, the documentation gives you some more detail on all the functionality available to you.


In [31]:
#Lets say we want to look at the number of scripts each month, irrespective of script type etc

scriptCount_table = pd.pivot_table( df_PBSFiltered, #the dataframe
                                   values="PRSCRPTN_CNT", #which column to group on - in this case the number of scripts
                                    index=["MONTH_OF_SUPPLY_dt"], #group by column - in this case date!
                                    aggfunc="sum" #what you want to do
)

#show the first few rows
scriptCount_table.head()

Unnamed: 0_level_0,PRSCRPTN_CNT
MONTH_OF_SUPPLY_dt,Unnamed: 1_level_1
2021-07-01,594486
2021-08-01,572839
2021-09-01,531048
2021-10-01,509598
2021-11-01,549714


In [32]:
#Lets say we want to look at the number of scripts each month, GROUPED by whether the patient was consession etc

scriptCount_RxType_table = pd.pivot_table( df_PBSFiltered, #the dataframe
                                   values="PRSCRPTN_CNT", #which column to group on - in this case the number of scripts
                                    index=["MONTH_OF_SUPPLY_dt"], #group by column - in this case date!
                                    aggfunc="sum", #what you want to do
                                    columns = ["DRUG_NAME"]
)

#show the first few rows
scriptCount_RxType_table.head()

DRUG_NAME,AMOXICILLIN,CEFALEXIN
MONTH_OF_SUPPLY_dt,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-07-01,256750,337736
2021-08-01,240831,332008
2021-09-01,211738,319310
2021-10-01,190724,318874
2021-11-01,207866,341848


In [14]:
#try your own pivot table here

## Adding columns to your pivot table

You might want to add a total column to your table to get a sense of total dispensings for a particular cluster of medications

In [33]:
#add totals column
#note axiz = 1 tells it to summ horazontally, whereas axis = 0 will sum things vertically
scriptCount_RxType_table['TOTAL'] = scriptCount_RxType_table.sum(axis=1)

scriptCount_RxType_table.head()

DRUG_NAME,AMOXICILLIN,CEFALEXIN,TOTAL
MONTH_OF_SUPPLY_dt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-07-01,256750,337736,594486
2021-08-01,240831,332008,572839
2021-09-01,211738,319310,531048
2021-10-01,190724,318874,509598
2021-11-01,207866,341848,549714


## Save the data

Let's load what we have so far into [RawGraphs](https://app.rawgraphs.io/) and think about what else we might need to add or change to make a nice graph!

In [34]:
# Save the DataFrame to a CSV file named 'PBS_mergedL.csv'
# Again, give your output name a sensible name
# Here I have put the date of my analysis at the front as YYYYMMDD format so I know WHEN I did the analysis, followed by some kind of descriptor

scriptCount_RxType_table.to_csv('20260224PBS_AmoxCeph_Init.csv')


<div style="border: 3px solid red; padding: 12px; background-color: #ffe5e5; color: red; font-weight: bold;font-size: 50px">
    STOP AND THINK
</div>

1. What do I want to visualise?

2. Is the data in the correct format?

3. How do I need to transform the data?


# Wide to Long format

Sometimes we want to reshape our data from long to wide and vice versa.
Often in graphing, a LONG format is used.

Check out this webpage which explains what this means in more detail: https://towardsdatascience.com/reshaping-a-pandas-dataframe-long-to-wide-and-vice-versa-517c7f0995ad/


**WIDE FORMAT**

|Index| CAT-1| CAT-2 |  CAT-3|
|-----| ----| ---- | -----|
|Date-1| Num 1| Num 2 | Num 3|

## Check for missing dates

In [85]:
#earliest date
print("Earliest date:", scriptCount_table.index.min())


#most recent date
print("Most recent date:", scriptCount_table.index.max())




#create a series from the max to min date
date_range = pd.date_range(start=scriptCount_table.index.min(), #earliest date
                           end=scriptCount_table.index.max(), #latest date
                           freq='MS') #start of month

#find the difference between our dates and what should be there
date_range.difference(scriptCount_table.index)



Earliest date: 2021-07-01 00:00:00
Most recent date: 2025-11-01 00:00:00


DatetimeIndex([], dtype='datetime64[ns]', freq='MS')

# Final touches

## Changing column names

One of the final things you want to do before saving is change your column names so that they come up nicely in your visual.

In [16]:
#change column names


# Save you data

Save the data you want to visualise as a csv file.
You should now be able to import this into a "point and click" tool, or you can use the R notebook provided to you to visualise your data

In [67]:
# Save the DataFrame to a CSV file named 'PBS_Wrangled_ALL.csv'
#again, give your output name a sensible name

scriptCount_table.to_csv('PBS_Wrangled_ALL.csv')
