<a href="https://colab.research.google.com/github/p-stehlik/StudentNotebooks/blob/main/PBSTimeSeries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 4013PHM PBS Data Wrangling Notebook

This notebook provides a step by steb guide to doing data wrangling (cleaning, filtering etc) for your publicaly available PBS data.

You can amend this notebook for MBS or other data but you should feel comfortable that you understand the code and what it means, and make any adjustments accordingly.

HINT: use co-pilot or Google AI to help you understand the code and how you might amend it to suit your needs.

If you do update the code, it is good practice to document your logic (i.e. what the code is doing and why) using comments.

Comments are done by using # at the start of the code line - you can see lots of commenting I have put in below, hopefully this helps you understand what you are doing and why!

Broadly, we want to clean whatever data you have so you can create a nice visual - so ultimately you need to consider what you want at the end and create a dataframe that can be easily analysed.

For the purposes of this course, we will use a [pre-developed Shiny app](https://robin-visser.shinyapps.io/The_TIM/) that can do a few different kinds of time series analyses for you.

Have a look at the example on the app to see what structure your final data needs to be in.

## Before you start

Be sure you read about the data you will use, what each column means and each category within any columns.

You also need to consider:


*   How was the data generated?
*   When was it generated?
* Is there any missing data you should be aware of?





# Load libraries

Libraries are packages or mini software within python that allow you to do things within your code without having to code from scratch.

There are LOTS of packages out there - we use a few below

In [11]:
#highly used python package for data wrangling and analysis
import pandas as pd

#package for datetime data - ie dealing with dates and time!
from datetime import datetime

#load the data

In [12]:
#LOAD THE PBS DATA

#To get the file path go to the file explorer tab, press the "three dots" and click "copy path"
#paste the file path into the quotation marks below - NOTE: the quotation marks tell python that its a string (sentence) and not a command
file_path = "/content/PBS_Data/dos-jul-2021-to-nov-2025.xlsx"

#the code below reads and .xlsx file, looks at all the sheets, and then merges them into one huge sheet
#dont change the code below - keep as is.

sheets = pd.read_excel(file_path, sheet_name=None)


df_PBS = pd.concat(
    [d.assign(sheet_name=name) for name, d in sheets.items()],
    ignore_index=True
)

#this creates a "data frame" which we have named "df_PBS"
#honestly you can name the dataframe bananaMOOOMOO for all the program cares, but convention is to put df_NAME, and for your readability give it a sensible name, and you cannot use spaces!
#so if you are for example looking at MBS data, you can change to df_MBS or something

In [13]:
#have a look at the first few rows of df_PBS

df_PBS.head() # you can change the number of rows by putting a number in the

Unnamed: 0,MONTH_OF_SUPPLY,ITEM_CODE,ATC5_CODE,DRUG_NAME,PTNT_CTGRY_DRVD_CD,DRG_TYP_CTGRY,SCRIPT_TYPE,PRSCRPTN_CNT,PATIENT_CONTRIB,GOVT_CONTRIB,RETAIL_MARKUP,TOTAL_COST,PATIENT_NET_CONTRIB,sheet_name
0,202107,00000B,Z,MISSING ITEM CODE,R0,Unknown,ABOVE CO-PAYMENT,771,0.0,30747.73,0.0,30747.73,18.8,DOS_FY2021_22
1,202107,00000B,Z,MISSING ITEM CODE,R0,Unknown,UNDER CO-PAYMENT,25,6.6,0.0,0.0,6.6,89.72,DOS_FY2021_22
2,202107,00000B,Z,MISSING ITEM CODE,R1,Unknown,ABOVE CO-PAYMENT,795,5233.8,41013.11,0.0,46246.91,2868.49,DOS_FY2021_22
3,202107,00000B,Z,MISSING ITEM CODE,R1,Unknown,UNDER CO-PAYMENT,2,12.3,0.0,0.0,12.3,12.3,DOS_FY2021_22
4,202107,00013Q,Z,EXTEMPORANEOUSLY PREPARED,C0,Section 85,ABOVE CO-PAYMENT,1959,0.0,83309.92,0.0,83309.92,23.09,DOS_FY2021_22


In [None]:
# dataFrame.info() function lets you explore your dataframe
#it tells you the column names, the number of columns, the number of rows (entries) and the data types
#objects = categories
#int64 = a type of number

df_PBS.info()

#notice that the MONTH_OF_SUPPLY column is an int, not a date. We will need to convert this to a date format for a time series analysis!

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1210410 entries, 0 to 1210409
Data columns (total 14 columns):
 #   Column               Non-Null Count    Dtype  
---  ------               --------------    -----  
 0   MONTH_OF_SUPPLY      1210410 non-null  int64  
 1   ITEM_CODE            1210410 non-null  object 
 2   ATC5_CODE            1210410 non-null  object 
 3   DRUG_NAME            1210410 non-null  object 
 4   PTNT_CTGRY_DRVD_CD   1210410 non-null  object 
 5   DRG_TYP_CTGRY        1210410 non-null  object 
 6   SCRIPT_TYPE          1210410 non-null  object 
 7   PRSCRPTN_CNT         1210410 non-null  int64  
 8   PATIENT_CONTRIB      1210410 non-null  float64
 9   GOVT_CONTRIB         1210410 non-null  float64
 10  RETAIL_MARKUP        1210410 non-null  float64
 11  TOTAL_COST           1210410 non-null  float64
 12  PATIENT_NET_CONTRIB  1210410 non-null  float64
 13  sheet_name           1210410 non-null  object 
dtypes: float64(5), int64(2), object(7)
memory usage: 1

One of the most important steps to data cleaning is ensuring that data is in the correct format.

One of the most DIFFICULT data types to work with is date_time. And it is especially a pain when you work with dates and excel!

In [24]:
#There are a few different ways to do this but an example is provided below

df_PBS['MONTH_OF_SUPPLY_dt'] = pd.to_datetime(df_PBS['MONTH_OF_SUPPLY'],
                                              format = "%Y%m")

#NOTE: format function tells python what format the data has come in so it can accurately convert to the a date/time. The PBS data came as 202107 - ie year and then month without a date


#have a look at what the data looks like now and checked it worked
df_PBS.head()

Unnamed: 0,MONTH_OF_SUPPLY,ITEM_CODE,ATC5_CODE,DRUG_NAME,PTNT_CTGRY_DRVD_CD,DRG_TYP_CTGRY,SCRIPT_TYPE,PRSCRPTN_CNT,PATIENT_CONTRIB,GOVT_CONTRIB,RETAIL_MARKUP,TOTAL_COST,PATIENT_NET_CONTRIB,sheet_name,MONTH_OF_SUPPLY_dt
0,202107,00000B,Z,MISSING ITEM CODE,R0,Unknown,ABOVE CO-PAYMENT,771,0.0,30747.73,0.0,30747.73,18.8,DOS_FY2021_22,2021-07-01
1,202107,00000B,Z,MISSING ITEM CODE,R0,Unknown,UNDER CO-PAYMENT,25,6.6,0.0,0.0,6.6,89.72,DOS_FY2021_22,2021-07-01
2,202107,00000B,Z,MISSING ITEM CODE,R1,Unknown,ABOVE CO-PAYMENT,795,5233.8,41013.11,0.0,46246.91,2868.49,DOS_FY2021_22,2021-07-01
3,202107,00000B,Z,MISSING ITEM CODE,R1,Unknown,UNDER CO-PAYMENT,2,12.3,0.0,0.0,12.3,12.3,DOS_FY2021_22,2021-07-01
4,202107,00013Q,Z,EXTEMPORANEOUSLY PREPARED,C0,Section 85,ABOVE CO-PAYMENT,1959,0.0,83309.92,0.0,83309.92,23.09,DOS_FY2021_22,2021-07-01


You should be able to see above that there are LOTS of rows, but you can see what is in each row.
It is important to remember that this now has ALL of the data but you will only need to look at whatever you're interested only.

In [47]:
#create a list of item codes you want to search for
#the square brackets tell python you are making a list!
#each item code will need to be in quotation marks and then seperated by a comma
#again you can call the list below whatever you want, but I thought PBS items might be intuative here, just give it a sensible name.


PBSItems = ["02868Y",
            "08172D",
            "08359Y"
            ]

 #now filter your large df_PBS and only get the rows of interest
 #the code below looks into the df_PBS dataframe, and then looks through the column called "ITEM_CODE" and finds which rows match any of the item codes you have listed in the "PBSItems" list

df_PBSFiltered = df_PBS[df_PBS['ITEM_CODE'].isin(PBSItems)]

df_PBSFiltered.head()

Unnamed: 0,MONTH_OF_SUPPLY,ITEM_CODE,ATC5_CODE,DRUG_NAME,PTNT_CTGRY_DRVD_CD,DRG_TYP_CTGRY,SCRIPT_TYPE,PRSCRPTN_CNT,PATIENT_CONTRIB,GOVT_CONTRIB,RETAIL_MARKUP,TOTAL_COST,PATIENT_NET_CONTRIB,sheet_name,MONTH_OF_SUPPLY_dt
4490,202107,02868Y,P02CF01,IVERMECTIN,C0,Section 85,ABOVE CO-PAYMENT,179,0.0,8124.54,723.01,8124.54,0.0,DOS_FY2021_22,2021-07-01
4491,202107,02868Y,P02CF01,IVERMECTIN,C1,Section 85,ABOVE CO-PAYMENT,636,3861.0,24458.4,2525.83,28319.4,3688.3,DOS_FY2021_22,2021-07-01
4492,202107,02868Y,P02CF01,IVERMECTIN,G1,Section 85,ABOVE CO-PAYMENT,1,6.6,40.64,4.3,47.24,6.6,DOS_FY2021_22,2021-07-01
4493,202107,02868Y,P02CF01,IVERMECTIN,G2,Section 85,ABOVE CO-PAYMENT,381,15076.0,2808.9,1636.16,17884.9,14938.4,DOS_FY2021_22,2021-07-01
4494,202107,02868Y,P02CF01,IVERMECTIN,G2,Section 85,UNDER CO-PAYMENT,31,903.46,0.0,52.77,903.46,1006.98,DOS_FY2021_22,2021-07-01


In [48]:
#lets have a look at the new df structure
df_PBSFiltered.info()

<class 'pandas.core.frame.DataFrame'>
Index: 660 entries, 4490 to 1190241
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   MONTH_OF_SUPPLY      660 non-null    int64         
 1   ITEM_CODE            660 non-null    object        
 2   ATC5_CODE            660 non-null    object        
 3   DRUG_NAME            660 non-null    object        
 4   PTNT_CTGRY_DRVD_CD   660 non-null    object        
 5   DRG_TYP_CTGRY        660 non-null    object        
 6   SCRIPT_TYPE          660 non-null    object        
 7   PRSCRPTN_CNT         660 non-null    int64         
 8   PATIENT_CONTRIB      660 non-null    float64       
 9   GOVT_CONTRIB         660 non-null    float64       
 10  RETAIL_MARKUP        660 non-null    float64       
 11  TOTAL_COST           660 non-null    float64       
 12  PATIENT_NET_CONTRIB  660 non-null    float64       
 13  sheet_name           660 non-null

# Exploring your data

Now you might want to think about exploring your data a little and filtering it as needed.

I have provided some code snippits below that might be useful, but you should add your own and explore the data.

In [49]:
#if you want to look at what the unique values are in a column you can use the code structure dataFrame["my_column"].unique()
#for example if I want to check which item codes I have filtered to make sure my code above worked

df_PBSFiltered["ITEM_CODE"].unique()

array(['02868Y', '08359Y'], dtype=object)

In [50]:
#similar to unique(), you can also see how many of each category you have with dataFrame["my_column"].value_counts()
df_PBSFiltered["ITEM_CODE"].value_counts()

Unnamed: 0_level_0,count
ITEM_CODE,Unnamed: 1_level_1
02868Y,348
08359Y,312


In [51]:
#explore some columns here and add additional code chuncks as needed

## Pivot tables

Pivot tables are highly useful tools to summarise data. This can be done in [Excel](https://support.microsoft.com/en-us/office/overview-of-pivottables-and-pivotcharts-527c8fa3-02c0-445a-a2db-7794676bce96#:~:text=A%20PivotTable%20is%20an%20interactive,unanticipated%20questions%20about%20your%20data.) but can be done in pandas too and it is absolutely a favourite of mine when trying to understand my data.

Examples of how to create pivot tables with python can be found [here](https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html).
NOTE: you just need to scroll past all the documentation bit to get to the examples. However, the documentation gives you some more detail on all the functionality available to you.


In [52]:
#Lets say we want to look at the number of scripts each month, irrespective of script type etc

scriptCount_table = pd.pivot_table( df_PBSFiltered, #the dataframe
                                   values="PRSCRPTN_CNT", #which column to group on - in this case the number of scripts
                                    index=["MONTH_OF_SUPPLY_dt"], #group by column - in this case date!
                                    aggfunc="sum" #what you want to do
)

#show the first few rows
scriptCount_table.head()

Unnamed: 0_level_0,PRSCRPTN_CNT
MONTH_OF_SUPPLY_dt,Unnamed: 1_level_1
2021-07-01,1468
2021-08-01,1183
2021-09-01,1471
2021-10-01,1138
2021-11-01,1172


In [53]:
#Lets say we want to look at the number of scripts each month, GROUPED by whether the patient was consession etc

scriptCount_RxType_table = pd.pivot_table( df_PBSFiltered, #the dataframe
                                   values="PRSCRPTN_CNT", #which column to group on - in this case the number of scripts
                                    index=["MONTH_OF_SUPPLY_dt"], #group by column - in this case date!
                                    aggfunc="sum", #what you want to do
                                    columns = ["PTNT_CTGRY_DRVD_CD"]
)

#show the first few rows
scriptCount_RxType_table.head()

PTNT_CTGRY_DRVD_CD,C0,C1,G1,G2,R0,R1
MONTH_OF_SUPPLY_dt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-07-01,218.0,747.0,1.0,471.0,9.0,22.0
2021-08-01,178.0,490.0,9.0,474.0,11.0,21.0
2021-09-01,319.0,627.0,1.0,488.0,23.0,13.0
2021-10-01,210.0,460.0,32.0,404.0,21.0,11.0
2021-11-01,233.0,437.0,38.0,440.0,18.0,6.0


In [54]:
#try your own pivot table here

## Adding columns to your pivot table

In [None]:
#add totals column

# Changing column names

One of the final things you want to do before saving is change your column names so that they come up nicely in your visual.

In [56]:
#change column names

# Save you data

Save the data you want to visualise as a csv file.
You should now be able to import this into a "point and click" tool, or you can use the R notebook provided to you to visualise your data

In [55]:
# Save the DataFrame to a CSV file named 'PBS_Wrangled_ALL.csv'
#again, give your output name a sensible name
scriptCount_table.to_csv('PBS_Wrangled_ALL.csv')
