<a href="https://colab.research.google.com/github/p-stehlik/StudentNotebooks/blob/main/PBSTimeSeries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 4013PHM PBS Data Wrangling Notebook
This notebook provides a step by steb guide to doing data wrangling (cleaning, filtering etc) for your publicaly available PBS data.

You can amend this notebook to other time series data but you should feel comfortable that you understand the code and what it means, and make any adjustments accordingly.

HINT: use co-pilot or Google AI to help you understand the code and how you might amend it to suit your needs.

If you do update the code, it is good practice to document your logic (i.e. what the code is doing and why) using comments.

Comments are done by using # at the start of the code line - you can see lots of commenting I have put in below, hopefully this helps you understand what you are doing and why!

Broadly, we want to clean whatever data you have so you can create a nice visual - so ultimately you need to consider what you want at the end and create a dataframe that can be easily analysed.

For the purposes of this course, we will  create a simple line graph using Python to visualise our data.

There are plenty of Shiny apps which can do more complex forecasting and ITS analysis; however, this is beyond the scope of this course.

While you are not prevented from using them, or using AI to generate analytic code you should do so **WITH CAUTION**.

**WORD OF WARNING**
Time series analysis is complex and should be guided or done by someone with the knowledge and skills to make informed decisions about the model itself. While you as a clinician should be able to interpret the outputs of such an analysis and what it might mean for your patients.



## Before you start

Be sure you read about the data you will use, what each column means and each category within any columns.

You also need to consider:


*   How was the data generated?
*   When was it generated?
* Is there any missing data you should be aware of?





# Load libraries

Libraries are packages or mini software within python that allow you to do things within your code without having to code from scratch.

There are LOTS of packages out there - we use a few below

In [1]:
#highly used python package for data wrangling and analysis
import pandas as pd

#package for datetime data - ie dealing with dates and time!
from datetime import datetime

#Package so Paulie can add some fancy stuff in the markdown boxes :)
from IPython.display import HTML

#packages needed to download data directly from a website
import requests
from io import BytesIO

#so you can work in google collab
#from google.colab import drive
#drive.mount('/content')

# Download data from a website

Below you will learn how to download the [PBS dispensing data](https://www.pbs.gov.au/info/statistics/dos-and-dop/dos-and-dop) but the code can be adjusted to download data from ANY website.

As with any dataset, you need to familiarise yourself with the data and understand what each column means - **so be sure to read any explanatory notes**.

The PBS notes are in the link above.


In [2]:
#the code below will download the PBS dispensing data
#you can amend this with whatever you want if you want to download different datasets directly from a website
#just copy and paste the correct link in


# --- User input needed ---
# Replace with the actual URL of your Excel file
excel_url = "https://www.pbs.gov.au/statistics/dos-and-dop/files/dos-jul-2021-to-nov-2025.xlsx"


# Replace with the desired local file name (e.g., 'downloaded_data.xlsx')
output_filename = "dos-jul-2021-to-nov-2025.xlsx"
# -------------------------

try:
    response = requests.get(excel_url)
    response.raise_for_status()  # Raise an HTTPError for bad responses (4xx or 5xx)

    # Save the downloaded file locally
    with open(output_filename, 'wb') as f:
        f.write(response.content)
    print(f"Excel file downloaded successfully as '{output_filename}'")

    # Read the Excel file into a pandas DataFrame directly from bytes content
    # Or, if you prefer to read from the saved file:
    # df_downloaded_excel = pd.read_excel(output_filename)
    # For this example, we'll read directly from the in-memory content
    df_downloaded_excel = pd.read_excel(BytesIO(response.content))

    print("First 5 rows of the downloaded data:")
    print(df_downloaded_excel.head())

except requests.exceptions.RequestException as e:
    print(f"Error downloading the file: {e}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")


Excel file downloaded successfully as 'dos-jul-2021-to-nov-2025.xlsx'
First 5 rows of the downloaded data:
   MONTH_OF_SUPPLY ITEM_CODE ATC5_CODE                  DRUG_NAME  \
0           202107    00000B         Z          MISSING ITEM CODE   
1           202107    00000B         Z          MISSING ITEM CODE   
2           202107    00000B         Z          MISSING ITEM CODE   
3           202107    00000B         Z          MISSING ITEM CODE   
4           202107    00013Q         Z  EXTEMPORANEOUSLY PREPARED   

  PTNT_CTGRY_DRVD_CD DRG_TYP_CTGRY       SCRIPT_TYPE  PRSCRPTN_CNT  \
0                 R0       Unknown  ABOVE CO-PAYMENT           771   
1                 R0       Unknown  UNDER CO-PAYMENT            25   
2                 R1       Unknown  ABOVE CO-PAYMENT           795   
3                 R1       Unknown  UNDER CO-PAYMENT             2   
4                 C0    Section 85  ABOVE CO-PAYMENT          1959   

   PATIENT_CONTRIB  GOVT_CONTRIB  RETAIL_MARKUP  TOTAL_CO

In [3]:
# lets also download the Item Code to Drug Mapping File as we will need this for later


# --- User input needed ---
# Replace with the actual URL of your Excel file
excel_url = "https://www.pbs.gov.au/statistics/dos-and-dop/files/pbs-item-drug-map.csv"


# Replace with the desired local file name (e.g., 'downloaded_data.xlsx')
output_filename = "pbs-item-drug-map.csv"
# -------------------------

try:
    response = requests.get(excel_url)
    response.raise_for_status()  # Raise an HTTPError for bad responses (4xx or 5xx)

    # Save the downloaded file locally
    with open(output_filename, 'wb') as f:
        f.write(response.content)
    print(f"Excel file downloaded successfully as '{output_filename}'")

    # Read the Excel file into a pandas DataFrame directly from bytes content
    # Or, if you prefer to read from the saved file:
    # df_downloaded_excel = pd.read_excel(output_filename)
    # For this example, we'll read directly from the in-memory content
    df_downloaded_csv = pd.read_csv(BytesIO(response.content), encoding='latin1')

    print("First 5 rows of the downloaded data:")
    print(df_downloaded_csv.head())

except requests.exceptions.RequestException as e:
    print(f"Error downloading the file: {e}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

Excel file downloaded successfully as 'pbs-item-drug-map.csv'
First 5 rows of the downloaded data:
  ITEM_CODE                  DRUG_NAME                     FORM/STRENGTH  \
0    00000A          MISSING ITEM CODE                 Missing Item Code   
1    00013Q  EXTEMPORANEOUSLY PREPARED                            Creams   
2    00015T  EXTEMPORANEOUSLY PREPARED                         Ear drops   
3    00016W                    ELIXIRS                      Generic term   
4    00019B  EXTEMPORANEOUSLY PREPARED  Eye drops containing cocaine hcl   

  ATC5_Code  
0         Z  
1         Z  
2         Z  
3         Z  
4         Z  


## Thinking about the denominator!

In the case of PBS data the dispensing data is TOTAL dispensings
In order to compare trends in a population, we also need the POPULATION data as our denominator.

If you had local hospital data your denominator might be the total number of admissions, or in community, the total number of patients in a given time period.

This is because you may have more people at one time point over another, but the overall proportion of dispensing might stay the same.

**Note** some datasets may have already done this by saying *per 1000 persons* or something similar.
Again this is why you need to familiarise yourself with the dataset FIRST

In [4]:
#download denominator data
#in this case I am getting this from the ABS population projections


# --- User input needed ---
# Replace with the actual URL of your Excel file
excel_url = "https://data.api.abs.gov.au/rest/data/ABS,POP_PROJ_2011,1.0.0/all?dimensionAtObservation=AllDimensions&format=csvfilewithlabels"


# Replace with the desired local file name (e.g., 'downloaded_data.xlsx')
output_filename = "ABSpop.csv"
# -------------------------

try:
    response = requests.get(excel_url)
    response.raise_for_status()  # Raise an HTTPError for bad responses (4xx or 5xx)

    # Save the downloaded file locally
    with open(output_filename, 'wb') as f:
        f.write(response.content)
    print(f"Excel file downloaded successfully as '{output_filename}'")

    # Read the Excel file into a pandas DataFrame directly from bytes content
    # Or, if you prefer to read from the saved file:
    # df_downloaded_excel = pd.read_excel(output_filename)
    # For this example, we'll read directly from the in-memory content
    df_downloaded_csv = pd.read_csv(BytesIO(response.content), encoding='latin1')

    print("First 5 rows of the downloaded data:")
    print(df_downloaded_csv.head())

except requests.exceptions.RequestException as e:
    print(f"Error downloading the file: {e}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

Excel file downloaded successfully as 'ABSpop.csv'
First 5 rows of the downloaded data:
  STRUCTURE              STRUCTURE_ID  \
0  DATAFLOW  ABS:POP_PROJ_2011(1.0.0)   
1  DATAFLOW  ABS:POP_PROJ_2011(1.0.0)   
2  DATAFLOW  ABS:POP_PROJ_2011(1.0.0)   
3  DATAFLOW  ABS:POP_PROJ_2011(1.0.0)   
4  DATAFLOW  ABS:POP_PROJ_2011(1.0.0)   

                                 STRUCTURE_NAME ACTION  ASGS_2011     Region  \
0  Population Projections, Australia, 2017-2066      I          0  Australia   
1  Population Projections, Australia, 2017-2066      I          0  Australia   
2  Population Projections, Australia, 2017-2066      I          0  Australia   
3  Population Projections, Australia, 2017-2066      I          0  Australia   
4  Population Projections, Australia, 2017-2066      I          0  Australia   

   SEX_ABS      Sex AGE Age  ...  TIME_PERIOD Time Period  OBS_VALUE  \
0        3  Persons  93  93  ...         2017         NaN      22612   
1        3  Persons  93  93  ...        

# Load the data

In [5]:
#LOAD THE PBS DATA

#To get the file path go to the file explorer tab, press the "three dots" and click "copy path"
#paste the file path into the quotation marks below - NOTE: the quotation marks tell python that its a string (sentence) and not a command
file_path = "/content/dos-jul-2021-to-nov-2025.xlsx"

#the code below reads and .xlsx file, looks at all the sheets, and then merges them into one huge sheet
#don't change the code below - keep as is.

sheets = pd.read_excel(file_path, sheet_name=None)


df_PBS = pd.concat(
    [d.assign(sheet_name=name) for name, d in sheets.items()],
    ignore_index=True
)

#this creates a "data frame" which we have named "df_PBS"
#honestly you can name the dataframe bananaMOOOMOO for all the program cares, but convention is to put df_NAME, and for your readability give it a sensible name, and you cannot use spaces!
#so if you are for example looking at MBS data, you can change to df_MBS or something

In [6]:
#have a look at the first few rows of df_PBS

df_PBS.head() # you can change the number of rows by putting a number in the

Unnamed: 0,MONTH_OF_SUPPLY,ITEM_CODE,ATC5_CODE,DRUG_NAME,PTNT_CTGRY_DRVD_CD,DRG_TYP_CTGRY,SCRIPT_TYPE,PRSCRPTN_CNT,PATIENT_CONTRIB,GOVT_CONTRIB,RETAIL_MARKUP,TOTAL_COST,PATIENT_NET_CONTRIB,sheet_name
0,202107,00000B,Z,MISSING ITEM CODE,R0,Unknown,ABOVE CO-PAYMENT,771,0.0,30747.73,0.0,30747.73,18.8,DOS_FY2021_22
1,202107,00000B,Z,MISSING ITEM CODE,R0,Unknown,UNDER CO-PAYMENT,25,6.6,0.0,0.0,6.6,89.72,DOS_FY2021_22
2,202107,00000B,Z,MISSING ITEM CODE,R1,Unknown,ABOVE CO-PAYMENT,795,5233.8,41013.11,0.0,46246.91,2868.49,DOS_FY2021_22
3,202107,00000B,Z,MISSING ITEM CODE,R1,Unknown,UNDER CO-PAYMENT,2,12.3,0.0,0.0,12.3,12.3,DOS_FY2021_22
4,202107,00013Q,Z,EXTEMPORANEOUSLY PREPARED,C0,Section 85,ABOVE CO-PAYMENT,1959,0.0,83309.92,0.0,83309.92,23.09,DOS_FY2021_22


In [7]:
# dataFrame.info() function lets you explore your dataframe
#it tells you the column names, the number of columns, the number of rows (entries) and the data types
#objects = categories
#int64 = a type of number

df_PBS.info()

#notice that the MONTH_OF_SUPPLY column is an int, not a date. We will need to convert this to a date format for a time series analysis!

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1210410 entries, 0 to 1210409
Data columns (total 14 columns):
 #   Column               Non-Null Count    Dtype  
---  ------               --------------    -----  
 0   MONTH_OF_SUPPLY      1210410 non-null  int64  
 1   ITEM_CODE            1210410 non-null  object 
 2   ATC5_CODE            1210410 non-null  object 
 3   DRUG_NAME            1210410 non-null  object 
 4   PTNT_CTGRY_DRVD_CD   1210410 non-null  object 
 5   DRG_TYP_CTGRY        1210410 non-null  object 
 6   SCRIPT_TYPE          1210410 non-null  object 
 7   PRSCRPTN_CNT         1210410 non-null  int64  
 8   PATIENT_CONTRIB      1210410 non-null  float64
 9   GOVT_CONTRIB         1210410 non-null  float64
 10  RETAIL_MARKUP        1210410 non-null  float64
 11  TOTAL_COST           1210410 non-null  float64
 12  PATIENT_NET_CONTRIB  1210410 non-null  float64
 13  sheet_name           1210410 non-null  object 
dtypes: float64(5), int64(2), object(7)
memory usage: 1

One of the most important steps to data cleaning is ensuring that data is in the correct format.

One of the most DIFFICULT data types to work with is date_time. And it is especially a pain when you work with dates and excel!

In [8]:
#There are a few different ways to do this but an example is provided below

df_PBS['MONTH_OF_SUPPLY_dt'] = pd.to_datetime(df_PBS['MONTH_OF_SUPPLY'],
                                              format = "%Y%m")

#NOTE: format function tells python what format the data has come in so it can accurately convert to the a date/time. The PBS data came as 202107 - ie year and then month without a date


#have a look at what the data looks like now and checked it worked
df_PBS.head()

Unnamed: 0,MONTH_OF_SUPPLY,ITEM_CODE,ATC5_CODE,DRUG_NAME,PTNT_CTGRY_DRVD_CD,DRG_TYP_CTGRY,SCRIPT_TYPE,PRSCRPTN_CNT,PATIENT_CONTRIB,GOVT_CONTRIB,RETAIL_MARKUP,TOTAL_COST,PATIENT_NET_CONTRIB,sheet_name,MONTH_OF_SUPPLY_dt
0,202107,00000B,Z,MISSING ITEM CODE,R0,Unknown,ABOVE CO-PAYMENT,771,0.0,30747.73,0.0,30747.73,18.8,DOS_FY2021_22,2021-07-01
1,202107,00000B,Z,MISSING ITEM CODE,R0,Unknown,UNDER CO-PAYMENT,25,6.6,0.0,0.0,6.6,89.72,DOS_FY2021_22,2021-07-01
2,202107,00000B,Z,MISSING ITEM CODE,R1,Unknown,ABOVE CO-PAYMENT,795,5233.8,41013.11,0.0,46246.91,2868.49,DOS_FY2021_22,2021-07-01
3,202107,00000B,Z,MISSING ITEM CODE,R1,Unknown,UNDER CO-PAYMENT,2,12.3,0.0,0.0,12.3,12.3,DOS_FY2021_22,2021-07-01
4,202107,00013Q,Z,EXTEMPORANEOUSLY PREPARED,C0,Section 85,ABOVE CO-PAYMENT,1959,0.0,83309.92,0.0,83309.92,23.09,DOS_FY2021_22,2021-07-01


You should be able to see above that there are LOTS of rows, but you can see what is in each row.
It is important to remember that this now has ALL of the data but you will only need to look at whatever you're interested only.

# Create a filter for your item codes

You will need to create a df file with your item codes of interest.

This will be used to create a list of item codes you are intersted in!


In [9]:
#first, lets explore the item code list

#load the data - we need to tell the code where the the .csv we downloaded is first
file_path = "/content/pbs-item-drug-map.csv"

#import the file
#again you can call the df below whatever you want, but I thought this name might be intuitive here, just give it a sensible name.
dfItemCodes = pd.read_csv(file_path, encoding='latin1')

#check it worked
dfItemCodes

#you can press the "filter button" to see which drug names you want to filter for!

Unnamed: 0,ITEM_CODE,DRUG_NAME,FORM/STRENGTH,ATC5_Code
0,00000A,MISSING ITEM CODE,Missing Item Code,Z
1,00013Q,EXTEMPORANEOUSLY PREPARED,Creams,Z
2,00015T,EXTEMPORANEOUSLY PREPARED,Ear drops,Z
3,00016W,ELIXIRS,Generic term,Z
4,00019B,EXTEMPORANEOUSLY PREPARED,Eye drops containing cocaine hcl,Z
...,...,...,...,...
11912,15215T,AFLIBERCEPT,Solution for intravitreal injection 6.6 mg in ...,S01LA05
11913,15216W,VANZACAFTOR + TEZACAFTOR + DEUTIVACAFTOR,Tablet containing 4 mg vanzacaftor (as calcium...,R07AX33
11914,15217X,AFLIBERCEPT,Solution for intravitreal injection 6.6 mg in ...,S01LA05
11915,15218Y,AFLIBERCEPT,Solution for intravitreal injection 6.6 mg in ...,S01LA05


In [10]:
#Now lets have a look at the structure of the file

dfItemCodes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11917 entries, 0 to 11916
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   ITEM_CODE      11917 non-null  object
 1   DRUG_NAME      11917 non-null  object
 2   FORM/STRENGTH  11917 non-null  object
 3   ATC5_Code      11917 non-null  object
dtypes: object(4)
memory usage: 372.5+ KB


In [42]:
#Now lets try and create a new df with JUST the item codes of interest to us.

#create a list of drug names of interest
drugNames = ["TRIMETHOPRIM",
             "NITROFURANTOIN"
             ] #cephalexin as a control

dfItemCodes = dfItemCodes[dfItemCodes["DRUG_NAME"].isin(drugNames)]

dfItemCodes

Unnamed: 0,ITEM_CODE,DRUG_NAME,FORM/STRENGTH,ATC5_Code
608,01691B,NITROFURANTOIN,"Oral suspension 25 mg per 5 mL, 200 mL",J01XE01
609,01692C,NITROFURANTOIN,Capsule 50 mg,J01XE01
610,01693D,NITROFURANTOIN,Capsule 100 mg,J01XE01
1418,02666H,TRIMETHOPRIM,Tablet 300 mg,J01EA01
1650,02922T,TRIMETHOPRIM,Tablet 300 mg,J01EA01
1653,02925Y,TRIMETHOPRIM,"Oral suspension 50 mg per 5 mL, 105 mL",J01EA01
7518,10785P,TRIMETHOPRIM,Tablet 300 mg,J01EA01
9393,12671X,NITROFURANTOIN,"Capsule 50 mg, USP",J01XE01


In [43]:
#we might ALSO want to filer by formulation!

# Might be easiest to get a list of fomulations first...
# we will also sort the list of unique formulations in alphabetical order - much easier to do!

sorted(dfItemCodes["FORM/STRENGTH"].unique())

['Capsule 100 mg',
 'Capsule 50 mg',
 'Capsule 50 mg, USP',
 'Oral suspension 25 mg per 5 mL, 200 mL',
 'Oral suspension 50 mg per 5 mL, 105 mL',
 'Tablet 300 mg']

In [44]:
#create a list of drug names of interest
drugNames = ["TRIMETHOPRIM",
             "NITROFURANTOIN",
             "CEFALEXIN"] #cephalexin as a control


#create a list of formulations of interest
#lets say we are only interested in capsules
formulations = ['Capsule 100 mg',
 'Capsule 250 mg (as monohydrate)',
 #'Capsule 50 mg',
 #'Capsule 50 mg, USP',
 'Capsule 500 mg (as monohydrate)',
 #'Granules for oral suspension 125 mg (as monohydrate) per 5 mL, 100 mL',
 #'Granules for oral suspension 250 mg (as monohydrate) per 5 mL, 100 mL',
 #'Granules for oral suspension 250 mg (as monohydrate) per 5 mL, 100 mL (s19A)',
 #'Oral suspension 25 mg per 5 mL, 200 mL',
 #'Oral suspension 50 mg per 5 mL, 105 mL',
 'Tablet 300 mg']


dfItemCodes[dfItemCodes["DRUG_NAME"].isin(drugNames) & dfItemCodes["FORM/STRENGTH"].isin(formulations)]

Unnamed: 0,ITEM_CODE,DRUG_NAME,FORM/STRENGTH,ATC5_Code
610,01693D,NITROFURANTOIN,Capsule 100 mg,J01XE01
1418,02666H,TRIMETHOPRIM,Tablet 300 mg,J01EA01
1650,02922T,TRIMETHOPRIM,Tablet 300 mg,J01EA01
7518,10785P,TRIMETHOPRIM,Tablet 300 mg,J01EA01


In [45]:
#now we want to create a FILTER using the item codes from our imported .csv file
#in this case it will be the ITEM_CODE column of the df we just created
PBSItems = dfItemCodes["ITEM_CODE"]

 #now filter your large df_PBS and only get the rows of interest
 #the code below looks into the df_PBS dataframe, and then looks through the column called "ITEM_CODE" and finds which rows match any of the item codes you have listed in the "PBSItems" list

df_PBSFiltered = df_PBS[df_PBS['ITEM_CODE'].isin(PBSItems)]

df_PBSFiltered.head()

Unnamed: 0,MONTH_OF_SUPPLY,ITEM_CODE,ATC5_CODE,DRUG_NAME,PTNT_CTGRY_DRVD_CD,DRG_TYP_CTGRY,SCRIPT_TYPE,PRSCRPTN_CNT,PATIENT_CONTRIB,GOVT_CONTRIB,RETAIL_MARKUP,TOTAL_COST,PATIENT_NET_CONTRIB,sheet_name,MONTH_OF_SUPPLY_dt
1483,202107,01692C,J01XE01,NITROFURANTOIN,C0,Section 85,ABOVE CO-PAYMENT,2106,0.0,49343.55,9124.44,49343.55,0.0,DOS_FY2021_22,2021-07-01
1484,202107,01692C,J01XE01,NITROFURANTOIN,C1,Section 85,ABOVE CO-PAYMENT,4327,27990.6,72795.36,18599.03,100785.96,26697.17,DOS_FY2021_22,2021-07-01
1485,202107,01692C,J01XE01,NITROFURANTOIN,C1,Section 85,UNDER CO-PAYMENT,4,17.25,0.0,0.0,17.25,26.4,DOS_FY2021_22,2021-07-01
1486,202107,01692C,J01XE01,NITROFURANTOIN,G1,Section 85,ABOVE CO-PAYMENT,48,316.8,803.04,206.4,1119.84,306.8,DOS_FY2021_22,2021-07-01
1487,202107,01692C,J01XE01,NITROFURANTOIN,G2,Section 85,ABOVE CO-PAYMENT,39,326.8,695.99,189.2,1022.79,320.8,DOS_FY2021_22,2021-07-01


In [46]:
#lets have a look at the new df structure
df_PBSFiltered.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2157 entries, 1483 to 1195966
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   MONTH_OF_SUPPLY      2157 non-null   int64         
 1   ITEM_CODE            2157 non-null   object        
 2   ATC5_CODE            2157 non-null   object        
 3   DRUG_NAME            2157 non-null   object        
 4   PTNT_CTGRY_DRVD_CD   2157 non-null   object        
 5   DRG_TYP_CTGRY        2157 non-null   object        
 6   SCRIPT_TYPE          2157 non-null   object        
 7   PRSCRPTN_CNT         2157 non-null   int64         
 8   PATIENT_CONTRIB      2157 non-null   float64       
 9   GOVT_CONTRIB         2157 non-null   float64       
 10  RETAIL_MARKUP        2157 non-null   float64       
 11  TOTAL_COST           2157 non-null   float64       
 12  PATIENT_NET_CONTRIB  2157 non-null   float64       
 13  sheet_name           2157 non-nu

# Mapping data from one df to another

One of the things you might be interested in is changes in formulation type, especially if looking at supply shortages. However, the PBS data does not have a column for this and it is only in the PBS drug map.

So what we will do below is MAP our drug column data from the PBS drug map to our pbs data.

In [47]:
#first greate a dictionary for our mapping process
#see here for more details: https://www.geeksforgeeks.org/python/python-mapping-key-values-to-dictionary/

#this will have a key (item code) and value (formulation)

mapping = dict(zip(dfItemCodes["ITEM_CODE"], #our item codes of interest
                   dfItemCodes["FORM/STRENGTH"])) #their associated formulation

#have a look to see if it worked
mapping

{'01691B': 'Oral suspension 25 mg per 5 mL, 200 mL',
 '01692C': 'Capsule 50 mg',
 '01693D': 'Capsule 100 mg',
 '02666H': 'Tablet 300 mg',
 '02922T': 'Tablet 300 mg',
 '02925Y': 'Oral suspension 50 mg per 5 mL, 105 mL',
 '10785P': 'Tablet 300 mg',
 '12671X': 'Capsule 50 mg, USP'}

In [48]:
#now we need to add a column in our PBS data and map the fomulation to that column baesd on what the item code is
#the basic approach is df['new_column'] = df['existing_column'].map(your_dict)

df_PBSFiltered["FORM/STRENGTH"] = df_PBSFiltered['ITEM_CODE'].map(mapping)

df_PBSFiltered.head()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,MONTH_OF_SUPPLY,ITEM_CODE,ATC5_CODE,DRUG_NAME,PTNT_CTGRY_DRVD_CD,DRG_TYP_CTGRY,SCRIPT_TYPE,PRSCRPTN_CNT,PATIENT_CONTRIB,GOVT_CONTRIB,RETAIL_MARKUP,TOTAL_COST,PATIENT_NET_CONTRIB,sheet_name,MONTH_OF_SUPPLY_dt,FORM/STRENGTH
1483,202107,01692C,J01XE01,NITROFURANTOIN,C0,Section 85,ABOVE CO-PAYMENT,2106,0.0,49343.55,9124.44,49343.55,0.0,DOS_FY2021_22,2021-07-01,Capsule 50 mg
1484,202107,01692C,J01XE01,NITROFURANTOIN,C1,Section 85,ABOVE CO-PAYMENT,4327,27990.6,72795.36,18599.03,100785.96,26697.17,DOS_FY2021_22,2021-07-01,Capsule 50 mg
1485,202107,01692C,J01XE01,NITROFURANTOIN,C1,Section 85,UNDER CO-PAYMENT,4,17.25,0.0,0.0,17.25,26.4,DOS_FY2021_22,2021-07-01,Capsule 50 mg
1486,202107,01692C,J01XE01,NITROFURANTOIN,G1,Section 85,ABOVE CO-PAYMENT,48,316.8,803.04,206.4,1119.84,306.8,DOS_FY2021_22,2021-07-01,Capsule 50 mg
1487,202107,01692C,J01XE01,NITROFURANTOIN,G2,Section 85,ABOVE CO-PAYMENT,39,326.8,695.99,189.2,1022.79,320.8,DOS_FY2021_22,2021-07-01,Capsule 50 mg


## Save the data

If you like you can copy the code below and run in a cell to save the current data and do additional cleaning with Excel or [OpenRefine](https://openrefine.org/])



> df_PBSFiltered.to_csv('PBS_Merged.csv')



Otherwise, keep going with Python.

# Calculations

The final thing we need to do is add our census data and calculate the number of dispensing PER PERSON (or a similar metric, e.g. per 1000 people)

In [49]:
#first, lets load the census data we downloaded

#load the data - we need to tell the code where the the .csv we downloaded is first
file_path = "/content/ABSpop.csv"

#import the file
#again you can call the df below whatever you want, but I thought this name might be intuitive here, just give it a sensible name.
dfPop = pd.read_csv(file_path, encoding='latin1')

#lets see which columns are available to us
dfPop.columns

Index(['STRUCTURE', 'STRUCTURE_ID', 'STRUCTURE_NAME', 'ACTION', 'ASGS_2011',
       'Region', 'SEX_ABS', 'Sex', 'AGE', 'Age', 'FERTILITY',
       'Fertility Assumption', 'MORTALITY', 'Mortality Assumption', 'NOM',
       'Net Overseas Migration', 'FREQUENCY', 'Frequency', 'TIME_PERIOD',
       'Time Period', 'OBS_VALUE', 'Observation Value', 'UNIT_MEASURE',
       'Unit of Measure', 'OBS_STATUS', 'Observation Status', 'OBS_COMMENT',
       'Observation Comment'],
      dtype='object')

In [50]:
#we dont really need all of these so lets just have a look at the relevant columns
dfPop = dfPop[['TIME_PERIOD', 'OBS_VALUE', 'UNIT_MEASURE']]

dfPop.head()

Unnamed: 0,TIME_PERIOD,OBS_VALUE,UNIT_MEASURE
0,2017,22612,PSNS
1,2018,23323,PSNS
2,2019,24377,PSNS
3,2020,24627,PSNS
4,2021,25227,PSNS


In [51]:
#now we need to put the population data into our filtered PBS data by matching on year of supply!
#we can use our mapping skills to do this

#first lets create a column in out filtered PBS data to state the year of supply
df_PBSFiltered["YearSupply"] = df_PBSFiltered['MONTH_OF_SUPPLY_dt'].dt.year

#check it worked
df_PBSFiltered.head()




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,MONTH_OF_SUPPLY,ITEM_CODE,ATC5_CODE,DRUG_NAME,PTNT_CTGRY_DRVD_CD,DRG_TYP_CTGRY,SCRIPT_TYPE,PRSCRPTN_CNT,PATIENT_CONTRIB,GOVT_CONTRIB,RETAIL_MARKUP,TOTAL_COST,PATIENT_NET_CONTRIB,sheet_name,MONTH_OF_SUPPLY_dt,FORM/STRENGTH,YearSupply
1483,202107,01692C,J01XE01,NITROFURANTOIN,C0,Section 85,ABOVE CO-PAYMENT,2106,0.0,49343.55,9124.44,49343.55,0.0,DOS_FY2021_22,2021-07-01,Capsule 50 mg,2021
1484,202107,01692C,J01XE01,NITROFURANTOIN,C1,Section 85,ABOVE CO-PAYMENT,4327,27990.6,72795.36,18599.03,100785.96,26697.17,DOS_FY2021_22,2021-07-01,Capsule 50 mg,2021
1485,202107,01692C,J01XE01,NITROFURANTOIN,C1,Section 85,UNDER CO-PAYMENT,4,17.25,0.0,0.0,17.25,26.4,DOS_FY2021_22,2021-07-01,Capsule 50 mg,2021
1486,202107,01692C,J01XE01,NITROFURANTOIN,G1,Section 85,ABOVE CO-PAYMENT,48,316.8,803.04,206.4,1119.84,306.8,DOS_FY2021_22,2021-07-01,Capsule 50 mg,2021
1487,202107,01692C,J01XE01,NITROFURANTOIN,G2,Section 85,ABOVE CO-PAYMENT,39,326.8,695.99,189.2,1022.79,320.8,DOS_FY2021_22,2021-07-01,Capsule 50 mg,2021


In [52]:
#now lets create a dictionary from our population data for pop each year and mapp to our PBS data
mapping = dict(zip(dfPop["TIME_PERIOD"], #our item codes of interest
                   dfPop["OBS_VALUE"])) #their associated formulation

df_PBSFiltered["Population"] = df_PBSFiltered['YearSupply'].map(mapping)

df_PBSFiltered.head()




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,MONTH_OF_SUPPLY,ITEM_CODE,ATC5_CODE,DRUG_NAME,PTNT_CTGRY_DRVD_CD,DRG_TYP_CTGRY,SCRIPT_TYPE,PRSCRPTN_CNT,PATIENT_CONTRIB,GOVT_CONTRIB,RETAIL_MARKUP,TOTAL_COST,PATIENT_NET_CONTRIB,sheet_name,MONTH_OF_SUPPLY_dt,FORM/STRENGTH,YearSupply,Population
1483,202107,01692C,J01XE01,NITROFURANTOIN,C0,Section 85,ABOVE CO-PAYMENT,2106,0.0,49343.55,9124.44,49343.55,0.0,DOS_FY2021_22,2021-07-01,Capsule 50 mg,2021,303799
1484,202107,01692C,J01XE01,NITROFURANTOIN,C1,Section 85,ABOVE CO-PAYMENT,4327,27990.6,72795.36,18599.03,100785.96,26697.17,DOS_FY2021_22,2021-07-01,Capsule 50 mg,2021,303799
1485,202107,01692C,J01XE01,NITROFURANTOIN,C1,Section 85,UNDER CO-PAYMENT,4,17.25,0.0,0.0,17.25,26.4,DOS_FY2021_22,2021-07-01,Capsule 50 mg,2021,303799
1486,202107,01692C,J01XE01,NITROFURANTOIN,G1,Section 85,ABOVE CO-PAYMENT,48,316.8,803.04,206.4,1119.84,306.8,DOS_FY2021_22,2021-07-01,Capsule 50 mg,2021,303799
1487,202107,01692C,J01XE01,NITROFURANTOIN,G2,Section 85,ABOVE CO-PAYMENT,39,326.8,695.99,189.2,1022.79,320.8,DOS_FY2021_22,2021-07-01,Capsule 50 mg,2021,303799


In [53]:
#the final step is to capculate the total dispensing PER population
#lets create a new column to do this

#we can make it per 1000 people - but you can adjust this numer as you see fit!

df_PBSFiltered["DispPerPop"] = df_PBSFiltered["PRSCRPTN_CNT"]/df_PBSFiltered["Population"]*1000

df_PBSFiltered.head()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,MONTH_OF_SUPPLY,ITEM_CODE,ATC5_CODE,DRUG_NAME,PTNT_CTGRY_DRVD_CD,DRG_TYP_CTGRY,SCRIPT_TYPE,PRSCRPTN_CNT,PATIENT_CONTRIB,GOVT_CONTRIB,RETAIL_MARKUP,TOTAL_COST,PATIENT_NET_CONTRIB,sheet_name,MONTH_OF_SUPPLY_dt,FORM/STRENGTH,YearSupply,Population,DispPerPop
1483,202107,01692C,J01XE01,NITROFURANTOIN,C0,Section 85,ABOVE CO-PAYMENT,2106,0.0,49343.55,9124.44,49343.55,0.0,DOS_FY2021_22,2021-07-01,Capsule 50 mg,2021,303799,6.932215
1484,202107,01692C,J01XE01,NITROFURANTOIN,C1,Section 85,ABOVE CO-PAYMENT,4327,27990.6,72795.36,18599.03,100785.96,26697.17,DOS_FY2021_22,2021-07-01,Capsule 50 mg,2021,303799,14.24297
1485,202107,01692C,J01XE01,NITROFURANTOIN,C1,Section 85,UNDER CO-PAYMENT,4,17.25,0.0,0.0,17.25,26.4,DOS_FY2021_22,2021-07-01,Capsule 50 mg,2021,303799,0.013167
1486,202107,01692C,J01XE01,NITROFURANTOIN,G1,Section 85,ABOVE CO-PAYMENT,48,316.8,803.04,206.4,1119.84,306.8,DOS_FY2021_22,2021-07-01,Capsule 50 mg,2021,303799,0.157999
1487,202107,01692C,J01XE01,NITROFURANTOIN,G2,Section 85,ABOVE CO-PAYMENT,39,326.8,695.99,189.2,1022.79,320.8,DOS_FY2021_22,2021-07-01,Capsule 50 mg,2021,303799,0.128374


# Data exploration

In [54]:
#you can use the describe function to generate some descriptive statistics for the numerical variables in your df

df_PBSFiltered.describe()

Unnamed: 0,MONTH_OF_SUPPLY,PRSCRPTN_CNT,PATIENT_CONTRIB,GOVT_CONTRIB,RETAIL_MARKUP,TOTAL_COST,PATIENT_NET_CONTRIB,MONTH_OF_SUPPLY_dt,YearSupply,Population,DispPerPop
count,2157.0,2157.0,2157.0,2157.0,2157.0,2157.0,2157.0,2157,2157.0,2157.0,2157.0
mean,202321.611497,2107.493741,21339.519824,14630.651191,9269.840079,35970.171015,20477.881748,2023-08-24 15:05:15.438108416,2023.146036,327748.643023,6.428746
min,202107.0,1.0,0.0,0.0,0.0,0.0,0.0,2021-07-01 00:00:00,2021.0,303799.0,0.002946
25%,202207.0,13.0,4.13,0.0,39.77,247.83,38.5,2022-07-01 00:00:00,2022.0,317883.0,0.040896
50%,202309.0,93.0,182.5,437.4,371.95,1986.4,315.18,2023-09-01 00:00:00,2023.0,331383.0,0.289695
75%,202410.0,817.0,4079.57,4584.95,3552.42,15022.14,5019.63,2024-10-01 00:00:00,2024.0,337108.0,2.459999
max,202511.0,39226.0,534928.65,303908.99,164258.4,534928.65,482065.83,2025-11-01 00:00:00,2025.0,339431.0,123.397602
std,130.727266,6206.225678,77303.673114,42012.243491,27179.060656,94449.691527,70374.152799,,1.314081,11941.297233,18.979729


In [55]:
# Get descriptive statistics for numeric columns
print("--- df.describe() ---")
print(df_PBSFiltered.describe())

# Get summary for all columns including categorical
print("\n--- df.describe(include='all') ---")
print(df_PBSFiltered.describe(include='all'))

# Get concise technical information
print("\n--- df.info() ---")
df_PBSFiltered.info()

# Get the number of missing values per column
print("\n--- Missing values count (df.isna().sum()) ---")
print(df_PBSFiltered.isna().sum())

--- df.describe() ---
       MONTH_OF_SUPPLY  PRSCRPTN_CNT  PATIENT_CONTRIB   GOVT_CONTRIB  \
count      2157.000000   2157.000000      2157.000000    2157.000000   
mean     202321.611497   2107.493741     21339.519824   14630.651191   
min      202107.000000      1.000000         0.000000       0.000000   
25%      202207.000000     13.000000         4.130000       0.000000   
50%      202309.000000     93.000000       182.500000     437.400000   
75%      202410.000000    817.000000      4079.570000    4584.950000   
max      202511.000000  39226.000000    534928.650000  303908.990000   
std         130.727266   6206.225678     77303.673114   42012.243491   

       RETAIL_MARKUP     TOTAL_COST  PATIENT_NET_CONTRIB  \
count    2157.000000    2157.000000          2157.000000   
mean     9269.840079   35970.171015         20477.881748   
min         0.000000       0.000000             0.000000   
25%        39.770000     247.830000            38.500000   
50%       371.950000    1986.

In [56]:
#if you want to look at what the unique values are in a column you can use the code structure dataFrame["my_column"].unique()
#for example if I want to check which item codes I have filtered to make sure my code above worked

df_PBSFiltered["ITEM_CODE"].unique()

array(['01692C', '01693D', '02666H', '02922T', '10785P', '12671X'],
      dtype=object)

In [57]:
#similar to unique(), you can also see how many of each category you have with dataFrame["my_column"].value_counts()
df_PBSFiltered["ITEM_CODE"].value_counts()

Unnamed: 0_level_0,count
ITEM_CODE,Unnamed: 1_level_1
02922T,479
01693D,413
01692C,409
10785P,404
02666H,400
12671X,52


In [58]:
#explore some columns here and add additional code chuncks as needed

## Pivot tables

Pivot tables are highly useful tools to summarise data. This can be done in [Excel](https://support.microsoft.com/en-us/office/overview-of-pivottables-and-pivotcharts-527c8fa3-02c0-445a-a2db-7794676bce96#:~:text=A%20PivotTable%20is%20an%20interactive,unanticipated%20questions%20about%20your%20data.) but can be done in pandas too and it is absolutely a favourite of mine when trying to understand my data.

Examples of how to create pivot tables with python can be found [here](https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html).
**NOTE**: you just need to scroll past all the documentation bit to get to the examples. However, the documentation gives you some more detail on all the functionality available to you.


In [59]:
#Lets say we want to look at the number of scripts each month, irrespective of script type etc

scriptCount_table = pd.pivot_table( df_PBSFiltered, #the dataframe
                                   values="DispPerPop", #which column to group on - in this case the number of scripts
                                    index=["MONTH_OF_SUPPLY_dt"], #group by column - in this case date!
                                    aggfunc="sum" #what you want to do
)

#show the first few rows
scriptCount_table.head()

Unnamed: 0_level_0,DispPerPop
MONTH_OF_SUPPLY_dt,Unnamed: 1_level_1
2021-07-01,296.027966
2021-08-01,285.98185
2021-09-01,271.956129
2021-10-01,273.368247
2021-11-01,284.803439


In [60]:
#Lets say we want to look at the number of scripts each month, GROUPED by whether the patient was consession etc

scriptCount_RxType_table = pd.pivot_table( df_PBSFiltered, #the dataframe
                                   values="DispPerPop", #which column to group on - in this case the number of scripts
                                    index=["MONTH_OF_SUPPLY_dt"], #group by column - in this case date!
                                    aggfunc="sum", #what you want to do
                                    columns = ["DRUG_NAME"]
)

#show the first few rows
scriptCount_RxType_table.head()

DRUG_NAME,NITROFURANTOIN,TRIMETHOPRIM
MONTH_OF_SUPPLY_dt,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-07-01,48.55513,247.472836
2021-08-01,43.502447,242.479402
2021-09-01,41.3925,230.563629
2021-10-01,41.925747,231.4425
2021-11-01,40.539962,244.263477


In [61]:
#try your own pivot table here

## Adding columns to your pivot table

You might want to add a total column to your table to get a sense of total dispensings for a particular cluster of medications

In [62]:
#add totals column
#note axiz = 1 tells it to summ horizontally, whereas axis = 0 will sum things vertically
scriptCount_RxType_table['TOTAL'] = scriptCount_RxType_table.sum(axis=1)

scriptCount_RxType_table.head()

DRUG_NAME,NITROFURANTOIN,TRIMETHOPRIM,TOTAL
MONTH_OF_SUPPLY_dt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-07-01,48.55513,247.472836,296.027966
2021-08-01,43.502447,242.479402,285.98185
2021-09-01,41.3925,230.563629,271.956129
2021-10-01,41.925747,231.4425,273.368247
2021-11-01,40.539962,244.263477,284.803439


In [63]:
# First covert your table back to a df - just easier to do stuff with
# Convert the pivot table (which is a DataFrame with an index) to a regular DataFrame
df_scriptCount_RxType = scriptCount_RxType_table.reset_index()

df_scriptCount_RxType.head()

DRUG_NAME,MONTH_OF_SUPPLY_dt,NITROFURANTOIN,TRIMETHOPRIM,TOTAL
0,2021-07-01,48.55513,247.472836,296.027966
1,2021-08-01,43.502447,242.479402,285.98185
2,2021-09-01,41.3925,230.563629,271.956129
3,2021-10-01,41.925747,231.4425,273.368247
4,2021-11-01,40.539962,244.263477,284.803439


In [74]:
#lets also round the data to the nearest whole numer
# we will also convert to integer type so we dont have any decimal places
df_scriptCount_RxType[["TRIMETHOPRIM",
                       "NITROFURANTOIN"]] = df_scriptCount_RxType[ ["TRIMETHOPRIM",
                                                     "NITROFURANTOIN"
                                                     ]].round(0).astype(int)

df_scriptCount_RxType

DRUG_NAME,MONTH_OF_SUPPLY_dt,NITROFURANTOIN,TRIMETHOPRIM,TOTAL
0,2021-07-01,49,247,296.027966
1,2021-08-01,44,242,285.98185
2,2021-09-01,41,231,271.956129
3,2021-10-01,42,231,273.368247
4,2021-11-01,41,244,284.803439
5,2021-12-01,47,253,299.862738
6,2022-01-01,36,216,252.190271
7,2022-02-01,40,224,264.411749
8,2022-03-01,48,265,313.011391
9,2022-04-01,21,228,249.019293


# Wide to Long format

Sometimes we want to reshape our data from long to wide and vice versa.
Often in graphing, a LONG format is used.

Check out this webpage which explains what this means in more detail: https://towardsdatascience.com/reshaping-a-pandas-dataframe-long-to-wide-and-vice-versa-517c7f0995ad/


It is also known as pivot or unpivoting, or in Python it is also known as melting.

**WIDE FORMAT**

|Index| CAT-1| CAT-2 |  CAT-3|
|-----| ----| ---- | -----|
|Date-1| Num 1| Num 2 | Num 3|


**LONG FORMAT**

| Index  | Category | Value |
|--------|----------|-------|
| Date-1 | CAT-1    | Num 1 |
| Date-1 | CAT-2    | Num 2 |
| Date-1 | CAT-3    | Num 3 |



In [75]:
# Convert to long format using melt

df_scriptCount_RxTypeLong = pd.melt(df_scriptCount_RxType, #which dataframe?
                                   id_vars=['MONTH_OF_SUPPLY_dt'], #what is the index?
                                   var_name='Antibiotic',
                                   value_name='Dispensing' #what are the numbers
                                   )

df_scriptCount_RxTypeLong.head()

Unnamed: 0,MONTH_OF_SUPPLY_dt,Antibiotic,Dispensing
0,2021-07-01,NITROFURANTOIN,49.0
1,2021-08-01,NITROFURANTOIN,44.0
2,2021-09-01,NITROFURANTOIN,41.0
3,2021-10-01,NITROFURANTOIN,42.0
4,2021-11-01,NITROFURANTOIN,41.0


## Check for missing dates

In [76]:
#earliest date
print("Earliest date:", df_scriptCount_RxTypeLong["MONTH_OF_SUPPLY_dt"].min())


#most recent date
print("Most recent date:", df_scriptCount_RxTypeLong["MONTH_OF_SUPPLY_dt"].max())




#create a series from the max to min date
date_range = pd.date_range(start=df_scriptCount_RxTypeLong["MONTH_OF_SUPPLY_dt"].min(), #earliest date
                           end=df_scriptCount_RxTypeLong["MONTH_OF_SUPPLY_dt"].max(), #latest date
                           freq='MS') #start of month

#find the difference between our dates and what should be there
date_range.difference(df_scriptCount_RxTypeLong["MONTH_OF_SUPPLY_dt"].unique())



Earliest date: 2021-07-01 00:00:00
Most recent date: 2025-11-01 00:00:00


DatetimeIndex([], dtype='datetime64[ns]', freq='MS')

## Save the data

You can load what you have done so far into [RawGraphs](https://app.rawgraphs.io/) and think about what else we might need to add or change to make a nice graph!

Otherwise keep going with Python!

In [77]:
# Save the DataFrame to a CSV file named 'PBS_mergedL.csv'
# Again, give your output name a sensible name
# Here I have put the date of my analysis at the front as YYYYMMDD format so I know WHEN I did the analysis, followed by some kind of descriptor

df_scriptCount_RxType.to_csv('20260224PBS_AB_1.csv')

In [78]:
#Alternatively, I used Google's Gemini AI tool to generate the following code for graphing

import plotly.express as px

fig = px.line(df_scriptCount_RxTypeLong,
              x="MONTH_OF_SUPPLY_dt",
              y="Dispensing",
              color="Antibiotic",
              title="Dispensing by Antibiotic Over Time")
fig.show()

In [79]:
from IPython.display import HTML, display

display(HTML("""
<div style="
  border: 8px solid red;
  padding: 40px;
  background-color: #ffcccc;
  color: red;
  font-weight: bold;
  font-size: 40px;
  text-align: center;
  border-radius: 12px;
  width: 100%;
  box-sizing: border-box;
">
  STOP AND THINK
</div>
"""))

1. What do I want to visualise?

2. Is the data in the correct format?

3. How do I need to transform the data?


# Adding Vertical and Horizontal Lines

This can be useful if trying to indicate a min threshold for something (horizontal line) or when something occured (vertical  lines)


Some software lets you put this in seperately as x = a or y = b (e.g. R's ggplot function) - however with something like RawGraphs you might need to "hack" your way around things and embedd what you need in the cleaned data itself.


## Horizontal Lines

Usually indicate some kind of threshold- e.g. min or max amount of something required, etc...


We want to have

POINT ONE COORDINATES
- x1 = the FIRST date in our time series (the x axis)
- y1 = the theshold value

POINT TWO COORDINATES
- x2 = LAST DATE in the time series
- y2 = the theshold value

## Vertical Lines

Usually indicate when something happened - e.g. Policy change, medication shortage, other event, etc...


We want to have

POINT ONE COORDINATES
- x1 = Date when the thing happend
- y1 = 0 (at the y axis)

POINT TWO COORDINATES
- x2 = Date when the thing happend
- y2 = the max value

NOTE: for some reason RawGraph wont plot verticle lines so your less than ideal work around is to plot the points and then draw them on manually

# Graphing

In [80]:
#Ok lets plot it again

import plotly.express as px
from datetime import datetime

fig = px.line(df_scriptCount_RxTypeLong,
              x="MONTH_OF_SUPPLY_dt",
              y="Dispensing",
              color="Antibiotic",
              title="Dispensing by Antibiotic Over Time with Intervention and Threshold")

#Let's add a horizontal line
fig.add_hline(
    y=100,  # <-- your value
    line_dash="dash",
    line_color="red",
    line_width=2,
    annotation_text="Threshold = 500",
    annotation_position="top right"
)


#add a verticle line
# Directly pass the date string 'YYYY-MM-DD' for better compatibility with Plotly's internal annotation handling
fig.add_vline(
    x="2023-01-01", #what date to draw the line
    line_color="blue", #the color
    line_width= 2
)

# Add annotation for the vertical line separately
fig.add_annotation(
    x="2022-12-01", #location of the annotation at the intervention date
    y=df_scriptCount_RxTypeLong["Dispensing"].max(), # Position the annotation at the top of the plot
    text="INTERVENTION", #what the label is
    showarrow=True, #makes a little arrow, you can remove by saying False
    arrowhead=1,
    yshift=10
)


fig.show()

If you are analysing PBS data - You can check whether the patterns in your visual are correct using this Shiny app:

Hall KA 2026, All PBS Dispensing Dashboard, Posit shinyapps.io, viewed (insert date viewed), https://kahall.shinyapps.io/all_dispensing.

# Finishing touches

Update the code above to make your graph more appealing

e.g. change the colors, change the x and/or y labels, change the background

Have a look at the [Plotly](https://plotly.com/python/time-series/) package documentation to see what IS possible, and feel free to use Gemini's AI to help as well.

In [81]:
# Finishing touches - document what you have done and why



# Downloading your Notebook

You are able to download a pdf version of your notebook and add as a supplementary file.

However BEFORE you do so:

1. Be sure that your code is commented so that anyone looking at it can understand what your group has done.

2. All of the text components should be updated and/or deleted so they make sense in the context of what your group has done. Remove any comments I have provided for teaching purposes unless they are needed.

3. Add you short summary (as required for your submission) at the TOP of your notebook.


To download a pdf version of your notebook:
Go to 'File' > 'Print' > 'Save as PDF' (in your browser's print dialog)

Ensure your notebook is clean and commented before downloading.