# MBS 2013-19, 2019-21 & 2021-22: Combine & Pre-Process Datasets

1. Estimated Resident Population data is missing from 2013-19 and 2019-21 datasets. The script first checks and cleans respective SA3 datasets and combines with corresponding MBS data. 

2. The script then combines the 3 MBS datasets and data is pre-processed. Export 1 mbs dataset

In [122]:
# import libraries
import pandas as pd
import numpy as np
import os

## Load and Combine 2013-19 MBS and ERP dataset

### Import MBS and SA3 ERP files

In [123]:
# import the transformed mbs file and assign to a dataframe

# setup path to original dataset
path = r"/Users/patel/Documents/CF-Data Anaylst Course/portfolio_projects/mbs_analysis/datasets/"

df_mbs_201319 = pd.read_csv(
    os.path.join(path, "clean_datasets/mbs_data/2013-19_phc_mbs.csv"),
    encoding="ISO-8859-1",
    index_col=[0],
)
df_mbs_201319.head(10)

  df_mbs_201319 = pd.read_csv(


Unnamed: 0,Year,StateTerritory,GeographicUnit,GeographicCode,GeographicAreaName,GeographicGroup,ServiceLevel,Service,DemographicGroup,Medicare benefits per 100 people ($),No. of patients,No. of services,Percentage of people who had the service (%),Services per 100 people,Total Medicare benefits paid ($),Total provider fees ($)
0,2013-14,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),0-24,2576.0,5624.0,10879.0,17.27,33.41,838549.0,1026474.0
1,2013-14,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),25-44,4004.0,7714.0,15870.0,24.75,50.93,1247656.0,1600846.0
2,2013-14,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),45-64,4672.0,8998.0,15754.0,41.32,72.35,1017264.0,1197133.0
3,2013-14,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),65+,5819.0,6397.0,12316.0,55.07,106.01,675946.0,761837.0
4,2013-14,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),All persons,3892.0,28733.0,54818.0,29.59,56.45,3779415.0,4586290.0
5,2013-14,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),Females,4902.0,17048.0,34198.0,34.9,70.01,2394671.0,2929743.0
6,2013-14,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),Males,2869.0,11685.0,20620.0,24.21,42.72,1384744.0,1656547.0
7,2013-14,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Diagnostic Imaging (total),0-24,2513.0,5594.0,8984.0,17.18,27.59,818149.0,1051718.0
8,2013-14,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Diagnostic Imaging (total),25-44,6904.0,9500.0,19275.0,30.48,61.85,2151589.0,3173663.0
9,2013-14,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Diagnostic Imaging (total),45-64,11869.0,8389.0,19105.0,38.53,87.74,2584397.0,3601039.0


In [124]:
df_mbs_201319.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 172308 entries, 0 to 172307
Data columns (total 16 columns):
 #   Column                                        Non-Null Count   Dtype 
---  ------                                        --------------   ----- 
 0   Year                                          172308 non-null  object
 1   StateTerritory                                172308 non-null  object
 2   GeographicUnit                                172308 non-null  object
 3   GeographicCode                                172308 non-null  object
 4   GeographicAreaName                            172308 non-null  object
 5   GeographicGroup                               172308 non-null  object
 6   ServiceLevel                                  172308 non-null  object
 7   Service                                       172308 non-null  object
 8   DemographicGroup                              172308 non-null  object
 9   Medicare benefits per 100 people ($)          172308 non-nu

#### Check Data Types

In [125]:
# checking columns for mixed types
mixed_type_columns = df_mbs_201319.applymap(type).nunique() > 1
mixed_type_columns

Year                                            False
StateTerritory                                  False
GeographicUnit                                  False
GeographicCode                                   True
GeographicAreaName                              False
GeographicGroup                                 False
ServiceLevel                                    False
Service                                         False
DemographicGroup                                False
Medicare benefits per 100 people ($)            False
No. of patients                                 False
No. of services                                 False
Percentage of people who had the service (%)    False
Services per 100 people                         False
Total Medicare benefits paid ($)                False
Total provider fees ($)                         False
dtype: bool

Found GeographicCode code to have mixed type. Other dimension data is object. For consistency, will enforce type all dimension fields before joining with SA3 and other MBS data

#### Enforcing Type on Dimension Values

In [126]:
# casting to columns to string fo consinstency when joining with SA3
df_mbs_201319["Year"] = df_mbs_201319["Year"].astype("str")
df_mbs_201319["GeographicCode"] = df_mbs_201319["GeographicCode"].astype("str")
df_mbs_201319["DemographicGroup"] = df_mbs_201319["DemographicGroup"].astype("str")

df_mbs_201319["StateTerritory"] = df_mbs_201319["StateTerritory"].astype("str")
df_mbs_201319["GeographicAreaName"] = df_mbs_201319["GeographicAreaName"].astype("str")
df_mbs_201319["ServiceLevel"] = df_mbs_201319["ServiceLevel"].astype("str")
df_mbs_201319["Service"] = df_mbs_201319["Service"].astype("str")

#### Importing SA3 File & Data Type Checks

In [127]:
# import sa3 file containing estimated resident population for 2013-19

# import relevant columns for merging purposes
filter_cols = [
    "Year",
    "GeographicCode",
    "DemographicGroup",
    "EstimatedResidentPopulation",
]
df_sa3_erp_1319 = pd.read_csv(
    os.path.join(path, "original_datasets/mbs_data/phc-mbs-2013-2019/SA3_ERP_CSV.csv"),
    usecols=filter_cols,
    encoding="ISO-8859-1",
    index_col=None,
)
df_sa3_erp_1319.head(5)

Unnamed: 0,Year,GeographicCode,DemographicGroup,EstimatedResidentPopulation
0,2013-14,001NAT,0-14,4377926
1,2013-14,001NAT,0-24,7489910
2,2013-14,001NAT,0-64,19797751
3,2013-14,001NAT,15-24,3111984
4,2013-14,001NAT,25-44,6596790


In [128]:
# casting to columns to string for consinstency when joining with mbs
df_sa3_erp_1319["Year"] = df_sa3_erp_1319["Year"].astype("str")
df_sa3_erp_1319["GeographicCode"] = df_sa3_erp_1319["GeographicCode"].astype("str")
df_sa3_erp_1319["DemographicGroup"] = df_sa3_erp_1319["DemographicGroup"].astype("str")

### Compare SA3 geographical area list in mbs and sa3 datasets

In [129]:
# checking number of unique GeographicCode values. Expected 347
df_sa3_erp_1319.nunique()

Year                               6
GeographicCode                   347
DemographicGroup                  10
EstimatedResidentPopulation    17390
dtype: int64

347 unique SA3 values. Expected the same or less in the mbs data file.

In [130]:
df_mbs_201319.nunique()

Year                                                 6
StateTerritory                                      10
GeographicUnit                                       1
GeographicCode                                     346
GeographicAreaName                                 346
GeographicGroup                                      7
ServiceLevel                                         3
Service                                             53
DemographicGroup                                     7
Medicare benefits per 100 people ($)             29547
No. of patients                                  35793
No. of services                                  58679
Percentage of people who had the service (%)      9802
Services per 100 people                          33996
Total Medicare benefits paid ($)                137047
Total provider fees ($)                         138572
dtype: int64

In [131]:
# using outer merge to find any different GeographicCodes between mbs and sa3 dataset
diff_df = pd.merge(
    df_mbs_201319[["GeographicCode"]],
    df_sa3_erp_1319[["GeographicCode"]],
    how="outer",
    indicator=True,
).query('_merge != "both"')
diff_df["GeographicCode"].unique()

array(['001NAT'], dtype=object)

001NAT is in SA3 dataset but not in MBS. Investigated the original file, Worksheet SA3 has no GeographicCode 001NAT

#### Check Estimated Resident Population Consistency

In [132]:
df_sa3_erp_1319["EstimatedResidentPopulation"].value_counts(dropna=False)
df_sa3_erp_1319["EstimatedResidentPopulation"].dtype

dtype('int64')

### Combine SA3 and MBS datasets

In [133]:
# left merge mbs data with SA3 to retrieve corresponding ERP
df_mbs_201319_sa3_combined = df_mbs_201319.merge(
    df_sa3_erp_1319,
    how="left",
    on=["Year", "GeographicCode", "DemographicGroup"],
    indicator=True,
)
df_mbs_201319_sa3_combined.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 172308 entries, 0 to 172307
Data columns (total 18 columns):
 #   Column                                        Non-Null Count   Dtype   
---  ------                                        --------------   -----   
 0   Year                                          172308 non-null  object  
 1   StateTerritory                                172308 non-null  object  
 2   GeographicUnit                                172308 non-null  object  
 3   GeographicCode                                172308 non-null  object  
 4   GeographicAreaName                            172308 non-null  object  
 5   GeographicGroup                               172308 non-null  object  
 6   ServiceLevel                                  172308 non-null  object  
 7   Service                                       172308 non-null  object  
 8   DemographicGroup                              172308 non-null  object  
 9   Medicare benefits per 100 people ($) 

In [134]:
df_mbs_201319_sa3_combined["_merge"].value_counts()

both          172308
left_only          0
right_only         0
Name: _merge, dtype: int64

In [135]:
df_mbs_201319_sa3_combined.head()

Unnamed: 0,Year,StateTerritory,GeographicUnit,GeographicCode,GeographicAreaName,GeographicGroup,ServiceLevel,Service,DemographicGroup,Medicare benefits per 100 people ($),No. of patients,No. of services,Percentage of people who had the service (%),Services per 100 people,Total Medicare benefits paid ($),Total provider fees ($),EstimatedResidentPopulation,_merge
0,2013-14,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),0-24,2576.0,5624.0,10879.0,17.27,33.41,838549.0,1026474.0,32558,both
1,2013-14,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),25-44,4004.0,7714.0,15870.0,24.75,50.93,1247656.0,1600846.0,31163,both
2,2013-14,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),45-64,4672.0,8998.0,15754.0,41.32,72.35,1017264.0,1197133.0,21774,both
3,2013-14,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),65+,5819.0,6397.0,12316.0,55.07,106.01,675946.0,761837.0,11617,both
4,2013-14,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),All persons,3892.0,28733.0,54818.0,29.59,56.45,3779415.0,4586290.0,97112,both


In [136]:
# renaming columns for consistency and dropping the _merge column
df_mbs_201319_sa3_combined.rename(
    columns={"EstimatedResidentPopulation": "Estimated resident population"},
    inplace=True,
)
df_mbs_201319_sa3_combined.drop(["_merge"], axis=1, inplace=True)
df_mbs_201319_sa3_combined.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 172308 entries, 0 to 172307
Data columns (total 17 columns):
 #   Column                                        Non-Null Count   Dtype 
---  ------                                        --------------   ----- 
 0   Year                                          172308 non-null  object
 1   StateTerritory                                172308 non-null  object
 2   GeographicUnit                                172308 non-null  object
 3   GeographicCode                                172308 non-null  object
 4   GeographicAreaName                            172308 non-null  object
 5   GeographicGroup                               172308 non-null  object
 6   ServiceLevel                                  172308 non-null  object
 7   Service                                       172308 non-null  object
 8   DemographicGroup                              172308 non-null  object
 9   Medicare benefits per 100 people ($)          172308 non-nu

## Load and Combine 2019-21 MBS and ERP dataset

### Import MBS and SA3 ERP files

In [137]:
# import the transformed mbs file and assign to a dataframe

df_mbs_201921 = pd.read_csv(
    os.path.join(path, "clean_datasets/mbs_data/2019-21_phc_mbs.csv"),
    encoding="ISO-8859-1",
    index_col=[0],
)
df_mbs_201921.head(10)

Unnamed: 0,Year,StateTerritory,GeographicUnit,GeographicCode,GeographicAreaName,GeographicGroup,ServiceLevel,Service,DemographicGroup,Medicare benefits per 100 people ($),No. of patients,No. of services,Percentage of people who had the service (%),Services per 100 people,Total Medicare benefits paid ($),Total provider fees ($)
0,2019-20,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),0-24,3275.33,6314,14134,18.89,42.29,1094649,1612037
1,2019-20,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),25-44,5036.77,8464,19537,27.12,62.6,1571825,2307710
2,2019-20,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),45-64,5641.12,9429,18660,43.35,85.8,1226832,1605423
3,2019-20,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),65+,7827.74,8973,19925,61.57,136.71,1140815,1421204
4,2019-20,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),All persons,4986.75,33180,72256,32.87,71.58,5034121,6946375
5,2019-20,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),F,6301.48,19402,44961,38.09,88.26,3210165,4451648
6,2019-20,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),M,3647.4,13777,27294,27.55,54.58,1823956,2494728
7,2019-20,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Diagnostic Imaging (total),0-24,2845.68,5308,8972,15.88,26.85,951054,1151880
8,2019-20,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Diagnostic Imaging (total),25-44,8754.3,9702,21175,31.09,67.85,2731955,3711186
9,2019-20,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Diagnostic Imaging (total),45-64,15762.76,8726,21008,40.12,96.6,3428086,4376262


In [138]:
df_mbs_201921.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 57263 entries, 0 to 57262
Data columns (total 16 columns):
 #   Column                                        Non-Null Count  Dtype 
---  ------                                        --------------  ----- 
 0   Year                                          57263 non-null  object
 1   StateTerritory                                57263 non-null  object
 2   GeographicUnit                                57263 non-null  object
 3   GeographicCode                                57263 non-null  object
 4   GeographicAreaName                            57263 non-null  object
 5   GeographicGroup                               57263 non-null  object
 6   ServiceLevel                                  57263 non-null  object
 7   Service                                       57263 non-null  object
 8   DemographicGroup                              57263 non-null  object
 9   Medicare benefits per 100 people ($)          57263 non-null  object
 10

#### Check Data Types

In [139]:
# checking for mixed type colmmns
mixed_type_columns = df_mbs_201319.applymap(type).nunique() > 1
mixed_type_columns

Year                                            False
StateTerritory                                  False
GeographicUnit                                  False
GeographicCode                                  False
GeographicAreaName                              False
GeographicGroup                                 False
ServiceLevel                                    False
Service                                         False
DemographicGroup                                False
Medicare benefits per 100 people ($)            False
No. of patients                                 False
No. of services                                 False
Percentage of people who had the service (%)    False
Services per 100 people                         False
Total Medicare benefits paid ($)                False
Total provider fees ($)                         False
dtype: bool

#### Enforcing Type on Dimension Values

In [140]:
# casting to columns to string fo consinstency when joining with SA3
df_mbs_201921["Year"] = df_mbs_201921["Year"].astype("str")
df_mbs_201921["GeographicCode"] = df_mbs_201921["GeographicCode"].astype("str")
df_mbs_201921["DemographicGroup"] = df_mbs_201921["DemographicGroup"].astype("str")
df_mbs_201921["StateTerritory"] = df_mbs_201921["StateTerritory"].astype("str")
df_mbs_201921["GeographicAreaName"] = df_mbs_201921["GeographicAreaName"].astype("str")
df_mbs_201921["ServiceLevel"] = df_mbs_201921["ServiceLevel"].astype("str")
df_mbs_201921["Service"] = df_mbs_201921["Service"].astype("str")

In [141]:
# import sa3 file containing estimated resident population for 2019-21

# import relevant columns for merging purposes
filter_cols = [
    "Year",
    "GeographicCode",
    "DemographicGroup",
    "EstimatedResidentPopulation",
]
df_sa3_erp_1921 = pd.read_csv(
    os.path.join(
        path, "original_datasets/mbs_data/phc-mbs-2019-2021/SA3_ERP_CSV_1920_2021.csv"
    ),
    usecols=filter_cols,
    encoding="ISO-8859-1",
    index_col=None,
)
df_sa3_erp_1921.tail(5)

Unnamed: 0,Year,GeographicCode,DemographicGroup,EstimatedResidentPopulation
4839,2020-21,90104,45-64,574
4840,2020-21,90104,65+,482
4841,2020-21,90104,All persons,1734
4842,2020-21,90104,F,901
4843,2020-21,90104,M,833


In [142]:
# casting to columns to string fo consinstency when joining with SA3
df_sa3_erp_1921["Year"] = df_sa3_erp_1921["Year"].astype("str")
df_sa3_erp_1921["GeographicCode"] = df_sa3_erp_1921["GeographicCode"].astype("str")
df_sa3_erp_1921["DemographicGroup"] = df_sa3_erp_1921["DemographicGroup"].astype("str")

### Compare SA3 geographical area list in mbs and sa3 datasets

In [143]:
df_sa3_erp_1921.nunique()

Year                              2
GeographicCode                  346
DemographicGroup                  7
EstimatedResidentPopulation    4642
dtype: int64

346 unique SA3 values. Expected the same for mbs data file.

In [144]:
df_mbs_201921.nunique()

Year                                                2
StateTerritory                                     11
GeographicUnit                                      2
GeographicCode                                    346
GeographicAreaName                                346
GeographicGroup                                     8
ServiceLevel                                        3
Service                                            53
DemographicGroup                                    7
Medicare benefits per 100 people ($)            45311
No. of patients                                 21132
No. of services                                 29779
Percentage of people who had the service (%)     8618
Services per 100 people                         19058
Total Medicare benefits paid ($)                51273
Total provider fees ($)                         51548
dtype: int64

In [145]:
diff_df = pd.merge(
    df_mbs_201921[["GeographicCode"]],
    df_sa3_erp_1921[["GeographicCode"]],
    how="outer",
    indicator=True,
).query('_merge != "both"')
diff_df

Unnamed: 0,GeographicCode,_merge


#### Check Estimated Resident Population Consistency

In [146]:
df_sa3_erp_1921["EstimatedResidentPopulation"].value_counts(dropna=False)

.        4
6939     4
11365    3
6985     3
121      3
        ..
23469    1
20352    1
90960    1
47137    1
833      1
Name: EstimatedResidentPopulation, Length: 4642, dtype: int64

Found SA3 had 4 records that were suppressed by using fullstop (.). Updating to np.nan so can be converted to int64

In [147]:
df_sa3_erp_1921["EstimatedResidentPopulation"] = pd.to_numeric(
    df_sa3_erp_1921["EstimatedResidentPopulation"], errors="coerce"
)

In [148]:
df_sa3_erp_1921["EstimatedResidentPopulation"] = df_sa3_erp_1921[
    "EstimatedResidentPopulation"
].astype("Int64")

In [149]:
df_sa3_erp_1921.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4844 entries, 0 to 4843
Data columns (total 4 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Year                         4844 non-null   object
 1   GeographicCode               4844 non-null   object
 2   DemographicGroup             4844 non-null   object
 3   EstimatedResidentPopulation  4840 non-null   Int64 
dtypes: Int64(1), object(3)
memory usage: 156.2+ KB


No differences in GeographicCodes between mbs and SA3 datasets

### Combine SA3 and MBS datasets

In [150]:
# left merge mbs data with SA3 to retrieve corresponding ERP
df_mbs_201921_sa3_combined = df_mbs_201921.merge(
    df_sa3_erp_1921,
    how="left",
    on=["Year", "GeographicCode", "DemographicGroup"],
    indicator=True,
)
df_mbs_201921_sa3_combined.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 57263 entries, 0 to 57262
Data columns (total 18 columns):
 #   Column                                        Non-Null Count  Dtype   
---  ------                                        --------------  -----   
 0   Year                                          57263 non-null  object  
 1   StateTerritory                                57263 non-null  object  
 2   GeographicUnit                                57263 non-null  object  
 3   GeographicCode                                57263 non-null  object  
 4   GeographicAreaName                            57263 non-null  object  
 5   GeographicGroup                               57263 non-null  object  
 6   ServiceLevel                                  57263 non-null  object  
 7   Service                                       57263 non-null  object  
 8   DemographicGroup                              57263 non-null  object  
 9   Medicare benefits per 100 people ($)          5726

In [151]:
df_mbs_201921_sa3_combined["_merge"].value_counts()

both          57263
left_only         0
right_only        0
Name: _merge, dtype: int64

In [152]:
df_mbs_201921_sa3_combined.head()

Unnamed: 0,Year,StateTerritory,GeographicUnit,GeographicCode,GeographicAreaName,GeographicGroup,ServiceLevel,Service,DemographicGroup,Medicare benefits per 100 people ($),No. of patients,No. of services,Percentage of people who had the service (%),Services per 100 people,Total Medicare benefits paid ($),Total provider fees ($),EstimatedResidentPopulation,_merge
0,2019-20,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),0-24,3275.33,6314,14134,18.89,42.29,1094649,1612037,33421,both
1,2019-20,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),25-44,5036.77,8464,19537,27.12,62.6,1571825,2307710,31207,both
2,2019-20,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),45-64,5641.12,9429,18660,43.35,85.8,1226832,1605423,21748,both
3,2019-20,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),65+,7827.74,8973,19925,61.57,136.71,1140815,1421204,14574,both
4,2019-20,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),All persons,4986.75,33180,72256,32.87,71.58,5034121,6946375,100950,both


In [153]:
# renaming columns for consistency and dropping the _merge column
df_mbs_201921_sa3_combined.rename(
    columns={"EstimatedResidentPopulation": "Estimated resident population"},
    inplace=True,
)
df_mbs_201921_sa3_combined.drop(["_merge"], axis=1, inplace=True)
df_mbs_201921_sa3_combined.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 57263 entries, 0 to 57262
Data columns (total 17 columns):
 #   Column                                        Non-Null Count  Dtype 
---  ------                                        --------------  ----- 
 0   Year                                          57263 non-null  object
 1   StateTerritory                                57263 non-null  object
 2   GeographicUnit                                57263 non-null  object
 3   GeographicCode                                57263 non-null  object
 4   GeographicAreaName                            57263 non-null  object
 5   GeographicGroup                               57263 non-null  object
 6   ServiceLevel                                  57263 non-null  object
 7   Service                                       57263 non-null  object
 8   DemographicGroup                              57263 non-null  object
 9   Medicare benefits per 100 people ($)          57263 non-null  object
 10

## Combine MBS Datasets

### Import 2021-22 MBS file

Dataset already has corresponding Estimated Resident Population

In [154]:
# import the transformed mbs file and assign to a dataframe

df_mbs_202122 = pd.read_csv(
    os.path.join(path, "clean_datasets/mbs_data/2021-22_phc_mbs.csv"),
    encoding="ISO-8859-1",
    index_col=[0],
)
df_mbs_202122.head(10)

Unnamed: 0,Year,StateTerritory,GeographicUnit,GeographicCode,GeographicAreaName,GeographicGroup,ServiceLevel,Service,DemographicGroup,Estimated resident population,Medicare benefits per 100 people ($),No. of patients,No. of services,Percentage of people who had the service (%),Services per 100 people,Total Medicare benefits paid ($),Total provider fees ($)
0,2021-22,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),0-24,33439,4385.95,6371,16206,19.05%,48.46,"$1,466,617","$2,259,631"
1,2021-22,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),25-44,34495,6177.09,8921,23460,25.86%,68.01,"$2,130,787","$3,248,640"
2,2021-22,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),45-64,22397,6283.74,9848,19883,43.97%,88.78,"$1,407,369","$1,933,615"
3,2021-22,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),65+,15541,7940.67,9794,21140,63.02%,136.03,"$1,234,059","$1,594,820"
4,2021-22,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),All persons,105872,5892.81,34933,80690,33.00%,76.21,"$6,238,832","$9,036,706"
5,2021-22,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),F,53393,7822.2,20755,51863,38.87%,97.13,"$4,176,508","$6,107,252"
6,2021-22,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),M,52479,3929.81,14178,28827,27.02%,54.93,"$2,062,324","$2,929,454"
7,2021-22,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Diagnostic Imaging (total),0-24,33439,2794.82,5074,8451,15.17%,25.27,"$934,559","$1,165,637"
8,2021-22,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Diagnostic Imaging (total),25-44,34495,8519.92,9925,21552,28.77%,62.48,"$2,938,946","$4,179,261"
9,2021-22,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Diagnostic Imaging (total),45-64,22397,16671.21,8522,20770,38.05%,92.74,"$3,733,850","$4,775,628"


In [155]:
df_mbs_202122.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28964 entries, 0 to 28963
Data columns (total 17 columns):
 #   Column                                        Non-Null Count  Dtype 
---  ------                                        --------------  ----- 
 0   Year                                          28964 non-null  object
 1   StateTerritory                                28964 non-null  object
 2   GeographicUnit                                28964 non-null  object
 3   GeographicCode                                28964 non-null  object
 4   GeographicAreaName                            28964 non-null  object
 5   GeographicGroup                               28964 non-null  object
 6   ServiceLevel                                  28964 non-null  object
 7   Service                                       28964 non-null  object
 8   DemographicGroup                              28964 non-null  object
 9   Estimated resident population                 28964 non-null  object
 10

#### Enforcing Type on Dimension Values

In [156]:
# For consistency, converting the colums to string
df_mbs_202122["Year"] = df_mbs_202122["Year"].astype("str")
df_mbs_202122["GeographicCode"] = df_mbs_202122["GeographicCode"].astype("str")
df_mbs_202122["DemographicGroup"] = df_mbs_202122["DemographicGroup"].astype("str")
df_mbs_202122["StateTerritory"] = df_mbs_202122["StateTerritory"].astype("str")
df_mbs_202122["GeographicAreaName"] = df_mbs_202122["GeographicAreaName"].astype("str")
df_mbs_202122["ServiceLevel"] = df_mbs_202122["ServiceLevel"].astype("str")
df_mbs_202122["Service"] = df_mbs_202122["Service"].astype("str")

##### Check Estimated Resident Population Consistency

In [157]:
# found 10 rows that contain . Updating the row to have np.nan
df_mbs_202122["Estimated resident population"].value_counts(dropna=False)
df_diff = df_mbs_202122[
    df_mbs_202122["Estimated resident population"].str.contains("\.")
]
print(df_diff.shape)
df_diff

(10, 17)


Unnamed: 0,Year,StateTerritory,GeographicUnit,GeographicCode,GeographicAreaName,GeographicGroup,ServiceLevel,Service,DemographicGroup,Estimated resident population,Medicare benefits per 100 people ($),No. of patients,No. of services,Percentage of people who had the service (%),Services per 100 people,Total Medicare benefits paid ($),Total provider fees ($)
6860,2021-22,NSW,SA3,10702,Illawarra Catchment Reserve,Ungrouped,Level 1,Allied Health attendances (total),45-64,.,n.p.,n.p.,n.p.,n.p.,n.p.,n.p.,n.p.
6861,2021-22,NSW,SA3,10702,Illawarra Catchment Reserve,Ungrouped,Level 1,Allied Health attendances (total),65+,.,n.p.,n.p.,n.p.,n.p.,n.p.,n.p.,n.p.
6867,2021-22,NSW,SA3,10702,Illawarra Catchment Reserve,Ungrouped,Level 1,Diagnostic Imaging (total),45-64,.,n.p.,n.p.,n.p.,n.p.,n.p.,n.p.,n.p.
6868,2021-22,NSW,SA3,10702,Illawarra Catchment Reserve,Ungrouped,Level 1,Diagnostic Imaging (total),65+,.,n.p.,n.p.,n.p.,n.p.,n.p.,n.p.,n.p.
6874,2021-22,NSW,SA3,10702,Illawarra Catchment Reserve,Ungrouped,Level 1,GP attendances (total),45-64,.,n.p.,n.p.,n.p.,n.p.,n.p.,n.p.,n.p.
6875,2021-22,NSW,SA3,10702,Illawarra Catchment Reserve,Ungrouped,Level 1,GP attendances (total),65+,.,n.p.,n.p.,n.p.,n.p.,n.p.,n.p.,n.p.
6881,2021-22,NSW,SA3,10702,Illawarra Catchment Reserve,Ungrouped,Level 1,Nursing and Aboriginal Health Workers (total),45-64,.,n.p.,n.p.,n.p.,n.p.,n.p.,n.p.,n.p.
6882,2021-22,NSW,SA3,10702,Illawarra Catchment Reserve,Ungrouped,Level 1,Nursing and Aboriginal Health Workers (total),65+,.,n.p.,n.p.,n.p.,n.p.,n.p.,n.p.,n.p.
6888,2021-22,NSW,SA3,10702,Illawarra Catchment Reserve,Ungrouped,Level 1,Specialist attendances (total),45-64,.,n.p.,n.p.,n.p.,n.p.,n.p.,n.p.,n.p.
6889,2021-22,NSW,SA3,10702,Illawarra Catchment Reserve,Ungrouped,Level 1,Specialist attendances (total),65+,.,n.p.,n.p.,n.p.,n.p.,n.p.,n.p.,n.p.


Found SA3 had 4 records that were suppressed by using fullstop (.). Updating to np.nan so can be converted to int64

In [158]:
df_mbs_202122["Estimated resident population"] = pd.to_numeric(
    df_mbs_202122["Estimated resident population"], errors="coerce"
)

In [159]:
df_mbs_202122["Estimated resident population"] = (
    df_mbs_202122["Estimated resident population"].astype("float").astype("Int64")
)

In [160]:
df_mbs_202122.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28964 entries, 0 to 28963
Data columns (total 17 columns):
 #   Column                                        Non-Null Count  Dtype 
---  ------                                        --------------  ----- 
 0   Year                                          28964 non-null  object
 1   StateTerritory                                28964 non-null  object
 2   GeographicUnit                                28964 non-null  object
 3   GeographicCode                                28964 non-null  object
 4   GeographicAreaName                            28964 non-null  object
 5   GeographicGroup                               28964 non-null  object
 6   ServiceLevel                                  28964 non-null  object
 7   Service                                       28964 non-null  object
 8   DemographicGroup                              28964 non-null  object
 9   Estimated resident population                 28954 non-null  Int64 
 10

### Combine MBS Datasets 2013-19, 2019-2021 and 2021-22

Expecting 258550 rows of MBS data after vertically stacking the dataframes

In [161]:
# vertically stacking each of the dataframe to create a new dataset with years 2013-22
df_mbs_combined_201322 = pd.concat(
    [df_mbs_201319_sa3_combined, df_mbs_201921_sa3_combined, df_mbs_202122],
    axis=0,
    ignore_index=True,
)
df_mbs_combined_201322.shape

(258535, 17)

In [162]:
df_mbs_combined_201322.tail(5)

Unnamed: 0,Year,StateTerritory,GeographicUnit,GeographicCode,GeographicAreaName,GeographicGroup,ServiceLevel,Service,DemographicGroup,Medicare benefits per 100 people ($),No. of patients,No. of services,Percentage of people who had the service (%),Services per 100 people,Total Medicare benefits paid ($),Total provider fees ($),Estimated resident population
258530,2021-22,WA,SA3,51104,Mid West,Remote (incl. very remote),Level 3,Physiotherapy,All persons,240.25,922,2485,1.61%,4.34,"$137,622","$154,790",57284
258531,2021-22,WA,SA3,51104,Mid West,Remote (incl. very remote),Level 3,Podiatry,All persons,469.51,1803,4856,3.15%,8.48,"$268,951","$310,628",57284
258532,2021-22,WA,SA3,51104,Mid West,Remote (incl. very remote),Level 3,Practice Nurse/Aboriginal Health Worker,All persons,233.88,5139,8167,8.97%,14.26,"$133,975","$133,983",57284
258533,2021-22,WA,SA3,51104,Mid West,Remote (incl. very remote),Level 3,Psychiatry,All persons,660.86,788,1757,1.38%,3.07,"$378,567","$567,161",57284
258534,2021-22,WA,SA3,51104,Mid West,Remote (incl. very remote),Level 3,Speech Pathology,All persons,n.p.,n.p.,n.p.,n.p.,n.p.,n.p.,n.p.,57284


In [163]:
df_mbs_combined_201322.head(5)

Unnamed: 0,Year,StateTerritory,GeographicUnit,GeographicCode,GeographicAreaName,GeographicGroup,ServiceLevel,Service,DemographicGroup,Medicare benefits per 100 people ($),No. of patients,No. of services,Percentage of people who had the service (%),Services per 100 people,Total Medicare benefits paid ($),Total provider fees ($),Estimated resident population
0,2013-14,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),0-24,2576.0,5624.0,10879.0,17.27,33.41,838549.0,1026474.0,32558
1,2013-14,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),25-44,4004.0,7714.0,15870.0,24.75,50.93,1247656.0,1600846.0,31163
2,2013-14,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),45-64,4672.0,8998.0,15754.0,41.32,72.35,1017264.0,1197133.0,21774
3,2013-14,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),65+,5819.0,6397.0,12316.0,55.07,106.01,675946.0,761837.0,11617
4,2013-14,ACT,SA3,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),All persons,3892.0,28733.0,54818.0,29.59,56.45,3779415.0,4586290.0,97112


## Data PreProcessing

In [164]:
df_mbs_combined_201322.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 258535 entries, 0 to 258534
Data columns (total 17 columns):
 #   Column                                        Non-Null Count   Dtype 
---  ------                                        --------------   ----- 
 0   Year                                          258535 non-null  object
 1   StateTerritory                                258535 non-null  object
 2   GeographicUnit                                258535 non-null  object
 3   GeographicCode                                258535 non-null  object
 4   GeographicAreaName                            258535 non-null  object
 5   GeographicGroup                               258535 non-null  object
 6   ServiceLevel                                  258535 non-null  object
 7   Service                                       258535 non-null  object
 8   DemographicGroup                              258535 non-null  object
 9   Medicare benefits per 100 people ($)          258535 non-nu

### Dimensions Values

#### Year

In [165]:
# checking for unique values of Year to identify any unusual or blank years
df_mbs_combined_201322["Year"].value_counts(dropna=False)

2021-22    28964
2013-14    28718
2014-15    28718
2015-16    28718
2016-17    28718
2017-18    28718
2018-19    28718
2020-21    28639
2019-20    28624
Name: Year, dtype: int64

Differences in number of records per year relates to new SA3 levels being introducted

#### State Territory

In [166]:
df_mbs_combined_201322["StateTerritory"].value_counts(dropna=False)

NSW                  68503
Qld                  61334
Vic                  49367
WA                   25414
SA                   20941
Tas                  11210
ACT                   7480
NT                    6721
National-SA3Group     4488
Other Territories     2827
National               250
Name: StateTerritory, dtype: int64

No blanks found. Missing values was identified and prepoulated with 'National-SA3Group' before pivoting the tables.

#### Geographic Unit

In [167]:
# checking if GeographicUnit is value and relevance
df_mbs_combined_201322["GeographicUnit"].value_counts(dropna=False)

SA3    258285
NAT       250
Name: GeographicUnit, dtype: int64

In [168]:
# investigationg 'NAT' to check if GeographicUnit can be dropped
df_mbs_combined_201322[["GeographicUnit", "GeographicCode"]].value_counts(dropna=False)

GeographicUnit  GeographicCode
SA3             30302             748
                31901             748
                31202             748
                31201             748
                31106             748
                                 ... 
                90101             691
                10803             691
                90102             668
                12402             498
NAT             001NAT            250
Length: 347, dtype: int64

GeographicUnit column is not required. National number can be identified using GeographicCode=001NAT or StateTerritory = National. The column will be dropped

#### Geographic Area Name

In [169]:
# to view the full list of area names to ensure there are no blanks
geographicAreaName_list = df_mbs_combined_201322["GeographicAreaName"].value_counts(
    dropna=False
)
geographicAreaName_list

Belconnen                     748
Brighton                      748
Limestone Coast               748
Fleurieu - Kangaroo Island    748
Outback - North and East      748
                             ... 
Lord Howe Island              691
Christmas Island              691
Cocos (Keeling) Islands       668
Blue Mountains - South        498
National                      250
Name: GeographicAreaName, Length: 347, dtype: int64

No missing values found for GeographicAreaName

#### Geographic Group

In [170]:
df_mbs_combined_201322["GeographicGroup"].value_counts(dropna=False)

Major cities - medium SES     70305
Inner regional                62083
Major cities - higher SES     37399
Major cities - lower SES      36652
Outer regional                32897
Remote (incl. very remote)    17703
Ungrouped                      1246
National                        250
Name: GeographicGroup, dtype: int64

In [171]:
# Investigating 'ungrouped' areas.
gg_ungrouped = df_mbs_combined_201322[
    df_mbs_combined_201322["GeographicGroup"] == "Ungrouped"
]
gg_ungrouped[["GeographicAreaName", "Year"]].value_counts(dropna=False)

GeographicAreaName           Year   
Illawarra Catchment Reserve  2021-22    84
Blue Mountains - South       2013-14    83
                             2014-15    83
                             2015-16    83
                             2016-17    83
                             2017-18    83
                             2018-19    83
Illawarra Catchment Reserve  2013-14    83
                             2014-15    83
                             2015-16    83
                             2016-17    83
                             2017-18    83
                             2018-19    83
                             2019-20    83
                             2020-21    83
dtype: int64

No missing values in GeographicGroup. There 2 geographic areas that are not assigned SA3 Group. These areas ('Illawarra Catchment Reserve' and 'Blue Mountains - South') have less than 10 people. Its does not fall under any definiations of SA3 Groups

#### ServiceLevel

In [172]:
df_mbs_combined_201322["ServiceLevel"].value_counts(dropna=False)

Level 3    124642
Level 1    108990
Level 2     24903
Name: ServiceLevel, dtype: int64

No misisng GeographicGroup. Level values are as expected

#### 4.1.6 Service

In [173]:
df_mbs_combined_201322["Service"].value_counts(dropna=False)

Allied Health attendances (total)                                21798
GP attendances (total)                                           21798
Nursing and Aboriginal Health Workers (total)                    21798
Specialist attendances (total)                                   21798
Diagnostic Imaging (total)                                       21798
GP After-hours (non-urgent)                                       3114
Practice Nurse/Aboriginal Health Worker                           3114
Other Non-referred Medical Practitioner attendances               3114
GP Standard (Level B)                                             3114
GP Short (Level A)                                                3114
GP Prolonged (Level D)                                            3114
GP Mental Health                                                  3114
GP Long (Level C)                                                 3114
GP Chronic Disease Management Plan                                3114
GP sub

No missing values in Services

#### DemographicGroup

In [174]:
df_mbs_combined_201322["DemographicGroup"].value_counts(dropna=False)

All persons    165115
0-24            15570
25-44           15570
45-64           15570
65+             15570
Females         10380
Males           10380
F                5190
M                5190
Name: DemographicGroup, dtype: int64

No missing Demographic Group values. Datasets 2013-19 have gender as 'Females' and 'Males'. Datasets 2019-21 and 2021-22 have gender set as 'F' and 'M'. In the standardization section, gender abbreviations are converted to names for analaysis.

### Measure Values

Measure values have Not Published data. To handle not publish data, function is created to set them has NaN so panda functions can recognise them as unavailable

In [175]:
# setting up function to convert any non published values (n.p or n.p.) to NaN for analysis. Not published values are suppressed or blank values
def set_np_to_NaN(dataframe, column_name):
    # replacing the n.p with np values first.
    dataframe.loc[
        dataframe[column_name].str.startswith("n.p", na=False), column_name
    ] = "np"

    # replacing np with np.nan
    dataframe[column_name] = dataframe[column_name].replace("np", np.nan)
    return dataframe

#### Medicare benefits per 100 people

In [176]:
df_mbs_combined_201322["Medicare benefits per 100 people ($)"].value_counts(
    dropna=False
)

n.p             18362
n.p.             3545
n.p.             2327
10.0              831
14.0              831
                ...  
12066.44            1
13151.04            1
10850.78            1
3270.35             1
660.86              1
Name: Medicare benefits per 100 people ($), Length: 94833, dtype: int64

There is total of 24,234 (9.37%) not published or blank values. To standarsize it, different variations of n.p will be updated to nan so other panda functions dectect and skip the row.

In [177]:
# investigating what years / locations are affected by n.p values and if consistent year on year.
df_mbs_combined_201322["Medicare benefits per 100 people ($)"] = df_mbs_combined_201322[
    "Medicare benefits per 100 people ($)"
].astype("str")
np_values = df_mbs_combined_201322[
    df_mbs_combined_201322["Medicare benefits per 100 people ($)"].str.startswith("n.p")
]

# dataframe counting np rows by year and geographic areas
year_gg_np_values = np_values[["Year", "GeographicAreaName"]].value_counts(dropna=False)

In [178]:
# replacing the n.p with NaN.
df_mbs_combined_201322 = set_np_to_NaN(
    df_mbs_combined_201322, "Medicare benefits per 100 people ($)"
)
df_mbs_combined_201322["Medicare benefits per 100 people ($)"].value_counts(
    dropna=False
)

NaN         24234
10.0          831
14.0          831
16.0          809
15.0          807
            ...  
12066.44        1
13151.04        1
10850.78        1
3270.35         1
660.86          1
Name: Medicare benefits per 100 people ($), Length: 94831, dtype: int64

In [179]:
# casting the value to be float. Removing $ and , before casting
df_mbs_combined_201322["Medicare benefits per 100 people ($)"] = (
    df_mbs_combined_201322["Medicare benefits per 100 people ($)"]
    .str.strip()
    .str.replace(",", "")
    .astype("float")
)

In [180]:
df_mbs_combined_201322["Medicare benefits per 100 people ($)"].dtype

dtype('float64')

In [181]:
df_mbs_combined_201322["Medicare benefits per 100 people ($)"].describe()

count    234301.000000
mean       6546.570367
std       11229.591164
min           0.000000
25%          97.000000
50%        1146.110000
75%        7662.810000
max      110280.550000
Name: Medicare benefits per 100 people ($), dtype: float64

In [182]:
df_mbs_combined_201322["Medicare benefits per 100 people ($)"].value_counts(
    dropna=False
)

NaN        24234
10.00        833
14.00        832
16.00        813
15.00        810
           ...  
5849.47        1
1139.59        1
876.39         1
311.01         1
660.86         1
Name: Medicare benefits per 100 people ($), Length: 94134, dtype: int64

#### No. of patients

In [183]:
df_mbs_combined_201322["No. of patients"].value_counts(dropna=False)

n.p             18362
n.p.             3545
n.p.             2327
0.0               331
21.0              250
                ...  
25929.0             1
24667.0             1
22864.0             1
36211.0             1
       5,139        1
Name: No. of patients, Length: 69679, dtype: int64

There is total of 24,234 (9.37%) not published or blank values. To standarsize it, different variations of n.p will be updated to NaN so other panda functions dectect and skip the row.

In [184]:
# replacing the n.p with NaN values first.
df_mbs_combined_201322 = set_np_to_NaN(df_mbs_combined_201322, "No. of patients")
df_mbs_combined_201322["No. of patients"].value_counts(dropna=False)

NaN             24234
0.0               331
21.0              250
24.0              248
26.0              237
                ...  
12798.0             1
25929.0             1
24667.0             1
22864.0             1
       5,139        1
Name: No. of patients, Length: 69677, dtype: int64

In [185]:
# casting the value to be float. Removing $ and , before casting
df_mbs_combined_201322["No. of patients"] = (
    df_mbs_combined_201322["No. of patients"]
    .str.strip()
    .str.replace(",", "")
    .astype("float")
)

In [186]:
df_mbs_combined_201322["No. of patients"].describe()

count    2.343010e+05
mean     2.172532e+04
std      2.288474e+05
min      0.000000e+00
25%      5.210000e+02
50%      3.290000e+03
75%      1.110100e+04
max      2.309965e+07
Name: No. of patients, dtype: float64

In [187]:
# casting it to be integer as decimal points are not required for whole patient
df_mbs_combined_201322["No. of patients"] = df_mbs_combined_201322[
    "No. of patients"
].astype("Int32")

In [188]:
df_mbs_combined_201322.dtypes

Year                                             object
StateTerritory                                   object
GeographicUnit                                   object
GeographicCode                                   object
GeographicAreaName                               object
GeographicGroup                                  object
ServiceLevel                                     object
Service                                          object
DemographicGroup                                 object
Medicare benefits per 100 people ($)            float64
No. of patients                                   Int32
No. of services                                  object
Percentage of people who had the service (%)     object
Services per 100 people                          object
Total Medicare benefits paid ($)                 object
Total provider fees ($)                          object
Estimated resident population                     Int64
dtype: object

#### No. of services

In [189]:
df_mbs_combined_201322["No. of services"].value_counts(dropna=False)

n.p             18362
n.p.             3545
n.p.             2327
0.0               331
21.0              169
                ...  
103721.0            1
25973.0             1
36077.0             1
14880.0             1
       8,167        1
Name: No. of services, Length: 105043, dtype: int64

There is total of 24,234 (9.37%) not published or blank values. To standarsize it, different variations of n.p will be updated to NaN so other panda functions dectect and skip the row.

In [190]:
df_mbs_combined_201322 = set_np_to_NaN(df_mbs_combined_201322, "No. of services")
df_mbs_combined_201322["No. of services"].value_counts(dropna=False)

NaN             24234
0.0               331
21.0              169
28.0              167
26.0              163
                ...  
103721.0            1
25973.0             1
36077.0             1
14880.0             1
       8,167        1
Name: No. of services, Length: 105041, dtype: int64

In [191]:
# casting the value to be float. Removing spances and , first
df_mbs_combined_201322["No. of services"] = (
    df_mbs_combined_201322["No. of services"]
    .str.strip()
    .str.replace(",", "")
    .astype("float")
)

In [192]:
df_mbs_combined_201322["No. of services"].describe()

count    2.343010e+05
mean     9.341582e+04
std      1.353283e+06
min     -1.020000e+02
25%      1.063000e+03
50%      7.721000e+03
75%      3.079400e+04
max      1.886940e+08
Name: No. of services, dtype: float64

In [193]:
# casting it to be integer as decimal points are not required to represent services.
df_mbs_combined_201322["No. of services"] = df_mbs_combined_201322[
    "No. of services"
].astype("Int64")

In [194]:
df_mbs_combined_201322.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 258535 entries, 0 to 258534
Data columns (total 17 columns):
 #   Column                                        Non-Null Count   Dtype  
---  ------                                        --------------   -----  
 0   Year                                          258535 non-null  object 
 1   StateTerritory                                258535 non-null  object 
 2   GeographicUnit                                258535 non-null  object 
 3   GeographicCode                                258535 non-null  object 
 4   GeographicAreaName                            258535 non-null  object 
 5   GeographicGroup                               258535 non-null  object 
 6   ServiceLevel                                  258535 non-null  object 
 7   Service                                       258535 non-null  object 
 8   DemographicGroup                              258535 non-null  object 
 9   Medicare benefits per 100 people ($)          23

#### Percentage of people who had the service

In [195]:
df_mbs_combined_201322["Percentage of people who had the service (%)"].value_counts(
    dropna=False
)

n.p             18362
n.p.             3545
n.p.             2327
0.04             1702
0.06             1679
                ...  
     63.44%         1
     95.57%         1
     86.29%         1
     92.19%         1
     72.62%         1
Name: Percentage of people who had the service (%), Length: 16953, dtype: int64

In [196]:
df_mbs_combined_201322 = set_np_to_NaN(
    df_mbs_combined_201322, "Percentage of people who had the service (%)"
)
df_mbs_combined_201322["Percentage of people who had the service (%)"].value_counts(
    dropna=False
)

NaN             24234
0.04             1702
0.06             1679
0.05             1651
0.03             1549
                ...  
     42.08%         1
     61.97%         1
     16.93%         1
     95.74%         1
     72.62%         1
Name: Percentage of people who had the service (%), Length: 16951, dtype: int64

In [197]:
# casting the value to be float. Removing spances and , first
df_mbs_combined_201322["Percentage of people who had the service (%)"] = (
    df_mbs_combined_201322["Percentage of people who had the service (%)"]
    .str.strip()
    .str.replace("%", "")
    .astype("float")
)

In [198]:
df_mbs_combined_201322["Percentage of people who had the service (%)"].describe()

count    234301.000000
mean         23.257384
std          28.371166
min           0.000000
25%           1.110000
50%           8.990000
75%          35.700000
max         100.000000
Name: Percentage of people who had the service (%), dtype: float64

#### Services per 100 people

In [199]:
df_mbs_combined_201322["Services per 100 people"].value_counts(dropna=False)

n.p             18362
n.p.             3545
n.p.             2327
0.06             1152
0.04             1135
                ...  
481.51              1
57.61               1
334.92              1
484.65              1
471.04              1
Name: Services per 100 people, Length: 43007, dtype: int64

In [200]:
df_mbs_combined_201322 = set_np_to_NaN(
    df_mbs_combined_201322, "Services per 100 people"
)
df_mbs_combined_201322["Services per 100 people"].value_counts(dropna=False)

NaN       24234
0.06       1152
0.04       1135
0.05       1084
0.07       1033
          ...  
264.68        1
387.41        1
510.33        1
449.4         1
471.04        1
Name: Services per 100 people, Length: 43005, dtype: int64

In [201]:
# casting the value to be float. Removing spances and , first
df_mbs_combined_201322["Services per 100 people"] = (
    df_mbs_combined_201322["Services per 100 people"]
    .str.strip()
    .str.replace(",", "")
    .astype("float")
)

In [202]:
df_mbs_combined_201322["Services per 100 people"].dtype

dtype('float64')

In [203]:
df_mbs_combined_201322["Services per 100 people"].describe()

count    234301.00000
mean        104.48140
std         208.22682
min          -0.34000
25%           2.17000
50%          19.43000
75%          91.07000
max        2041.20000
Name: Services per 100 people, dtype: float64

In [204]:
df_mbs_combined_201322[df_mbs_combined_201322["Services per 100 people"] < 0]

Unnamed: 0,Year,StateTerritory,GeographicUnit,GeographicCode,GeographicAreaName,GeographicGroup,ServiceLevel,Service,DemographicGroup,Medicare benefits per 100 people ($),No. of patients,No. of services,Percentage of people who had the service (%),Services per 100 people,Total Medicare benefits paid ($),Total provider fees ($),Estimated resident population
159089,2018-19,Qld,SA3,31502,Outback - North,Remote (incl. very remote),Level 3,GP Multidisciplinary Case Conference,All persons,5.0,183,-102,0.61,-0.34,1548.0,1586.0,30139


#### Total Medicare benefits paid

In [205]:
df_mbs_combined_201322["Total Medicare benefits paid ($)"].value_counts(dropna=False)

n.p                 18362
n.p.                 3545
n.p.                 2327
0.0                   331
6471.0                  8
                    ...  
510815.0                1
1246134.0               1
167406.0                1
202074.0                1
        $378,567        1
Name: Total Medicare benefits paid ($), Length: 214386, dtype: int64

In [206]:
df_mbs_combined_201322 = set_np_to_NaN(
    df_mbs_combined_201322, "Total Medicare benefits paid ($)"
)
df_mbs_combined_201322["Total Medicare benefits paid ($)"].value_counts(dropna=False)

NaN                 24234
0.0                   331
6471.0                  8
8569.0                  7
2480.0                  7
                    ...  
43812.0                 1
510815.0                1
1246134.0               1
167406.0                1
        $378,567        1
Name: Total Medicare benefits paid ($), Length: 214384, dtype: int64

In [207]:
# casting the value to be float. Removing spaces, comma (,) and $ before casting
df_mbs_combined_201322["Total Medicare benefits paid ($)"] = (
    df_mbs_combined_201322["Total Medicare benefits paid ($)"]
    .str.strip()
    .str.replace(",", "")
    .str.replace("[$,]", "", regex=True)
    .astype("float")
)

In [208]:
df_mbs_combined_201322["Total Medicare benefits paid ($)"].dtype

dtype('float64')

In [209]:
df_mbs_combined_201322["Total Medicare benefits paid ($)"].describe()

count    2.343010e+05
mean     5.642181e+06
std      6.963072e+07
min      0.000000e+00
25%      4.743900e+04
50%      5.376540e+05
75%      2.486454e+06
max      9.082284e+09
Name: Total Medicare benefits paid ($), dtype: float64

#### Total provider fees

In [210]:
df_mbs_combined_201322["Total provider fees ($)"].value_counts(dropna=False)

n.p                 18362
n.p.                 3545
n.p.                 2327
0.0                   331
21744.0                 8
                    ...  
1397507.0               1
6808915.0               1
3922965.0               1
2885950.0               1
        $567,161        1
Name: Total provider fees ($), Length: 216251, dtype: int64

In [211]:
df_mbs_combined_201322 = set_np_to_NaN(
    df_mbs_combined_201322, "Total provider fees ($)"
)
df_mbs_combined_201322["Total provider fees ($)"].value_counts(dropna=False)

NaN                 24234
0.0                   331
21744.0                 8
15853.0                 7
10991.0                 7
                    ...  
1397507.0               1
6808915.0               1
3922965.0               1
2885950.0               1
        $567,161        1
Name: Total provider fees ($), Length: 216249, dtype: int64

In [212]:
# casting the value to be float. Removing spaces, comma (,) and $ before casting
df_mbs_combined_201322["Total provider fees ($)"] = (
    df_mbs_combined_201322["Total provider fees ($)"]
    .str.strip()
    .str.replace(",", "")
    .str.replace("[$,]", "", regex=True)
    .astype("float")
)

In [213]:
df_mbs_combined_201322["Total provider fees ($)"].dtype

dtype('float64')

In [214]:
df_mbs_combined_201322["Total provider fees ($)"].describe()

count    2.343010e+05
mean     6.530087e+06
std      7.848583e+07
min      0.000000e+00
25%      5.389700e+04
50%      6.216280e+05
75%      3.009783e+06
max      1.000562e+10
Name: Total provider fees ($), dtype: float64

In [215]:
df_mbs_combined_201322.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 258535 entries, 0 to 258534
Data columns (total 17 columns):
 #   Column                                        Non-Null Count   Dtype  
---  ------                                        --------------   -----  
 0   Year                                          258535 non-null  object 
 1   StateTerritory                                258535 non-null  object 
 2   GeographicUnit                                258535 non-null  object 
 3   GeographicCode                                258535 non-null  object 
 4   GeographicAreaName                            258535 non-null  object 
 5   GeographicGroup                               258535 non-null  object 
 6   ServiceLevel                                  258535 non-null  object 
 7   Service                                       258535 non-null  object 
 8   DemographicGroup                              258535 non-null  object 
 9   Medicare benefits per 100 people ($)          23

#### Estimated resident populatio

In [216]:
df_mbs_combined_201322["Estimated resident population"].value_counts(dropna=False)

5        445
0        314
4        204
59201    160
1        120
        ... 
60579      5
22484      5
17623      5
10921      5
28992      5
Name: Estimated resident population, Length: 18284, dtype: Int64

In [217]:
df_mbs_combined_201322["Estimated resident population"].describe()

count         258505.0
mean     126692.421748
std      835320.657884
min                0.0
25%            20080.0
50%            43450.0
75%            77425.0
max         25697298.0
Name: Estimated resident population, dtype: Float64

## Drop Unwanted Columns

In [218]:
df_mbs_combined_201322.drop(["GeographicUnit"], axis=1, inplace=True)

In [219]:
df_mbs_combined_201322.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 258535 entries, 0 to 258534
Data columns (total 16 columns):
 #   Column                                        Non-Null Count   Dtype  
---  ------                                        --------------   -----  
 0   Year                                          258535 non-null  object 
 1   StateTerritory                                258535 non-null  object 
 2   GeographicCode                                258535 non-null  object 
 3   GeographicAreaName                            258535 non-null  object 
 4   GeographicGroup                               258535 non-null  object 
 5   ServiceLevel                                  258535 non-null  object 
 6   Service                                       258535 non-null  object 
 7   DemographicGroup                              258535 non-null  object 
 8   Medicare benefits per 100 people ($)          234301 non-null  float64
 9   No. of patients                               23

## Data Standardization

### Year: Convert from Financial to Calendar Year 

In [220]:
df_mbs_combined_201322["Year"].value_counts()

2021-22    28964
2013-14    28718
2014-15    28718
2015-16    28718
2016-17    28718
2017-18    28718
2018-19    28718
2020-21    28639
2019-20    28624
Name: Year, dtype: int64

In [221]:
year_replacement = {
    "2021-22": "2022",
    "2020-21": "2021",
    "2019-20": "2020",
    "2018-19": "2019",
    "2017-18": "2018",
    "2016-17": "2017",
    "2015-16": "2016",
    "2014-15": "2015",
    "2013-14": "2014",
}

df_mbs_combined_201322["Year"] = df_mbs_combined_201322["Year"].replace(
    year_replacement
)

In [222]:
df_mbs_combined_201322["Year"].value_counts()

2022    28964
2014    28718
2015    28718
2016    28718
2017    28718
2018    28718
2019    28718
2021    28639
2020    28624
Name: Year, dtype: int64

In [223]:
df_mbs_combined_201322["Year"] = df_mbs_combined_201322["Year"].astype("int64")

In [224]:
df_mbs_combined_201322.head(10)

Unnamed: 0,Year,StateTerritory,GeographicCode,GeographicAreaName,GeographicGroup,ServiceLevel,Service,DemographicGroup,Medicare benefits per 100 people ($),No. of patients,No. of services,Percentage of people who had the service (%),Services per 100 people,Total Medicare benefits paid ($),Total provider fees ($),Estimated resident population
0,2014,ACT,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),0-24,2576.0,5624,10879,17.27,33.41,838549.0,1026474.0,32558
1,2014,ACT,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),25-44,4004.0,7714,15870,24.75,50.93,1247656.0,1600846.0,31163
2,2014,ACT,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),45-64,4672.0,8998,15754,41.32,72.35,1017264.0,1197133.0,21774
3,2014,ACT,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),65+,5819.0,6397,12316,55.07,106.01,675946.0,761837.0,11617
4,2014,ACT,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),All persons,3892.0,28733,54818,29.59,56.45,3779415.0,4586290.0,97112
5,2014,ACT,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),Females,4902.0,17048,34198,34.9,70.01,2394671.0,2929743.0,48846
6,2014,ACT,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),Males,2869.0,11685,20620,24.21,42.72,1384744.0,1656547.0,48266
7,2014,ACT,80101,Belconnen,Major cities - medium SES,Level 1,Diagnostic Imaging (total),0-24,2513.0,5594,8984,17.18,27.59,818149.0,1051718.0,32558
8,2014,ACT,80101,Belconnen,Major cities - medium SES,Level 1,Diagnostic Imaging (total),25-44,6904.0,9500,19275,30.48,61.85,2151589.0,3173663.0,31163
9,2014,ACT,80101,Belconnen,Major cities - medium SES,Level 1,Diagnostic Imaging (total),45-64,11869.0,8389,19105,38.53,87.74,2584397.0,3601039.0,21774


### Gender: Convert Abbreiviations

In [225]:
df_mbs_combined_201322["DemographicGroup"].value_counts()

All persons    165115
0-24            15570
25-44           15570
45-64           15570
65+             15570
Females         10380
Males           10380
F                5190
M                5190
Name: DemographicGroup, dtype: int64

In [226]:
# replacing the M to Males and F to Females to make gender values consistent in the DemographicGroup column.
df_mbs_combined_201322["DemographicGroup"] = df_mbs_combined_201322[
    "DemographicGroup"
].replace({"M": "Males", "F": "Females"})
df_mbs_combined_201322["DemographicGroup"].value_counts()

All persons    165115
0-24            15570
25-44           15570
45-64           15570
65+             15570
Females         15570
Males           15570
Name: DemographicGroup, dtype: int64

## Data Consistency Checks

In [227]:
df_mbs_combined_201322.describe()

Unnamed: 0,Year,Medicare benefits per 100 people ($),No. of patients,No. of services,Percentage of people who had the service (%),Services per 100 people,Total Medicare benefits paid ($),Total provider fees ($),Estimated resident population
count,258535.0,234301.0,234301.0,234301.0,234301.0,234301.0,234301.0,234301.0,258505.0
mean,2018.002162,6546.570367,21725.315317,93415.819766,23.257384,104.4814,5642181.0,6530087.0,126692.421748
std,2.583762,11229.591164,228847.398569,1353282.511084,28.371166,208.22682,69630720.0,78485830.0,835320.657884
min,2014.0,0.0,0.0,-102.0,0.0,-0.34,0.0,0.0,0.0
25%,2016.0,97.0,521.0,1063.0,1.11,2.17,47439.0,53897.0,20080.0
50%,2018.0,1146.11,3290.0,7721.0,8.99,19.43,537654.0,621628.0,43450.0
75%,2020.0,7662.81,11101.0,30794.0,35.7,91.07,2486454.0,3009783.0,77425.0
max,2022.0,110280.55,23099650.0,188694030.0,100.0,2041.2,9082284000.0,10005620000.0,25697298.0


1. 10% of values are missing in each numeric column except Year and ERD. These are not published values due to suppression. Will not be populated
2. No of Services has - 102 as minimum. Investigation required
3. Services per 100 people has negative value
4. Check number of SA areas that have 0 residents

### Address Negative Value

In [228]:
df_mbs_combined_201322[df_mbs_combined_201322["No. of services"] < 0]

Unnamed: 0,Year,StateTerritory,GeographicCode,GeographicAreaName,GeographicGroup,ServiceLevel,Service,DemographicGroup,Medicare benefits per 100 people ($),No. of patients,No. of services,Percentage of people who had the service (%),Services per 100 people,Total Medicare benefits paid ($),Total provider fees ($),Estimated resident population
159089,2019,Qld,31502,Outback - North,Remote (incl. very remote),Level 3,GP Multidisciplinary Case Conference,All persons,5.0,183,-102,0.61,-0.34,1548.0,1586.0,30139


Original data is with -ve value. No explanation from source the reason for -ve 'no of services' and 'services per 100 people'. 

Assumption made that thi is an error. Since there were patients recorded, setting it to be positive.

In [229]:
df_mbs_combined_201322.loc[159089, "No. of services"] = abs(
    df_mbs_combined_201322.loc[159089, "No. of services"]
)
df_mbs_combined_201322.loc[159089, "Services per 100 people"] = abs(
    df_mbs_combined_201322.loc[159089, "Services per 100 people"]
)
df_mbs_combined_201322.loc[159089]

Year                                                                            2019
StateTerritory                                                                   Qld
GeographicCode                                                                 31502
GeographicAreaName                                                   Outback - North
GeographicGroup                                           Remote (incl. very remote)
ServiceLevel                                                                 Level 3
Service                                         GP Multidisciplinary Case Conference
DemographicGroup                                                         All persons
Medicare benefits per 100 people ($)                                             5.0
No. of patients                                                                  183
No. of services                                                                  102
Percentage of people who had the service (%)                     

#### Estimated Resident Population 0 Investigation

In [230]:
zero_erp = df_mbs_combined_201322[
    df_mbs_combined_201322["Estimated resident population"] == 0
]
zero_erp.shape

(314, 16)

In [231]:
zero_erp[["StateTerritory", "GeographicCode", "GeographicAreaName"]].value_counts(
    dropna=False
)

StateTerritory     GeographicCode  GeographicAreaName         
Other Territories  90104           Norfolk Island                 249
NSW                12402           Blue Mountains - South          45
                   10702           Illawarra Catchment Reserve     20
dtype: int64

From previous investigation we know Blue Mountains and Illawarra Catchment Reserve have no population.

In [232]:
# investigating Norfolk Island
zero_erp[zero_erp["GeographicCode"] == "90104"]["GeographicGroup"].value_counts(
    dropna=False
)

Remote (incl. very remote)    249
Name: GeographicGroup, dtype: int64

In [233]:
norfolk_data = df_mbs_combined_201322[
    (df_mbs_combined_201322["GeographicCode"] == "90104")
]

Norfolk Island data has small population. Due to this data was either suppressed or services were not used. Suspect the data collection from Norfolk also occurred after 2016 census. 

In [234]:
df_mbs_combined_201322.dtypes.to_clipboard()

In [235]:
df_mbs_combined_201322.shape

(258535, 16)

### Drop Rows with Missinh Medicare & Provider Values

In [236]:
df_mbs_combined_201322.isnull().sum()

Year                                                0
StateTerritory                                      0
GeographicCode                                      0
GeographicAreaName                                  0
GeographicGroup                                     0
ServiceLevel                                        0
Service                                             0
DemographicGroup                                    0
Medicare benefits per 100 people ($)            24234
No. of patients                                 24234
No. of services                                 24234
Percentage of people who had the service (%)    24234
Services per 100 people                         24234
Total Medicare benefits paid ($)                24234
Total provider fees ($)                         24234
Estimated resident population                      30
dtype: int64

In [237]:
df_mbs_combined_201322[
    df_mbs_combined_201322["Total Medicare benefits paid ($)"].isnull()
]

Unnamed: 0,Year,StateTerritory,GeographicCode,GeographicAreaName,GeographicGroup,ServiceLevel,Service,DemographicGroup,Medicare benefits per 100 people ($),No. of patients,No. of services,Percentage of people who had the service (%),Services per 100 people,Total Medicare benefits paid ($),Total provider fees ($),Estimated resident population
48,2014,ACT,80101,Belconnen,Major cities - medium SES,Level 3,Diabetes Education,All persons,,,,,,,,97112
64,2014,ACT,80101,Belconnen,Major cities - medium SES,Level 3,GP Prolonged - Imminent danger of death,All persons,,,,,,,,97112
67,2014,ACT,80101,Belconnen,Major cities - medium SES,Level 3,GP Telehealth (patient-end support),All persons,,,,,,,,97112
70,2014,ACT,80101,Belconnen,Major cities - medium SES,Level 3,Midwifery,All persons,,,,,,,,97112
74,2014,ACT,80101,Belconnen,Major cities - medium SES,Level 3,Other Allied Health,All persons,,,,,,,,97112
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
258503,2022,WA,51104,Mid West,Remote (incl. very remote),Level 3,Exercise Physiology,All persons,,,,,,,,57284
258504,2022,WA,51104,Mid West,Remote (incl. very remote),Level 3,GP Acupuncture,All persons,,,,,,,,57284
258508,2022,WA,51104,Mid West,Remote (incl. very remote),Level 3,GP Focussed Psychological Strategies and Famil...,All persons,,,,,,,,57284
258524,2022,WA,51104,Mid West,Remote (incl. very remote),Level 3,Osteopathy,All persons,,,,,,,,57284


In [238]:
df_mbs_combined_201322.dropna(
    subset=["Total Medicare benefits paid ($)", "Total provider fees ($)"], inplace=True
)
df_mbs_combined_201322.isnull().sum()

Year                                            0
StateTerritory                                  0
GeographicCode                                  0
GeographicAreaName                              0
GeographicGroup                                 0
ServiceLevel                                    0
Service                                         0
DemographicGroup                                0
Medicare benefits per 100 people ($)            0
No. of patients                                 0
No. of services                                 0
Percentage of people who had the service (%)    0
Services per 100 people                         0
Total Medicare benefits paid ($)                0
Total provider fees ($)                         0
Estimated resident population                   0
dtype: int64

In [241]:
df_mbs_combined_201322.shape

(234301, 16)

### Export to Pickle File

In [239]:
df_mbs_combined_201322.to_pickle(
    os.path.join(path, "clean_datasets/mbs_data/2014-22_phc_combined_mbs.pkl")
)