# Census & MBS: Combine Datasets & Derive New Variables

1. MBS data was separate into 3 separate datasets, National, National-SAGroup and State Level
2. Census and MBS State datasets were combined using year and SA3 area to create a complete combined dataset
3. Combined census and MBS State dataset contained significant number of blank values as demographic data for Service level 2 and 3 did not exist in the MBS dataset
4. Created 2 additional dataset by subsetting combined dataset into service level 1 and service 2+3 dataset. 
5. Derived following new variables : out of pocket costs, % of out of pocket costs, out of pocket cost per person, % of out of pocket cost by median incomes

In [81]:
# import libraries
import pandas as pd
import numpy as np
import os

pd.set_option("display.max_rows", 120)

## Import & Ready MBS Combined Dataset

In [82]:
# import the transformed mbs file and assign to a dataframe

# setup path to original dataset
path = r"/Users/patel/Documents/CF-Data Anaylst Course/portfolio_projects/mbs_analysis/datasets/"

df_mbs_2014_23 = pd.read_pickle(
    os.path.join(path, "clean_datasets/mbs_data/2014-22_phc_combined_mbs.pkl")
)
df_mbs_2014_23.info(10)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 234301 entries, 0 to 258533
Data columns (total 16 columns):
 #   Column                                        Non-Null Count   Dtype  
---  ------                                        --------------   -----  
 0   Year                                          234301 non-null  int64  
 1   StateTerritory                                234301 non-null  object 
 2   GeographicCode                                234301 non-null  object 
 3   GeographicAreaName                            234301 non-null  object 
 4   GeographicGroup                               234301 non-null  object 
 5   ServiceLevel                                  234301 non-null  object 
 6   Service                                       234301 non-null  object 
 7   DemographicGroup                              234301 non-null  object 
 8   Medicare benefits per 100 people ($)          234301 non-null  float64
 9   No. of patients                               23

In [83]:
# rename columns for easy references
df_mbs_2014_23.rename(
    columns={
        "Medicare benefits per 100 people ($)": "MBS_per_100",
        "No. of patients": "No_of_patients",
        "No. of services": "No_of_services",
        "Percentage of people who had the service (%)": "%_People_had_service",
        "Services per 100 people": "Services_100_people",
        "Total Medicare benefits paid ($)": "Total_mbs_paid_$",
        "Total provider fees ($)": "Total_provider_fees_$",
        "Estimated resident population": "ERP",
    },
    inplace=True,
)

In [84]:
# create key for reference and comparison
df_mbs_2014_23["key"] = (
    df_mbs_2014_23["Year"].astype("str") + "-" + df_mbs_2014_23["GeographicCode"]
)
df_mbs_2014_23.head(3)

Unnamed: 0,Year,StateTerritory,GeographicCode,GeographicAreaName,GeographicGroup,ServiceLevel,Service,DemographicGroup,MBS_per_100,No_of_patients,No_of_services,%_People_had_service,Services_100_people,Total_mbs_paid_$,Total_provider_fees_$,ERP,key
0,2014,ACT,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),0-24,2576.0,5624,10879,17.27,33.41,838549.0,1026474.0,32558,2014-80101
1,2014,ACT,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),25-44,4004.0,7714,15870,24.75,50.93,1247656.0,1600846.0,31163,2014-80101
2,2014,ACT,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),45-64,4672.0,8998,15754,41.32,72.35,1017264.0,1197133.0,21774,2014-80101


### Derive MBS Specific New Variables

#### Out of Pocket

In [85]:
df_mbs_2014_23["Out_of_Pocket"] = (
    df_mbs_2014_23["Total_provider_fees_$"] - df_mbs_2014_23["Total_mbs_paid_$"]
)
df_mbs_2014_23[
    ["Out_of_Pocket", "Total_provider_fees_$", "Total_mbs_paid_$"]
].value_counts(dropna=False)

Out_of_Pocket  Total_provider_fees_$  Total_mbs_paid_$
0.000000e+00   0.000000e+00           0.000000e+00        341
               2.067000e+03           2.067000e+03          5
               7.104000e+03           7.104000e+03          4
               1.737100e+04           1.737100e+04          4
               7.336000e+03           7.336000e+03          4
                                                         ... 
7.264000e+03   7.825800e+04           7.099400e+04          1
               9.949200e+04           9.222800e+04          1
7.265000e+03   2.443500e+04           1.717000e+04          1
7.266000e+03   1.874100e+04           1.147500e+04          1
1.429639e+09   3.750781e+09           2.321142e+09          1
Length: 232766, dtype: int64

In [86]:
df_mbs_2014_23[df_mbs_2014_23["Out_of_Pocket"].isnull()]

Unnamed: 0,Year,StateTerritory,GeographicCode,GeographicAreaName,GeographicGroup,ServiceLevel,Service,DemographicGroup,MBS_per_100,No_of_patients,No_of_services,%_People_had_service,Services_100_people,Total_mbs_paid_$,Total_provider_fees_$,ERP,key,Out_of_Pocket


In [87]:
# checking for any out of pocket cost that is empty and provider or mbs fees values are not empty
df_mbs_2014_23[
    (df_mbs_2014_23["Out_of_Pocket"].isnull())
    & (
        (~df_mbs_2014_23["Total_mbs_paid_$"].isnull())
        | (~df_mbs_2014_23["Out_of_Pocket"].isnull())
    )
]

Unnamed: 0,Year,StateTerritory,GeographicCode,GeographicAreaName,GeographicGroup,ServiceLevel,Service,DemographicGroup,MBS_per_100,No_of_patients,No_of_services,%_People_had_service,Services_100_people,Total_mbs_paid_$,Total_provider_fees_$,ERP,key,Out_of_Pocket


#### Percentage of Out of Pocket of Provider Fees

Calculate percentage of out of pocket costs paid by patients

In [88]:
# derive out of pocket cost % value

df_mbs_2014_23.loc[
    df_mbs_2014_23["Total_provider_fees_$"] == 0, "Out_of_pocket_cost_%"
] = 0
df_mbs_2014_23.loc[
    (df_mbs_2014_23["Total_provider_fees_$"] != 0), "Out_of_pocket_cost_%"
] = (df_mbs_2014_23["Out_of_Pocket"] / df_mbs_2014_23["Total_provider_fees_$"]) * 100

""" df_mbs_2014_23["Out_of_pocket_cost_%"] = (
    df_mbs_2014_23["Out_of_Pocket"] / df_mbs_2014_23["Total_provider_fees_$"]
) * 100 """

# display specific columns for checking
df_mbs_2014_23[
    [
        "Year",
        "GeographicCode",
        "Service",
        "Out_of_Pocket",
        "Total_provider_fees_$",
        "Out_of_pocket_cost_%",
    ]
]

Unnamed: 0,Year,GeographicCode,Service,Out_of_Pocket,Total_provider_fees_$,Out_of_pocket_cost_%
0,2014,80101,Allied Health attendances (total),187925.0,1026474.0,18.307819
1,2014,80101,Allied Health attendances (total),353190.0,1600846.0,22.062709
2,2014,80101,Allied Health attendances (total),179869.0,1197133.0,15.024981
3,2014,80101,Allied Health attendances (total),85891.0,761837.0,11.274196
4,2014,80101,Allied Health attendances (total),806875.0,4586290.0,17.593196
...,...,...,...,...,...,...
258529,2022,51104,Other Psychologist,197063.0,599064.0,32.895150
258530,2022,51104,Physiotherapy,17168.0,154790.0,11.091156
258531,2022,51104,Podiatry,41677.0,310628.0,13.417013
258532,2022,51104,Practice Nurse/Aboriginal Health Worker,8.0,133983.0,0.005971


In [89]:
# checking for nulls
df_mbs_2014_23[
    ["Out_of_Pocket", "Total_provider_fees_$", "Out_of_pocket_cost_%"]
].value_counts(dropna=False)

Out_of_Pocket  Total_provider_fees_$  Out_of_pocket_cost_%
0.000000e+00   0.000000e+00           0.000000                341
               2.067000e+03           0.000000                  5
               7.104000e+03           0.000000                  4
               1.737100e+04           0.000000                  4
               7.336000e+03           0.000000                  4
                                                             ... 
7.264000e+03   7.825800e+04           9.282118                  1
               9.949200e+04           7.301090                  1
7.265000e+03   2.443500e+04           29.731942                 1
7.266000e+03   1.874100e+04           38.770610                 1
1.429639e+09   3.750781e+09           38.115761                 1
Length: 232766, dtype: int64

#### Out of pocket per person

In [90]:
# derive out of pocket cost % value
df_mbs_2014_23.loc[
    (df_mbs_2014_23["No_of_patients"] == 0), "Out_of_pocket_cost_per_person"
] = 0

df_mbs_2014_23.loc[
    (df_mbs_2014_23["No_of_patients"] != 0), "Out_of_pocket_cost_per_person"
] = (df_mbs_2014_23["Out_of_Pocket"] / df_mbs_2014_23["No_of_patients"])

# display specific columns for checking
df_mbs_2014_23[
    [
        "Year",
        "GeographicCode",
        "Service",
        "Out_of_Pocket",
        "No_of_patients",
        "Out_of_pocket_cost_per_person",
    ]
]

Unnamed: 0,Year,GeographicCode,Service,Out_of_Pocket,No_of_patients,Out_of_pocket_cost_per_person
0,2014,80101,Allied Health attendances (total),187925.0,5624,33.414829
1,2014,80101,Allied Health attendances (total),353190.0,7714,45.785585
2,2014,80101,Allied Health attendances (total),179869.0,8998,19.989887
3,2014,80101,Allied Health attendances (total),85891.0,6397,13.426763
4,2014,80101,Allied Health attendances (total),806875.0,28733,28.081822
...,...,...,...,...,...,...
258529,2022,51104,Other Psychologist,197063.0,1104,178.499094
258530,2022,51104,Physiotherapy,17168.0,922,18.620390
258531,2022,51104,Podiatry,41677.0,1803,23.115363
258532,2022,51104,Practice Nurse/Aboriginal Health Worker,8.0,5139,0.001557


In [91]:
# display specific columns for checking
df_mbs_2014_23[
    [
        "Out_of_Pocket",
        "No_of_patients",
        "Out_of_pocket_cost_per_person",
    ]
].value_counts(dropna=False)

Out_of_Pocket  No_of_patients  Out_of_pocket_cost_per_person
0.000000e+00   0               0.000000                         341
               21              0.000000                          86
               23              0.000000                          73
               36              0.000000                          71
               34              0.000000                          68
                                                               ... 
1.015000e+04   966             10.507246                          1
               1029            9.863946                           1
               1696            5.984670                           1
1.015100e+04   6077            1.670397                           1
1.429639e+09   8059022         177.396047                         1
Length: 223305, dtype: int64

In [92]:
df_mbs_2014_23["Out_of_pocket_cost_per_person"].isnull().sum()

0

#### Number of service per person

In [93]:
# derive number of service per person

df_mbs_2014_23.loc[
    df_mbs_2014_23["No_of_patients"] == 0, "No_of_service_per_person"
] = 0
df_mbs_2014_23.loc[
    df_mbs_2014_23["No_of_patients"] != 0, "No_of_service_per_person"
] = (df_mbs_2014_23["No_of_services"] / df_mbs_2014_23["No_of_patients"])

df_mbs_2014_23[
    [
        "Year",
        "GeographicCode",
        "Service",
        "No_of_services",
        "No_of_patients",
        "No_of_service_per_person",
    ]
]

Unnamed: 0,Year,GeographicCode,Service,No_of_services,No_of_patients,No_of_service_per_person
0,2014,80101,Allied Health attendances (total),10879,5624,1.934388
1,2014,80101,Allied Health attendances (total),15870,7714,2.057298
2,2014,80101,Allied Health attendances (total),15754,8998,1.750834
3,2014,80101,Allied Health attendances (total),12316,6397,1.925277
4,2014,80101,Allied Health attendances (total),54818,28733,1.907841
...,...,...,...,...,...,...
258529,2022,51104,Other Psychologist,4432,1104,4.014493
258530,2022,51104,Physiotherapy,2485,922,2.695228
258531,2022,51104,Podiatry,4856,1803,2.693289
258532,2022,51104,Practice Nurse/Aboriginal Health Worker,8167,5139,1.589220


In [94]:
# checking for valid null
df_mbs_2014_23[
    [
        "No_of_services",
        "No_of_patients",
        "No_of_service_per_person",
    ]
].value_counts(dropna=False)

No_of_services  No_of_patients  No_of_service_per_person
0               0               0.000000                    341
21              21              1.000000                    106
28              28              1.000000                     99
26              26              1.000000                     96
31              31              1.000000                     96
                                                           ... 
4607            2430            1.895885                      1
                2711            1.699373                      1
                3061            1.505064                      1
                3102            1.485171                      1
188694030       23099650        8.168696                      1
Length: 211502, dtype: int64

#### Flag for Patients More than ERP

Created this flag, but is not used when pivoting tables. Won't be included in any census_mbs combined datasets

In [95]:
df_mbs_2014_23["Patient_ERP_Flag"] = True

# Set to False where the condition is met
df_mbs_2014_23.loc[
    df_mbs_2014_23["ERP"] < df_mbs_2014_23["No_of_patients"], "Patient_ERP_Flag"
] = False

In [96]:
df_mbs_2014_23["Patient_ERP_Flag"].value_counts(dropna=False)

True     232533
False      1768
Name: Patient_ERP_Flag, dtype: int64

In [97]:
df_mbs_2014_23.isnull().sum()

Year                             0
StateTerritory                   0
GeographicCode                   0
GeographicAreaName               0
GeographicGroup                  0
ServiceLevel                     0
Service                          0
DemographicGroup                 0
MBS_per_100                      0
No_of_patients                   0
No_of_services                   0
%_People_had_service             0
Services_100_people              0
Total_mbs_paid_$                 0
Total_provider_fees_$            0
ERP                              0
key                              0
Out_of_Pocket                    0
Out_of_pocket_cost_%             0
Out_of_pocket_cost_per_person    0
No_of_service_per_person         0
Patient_ERP_Flag                 0
dtype: int64

### MBS: Subset National & State Level Data

Subset MBS data by separating State from National level data. This done by creating 3 new dataset containing:
1. National-SA3Group
2. National
3. State

Rows associalted with State Territory National-SA3Group and National will be removed from State MBS dataset

In [98]:
df_mbs_2014_23["StateTerritory"].value_counts()

NSW                  62109
Qld                  57723
Vic                  46874
WA                   23005
SA                   19301
Tas                   9955
NT                    5568
ACT                   4897
National-SA3Group     4482
National               250
Other Territories      137
Name: StateTerritory, dtype: int64

#### Subset National Level Data

In [99]:
# create subset containing dataset National level. Not this only has data from 2020-2022
df_national_au = df_mbs_2014_23[df_mbs_2014_23["StateTerritory"].isin(["National"])]
print(df_national_au.shape)
print(df_national_au["Year"].value_counts(dropna=False))

(250, 22)
2022    84
2020    83
2021    83
Name: Year, dtype: int64


#### Subset SA3 Group Level Data

In [100]:
# subset data by National-SA3 Group. These were rows with State value that were updated in script 1 before pivotinig
df_sa3_group_au = df_mbs_2014_23[
    df_mbs_2014_23["StateTerritory"].isin(["National-SA3Group"])
]
print(df_sa3_group_au.shape)
print(df_sa3_group_au["Year"].value_counts(dropna=False))

(4482, 22)
2022    504
2014    498
2015    498
2016    498
2017    498
2018    498
2020    498
2021    498
2019    492
Name: Year, dtype: int64


There are 6 more rows in 2022 than other years. Below investigation found "Other GP Services" was added in 2022 service level 3 list

In [101]:
# checking unique count per geographic code. Appears there is an extra service row for each area
df_sa3_group_au[["Year", "GeographicCode"]].value_counts(dropna=False)

Year  GeographicCode
2022  004-06            84
      004-05            84
      004-04            84
      004-03            84
      004-02            84
      004-01            84
2020  004-03            83
2018  004-05            83
      004-06            83
2019  004-01            83
      004-02            83
      004-04            83
      004-05            83
2020  004-01            83
      004-02            83
      004-06            83
      004-04            83
      004-05            83
2018  004-03            83
2021  004-01            83
      004-02            83
      004-03            83
      004-04            83
      004-05            83
      004-06            83
2014  004-02            83
      004-01            83
2018  004-02            83
2016  004-02            83
2014  004-03            83
      004-04            83
      004-05            83
      004-06            83
2015  004-01            83
      004-02            83
      004-03            83
      0

In [102]:
# extracting unique list of services for 2021 and 2022 to determine service new service added in 2022
df_service_level_3_2021 = (
    df_sa3_group_au[
        (df_sa3_group_au["Year"] == 2021)
        & (df_sa3_group_au["ServiceLevel"] == "Level 3")
    ]["Service"]
).unique()
df_service_level_3_2022 = (
    df_sa3_group_au[
        (df_sa3_group_au["Year"] == 2022)
        & (df_sa3_group_au["ServiceLevel"] == "Level 3")
    ]["Service"]
).unique()

# finding the difference in service list
np.setdiff1d(df_service_level_3_2022, df_service_level_3_2021)

array(['Other GP Services'], dtype=object)

### Subset State Level SA3 Data

In [103]:
# remove national and national-sagroup data from masterset to create a state only dataset
df_mbs_state_sa3_complete = df_mbs_2014_23[
    ~df_mbs_2014_23["StateTerritory"].isin(["National", "National-SA3Group"])
]
df_mbs_state_sa3_complete.shape

(229569, 22)

In [104]:
df_mbs_state_sa3_complete["StateTerritory"].value_counts(dropna=False)

NSW                  62109
Qld                  57723
Vic                  46874
WA                   23005
SA                   19301
Tas                   9955
NT                    5568
ACT                   4897
Other Territories      137
Name: StateTerritory, dtype: int64

In [105]:
df_mbs_state_sa3_complete["GeographicCode"].value_counts(dropna=False)

31003    742
31106    740
31103    739
30907    738
31004    738
        ... 
80110    202
90101     67
90104     67
10803      4
90102      3
Name: GeographicCode, Length: 336, dtype: int64

In [106]:
print(df_mbs_state_sa3_complete["Year"].min())
print(df_mbs_state_sa3_complete["Year"].max())

2014
2022


#### Explort MBS State Level (Non Pivot)

In [107]:
df_mbs_state_sa3_complete.to_pickle(
    os.path.join(
        path, "clean_datasets/mbs_data/2014-22_mbs_state_complete_no_piovot.pkl"
    )
)

### Pivot State MBS by Demograpghic Data

In [108]:
# find unique demographic values
df_mbs_state_sa3_complete["DemographicGroup"].unique()

array(['0-24', '25-44', '45-64', '65+', 'All persons', 'Females', 'Males'],
      dtype=object)

In [109]:
df_mbs_state_sa3_complete.columns

Index(['Year', 'StateTerritory', 'GeographicCode', 'GeographicAreaName',
       'GeographicGroup', 'ServiceLevel', 'Service', 'DemographicGroup',
       'MBS_per_100', 'No_of_patients', 'No_of_services',
       '%_People_had_service', 'Services_100_people', 'Total_mbs_paid_$',
       'Total_provider_fees_$', 'ERP', 'key', 'Out_of_Pocket',
       'Out_of_pocket_cost_%', 'Out_of_pocket_cost_per_person',
       'No_of_service_per_person', 'Patient_ERP_Flag'],
      dtype='object')

In [110]:
# Use pivot_table to reshape the DataFrame
df_mbs_state_sa3_pivot = df_mbs_state_sa3_complete.pivot_table(
    index=[
        "key",
        "Year",
        "StateTerritory",
        "GeographicCode",
        "GeographicAreaName",
        "GeographicGroup",
        "ServiceLevel",
        "Service",
    ],
    columns="DemographicGroup",
    values=[
        "MBS_per_100",
        "No_of_patients",
        "No_of_services",
        "%_People_had_service",
        "Services_100_people",
        "Total_mbs_paid_$",
        "Total_provider_fees_$",
        "ERP",
        "Out_of_Pocket",
        "Out_of_pocket_cost_%",
        "No_of_service_per_person",
        "Out_of_pocket_cost_per_person",
    ],
    aggfunc="first",
)  # 'first' is used to pick the first value in case of duplicates
df_mbs_state_sa3_pivot.reset_index(inplace=True)

In [111]:
# Flatten the MultiIndex in columns and format new column names as 'demographic_value'
df_mbs_state_sa3_pivot.columns = [
    "_".join(col).strip() if col[1] else col[0]
    for col in df_mbs_state_sa3_pivot.columns.values
]

In [112]:
df_mbs_state_sa3_pivot.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140823 entries, 0 to 140822
Data columns (total 92 columns):
 #   Column                                     Non-Null Count   Dtype  
---  ------                                     --------------   -----  
 0   key                                        140823 non-null  object 
 1   Year                                       140823 non-null  int64  
 2   StateTerritory                             140823 non-null  object 
 3   GeographicCode                             140823 non-null  object 
 4   GeographicAreaName                         140823 non-null  object 
 5   GeographicGroup                            140823 non-null  object 
 6   ServiceLevel                               140823 non-null  object 
 7   Service                                    140823 non-null  object 
 8   %_People_had_service_0-24                  14724 non-null   float64
 9   %_People_had_service_25-44                 14802 non-null   float64
 10  %_People

In [113]:
df_mbs_state_sa3_pivot.head(3)

Unnamed: 0,key,Year,StateTerritory,GeographicCode,GeographicAreaName,GeographicGroup,ServiceLevel,Service,%_People_had_service_0-24,%_People_had_service_25-44,...,Total_mbs_paid_$_All persons,Total_mbs_paid_$_Females,Total_mbs_paid_$_Males,Total_provider_fees_$_0-24,Total_provider_fees_$_25-44,Total_provider_fees_$_45-64,Total_provider_fees_$_65+,Total_provider_fees_$_All persons,Total_provider_fees_$_Females,Total_provider_fees_$_Males
0,2014-10102,2014,NSW,10102,Queanbeyan,Major cities - higher SES,Level 1,Allied Health attendances (total),18.8,20.65,...,1938341.0,1201578.0,736763.0,580491.0,633972.0,717555.0,360781.0,2292799.0,1428964.0,863835.0
1,2014-10102,2014,NSW,10102,Queanbeyan,Major cities - higher SES,Level 1,Diagnostic Imaging (total),17.59,27.72,...,4702507.0,2761452.0,1941056.0,657658.0,1586180.0,2387889.0,1558873.0,6190599.0,3768680.0,2421920.0
2,2014-10102,2014,NSW,10102,Queanbeyan,Major cities - higher SES,Level 1,GP attendances (total),76.02,74.84,...,10780077.0,6339946.0,4440131.0,3207438.0,3764233.0,4500909.0,3067848.0,14540427.0,8587817.0,5952610.0


In [114]:
# checking the pivot values
df_mbs_state_sa3_pivot[
    (df_mbs_state_sa3_pivot["Year"] == 2014)
    & (df_mbs_state_sa3_pivot["GeographicCode"] == "80101")
    & (df_mbs_state_sa3_pivot["ServiceLevel"] == "Level 1")
    & (
        df_mbs_state_sa3_pivot["Service"].isin(
            ["Allied Health attendances (total)", "Diagnostic Imaging (total)"]
        )
    )
]

Unnamed: 0,key,Year,StateTerritory,GeographicCode,GeographicAreaName,GeographicGroup,ServiceLevel,Service,%_People_had_service_0-24,%_People_had_service_25-44,...,Total_mbs_paid_$_All persons,Total_mbs_paid_$_Females,Total_mbs_paid_$_Males,Total_provider_fees_$_0-24,Total_provider_fees_$_25-44,Total_provider_fees_$_45-64,Total_provider_fees_$_65+,Total_provider_fees_$_All persons,Total_provider_fees_$_Females,Total_provider_fees_$_Males
14599,2014-80101,2014,ACT,80101,Belconnen,Major cities - medium SES,Level 1,Allied Health attendances (total),17.27,24.75,...,3779415.0,2394671.0,1384744.0,1026474.0,1600846.0,1197133.0,761837.0,4586290.0,2929743.0,1656547.0
14600,2014-80101,2014,ACT,80101,Belconnen,Major cities - medium SES,Level 1,Diagnostic Imaging (total),17.18,30.48,...,8274560.0,5024192.0,3250368.0,1051718.0,3173663.0,3601039.0,3040462.0,10866882.0,6807431.0,4059451.0


Note, there are is no data by demograpgic for service level 2 & 3. Due to this, significant number of blank values exists in measure_value_demographic columns except for all person

In [115]:
# exporting dataset for analysis of the pivot
df_mbs_state_sa3_pivot.to_csv(
    os.path.join(path, "clean_datasets/mbs_data/mbs_state_pivot_datatset_complete.csv")
)

## Import & Ready Census Dataset

In [116]:
# import the transformed mbs file and assign to a dataframe

df_census_2011_22 = pd.read_pickle(
    os.path.join(path, "clean_datasets/census_data/2011_22_census_complete.pkl")
)
df_census_2011_22.info(10)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4296 entries, 0 to 4295
Data columns (total 26 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   SA3                      4296 non-null   object 
 1   Year                     4296 non-null   int64  
 2   age_0-14                 4296 non-null   float64
 3   age_15-24                4296 non-null   float64
 4   age_25-44                4296 non-null   float64
 5   age_45-64                4296 non-null   float64
 6   age_65-79                4296 non-null   float64
 7   age_80+                  4296 non-null   float64
 8   negative_income          4296 non-null   float64
 9   no_income                4296 non-null   float64
 10  average_income_$5200     4296 non-null   float64
 11  average_income_$13000    4296 non-null   float64
 12  average_income_$18200    4296 non-null   float64
 13  average_income_$26000    4296 non-null   float64
 14  average_income_$36400   

### Extract data from 2014

In [117]:
# extracting data from 2014 onwards to match the MBS range
df_census_2014_22 = df_census_2011_22[df_census_2011_22["Year"] >= 2014]
df_census_2014_22["Year"].value_counts(dropna=False)

2014    358
2015    358
2016    358
2017    358
2018    358
2019    358
2020    358
2021    358
2022    358
Name: Year, dtype: int64

### Align Age Groups with MBS

1. 0-24
2. 25-44
3. 45-64
4. 65+

In [118]:
# make a copy of the dataset to avoid changing the original imported version
df_census_2014_22_new = df_census_2014_22.copy()

In [119]:
# combine population of age bracks 0-14 and 15-24 to create a 0-24 bracket
df_census_2014_22_new["age_0-24"] = (
    df_census_2014_22_new["age_0-14"] + df_census_2014_22_new["age_15-24"]
)
# combine population of age bracks 65-79 and 8-+ to create a 65+ bracket
df_census_2014_22_new["age_65+"] = (
    df_census_2014_22_new["age_65-79"] + df_census_2014_22_new["age_80+"]
)

In [120]:
df_census_2014_22_new.head(3)

Unnamed: 0,SA3,Year,age_0-14,age_15-24,age_25-44,age_45-64,age_65-79,age_80+,negative_income,no_income,...,average_income_$91000,average_income_$130000,not_stated,not_applicable,male_pop,female_pop,total_population,average_income_$169000+,age_0-24,age_65+
3,10102,2014,11151.0,6831.0,15725.0,15715.2,5045.8,1332.6,166.2,2839.0,...,5117.2,4088.6,3787.6,11151.0,28038.8,27765.2,55804.2,,17982.0,6378.4
4,10102,2015,11193.0,6854.0,15782.0,16033.6,5337.4,1363.8,163.6,2951.0,...,5324.6,4089.8,4047.8,11193.0,28434.4,28133.6,56568.6,,18047.0,6701.2
5,10102,2016,11235.0,6877.0,15839.0,16352.0,5629.0,1395.0,161.0,3063.0,...,5532.0,4091.0,4308.0,11235.0,28830.0,28502.0,57333.0,1812.0,18112.0,7024.0


In [121]:
# Drop columns age_0-14, age_15-24, age_65-79, age_80+ as aggregate columns are formed

df_census_2014_22_new.drop(
    ["age_0-14", "age_15-24", "age_65-79", "age_80+"], axis=1, inplace=True
)
df_census_2014_22_new.columns

Index(['SA3', 'Year', 'age_25-44', 'age_45-64', 'negative_income', 'no_income',
       'average_income_$5200', 'average_income_$13000',
       'average_income_$18200', 'average_income_$26000',
       'average_income_$36400', 'average_income_$46800',
       'average_income_$58500', 'average_income_$71500',
       'average_income_$91000', 'average_income_$130000', 'not_stated',
       'not_applicable', 'male_pop', 'female_pop', 'total_population',
       'average_income_$169000+', 'age_0-24', 'age_65+'],
      dtype='object')

### Compare SA3 values

Found SA3 values exists in census but not in MBS. This is expected and is not an issue for the join

Make census + year key to find the missing year-census 

In [122]:
# creating key for reference and investigations
df_census_2014_22_new["key"] = (
    df_census_2014_22_new["Year"].astype("str") + "-" + df_census_2014_22_new["SA3"]
)

In [123]:
df_mbs_state_sa3_pivot.head(3)

Unnamed: 0,key,Year,StateTerritory,GeographicCode,GeographicAreaName,GeographicGroup,ServiceLevel,Service,%_People_had_service_0-24,%_People_had_service_25-44,...,Total_mbs_paid_$_All persons,Total_mbs_paid_$_Females,Total_mbs_paid_$_Males,Total_provider_fees_$_0-24,Total_provider_fees_$_25-44,Total_provider_fees_$_45-64,Total_provider_fees_$_65+,Total_provider_fees_$_All persons,Total_provider_fees_$_Females,Total_provider_fees_$_Males
0,2014-10102,2014,NSW,10102,Queanbeyan,Major cities - higher SES,Level 1,Allied Health attendances (total),18.8,20.65,...,1938341.0,1201578.0,736763.0,580491.0,633972.0,717555.0,360781.0,2292799.0,1428964.0,863835.0
1,2014-10102,2014,NSW,10102,Queanbeyan,Major cities - higher SES,Level 1,Diagnostic Imaging (total),17.59,27.72,...,4702507.0,2761452.0,1941056.0,657658.0,1586180.0,2387889.0,1558873.0,6190599.0,3768680.0,2421920.0
2,2014-10102,2014,NSW,10102,Queanbeyan,Major cities - higher SES,Level 1,GP attendances (total),76.02,74.84,...,10780077.0,6339946.0,4440131.0,3207438.0,3764233.0,4500909.0,3067848.0,14540427.0,8587817.0,5952610.0


In [124]:
census_sa3_list = pd.Series(df_census_2014_22_new["key"].unique())
mbs_sa3_list = pd.Series(df_mbs_state_sa3_pivot["key"].unique())

census_sa3_list.to_clipboard()

In [125]:
# diff_values = np.setdiff1d(census_sa3_list, mbs_sa3_list)
diff_values = mbs_sa3_list[~mbs_sa3_list.isin(census_sa3_list)]
diff_values

Series([], dtype: object)

Diff_values shows keys in mbs also exists in census

In [126]:
# checking population data of SA3 codes added in 2016 and backfilled values
df_census_2014_22_new[
    (
        df_census_2014_22_new["Year"].isin([2014, 2015])
        & (df_census_2014_22_new["SA3"] == "90104")
    )
]

Unnamed: 0,SA3,Year,age_25-44,age_45-64,negative_income,no_income,average_income_$5200,average_income_$13000,average_income_$18200,average_income_$26000,...,average_income_$130000,not_stated,not_applicable,male_pop,female_pop,total_population,average_income_$169000+,age_0-24,age_65+,key
4263,90104,2014,341.0,577.0,9.0,67.0,54.0,98.0,146.0,344.0,...,33.0,138.0,296.0,820.0,925.0,1748.0,,406.0,413.0,2014-90104
4264,90104,2015,341.0,577.0,9.0,67.0,54.0,98.0,146.0,344.0,...,33.0,138.0,296.0,820.0,925.0,1748.0,,406.0,413.0,2015-90104


#### Fill in average_income_$169000+ NA Values

In [127]:
# check for any empty values, expected for average_income_$169000+
df_census_2014_22_new[df_census_2014_22_new["average_income_$169000+"].isna()][
    "Year"
].value_counts()

2014    358
2015    358
Name: Year, dtype: int64

In [128]:
# fill in na values with 0
df_census_2014_22_new["average_income_$169000+"].fillna(0, inplace=True)

In [129]:
# double checking
df_census_2014_22_new[df_census_2014_22_new["average_income_$169000+"].isna()][
    "Year"
].value_counts()

Series([], Name: Year, dtype: int64)

#### Export Standardized Census Data

In [130]:
df_census_2014_22_new.to_pickle(
    os.path.join(
        path, "clean_datasets/census_data/2011_22_census_complelete_standardized.pkl"
    )
)

## Combine MBS and Census

Combining MBS state complete dataset with census. Expected 140,823 rows. The join will occur on SA3 (GeographicCode) and Year

In [1243]:
df_mbs_state_sa3_pivot.shape

(140823, 92)

In [131]:
df_census_2014_22_new.shape

(3222, 25)

In [1245]:
df_census_mbs_combined = df_mbs_state_sa3_pivot.merge(
    df_census_2014_22_new,
    how="left",
    left_on=["Year", "GeographicCode"],
    right_on=["Year", "SA3"],
    indicator=True,
)

In [1246]:
df_census_mbs_combined.shape

(140823, 117)

In [1247]:
df_census_mbs_combined.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 140823 entries, 0 to 140822
Columns: 117 entries, key_x to _merge
dtypes: Int32(7), Int64(14), category(1), float64(85), int64(1), object(9)
memory usage: 124.9+ MB


In [1248]:
df_census_mbs_combined.isnull().sum()

key_x                                             0
Year                                              0
StateTerritory                                    0
GeographicCode                                    0
GeographicAreaName                                0
GeographicGroup                                   0
ServiceLevel                                      0
Service                                           0
%_People_had_service_0-24                    126099
%_People_had_service_25-44                   126021
%_People_had_service_45-64                   126056
%_People_had_service_65+                     126081
%_People_had_service_All persons                 12
%_People_had_service_Females                 125962
%_People_had_service_Males                   125961
ERP_0-24                                     126099
ERP_25-44                                    126021
ERP_45-64                                    126056
ERP_65+                                      126081
ERP_All pers

In [1249]:
df_census_mbs_combined[df_census_mbs_combined["MBS_per_100_All persons"].isnull()]

Unnamed: 0,key_x,Year,StateTerritory,GeographicCode,GeographicAreaName,GeographicGroup,ServiceLevel,Service,%_People_had_service_0-24,%_People_had_service_25-44,...,not_stated,not_applicable,male_pop,female_pop,total_population,average_income_$169000+,age_0-24,age_65+,key_y,_merge
93056,2020-10803,2020,NSW,10803,Lord Howe Island,Remote (incl. very remote),Level 1,Nursing and Aboriginal Health Workers (total),,,...,32.4,63.0,214.0,223.6,432.2,3.4,77.2,96.6,2020-10803,both
108261,2020-90104,2020,Other Territories,90104,Norfolk Island,Remote (incl. very remote),Level 1,Nursing and Aboriginal Health Workers (total),0.0,,...,202.0,340.0,1012.0,1089.0,2100.0,21.0,510.8,514.6,2020-90104,both
109483,2021-10803,2021,NSW,10803,Lord Howe Island,Remote (incl. very remote),Level 1,Nursing and Aboriginal Health Workers (total),0.0,,...,31.0,66.0,220.0,232.0,445.0,3.0,77.0,103.0,2021-10803,both
124605,2021-90101,2021,Other Territories,90101,Christmas Island,Remote (incl. very remote),Level 1,Nursing and Aboriginal Health Workers (total),0.0,,...,424.0,285.0,1006.0,685.0,1692.0,46.0,412.0,221.0,2021-90101,both
124606,2021-90102,2021,Other Territories,90102,Cocos (Keeling) Islands,Remote (incl. very remote),Level 1,Nursing and Aboriginal Health Workers (total),0.0,,...,84.0,130.0,302.0,292.0,593.0,13.0,179.0,101.0,2021-90102,both
125818,2022-10803,2022,NSW,10803,Lord Howe Island,Remote (incl. very remote),Level 1,Nursing and Aboriginal Health Workers (total),0.0,,...,31.0,66.0,220.0,232.0,445.0,3.0,77.0,103.0,2022-10803,both
128441,2022-12405,2022,NSW,12405,St Marys,Major cities - lower SES,Level 1,Nursing and Aboriginal Health Workers (total),2.34,4.37,...,3721.0,12063.0,28372.0,28419.0,56790.0,624.0,19310.0,7237.0,2022-12405,both
136828,2022-40201,2022,SA,40201,Gawler - Two Wells,Major cities - lower SES,Level 1,GP attendances (total),,,...,1704.0,6898.0,18564.0,19141.0,37709.0,573.0,11567.0,6926.0,2022-40201,both
138869,2022-50703,2022,WA,50703,Kwinana,Major cities - lower SES,Level 1,Allied Health attendances (total),19.26,27.21,...,3363.0,10410.0,23482.0,22379.0,45867.0,983.0,16052.0,4444.0,2022-50703,both
138870,2022-50703,2022,WA,50703,Kwinana,Major cities - lower SES,Level 1,Diagnostic Imaging (total),16.33,34.94,...,3363.0,10410.0,23482.0,22379.0,45867.0,983.0,16052.0,4444.0,2022-50703,both


#### Derive New Variables

For simplicity, calculated % of out of pocket for all person (only) with median income brackets

In [1250]:
df_census_mbs_combined["%_out_of_pocket_by_$5200"] = (
    df_census_mbs_combined["Out_of_pocket_cost_per_person_All persons"] / 5200
) * 100
df_census_mbs_combined["%_out_of_pocket_by_$13000"] = (
    df_census_mbs_combined["Out_of_pocket_cost_per_person_All persons"] / 13000
) * 100
df_census_mbs_combined["%_out_of_pocket_by_$18200"] = (
    df_census_mbs_combined["Out_of_pocket_cost_per_person_All persons"] / 18200
) * 100
df_census_mbs_combined["%_out_of_pocket_by_$26000"] = (
    df_census_mbs_combined["Out_of_pocket_cost_per_person_All persons"] / 26000
) * 100
df_census_mbs_combined["%_out_of_pocket_by_$36400"] = (
    df_census_mbs_combined["Out_of_pocket_cost_per_person_All persons"] / 36400
) * 100
df_census_mbs_combined["%_out_of_pocket_by_$46800"] = (
    df_census_mbs_combined["Out_of_pocket_cost_per_person_All persons"] / 46800
) * 100
df_census_mbs_combined["%_out_of_pocket_by_$58500"] = (
    df_census_mbs_combined["Out_of_pocket_cost_per_person_All persons"] / 58500
) * 100
df_census_mbs_combined["%_out_of_pocket_by_$71500"] = (
    df_census_mbs_combined["Out_of_pocket_cost_per_person_All persons"] / 71500
) * 100
df_census_mbs_combined["%_out_of_pocket_by_$91000"] = (
    df_census_mbs_combined["Out_of_pocket_cost_per_person_All persons"] / 91000
) * 100
df_census_mbs_combined["%_out_of_pocket_by_$130000"] = (
    df_census_mbs_combined["Out_of_pocket_cost_per_person_All persons"] / 130000
) * 100
df_census_mbs_combined["%_out_of_pocket_by_$169000+"] = (
    df_census_mbs_combined["Out_of_pocket_cost_per_person_All persons"] / 169000
) * 100

In [1251]:
df_census_mbs_combined[
    [
        "Out_of_Pocket_All persons",
        "%_out_of_pocket_by_$5200",
        "%_out_of_pocket_by_$13000",
        "%_out_of_pocket_by_$18200",
        "%_out_of_pocket_by_$26000",
        "%_out_of_pocket_by_$36400",
        "%_out_of_pocket_by_$46800",
        "%_out_of_pocket_by_$58500",
        "%_out_of_pocket_by_$71500",
        "%_out_of_pocket_by_$91000",
        "%_out_of_pocket_by_$130000",
        "%_out_of_pocket_by_$169000+",
    ]
]

Unnamed: 0,Out_of_Pocket_All persons,%_out_of_pocket_by_$5200,%_out_of_pocket_by_$13000,%_out_of_pocket_by_$18200,%_out_of_pocket_by_$26000,%_out_of_pocket_by_$36400,%_out_of_pocket_by_$46800,%_out_of_pocket_by_$58500,%_out_of_pocket_by_$71500,%_out_of_pocket_by_$91000,%_out_of_pocket_by_$130000,%_out_of_pocket_by_$169000+
0,354458.0,0.437320,0.174928,0.124948,0.087464,0.062474,0.048591,0.038873,0.031805,0.024990,0.017493,0.013456
1,1488092.0,1.707366,0.682946,0.487819,0.341473,0.243909,0.189707,0.151766,0.124172,0.097564,0.068295,0.052534
2,3760350.0,1.594093,0.637637,0.455455,0.318819,0.227728,0.177121,0.141697,0.115934,0.091091,0.063764,0.049049
3,54.0,0.001194,0.000477,0.000341,0.000239,0.000171,0.000133,0.000106,0.000087,0.000068,0.000048,0.000037
4,1646217.0,2.386403,0.954561,0.681829,0.477281,0.340915,0.265156,0.212125,0.173557,0.136366,0.095456,0.073428
...,...,...,...,...,...,...,...,...,...,...,...,...
140818,7368.0,1.666968,0.666787,0.476277,0.333394,0.238138,0.185219,0.148175,0.121234,0.095255,0.066679,0.051291
140819,3.0,0.000502,0.000201,0.000143,0.000100,0.000072,0.000056,0.000045,0.000036,0.000029,0.000020,0.000015
140820,35715.0,7.804851,3.121941,2.229958,1.560970,1.114979,0.867206,0.693765,0.567626,0.445992,0.312194,0.240149
140821,,,,,,,,,,,,


Investigated the lot of nulls. This is due to Service Level 2 and 3 not having gender and age bracket information. Because of this, for the analysis, splitting data into 3 datasets.

1. 2014_22_mbs_census_combined : full combined dataset
2. 2014_22_mbs_census_service_level_1
3. 2014_22_mbs_census_service_level_2_3

In [1252]:
# export complete census and mbs data combined.
df_census_mbs_combined.to_pickle(
    os.path.join(path, "clean_datasets/2014_22_mbs_census_combined.pkl")
)

#### Subset Service Level 1

In [1253]:
# subsetting service level 1 dataset
df_mbs_census_service_level_1 = df_census_mbs_combined[
    df_census_mbs_combined["ServiceLevel"] == "Level 1"
]
df_mbs_census_service_level_1.shape

(14877, 128)

In [1254]:
# checking on nulls
df_mbs_census_service_level_1.isnull().sum().to_clipboard()

In [1255]:
df_mbs_census_service_level_1[
    df_mbs_census_service_level_1["%_People_had_service_65+"].isnull()
][["Year", "GeographicAreaName", "Service", "%_People_had_service_65+"]]

Unnamed: 0,Year,GeographicAreaName,Service,%_People_had_service_65+
13108,2014,Kwinana,Allied Health attendances (total),
13109,2014,Kwinana,Diagnostic Imaging (total),
13110,2014,Kwinana,GP attendances (total),
13111,2014,Kwinana,Nursing and Aboriginal Health Workers (total),
13112,2014,Kwinana,Specialist attendances (total),
...,...,...,...,...
140789,2022,Molonglo,GP attendances (total),
140790,2022,Molonglo,Nursing and Aboriginal Health Workers (total),
140791,2022,Molonglo,Specialist attendances (total),
140821,2022,Christmas Island,Nursing and Aboriginal Health Workers (total),


Found 12 rows with nulls. Fill in with 0

In [1256]:
df_mbs_census_service_level_1.fillna(0, inplace=True)
df_mbs_census_service_level_1.isnull().sum()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_mbs_census_service_level_1.fillna(0, inplace=True)


key_x                          0
Year                           0
StateTerritory                 0
GeographicCode                 0
GeographicAreaName             0
                              ..
%_out_of_pocket_by_$58500      0
%_out_of_pocket_by_$71500      0
%_out_of_pocket_by_$91000      0
%_out_of_pocket_by_$130000     0
%_out_of_pocket_by_$169000+    0
Length: 128, dtype: int64

In [1257]:
df_mbs_census_service_level_1[df_mbs_census_service_level_1["key_x"] == "2020-90104"]

Unnamed: 0,key_x,Year,StateTerritory,GeographicCode,GeographicAreaName,GeographicGroup,ServiceLevel,Service,%_People_had_service_0-24,%_People_had_service_25-44,...,%_out_of_pocket_by_$13000,%_out_of_pocket_by_$18200,%_out_of_pocket_by_$26000,%_out_of_pocket_by_$36400,%_out_of_pocket_by_$46800,%_out_of_pocket_by_$58500,%_out_of_pocket_by_$71500,%_out_of_pocket_by_$91000,%_out_of_pocket_by_$130000,%_out_of_pocket_by_$169000+
108261,2020-90104,2020,Other Territories,90104,Norfolk Island,Remote (incl. very remote),Level 1,Nursing and Aboriginal Health Workers (total),0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [1258]:
df_mbs_census_service_level_1[df_mbs_census_service_level_1["Year"] == 2014]

Unnamed: 0,key_x,Year,StateTerritory,GeographicCode,GeographicAreaName,GeographicGroup,ServiceLevel,Service,%_People_had_service_0-24,%_People_had_service_25-44,...,%_out_of_pocket_by_$13000,%_out_of_pocket_by_$18200,%_out_of_pocket_by_$26000,%_out_of_pocket_by_$36400,%_out_of_pocket_by_$46800,%_out_of_pocket_by_$58500,%_out_of_pocket_by_$71500,%_out_of_pocket_by_$91000,%_out_of_pocket_by_$130000,%_out_of_pocket_by_$169000+
0,2014-10102,2014,NSW,10102,Queanbeyan,Major cities - higher SES,Level 1,Allied Health attendances (total),18.80,20.65,...,0.174928,0.124948,0.087464,0.062474,0.048591,0.038873,0.031805,0.024990,0.017493,0.013456
1,2014-10102,2014,NSW,10102,Queanbeyan,Major cities - higher SES,Level 1,Diagnostic Imaging (total),17.59,27.72,...,0.682946,0.487819,0.341473,0.243909,0.189707,0.151766,0.124172,0.097564,0.068295,0.052534
2,2014-10102,2014,NSW,10102,Queanbeyan,Major cities - higher SES,Level 1,GP attendances (total),76.02,74.84,...,0.637637,0.455455,0.318819,0.227728,0.177121,0.141697,0.115934,0.091091,0.063764,0.049049
3,2014-10102,2014,NSW,10102,Queanbeyan,Major cities - higher SES,Level 1,Nursing and Aboriginal Health Workers (total),1.14,0.59,...,0.000477,0.000341,0.000239,0.000171,0.000133,0.000106,0.000087,0.000068,0.000048,0.000037
4,2014-10102,2014,NSW,10102,Queanbeyan,Major cities - higher SES,Level 1,Specialist attendances (total),14.60,17.92,...,0.954561,0.681829,0.477281,0.340915,0.265156,0.212125,0.173557,0.136366,0.095456,0.073428
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14866,2014-80109,2014,ACT,80109,Woden Valley,Major cities - higher SES,Level 1,Allied Health attendances (total),19.16,25.06,...,0.210063,0.150045,0.105032,0.075023,0.058351,0.046681,0.038193,0.030009,0.021006,0.016159
14867,2014-80109,2014,ACT,80109,Woden Valley,Major cities - higher SES,Level 1,Diagnostic Imaging (total),20.39,31.25,...,0.703136,0.502240,0.351568,0.251120,0.195316,0.156252,0.127843,0.100448,0.070314,0.054087
14868,2014-80109,2014,ACT,80109,Woden Valley,Major cities - higher SES,Level 1,GP attendances (total),80.63,80.73,...,0.719454,0.513896,0.359727,0.256948,0.199848,0.159879,0.130810,0.102779,0.071945,0.055343
14869,2014-80109,2014,ACT,80109,Woden Valley,Major cities - higher SES,Level 1,Nursing and Aboriginal Health Workers (total),0.57,0.60,...,0.013366,0.009547,0.006683,0.004774,0.003713,0.002970,0.002430,0.001909,0.001337,0.001028


In [1259]:
# checking 2016 new SA3 codes having backfilled population values
old_sa3_codes = df_mbs_census_service_level_1[
    (
        df_mbs_census_service_level_1["GeographicCode"].isin(
            ["80110", "80111", "10106", "90104", "30805", "31608", "21704", "51003"]
        )
    )
    & (df_mbs_census_service_level_1["total_population"].isnull())
]
old_sa3_codes["Year"].value_counts(dropna=False)

Series([], Name: Year, dtype: int64)

In [1260]:
df_mbs_census_service_level_1_new = df_mbs_census_service_level_1.drop(
    columns=["key_y", "_merge", "SA3"]
)

In [1261]:
# Exporting df_mbs_census_service_level_1 into pickle file
df_mbs_census_service_level_1_new.to_pickle(
    os.path.join(path, "clean_datasets/2014_22_mbs_census_service_level_1.pkl")
)

#### Subset Service Level 2 & 3

In [1262]:
# subsetting service level 2 and 3 dataset
df_mbs_census_service_level_2_3 = df_census_mbs_combined[
    df_census_mbs_combined["ServiceLevel"].isin(["Level 2", "Level 3"])
]
df_mbs_census_service_level_2_3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 125946 entries, 5 to 140820
Columns: 128 entries, key_x to %_out_of_pocket_by_$169000+
dtypes: Int32(7), Int64(14), category(1), float64(96), int64(1), object(9)
memory usage: 122.3+ MB


In [1263]:
df_mbs_census_service_level_2_3.columns

Index(['key_x', 'Year', 'StateTerritory', 'GeographicCode',
       'GeographicAreaName', 'GeographicGroup', 'ServiceLevel', 'Service',
       '%_People_had_service_0-24', '%_People_had_service_25-44',
       ...
       '%_out_of_pocket_by_$13000', '%_out_of_pocket_by_$18200',
       '%_out_of_pocket_by_$26000', '%_out_of_pocket_by_$36400',
       '%_out_of_pocket_by_$46800', '%_out_of_pocket_by_$58500',
       '%_out_of_pocket_by_$71500', '%_out_of_pocket_by_$91000',
       '%_out_of_pocket_by_$130000', '%_out_of_pocket_by_$169000+'],
      dtype='object', length=128)

In [1264]:
df_mbs_census_service_level_2_3_new = df_mbs_census_service_level_2_3.copy()
df_mbs_census_service_level_2_3_new.drop(
    columns=[
        "Total_provider_fees_$_Males",
        "Total_provider_fees_$_Females",
        "Total_provider_fees_$_65+",
        "Total_provider_fees_$_45-64",
        "Total_provider_fees_$_25-44",
        "Total_provider_fees_$_0-24",
        "Services_100_people_Males",
        "Services_100_people_Females",
        "Services_100_people_65+",
        "Services_100_people_45-64",
        "Services_100_people_25-44",
        "Services_100_people_0-24",
        "No_of_services_Males",
        "No_of_services_Females",
        "No_of_services_0-24",
        "No_of_services_25-44",
        "No_of_services_45-64",
        "No_of_services_65+",
        "%_People_had_service_0-24",
        "%_People_had_service_25-44",
        "%_People_had_service_45-64",
        "%_People_had_service_65+",
        "%_People_had_service_Females",
        "%_People_had_service_Males",
        "ERP_0-24",
        "ERP_25-44",
        "ERP_45-64",
        "ERP_65+",
        "ERP_Females",
        "ERP_Males",
        "MBS_per_100_0-24",
        "MBS_per_100_25-44",
        "MBS_per_100_45-64",
        "MBS_per_100_65+",
        "MBS_per_100_Females",
        "MBS_per_100_Males",
        "No_of_patients_0-24",
        "No_of_patients_25-44",
        "No_of_patients_45-64",
        "No_of_patients_65+",
        "No_of_patients_Females",
        "No_of_patients_Males",
        "Total_mbs_paid_$_0-24",
        "Total_mbs_paid_$_25-44",
        "Total_mbs_paid_$_45-64",
        "Total_mbs_paid_$_65+",
        "Total_mbs_paid_$_Females",
        "Total_mbs_paid_$_Males",
        "_merge",
        "SA3",
        "key_y",
        "Out_of_Pocket_0-24",
        "Out_of_Pocket_25-44",
        "Out_of_Pocket_45-64",
        "Out_of_Pocket_65+",
        "Out_of_Pocket_Females",
        "Out_of_Pocket_Males",
        "Out_of_pocket_cost_%_0-24",
        "Out_of_pocket_cost_%_25-44",
        "Out_of_pocket_cost_%_45-64",
        "Out_of_pocket_cost_%_65+",
        "Out_of_pocket_cost_%_Females",
        "Out_of_pocket_cost_%_Males",
        "No_of_service_per_person_0-24",
        "No_of_service_per_person_25-44",
        "No_of_service_per_person_45-64",
        "No_of_service_per_person_65+",
        "No_of_service_per_person_Females",
        "No_of_service_per_person_Males",
        "Out_of_pocket_cost_per_person_0-24",
        "Out_of_pocket_cost_per_person_25-44",
        "Out_of_pocket_cost_per_person_45-64",
        "Out_of_pocket_cost_per_person_65+",
        "Out_of_pocket_cost_per_person_Females",
        "Out_of_pocket_cost_per_person_Males",
    ],
    inplace=True,
)
df_mbs_census_service_level_2_3_new.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 125946 entries, 5 to 140820
Data columns (total 53 columns):
 #   Column                                     Non-Null Count   Dtype  
---  ------                                     --------------   -----  
 0   key_x                                      125946 non-null  object 
 1   Year                                       125946 non-null  int64  
 2   StateTerritory                             125946 non-null  object 
 3   GeographicCode                             125946 non-null  object 
 4   GeographicAreaName                         125946 non-null  object 
 5   GeographicGroup                            125946 non-null  object 
 6   ServiceLevel                               125946 non-null  object 
 7   Service                                    125946 non-null  object 
 8   %_People_had_service_All persons           125946 non-null  float64
 9   ERP_All persons                            125946 non-null  Int64  
 10  MBS_per_

In [1265]:
df_mbs_census_service_level_2_3_new.isna().sum()

key_x                                            0
Year                                             0
StateTerritory                                   0
GeographicCode                                   0
GeographicAreaName                               0
GeographicGroup                                  0
ServiceLevel                                     0
Service                                          0
%_People_had_service_All persons                 0
ERP_All persons                                  0
MBS_per_100_All persons                          0
No_of_patients_All persons                       0
No_of_service_per_person_All persons             0
No_of_services_All persons                       0
Out_of_Pocket_All persons                        0
Out_of_pocket_cost_%_All persons                 0
Out_of_pocket_cost_per_person_All persons        0
Services_100_people_All persons                  0
Total_mbs_paid_$_All persons                     0
Total_provider_fees_$_All perso

In [1266]:
df_mbs_census_service_level_2_3_new["average_income_$169000+"] = (
    df_mbs_census_service_level_2_3_new["average_income_$169000+"].fillna(0)
)

In [1267]:
df_mbs_census_service_level_2_3_new.isnull().sum()

key_x                                        0
Year                                         0
StateTerritory                               0
GeographicCode                               0
GeographicAreaName                           0
GeographicGroup                              0
ServiceLevel                                 0
Service                                      0
%_People_had_service_All persons             0
ERP_All persons                              0
MBS_per_100_All persons                      0
No_of_patients_All persons                   0
No_of_service_per_person_All persons         0
No_of_services_All persons                   0
Out_of_Pocket_All persons                    0
Out_of_pocket_cost_%_All persons             0
Out_of_pocket_cost_per_person_All persons    0
Services_100_people_All persons              0
Total_mbs_paid_$_All persons                 0
Total_provider_fees_$_All persons            0
age_25-44                                    0
age_45-64    

In [1268]:
# Exporting df_mbs_census_service_level_1 into pickle file
df_mbs_census_service_level_2_3_new.to_pickle(
    os.path.join(path, "clean_datasets/2014_22_mbs_census_service_level_2_3.pkl")
)