# Census Datasets 2011, 2016 & 2021 : Combine & Pre-Process

The script imports the combined census data files of each year, checks and corrects consistency issues in columns (Income brackets and SA3 Level values) and combines the census datat to create a single dataset

In [410]:
import pandas as pd
import numpy as np
import os

## Import Census Datasets

#### Census 2011

In [411]:
path = r"/Users/patel/Documents/CF-Data Anaylst Course/portfolio_projects/mbs_analysis/datasets/"

df_census_2011 = pd.read_pickle(
    os.path.join(path, "clean_datasets/census_data/2011_cenus_combined.pkl")
)
df_census_2011.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 351 entries, 0 to 350
Data columns (total 24 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   SA3               351 non-null    object
 1   age_0-14          351 non-null    int64 
 2   age_15-24         351 non-null    int64 
 3   age_25-44         351 non-null    int64 
 4   age_45-64         351 non-null    int64 
 5   age_65-79         351 non-null    int64 
 6   age_80+           351 non-null    int64 
 7   Negative income   351 non-null    int64 
 8   Nil income        351 non-null    int64 
 9   $1-$10,399        351 non-null    int64 
 10  $10,400-$15,599   351 non-null    int64 
 11  $15,600-$20,799   351 non-null    int64 
 12  $20,800-$31,199   351 non-null    int64 
 13  $31,200-$41,599   351 non-null    int64 
 14  $41,600-$51,999   351 non-null    int64 
 15  $52,000-$64,999   351 non-null    int64 
 16  $65,000-$77,999   351 non-null    int64 
 17  $78,000-$103,999

#### Census 2016

In [412]:
df_census_2016 = pd.read_pickle(
    os.path.join(path, "clean_datasets/census_data/2016_cenus_combined.pkl")
)
df_census_2016.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 358 entries, 0 to 357
Data columns (total 27 columns):
 #   Column                             Non-Null Count  Dtype 
---  ------                             --------------  ----- 
 0   SA3                                358 non-null    object
 1   age_0-14                           358 non-null    int64 
 2   age_15-24                          358 non-null    int64 
 3   age_25-44                          358 non-null    int64 
 4   age_45-64                          358 non-null    int64 
 5   age_65-79                          358 non-null    int64 
 6   age_80+                            358 non-null    int64 
 7   Negative income                    358 non-null    int64 
 8   Nil income                         358 non-null    int64 
 9   $1-$149 ($1-$7,799)                358 non-null    int64 
 10  $150-$299 ($7,800-$15,599)         358 non-null    int64 
 11  $300-$399 ($15,600-$20,799)        358 non-null    int64 
 12  $400-$49

### Census 2021

In [413]:
df_census_2021 = pd.read_pickle(
    os.path.join(path, "clean_datasets/census_data/2021_cenus_combined.pkl")
)
df_census_2021.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 358 entries, 0 to 357
Data columns (total 28 columns):
 #   Column                             Non-Null Count  Dtype 
---  ------                             --------------  ----- 
 0   SA3                                358 non-null    object
 1   age_0-14                           358 non-null    int64 
 2   age_15-24                          358 non-null    int64 
 3   age_25-44                          358 non-null    int64 
 4   age_45-64                          358 non-null    int64 
 5   age_65-79                          358 non-null    int64 
 6   age_80+                            358 non-null    int64 
 7   Negative income                    358 non-null    int64 
 8   Nil income                         358 non-null    int64 
 9   $1-$149 ($1-$7,799)                358 non-null    int64 
 10  $150-$299 ($7,800-$15,599)         358 non-null    int64 
 11  $300-$399 ($15,600-$20,799)        358 non-null    int64 
 12  $400-$49

## Data Consistency Across All Datasets

### Income

Income brackets are not consistent across 2011, 2016 and 2021. Before the census datasets can be stacked, following steps will be executed for each dataset

1. Income brackets need to be standardized so the number of income brackets are same across all datasets
2. Any blank values will be populated to 0
2. Income brackets will be renamed

Below is the income mapping used. Average income is annual average of lowest and highest range within the bracket 

| Income Brackets               | Average Income | Average Income Column Name       |
|-------------------------------|----------------|----------------------------------|
| $1-$199 ($1-$10,399)          | 5200           | average_income_$5200             |
| $200-$299 ($10,400-$15,599)   | 12999.5        | average_income_$13000            |
| $300-$399 ($15,600-$20,799)   | 18199.5        | average_income_$18200            |
| $400-$599 ($20,800-$31,199)   | 25999.5        | average_income_$26000            |
| $600-$799 ($31,200-$41,599)   | 36399.5        | average_income_$36400            |
| $800-$999 ($41,600-$51,999)   | 46799.5        | average_income_$46800            |
| $1,000-$1,249 ($52,000-$64,999) | 58499.5      | average_income_$58500            |
| $1,250-$1,499 ($65,000-$77,999) | 71499.5      | average_income_$71500            |
| $1,500-$1,999 ($78,000-$103,999) | 90999.5     | average_income_$91000            |
| $2,000-$2,999 ($104,000-$155,999) | 129999.5   | average_income_$130000           |
| $3,000-$3,499 ($156,000-$181,999) | 168999.5   | average_income_$169000           |
| $3,500 or more ($182,000 or more) |             | average_income_$200000+          |

#### Census 2011

In [414]:
df_census_2011.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 351 entries, 0 to 350
Data columns (total 24 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   SA3               351 non-null    object
 1   age_0-14          351 non-null    int64 
 2   age_15-24         351 non-null    int64 
 3   age_25-44         351 non-null    int64 
 4   age_45-64         351 non-null    int64 
 5   age_65-79         351 non-null    int64 
 6   age_80+           351 non-null    int64 
 7   Negative income   351 non-null    int64 
 8   Nil income        351 non-null    int64 
 9   $1-$10,399        351 non-null    int64 
 10  $10,400-$15,599   351 non-null    int64 
 11  $15,600-$20,799   351 non-null    int64 
 12  $20,800-$31,199   351 non-null    int64 
 13  $31,200-$41,599   351 non-null    int64 
 14  $41,600-$51,999   351 non-null    int64 
 15  $52,000-$64,999   351 non-null    int64 
 16  $65,000-$77,999   351 non-null    int64 
 17  $78,000-$103,999

In [415]:
# rename the columns
new_column_names = {
    "$1-$10,399": "average_income_$5200",
    "$10,400-$15,599": "average_income_$13000",
    "$15,600-$20,799": "average_income_$18200",
    "$20,800-$31,199": "average_income_$26000",
    "$31,200-$41,599": "average_income_$36400",
    "$41,600-$51,999": "average_income_$46800",
    "$52,000-$64,999": "average_income_$58500",
    "$65,000-$77,999": "average_income_$71500",
    "$78,000-$103,999": "average_income_$91000",
    "$104,000+": "average_income_$130000",
    "Negative income": "negative_income",
    "Nil income": "no_income",
    "Not stated": "not_stated",
    "Not applicable": "not_applicable",
}
df_census_2011 = df_census_2011.rename(columns=new_column_names)

In [416]:
# add high income brackets which are present in 2021 census but not in census 2011.
# This is so the data is consistent for merging

# df_census_2011["average_income_$200000+"] = np.nan -- commenting out average_income_$200000+

""" df_census_2011["average_income_$200000+"] = df_census_2011[
    "average_income_$200000+"
].astype("Int64") """

# casting to reinforce intgeter value for consistency
df_census_2011["average_income_$169000+"] = np.nan
df_census_2011["average_income_$169000+"] = df_census_2011[
    "average_income_$169000+"
].astype("Int64")

# checking the shape and new columns
print(df_census_2011.shape)
df_census_2011.head()

(351, 25)


Unnamed: 0,SA3,age_0-14,age_15-24,age_25-44,age_45-64,age_65-79,age_80+,negative_income,no_income,average_income_$5200,...,average_income_$58500,average_income_$71500,average_income_$91000,average_income_$130000,not_stated,not_applicable,male_pop,female_pop,total_population,average_income_$169000+
0,10101,13648,7590,15585,18904,8708,3072,333,3126,3900,...,3934,2656,3222,2258,4338,13648,33658,33841,67500,
1,10102,11025,6762,15554,14760,4171,1239,174,2503,2489,...,4422,3528,4495,4085,3007,11025,26852,26660,53511,
2,10103,3641,2198,4485,5530,2264,818,83,799,1103,...,1134,695,787,530,1642,3641,9736,9193,18931,
3,10104,11507,6347,12033,21878,11980,4123,295,2846,4480,...,3191,1679,1965,1123,4432,11507,33368,34514,67880,
4,10201,30502,19699,37671,43996,20940,10297,640,8641,10113,...,9723,6587,8093,6307,10701,30502,78677,84435,163110,


In [417]:
df_census_2011.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 351 entries, 0 to 350
Data columns (total 25 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   SA3                      351 non-null    object
 1   age_0-14                 351 non-null    int64 
 2   age_15-24                351 non-null    int64 
 3   age_25-44                351 non-null    int64 
 4   age_45-64                351 non-null    int64 
 5   age_65-79                351 non-null    int64 
 6   age_80+                  351 non-null    int64 
 7   negative_income          351 non-null    int64 
 8   no_income                351 non-null    int64 
 9   average_income_$5200     351 non-null    int64 
 10  average_income_$13000    351 non-null    int64 
 11  average_income_$18200    351 non-null    int64 
 12  average_income_$26000    351 non-null    int64 
 13  average_income_$36400    351 non-null    int64 
 14  average_income_$46800    351 non-null    i

#### Census 2016

In [418]:
df_census_2016.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 358 entries, 0 to 357
Data columns (total 27 columns):
 #   Column                             Non-Null Count  Dtype 
---  ------                             --------------  ----- 
 0   SA3                                358 non-null    object
 1   age_0-14                           358 non-null    int64 
 2   age_15-24                          358 non-null    int64 
 3   age_25-44                          358 non-null    int64 
 4   age_45-64                          358 non-null    int64 
 5   age_65-79                          358 non-null    int64 
 6   age_80+                            358 non-null    int64 
 7   Negative income                    358 non-null    int64 
 8   Nil income                         358 non-null    int64 
 9   $1-$149 ($1-$7,799)                358 non-null    int64 
 10  $150-$299 ($7,800-$15,599)         358 non-null    int64 
 11  $300-$399 ($15,600-$20,799)        358 non-null    int64 
 12  $400-$49

In [419]:
# combining the brackets to create single bracket so its consistent with census 2011 dataset
df_census_2016["average_income_$26000"] = (
    df_census_2016["$400-$499 ($20,800-$25,999)"]
    + df_census_2016["$500-$649 ($26,000-$33,799)"]
)

df_census_2016["average_income_$91000"] = (
    df_census_2016["$1,500-$1,749 ($78,000-$90,999)"]
    + df_census_2016["$1,750-$1,999 ($91,000-$103,999)"]
)

In [420]:
# drop income columns that were combined and new column was derived above
df_census_2016.drop(
    [
        "$400-$499 ($20,800-$25,999)",
        "$500-$649 ($26,000-$33,799)",
        "$1,500-$1,749 ($78,000-$90,999)",
        "$1,750-$1,999 ($91,000-$103,999)",
    ],
    inplace=True,
    axis=1,
)

In [421]:
# rename the columns
new_column_names = {
    "$1-$149 ($1-$7,799)": "average_income_$5200",
    "$150-$299 ($7,800-$15,599)": "average_income_$13000",
    "$300-$399 ($15,600-$20,799)": "average_income_$18200",
    "$650-$799 ($33,800-$41,599)": "average_income_$36400",
    "$800-$999 ($41,600-$51,999)": "average_income_$46800",
    "$1,000-$1,249 ($52,000-$64,999)": "average_income_$58500",
    "$1,250-$1,499 ($65,000-$77,999)": "average_income_$71500",
    "$2,000-$2,999 ($104,000-$155,999)": "average_income_$130000",
    "$3,000 or more ($156,000 or more)": "average_income_$169000+",
    "Negative income": "negative_income",
    "Nil income": "no_income",
    "Not stated": "not_stated",
    "Not applicable": "not_applicable",
}
df_census_2016.rename(columns=new_column_names, inplace=True)

In [422]:
# commenting out average_income_$200000+
""" df_census_2016["average_income_$200000+"] = np.nan
df_census_2016["average_income_$200000+"] = df_census_2016[
    "average_income_$200000+"
].astype("Int64") """
df_census_2016.head()

Unnamed: 0,SA3,age_0-14,age_15-24,age_25-44,age_45-64,age_65-79,age_80+,negative_income,no_income,average_income_$5200,...,average_income_$71500,average_income_$130000,average_income_$169000+,not_stated,not_applicable,male_pop,female_pop,total_population,average_income_$26000,average_income_$91000
0,10102,11235,6877,15839,16352,5629,1395,161,3063,1697,...,3639,4091,1812,4308,11235,28830,28502,57333,5389,5532
1,10103,3275,2164,4489,5677,2830,936,69,839,603,...,871,561,292,1973,3275,10013,9356,19365,2899,1128
2,10104,10610,6067,11644,22307,15245,4768,250,3168,2481,...,2405,1224,671,6536,10610,34606,36035,70642,13167,2755
3,10105,6398,3988,8133,9835,5391,1822,143,1895,1099,...,1429,1055,399,3640,6398,17991,17571,35561,5320,2098
4,10106,7255,3921,7649,10066,5398,1650,169,1757,1186,...,1465,1513,817,2803,7255,17677,18259,35939,5037,2354


#### Census 2021

There are more SA3 values in 2016 and 2021 compared to 2011. Compare SA3 values in each dataset to find the inconsistencies

In [423]:
df_census_2021.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 358 entries, 0 to 357
Data columns (total 28 columns):
 #   Column                             Non-Null Count  Dtype 
---  ------                             --------------  ----- 
 0   SA3                                358 non-null    object
 1   age_0-14                           358 non-null    int64 
 2   age_15-24                          358 non-null    int64 
 3   age_25-44                          358 non-null    int64 
 4   age_45-64                          358 non-null    int64 
 5   age_65-79                          358 non-null    int64 
 6   age_80+                            358 non-null    int64 
 7   Negative income                    358 non-null    int64 
 8   Nil income                         358 non-null    int64 
 9   $1-$149 ($1-$7,799)                358 non-null    int64 
 10  $150-$299 ($7,800-$15,599)         358 non-null    int64 
 11  $300-$399 ($15,600-$20,799)        358 non-null    int64 
 12  $400-$49

In [424]:
# combining the brackets to create single bracket so its consistent with census 2011 dataset
df_census_2021["average_income_$26000"] = (
    df_census_2021["$400-$499 ($20,800-$25,999)"]
    + df_census_2021["$500-$649 ($26,000-$33,799)"]
)

df_census_2021["average_income_$91000"] = (
    df_census_2021["$1,500-$1,749 ($78,000-$90,999)"]
    + df_census_2021["$1,750-$1,999 ($91,000-$103,999)"]
)

df_census_2021["average_income_$169000+"] = (
    df_census_2021["$3,000-$3,499 ($156,000-$181,999)"]
    + df_census_2021["$3,500 or more ($182,000 or more)"]
)

In [425]:
# drop income columns that were combined and new column was derived above
df_census_2021.drop(
    [
        "$400-$499 ($20,800-$25,999)",
        "$500-$649 ($26,000-$33,799)",
        "$1,500-$1,749 ($78,000-$90,999)",
        "$1,750-$1,999 ($91,000-$103,999)",
        "$3,000-$3,499 ($156,000-$181,999)",
        "$3,500 or more ($182,000 or more)"
    ],
    inplace=True,
    axis=1,
)

In [426]:
# rename the columns
new_column_names = {
    "$1-$149 ($1-$7,799)": "average_income_$5200",
    "$150-$299 ($7,800-$15,599)": "average_income_$13000",
    "$300-$399 ($15,600-$20,799)": "average_income_$18200",
    "$650-$799 ($33,800-$41,599)": "average_income_$36400",
    "$800-$999 ($41,600-$51,999)": "average_income_$46800",
    "$1,000-$1,249 ($52,000-$64,999)": "average_income_$58500",
    "$1,250-$1,499 ($65,000-$77,999)": "average_income_$71500",
    "$2,000-$2,999 ($104,000-$155,999)": "average_income_$130000",
    "Negative income": "negative_income",
    "Nil income": "no_income",
    "Not stated": "not_stated",
    "Not applicable": "not_applicable",
}
df_census_2021.rename(columns=new_column_names, inplace=True)

In [427]:
df_census_2021.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 358 entries, 0 to 357
Data columns (total 25 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   SA3                      358 non-null    object
 1   age_0-14                 358 non-null    int64 
 2   age_15-24                358 non-null    int64 
 3   age_25-44                358 non-null    int64 
 4   age_45-64                358 non-null    int64 
 5   age_65-79                358 non-null    int64 
 6   age_80+                  358 non-null    int64 
 7   negative_income          358 non-null    int64 
 8   no_income                358 non-null    int64 
 9   average_income_$5200     358 non-null    int64 
 10  average_income_$13000    358 non-null    int64 
 11  average_income_$18200    358 non-null    int64 
 12  average_income_$36400    358 non-null    int64 
 13  average_income_$46800    358 non-null    int64 
 14  average_income_$58500    358 non-null    i

### SA3 Level

Its important to ensure that Statistical Area Level 3 are from the same edition (release 2016), before stacking the census data and merging with MBS and patient experience. 

MBS 2013-19 data use SA3 areas / codes use edition released in 2016. Census 2011 uses  SA3 areas/codes from edition 2011. This means SA3 codes in census 2011 will need to be updated to match the corresponding 2016 SA3 codes. 

Statistical Area Level 3 2011 to Statistical Area Level 3 2016 mapping provided was provided ABS. This mapping file is imported to update the values in 2011 census.

In [428]:
# extracting sa3 values in to a set. The set will remove duplicates.
sa3_2011 = set(df_census_2011["SA3"])
sa3_2016 = set(df_census_2016["SA3"])
sa3_2021 = set(df_census_2021["SA3"])

# since 2016 and 2021 have the same number of unique SA3 values, checking to see if there are any differences
sa3_diff = sa3_2021 == sa3_2016

print(sa3_diff)

True


No differences in SA3 2021 and 2016 list. This makes sense since census data used 2016 edition of SA3

In [429]:
# finding SA3 codes that exist in 2011 but not in 2016 census.
sa3_diff_11 = sa3_2011 - sa3_2016
print(sa3_diff_11)

{'21702', '50803', '10101', '50801', '31604', '50802', '50804', '30802', '50806', '50805', '80102'}


Found 11 SA3 codes are in 2011 census but not in 2016. External investigation found that SA3 areas were folded into other suburbs or separated into multiple suburbs due to its population. When separated, new SA3 code and name was assigned. 

Examples of 2011 to 2016 SA3 codes conversions.

| 2011_SA3 | Name               | 2016_SA3 |
|----------|--------------------|----------|
| 50804    | Kimberley          | 51001    |
| 50805    | Mid-West           | 51104    |
| 10101    | Goulburn - Mulwaree | 10105    |
| 31604    | Nambour            | 31607    |
| 50806    | East Pilbara       | 51002    |
| 80102    | Belconnen          | 80101    |

#### Importing 2011 to 2016 Mapping File

In [430]:
# import file containing SA3 2011 to 2016 maping
df_sa3_mapping = pd.read_csv(
    os.path.join(path, "original_datasets/state_sal3/sa3_2011_to_2016_mapping.csv"),
    index_col=None,
)
df_sa3_mapping.head(5)

Unnamed: 0,SA3_CODE_2011,SA3_NAME_2011,SA3_CODE_2016,SA3_NAME_2016
0,10101,Goulburn - Yass,10105,Goulburn - Mulwaree
1,10101,Goulburn - Yass,10106,Young - Yass
2,10102,Queanbeyan,10102,Queanbeyan
3,10103,Snowy Mountains,10103,Snowy Mountains
4,10104,South Coast,10104,South Coast


In [431]:
df_sa3_mapping.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 369 entries, 0 to 368
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   SA3_CODE_2011  369 non-null    int64 
 1   SA3_NAME_2011  369 non-null    object
 2   SA3_CODE_2016  369 non-null    int64 
 3   SA3_NAME_2016  369 non-null    object
dtypes: int64(2), object(2)
memory usage: 11.7+ KB


In [432]:
# converting codes to strings to ensure there is consistency between census and mapping file columns
df_sa3_mapping["SA3_CODE_2011"] = df_sa3_mapping["SA3_CODE_2011"].astype("str")
df_sa3_mapping["SA3_CODE_2011"] = df_sa3_mapping["SA3_CODE_2011"].str.strip()
df_sa3_mapping["SA3_CODE_2016"] = df_sa3_mapping["SA3_CODE_2016"].astype("str")
df_sa3_mapping["SA3_CODE_2016"] = df_sa3_mapping["SA3_CODE_2016"].str.strip()

#### Census 2011 : Find & Replace SA3 codes

In [433]:
# Convert the missing SA3 codes in a set to string
sa3_diff_11_str = set(map(str, sa3_diff_11))
sa3_diff_11_str

{'10101',
 '21702',
 '30802',
 '31604',
 '50801',
 '50802',
 '50803',
 '50804',
 '50805',
 '50806',
 '80102'}

In [434]:
# extract the mapping for missing SA3 codes in 2011
df_sa3_mapping_missing = df_sa3_mapping[
    df_sa3_mapping["SA3_CODE_2011"].isin(sa3_diff_11_str)
]
df_sa3_mapping_missing

Unnamed: 0,SA3_CODE_2011,SA3_NAME_2011,SA3_CODE_2016,SA3_NAME_2016
0,10101,Goulburn - Yass,10105,Goulburn - Mulwaree
1,10101,Goulburn - Yass,10106,Young - Yass
159,21702,Warrnambool - Otway Ranges,21703,Colac - Corangamite
160,21702,Warrnambool - Otway Ranges,21704,Warrnambool
197,30802,Gladstone - Biloela,30804,Biloela
198,30802,Gladstone - Biloela,30805,Gladstone
238,31604,Nambour - Pomona,31607,Nambour
239,31604,Nambour - Pomona,31608,Noosa Hinterland
308,50801,Esperance,51101,Esperance
309,50802,Gascoyne,51102,Gascoyne


In [435]:
# SA3_Code_2011 has multiple mappings for SA3 codes as areas have been separated to multiple suburbs.
# For the project we are going to pick the first instance of the duplicate.
df_sa3_mapping_missing = df_sa3_mapping_missing.drop_duplicates(
    subset=["SA3_CODE_2011"], keep="first", ignore_index=False
)
df_sa3_mapping_missing

Unnamed: 0,SA3_CODE_2011,SA3_NAME_2011,SA3_CODE_2016,SA3_NAME_2016
0,10101,Goulburn - Yass,10105,Goulburn - Mulwaree
159,21702,Warrnambool - Otway Ranges,21703,Colac - Corangamite
197,30802,Gladstone - Biloela,30804,Biloela
238,31604,Nambour - Pomona,31607,Nambour
308,50801,Esperance,51101,Esperance
309,50802,Gascoyne,51102,Gascoyne
310,50803,Goldfields,51103,Goldfields
311,50804,Kimberley,51001,Kimberley
312,50805,Mid West,51104,Mid West
313,50806,Pilbara,51002,East Pilbara


In [436]:
# set SA3_CODE_2011 as index and extract corresponding SA3_CODE_2016 value.
# convert to dictionary for replacement function
df_mapping_dict = df_sa3_mapping_missing.set_index("SA3_CODE_2011")[
    "SA3_CODE_2016"
].to_dict()
df_mapping_dict

{'10101': '10105',
 '21702': '21703',
 '30802': '30804',
 '31604': '31607',
 '50801': '51101',
 '50802': '51102',
 '50803': '51103',
 '50804': '51001',
 '50805': '51104',
 '50806': '51002',
 '80102': '80101'}

In [437]:
# replacing SA3 2011 with corresponding SA3 2016 codes for consistency
df_census_2011["SA3"] = df_census_2011["SA3"].replace(df_mapping_dict)

df_census_2011["SA3"].value_counts(dropna=False)

80101    2
10105    1
39797    1
31904    1
31903    1
        ..
20606    1
20605    1
20604    1
20603    1
99999    1
Name: SA3, Length: 350, dtype: int64

In [438]:
# investigating why SA3 code has 2 rows.
df_census_2011[df_census_2011["SA3"] == "80101"]

Unnamed: 0,SA3,age_0-14,age_15-24,age_25-44,age_45-64,age_65-79,age_80+,negative_income,no_income,average_income_$5200,...,average_income_$58500,average_income_$71500,average_income_$91000,average_income_$130000,not_stated,not_applicable,male_pop,female_pop,total_population,average_income_$169000+
335,80101,17048,14352,28926,21854,7881,2386,272,4853,4970,...,8108,6824,9143,6625,4771,17048,45670,46775,92444,
336,80101,117,92,170,126,34,4,0,19,37,...,45,35,43,46,22,117,296,255,547,


Expected SA3 code to have 2 rows since Cotter - Namadgi (80102) was folded in to an existing suburb Belconnen(80101). To ensure there is unique SA3s in census data, the rows will be combined.

In [439]:
# Grouping any replacement action resulting in more than 1 SA3 rows in census 2011 data
df_census_2011_new = df_census_2011.groupby(by=["SA3"], as_index=False).sum()

# checking SA3 80101 has 1 record and values were added.
df_census_2011_new[df_census_2011_new["SA3"] == "80101"]

Unnamed: 0,SA3,age_0-14,age_15-24,age_25-44,age_45-64,age_65-79,age_80+,negative_income,no_income,average_income_$5200,...,average_income_$58500,average_income_$71500,average_income_$91000,average_income_$130000,not_stated,not_applicable,male_pop,female_pop,total_population,average_income_$169000+
335,80101,17165,14444,29096,21980,7915,2390,272,4872,5007,...,8153,6859,9186,6671,4793,17165,45966,47030,92991,0


In [440]:
df_census_2011_new[df_census_2011_new["SA3"] == "80102"]

Unnamed: 0,SA3,age_0-14,age_15-24,age_25-44,age_45-64,age_65-79,age_80+,negative_income,no_income,average_income_$5200,...,average_income_$58500,average_income_$71500,average_income_$91000,average_income_$130000,not_stated,not_applicable,male_pop,female_pop,total_population,average_income_$169000+


In [441]:
# checking the counts
df_census_2011_new["SA3"].value_counts()

10102    1
31606    1
31904    1
31903    1
31902    1
        ..
20606    1
20605    1
20604    1
20603    1
99999    1
Name: SA3, Length: 350, dtype: int64

In [442]:
# checking difference in length in SA3 census 2011 and 2016. Expected the length to be different
sa3_2011_v1 = set(df_census_2011["SA3"])
sa3_2011_v1 = set(map(str.strip, sa3_2011_v1))
print(len(sa3_2011_v1))
print(len(sa3_2016))

350
358


In [443]:
# checking that the replacement has worked and all values exist in 2011 also exist in 2016
# expected result is empty set
sa3_diff_vals = sa3_2011_v1 - sa3_2016
print(sa3_diff_vals)

set()


### Year

Year field need to be added to each dataset to recognise the census the data is for

In [444]:
df_census_2011_new["Year"] = "2011"
df_census_2011_new["Year"] = df_census_2011_new["Year"].astype("str")
df_census_2011_new.head()

Unnamed: 0,SA3,age_0-14,age_15-24,age_25-44,age_45-64,age_65-79,age_80+,negative_income,no_income,average_income_$5200,...,average_income_$71500,average_income_$91000,average_income_$130000,not_stated,not_applicable,male_pop,female_pop,total_population,average_income_$169000+,Year
0,10102,11025,6762,15554,14760,4171,1239,174,2503,2489,...,3528,4495,4085,3007,11025,26852,26660,53511,0,2011
1,10103,3641,2198,4485,5530,2264,818,83,799,1103,...,695,787,530,1642,3641,9736,9193,18931,0,2011
2,10104,11507,6347,12033,21878,11980,4123,295,2846,4480,...,1679,1965,1123,4432,11507,33368,34514,67880,0,2011
3,10105,13648,7590,15585,18904,8708,3072,333,3126,3900,...,2656,3222,2258,4338,13648,33658,33841,67500,0,2011
4,10201,30502,19699,37671,43996,20940,10297,640,8641,10113,...,6587,8093,6307,10701,30502,78677,84435,163110,0,2011


In [445]:
df_census_2016["Year"] = "2016"
df_census_2016["Year"] = df_census_2016["Year"].astype("str")
df_census_2016.head()

Unnamed: 0,SA3,age_0-14,age_15-24,age_25-44,age_45-64,age_65-79,age_80+,negative_income,no_income,average_income_$5200,...,average_income_$130000,average_income_$169000+,not_stated,not_applicable,male_pop,female_pop,total_population,average_income_$26000,average_income_$91000,Year
0,10102,11235,6877,15839,16352,5629,1395,161,3063,1697,...,4091,1812,4308,11235,28830,28502,57333,5389,5532,2016
1,10103,3275,2164,4489,5677,2830,936,69,839,603,...,561,292,1973,3275,10013,9356,19365,2899,1128,2016
2,10104,10610,6067,11644,22307,15245,4768,250,3168,2481,...,1224,671,6536,10610,34606,36035,70642,13167,2755,2016
3,10105,6398,3988,8133,9835,5391,1822,143,1895,1099,...,1055,399,3640,6398,17991,17571,35561,5320,2098,2016
4,10106,7255,3921,7649,10066,5398,1650,169,1757,1186,...,1513,817,2803,7255,17677,18259,35939,5037,2354,2016


In [446]:
df_census_2021["Year"] = "2021"
df_census_2016["Year"] = df_census_2016["Year"].astype("str")
df_census_2021.head()

Unnamed: 0,SA3,age_0-14,age_15-24,age_25-44,age_45-64,age_65-79,age_80+,negative_income,no_income,average_income_$5200,...,average_income_$130000,not_stated,not_applicable,male_pop,female_pop,total_population,average_income_$26000,average_income_$91000,average_income_$169000+,Year
0,10102,12643,6973,18711,17687,7006,1767,258,3127,1472,...,6822,3193,12643,32664,32122,64793,5562,7841,3664,2021
1,10103,3354,2275,5029,5915,3148,1001,127,911,516,...,1099,1722,3354,10674,10046,20717,2660,1813,636,2021
2,10104,11075,6302,13133,22453,18433,5339,468,3564,2315,...,2465,5731,11075,37632,39101,76736,13640,4598,1315,2021
3,10105,6771,4065,8928,10245,6322,2071,270,2052,1014,...,1923,2962,6771,19335,19072,38403,5438,3020,835,2021
4,10106,7401,4045,8076,10522,6198,1911,331,1922,1021,...,2353,2544,7401,18787,19369,38159,4725,3248,1424,2021


## Combine Census Datasets

In [447]:
df_census_2011_new.shape

(350, 26)

In [448]:
df_census_2016.shape

(358, 26)

In [449]:
df_census_2021.shape

(358, 26)

All datasets have 26 columns and 1066 in the final combined census dataset. Expected no blanks

In [450]:
# checking for common columns. Expected 26 columns to be in common
df_census_2011_new.columns.intersection(df_census_2016.columns)

Index(['SA3', 'age_0-14', 'age_15-24', 'age_25-44', 'age_45-64', 'age_65-79',
       'age_80+', 'negative_income', 'no_income', 'average_income_$5200',
       'average_income_$13000', 'average_income_$18200',
       'average_income_$26000', 'average_income_$36400',
       'average_income_$46800', 'average_income_$58500',
       'average_income_$71500', 'average_income_$91000',
       'average_income_$130000', 'not_stated', 'not_applicable', 'male_pop',
       'female_pop', 'total_population', 'average_income_$169000+', 'Year'],
      dtype='object')

In [451]:
df_census_2021.columns.intersection(df_census_2016.columns)

Index(['SA3', 'age_0-14', 'age_15-24', 'age_25-44', 'age_45-64', 'age_65-79',
       'age_80+', 'negative_income', 'no_income', 'average_income_$5200',
       'average_income_$13000', 'average_income_$18200',
       'average_income_$36400', 'average_income_$46800',
       'average_income_$58500', 'average_income_$71500',
       'average_income_$130000', 'not_stated', 'not_applicable', 'male_pop',
       'female_pop', 'total_population', 'average_income_$26000',
       'average_income_$91000', 'average_income_$169000+', 'Year'],
      dtype='object')

In [452]:
# vertically stacking each of the dataframe to create a new dataset with years 2013-22
df_census_combined_2011_16_21 = pd.concat(
    [df_census_2011_new, df_census_2016, df_census_2021],
    axis=0,
    ignore_index=True,
)
df_census_combined_2011_16_21.shape

(1066, 26)

In [453]:
df_census_combined_2011_16_21

Unnamed: 0,SA3,age_0-14,age_15-24,age_25-44,age_45-64,age_65-79,age_80+,negative_income,no_income,average_income_$5200,...,average_income_$71500,average_income_$91000,average_income_$130000,not_stated,not_applicable,male_pop,female_pop,total_population,average_income_$169000+,Year
0,10102,11025,6762,15554,14760,4171,1239,174,2503,2489,...,3528,4495,4085,3007,11025,26852,26660,53511,0,2011
1,10103,3641,2198,4485,5530,2264,818,83,799,1103,...,695,787,530,1642,3641,9736,9193,18931,0,2011
2,10104,11507,6347,12033,21878,11980,4123,295,2846,4480,...,1679,1965,1123,4432,11507,33368,34514,67880,0,2011
3,10105,13648,7590,15585,18904,8708,3072,333,3126,3900,...,2656,3222,2258,4338,13648,33658,33841,67500,0,2011
4,10201,30502,19699,37671,43996,20940,10297,640,8641,10113,...,6587,8093,6307,10701,30502,78677,84435,163110,0,2011
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1061,90102,130,49,147,166,80,21,3,30,8,...,23,30,28,84,130,302,292,593,13,2021
1062,90103,65,42,69,84,38,10,0,9,3,...,10,11,14,89,65,149,158,310,6,2021
1063,90104,351,186,404,705,423,117,21,71,73,...,98,119,75,218,351,1060,1130,2188,22,2021
1064,99797,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2021


In [454]:
df_census_combined_2011_16_21.info()
df_census_combined_2011_16_21.to_clipboard()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1066 entries, 0 to 1065
Data columns (total 26 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   SA3                      1066 non-null   object
 1   age_0-14                 1066 non-null   int64 
 2   age_15-24                1066 non-null   int64 
 3   age_25-44                1066 non-null   int64 
 4   age_45-64                1066 non-null   int64 
 5   age_65-79                1066 non-null   int64 
 6   age_80+                  1066 non-null   int64 
 7   negative_income          1066 non-null   int64 
 8   no_income                1066 non-null   int64 
 9   average_income_$5200     1066 non-null   int64 
 10  average_income_$13000    1066 non-null   int64 
 11  average_income_$18200    1066 non-null   int64 
 12  average_income_$26000    1066 non-null   int64 
 13  average_income_$36400    1066 non-null   int64 
 14  average_income_$46800    1066 non-null  

In [455]:
df_census_combined_2011_16_21.to_pickle(
    os.path.join(path, "clean_datasets/census_data/2011_16_21_census_combined.pkl")
)