# Clean & Transform Census Datasets

Importing, cleaning and combining census datasets of 2011, 2016 and 2021. Census data for each year contains population by age, sex and personal income.

Datasets for each census year contains 3 separate datasets; population by age, gender and income. These datasets are first individually cleaned and combined to create a 1 dataset for each year.

In [30]:
import pandas as pd
import numpy as np
import os

## Import & Combine 2011 Census Datasets

### Import & Clean 2011 Population-Age Dataset

In [31]:
# set file path and import 2011 population by age dataset
path = r"/Users/patel/Documents/CF-Data Anaylst Course/portfolio_projects/mbs_analysis/datasets/"

df_census_age_2011 = pd.read_csv(
    os.path.join(path, "original_datasets/census_data/2011/2011_population_age.csv"),
    index_col=False,
    encoding="ISO-8859-1",
)
df_census_age_2011.head()

Unnamed: 0,SA3,age_0-14,age_15-24,age_25-44,age_45-64,age_65-79,age_80+
0,10101,13648,7590,15585,18904,8708,3072
1,10102,11025,6762,15554,14760,4171,1239
2,10103,3641,2198,4485,5530,2264,818
3,10104,11507,6347,12033,21878,11980,4123
4,10201,30502,19699,37671,43996,20940,10297


In [32]:
df_census_age_2011.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 351 entries, 0 to 350
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   SA3        351 non-null    int64
 1   age_0-14   351 non-null    int64
 2   age_15-24  351 non-null    int64
 3   age_25-44  351 non-null    int64
 4   age_45-64  351 non-null    int64
 5   age_65-79  351 non-null    int64
 6   age_80+    351 non-null    int64
dtypes: int64(7)
memory usage: 19.3 KB


In [33]:
# converting SA3 to be string for merges with other datasets (census and mbs)
df_census_age_2011["SA3"] = df_census_age_2011["SA3"].astype(("str"))
df_census_age_2011.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 351 entries, 0 to 350
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   SA3        351 non-null    object
 1   age_0-14   351 non-null    int64 
 2   age_15-24  351 non-null    int64 
 3   age_25-44  351 non-null    int64 
 4   age_45-64  351 non-null    int64 
 5   age_65-79  351 non-null    int64 
 6   age_80+    351 non-null    int64 
dtypes: int64(6), object(1)
memory usage: 19.3+ KB


In [34]:
df_census_age_2011.describe()

Unnamed: 0,age_0-14,age_15-24,age_25-44,age_45-64,age_65-79,age_80+
count,351.0,351.0,351.0,351.0,351.0,351.0
mean,11806.190883,8166.660969,17179.524217,15541.148148,6189.88604,2392.233618
std,8339.142607,6068.821251,13064.111914,10225.126635,4261.153964,1902.499656
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,6461.0,4083.5,8002.0,8539.5,3072.5,1019.5
50%,9796.0,6702.0,13932.0,13716.0,5595.0,1987.0
75%,15842.5,10846.0,23419.0,21342.5,8696.5,3215.0
max,39077.0,28164.0,87146.0,45569.0,22249.0,10297.0


Total 351 SA3 areas in census population age dataset. This is correct since MBS 2013-19 contains 346 unique SA3 areas. Further checks on missing SA3 areas will be conducted before merging datasets

### Import & Clean 2011 Population Personal Income Dataset

In [35]:
# set file path and import 2011 population by age dataset
path = r"/Users/patel/Documents/CF-Data Anaylst Course/portfolio_projects/mbs_analysis/datasets/"

df_census_personal_income_2011 = pd.read_csv(
    os.path.join(
        path, "original_datasets/census_data/2011/2011_population_personal_income.csv"
    ),
    index_col=False,
    encoding="ISO-8859-1",
)
df_census_personal_income_2011.head()

Unnamed: 0,SA3,Negative income,Nil income,"$1-$199 ($1-$10,399)","$200-$299 ($10,400-$15,599)","$300-$399 ($15,600-$20,799)","$400-$599 ($20,800-$31,199)","$600-$799 ($31,200-$41,599)","$800-$999 ($41,600-$51,999)","$1,000-$1,249 ($52,000-$64,999)","$1,250-$1,499 ($65,000-$77,999)","$1,500-$1,999 ($78,000-$103,999)","$2,000 or more ($104,000 or more)",Not stated,Not applicable,Total
0,10101,333,3126,3900,6157,6630,7080,5958,4256,3934,2656,3222,2258,4338,13648,67500
1,10102,174,2503,2489,2660,3051,3978,4239,3859,4422,3528,4495,4085,3007,11025,53511
2,10103,83,799,1103,1479,1746,2064,1878,1344,1134,695,787,530,1642,3641,18931
3,10104,295,2846,4480,8533,8105,9082,6388,4243,3191,1679,1965,1123,4432,11507,67880
4,10201,640,8641,10113,13944,15751,17655,13977,10481,9723,6587,8093,6307,10701,30502,163110


In [36]:
df_census_personal_income_2011.shape

(351, 16)

In [37]:
df_census_personal_income_2011.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 351 entries, 0 to 350
Data columns (total 16 columns):
 #   Column                             Non-Null Count  Dtype
---  ------                             --------------  -----
 0   SA3                                351 non-null    int64
 1   Negative income                    351 non-null    int64
 2   Nil income                         351 non-null    int64
 3   $1-$199 ($1-$10,399)               351 non-null    int64
 4   $200-$299 ($10,400-$15,599)        351 non-null    int64
 5   $300-$399 ($15,600-$20,799)        351 non-null    int64
 6   $400-$599 ($20,800-$31,199)        351 non-null    int64
 7   $600-$799 ($31,200-$41,599)        351 non-null    int64
 8   $800-$999 ($41,600-$51,999)        351 non-null    int64
 9   $1,000-$1,249 ($52,000-$64,999)    351 non-null    int64
 10  $1,250-$1,499 ($65,000-$77,999)    351 non-null    int64
 11  $1,500-$1,999 ($78,000-$103,999)   351 non-null    int64
 12  $2,000 or more ($104,0

In [38]:
# converting SA3 to be string for merges with other datasets (census and mbs)
df_census_personal_income_2011["SA3"] = df_census_age_2011["SA3"].astype(("str"))
df_census_personal_income_2011.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 351 entries, 0 to 350
Data columns (total 16 columns):
 #   Column                             Non-Null Count  Dtype 
---  ------                             --------------  ----- 
 0   SA3                                351 non-null    object
 1   Negative income                    351 non-null    int64 
 2   Nil income                         351 non-null    int64 
 3   $1-$199 ($1-$10,399)               351 non-null    int64 
 4   $200-$299 ($10,400-$15,599)        351 non-null    int64 
 5   $300-$399 ($15,600-$20,799)        351 non-null    int64 
 6   $400-$599 ($20,800-$31,199)        351 non-null    int64 
 7   $600-$799 ($31,200-$41,599)        351 non-null    int64 
 8   $800-$999 ($41,600-$51,999)        351 non-null    int64 
 9   $1,000-$1,249 ($52,000-$64,999)    351 non-null    int64 
 10  $1,250-$1,499 ($65,000-$77,999)    351 non-null    int64 
 11  $1,500-$1,999 ($78,000-$103,999)   351 non-null    int64 
 12  $2,000 o

In [39]:
df_census_personal_income_2011.describe()

Unnamed: 0,Negative income,Nil income,"$1-$199 ($1-$10,399)","$200-$299 ($10,400-$15,599)","$300-$399 ($15,600-$20,799)","$400-$599 ($20,800-$31,199)","$600-$799 ($31,200-$41,599)","$800-$999 ($41,600-$51,999)","$1,000-$1,249 ($52,000-$64,999)","$1,250-$1,499 ($65,000-$77,999)","$1,500-$1,999 ($78,000-$103,999)","$2,000 or more ($104,000 or more)",Not stated,Not applicable,Total
count,351.0,351.0,351.0,351.0,351.0,351.0,351.0,351.0,351.0,351.0,351.0,351.0,351.0,351.0,351.0
mean,290.849003,3744.478632,3662.404558,5120.683761,4888.470085,5715.242165,5127.193732,4092.131054,3908.675214,2733.763533,3193.450142,3080.592593,3910.615385,11806.193732,61275.410256
std,215.583083,3165.936958,2641.206295,3776.026852,3382.369116,3862.770369,3568.099756,2917.502148,2842.593911,2061.148821,2553.227144,3521.87954,2887.13986,8339.139214,41775.190657
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,146.0,1678.0,1863.0,2451.0,2439.5,2989.0,2527.0,2109.5,1919.0,1292.5,1402.0,871.5,1997.5,6461.0,33165.5
50%,245.0,2844.0,3079.0,4415.0,4464.0,5119.0,4493.0,3610.0,3329.0,2247.0,2552.0,1936.0,3264.0,9796.0,53373.0
75%,377.0,4752.5,4871.0,7091.5,6687.0,7828.0,6700.5,5469.0,5286.5,3795.5,4480.0,4029.5,5182.0,15842.5,82641.0
max,1211.0,16505.0,13276.0,23608.0,16338.0,18110.0,16934.0,12938.0,13685.0,11634.0,15848.0,21922.0,22876.0,39077.0,178432.0


In [40]:
df_census_personal_income_2011.drop(["Total"], axis=1, inplace=True)
df_census_personal_income_2011.columns

Index(['SA3', 'Negative income', 'Nil income', '$1-$199 ($1-$10,399)',
       '$200-$299 ($10,400-$15,599)', '$300-$399 ($15,600-$20,799)',
       '$400-$599 ($20,800-$31,199)', '$600-$799 ($31,200-$41,599)',
       '$800-$999 ($41,600-$51,999)', '$1,000-$1,249 ($52,000-$64,999)',
       '$1,250-$1,499 ($65,000-$77,999)', '$1,500-$1,999 ($78,000-$103,999)',
       '$2,000 or more ($104,000 or more)', 'Not stated', 'Not applicable'],
      dtype='object')

In [41]:
rename_columns = {
    "$1-$199 ($1-$10,399)": "$1-$10,399",
    "$200-$299 ($10,400-$15,599)": "$10,400-$15,599",
    "$300-$399 ($15,600-$20,799)": "$15,600-$20,799",
    "$400-$599 ($20,800-$31,199)": "$20,800-$31,199",
    "$600-$799 ($31,200-$41,599)": "$31,200-$41,599",
    "$800-$999 ($41,600-$51,999)": "$41,600-$51,999",
    "$1,000-$1,249 ($52,000-$64,999)": "$52,000-$64,999",
    "$1,250-$1,499 ($65,000-$77,999)": "$65,000-$77,999",
    "$1,500-$1,999 ($78,000-$103,999)": "$78,000-$103,999",
    "$2,000 or more ($104,000 or more)": "$104,000+",
}

df_census_personal_income_2011.rename(columns=rename_columns, inplace=True)
df_census_personal_income_2011

Unnamed: 0,SA3,Negative income,Nil income,"$1-$10,399","$10,400-$15,599","$15,600-$20,799","$20,800-$31,199","$31,200-$41,599","$41,600-$51,999","$52,000-$64,999","$65,000-$77,999","$78,000-$103,999","$104,000+",Not stated,Not applicable
0,10101,333,3126,3900,6157,6630,7080,5958,4256,3934,2656,3222,2258,4338,13648
1,10102,174,2503,2489,2660,3051,3978,4239,3859,4422,3528,4495,4085,3007,11025
2,10103,83,799,1103,1479,1746,2064,1878,1344,1134,695,787,530,1642,3641
3,10104,295,2846,4480,8533,8105,9082,6388,4243,3191,1679,1965,1123,4432,11507
4,10201,640,8641,10113,13944,15751,17655,13977,10481,9723,6587,8093,6307,10701,30502
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
346,90101,3,68,27,39,38,50,109,128,181,110,116,88,834,263
347,90102,8,35,27,49,45,71,47,21,24,17,23,18,36,129
348,90103,0,20,15,44,30,27,22,22,15,22,20,10,26,103
349,99797,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### Import & Clean 2011 Population-Gender Dataset

In [42]:
# import 2011 population by gender dataset

df_census_gender_2011 = pd.read_csv(
    os.path.join(path, "original_datasets/census_data/2011/2011_population_sex.csv"),
    index_col=False,
    encoding="ISO-8859-1",
)
df_census_gender_2011.head()

Unnamed: 0,SA3,male_pop,female_pop,total_population
0,10101,33658,33841,67500
1,10102,26852,26660,53511
2,10103,9736,9193,18931
3,10104,33368,34514,67880
4,10201,78677,84435,163110


In [43]:
df_census_gender_2011.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 351 entries, 0 to 350
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   SA3               351 non-null    int64
 1   male_pop          351 non-null    int64
 2   female_pop        351 non-null    int64
 3   total_population  351 non-null    int64
dtypes: int64(4)
memory usage: 11.1 KB


In [45]:
df_census_gender_2011["SA3"] = df_census_gender_2011["SA3"].astype("str")
df_census_gender_2011["SA3"].info()

<class 'pandas.core.series.Series'>
RangeIndex: 351 entries, 0 to 350
Series name: SA3
Non-Null Count  Dtype 
--------------  ----- 
351 non-null    object
dtypes: object(1)
memory usage: 2.9+ KB


In [46]:
df_census_gender_2011.describe()

Unnamed: 0,male_pop,female_pop,total_population
count,351.0,351.0,351.0
mean,30296.333333,30979.34188,61275.410256
std,20556.012001,21249.844061,41775.190657
min,0.0,0.0,0.0
25%,16518.5,16347.5,33165.5
50%,26221.0,26755.0,53373.0
75%,40762.0,42135.0,82641.0
max,92088.0,90795.0,178432.0


### Combine 2011 Age, Gender, Income Datasets

In [49]:
# check that all datasets have the same SA3 values

# extracted SA3 values in to a set (set will remove duplicates if any)
sa3_age = set(df_census_age_2011["SA3"])
sa3_income = set(df_census_personal_income_2011["SA3"])
sa3_gender = set(df_census_gender_2011["SA3"])

# checks if sets are equal. expected diff_sa3 to be False, if they don't contain the same values
diff_sa3 = sa3_age == sa3_income == sa3_gender
print(diff_sa3)

True


In [50]:
df_census_age_income_2011 = df_census_age_2011.merge(
    df_census_personal_income_2011, how="inner", on="SA3"
)
df_census_combined_2011 = df_census_age_income_2011.merge(
    df_census_gender_2011, how="inner", on="SA3"
)
df_census_combined_2011.shape

(351, 24)

In [51]:
df_census_combined_2011.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 351 entries, 0 to 350
Data columns (total 24 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   SA3               351 non-null    object
 1   age_0-14          351 non-null    int64 
 2   age_15-24         351 non-null    int64 
 3   age_25-44         351 non-null    int64 
 4   age_45-64         351 non-null    int64 
 5   age_65-79         351 non-null    int64 
 6   age_80+           351 non-null    int64 
 7   Negative income   351 non-null    int64 
 8   Nil income        351 non-null    int64 
 9   $1-$10,399        351 non-null    int64 
 10  $10,400-$15,599   351 non-null    int64 
 11  $15,600-$20,799   351 non-null    int64 
 12  $20,800-$31,199   351 non-null    int64 
 13  $31,200-$41,599   351 non-null    int64 
 14  $41,600-$51,999   351 non-null    int64 
 15  $52,000-$64,999   351 non-null    int64 
 16  $65,000-$77,999   351 non-null    int64 
 17  $78,000-$103,999

In [53]:
df_census_combined_2011.describe().to_clipboard()