# General Community Profile Preprocessing

This section focuses on preprocessing and cleaning the GCP dataset. We incorporate this dataset because it provides detailed demographic and socioeconomic statistics at the SA2 level, including population counts, age distribution, household composition, education levels, and employment status. 

These variables enable us to build a richer understanding of the communities where our consumers and merchants are located. By linking GCP data to our transaction dataset through SA2 codes, we can capture how factors such as population density, workforce participation, and education levels correlate with spending patterns. This helps us improve merchant ranking by identifying areas with higher demand potential and enhance fraud detection by spotting transactions that deviate from the typical demographic and economic profile of their region.

## Import Libraries

In [1]:
import pandas as pd
import geopandas as gpd
import numpy as np
from pathlib import Path

In [None]:
RAW_EXT = Path("../data/raw/external_dataset")
CLE = Path("../data/cleaned")
CUR = Path("../data/curated")

## Loading Data

Extract only the columns needeed from ABS:
- Population (G01) → total_population
- Income (G02) → median_weekly_personal_income, median_weekly_household_income
- Age distribution (G04A + G04B) → raw age counts, later binned.
- Labour force status (G46) → part time employed, full time employed, employment, unemployment, labour force size.

In [3]:
G01 = RAW_EXT / "2021_GCP_SA2_for_AUS_short-header/2021 Census GCP Statistical Area 2 for AUS/2021Census_G01_AUST_SA2.csv"
G02 = RAW_EXT / "2021_GCP_SA2_for_AUS_short-header/2021 Census GCP Statistical Area 2 for AUS/2021Census_G02_AUST_SA2.csv"
G04A = RAW_EXT / "2021_GCP_SA2_for_AUS_short-header/2021 Census GCP Statistical Area 2 for AUS/2021Census_G04A_AUST_SA2.csv"  # you uploaded this
G04B = RAW_EXT / "2021_GCP_SA2_for_AUS_short-header/2021 Census GCP Statistical Area 2 for AUS/2021Census_G04B_AUST_SA2.csv"
G46 = RAW_EXT / "2021_GCP_SA2_for_AUS_short-header/2021 Census GCP Statistical Area 2 for AUS/2021Census_G46B_AUST_SA2.csv"

### Population

In [4]:
population = (
    pd.read_csv(G01, dtype={"SA2_CODE_2021": str}, usecols=["SA2_CODE_2021", "Tot_P_P"])
    .rename(columns={"SA2_CODE_2021": "SA2_code", "Tot_P_P": "total_population"})
)

# pad SA2 to 9 digits
population["SA2_code"] = (
    population["SA2_code"].astype(str).str.split(".").str[0].str.zfill(9)
)

print(population.head())

    SA2_code  total_population
0  101021007              4343
1  101021008              8517
2  101021009             11342
3  101021010              5085
4  101021012             12744


### Median total personal income (weekly) and median total household income (weekly)

In [5]:
median_income = (
    pd.read_csv(
        G02,
        dtype={"SA2_CODE_2021": str},
        usecols=[
            "SA2_CODE_2021",
            "Median_tot_prsnl_inc_weekly",
            "Median_tot_hhd_inc_weekly",
        ],
    )
    .rename(
        columns={
            "SA2_CODE_2021": "SA2_code",
            "Median_tot_prsnl_inc_weekly": "median_weekly_personal_income",
            "Median_tot_hhd_inc_weekly": "median_weekly_household_income",
        }
    )
)
median_income["SA2_code"] = (
    median_income["SA2_code"].astype(str).str.split(".").str[0].str.zfill(9)
)
print(median_income.head())

    SA2_code  median_weekly_personal_income  median_weekly_household_income
0  101021007                            760                            1429
1  101021008                            975                            1989
2  101021009                            996                            1703
3  101021010                           1104                            1796
4  101021012                           1357                            3014


### Age by sex

In [6]:
age_cols_a = [
    "Age_yr_0_4_P",
    "Age_yr_5_9_P",
    "Age_yr_10_P",
    "Age_yr_11_P",
    "Age_yr_12_P",
    "Age_yr_13_P",
    "Age_yr_14_P",
    "Age_yr_15_19_P",
    "Age_yr_20_24_P",
    "Age_yr_25_29_P",
    "Age_yr_30_34_P",
    "Age_yr_35_39_P",
    "Age_yr_40_44_P",
    "Age_yr_45_49_P",
    "Age_yr_50_54_P",
]
age_cols_b = [
    "Age_yr_55_59_P",
    "Age_yr_60_64_P",
    "Age_yr_65_69_P",
    "Age_yr_70_74_P",
    "Age_yr_75_79_P",
    "Age_yr_80_84_P",
    "Age_yr_85_89_P",
    "Age_yr_90_94_P",
    "Age_yr_95_99_P",
    "Age_yr_100_yr_over_P",
]

age_by_sex_a = pd.read_csv(
    G04A, dtype={"SA2_CODE_2021": str}, usecols=["SA2_CODE_2021"] + age_cols_a
).rename(columns={"SA2_CODE_2021": "SA2_code"})
age_by_sex_a["SA2_code"] = (
    age_by_sex_a["SA2_code"].astype(str).str.split(".").str[0].str.zfill(9)
)

age_by_sex_b = pd.read_csv(
    G04B, dtype={"SA2_CODE_2021": str}, usecols=["SA2_CODE_2021"] + age_cols_b
).rename(columns={"SA2_CODE_2021": "SA2_code"})
age_by_sex_b["SA2_code"] = (
    age_by_sex_b["SA2_code"].astype(str).str.split(".").str[0].str.zfill(9)
)

age_by_sex = age_by_sex_a.merge(age_by_sex_b, on="SA2_code", how="outer")
print(age_by_sex.head())

    SA2_code  Age_yr_0_4_P  Age_yr_5_9_P  Age_yr_10_P  Age_yr_11_P  \
0  101021007           207           254           40           56   
1  101021008           526           530          103          121   
2  101021009           669           587          104          100   
3  101021010           319           252           52           38   
4  101021012           835           914          187          204   

   Age_yr_12_P  Age_yr_13_P  Age_yr_14_P  Age_yr_15_19_P  Age_yr_20_24_P  ...  \
0           47           39           45             173             122  ...   
1          125          105          108             490             515  ...   
2           92           95           89             433             689  ...   
3           51           35           43             184             343  ...   
4          191          180          198             927             746  ...   

   Age_yr_55_59_P  Age_yr_60_64_P  Age_yr_65_69_P  Age_yr_70_74_P  \
0             353      

### Labour force status

- Employed_full_time_P → full-time employed persons.
- Employed_part_time_P → part-time employed persons.
- Labour_force_P → total in the labour force (employed + unemployed).
- Unemployed_P → unemployed persons.
- Not_in_labour_force_P → persons not in the labour force.


In [7]:
labour_force_status = (
    pd.read_csv(
        G46,
        dtype={"SA2_CODE_2021": str},
        usecols=[
            "SA2_CODE_2021",
            "P_Emp_FullT_Tot",
            "P_Emp_PartT_Tot",
            "P_Tot_Unemp_Tot",
            "P_Tot_LF_Tot",
            "P_Not_in_LF_Tot",
        ],
    )
    .rename(
        columns={
            "SA2_CODE_2021": "SA2_code",
            "P_Emp_FullT_Tot": "full_time_employee_total",
            "P_Emp_PartT_Tot": "part_time_employee_total",
            "P_Tot_Unemp_Tot": "unemployed_total",
            "P_Tot_LF_Tot": "labour_force_total",
            "P_Not_in_LF_Tot": "not_in_labour_force_total",
        }
    )
)
labour_force_status["SA2_code"] = (
    labour_force_status["SA2_code"].astype(str).str.split(".").str[0].str.zfill(9)
)
print(labour_force_status.head())

    SA2_code  full_time_employee_total  part_time_employee_total  \
0  101021007                      1168                       686   
1  101021008                      2922                      1121   
2  101021009                      4253                      1659   
3  101021010                      2052                       720   
4  101021012                      5066                      1942   

   unemployed_total  labour_force_total  not_in_labour_force_total  
0                68                2090                       1259  
1               198                4508                       1962  
2               270                6522                       2555  
3               108                3033                       1039  
4               170                7567                       2131  


## Data 

### Age binning
We bin age into categories:
- Children = 0-12 years old
- Teenagers = 13-19 years old
- Young adults = 20-35 years old
- Adults = 36-64 years old
- Senior adults = 65+ years old

In [8]:
age_value_cols = [c for c in age_by_sex.columns if c != "SA2_code"]
age_by_sex[age_value_cols] = age_by_sex[age_value_cols].fillna(0)

children_0_12 = (
    age_by_sex["Age_yr_0_4_P"]
    + age_by_sex["Age_yr_5_9_P"]
    + age_by_sex["Age_yr_10_P"]
    + age_by_sex["Age_yr_11_P"]
    + age_by_sex["Age_yr_12_P"]
)

teenagers_13_19 = (
    age_by_sex["Age_yr_13_P"] + age_by_sex["Age_yr_14_P"] + age_by_sex["Age_yr_15_19_P"]
)

youth_20_24 = age_by_sex["Age_yr_20_24_P"]

young_adults_25_34 = age_by_sex["Age_yr_25_29_P"] + age_by_sex["Age_yr_30_34_P"]

adults_35_64 = (
    age_by_sex["Age_yr_35_39_P"]
    + age_by_sex["Age_yr_40_44_P"]
    + age_by_sex["Age_yr_45_49_P"]
    + age_by_sex["Age_yr_50_54_P"]
    + age_by_sex["Age_yr_55_59_P"]
    + age_by_sex["Age_yr_60_64_P"]
)

senior_adults_65_plus = (
    age_by_sex["Age_yr_65_69_P"]
    + age_by_sex["Age_yr_70_74_P"]
    + age_by_sex["Age_yr_75_79_P"]
    + age_by_sex["Age_yr_80_84_P"]
    + age_by_sex["Age_yr_85_89_P"]
    + age_by_sex["Age_yr_90_94_P"]
    + age_by_sex["Age_yr_95_99_P"]
    + age_by_sex["Age_yr_100_yr_over_P"]
)

age_by_bins = pd.DataFrame(
    {
        "SA2_code": age_by_sex["SA2_code"],
        "children_0_12": children_0_12,
        "teenagers_13_19": teenagers_13_19,
        "youth_20_24": youth_20_24,
        "young_adults_25_34": young_adults_25_34,
        "adults_35_64": adults_35_64,
        "senior_adults_65_plus": senior_adults_65_plus,
    }
)
print(age_by_bins.head())

    SA2_code  children_0_12  teenagers_13_19  youth_20_24  young_adults_25_34  \
0  101021007            604              257          122                 364   
1  101021008           1405              703          515                1266   
2  101021009           1552              617          689                2368   
3  101021010            712              262          343                1094   
4  101021012           2331             1305          746                1483   

   adults_35_64  senior_adults_65_plus  
0          1878                   1101  
1          3377                   1239  
2          4299                   1825  
3          2032                    641  
4          5644                   1224  


### Data concatenate

In [9]:
gcp = (
    population.merge(median_income, on="SA2_code", how="outer")
    .merge(age_by_bins, on="SA2_code", how="outer")
    .merge(labour_force_status, on="SA2_code", how="outer")
)

In [10]:
print(gcp.head())

    SA2_code  total_population  median_weekly_personal_income  \
0  101021007              4343                            760   
1  101021008              8517                            975   
2  101021009             11342                            996   
3  101021010              5085                           1104   
4  101021012             12744                           1357   

   median_weekly_household_income  children_0_12  teenagers_13_19  \
0                            1429            604              257   
1                            1989           1405              703   
2                            1703           1552              617   
3                            1796            712              262   
4                            3014           2331             1305   

   youth_20_24  young_adults_25_34  adults_35_64  senior_adults_65_plus  \
0          122                 364          1878                   1101   
1          515                1266          

## Data Cleaning

- Cast everything to double for numeric ops.
- Remove duplicates by SA2_code.
- Filter out negative values (invalid for population/income).
- Replace nulls with 0.
- Drop SA2s with population ≤ 0.

In [11]:
# Cast to string for SA2 Code
gcp["SA2_code"] = gcp["SA2_code"].astype(str).str.split(".").str[0].str.zfill(9)

In [12]:
# Ensure all the columns are numeric
all_columns = [
    "total_population",
    "children_0_12",
    "teenagers_13_19",
    "youth_20_24",
    "young_adults_25_34",
    "adults_35_64",
    "senior_adults_65_plus",
    "median_weekly_personal_income",
    "median_weekly_household_income",
    "full_time_employee_total",
    "part_time_employee_total",
    "unemployed_total",
    "labour_force_total",
    "not_in_labour_force_total",
]
for c in all_columns:
    if c in gcp.columns:
        gcp[c] = pd.to_numeric(gcp[c], errors="coerce").astype("float64")

In [13]:
# Remove duplicate SA2 codes
gcp = gcp.drop_duplicates(subset=["SA2_code"])


In [14]:
# Remove negative values
# Remove invalid extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, congestion_surcharge, total_amount and airport_fee
mask = (
    (gcp["total_population"] > 0)
    & (gcp["children_0_12"] >= 0)
    & (gcp["teenagers_13_19"] >= 0)
    & (gcp["youth_20_24"] >= 0)
    & (gcp["young_adults_25_34"] >= 0)
    & (gcp["adults_35_64"] >= 0)
    & (gcp["senior_adults_65_plus"] >= 0)
    & (gcp["median_weekly_personal_income"] >= 0)
    & (gcp["median_weekly_household_income"] >= 0)
    & (gcp["full_time_employee_total"] >= 0)
    & (gcp["part_time_employee_total"] >= 0)
    & (gcp["unemployed_total"] >= 0)
    & (gcp["labour_force_total"] > 0)
    & (gcp["not_in_labour_force_total"] >= 0)
)
gcp = gcp.loc[mask].copy()

In [15]:
# Replace null counts with 0
for c in all_columns:
    gcp[c] = gcp[c].fillna(0.0)

In [16]:
# Drop SA2 area that have <= 0 population
if "total_population" in gcp.columns:
    gcp = gcp[gcp["total_population"] > 0].copy()

In [17]:
gcp.head()

Unnamed: 0,SA2_code,total_population,median_weekly_personal_income,median_weekly_household_income,children_0_12,teenagers_13_19,youth_20_24,young_adults_25_34,adults_35_64,senior_adults_65_plus,full_time_employee_total,part_time_employee_total,unemployed_total,labour_force_total,not_in_labour_force_total
0,101021007,4343.0,760.0,1429.0,604.0,257.0,122.0,364.0,1878.0,1101.0,1168.0,686.0,68.0,2090.0,1259.0
1,101021008,8517.0,975.0,1989.0,1405.0,703.0,515.0,1266.0,3377.0,1239.0,2922.0,1121.0,198.0,4508.0,1962.0
2,101021009,11342.0,996.0,1703.0,1552.0,617.0,689.0,2368.0,4299.0,1825.0,4253.0,1659.0,270.0,6522.0,2555.0
3,101021010,5085.0,1104.0,1796.0,712.0,262.0,343.0,1094.0,2032.0,641.0,2052.0,720.0,108.0,3033.0,1039.0
4,101021012,12744.0,1357.0,3014.0,2331.0,1305.0,746.0,1483.0,5644.0,1224.0,5066.0,1942.0,170.0,7567.0,2131.0


## Feature Engineering

### Age bins percentage across whole SA2 area
Compare age composition across SA2 areas regardless of population size.

In [18]:
den = gcp["total_population"].astype("float64")

gcp["pct_children"]      = np.where(den > 0, gcp["children_0_12"] / den, 0.0).astype("float64")
gcp["pct_teenagers"]     = np.where(den > 0, gcp["teenagers_13_19"] / den, 0.0).astype("float64")
gcp["pct_youth"]         = np.where(den > 0, gcp["youth_20_24"] / den, 0.0).astype("float64")
gcp["pct_young_adults"]  = np.where(den > 0, gcp["young_adults_25_34"] / den, 0.0).astype("float64")
gcp["pct_adults"]        = np.where(den > 0, gcp["adults_35_64"] / den, 0.0).astype("float64")
gcp["pct_seniors"]       = np.where(den > 0, gcp["senior_adults_65_plus"] / den, 0.0).astype("float64")

### Employement rate across whole SA2 area
To get the context of whole area — e.g., areas with lots of retirees will show high pct_not_in_labour_force.

In [19]:
gcp["pct_full_time"] = np.where(den > 0, gcp["full_time_employee_total"] / den, 0.0).astype("float64")
gcp["pct_part_time"] = np.where(den > 0, gcp["part_time_employee_total"] / den, 0.0).astype("float64")
gcp["pct_unemployed"] = np.where(den > 0, gcp["unemployed_total"] / den, 0.0).astype("float64")
gcp["pct_labour_force"] = np.where(den > 0, gcp["labour_force_total"] / den, 0.0).astype("float64")
gcp["pct_not_in_labour_force"] = np.where(den > 0, gcp["not_in_labour_force_total"] / den, 0.0).astype("float64")

### Dependency ratios
Captures financial dependency load — areas with many dependents relative to working-age population. Higher ratio = more strain on earners → higher BNPL risk.

In [20]:
dep_num = gcp["children_0_12"].fillna(0.0) + gcp["senior_adults_65_plus"].fillna(0.0)
dep_den = (
    gcp["adults_35_64"].fillna(0.0)
    + gcp["young_adults_25_34"].fillna(0.0)
    + gcp["youth_20_24"].fillna(0.0)
)
gcp["dependency_ratio"] = np.where(dep_den > 0, dep_num / dep_den, 0.0).astype("float64")

print(gcp[["SA2_code", "dependency_ratio"]].head(10))

    SA2_code  dependency_ratio
0  101021007          0.721235
1  101021008          0.512602
2  101021009          0.459081
3  101021010          0.390026
4  101021012          0.451543
5  101021610          0.502068
6  101021611          0.535612
7  101031013          0.785546
8  101031014          0.669754
9  101031015          0.650728


### Income
Captures income structure differences, not just raw income:
- income_per_worker → proxy for productivity/affordability.
- household_personal_gap → multi-income households vs. single earners.

In [21]:
# Income per worker
gcp["income_per_worker"] = np.where(
    gcp["labour_force_total"] > 0,
    gcp["median_weekly_personal_income"] / gcp["labour_force_total"],
    0.0,
).astype("float64")

# Household personal gap
gcp["household_personal_gap"] = np.where(
    gcp["median_weekly_personal_income"] > 0,
    gcp["median_weekly_household_income"] / gcp["median_weekly_personal_income"],
    0.0,
).astype("float64")

### Labour market (divided with labour force in SA2 area)
To measure standard labour market:
- High unemployment_rate = financial stress.
- part_time_share vs. full_time_share = underemployment patterns.

In [22]:
gcp["unemployment_rate"] = np.where(
    gcp["labour_force_total"] > 0,
    gcp["unemployed_total"] / gcp["labour_force_total"],
    0.0,
).astype("float64")
gcp["full_time_share"] = np.where(
    gcp["labour_force_total"] > 0,
    gcp["full_time_employee_total"] / gcp["labour_force_total"],
    0.0,
).astype("float64")
gcp["part_time_share"] = np.where(
    gcp["labour_force_total"] > 0,
    gcp["part_time_employee_total"] / gcp["labour_force_total"],
    0.0,
).astype("float64")

### BNPL relevant interaction
Captures non-linear risks:
- Young people + joblessness → heavy BNPL reliance.
- Household income gap × seniors → multi-earner families supporting older adults.
- Dependency × unemployment → systemic stress

In [23]:
gcp["youth_x_unemployment"] = gcp["pct_youth"] * gcp["unemployment_rate"]
gcp["youngshare_x_income"] = (
    (gcp["pct_young_adults"].fillna(0.0) + gcp["pct_youth"].fillna(0.0))
    * gcp["median_weekly_personal_income"].fillna(0.0)
)
gcp["dependency_x_unemployment"] = gcp["dependency_ratio"] * gcp["unemployment_rate"]
gcp["hhgap_x_seniors"] = gcp["household_personal_gap"] * gcp["pct_seniors"]

In [24]:
print(gcp.head())

    SA2_code  total_population  median_weekly_personal_income  \
0  101021007            4343.0                          760.0   
1  101021008            8517.0                          975.0   
2  101021009           11342.0                          996.0   
3  101021010            5085.0                         1104.0   
4  101021012           12744.0                         1357.0   

   median_weekly_household_income  children_0_12  teenagers_13_19  \
0                          1429.0          604.0            257.0   
1                          1989.0         1405.0            703.0   
2                          1703.0         1552.0            617.0   
3                          1796.0          712.0            262.0   
4                          3014.0         2331.0           1305.0   

   youth_20_24  young_adults_25_34  adults_35_64  senior_adults_65_plus  ...  \
0        122.0               364.0        1878.0                 1101.0  ...   
1        515.0              1266.0

## Outliers Analysis
To identify how many SA2 area that has extreme values for each features. For each numeric column:
- Calculate Q1, Q3, IQR.
- Flag values outside [Q1 - 1.5 * IQR, Q3 + 1.5 * IQR] as outliers.
- Summarise counts of flagged outliers per feature.

In [25]:
gcp_iqr = gcp.copy()
iqr_bounds = {}

for c in all_columns:
    q1 = gcp[c].quantile(0.25)
    q3 = gcp[c].quantile(0.75)
    iqr = q3 - q1
    lower = q1 - 1.5 * iqr
    upper = q3 + 1.5 * iqr
    iqr_bounds[c] = (float(lower), float(upper))
    gcp_iqr[f"{c}_iqr_outlier"] = ((gcp[c] < lower) | (gcp[c] > upper)).astype(int)

# summary: count of IQR outliers per feature
iqr_summary = pd.DataFrame(
    {f"{c}_iqr_outliers": [int(gcp_iqr[f"{c}_iqr_outlier"].sum())] for c in all_columns}
)
print(iqr_summary.T)

                                              0
total_population_iqr_outliers                 0
children_0_12_iqr_outliers                   35
teenagers_13_19_iqr_outliers                 24
youth_20_24_iqr_outliers                     42
young_adults_25_34_iqr_outliers              60
adults_35_64_iqr_outliers                     0
senior_adults_65_plus_iqr_outliers           38
median_weekly_personal_income_iqr_outliers   91
median_weekly_household_income_iqr_outliers  46
full_time_employee_total_iqr_outliers        15
part_time_employee_total_iqr_outliers         4
unemployed_total_iqr_outliers                38
labour_force_total_iqr_outliers               1
not_in_labour_force_total_iqr_outliers       31


- Age-related outliers are strongest in youth (20–34) bins → student and young professional clustering drives big demographic differences.
- Income variables have the most extreme outliers (102 for personal income), showing very uneven income distribution across SA2s (consistent with ABS socio-economic index patterns).
- Labour market totals (employment/unemployment) show fewer but meaningful outliers, highlighting regional economic specialisations (e.g., high unemployment pockets, mining towns with high employment).
- Outliers reflect real structural differences in Australian geography, not just noise. For modelling, they may distort regressions/clustering → consider winsorisation, scaling, or robust methods,
- OR be preserved to capture meaningful “extreme” regions.

In [26]:
print(gcp.dtypes)

SA2_code                           object
total_population                  float64
median_weekly_personal_income     float64
median_weekly_household_income    float64
children_0_12                     float64
teenagers_13_19                   float64
youth_20_24                       float64
young_adults_25_34                float64
adults_35_64                      float64
senior_adults_65_plus             float64
full_time_employee_total          float64
part_time_employee_total          float64
unemployed_total                  float64
labour_force_total                float64
not_in_labour_force_total         float64
pct_children                      float64
pct_teenagers                     float64
pct_youth                         float64
pct_young_adults                  float64
pct_adults                        float64
pct_seniors                       float64
pct_full_time                     float64
pct_part_time                     float64
pct_unemployed                    

## Final NULL Value Check

In [27]:
# Count null values per column
null_counts = gcp.isnull().sum()

# Show only columns with at least 1 null
print(null_counts[null_counts > 0])

Series([], dtype: int64)


## Save to CSV

In [28]:
gcp.to_csv(CLE / "general_community_profile.csv", index=False, float_format="%.6f")

## Summary

1. Setup
- Built a Spark environment with increased memory and controlled partitions.
- Imported key libraries (pyspark.sql, pandas).
2. Data Loading
- Population (G01): extracted total population per SA2.
- Income (G02): extracted median weekly personal & household income.
- Age by sex (G04A + G04B): extracted population counts by 5-year age bands.
- Labour force (G46): extracted full-time, part-time, unemployed, labour force total, and not in labour force counts.
3. Data Transformation
- Age binning: grouped raw ABS age bands into broader categories:
    - children_0_12, teenagers_13_19, youth_20_24, young_adults_25_34, adults_35_64, senior_adults_65_plus.
- Concatenated all datasets into a single gcp DataFrame.
- Cleaning:
    - casted columns to numeric,
    - removed duplicates,
    - filtered out negative values,
    - replaced nulls with 0,
    - dropped SA2s with ≤ 0 population.
4. Feature Engineering
- Demographics
    - Computed age bin percentages (pct_children, pct_teenagers, pct_youth, pct_young_adults, pct_adults, pct_seniors).
    - Dependency ratio: (children + seniors) ÷ (adults + youth + young adults).
- Income
    - Income per worker: median personal ÷ labour force.
    - Household–personal gap: household median ÷ personal median.
-   Employment
    - Labour-force based rates:
        - unemployment_rate = unemployed / labour_force
        - full_time_share = full_time / labour_force
        - part_time_share = part_time / labour_force
    - Population-based rates:
        - pct_full_time, pct_part_time, pct_unemployed (employment per capita)
        - pct_labour_force = labour force ÷ population
        - pct_not_in_labour_force = not in LF ÷ population
- BNPL-Relevant Interactions
    - youth_x_unemployment = pct_youth * unemployment_rate
    - youngshare_x_income = (pct_young_adults + pct_youth) * median_income
    - dependency_x_unemployment = dependency_ratio * unemployment_rate
    - hhgap_x_seniors = household_personal_gap * pct_seniors
5. Outlier Analysis
- Applied IQR method to all numeric columns.
- Flagged extreme SA2s as outliers for review.
- Summarised counts of outliers per feature