# SEIFA & APRA Preprocessing and Cleaning

## SEIFA

This section focuses on preprocessing and cleaning the SEIFA dataset. We use the SEIFA dataset because it provides a comprehensive measure of socio-economic conditions across SA2 regions, allowing us to incorporate relative advantage and disadvantage into our analysis. SEIFA’s indicators  help us understand whether a region is generally advantaged or disadvantaged. 

In particular, the IRSAD index captures both advantage and disadvantage, giving a balanced view of overall socio-economic status, while IRSD, IER, and IEO focus on disadvantage, financial resources, and skill/education levels respectively. Integrating these scores into our pipeline allows us to rank merchants in areas with stronger consumer purchasing power and to identify potential fraud risks when transactions occur in regions with unexpectedly high-value purchases given their socio-economic profile.

Note: Missing values are not imputed nor dropped intentionally, [for a valid reason](https://www.abs.gov.au/methodologies/socio-economic-indexes-areas-seifa-australia-methodology/2021#:~:text=Areas%20without%20SEIFA%20Scores).
This should be dealt with in the downstream process instead of here

<div>
<img src="seifa.png" height="500"/>
</div>

Image taken from https://www.abs.gov.au/statistics/detailed-methodology-information/concepts-sources-methods/socio-economic-indexes-areas-seifa-technical-paper/2021/data-underpinning-indexes#basic-exploratory-analysis-of-variables

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
import folium
from shapely import set_precision

In [None]:
RAW_EXT = Path("../data/raw/external_dataset")
CLE = Path("../data/cleaned")
CUR = Path("../data/curated")

### Convert to CSV

In [3]:
SEIFA_DIR = RAW_EXT / "seifa/seifa2021.xlsx"

In [4]:
table1 = pd.read_excel(SEIFA_DIR, sheet_name="Table 1", skiprows=5)
table1

Unnamed: 0,2021 Statistical Area Level 2 (SA2) 9-Digit Code,2021 Statistical Area Level 2 (SA2) Name,Score,Decile,Score.1,Decile.1,Score.2,Decile.2,Score.3,Decile.3,Usual Resident Population
0,101021007,Braidwood,1024,6,1001,6,1027,7,1008.0,6.0,4343.0
1,101021008,Karabar,994,5,982,5,1000,5,967.0,5.0,8517.0
2,101021009,Queanbeyan,1010,5,998,6,945,3,1000.0,6.0,11342.0
3,101021010,Queanbeyan - East,1025,6,1015,6,969,4,1025.0,7.0,5085.0
4,101021012,Queanbeyan West - Jerrabomberra,1098,10,1107,9,1109,10,1080.0,8.0,12744.0
...,...,...,...,...,...,...,...,...,...,...,...
2363,901021002,Cocos (Keeling) Islands,850,1,903,2,980,4,896.0,2.0,593.0
2364,901031003,Jervis Bay,865,1,905,2,792,1,964.0,5.0,310.0
2365,901041004,Norfolk Island,1006,5,958,4,974,4,960.0,4.0,2188.0
2366,,,,,,,,,,,


In [5]:
# drop last 2 rows
table1 = table1[:-2]

# drop deciles
table1 = table1.drop(columns=["Decile", "Decile.1", "Decile.2", "Decile.3"])

# rename columns
table1 = table1.rename(columns={
    "2021 Statistical Area Level 2  (SA2) 9-Digit Code": "sa2code",
    "2021 Statistical Area Level 2 (SA2) Name ": "sa2name",
    "Score": "irsd",
    "Score.1": "irsad",
    "Score.2": "ier",
    "Score.3": "ieo",
    "Usual Resident Population": "population"
})

table1

Unnamed: 0,sa2code,sa2name,irsd,irsad,ier,ieo,population
0,101021007,Braidwood,1024,1001,1027,1008.0,4343.0
1,101021008,Karabar,994,982,1000,967.0,8517.0
2,101021009,Queanbeyan,1010,998,945,1000.0,11342.0
3,101021010,Queanbeyan - East,1025,1015,969,1025.0,5085.0
4,101021012,Queanbeyan West - Jerrabomberra,1098,1107,1109,1080.0,12744.0
...,...,...,...,...,...,...,...
2361,801111141,Namadgi,968,932,945,979.0,63.0
2362,901011001,Christmas Island,971,972,943,957.0,1692.0
2363,901021002,Cocos (Keeling) Islands,850,903,980,896.0,593.0
2364,901031003,Jervis Bay,865,905,792,964.0,310.0


In [6]:
# Replace '-' values with NaN (nulls) for SEIFA metrics

# SEIFA columns that may contain '-' values indicating missing data
seifa_columns = ["irsd", "irsad", "ier", "ieo"]

# Replace '-' with NaN
for col in seifa_columns:
    table1[col] = table1[col].replace('-', np.nan)

# Convert SEIFA columns to numeric (will handle any remaining string values)
for col in seifa_columns:
    table1[col] = pd.to_numeric(table1[col], errors='coerce')

# Also ensure population is numeric
table1['population'] = pd.to_numeric(table1['population'], errors='coerce')

# Check for missing values after cleaning
print("Missing values after cleaning:")
print(table1[seifa_columns + ['population']].isnull().sum())

print(f"\nDataset shape: {table1.shape}")
print(f"Areas with complete SEIFA data: {table1[seifa_columns].dropna().shape[0]}")
print(f"Areas with any missing SEIFA data: {table1[seifa_columns].isnull().any(axis=1).sum()}")

Missing values after cleaning:
irsd          13
irsad         13
ier           12
ieo            0
population     0
dtype: int64

Dataset shape: (2366, 7)
Areas with complete SEIFA data: 2353
Areas with any missing SEIFA data: 13


  table1[col] = table1[col].replace('-', np.nan)


### Export

In [7]:
# export to csv
table1.to_csv(CLE / "seifa2021.csv", index=False)

## APRA Point of Presence



In [8]:
APRA_DIR = RAW_EXT / "apra_pop/pop1724.xlsx"
TABLE_NAME = "Table 1"
df = pd.read_excel(APRA_DIR, sheet_name=TABLE_NAME, skiprows=2)

In [9]:
sa2_path = RAW_EXT / "sa2shape/SA2_2021_AUST_GDA2020.shp"
gdf_sa2 = gpd.read_file(sa2_path)

### Convert to CSV

In [10]:
df = df[df["Period"] == "2021-06-30"]
df = df[[
    "Period",
    "States and Territories",
    "SA4 name",
    "SA3 name",
    "SA2 name",
    "Service Channel Type",
    "Remoteness",
    "Number"
]]

df["Number"] = pd.to_numeric(df["Number"], errors="coerce")

df.head()

Unnamed: 0,Period,States and Territories,SA4 name,SA3 name,SA2 name,Service Channel Type,Remoteness,Number
35445,2021-06-30,NSW,Sydney - Blacktown,Blacktown,Blacktown (East) - Kings Park,ATMs,Major Cities of Australia,1
35446,2021-06-30,NSW,Sydney - Blacktown,Blacktown,Blacktown (East) - Kings Park,ATMs,Major Cities of Australia,1
35447,2021-06-30,NSW,Sydney - Blacktown,Mount Druitt,Mount Druitt - Whalan,ATMs,Major Cities of Australia,1
35448,2021-06-30,NSW,Sydney - Inner South West,Bankstown,Bankstown - South,ATMs,Major Cities of Australia,1
35449,2021-06-30,SA,Adelaide - Central and Hills,Adelaide City,Adelaide,ATMs,Major Cities of Australia,2


### Diagnostics 1

In [11]:
df["Service Channel Type"].unique()

array(['ATMs', 'Bank_post', 'Branch', 'EFTPOS', 'Other face-to-face'],
      dtype=object)

In [12]:
df[df["Service Channel Type"] == "EFTPOS"].head()

Unnamed: 0,Period,States and Territories,SA4 name,SA3 name,SA2 name,Service Channel Type,Remoteness,Number
48759,2021-06-30,ACT,,,,EFTPOS,,1887
48760,2021-06-30,NSW,,,,EFTPOS,,34650
48761,2021-06-30,NT,,,,EFTPOS,,1635
48762,2021-06-30,QLD,,,,EFTPOS,,27695
48763,2021-06-30,SA,,,,EFTPOS,,10924


In [13]:
df[df["Service Channel Type"] == "EFTPOS"]["SA2 name"].unique()

array([nan], dtype=object)

### Aggregating

In [14]:
# Pivot service types into wide format at SA2 level
keys = ["Period", "States and Territories", "SA4 name", "SA3 name", "SA2 name"]

sa2_summary = (
    df.pivot_table(
        index=keys,
        columns="Service Channel Type",
        values="Number",
        aggfunc="sum",
        fill_value=0
    )
    .reset_index()
)

sa2_summary["Total"] = sa2_summary.drop(columns=keys).sum(axis=1)

sa2_summary.columns.name = None
sa2_summary.head()

Unnamed: 0,Period,States and Territories,SA4 name,SA3 name,SA2 name,ATMs,Bank_post,Branch,Other face-to-face,Total
0,2021-06-30,ACT,Australian Capital Territory,Belconnen,Belconnen,16,1,9,1,27
1,2021-06-30,ACT,Australian Capital Territory,Belconnen,Bruce,0,1,0,0,1
2,2021-06-30,ACT,Australian Capital Territory,Belconnen,Charnwood,1,1,0,0,2
3,2021-06-30,ACT,Australian Capital Territory,Belconnen,Gooromon,0,0,1,0,1
4,2021-06-30,ACT,Australian Capital Territory,Belconnen,Hawker,1,1,0,0,2


In [15]:
sa2_summary["ATM_prop"] = sa2_summary.apply(
    lambda row: row["ATMs"] / row["Total"] if row["Total"] > 0 else 0, axis=1
)

sa2_summary["Branch_prop"] = sa2_summary.apply(
    lambda row: row["Branch"] / row["Total"] if row["Total"] > 0 else 0, axis=1
)

sa2_summary["Bank_post_prop"] = sa2_summary.apply(
    lambda row: row["Bank_post"] / row["Total"] if row["Total"] > 0 else 0, axis=1
)

sa2_summary["Other_prop"] = sa2_summary.apply(
    lambda row: row["Other face-to-face"] / row["Total"] if row["Total"] > 0 else 0, axis=1
)

### Diagnostics 2

In [16]:
# check join keys, check if there are unmatched keys
set(sa2_summary["SA2 name"]).difference(set(gdf_sa2["SA2_NAME21"]))

{'Ferntree Gully (South) - Upper Ferntree Gul'}

In [17]:
# Find entries similar to 'Ferntree Gully (South) - Upper Ferntree Gul' in the SA2_NAME21 column
gdf_sa2[gdf_sa2["SA2_NAME21"].str.contains("Ferntree", na=False)]

Unnamed: 0,SA2_CODE21,SA2_NAME21,CHG_FLAG21,CHG_LBL21,SA3_CODE21,SA3_NAME21,SA4_CODE21,SA4_NAME21,GCC_CODE21,GCC_NAME21,STE_CODE21,STE_NAME21,AUS_CODE21,AUS_NAME21,AREASQKM21,LOCI_URI21,geometry
951,211011447,Ferntree Gully - North,3,Name change,21101,Knox,211,Melbourne - Outer East,2GMEL,Greater Melbourne,2,Victoria,AUS,Australia,6.9851,http://linked.data.gov.au/dataset/asgsed3/SA2/...,"POLYGON ((145.25869 -37.87434, 145.25906 -37.8..."
952,211011448,Ferntree Gully (South) - Upper Ferntree Gully,0,No change,21101,Knox,211,Melbourne - Outer East,2GMEL,Greater Melbourne,2,Victoria,AUS,Australia,9.1499,http://linked.data.gov.au/dataset/asgsed3/SA2/...,"POLYGON ((145.2576 -37.8892, 145.2576 -37.8887..."


In [18]:
# update the name to match, Ferntree Gully (South) - Upper Ferntree Gul -> Ferntree Gully (South) - Upper Ferntree Gully
df.loc[df["SA2 name"] == "Ferntree Gully (South) - Upper Ferntree Gul", "SA2 name"] = "Ferntree Gully (South) - Upper Ferntree Gully"

### Diagnostics 3

In [19]:
# check join keys, check if there are unmatched keys
len(set(gdf_sa2["SA2_NAME21"]).difference(set(df["SA2 name"])))

467

### Export

In [20]:
sa2_summary.to_csv(CLE / "apra_pop2021.csv", index=False)