# Summary

This notebook provides an overview of our Buy Now Pay Later (BNPL) merchant ranking project.

It summarises the key steps, datasets, methods, and results from all notebooks.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Start Spark session
spark = (
    SparkSession.builder.appName("Summary")
    .config("spark.sql.repl.eagerEval.enabled", True)
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.driver.memory", "8g")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .getOrCreate()
)

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/10/16 19:18:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/10/16 19:18:06 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


## 1. Data Preprocessing

We prepared and cleaned the **main datasets** (consumer, merchant, and transaction) and **external datasets** to ensure all were consistent and ready for analysis.

We checked for **missing values**, **duplicate IDs**, and **inconsistent formats**.

We also **engineered features** to summarise transaction behaviour for both merchants and consumers.

For consumers, we **merged external datasets using SA2 codes** to include regional and demographic context.

**External datasets used:**
- **GCP** (General Community Profile): demographic features such as income and employment
- **WPP** (Workforce Profile and Participation): workforce and industry composition
- **APRA** (Australian Prudential Regulation Authority): financial inclusion and credit access indicators
- **SEIFA** (Socio-Economic Indexes for Areas): socioeconomic advantage and disadvantage scores

**What we found:**
- Some merchants and consumers were **missing fraud probabilities** in many transactions.
- Transactions linked to **missing merchant details** were removed.
- A few postcodes couldn’t be matched to SA2 codes and were marked as null.
- **Three high-value outlier transactions** were found across different merchants and removed.

### Datasets after ETL

In [None]:
# Consumer features
consumer_features = spark.read.csv("../data/curated/full_consumer_df.csv", header=True, inferSchema=True)
display(consumer_features.limit(5))

25/10/16 19:18:10 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


SA2_code,consumer_id,transaction_count,total_spend,unique_merchants,avg_spend,fraud_prob_avg,name,address,state,postcode,gender,user_id,median_weekly_personal_income,household_personal_gap,dependency_ratio,pct_youth,pct_seniors,unemployment_rate,full_time_share,part_time_share,Incorporated,Unincorporated,Enterprises with no Employees,Enterprises with not stated Employees,high_skill_occ,med_skill_occ,low_skill_occ,irsad,irsd,ier,ieo,ATMs,Bank_post,Branch,Other face-to-face
209031212,1003027,587,86617.58739117667,354,147.55977409059057,0.1300147372040564,David Mckee,879 Owens Fords,VIC,3760,Male,654,876.0,1378.0,0.572066,0.06266,0.201336,0.031965,0.51798,0.361424,203,167,241,6,499,360,297,1076.0,1086.0,1111.0,1064.0,0.0,3.0,0.0,0.0
402041046,1008991,572,82703.00766697434,336,144.58567773946564,0.128272569513518,Vanessa Wilson,576 Janet Key Sui...,SA,5106,Female,22423,530.0,541.0,0.603458,0.066419,0.178767,0.099947,0.505981,0.337453,424,380,387,5,2462,2247,4502,827.0,803.0,876.0,851.0,21.0,4.0,9.0,0.0
402021030,1011771,567,83051.92781210352,332,146.4760631606764,,Teresa Simmons,0461 Bruce Fords,SA,5112,Undisclosed,2665,460.0,367.0,0.607354,0.072398,0.166036,0.182436,0.415435,0.325251,169,232,198,4,4689,1849,4671,717.0,646.0,758.0,773.0,11.0,3.0,4.0,0.0
206041504,1027927,545,99927.18048689168,334,183.35262474659024,0.2374823583434533,Cody Cox,25036 Peterson Gr...,VIC,3000,Male,17098,638.0,451.0,0.04782,0.327623,0.019087,0.114449,0.406008,0.407851,811,530,647,0,14675,6529,5420,1051.0,961.0,716.0,1139.0,22.0,1.0,7.0,0.0
402021030,1045351,569,87889.37180910617,336,154.46286785431664,0.1827535679858087,Nicole Macias,8761 Thomas Trail,SA,5112,Female,3615,460.0,367.0,0.607354,0.072398,0.166036,0.182436,0.415435,0.325251,169,232,198,4,4689,1849,4671,717.0,646.0,758.0,773.0,11.0,3.0,4.0,0.0


In [None]:
# Merchant features
merchant_features = spark.read.parquet("../data/curated/merchant_features.parquet")
display(merchant_features.limit(5))

merchant_abn,name,tags,categories,type,take_rate,order_count,total_sales,avg_order_value,unique_consumers,repeat_consumers,repurchase_rate
10023283211,Felis Limited,"((furniture, home...","[furniture, home ...",e,0.18,3261,703277.7114509277,215.66320498341847,3032,218,0.07189973614775726
10142254217,Arcu Ac Orci Corp...,"([cable, satellit...","[cable, satellite...",b,4.22,3036,118356.1460726035,38.98423783682592,2849,182,0.06388206388206388
10165489824,Nunc Sed Company,"([jewelry, watch,...","[jewelry, watch, ...",b,4.4,5,56180.47385703053,11236.094771406108,5,0,0.0
10187291046,Ultricies Digniss...,"([wAtch, clock, a...","[watch, clock, an...",b,3.29,336,39693.73038743404,118.13610234355367,335,1,0.002985074626865...
10192359162,Enim Condimentum PC,([music shops - m...,[music shops - mu...,a,6.33,385,177980.50545638183,462.2870271594333,383,2,0.005221932114882507


In [None]:
# Transaction data (merged with all features)
transactions = spark.read.parquet("../data/curated/all_given_data_us.parquet")
display(transactions.limit(5))

user_id,merchant_abn,dollar_value,order_id,order_datetime,consumer_id,transaction_count,total_spend,unique_merchants,avg_spend,consumer_fraud_prob_avg,consumer_name,address,state,postcode,gender,SA2_code,SA2_name,is_po_box,merchant_name,tags,categories,type,take_rate,order_count,total_sales,avg_order_value,unique_consumers,repeat_consumers,repurchase_rate
1,28000487688,133.22689421562643,0c37b3f7-c7f1-48c...,2021-02-28 00:00:00,1195503,553,79177.74505087877,324,143.17856247898513,0.0980543113652096,Yolanda Williams,413 Haney Gardens...,WA,6935,Female,,,True,Sed Nunc Industries,"((books, periodic...","[books, periodica...",b,4.24,3791,927005.5450187,244.52797283532053,3495,277,0.0792560801144492
18485,62191208634,79.13140006851712,9e18b913-0465-4fd...,2021-02-28 00:00:00,1212819,554,106704.06988486904,329,192.60662434091884,0.142110997778784,Samuel Haynes,9969 Catherine Vi...,VIC,3073,Male,209021523.0,Reservoir - North...,False,Cursus Non Egesta...,"[(furniture, home...","[furniture, home ...",c,2.17,16380,1423276.3410760397,86.89110751379974,11863,3578,0.3016100480485543
1,83690644458,30.441348317517228,40a2ff69-ea34-465...,2021-02-28 00:00:00,1195503,553,79177.74505087877,324,143.17856247898513,0.0980543113652096,Yolanda Williams,413 Haney Gardens...,WA,6935,Female,,,True,Id Erat Etiam Con...,"[(gift, card, nov...","[gift, card, nove...",b,3.15,35852,3183223.7621428915,88.78789920068313,18665,10550,0.5652290383069917
18488,39649557865,962.8133405407584,f4c1a5ae-5b76-40d...,2021-02-28 00:00:00,1302316,590,84946.03561612322,342,143.97633155275122,0.1991936303806636,Aaron Sawyer,362 Dixon Islands,WA,6646,Male,511041290.0,Meekatharra,False,Arcu Morbi Institute,([artist supply a...,[artist supply an...,c,1.47,21919,9857402.328111364,449.7195277207612,14462,5494,0.3798921311021989
2,80779820715,48.12397733548124,cd09bdd6-f56d-489...,2021-02-28 00:00:00,179208,567,116325.66530808884,345,205.1599035416029,0.0983468242510051,Mary Smith,3764 Amber Oval,NSW,2782,Female,124011455.0,Wentworth Falls,False,Euismod Enim LLC,"([watch, clock, a...","[watch, clock, an...",b,4.71,36438,1272197.0816010502,34.91402057195923,18807,10786,0.5735098633487531


In [None]:
# GCP: General Community Profile
gcp = spark.read.csv("../data/curated/gcp_merged_cleaned.csv", header=True, inferSchema=True)
display(gcp.limit(5))

SA2_code,SA2_name,state_code,state_name,geometry,median_weekly_personal_income,household_personal_gap,dependency_ratio,pct_youth,pct_seniors,unemployment_rate,full_time_share,part_time_share
101021007,Braidwood,1,New South Wales,POLYGON ((149.584...,760.0,669.0,0.721235,0.028091,0.253511,0.032536,0.558852,0.32823
101021008,Karabar,1,New South Wales,POLYGON ((149.218...,975.0,1014.0,0.512602,0.060467,0.145474,0.043922,0.648181,0.248669
101021009,Queanbeyan,1,New South Wales,POLYGON ((149.213...,996.0,707.0,0.459081,0.060748,0.160906,0.041398,0.652101,0.25437
101021010,Queanbeyan - East,1,New South Wales,POLYGON ((149.240...,1104.0,692.0,0.390026,0.067453,0.126057,0.035608,0.676558,0.237389
101021012,Queanbeyan West -...,1,New South Wales,POLYGON ((149.195...,1357.0,1657.0,0.451543,0.058537,0.096045,0.022466,0.669486,0.256641


In [None]:
# WPP: Workforce Participation Profile
wpp = spark.read.csv("../data/curated/WPP_cleaned.csv", header=True, inferSchema=True)
wpp = wpp.drop("_c0")
display(wpp.limit(5))

SA2_code,Incorporated,Unincorporated,Enterprises with no Employees,Enterprises with not stated Employees,high_skill_occ,med_skill_occ,low_skill_occ
101021007,138,273,270,0,551,292,505
101021008,105,93,115,0,288,338,345
101021009,299,232,241,3,1946,1421,2112
101021010,293,143,151,5,675,1161,1253
101021012,332,203,251,0,618,947,1009


In [None]:
# APRA: Australian Prudential Regulation Authority 
# SEIFA: Socio-Economic Indexes for Areas
apra_seifa = spark.read.csv("../data/curated/apra_seifa_imputed.csv", header=True, inferSchema=True)
display(apra_seifa.limit(5))

SA2_code,irsad,irsd,ier,ieo,ATMs,Bank_post,Branch,Other face-to-face,Total,ATM_prop,Branch_prop,Bank_post_prop,Other_prop
101021007,1001.0,1024.0,1027.0,1008.0,2.0,1.0,1.0,0.0,4.0,0.5,0.25,0.25,0.0
101021008,982.0,994.0,1000.0,967.0,1.0,1.0,0.0,0.0,2.0,0.5,0.0,0.5,0.0
101021009,998.0,1010.0,945.0,1000.0,8.0,1.0,7.0,0.0,16.0,0.5,0.4375,0.0625,0.0
101021010,1015.0,1025.0,969.0,1025.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
101021012,1107.0,1098.0,1109.0,1080.0,1.0,1.0,1.0,0.0,3.0,0.3333333333333333,0.3333333333333333,0.3333333333333333,0.0


## 2. Create Delta File

We created a **delta file** that adds a new **is_fraud flag** to each transaction using **custom fraud detection rules**.

This file helped with downstream tasks such as **imputing missing merchant fraud probabilities**.

### Rules Used to Flag Fraudulent Transactions

| Rule                                                                                                  | Rationale                                                                            |
| ----------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
| **Burst Transactions** <br> More than 5 transactions within 1 hour by the same consumer               | Rapid repeat purchases can indicate account takeover or bot abuse.                   |
| **Low-History, High-Spend** <br> `transaction_count <= 1` and `dollar_value > P90`                      | A new customer making a large first purchase is higher risk.                         |
| **High Weighted Fraud Probability** <br> `consumer_fraud_prob > 0.7` or `merchant_fraud_prob > 0.7`   | Use existing scores to flag risky actors.                                |
| **High-Risk Merchants** <br> `order_count < 10` and `total_sales > P90`   | A new merchant that is already having extremely high revenue with little orders is risky. |

In [None]:
# Delta file with fraud labels
delta_file = spark.read.parquet("../data/curated/fraud_delta.parquet")
display(delta_file.limit(5))

order_id,is_fraud
0613d31e-3c0b-41c...,True
e8480b9f-53f3-40e...,True
d8d31f15-afad-4d3...,True
f9b1d4ff-a486-45d...,True
1c656626-1439-4c3...,True


## 3. Impute Missing Fraud Probability

After cleaning, many transactions were missing **fraud probability values** for both consumers and merchants. 

We used **predictive modelling** to fill these gaps.

**Consumer Fraud Probability**
- **3327 consumers** had missing fraud probability.
- **Trained using:** 16,784 consumers with known fraud probability.
- **Imputed:** average fraud probability per consumer (later used in merchant fraud model).
- **Models tested:** Median Imputation (baseline), Linear Regression, Random Forest, Gradient Boosting, XGBoost, KNN Imputer.
- **Selected model: KNN Imputer** (MAE = 0.057).
- **Top predictors:** average spend, total spend, unique merchants, transaction count, gender.

**Merchant Fraud Probability**
- **13.26M transactions** had missing merchant fraud probability.
- **Trained using:** 3890 transactions with known merchant fraud probability.
- **Imputed:** merchant fraud probability per transaction.
- **Models tested:** Random Forest, Gradient Boosting, XGBoost.
- **Selected model: XGBoost** (RMSE = 0.0114, R² = 0.96).
- **Top predictors:** total sales, repeat customers, order count, average order value, average consumer fraud probability.

In [None]:
# Average consumer fraud probability for each consumer
consumer_fraud_prob = spark.read.parquet("../data/curated/consumer_fraud_prob_imputed.parquet")
display(consumer_fraud_prob.limit(5))

consumer_id,consumer_fraud_prob_avg_filled
1003027,0.1300147372040564
1008991,0.128272569513518
1011771,0.1624036095620602
1027927,0.2374823583434533
1045351,0.1827535679858087


In [None]:
# Merchant fraud probability for every transaction
tx_merchant_fraud_prob = spark.read.parquet("../data/curated/tx_imputed.parquet")
display(
    tx_merchant_fraud_prob.select(
        "consumer_id",
        "merchant_abn",
        "order_id",
        "consumer_fraud_prob_avg",
        "tx_fraud_merchant"
    ).limit(5)
)

consumer_id,merchant_abn,order_id,consumer_fraud_prob_avg,tx_fraud_merchant
182208,78798828265,65595814-8db4-4ec...,0.1202723929150431,0.6222573518753052
1226530,60956456424,6d18b988-5613-4aa...,0.1431002243843158,0.3182905912399292
773039,94472466107,042f716a-24dd-4b9...,0.159330064094704,0.4926167130470276
157044,49891706470,743ed3e8-86c0-484...,0.1396205872297287,0.3179011940956116
557890,82812059627,76c6e2ad-1ff8-435...,0.0962803404885092,0.5633848309516907


## 4. Predict Future Monthly Revenue

We forecast merchant revenue for the **next three months (next quarter)** to estimate growth for our **ranking metric** and extend beyond current transaction data.

**Monthly Revenue Forecast**
- **Trained using:**
    - **Current values:** current month’s revenue, merchant type, take rate, and month name
    - **Previous month values (lag 1):** ABS retail turnover, CPI, RBA cash rate
    - **Two months prior values (lag 2):** ABS retail turnover, CPI, RBA cash rate
- **Train** Mar 2021-Feb 2022, **validate** Mar-Apr 2022, **test** Jun 2022 onwards with recursive forecasting.
- **Predicted:** next-month revenue per merchant, then rolled recursively to **1-3 months ahead** and a **next-quarter total**.
- **Selected model: XGBoost** (MAE = 13,285.16, mean next-quarter revenue = 91,199.96).
- **Top predictors:** current month’s revenue, merchant ABN, ABS retail turnover lags, month name.

In [None]:
# Next-quarter revenue for each merchant
next_quarter_revenue = spark.read.csv("../data/curated/next_quarter_revenue_forecast.csv", header=True, inferSchema=True)
display(next_quarter_revenue.limit(5))

merchant_abn,next_quarter_revenue
10023283211,145067.734375
10142254217,24538.4248046875
10165489824,16613.703369140625
10187291046,8188.8447265625
10192359162,33783.423828125


## 5. Build Ranking System

We created a **ranking system** to identify the best merchants for the BNPL firm to partner with, focusing only on those with **recent transaction activity**.

Each merchant was assigned a **final score**, which was then used to select the **top 100 merchants** overall.

**Ranking Metrics & Weights:**
- **BNPL Revenue (35%)**
    - Calculated as: take rate x average merchant revenue over the last 3 months.
    - Reflects profit potential and current market size.
- **Growth Rate (20%)**
    - Forecasted revenue for next quarter vs actual revenue in previous quarter.
    - Captures sales momentum or decline.
- **Weighted Merchant Fraud Probability (20%)**
    - Computed as: Σ(transaction dollar value x fraud probability).
    - Helps assess risk based on high-value fraud exposure.
- **Repurchase Rate (15%)**
    - Proportion of consumers who made repeat purchases with the merchant.
    - Indicates customer loyalty and long-term engagement.
- **Consumer Fraud Probability (10%)**
    - Spend-weighted average fraud risk of the merchant’s consumers.
    - Captures underlying customer base risk.

This approach balances **growth potential** with **risk** and **customer loyalty**, aligned with BNPL firm strategy.

### Top 100 Merchants Overall

In [None]:
# Metrics and final ranking score for each merchant
top_merchants = spark.read.parquet("../data/curated/merchant_ranking.parquet")
display(top_merchants.limit(5))

merchant_abn,bnpl_revenue,total_tx_dollar_fraud,repurchase_rate,avg_weighted_consumer_fraud,growth_rate_pct,bnpl_revenue_norm,fraud_sum_norm,repurchase_norm,avg_cons_fraud_norm,growth_rate_norm,fraud_sum_norm_inv,avg_cons_fraud_norm_inv,final_score
86578477987,3552932.653830079,2656427.267256392,0.9999169469706408,61.59539438446461,-2.2454070088854152,0.9574043910346084,0.8245134810756347,1.0,0.005752669027234802,0.0192663270478779,0.1754865189243653,0.9942473309727652,0.623466839153838
32361057556,3616489.51887428,2629309.483611885,0.8995554795691572,62.78210778993344,-2.6232613763519748,0.9745314280184088,0.8160833715327126,0.8996301965822865,0.005865855316470157,0.0191782180198633,0.1839166284672874,0.99413414468353,0.6160629130595692
45629217853,3393159.84899627,2570540.8429600066,0.9990448901623686,54.16833550848797,-0.1054426340193962,0.9143494868084304,0.7978139552382761,0.9991278707587524,0.005044291393485...,0.0197653293018929,0.2021860447617238,0.9949557086065144,0.6137773466701383
48534649627,3665949.663159541,2621835.4719931367,0.8144375553587245,64.82421822034144,-3.4890116907925703,0.9878597399137272,0.8137599243993815,0.8145052024832197,0.006060627618494824,0.0189763401926461,0.1862400756006185,0.9939393723815052,0.6083639097390909
21439773999,3328034.1752170995,2622254.155756892,0.967341306347746,61.267072350663945,-2.7063122344456634,0.8967996938374418,0.8138900806849603,0.9674216536467491,0.005721354345434759,0.0191588520127122,0.1861099193150397,0.9942786456545653,0.5994747597211239


## 6. Identifying Segments

To enable more detailed insights and recommendations, we **grouped merchants into industry segments**.

We tested two approaches:
- **Manual assignment**, based on **merchant product categories**
- **K-means clustering**, using **merchant features** (category, take rate, performance-based features) from Step 1
- **Result:** different top merchants surfaced, but **segment characteristics were consistent** across methods

**Segments used:**
- Beauty
- Entertainment & Media
- Miscellaneous
- Office & Home
- Technology

This segmentation enabled us to:
- **Analyse merchant performance** and **identify top performers** within each segment
- Retain each merchant’s **original ranking score** from Step 5 to enable **fair, segment-level comparisons** using the same evaluation criteria
- Provide the BNPL firm with **segment-specific insights** to guide how to assess merchants from that segment

### Top 10 Merchants: Beauty Segment

In [None]:
# Manual ranking scores for Beauty segment merchants
top_merchants_beauty = spark.read.parquet("../data/curated/top10_manual_beauty.parquet")
display(
    top_merchants_beauty
        .select("merchant_abn", "final_score", "segment", "categories", "rank")
)

# KMeans ranking scores for Beauty segment merchants
top_merchants_kmeans = spark.read.csv("../data/curated/top10_kmeans.csv", header=True, inferSchema=True)
display(
    top_merchants_kmeans
        .filter(col("segment") == "Beauty")
        .select("merchant_abn", "final_score", "segment", "categories", "rank")
)

merchant_abn,final_score,segment,categories,rank
86578477987,0.623466839153838,Beauty,"[watch, clock, an...",1
49322182190,0.5350555904242618,Beauty,"[watch, clock, an...",2
11439466003,0.4312342682641663,Beauty,[shoe shops],3
99976658299,0.4180919490241472,Beauty,[shoe shops],4
62224020443,0.4085293576642098,Beauty,"[watch, clock, an...",5
71528203369,0.3983042264245684,Beauty,"[watch, clock, an...",6
29616684420,0.3970334673404302,Beauty,"[watch, clock, an...",7
81761494572,0.3948807475344507,Beauty,"[watch, clock, an...",8
91720867026,0.3816507652829619,Beauty,"[watch, clock, an...",9
80779820715,0.3813575508457024,Beauty,"[watch, clock, an...",10


merchant_abn,final_score,segment,categories,rank
32361057556,0.6160629130595692,Beauty,"gift, card, novel...",1
45629217853,0.6137773466701383,Beauty,"gift, card, novel...",2
89726005175,0.5818098848867259,Beauty,tent and awning s...,3
94493496784,0.5587284444045288,Beauty,"gift, card, novel...",4
49891706470,0.5407112560787694,Beauty,tent and awning s...,5
79417999332,0.5217683871468151,Beauty,"gift, card, novel...",6
60956456424,0.4874935965180409,Beauty,"gift, card, novel...",7
81219314324,0.4769298339725689,Beauty,"gift, card, novel...",8
38700038932,0.4694323047013354,Beauty,tent and awning s...,9
57900494384,0.4339193918848676,Beauty,tent and awning s...,10


### Top 10 Merchants: Entertainment & Media Segment

In [None]:
# Manual ranking scores for Entertainment & Media segment merchants
top_merchants_entertainment_media = spark.read.parquet("../data/curated/top10_manual_entertainment_media.parquet")
display(
    top_merchants_entertainment_media
        .select("merchant_abn", "final_score", "segment", "categories", "rank")
)

# KMeans ranking scores for Entertainment & Media segment merchants
top_merchants_kmeans = spark.read.csv("../data/curated/top10_kmeans.csv", header=True, inferSchema=True)
display(
    top_merchants_kmeans
        .filter(col("segment") == "Entertainment & Media")
        .select("merchant_abn", "final_score", "segment", "categories", "rank")
)

merchant_abn,final_score,segment,categories,rank
64403598239,0.5936745843187259,Entertainment & M...,[music shops - mu...,1
72472909171,0.5515312229594163,Entertainment & M...,[digital goods: b...,2
63290521567,0.5217204032851467,Entertainment & M...,[artist supply an...,3
43186523025,0.50886665580817,Entertainment & M...,[florists supplie...,4
98973094975,0.5051059027603686,Entertainment & M...,"[hobby, toy and g...",5
88253903277,0.4979195890992499,Entertainment & M...,"[hobby, toy and g...",6
63123845164,0.4833965405773355,Entertainment & M...,[artist supply an...,7
49505931725,0.4706270527008095,Entertainment & M...,[digital goods: b...,8
21772962346,0.4663390210334396,Entertainment & M...,[florists supplie...,9
76314317957,0.4465099509024656,Entertainment & M...,[florists supplie...,10


merchant_abn,final_score,segment,categories,rank
43186523025,0.50886665580817,Entertainment & M...,florists supplies...,1
35909341340,0.4749561258841396,Entertainment & M...,computer programm...,2
94690988633,0.4711333223853016,Entertainment & M...,"computers, comput...",3
21772962346,0.4663390210334396,Entertainment & M...,florists supplies...,4
67400260923,0.4631538583449895,Entertainment & M...,computer programm...,5
58454491168,0.4541783451464877,Entertainment & M...,computer programm...,6
76314317957,0.4465099509024656,Entertainment & M...,florists supplies...,7
45433476494,0.4402290077901831,Entertainment & M...,"computers, comput...",8
68216911708,0.4295842770051103,Entertainment & M...,"computers, comput...",9
49212265466,0.4256736644037997,Entertainment & M...,florists supplies...,10


### Top 10 Merchants: Miscellaneous Segment

In [None]:
# Manual ranking scores for Miscellaneous segment merchants
top_merchants_miscellaneous = spark.read.parquet("../data/curated/top10_manual_miscellaneous.parquet")
display(
    top_merchants_miscellaneous
        .select("merchant_abn", "final_score", "segment", "categories", "rank")
)

# KMeans ranking scores for Miscellaneous segment merchants
top_merchants_kmeans = spark.read.csv("../data/curated/top10_kmeans.csv", header=True, inferSchema=True)
display(
    top_merchants_kmeans
        .filter(col("segment") == "Miscellaneous")
        .select("merchant_abn", "final_score", "segment", "categories", "rank")
)

merchant_abn,final_score,segment,categories,rank
32361057556,0.6160629130595692,Miscellaneous,"[gift, card, nove...",1
45629217853,0.6137773466701383,Miscellaneous,"[gift, card, nove...",2
48534649627,0.6083639097390909,Miscellaneous,"[opticians, optic...",3
94493496784,0.5587284444045288,Miscellaneous,"[gift, card, nove...",4
79417999332,0.5217683871468151,Miscellaneous,"[gift, card, nove...",5
96680767841,0.5209375703907587,Miscellaneous,[motor vehicle su...,6
60956456424,0.4874935965180409,Miscellaneous,"[gift, card, nove...",7
81219314324,0.4769298339725689,Miscellaneous,"[gift, card, nove...",8
13514558491,0.4472862487552262,Miscellaneous,[motor vehicle su...,9
46804135891,0.4261539396432031,Miscellaneous,"[opticians, optic...",10


merchant_abn,final_score,segment,categories,rank
86578477987,0.623466839153838,Miscellaneous,"watch, clock, and...",1
49322182190,0.5350555904242618,Miscellaneous,"watch, clock, and...",2
63290521567,0.5217204032851467,Miscellaneous,artist supply and...,3
63123845164,0.4833965405773355,Miscellaneous,artist supply and...,4
42355028515,0.4496397333059124,Miscellaneous,lawn and garden s...,5
24120162361,0.4400172552604369,Miscellaneous,artist supply and...,6
40515428545,0.4365614279190323,Miscellaneous,artist supply and...,7
67978471888,0.4162486583789473,Miscellaneous,artist supply and...,8
68559320474,0.4152860794706826,Miscellaneous,antique shops - s...,9
62224020443,0.4085293576642098,Miscellaneous,"watch, clock, and...",10


### Top 10 Merchants: Office & Home Segment

In [None]:
# Manual ranking scores for Office & Home Supplies segment merchants
top_merchants_office_home = spark.read.parquet("../data/curated/top10_manual_office_home_supplies.parquet")
display(
    top_merchants_office_home
        .select("merchant_abn", "final_score", "segment", "categories", "rank")
)

# KMeans ranking scores for Office & Home Supplies segment merchants
top_merchants_kmeans = spark.read.csv("../data/curated/top10_kmeans.csv", header=True, inferSchema=True)
display(
    top_merchants_kmeans
        .filter(col("segment") == "Office & Home Supplies")
        .select("merchant_abn", "final_score", "segment", "categories", "rank")
)

merchant_abn,final_score,segment,categories,rank
89726005175,0.5818098848867259,Office & Home Sup...,[tent and awning ...,1
49891706470,0.5407112560787694,Office & Home Sup...,[tent and awning ...,2
22267067774,0.4941974701235194,Office & Home Sup...,"[furniture, home ...",3
79827781481,0.4873318358483836,Office & Home Sup...,"[furniture, home ...",4
76767266140,0.4733079081637109,Office & Home Sup...,"[furniture, home ...",5
38700038932,0.4694323047013354,Office & Home Sup...,[tent and awning ...,6
42355028515,0.4496397333059124,Office & Home Sup...,[lawn and garden ...,7
57900494384,0.4339193918848676,Office & Home Sup...,[tent and awning ...,8
91923722701,0.4285920944089731,Office & Home Sup...,[tent and awning ...,9
82065156333,0.4281090967208397,Office & Home Sup...,[tent and awning ...,10


merchant_abn,final_score,segment,categories,rank
72472909171,0.5515312229594163,Office & Home Sup...,digital goods: bo...,1
22267067774,0.4941974701235194,Office & Home Sup...,"furniture, home f...",2
79827781481,0.4873318358483836,Office & Home Sup...,"furniture, home f...",3
76767266140,0.4733079081637109,Office & Home Sup...,"furniture, home f...",4
49505931725,0.4706270527008095,Office & Home Sup...,digital goods: bo...,5
35223308778,0.4214501000044393,Office & Home Sup...,"books, periodical...",6
46298404088,0.4078189471288243,Office & Home Sup...,"books, periodical...",7
38090089066,0.4029560121252256,Office & Home Sup...,"furniture, home f...",8
47103021057,0.4027783807041278,Office & Home Sup...,art dealers and g...,9
47086412084,0.4003886444191196,Office & Home Sup...,digital goods: bo...,10


### Top 10 Merchants: Technology Segment

In [None]:
# Manual ranking scores for Technology segment merchants
top_merchants_technology = spark.read.parquet("../data/curated/top10_manual_technology.parquet")
display(
    top_merchants_technology
        .select("merchant_abn", "final_score", "segment", "categories", "rank")
        .limit(10)
)

# KMeans ranking scores for Technology segment merchants
top_merchants_kmeans = spark.read.csv("../data/curated/top10_kmeans.csv", header=True, inferSchema=True)
display(
    top_merchants_kmeans
        .filter(col("segment") == "Technology")
        .select("merchant_abn", "final_score", "segment", "categories", "rank")
)

merchant_abn,final_score,segment,categories,rank
21439773999,0.5994747597211239,Technology,"[cable, satellite...",1
35909341340,0.4749561258841396,Technology,[computer program...,2
94690988633,0.4711333223853016,Technology,"[computers, compu...",3
67400260923,0.4631538583449895,Technology,[computer program...,4
58454491168,0.4541783451464877,Technology,[computer program...,5
82368304209,0.4427622141514736,Technology,[telecom],6
45433476494,0.4402290077901831,Technology,"[computers, compu...",7
17488304283,0.432579145211736,Technology,"[cable, satellite...",8
68216911708,0.4295842770051103,Technology,"[computers, compu...",9
62694031334,0.4060815269017153,Technology,"[computers, compu...",10


merchant_abn,final_score,segment,categories,rank
48534649627,0.6083639097390909,Technology,"opticians, optica...",1
21439773999,0.5994747597211239,Technology,"cable, satellite,...",2
64403598239,0.5936745843187259,Technology,music shops - mus...,3
96680767841,0.5209375703907587,Technology,motor vehicle sup...,4
98973094975,0.5051059027603686,Technology,"hobby, toy and ga...",5
88253903277,0.4979195890992499,Technology,"hobby, toy and ga...",6
13514558491,0.4472862487552262,Technology,motor vehicle sup...,7
82368304209,0.4427622141514736,Technology,telecom,8
17488304283,0.432579145211736,Technology,"cable, satellite,...",9
27326652377,0.4279560611057949,Technology,music shops - mus...,10


## 7. Segment Insights & Recommendation

Each segment presents **distinct patterns** in customer behaviour, purchase frequency, and fraud risk. 

Recognising these differences helps BNPL firms design more effective strategies for **merchant onboarding**, **fraud control**, and **customer engagement**.

- **Beauty**
    - Moderate repurchase rates, **high-value products** (e.g. watches, jewellery), and **higher fraud sensitivity**. 
    - Promote BNPL as an “affordable luxury.”
- **Entertainment & Media**
    - **Very high repurchase rates** and **strong revenue potential**. Slightly higher fraud risk, especially for digital goods and music. 
    - Strengthen fraud checks.
- **Miscellaneous**
    - **High customer loyalty** and **low fraud risk**. Includes novelty and gift merchants. 
    - Plan for seasonal spikes.
- **Office & Home**
    - **Repurchase rates vary**. Purchases are **large and infrequent** (e.g. furniture, garden). 
    - Offer longer repayment plans and screen credit.
- **Technology**
    - **Highest revenue per merchant**, especially in electronics/software, and **elevated fraud risk**. 
    - Apply stricter vetting before onboarding.

## 8. Assumptions & Limitations

**Assumptions:**
- **Consumer behaviour** stays consistent across time and segments
- **Merchant behaviour and performance trend**s** are stable
- Provided **fraud probabilities** reflect actual fraud risk
- No major market or seasonal changes affect future trends
- Dataset captures all transactions and is representative for each merchant

**Limitations:**
- **Dataset period**: Feb 2021 to Oct 2022, so future trends may differ
- **ABS and SEIFA data** is regional and may miss small-scale patterns
- **Postcode-to-SA2 mapping** may not always be accurate
- **Rule-based fraud flags** are not verified outcomes
- **Fraud probability** is just an estimate
- **Small or new merchants** may be underrepresented
- Mostly reflects **Australian retail BNPL**, not other industries or countries