# Preprocess BNPL Datasets

This notebook focuses on preprocessing the given `consumer`, `merchant`, and `transaction` datasets. This aims to standardise the raw data, validate identifiers, and resolve missing values to create cleaned datasets for future feature engineering.

In [1]:
from pathlib import Path
import pandas as pd
import numpy as np
import pyarrow.parquet as pq
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow as pa
import re

In [4]:
DATA = Path("..") / "data"
RAW  = DATA / "raw" / "tables"
RAW_EXT = DATA / "raw" / "external_dataset"
OUT  = DATA / "cleaned"
OUT.mkdir(parents=True, exist_ok=True)

## Read and Inspect BNPL Datasets

In [5]:
# Consumer details table
consumer_df = pd.read_csv(RAW / "tbl_consumer.csv", sep="|")
print("consumer_df:", consumer_df.shape)

# Merchant details table
merchant_df = pd.read_parquet(RAW / "tbl_merchants.parquet")
print("merchant_df:", merchant_df.shape)

# User - consumer lookup table
user_details_df = pd.read_parquet(RAW / "consumer_user_details.parquet")
print("user_details_df:", user_details_df.shape)

# Fraud probabilities of consumer
consumer_fraud_df = pd.read_csv(RAW / "consumer_fraud_probability.csv", low_memory=False)
print("consumer_fraud_df:", consumer_fraud_df.shape)

# Fraud probabilities of merchant
merchant_fraud_df = pd.read_csv(RAW / "merchant_fraud_probability.csv", low_memory=False)
print("merchant_fraud_df:", merchant_fraud_df.shape)

# Read a whole snapshot
t1 = pd.read_parquet(RAW / "transactions_20210228_20210827_snapshot")
t2 = pd.read_parquet(RAW / "transactions_20210828_20220227_snapshot")
t3 = pd.read_parquet(RAW / "transactions_20220228_20220828_snapshot")

# Combine
tx = pd.concat([t1,t2,t3], ignore_index=True)
print("transactions:", tx.shape)

consumer_df: (499999, 6)
merchant_df: (4026, 2)
user_details_df: (499999, 2)
consumer_fraud_df: (34864, 3)
merchant_fraud_df: (114, 3)
transactions: (14195505, 5)


In [None]:
# Helper function to print quick statistics
def quick_profile(df: pd.DataFrame, name: str):
    print(f"\n=== {name}: shape={df.shape} ===")
    display(df.head(3))
    display(df.sample(min(3, len(df))) if len(df) else df.head(0))
    display(df.dtypes.to_frame("dtype"))
    # nulls
    na = df.isna().mean().sort_values(ascending=False).to_frame("null_rate")
    display(na[na.null_rate>0])
    # basic uniques
    nunique = df.nunique(dropna=True).sort_values(ascending=False).to_frame("nunique")
    display(nunique.head(15))
    # date ranges (auto-detect)
    date_cols = [c for c in df.columns if df[c].dtype.kind == "M"]
    for c in date_cols:
        print(f"{c}: min={df[c].min()}  max={df[c].max()}")

## Consumer datasets

### Clean consumer details dataset

In [None]:
quick_profile(consumer_df, "tbl_consumer.csv")


=== tbl_consumer.csv: shape=(499999, 6) ===


Unnamed: 0,name,address,state,postcode,gender,consumer_id
0,Yolanda Williams,413 Haney Gardens Apt. 742,WA,6935,Female,1195503
1,Mary Smith,3764 Amber Oval,NSW,2782,Female,179208
2,Jill Jones MD,40693 Henry Greens,NT,862,Female,1194530


Unnamed: 0,name,address,state,postcode,gender,consumer_id
127680,Stephanie Small,811 Erica Crest,NSW,2163,Undisclosed,149040
367998,Victor Martinez,6635 Katherine Flat,NSW,2463,Male,940187
350845,Catherine Collins,622 Shaffer Forks Apt. 251,NSW,1163,Female,727723


Unnamed: 0,dtype
name,object
address,object
state,object
postcode,int64
gender,object
consumer_id,int64


Unnamed: 0,null_rate


Unnamed: 0,nunique
consumer_id,499999
address,499955
name,221377
postcode,3167
state,8
gender,3


In [None]:
# Trim whitespace
for c in ["name", "address", "state", "gender", "postcode", "consumer_id"]:
    consumer_df[c] = consumer_df[c].astype("string").str.strip()

In [None]:
# Validate the values of State 
consumer_df["state"] = (consumer_df["state"]
                            .str.upper()
                            .str.replace(r"[^A-Z]", "", regex=True))

allowed_states = {"NSW","VIC","QLD","SA","WA","TAS","NT","ACT"}
bad_state_mask = ~consumer_df["state"].isin(allowed_states)
print("Invalid state rows:", int(bad_state_mask.sum()))

Invalid state rows: 0


In [None]:
# Ensure postcode is a 4-digit string
consumer_df["postcode"] = (consumer_df["postcode"]
                                .str.replace(r"[^0-9]", "", regex=True)
                                .str.zfill(4)  # "862" -> "0862"
                                .where(lambda s: s.str.fullmatch(r"\d{4}"), other=pd.NA))

bad_pc_mask = consumer_df["postcode"].isna()
print("Invalid/missing postcodes:", int(bad_pc_mask.sum()))

Invalid/missing postcodes: 0


In [None]:
# Check consumer_id hygiene & checks
consumer_df["consumer_id"] = (consumer_df["consumer_id"]
                                    .str.replace(r"[^0-9A-Za-z_-]", "", regex=True)
                                    .str.strip())

null_ids = int(consumer_df["consumer_id"].isna().sum())
dupes = int(len(consumer_df) - consumer_df["consumer_id"].nunique())
print({"null_consumer_id": null_ids, "duplicate_consumer_id_rows": dupes})

{'null_consumer_id': 0, 'duplicate_consumer_id_rows': 0}


In [None]:
# Validations summary
summary = {
    "rows": len(consumer_df),
    "unique_consumer_id": consumer_df["consumer_id"].nunique(),
    "gender_counts": consumer_df["gender"].value_counts(dropna=False).to_dict(),
    "state_counts": consumer_df["state"].value_counts(dropna=False).to_dict(),
    "postcode_null_rate": float(consumer_df["postcode"].isna().mean()),
}
summary

{'rows': 499999,
 'unique_consumer_id': 499999,
 'gender_counts': {'Male': 224979, 'Female': 224946, 'Undisclosed': 50074},
 'state_counts': {'NSW': 144188,
  'VIC': 117525,
  'WA': 79146,
  'QLD': 72861,
  'SA': 54973,
  'TAS': 18878,
  'NT': 7764,
  'ACT': 4664},
 'postcode_null_rate': 0.0}

### Clean consumer fraud probability dataset

In [None]:
quick_profile(consumer_fraud_df, "consumer_fraud_probability.csv")


=== consumer_fraud_probability.csv: shape=(34864, 3) ===


Unnamed: 0,user_id,order_datetime,fraud_probability
0,6228,2021-12-19,97.629808
1,21419,2021-12-10,99.24738
2,5606,2021-10-17,84.05825


Unnamed: 0,user_id,order_datetime,fraud_probability
2457,10387,2021-04-09,34.200358
12453,4709,2021-06-30,12.506258
33898,10483,2021-12-17,8.590592


Unnamed: 0,dtype
user_id,int64
order_datetime,object
fraud_probability,float64


Unnamed: 0,null_rate


Unnamed: 0,nunique
fraud_probability,34765
user_id,20128
order_datetime,365


In [None]:
# Cast to string for consistency
consumer_fraud_df["user_id"] = consumer_fraud_df["user_id"].astype("string").str.strip()

# Rename fraud_probability → tx_fraud_consumer
consumer_fraud_df.rename(columns={"fraud_probability": "tx_fraud_consumer"}, inplace=True)

# Convert to datetime
consumer_fraud_df["order_datetime"] = pd.to_datetime(consumer_fraud_df["order_datetime"], errors="coerce")
print("Invalid dates:", consumer_fraud_df["order_datetime"].isna().sum())
print("Date range:", consumer_fraud_df["order_datetime"].min(), consumer_fraud_df["order_datetime"].max())

# Cast to string for consistency
consumer_fraud_df["user_id"] = consumer_fraud_df["user_id"].astype("string").str.strip()

# Rename fraud_probability → tx_fraud_consumer
consumer_fraud_df.rename(columns={"fraud_probability": "tx_fraud_consumer"}, inplace=True)

Invalid dates: 0
Date range: 2021-02-28 00:00:00 2022-02-27 00:00:00


In [None]:
# Normalize probability values
consumer_fraud_df["tx_fraud_consumer"] = pd.to_numeric(consumer_fraud_df["tx_fraud_consumer"], errors="coerce")

# normalise 0–100 → 0–1
consumer_fraud_df["tx_fraud_consumer"] = consumer_fraud_df["tx_fraud_consumer"] / 100

consumer_fraud_df["tx_fraud_consumer"].describe()

count    34864.000000
mean         0.151201
std          0.099461
min          0.082871
25%          0.096344
50%          0.117356
75%          0.162162
max          0.992474
Name: tx_fraud_consumer, dtype: float64

In [None]:
# Checks if each user has different fraud probability associated
consumer_fraud_df.groupby("user_id").size().describe()
multi_prob_users = consumer_fraud_df.groupby("user_id")["tx_fraud_consumer"].nunique()
print("Users with >1 fraud prob:", (multi_prob_users > 1).sum())

sample_multi = consumer_fraud_df[consumer_fraud_df["user_id"].isin(multi_prob_users[multi_prob_users > 1].index)]
sample_multi.sort_values(["user_id","order_datetime"]).head(10)

Users with >1 fraud prob: 10773


Unnamed: 0,user_id,order_datetime,tx_fraud_consumer
25616,1000,2021-09-30,0.103278
30879,1000,2022-01-02,0.089588
8312,10000,2021-12-27,0.170692
23011,10000,2022-02-09,0.101045
16341,10004,2021-06-08,0.105277
5634,10004,2021-11-22,0.177575
9022,10006,2021-07-27,0.125405
14572,10006,2021-12-13,0.099248
2796,10006,2021-12-27,0.231379
17118,10013,2021-11-25,0.131518


### Clean user details dataset

In [None]:
quick_profile(user_details_df, "consumer_user_details.parquet")


=== consumer_user_details.parquet: shape=(499999, 2) ===


Unnamed: 0,user_id,consumer_id
0,1,1195503
1,2,179208
2,3,1194530


Unnamed: 0,user_id,consumer_id
284297,284298,14832
208437,208438,1403556
417238,417239,1406862


Unnamed: 0,dtype
user_id,int64
consumer_id,int64


Unnamed: 0,null_rate


Unnamed: 0,nunique
user_id,499999
consumer_id,499999


All user_id and consumer_id seems unique with no duplicates.

In [None]:
# Convert IDs to string
user_details_df["user_id"] = user_details_df["user_id"].astype("string").str.strip()
user_details_df["consumer_id"] = user_details_df["consumer_id"].astype("string").str.strip()

### Linking Consumer Data to Both User ID and Consumer ID

In [None]:
# Merge fraud with user details
consumer_fraud_df = consumer_fraud_df.merge(user_details_df, on="user_id", how="left")
print("consumer_fraud_df:", consumer_fraud_df.shape)

# Merge consumer with user details
consumer_details_df = consumer_df.merge(user_details_df, on="consumer_id", how="left")
print("consumer_details_df:", consumer_details_df.shape)

consumer_fraud_df: (34864, 4)
consumer_details_df: (499999, 7)


In [None]:
# Save files as parquet
consumer_details_df.to_parquet(OUT / "consumer_details.parquet")
consumer_fraud_df.to_parquet(OUT / "consumer_fraud.parquet")

## Merchant datasets

### Clean merchant details dataset

In [None]:
merchant_df = merchant_df.reset_index()
quick_profile(merchant_df, "tbl_merchants.parquet")


=== tbl_merchants.parquet: shape=(4026, 3) ===


Unnamed: 0,merchant_abn,name,tags
0,10023283211,Felis Limited,"((furniture, home furnishings and equipment sh..."
1,10142254217,Arcu Ac Orci Corporation,"([cable, satellite, and otHer pay television a..."
2,10165489824,Nunc Sed Company,"([jewelry, watch, clock, and silverware shops]..."


Unnamed: 0,merchant_abn,name,tags
1753,48220830699,Nec Diam Duis Corporation,"((art dealers and galleries), (c), (take rate:..."
1415,40916035557,Phasellus Vitae Mauris Limited,"([stationery, office supplies and printing and..."
744,26025787253,Felis Donec Limited,"[[furniture, home furnishings and equipment sh..."


Unnamed: 0,dtype
merchant_abn,int64
name,object
tags,object


Unnamed: 0,null_rate


Unnamed: 0,nunique
merchant_abn,4026
name,4026
tags,3954


In [None]:
# Clean the ABN column (ensure they are all digits with length 11)
merchant_df["merchant_abn"] = (merchant_df["merchant_abn"].astype("string")
                                                .str.replace(r"\D", "", regex=True)
                                                .str.zfill(11))

In [None]:
def parse_tag_entry(raw: str):
    """
    Parse a tags string into:
      - categories: list[str]  (lowercased, deduped, no 'take rate', subcats collapsed)
      - type: 'a'|'b'|'c'|<NA>
      - take_rate: float|<NA>

    Handles (), [], {}, doubled brackets, arbitrary order.
    """
    if pd.isna(raw):
        return [], pd.NA, pd.NA

    s = str(raw)

    # --- 0) normalise all brackets to parentheses & collapse repeats ---
    s = re.sub(r'[\(\[\{]+', '(', s)     # any opening bracket(s) -> (
    s = re.sub(r'[\)\]\}]+', ')', s)     # any closing bracket(s) -> )

    # --- 1) pull out every (...) segment (non-nested content) ---
    segments = re.findall(r'\(([^()]*)\)', s)

    cat_chunks = []
    type_code = pd.NA
    take_rate = pd.NA

    for seg in segments:
        seg_clean = ' '.join(seg.split()).strip()   # collapse whitespace
        low = seg_clean.lower()

        # classify
        m_tr = re.search(r'take[\s\-_]*rate\s*[:=]\s*([0-9]+(?:\.[0-9]+)?)', low, flags=re.I)
        if m_tr:
            try:
                take_rate = float(m_tr.group(1))
            except:
                pass
            continue

        if re.fullmatch(r'[abcde]', low):  # a/b/c only
            type_code = low
            continue

        # otherwise treat as category chunk
        cat_chunks.append(seg_clean)

    # --- 2) turn category chunks into list items ---
    # join all chunks and split by commas
    cats_raw = ','.join(cat_chunks)
    parts = [p.strip().lower() for p in cats_raw.split(',') if p.strip()]

    # drop noise tokens and any lingering 'take rate'
    noise = {'a','b','c','the','and','&','take rate'}
    parts = [p for p in parts if len(p) > 2 and p not in noise and 'take rate' not in p]

    # de-duplicate preserving order
    seen, categories = set(), []
    for p in parts:
        if p not in seen:
            seen.add(p)
            categories.append(p)

    return categories, type_code, take_rate

In [None]:
# Apply to dataframe
parsed = merchant_df['tags'].apply(parse_tag_entry)
merchant_df['categories']       = parsed.apply(lambda t: t[0])
merchant_df['type']             = parsed.apply(lambda t: t[1])
merchant_df['take_rate']        = parsed.apply(lambda t: t[2])
merchant_df.head(10)

Unnamed: 0,merchant_abn,name,tags,categories,type,take_rate
0,10023283211,Felis Limited,"((furniture, home furnishings and equipment sh...","[furniture, home furnishings and equipment sho...",e,0.18
1,10142254217,Arcu Ac Orci Corporation,"([cable, satellite, and otHer pay television a...","[cable, satellite, and other pay television an...",b,4.22
2,10165489824,Nunc Sed Company,"([jewelry, watch, clock, and silverware shops]...","[jewelry, watch, clock, and silverware shops]",b,4.4
3,10187291046,Ultricies Dignissim Lacus Foundation,"([wAtch, clock, and jewelry repair shops], [b]...","[watch, clock, and jewelry repair shops]",b,3.29
4,10192359162,Enim Condimentum PC,"([music shops - musical instruments, pianos, a...","[music shops - musical instruments, pianos, an...",a,6.33
5,10206519221,Fusce Company,"[(gift, card, novelty, and souvenir shops), (a...","[gift, card, novelty, and souvenir shops]",a,6.34
6,10255988167,Aliquam Enim Incorporated,"[(computers, comPUter peripheral equipment, an...","[computers, computer peripheral equipment, and...",b,4.32
7,10264435225,Ipsum Primis Ltd,"[[watch, clock, and jewelry repair shops], [c]...","[watch, clock, and jewelry repair shops]",c,2.39
8,10279061213,Pede Ultrices Industries,"([computer programming , data processing, and ...","[computer programming, data processing, and in...",a,5.71
9,10323485998,Nunc Inc.,"[(furniture, home furnishings and equipment sh...","[furniture, home furnishings and equipment sho...",a,6.61


In [None]:
merchant_df.isna().sum()

merchant_abn    0
name            0
tags            0
categories      0
type            0
take_rate       0
dtype: int64

### Clean merchant fraud dataset

In [None]:
quick_profile(merchant_fraud_df, "merchant_fraud_probability.csv")


=== merchant_fraud_probability.csv: shape=(114, 3) ===


Unnamed: 0,merchant_abn,order_datetime,fraud_probability
0,19492220327,2021-11-28,44.403659
1,31334588839,2021-10-02,42.755301
2,19492220327,2021-12-22,38.86779


Unnamed: 0,merchant_abn,order_datetime,fraud_probability
52,57564805948,2021-11-23,31.268145
77,48534649627,2021-11-26,29.005907
103,15157368385,2021-12-13,64.277413


Unnamed: 0,dtype
merchant_abn,int64
order_datetime,object
fraud_probability,float64


Unnamed: 0,null_rate


Unnamed: 0,nunique
fraud_probability,113
order_datetime,64
merchant_abn,61


In [None]:
# Cast to string for consistency
merchant_fraud_df["merchant_abn"] = merchant_fraud_df["merchant_abn"].astype("string").str.strip()

# Rename fraud_probability → tx_fraud_merchant
merchant_fraud_df.rename(columns={"fraud_probability": "tx_fraud_merchant"}, inplace=True)

# Convert to datetime
merchant_fraud_df["order_datetime"] = pd.to_datetime(merchant_fraud_df["order_datetime"], errors="coerce")
print("Invalid dates:", merchant_fraud_df["order_datetime"].isna().sum())
print("Date range:", merchant_fraud_df["order_datetime"].min(), merchant_fraud_df["order_datetime"].max())

Invalid dates: 0
Date range: 2021-03-25 00:00:00 2022-02-27 00:00:00


In [None]:
# Normalize probability values
merchant_fraud_df["tx_fraud_merchant"] = pd.to_numeric(merchant_fraud_df["tx_fraud_merchant"], errors="coerce")

# normalise 0–100 → 0–1
merchant_fraud_df["tx_fraud_merchant"] = merchant_fraud_df["tx_fraud_merchant"] / 100

merchant_fraud_df["tx_fraud_merchant"].describe()

count    114.000000
mean       0.404193
std        0.171877
min        0.182109
25%        0.289928
50%        0.326920
75%        0.483953
max        0.941347
Name: tx_fraud_merchant, dtype: float64

In [None]:
# Checks if each merchant has different fraud probability associated
merchant_fraud_df.groupby("merchant_abn").size().describe()
multi_prob_merchant = merchant_fraud_df.groupby("merchant_abn")["tx_fraud_merchant"].nunique()
print("Merchant with >1 fraud prob:", (multi_prob_merchant > 1).sum())

sample_multi = merchant_fraud_df[merchant_fraud_df["merchant_abn"].isin(multi_prob_merchant[multi_prob_merchant > 1].index)]
sample_multi.sort_values(["merchant_abn","order_datetime"]).head(10)

Merchant with >1 fraud prob: 17


Unnamed: 0,merchant_abn,order_datetime,tx_fraud_merchant
47,11149063370,2021-08-28,0.564376
69,11149063370,2021-11-14,0.524078
83,11149063370,2022-02-25,0.510154
7,14827550074,2021-11-26,0.464578
11,14827550074,2021-12-05,0.438552
26,14827550074,2021-12-11,0.394064
35,14827550074,2021-12-12,0.382829
46,15043504837,2021-08-29,0.597765
72,15043504837,2021-10-08,0.250544
51,15043504837,2021-12-14,0.261252


In [None]:
# Save files as parquet
merchant_df.to_parquet(OUT / "merchant_details.parquet")
merchant_fraud_df.to_parquet(OUT / "merchant_fraud.parquet")

## Transaction dataset

### Clean transactions dataset

In [None]:
tx.head(3)

Unnamed: 0,user_id,merchant_abn,dollar_value,order_id,order_datetime
0,1,28000487688,133.226894,0c37b3f7-c7f1-48cb-bcc7-0a58e76608ea,2021-02-28
1,18485,62191208634,79.1314,9e18b913-0465-4fd4-92fd-66d15e65d93c,2021-02-28
2,1,83690644458,30.441348,40a2ff69-ea34-4657-8429-df7ca957d6a1,2021-02-28


In [None]:
# Clean and standardize transaction data types
tx["order_datetime"] = pd.to_datetime(tx["order_datetime"], errors="coerce")
tx["order_id"]    = tx["order_id"].astype("string").str.strip()
tx["user_id"]     = tx["user_id"].astype("string").str.strip()
tx["merchant_abn"] = tx["merchant_abn"].astype("string").str.replace(r"\D","",regex=True)
tx["dollar_value"] = pd.to_numeric(tx["dollar_value"], errors="coerce")

In [None]:
# Check if all transaction merchants exist in merchant_details
missing_merchants = tx.loc[~tx["merchant_abn"].isin(merchant_df["merchant_abn"])]

if missing_merchants.empty:
    print("All transactions have matching merchants in merchant_details.")
else:
    print("Number of unique merchants in transactions:", tx["merchant_abn"].nunique())
    print("Number of unique merchants in merchant_details:", merchant_df["merchant_abn"].nunique())
    print("Number of missing merchants:", missing_merchants["merchant_abn"].nunique())

Number of unique merchants in transactions: 4422
Number of unique merchants in merchant_details: 4026
Number of missing merchants: 396


In [None]:
# Keep only transactions with merchants that exist in merchant_df
tx = tx[tx["merchant_abn"].isin(merchant_df["merchant_abn"])]

print("Transactions after removing missing merchants:", tx.shape)
print("Unique merchants remaining in tx:", tx["merchant_abn"].nunique())

Transactions after removing missing merchants: (13614675, 5)
Unique merchants remaining in tx: 4026


In [None]:
tx.isna().sum()

user_id           0
merchant_abn      0
dollar_value      0
order_id          0
order_datetime    0
dtype: int64

In [None]:
# Unique counts for key IDs
n_users    = tx["user_id"].nunique(dropna=True)
n_merchants = tx["merchant_abn"].nunique(dropna=True)
n_orders   = tx["order_id"].nunique(dropna=True)

print("Unique user_id:     ", n_users)
print("Unique merchant_abn:", n_merchants)
print("Unique order_id:    ", n_orders)

Unique user_id:      24081
Unique merchant_abn: 4026
Unique order_id:     13614675


## Outlier Analysis

In [None]:
import numpy as np
import math

# Add a log-transformed column
tx["log_dollar_value"] = np.log(tx["dollar_value"] + 1)  
# (+1 avoids errors if dollar_value has zeros)

def adaptive_iqr_outliers(series):
    N = len(series.dropna())  # number of records
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    
    # Adaptive multiplier: √log(N) - 0.5
    multiplier = math.sqrt(math.log(N)) - 0.5
    
    lower = Q1 - multiplier * IQR
    upper = Q3 + multiplier * IQR
    outliers = series[(series < lower) | (series > upper)]
    
    return {
        "Q1": round(Q1, 2),
        "Q3": round(Q3, 2),
        "IQR": round(IQR, 2),
        "lower_bound": round(lower, 2),
        "upper_bound": round(upper, 2),
        "num_outliers": int(outliers.shape[0]),
        "percent_outliers": round(100 * len(outliers) / N, 2),
        "multiplier": round(multiplier, 2)
    }

print("Log Dollar Value Outliers:", adaptive_iqr_outliers(tx["log_dollar_value"]))
print("Take Rate Outliers:", adaptive_iqr_outliers(merchant_df["take_rate"]))

Log Dollar Value Outliers: {'Q1': 3.28, 'Q3': 5.0, 'IQR': 1.72, 'lower_bound': -2.82, 'upper_bound': 11.11, 'num_outliers': 3, 'percent_outliers': 0.0, 'multiplier': 3.55}
Take Rate Outliers: {'Q1': 2.97, 'Q3': 6.03, 'IQR': 3.06, 'lower_bound': -4.32, 'upper_bound': 13.32, 'num_outliers': 0, 'percent_outliers': 0.0, 'multiplier': 2.38}


In [None]:
# Calculate bounds
bounds = adaptive_iqr_outliers(tx["log_dollar_value"])
lower, upper = bounds["lower_bound"], bounds["upper_bound"]

# Find outlier rows (transactions outside the bounds)
outlier_tx = tx[(tx["log_dollar_value"] < lower) | (tx["log_dollar_value"] > upper)]

# Get unique ABNs for outlier merchants
outlier_abns = outlier_tx["merchant_abn"].unique()
print("Outlier merchant ABNs:", outlier_abns)

# Merchant details of outlier merchants
outlier_merchant_details = merchant_df[merchant_df["merchant_abn"].isin(outlier_abns)]
display(outlier_merchant_details)

# Fraud probability of outlier merchants
outlier_fraud_details = merchant_fraud_df[merchant_fraud_df["merchant_abn"].isin(outlier_abns)]

outlier_fraud_details = outlier_fraud_details.merge(
    merchant_df[["merchant_abn", "name"]],
    on="merchant_abn",
    how="left"
)

display(outlier_fraud_details)

Outlier merchant ABNs: <StringArray>
['91880575299', '53918538787', '83199298021']
Length: 3, dtype: string


Unnamed: 0,merchant_abn,name,tags,categories,type,take_rate
1995,53918538787,In Tempus Inc.,"[(antique shops - sales, repairs, and restorat...","[antique shops - sales, repairs, and restorati...",b,3.49
3315,83199298021,Ligula Elit Pretium Foundation,"[[antique shops - sales, repairs, and restorat...","[antique shops - sales, repairs, and restorati...",b,4.82
3669,91880575299,At Foundation,"((antique shops - sales, repairs, and restorat...","[antique shops - sales, repairs, and restorati...",b,3.4


Unnamed: 0,merchant_abn,order_datetime,tx_fraud_merchant,name
0,83199298021,2022-02-27,0.260252,Ligula Elit Pretium Foundation
1,83199298021,2022-02-17,0.2578,Ligula Elit Pretium Foundation
2,83199298021,2021-12-30,0.239986,Ligula Elit Pretium Foundation
3,91880575299,2021-04-17,0.32995,At Foundation
4,83199298021,2022-01-04,0.239203,Ligula Elit Pretium Foundation
5,83199298021,2021-12-14,0.227998,Ligula Elit Pretium Foundation
6,83199298021,2021-03-25,0.690856,Ligula Elit Pretium Foundation


In [None]:
bounds = adaptive_iqr_outliers(tx["log_dollar_value"])
lower, upper = bounds["lower_bound"], bounds["upper_bound"]

# Keep only rows within the bounds
tx = tx[(tx["log_dollar_value"] >= lower) & (tx["log_dollar_value"] <= upper)]

# Drop the helper column
tx = tx.drop(columns=["log_dollar_value"])

print("Transactions after outlier removal:", tx.shape)

Transactions after outlier removal: (13614672, 5)


In [None]:
# Save files as parquet
tx.to_parquet(OUT / "transaction.parquet")

## Summary

**Preprocessing Done**

- Changed fraud probability columns to tx_fraud_consumer and tx_fraud_merchant
- Converted fraud probability values to range 0–1
- Confirmed no missing values in all datasets
- Converted columns to appropriate data types to ease merging later
- Parsed merchant tags column into categories, type, and take_rate
- Made sure all transaction merchants have matching details in merchant_df

**Outlier Analysis Done**

- Applied log transformation to dollar_value to reduce skewness
- Used adaptive IQR rule (√log(N) – 0.5) for more realistic bounds with large datasets
- Checked continuous variables: take_rate and log_order_count
- Fraud probability values already fall between 0 and 1, so no further checks needed
- Found 3 confirmed outliers → removed from transaction dataset
- Outlier transactions came from 3 different merchants, so no suspicious pattern from a single merchant

**Outlier Merchants:**
- In Tempus Inc. – high-value transaction flagged, but no fraud history recorded
- Ligula Elit Pretium Foundation – repeated outlier transactions with multiple fraud probability entries
- At Foundation – single high-value outlier with a moderate fraud probability

**consumer_fraud:**

- Columns = ['user_id', 'order_datetime', 'tx_fraud_consumer', 'consumer_id']
- Records = 34,684 orders
- Range of transactions = 28 Feb 2021 – 27 Feb 2022

**consumer_details:**

- Columns = ['name', 'address', 'state', 'postcode', 'gender', 'consumer_id', 'user_id']
- Records = 499,999 consumers

**merchant_fraud:**

- Columns = ['merchant_abn', 'order_datetime', 'tx_fraud_merchant']
- Records = 114 orders
- Range of orders = 25 Mar 2021 – 27 Feb 2022

**merchant_details:**

- Columns = ['merchant_abn', 'name', 'tags', 'categories', 'type', 'take_rate']
- Records = 4,026 merchants

**transaction:**

- Columns = ['user_id', 'merchant_abn', 'dollar_value', 'order_id', 'order_datetime']
- Initial size = 14,195,505 transactions
- Final size = 13,614,672 transactions (after removing rows with missing merchant details and 3 outliers)
- Range of transactions = 28 Feb 2021 – 26 Oct 2022 (wider than consumer and merchant fraud datasets)