# Segmentation: Manual Ranking

## Import Libraries and Starting Pyspark

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import *
import pandas as pd
import requests
import re
import numpy as npv
from pyspark.sql.window import Window



In [2]:
# Create a spark session
spark = (
    SparkSession.builder.appName("Manual Ranking Segmentation")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.driver.memory", "9g") 
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config("spark.network.timeout", "600s")
    .getOrCreate()
)

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/10/15 10:40:02 WARN Utils: Your hostname, Felicias-MacBook-Pro.local, resolves to a loopback address: 127.0.0.1; using 10.13.14.109 instead (on interface en0)
25/10/15 10:40:02 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/10/15 10:40:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Load Data

In [None]:
merchant = spark.read.parquet(str("../data/curated/merchant_features.parquet"))
merchant_fraud = spark.read.parquet(str("../data/curated/merchant_imputed.parquet"))

In [4]:
merchant_df = merchant.toPandas()

In [5]:
merchant_df.columns.tolist()

['merchant_abn',
 'name',
 'tags',
 'categories',
 'type',
 'take_rate',
 'order_count',
 'total_sales',
 'avg_order_value',
 'unique_consumers',
 'repeat_consumers',
 'repurchase_rate']

In [6]:
merchant_df.categories.value_counts()

categories
[digital goods: books, movies, music]                                                      195
[artist supply and craft shops]                                                            193
[computer programming, data processing, and integrated systems design services]            191
[shoe shops]                                                                               185
[gift, card, novelty, and souvenir shops]                                                  182
[furniture, home furnishings and equipment shops, and manufacturers, except appliances]    182
[computers, computer peripheral equipment, and software]                                   181
[florists supplies, nursery stock, and flowers]                                            180
[tent and awning shops]                                                                    178
[cable, satellite, and other pay television and radio services]                            175
[watch, clock, and jewelry repair shops

In [7]:
def clean_text_any(x):
    if x is None or (isinstance(x, float) and pd.isna(x)):
        s = ""
    # if it's already a list -> join
    elif isinstance(x, (list, tuple, set)):
        s = " ".join(map(str, x))
    else:
        s = str(x)

    s = s.lower()
    s = re.sub(r"[^a-z0-9\s]", " ", s)   # keep letters/numbers/spaces
    s = re.sub(r"\s+", " ", s).strip()
    return s

# clean → tokens → remove stopwords → join back
stop_words = set(["is","a","the","for","and","to","of","this","except","in","on","at","with","by"])
merchant_df["str_categories"]   = merchant_df["categories"].apply(clean_text_any)
merchant_df["tokens"]           = merchant_df["str_categories"].str.split()
merchant_df["clean_categories"] = merchant_df["tokens"].apply(lambda xs: [w for w in xs if w not in stop_words])
merchant_df["clean_categories_str"] = merchant_df["clean_categories"].apply(lambda xs: " ".join(xs))

In [8]:
merchant_df["clean_categories_str"].value_counts(dropna=False).head(30)

clean_categories_str
digital goods books movies music                                           195
artist supply craft shops                                                  193
computer programming data processing integrated systems design services    191
shoe shops                                                                 185
gift card novelty souvenir shops                                           182
furniture home furnishings equipment shops manufacturers appliances        182
computers computer peripheral equipment software                           181
florists supplies nursery stock flowers                                    180
tent awning shops                                                          178
cable satellite other pay television radio services                        175
watch clock jewelry repair shops                                           170
bicycle shops sales service                                                170
music shops musical instruments

In [9]:
# Manually categorise each category
segments = [
    "Entertainment & Media", "Entertainment & Media", "Technology", "Beauty", 
    "Miscellaneous", "Office & Home Supplies", "Technology", 
    "Entertainment & Media", "Office & Home Supplies", "Technology", 
    "Beauty", "Miscellaneous", "Entertainment & Media", 
    "Beauty", "Entertainment & Media", "Office & Home Supplies", 
    "Office & Home Supplies", "Miscellaneous", "Miscellaneous", 
    "Entertainment & Media", "Office & Home Supplies", 
    "Office & Home Supplies", "Technology", "Entertainment & Media", "Beauty"
]

In [10]:
segments_dict = {list(merchant_df.clean_categories_str.value_counts().index)[i]: segments[i] for i in range(len(segments))}

In [11]:
segments_dict

{'digital goods books movies music': 'Entertainment & Media',
 'artist supply craft shops': 'Entertainment & Media',
 'computer programming data processing integrated systems design services': 'Technology',
 'shoe shops': 'Beauty',
 'gift card novelty souvenir shops': 'Miscellaneous',
 'furniture home furnishings equipment shops manufacturers appliances': 'Office & Home Supplies',
 'computers computer peripheral equipment software': 'Technology',
 'florists supplies nursery stock flowers': 'Entertainment & Media',
 'tent awning shops': 'Office & Home Supplies',
 'cable satellite other pay television radio services': 'Technology',
 'watch clock jewelry repair shops': 'Beauty',
 'bicycle shops sales service': 'Miscellaneous',
 'music shops musical instruments pianos sheet music': 'Entertainment & Media',
 'health beauty spas': 'Beauty',
 'books periodicals newspapers': 'Entertainment & Media',
 'stationery office supplies printing writing paper': 'Office & Home Supplies',
 'lawn garden s

In [12]:
merchant_df['segment']  = merchant_df['clean_categories_str'].map(lambda x: segments_dict[x])

In [13]:
merchant_df.segment.value_counts()

segment
Entertainment & Media     1153
Office & Home Supplies     937
Technology                 672
Miscellaneous              654
Beauty                     610
Name: count, dtype: int64

In [None]:
merchant_df[['merchant_abn', 'segment', 'categories']].to_parquet("../data/curated/merchant_segment.parquet")

## Ranking based on each segment

In [None]:
initial_ranking = spark.read.parquet("../data/curated/merchant_ranking.parquet")
merchant_segment = spark.read.parquet("../data/curated/merchant_segment.parquet")

In [16]:
# Join each merchant in initial ranking by segment
merchant_segment_ranking = initial_ranking.join(merchant_segment, how='left', on='merchant_abn')

In [17]:
segments = [
    "Entertainment & Media",
    "Office & Home Supplies",
    "Miscellaneous",
    "Beauty",
    "Technology"
]

for segment in segments:
    print(segment)
    segment_ranking = merchant_segment_ranking.filter(F.col('segment') == segment)
    segment_ranking.orderBy(F.col('final_score').desc()).show(10)

Entertainment & Media
+------------+------------------+---------------------+-------------------+---------------------------+-------------------+--------------------+--------------------+-------------------+--------------------+--------------------+-------------------+-----------------------+-------------------+--------------------+--------------------+
|merchant_abn|      bnpl_revenue|total_tx_dollar_fraud|    repurchase_rate|avg_weighted_consumer_fraud|    growth_rate_pct|   bnpl_revenue_norm|      fraud_sum_norm|    repurchase_norm| avg_cons_fraud_norm|    growth_rate_norm| fraud_sum_norm_inv|avg_cons_fraud_norm_inv|        final_score|             segment|          categories|
+------------+------------------+---------------------+-------------------+---------------------------+-------------------+--------------------+--------------------+-------------------+--------------------+--------------------+-------------------+-----------------------+-------------------+-------------------

## Save top 10 merchants to parquet

In [None]:
candidate = (
    merchant_segment_ranking
    .filter(F.col("segment").isin(segments))
    .filter(F.col("final_score").isNotNull())
)

w = Window.partitionBy("segment").orderBy(
    F.col("final_score").desc(),
    F.col("merchant_abn").asc()
)

top10_per_segment = (
    candidate
    .withColumn("rank", F.row_number().over(w))
    .filter(F.col("rank") <= 10)
)

for seg in segments:
    (
        top10_per_segment
        .filter(F.col("segment") == seg)
        .orderBy("rank")
        .write.mode("overwrite")
        .parquet(f"../data/curated/top10_manual_{re.sub(r'[^A-Za-z0-9]+','_', seg.lower())}.parquet")
    )


## Justification and Insights

If a merchant has high revenue + high repeat + low fraud, it is considered as top ranked. 
High fraud (either total $ fraud or high avg consumer fraud) pulls scores down, even with big revenue.

1) Entertainment & Media
- Pattern: high repurchase (0.94–1.00) and strong revenue_norm; fraud_sum_norm is often high (0.69–0.83), but the inverse terms partially offset it.
- Takeaway: This segment earns high ranks when repeat is strong. Where repeat collapses, even big revenue can’t carry them.
- Action:
    - Grow: high-repurchase, moderate-fraud merchants (e.g., music/digital goods).
    - Watch: accounts with low repeat despite high revenue—probe CX, refunds, or category churn.
2) Office & Home Supplies
- Pattern: wide spread. Some have excellent repeat (~1.0) and big revenue, but others show extreme avg consumer fraud (e.g., 206–241), dragging scores.
- Takeaway: Fraud quality (avg_cons_fraud) is the decisive separator within this segment.
- Action:
    - De-risk: merchants with avg_cons_fraud > ~60 → tighten KYC/velocity rules, lower limits, step-up auth.
    - Upsell: the near-1.0 repurchase cohort (tents/furniture) with loyalty offers; they have sticky demand.
3) Miscellaneous
- Pattern: consistently high repurchase (~0.9–1.0) + high revenue_norm; fraud generally mid, so inverses help → top scores (0.76–0.70).
- Takeaway: Solid “core” segment for profitable growth; fraud stays in check.
- Action:
    - Scale: marketing credit lines/promos to top repeat merchants.
    - Monitor: sub-categories with avg_cons_fraud > ~70.
4) Beauty
- Pattern: high repurchase and revenue_norm at the top (scores ~0.78), but many mid-ranked merchants have moderate fraud and middling repeat (0.35–0.66).
- Takeaway: Beauty is attractive where loyalty is strong; otherwise, scores flatten.
- Action:
    - Grow: high-repeat beauty merchants (loyalty-driven SKUs).
    - Fix: low-repeat sub-categories (shoes/watches) with targeted incentives or better UX.
5) Technology
- Pattern: revenue often strong, but avg_cons_fraud spikes (e.g., 87–234) + variable repeat → scores cluster around 0.54–0.75.
- Takeaway: Fraud quality & low repeat are the blockers. Tech looks good only where repeat stays high.
- Action:
    - De-risk: telecom/computers with avg_cons_fraud > ~80 (hard flags).
    - Keep/grow: software/services with sticky subscriptions (high repeat).