<a href="https://colab.research.google.com/github/jamesemansfield2/Customer-Level-Financial-Services-Product-Upsell-Recommender-Model-System/blob/main/Copy_of_Upsales_recommender_model_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Customer-Level Financial Product Upsell Recommender

This notebook builds and runs a **recommender system** that suggests the next-best financial products for existing bank customers. The goal is to help RM/branch/digital channels **prioritize who to talk to, about what, and why** — using both customer attributes and behavior patterns.

The model outputs a **ranked list of product recommendations per customer**, each with:

1. `customer_id` – who we should target  
2. `product_id` / `product_type` – what we should offer  
3. `score` – how strong the recommendation is (higher = better)  
4. `reasons` – *why* this product was recommended (interpretable tags)

This makes the output directly usable for:
- RM lead lists,
- outbound / dialer campaigns,
- in-app cross-sell banners,
- A/B testing of offer strategies.

---

## 1. Problem Framing

Most retail / affluent banking portfolios have:
- **High product penetration in 1–2 products** (usually CASA + 1 credit product),
- **Lots of headroom** in unsecured loans, premium cards, and secured/HL products,
- **Fragmented signals** across transactions, income, demographics, and peer behavior.

This notebook simulates a **Next Best Offer (NBO)** / **Next Best Product (NBP)** engine that scores each (customer, product) pair and returns only the top ones.

---

## 2. Data & Signals (Conceptual)

The recommender can be fed from typical banking data domains:

- **Customer profile**: age, income/salary band, segment (mass / mass affluent / HNI), location  
- **Relationship**: existing products, vintage, balances  
- **Behavior**: spends spike around salary date, card usage, repayment regularity  
- **Peer / collaborative signals**: “customers like you also took…”  
- **Product rules**: income thresholds, segment eligibility, HNI-only lines  

In the sample output, you can see these features show up as human-readable reasons such as:
- `Meets income requirement`
- `Spends spike near salary`
- `Income-product fit`
- `Similar users liked this product`
- `Not HNI-focused`

These reasons are generated from the feature checks that fired for that (customer, product) pair.

---

## 3. Model / Approach (High Level)

The notebook follows a **hybrid recommender** pattern:

1. **Rule / eligibility filtering**  
   - Don’t recommend products the customer is clearly ineligible for (e.g. premium card without income).  
   - Don’t recommend products the customer already holds (unless you allow “upgrade” scenarios).

2. **Scoring layer**  
   - For every eligible (customer, product), we create features and run them through a scoring function.  
   - This can be:  
     - a simple weighted rules engine,  
     - a similarity / collaborative score,  
     - or a learned model (e.g. XGBoost / LightGBM) if you have labeled “accepted offer” data.  
   - The output is a **continuous score** between 0 and 1 (in this sample, scores are ~0.11–0.17).

3. **Explainability**  
   - Alongside the score, the notebook collects all the **rules/conditions that fired** and stores them in `reasons`.  
   - This makes the model **auditable and business-friendly**.

4. **Ranking**  
   - For each customer, sort descending by score.  
   - Keep top *N* (e.g. top 3 per customer) for actual campaign use.

---

## 4. Product Codes in Output

You will see product types like:

- `CC` – Credit Card (core / mass)
- `PremiumCC` – Premium / high-limit credit card
- `EL` – Education / Personal / Employee Loan (depending on your mapping)
- `HL` – Home Loan / Housing Loan
- `GL` – General Loan / Gold Loan / Term Loan (placeholder for secured / non-card lending)

You can relabel these to your bank’s actual product catalogue.

---

## 5. Sample Output Explained

### Model Output Snapshot

```text
ustomer_id  product_id product_type     score  \
0          636           6           EL  0.166882   
1          636           4    PremiumCC  0.150366   
2          636          10    PremiumCC  0.113097   
3          382           1           HL  0.126540   
4          382          11           GL  0.126540   

                                             reasons  
0  [Spends spike near salary, Income-product fit,...  
1  [Meets income requirement, Weak collaborative ...  
2  [Meets income requirement, Spends spike near s...  
3  [Meets income requirement, Spends spike near s...  
4  [Meets income requirement, Spends spike near s...  
[Time] Recommendations & artifacts done. 0.96s

customer_id	product_id	product_type	score	reasons
0	636	6	EL	0.166882	[Spends spike near salary, Income-product fit,...
1	636	4	PremiumCC	0.150366	[Meets income requirement, Weak collaborative ...
2	636	10	PremiumCC	0.113097	[Meets income requirement, Spends spike near s...
3	382	1	HL	0.126540	[Meets income requirement, Spends spike near s...
4	382	11	GL	0.126540	[Meets income requirement, Spends spike near s...
...	...	...	...	...	...
63	524	5	EL	0.123357	[Similar users liked this product, Not HNI-foc...
64	524	3	HL	0.121895	[Meets income requirement, Spends spike near s...
65	501	1	HL	0.171219	[Meets income requirement, Income-product fit,...
66	501	11	GL	0.171219	[Meets income requirement, Income-product fit,...
67	501	2	CC	0.164791	[Income-product fit, Spends spike near salary,...

## Interpretation of Results

### Customer 636

**Top recommendation:** `EL (0.1669)`  
→ Behavior and income align with loan eligibility (“spends spike near salary”, “income-product fit”).

**Secondary:** `PremiumCC` options with slightly lower scores (0.15, 0.11) – still good fits but lower predicted engagement.

---

### Customer 382

**Two tied recommendations:** (`HL` and `GL`, both 0.1265).  
→ Customer meets baseline income and spend pattern, but both secured loans look equally suitable — no strong behavioral differentiation yet.

---

### Customer 501

**Strong multi-product potential:** (`HL`, `GL`, `CC`) — all with scores ≥ 0.16.  
→ Suggests this customer is financially strong, consistent spender, and open to additional products.  
→ For campaign prioritization, focus on **home loan (highest profitability)**, followed by **credit card** for bundling opportunities.

---

## 6. Score Interpretation

- The **score is relative**, not a probability.  
- Focus on **rank within customer** rather than the absolute number.  
- In practice, set a **cutoff** (e.g., `0.10`) to suppress weak or uncertain recommendations.

---

## 7. Business Explainability: `reasons` Field

Each recommendation includes an interpretable list of business-driven factors, for example:

- `Meets income requirement` → customer passes product thresholds  
- `Spends spike near salary` → salary-linked spend profile  
- `Income-product fit` → income in product target range  
- `Similar users liked this product` → collaborative similarity  
- `Not HNI-focused` → product is suitable for non-HNI segments  

These explanations are ideal for:

- **RM scripting:**  
  “You may want to consider a Gold Loan, given your regular income and peer pattern.”

- **In-app transparency:**  
  “We recommend this because your spending pattern matches eligible customers.”

- **Audit & compliance:**  
  Ensures traceable logic for recommendations, aiding regulatory interpretability and governance.

---

## 8. Next Steps / Extensions

- Add **historical acceptance labels** for supervised learning (e.g., LightGBM classifier).  
- Integrate **profitability, delinquency, and risk factors** for smarter prioritization.  
- Deploy within **CRM / campaign tools** with attached reason codes for explainability.  
- Conduct **uplift / A/B testing** to measure incremental conversion and ROI impact.

---

## 9. Summary

This recommender provides:

- **Customer-level next best product scores**  
- **Explainable “why” reasons per recommendation**  
- **A foundation for personalized, data-driven upsell campaigns**

> “For each customer, here are the top products to offer, in order, and the reasons why they’re a fit.”



In [1]:
##############
# 1) SETUP, IMPORTS, CONFIG, HELPERS
##############
import os, json, math, random, time, gc
from datetime import datetime, timedelta
from collections import defaultdict
import numpy as np
import pandas as pd

from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score

print("[Block 1] Imports loaded.")

# ---- Reproducibility ----
RNG = np.random.default_rng(7)
random.seed(7)

# ---- Paths / Output dir ----
OUT_DIR = "./bank_reco_case_study"
os.makedirs(OUT_DIR, exist_ok=True)
print(f"[Block 1] OUT_DIR set: {OUT_DIR}")

# ---- Config (tunable) ----
N_CUSTOMERS  = 1200
N_PRODUCTS   = 12
N_MERCHANTS  = 50
N_CAMPAIGNS  = 6
K_EMBED      = min(8, N_PRODUCTS-1)   # SVD components
K_RECO       = 10
CONTACT_CAP  = 3                      # weekly cap for recommendations

print(f"[Block 1] Config -> customers={N_CUSTOMERS}, products={N_PRODUCTS}, merchants={N_MERCHANTS}, campaigns={N_CAMPAIGNS}, k_embed={K_EMBED}")

# ---- Time helpers ----
def tic():
    return time.time()

def toc(t0, msg=""):
    print(f"[Time] {msg} {time.time()-t0:.2f}s")

# ---- UI helper fallback (for non-notebook environments) ----
try:
    from caas_jupyter_tools import display_dataframe_to_user  # may not exist on your machine
    print("[Block 1] display_dataframe_to_user available (notebook UI).")
except Exception:
    def display_dataframe_to_user(name: str, df: pd.DataFrame):
        safe = "".join(c if c.isalnum() or c in (" ","_","-") else "_" for c in name).strip().replace(" ","_")
        path = os.path.join(OUT_DIR, f"{safe}.csv")
        try:
            df.to_csv(path, index=False)
            print(f"[Block 1] [Saved CSV] {name} -> {path} (showing head below)")
        except Exception as e:
            print(f"[Block 1] Could not save CSV for {name} due to {e}. Printing head:")
        print(df.head())

print("[Block 1] Setup complete.\n")


[Block 1] Imports loaded.
[Block 1] OUT_DIR set: ./bank_reco_case_study
[Block 1] Config -> customers=1200, products=12, merchants=50, campaigns=6, k_embed=8
[Block 1] Setup complete.



In [2]:
##############
# 2) SYNTHETIC DATA GENERATION
##############
t0 = tic()

# Customers
cust = pd.DataFrame({
    "customer_id": np.arange(N_CUSTOMERS),
    "age": RNG.integers(21, 70, N_CUSTOMERS),
    "income": RNG.normal(8, 3, N_CUSTOMERS).clip(2, 25),
    "city_tier": RNG.choice([1,2,3], N_CUSTOMERS, p=[0.35,0.45,0.20]),
    "channel_pref": RNG.choice(["mobile","web","branch","phone"], N_CUSTOMERS, p=[0.5,0.25,0.2,0.05]),
    "segment": RNG.choice(["Mass","Affluent","HNI"], N_CUSTOMERS, p=[0.6,0.3,0.1]),
})
cust["salary_day"] = RNG.integers(25, 31, N_CUSTOMERS)

# Products
prod_types = ["CC","PL","FD","MF","HL","EL","GL","TravelCC","PremiumCC","Insurance"]
base_products = []
for i in range(N_PRODUCTS):
    pt = RNG.choice(prod_types)
    base_products.append({
        "product_id": i,
        "product_type": pt,
        "min_income": float(max(2, RNG.normal(6,2))),
        "segment_bias": RNG.choice(["Mass","Affluent","HNI","All"], p=[0.35,0.3,0.2,0.15]),
        "is_premium": int(pt in ["PremiumCC","MF","Insurance"] and RNG.random() < 0.6)
    })
prod = pd.DataFrame(base_products)

# Merchants
mcc = pd.DataFrame({
    "merchant_id": np.arange(N_MERCHANTS),
    "mcc": RNG.choice(["groceries","fuel","dining","online","travel","utilities","fashion","electronics"], N_MERCHANTS)
})

# Transactions for last 180 days
def synth_txn_for_customer(cid: int) -> pd.DataFrame:
    n = RNG.integers(15,60)
    days = RNG.integers(1,180,n)
    amounts = np.abs(RNG.normal(1.5, 1.0, n)) * 1000
    merchants = RNG.integers(0, N_MERCHANTS, n)
    sal_day = cust.loc[cid,"salary_day"]
    dom = RNG.integers(1,31,n)
    season_bump = 1.0 + 0.2 * (np.abs(dom - sal_day) <= 3)
    amounts *= season_bump
    channel = RNG.choice(["mobile","web","branch","phone"], n, p=[0.5,0.25,0.2,0.05])
    return pd.DataFrame({
        "customer_id": cid,
        "days_ago": days,
        "amount": amounts,
        "merchant_id": merchants,
        "channel": channel
    })

txns = pd.concat([synth_txn_for_customer(i) for i in range(N_CUSTOMERS)], ignore_index=True)
txns = txns.merge(mcc, on="merchant_id", how="left")

print(f"[Block 2] Customers: {cust.shape}, Products: {prod.shape}, Merchants: {mcc.shape}, Txns: {txns.shape}")
display_dataframe_to_user("Customers_preview", cust.head(10))
display_dataframe_to_user("Products_preview", prod.head(10))
display_dataframe_to_user("Transactions_preview", txns.head(10))
toc(t0, "Data generation done.")
print()


[Block 2] Customers: (1200, 7), Products: (12, 5), Merchants: (50, 2), Txns: (44450, 6)
[Block 1] [Saved CSV] Customers_preview -> ./bank_reco_case_study/Customers_preview.csv (showing head below)
   customer_id  age     income  city_tier channel_pref   segment  salary_day
0            0   67   9.187192          2       branch      Mass          29
1            1   51   8.330715          3       mobile      Mass          26
2            2   54  10.984405          3        phone      Mass          25
3            3   64   5.682896          3       mobile  Affluent          26
4            4   49   7.831770          2       mobile  Affluent          28
[Block 1] [Saved CSV] Products_preview -> ./bank_reco_case_study/Products_preview.csv (showing head below)
   product_id product_type  min_income segment_bias  is_premium
0           0           PL   11.105766     Affluent           0
1           1           HL    5.917203         Mass           0
2           2           CC    6.966036    

In [3]:
##############
# 3) PRODUCT HOLDINGS
##############
t0 = tic()
counts = RNG.integers(1,4, N_CUSTOMERS)
holds_cids = np.repeat(np.arange(N_CUSTOMERS), counts)
holds_pids = RNG.integers(0, N_PRODUCTS, holds_cids.shape[0])
holdings = pd.DataFrame({"customer_id": holds_cids, "product_id": holds_pids}).drop_duplicates()

print(f"[Block 3] Holdings: {holdings.shape}, avg holdings per cust ~ {len(holdings)/N_CUSTOMERS:.2f}")
display_dataframe_to_user("Holdings_preview", holdings.head(10))
toc(t0, "Holdings generated.")
print()


[Block 3] Holdings: (2273, 2), avg holdings per cust ~ 1.89
[Block 1] [Saved CSV] Holdings_preview -> ./bank_reco_case_study/Holdings_preview.csv (showing head below)
   customer_id  product_id
0            0           0
1            0           1
2            0           4
3            1           3
4            1           2
[Time] Holdings generated. 0.01s



In [4]:
##############
# 4) FEATURE ENGINEERING
##############
t0 = tic()
now = datetime(2025,10,25)
txns["date"] = now - txns["days_ago"].apply(lambda d: timedelta(days=int(d)))

# RFM
rfm = txns.groupby("customer_id").agg(
    recency_days=("days_ago","min"),
    frequency=("customer_id","count"),
    monetary=("amount","mean")
).reset_index()

# Channel share
ch_share = txns.pivot_table(index="customer_id", columns="channel", values="amount", aggfunc="count", fill_value=0)
ch_share = ch_share.div(ch_share.sum(axis=1), axis=0).reset_index()

# Salary proximity
txns["day_of_month"] = txns["date"].dt.day
tmp = txns.merge(cust[["customer_id","salary_day"]], on="customer_id")
tmp["near_salary"] = (np.abs(tmp["day_of_month"] - tmp["salary_day"]) <= 3).astype(int)
sal_feat = tmp.groupby("customer_id")["near_salary"].mean().rename("salary_seasonality").reset_index()

# MCC distribution (top-3 one-hot)
mcc_counts = txns.groupby(["customer_id","mcc"]).size().reset_index(name="cnt")
mcc_top = mcc_counts.sort_values(["customer_id","cnt"], ascending=[True,False]).groupby("customer_id").head(3)
mcc_piv = pd.crosstab(mcc_top["customer_id"], mcc_top["mcc"]).reset_index()

# Merge features
feat = cust.merge(rfm, on="customer_id", how="left")\
           .merge(ch_share, on="customer_id", how="left")\
           .merge(sal_feat, on="customer_id", how="left")\
           .merge(mcc_piv, on="customer_id", how="left").fillna(0)

numeric_cols = ["age","income","recency_days","frequency","monetary","salary_seasonality"]
scaler = StandardScaler()
feat[numeric_cols] = scaler.fit_transform(feat[numeric_cols])

print(f"[Block 4] Features built: {feat.shape}, numeric_cols={numeric_cols}")
display_dataframe_to_user("Features_preview", feat.head(10))
toc(t0, "Feature engineering done.")
print()


[Block 4] Features built: (1200, 23), numeric_cols=['age', 'income', 'recency_days', 'frequency', 'monetary', 'salary_seasonality']
[Block 1] [Saved CSV] Features_preview -> ./bank_reco_case_study/Features_preview.csv (showing head below)
   customer_id       age    income  city_tier channel_pref   segment  \
0            0  1.529366  0.396287          2       branch      Mass   
1            1  0.395846  0.109841          3       mobile      Mass   
2            2  0.608381  0.997360          3        phone      Mass   
3            3  1.316831 -0.775715          3       mobile  Affluent   
4            4  0.254156 -0.057030          2       mobile  Affluent   

   salary_day  recency_days  frequency  monetary  ...       web  \
0          29      1.120365   0.904508  0.775519  ...  0.306122   
1          26      0.757738  -0.003152  0.156974  ...  0.270270   
2          25      0.032485   1.207061 -0.522615  ...  0.358491   
3          26     -0.330141  -1.213364 -1.096078  ...  0.142

In [5]:
##############
# 5) GROUND-TRUTH ADOPTION PROPENSITY (for labels and simulation)
##############
t0 = tic()
seg_map = {"Mass":0, "Affluent":0.5, "HNI":1.0}
chan_map = {"mobile":0.8, "web":0.4, "branch":0.2, "phone":0.1}
prod_bias = prod["is_premium"].map({0:-0.1, 1:0.2}).values

cust_vec = (
    0.2*cust["income"] + 0.1*cust["age"]/70 + 0.2*cust["segment"].map(seg_map).values +
    0.1*cust["channel_pref"].map(chan_map).values + 0.1*feat["salary_seasonality"].values -
    0.05*cust["city_tier"]/3
)
cust_vec = (cust_vec - np.mean(cust_vec))/np.std(cust_vec)

prod_vec = (
    0.2*prod["min_income"].values/25 +
    0.2*(prod["segment_bias"].map({"Mass":0,"Affluent":0.5,"HNI":1,"All":0.4}).values) +
    0.2*prod_bias
)
prod_vec = (prod_vec - np.mean(prod_vec))/np.std(prod_vec)

true_score = np.outer(cust_vec, prod_vec) + RNG.normal(0,0.5,(N_CUSTOMERS,N_PRODUCTS))
print(f"[Block 5] true_score shape={true_score.shape}, mean={true_score.mean():.3f}, std={true_score.std():.3f}")
toc(t0, "Ground-truth created.")
print()


[Block 5] true_score shape=(1200, 12), mean=-0.000, std=1.112
[Time] Ground-truth created. 0.01s



In [6]:
##############
# 6) ELIGIBILITY RULES
##############
t0 = tic()
hold_set = set(map(tuple, holdings[["customer_id","product_id"]].to_records(index=False)))
prod_min_income = prod.set_index("product_id")["min_income"].to_dict()

def eligible_products_for_customer(cid: int):
    inc = cust.loc[cid,"income"]
    cands = []
    for pid in range(N_PRODUCTS):
        if (cid,pid) in hold_set:
            continue
        if inc < prod_min_income[pid]*0.8:  # relaxed threshold
            continue
        cands.append(pid)
    return cands

eligible = {cid: eligible_products_for_customer(cid) for cid in range(N_CUSTOMERS)}
avg_cands = np.mean([len(v) for v in eligible.values()])
print(f"[Block 6] Eligibility computed. avg candidates per user ≈ {avg_cands:.2f}")
toc(t0, "Eligibility ready.")
print()


[Block 6] Eligibility computed. avg candidates per user ≈ 8.15
[Time] Eligibility ready. 0.09s



In [7]:
##############
# 7) MATRIX FACTORIZATION (implicit) via SVD
##############
t0 = tic()
mcc_to_prodtype = {
    "travel":["TravelCC","CC"],
    "fuel":["CC"],
    "groceries":["CC","Insurance"],
    "dining":["CC"],
    "online":["CC","EL","electronics"],
    "utilities":["PL","Insurance","FD"],
    "fashion":["CC","PL"],
    "electronics":["EL","CC"]
}
ptype_to_pids = {pt: prod[prod["product_type"]==pt]["product_id"].tolist() for pt in prod_types}

interest = np.zeros((N_CUSTOMERS, N_PRODUCTS))
cust_mcc = txns.groupby(["customer_id","mcc"])["amount"].sum().reset_index()
for _, row in cust_mcc.iterrows():
    cid, cat, amt = int(row["customer_id"]), row["mcc"], float(row["amount"])
    for pt in mcc_to_prodtype.get(cat, []):
        for pid in ptype_to_pids.get(pt, []):
            interest[cid, pid] += math.log1p(amt)

svd = TruncatedSVD(n_components=K_EMBED, random_state=42)
U = svd.fit_transform(interest + 1e-6)  # user embeddings
V = svd.components_.T                   # item embeddings
mf_score = U @ V.T                      # dense score matrix

print(f"[Block 7] Interest shape={interest.shape}, SVD comps={K_EMBED}, explained_var≈{svd.explained_variance_ratio_.sum():.3f}")
toc(t0, "MF ready.")
print()


[Block 7] Interest shape=(1200, 12), SVD comps=8, explained_var≈1.000
[Time] MF ready. 1.12s



In [8]:
##############
# 8) TRAINING DATA FROM SIMULATED CAMPAIGNS
##############
t0 = tic()
rows = []
campaign_dates = [datetime(2025,10,25) - timedelta(days=7*i) for i in range(N_CAMPAIGNS,0,-1)]
for ci, _ in enumerate(campaign_dates):
    for cid in range(N_CUSTOMERS):
        cands = eligible[cid]
        if not cands:
            continue
        # sample a small set per user/campaign (pressure cap * 2)
        size = min(CONTACT_CAP*2, len(cands))
        if size == 0:
            continue
        sampled = RNG.choice(cands, size=size, replace=False)
        for pid in sampled:
            y_prob = 1/(1+np.exp(-(true_score[cid,pid])))
            y = int(RNG.random() < y_prob*0.25)   # scaled response rate
            rows.append({
                "campaign_ix": ci,
                "customer_id": cid,
                "product_id": pid,
                "label": y,
                "mf_score": mf_score[cid,pid],
                "income": cust.loc[cid,"income"],
                "age": cust.loc[cid,"age"],
                "seg_hni": int(cust.loc[cid,"segment"]=="HNI"),
                "seg_affluent": int(cust.loc[cid,"segment"]=="Affluent"),
                "channel_mobile": int(cust.loc[cid,"channel_pref"]=="mobile"),
                "salary_seasonality": float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
                "prod_min_income": float(prod_min_income[pid]),
                "is_premium": int(prod.loc[pid,"is_premium"]),
            })
train_df = pd.DataFrame(rows)

print(f"[Block 8] Training samples: {train_df.shape}, positives={train_df['label'].sum()}")
display_dataframe_to_user("Train_samples_preview", train_df.head(10))
toc(t0, "Training data built.")
print()


  "salary_seasonality": float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),


[Block 8] Training samples: (38568, 13), positives=4769
[Block 1] [Saved CSV] Train_samples_preview -> ./bank_reco_case_study/Train_samples_preview.csv (showing head below)
   campaign_ix  customer_id  product_id  label   mf_score    income  age  \
0            0            0           7      0   9.608230  9.187192   67   
1            0            0          11      0   0.000001  9.187192   67   
2            0            0           3      0   0.000001  9.187192   67   
3            0            0           8      0  17.182877  9.187192   67   
4            0            0           2      0  61.393805  9.187192   67   

   seg_hni  seg_affluent  channel_mobile  salary_seasonality  prod_min_income  \
0        0             0               0            0.208527         6.166886   
1        0             0               0            0.208527         6.028343   
2        0             0               0            0.208527         6.255875   
3        0             0               0      

In [9]:
##############
# 9) TRAIN GBM RE-RANKER
##############
t0 = tic()
train_idx = train_df["campaign_ix"] < (N_CAMPAIGNS-2)
valid_idx = ~train_idx

X_train = train_df[train_idx].drop(columns=["label","customer_id","product_id","campaign_ix"])
y_train = train_df[train_idx]["label"].values
X_valid = train_df[valid_idx].drop(columns=["label","customer_id","product_id","campaign_ix"])
y_valid = train_df[valid_idx]["label"].values

gbm = GradientBoostingClassifier(random_state=42)
gbm.fit(X_train, y_train)
valid_pred = gbm.predict_proba(X_valid)[:,1]
valid_auc = roc_auc_score(y_valid, valid_pred)

print(f"[Block 9] GBM trained. Features={X_train.shape[1]}, AUC(valid)={valid_auc:.4f}")
toc(t0, "GBM training done.")
print()


[Block 9] GBM trained. Features=9, AUC(valid)=0.5723
[Time] GBM training done. 2.68s



In [10]:
##############
# 10) EVALUATION: MAP@5 / NDCG@5 vs baselines
##############
t0 = tic()

def dcg(rels):
    return sum((2**rel - 1)/math.log2(i+2) for i, rel in enumerate(rels))

def ndcg_at_k(y_true, y_score, k=10):
    order = np.argsort(-y_score)[:k]
    rels = y_true[order]
    ideal = np.sort(y_true)[::-1][:k]
    return dcg(rels)/max(dcg(ideal), 1e-9)

def ap_at_k(y_true, y_score, k=10):
    order = np.argsort(-y_score)[:k]
    hits, s = 0, 0.0
    for i, idx in enumerate(order, start=1):
        if y_true[idx] == 1:
            hits += 1
            s += hits / i
    return s / max(min(k, int(y_true.sum())), 1) if y_true.sum() > 0 else 0.0

eval_df = train_df[valid_idx].copy()

# Baselines
popularity = train_df.groupby("product_id")["label"].mean()
mcc_counts_all = txns.groupby(["customer_id","mcc"]).size().reset_index(name="cnt")
mcc_top1 = mcc_counts_all.sort_values(["customer_id","cnt"], ascending=[True,False]).groupby("customer_id").head(1)
mcc_top1 = dict(zip(mcc_top1["customer_id"], mcc_top1["mcc"]))

def heuristic_score(cid, pid):
    base = 0.4*popularity.get(pid, 0)
    cat = mcc_top1.get(cid, None)
    pt = prod.loc[pid,"product_type"]
    bonus = 0.4 if cat and pt in mcc_to_prodtype.get(cat, []) else 0.0
    inc = cust.loc[cid,"income"]
    mi  = prod_min_income[pid]
    bonus += 0.2 * (inc >= mi)
    return base + bonus

metrics_rows = []
users = eval_df["customer_id"].unique()
for cid in users:
    cands = eligible[cid]
    if not cands:
        continue
    # labels
    y_true = []
    for pid in cands:
        sub = eval_df[(eval_df["customer_id"]==cid)&(eval_df["product_id"]==pid)]
        if len(sub):
            y_true.append(int(sub["label"].max()))
        else:
            prob = 1/(1+np.exp(-(true_score[cid,pid])))
            y_true.append(int(prob>0.7 and RNG.random()<0.1))
    y_true = np.array(y_true)

    # scores
    s_mf = np.array([mf_score[cid,p] for p in cands])

    feats = []
    for pid in cands:
        feats.append([
            mf_score[cid,pid],
            cust.loc[cid,"income"], cust.loc[cid,"age"],
            int(cust.loc[cid,"segment"]=="HNI"),
            int(cust.loc[cid,"segment"]=="Affluent"),
            int(cust.loc[cid,"channel_pref"]=="mobile"),
            float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
            float(prod_min_income[pid]),
            int(prod.loc[pid,"is_premium"]),
        ])
    feats = np.array(feats)
    s_gbm = gbm.predict_proba(feats)[:,1]

    s_heur = np.array([heuristic_score(cid,p) for p in cands])
    s_pop  = np.array([popularity.get(p,0) for p in cands])

    for name, scores in [("MF", s_mf), ("GBM", s_gbm), ("Heuristic", s_heur), ("Popularity", s_pop)]:
        ndcg5 = ndcg_at_k(y_true, scores, k=5)
        map5  = ap_at_k  (y_true, scores, k=5)
        metrics_rows.append({"customer_id": cid, "model": name, "NDCG@5": ndcg5, "MAP@5": map5})

metrics = pd.DataFrame(metrics_rows)
summary = metrics.groupby("model").agg({"NDCG@5":"mean","MAP@5":"mean"}).reset_index()

print("[Block 10] Ranking metrics summary:")
print(summary)
display_dataframe_to_user("Ranking_Metrics_Summary", summary)
toc(t0, "Evaluation done.")
print()


  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(fe

[Block 10] Ranking metrics summary:
        model    NDCG@5     MAP@5
0         GBM  0.363024  0.287328
1   Heuristic  0.328008  0.251886
2          MF  0.310712  0.237620
3  Popularity  0.327668  0.255076
[Block 1] [Saved CSV] Ranking_Metrics_Summary -> ./bank_reco_case_study/Ranking_Metrics_Summary.csv (showing head below)
        model    NDCG@5     MAP@5
0         GBM  0.363024  0.287328
1   Heuristic  0.328008  0.251886
2          MF  0.310712  0.237620
3  Popularity  0.327668  0.255076
[Time] Evaluation done. 12.56s



  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),


In [11]:
##############
# 11) A/B SIMULATED LIFT
##############
t0 = tic()
ab_rows = []
assign = {cid: ("B" if RNG.random()<0.5 else "A") for cid in range(N_CUSTOMERS)}

def simulate_response(cid, pid):
    prob = 1/(1+np.exp(-(true_score[cid,pid])))
    return 1 if RNG.random() < prob*0.25 else 0

for cid in range(N_CUSTOMERS):
    cands = eligible[cid]
    if not cands:
        continue
    arm = assign[cid]
    if arm=="A":
        scores = [heuristic_score(cid,p) for p in cands]
    else:
        feats = []
        for pid in cands:
            feats.append([
                mf_score[cid,pid],
                cust.loc[cid,"income"], cust.loc[cid,"age"],
                int(cust.loc[cid,"segment"]=="HNI"),
                int(cust.loc[cid,"segment"]=="Affluent"),
                int(cust.loc[cid,"channel_pref"]=="mobile"),
                float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
                float(prod_min_income[pid]),
                int(prod.loc[pid,"is_premium"]),
            ])
        feats = np.array(feats)
        scores = gbm.predict_proba(feats)[:,1]
    top_ix = int(np.argmax(scores))
    pid = cands[top_ix]
    y = simulate_response(cid, pid)
    ab_rows.append({"customer_id": cid, "arm": arm, "response": y})

ab = pd.DataFrame(ab_rows)
lift = (ab[ab["arm"]=="B"]["response"].mean() - ab[ab["arm"]=="A"]["response"].mean()) / max(ab[ab["arm"]=="A"]["response"].mean(), 1e-9)
lift_pct = 100*lift

print(f"[Block 11] Simulated lift (GBM vs Heuristic): {lift_pct:.2f}%")
toc(t0, "A/B simulation done.")
print()


  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(fe

[Block 11] Simulated lift (GBM vs Heuristic): 13.18%
[Time] A/B simulation done. 3.79s



  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(fe

In [12]:
##############
# 12) REASON CODES, RECOMMEND(), SAVE ARTIFACTS
##############
t0 = tic()

feature_names = ["mf_score","income","age","seg_hni","seg_affluent","channel_mobile","salary_seasonality","prod_min_income","is_premium"]

def reason_codes_for_pair(x_row, top_k=3):
    base = gbm.predict_proba([x_row])[0,1]
    deltas = []
    for j, fname in enumerate(feature_names):
        x_mut = x_row.copy()
        x_mut[j] = 0.0            # simple "knock-out"
        prob = gbm.predict_proba([x_mut])[0,1]
        deltas.append((fname, base - prob))
    deltas.sort(key=lambda t: t[1], reverse=True)
    top = [f"{nm}↑" if dv>0 else f"{nm}↓" for nm,dv in deltas[:top_k]]
    mapping = {
        "mf_score↑":"Similar users liked this product",
        "income↑":"Meets income requirement",
        "age↑":"Age profile matches adopters",
        "seg_hni↑":"HNI segment bias",
        "seg_affluent↑":"Affluent segment bias",
        "channel_mobile↑":"High mobile usage",
        "salary_seasonality↑":"Spends spike near salary",
        "prod_min_income↑":"Income-product fit",
        "is_premium↑":"Premium propensity detected",
        "mf_score↓":"Weak collaborative signal",
        "income↓":"Income constraint",
        "age↓":"Age profile less aligned",
        "seg_hni↓":"Not HNI-focused",
        "seg_affluent↓":"Not Affluent-focused",
        "channel_mobile↓":"Low mobile usage",
        "salary_seasonality↓":"No salary-linked spikes",
        "prod_min_income↓":"Below min income",
        "is_premium↓":"Premium not suitable"
    }
    return [mapping.get(k,k) for k in top]

def recommend_for_customer(cid, top_n=K_RECO):
    cands = eligible[cid]
    if not cands:
        return []
    feats = []
    for pid in cands:
        feats.append([
            mf_score[cid,pid],
            cust.loc[cid,"income"], cust.loc[cid,"age"],
            int(cust.loc[cid,"segment"]=="HNI"),
            int(cust.loc[cid,"segment"]=="Affluent"),
            int(cust.loc[cid,"channel_pref"]=="mobile"),
            float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
            float(prod_min_income[pid]),
            int(prod.loc[pid,"is_premium"]),
        ])
    feats = np.array(feats)
    scores = gbm.predict_proba(feats)[:,1]
    order = np.argsort(-scores)
    recs, contacted = [], 0
    for idx in order:
        if contacted >= CONTACT_CAP or len(recs) >= top_n:
            break
        pid = cands[idx]
        xrow = feats[idx].tolist()
        reasons = reason_codes_for_pair(xrow, top_k=3)
        recs.append({
            "customer_id": cid,
            "product_id": int(pid),
            "product_type": str(prod.loc[pid,"product_type"]),
            "score": float(scores[idx]),
            "reasons": reasons
        })
        contacted += 1
    return recs

# Sample some users
sample_customers = RNG.choice(np.arange(N_CUSTOMERS), 25, replace=False)
reco_rows = []
for cid in sample_customers:
    for r in recommend_for_customer(cid, top_n=K_RECO):
        reco_rows.append(r)
reco_df = pd.DataFrame(reco_rows)

# Save artifacts
summary_path = os.path.join(OUT_DIR, "ranking_metrics.csv")
reco_path    = os.path.join(OUT_DIR, "sample_recommendations.csv")
customers_csv = os.path.join(OUT_DIR, "customers.csv")
products_csv  = os.path.join(OUT_DIR, "products.csv")
transactions_csv = os.path.join(OUT_DIR, "transactions.csv")
holdings_csv  = os.path.join(OUT_DIR, "holdings.csv")

# Metrics summary re-used from Block 10
summary.to_csv(summary_path, index=False)
reco_df.to_csv(reco_path, index=False)
cust.to_csv(customers_csv, index=False)
prod.to_csv(products_csv, index=False)
txns.to_csv(transactions_csv, index=False)
holdings.to_csv(holdings_csv, index=False)

print(f"[Block 12] Saved:")
print("  -", summary_path)
print("  -", reco_path)
print("  -", customers_csv)
print("  -", products_csv)
print("  -", transactions_csv)
print("  -", holdings_csv)

# Acceptance criteria printout
ndcg5_gbm = float(summary.loc[summary["model"]=="GBM","NDCG@5"])
meets_ndcg = ndcg5_gbm >= 0.65
meets_lift = (lift_pct >= 10.0)

print(f"[Block 12] Acceptance check -> NDCG@5(GBM)={ndcg5_gbm:.4f} | meets ≥0.65? {meets_ndcg}")
print(f"[Block 12] Simulated Lift vs Heuristic (%)={lift_pct:.2f} | meets ≥+10%? {meets_lift}")
print(f"[Block 12] Valid AUC (holdout)={roc_auc_score(train_df[valid_idx]['label'].values, gbm.predict_proba(train_df[valid_idx].drop(columns=['label','customer_id','product_id','campaign_ix']))[:,1]):.4f}")

display_dataframe_to_user("Sample_TopN_Recommendations_25_users", reco_df.head(50))
toc(t0, "Recommendations & artifacts done.")
print()

# SECURITY/PRIVACY: In production, ensure user consent & purpose limitation before using PII/transactions.


  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(feat.loc[feat["customer_id"]==cid,"salary_seasonality"]),
  float(fe

[Block 12] Saved:
  - ./bank_reco_case_study/ranking_metrics.csv
  - ./bank_reco_case_study/sample_recommendations.csv
  - ./bank_reco_case_study/customers.csv
  - ./bank_reco_case_study/products.csv
  - ./bank_reco_case_study/transactions.csv
  - ./bank_reco_case_study/holdings.csv
[Block 12] Acceptance check -> NDCG@5(GBM)=0.3630 | meets ≥0.65? False
[Block 12] Simulated Lift vs Heuristic (%)=13.18 | meets ≥+10%? True
[Block 12] Valid AUC (holdout)=0.5723
[Block 1] [Saved CSV] Sample_TopN_Recommendations_25_users -> ./bank_reco_case_study/Sample_TopN_Recommendations_25_users.csv (showing head below)
   customer_id  product_id product_type     score  \
0          636           6           EL  0.166882   
1          636           4    PremiumCC  0.150366   
2          636          10    PremiumCC  0.113097   
3          382           1           HL  0.126540   
4          382          11           GL  0.126540   

                                             reasons  
0  [Spends spike 

  ndcg5_gbm = float(summary.loc[summary["model"]=="GBM","NDCG@5"])


In [13]:
display(reco_df)

Unnamed: 0,customer_id,product_id,product_type,score,reasons
0,636,6,EL,0.166882,"[Spends spike near salary, Income-product fit,..."
1,636,4,PremiumCC,0.150366,"[Meets income requirement, Weak collaborative ..."
2,636,10,PremiumCC,0.113097,"[Meets income requirement, Spends spike near s..."
3,382,1,HL,0.126540,"[Meets income requirement, Spends spike near s..."
4,382,11,GL,0.126540,"[Meets income requirement, Spends spike near s..."
...,...,...,...,...,...
63,524,5,EL,0.123357,"[Similar users liked this product, Not HNI-foc..."
64,524,3,HL,0.121895,"[Meets income requirement, Spends spike near s..."
65,501,1,HL,0.171219,"[Meets income requirement, Income-product fit,..."
66,501,11,GL,0.171219,"[Meets income requirement, Income-product fit,..."
