# **Feature Engineering**

## Build Customer-Level Base Table (Include RFM)

In this step, we convert the raw transaction-level dataset (invoice lines) into a customer-level feature table. This is critical for CLV because most models don’t learn from individual line items—they learn from customer behavior summaries such as recency, frequency, monetary value, and lifecycle length. Getting this table right makes the downstream modeling stable, interpretable, and aligned with real business behavior.

In [155]:
# Define snapshot date
snapshot_date = df["InvoiceDate"].max() + pd.Timedelta(days=1)

# Aggregatuion table
customer_base = (
    df.groupby("Customer ID")
      .agg(
          first_purchase_date=("InvoiceDate", "min"),
          last_purchase_date=("InvoiceDate", "max"),
          n_invoices=("Invoice", "nunique"),          # Frequency (purchase occasions)
          txn_lines=("Invoice", "size"),              # Total line items (activity intensity)
          total_quantity=("Quantity", "sum"),
          total_revenue=("Revenue", "sum"),           # Monetary
          n_unique_products=("StockCode", "nunique"),
      )
      .reset_index()
)

# Time-based features
customer_base["recency_days"] = (snapshot_date - customer_base["last_purchase_date"]).dt.days
customer_base["tenure_days"]  = (customer_base["last_purchase_date"] - customer_base["first_purchase_date"]).dt.days

# Basket features
customer_base["aov"] = customer_base["total_revenue"] / customer_base["n_invoices"]     # Average order value
customer_base["lines_per_invoice"] = customer_base["txn_lines"] / customer_base["n_invoices"]

# Ovoid division surprises (if ever there is zero invoices)
customer_base.replace([np.inf, -np.inf], np.nan, inplace=True)

display(customer_base.head())


Unnamed: 0,Customer ID,first_purchase_date,last_purchase_date,n_invoices,txn_lines,total_quantity,total_revenue,n_unique_products,recency_days,tenure_days,aov,lines_per_invoice
0,12346,2009-12-14 08:34:00,2011-01-18 10:01:00,12,34,74285,77556.46,27,326,400,6463.038333,2.833333
1,12347,2010-10-31 14:20:00,2011-12-07 15:52:00,8,222,2967,4921.53,126,2,402,615.19125,27.75
2,12348,2010-09-27 14:59:00,2011-09-25 13:13:00,5,51,2714,2019.4,25,75,362,403.88,10.2
3,12349,2010-04-29 13:20:00,2011-11-21 09:51:00,4,175,1624,4428.69,138,19,570,1107.1725,43.75
4,12350,2011-02-02 16:01:00,2011-02-02 16:01:00,1,17,197,334.4,17,310,0,334.4,17.0


## Customer's State Features

In this step, we convert raw behavioral signals (mainly recency and sometimes tenure) into simple customer states like Active, Warm, At-risk, or Dormant. This matters because many CLV models work better when they can “understand” where a customer sits in their lifecycle—not just how much they spent. State features make your modeling more interpretable, help segmentation, and often improve predictive stability because they compress noisy day-level variability into business-meaningful buckets.

In [161]:
# Activity state (recency-based buckets)
state_bins = [0, 30, 90, 180, 365, np.inf]
state_labels = ["Active (<=30d)", "Warm (31-90d)", "Sleeping (91-180d)", "At-risk (181-365d)", "Dormant (>365d)"]

customer_base["activity_state"] = pd.cut(
    customer_base["recency_days"],
    bins=state_bins,
    labels=state_labels,
    right=True,
    include_lowest=True
)

# Tenure-based buckets
tenure_bins = [0, 30, 90, 180, 365, np.inf]
tenure_labels = ["New (<=30d)", "Young (31-90d)", "Growing (91-180d)", "Established (181-365d)", "Long-term (>365d)"]

customer_base["tenure_group"] = pd.cut(
    customer_base["tenure_days"],
    bins=tenure_bins,
    labels=tenure_labels,
    right=True,
    include_lowest=True
)

# "Likely churned" flag
customer_base["likely_churned"] = (customer_base["recency_days"] > 365).astype(int)

# Preview
display(customer_base[["Customer ID", "recency_days", "tenure_days", "activity_state", "tenure_group", "likely_churned"]].head())

Unnamed: 0,Customer ID,recency_days,tenure_days,activity_state,tenure_group,likely_churned
0,12346,326,400,At-risk (181-365d),Long-term (>365d),0
1,12347,2,402,Active (<=30d),Long-term (>365d),0
2,12348,75,362,Warm (31-90d),Established (181-365d),0
3,12349,19,570,Active (<=30d),Long-term (>365d),0
4,12350,310,0,At-risk (181-365d),New (<=30d),0


## Rate and Intensity Features

In this step, we turn raw totals (invoices, revenue, line items, quantity) into rates and intensity metrics—basically “how fast” and “how dense” a customer buys. Two customers can have the same total revenue, but one might generate it quickly (high velocity) while the other takes a long time (low velocity). These features help CLV modeling because they capture purchase momentum, reduce bias from different observation windows (tenure), and often separate “steady” customers from “burst” customers.

In [164]:
# Save denominators (avoid division surprises)
TENURE_MIN_DAYS = 1
tenure_days_safe = customer_base["tenure_days"].clip(lower=TENURE_MIN_DAYS)
tenure_months_safe = (tenure_days_safe / 30.0)

# Active window approximation (time covered by observation)
# "active_days" here means time span from first to last purchase (tenure), not "days with transactions"
active_months_safe = tenure_months_safe

# Velocity features
customer_base["invoices_per_month"] = customer_base["n_invoices"] / active_months_safe
customer_base["lines_per_month"] = customer_base["txn_lines"] / active_months_safe
customer_base["quantity_per_month"] = customer_base["total_quantity"] / active_months_safe
customer_base["revenue_per_month"] = customer_base["total_revenue"] / active_months_safe

# Intensity per purchase (order density)
INV_MIN = 1
n_invoices_safe = customer_base["n_invoices"].clip(lower=INV_MIN)

customer_base["qty_per_invoice"] = customer_base["total_quantity"] / n_invoices_safe
customer_base["revenue_per_line"] = customer_base["total_revenue"] / customer_base["txn_lines"].clip(lower=1)
customer_base["qty_per_line"] = customer_base["total_quantity"] / customer_base["txn_lines"].clip(lower=1)

# Momentum proxy using recency
customer_base["velocity_x_freshness"] = customer_base["revenue_per_month"] / (customer_base["recency_days"] + 1)

# Cleanup infinities if any
customer_base.replace([np.inf, -np.inf], np.nan, inplace=True)

# Preview
cols_preview = [
    "Customer ID",
    "n_invoices", "txn_lines", "total_quantity", "total_revenue",
    "tenure_days", "recency_days",
    "invoices_per_month", "lines_per_month", "quantity_per_month", "revenue_per_month",
    "qty_per_invoice", "revenue_per_line", "qty_per_line",
    "velocity_x_freshness"
]
display(customer_base[cols_preview].head())


Unnamed: 0,Customer ID,n_invoices,txn_lines,total_quantity,total_revenue,tenure_days,recency_days,invoices_per_month,lines_per_month,quantity_per_month,revenue_per_month,qty_per_invoice,revenue_per_line,qty_per_line,velocity_x_freshness
0,12346,12,34,74285,77556.46,400,326,0.9,2.55,5571.375,5816.7345,6190.416667,2281.072353,2184.852941,17.788179
1,12347,8,222,2967,4921.53,402,2,0.597015,16.567164,221.41791,367.278358,370.875,22.169054,13.364865,122.426119
2,12348,5,51,2714,2019.4,362,75,0.414365,4.226519,224.917127,167.353591,542.8,39.596078,53.215686,2.202021
3,12349,4,175,1624,4428.69,570,19,0.210526,9.210526,85.473684,233.088947,406.0,25.3068,9.28,11.654447
4,12350,1,17,197,334.4,0,310,30.0,510.0,5910.0,10032.0,197.0,19.670588,11.588235,32.257235


## Breadth & Diversity (SKU Spread & Engagement Depth)

In this step, we capture how broad and deep a customer’s engagement is with the product catalog. Instead of focusing only on how much or how often customers buy, breadth and diversity features describe how varied their purchases are. These signals often correlate with loyalty and stickiness: customers who explore more products tend to be less price-sensitive and more resilient over time. For CLV modeling, breadth features help differentiate customers with similar revenue but very different engagement patterns.

In [170]:
# Safe Denominators
INV_MIN = 1
LINES_MIN = 1

n_invoices_safe = customer_base["n_invoices"].clip(lower=INV_MIN)
txn_lines_safe = customer_base["txn_lines"].clip(lower=LINES_MIN)

# Active months: prefer active_months if available, else approximate from tenure
if "active_months" in customer_base.columns:
    active_months_safe = customer_base["active_months"].clip(lower=1)
else:
    active_months_safe = (customer_base["tenure_days"].clip(lower=1) / 30.0)

# Breadth intensity features
customer_base["unique_products_per_invoice"] = (
    customer_base["n_unique_products"] / n_invoices_safe
)

customer_base["unique_products_per_line"] = (
    customer_base["n_unique_products"] / txn_lines_safe
)

customer_base["unique_products_per_month"] = (
    customer_base["n_unique_products"] / active_months_safe
)

# Diversity ratios (engagement depth)
customer_base["product_diversity_ratio"] = (
    customer_base["n_unique_products"] / txn_lines_safe
)

# Optional: capped version to reduce extreme noise
customer_base["product_diversity_ratio_capped"] = (
    customer_base["product_diversity_ratio"].clip(upper=1.0)
)

# Cleanup infinities
customer_base.replace([np.inf, -np.inf], np.nan, inplace=True)

# Preview
cols_preview = [
    "Customer ID",
    "n_unique_products", "n_invoices", "txn_lines",
    "unique_products_per_invoice",
    "unique_products_per_line",
    "unique_products_per_month",
    "product_diversity_ratio",
    "product_diversity_ratio_capped"
]

display(customer_base[cols_preview].head())


Unnamed: 0,Customer ID,n_unique_products,n_invoices,txn_lines,unique_products_per_invoice,unique_products_per_line,unique_products_per_month,product_diversity_ratio,product_diversity_ratio_capped
0,12346,27,12,34,2.25,0.794118,2.025,0.794118,0.794118
1,12347,126,8,222,15.75,0.567568,9.402985,0.567568,0.567568
2,12348,25,5,51,5.0,0.490196,2.071823,0.490196,0.490196
3,12349,138,4,175,34.5,0.788571,7.263158,0.788571,0.788571
4,12350,17,1,17,17.0,1.0,510.0,1.0,1.0


## Temporal Summary

In this step, we summarize each customer’s purchase timing pattern—how spread out their transactions are, how consistent they are, and how long they stay active. Two customers can have the same frequency, but one buys steadily while the other buys in bursts. Temporal summary features capture that difference and often improve CLV modeling because they add a “behavior rhythm” signal: consistency usually correlates with retention and future purchasing.

In [172]:
# Build customer -> sorted purchase dates (unique invoices)
orders = (
    df.groupby(["Customer ID", "Invoice"])["InvoiceDate"]
      .max()  # one timestamp per invoice
      .reset_index()
      .sort_values(["Customer ID", "InvoiceDate"])
)

# Compute gaps (days between purchases) per customer
orders["prev_date"] = orders.groupby("Customer ID")["InvoiceDate"].shift(1)
orders["gap_days"] = (orders["InvoiceDate"] - orders["prev_date"]).dt.days

# Aggregate gap statistics per customer
gap_features = (
    orders.groupby("Customer ID")["gap_days"]
          .agg(
              avg_gap_days="mean",
              median_gap_days="median",
              std_gap_days="std",
              min_gap_days="min",
              max_gap_days="max",
          )
          .reset_index()
)

# Customers with only 1 invoice will have NaN gaps; we keep them as NaN for now
# (optional fill later depending on model choice)

# Lower std relative to mean -> more regular
gap_features["gap_cv"] = gap_features["std_gap_days"] / gap_features["avg_gap_days"]

# Simple regularity score (bounded, higher = more regular)
gap_features["regularity_score"] = 1 / (1 + gap_features["gap_cv"])

# Merge into customer_base
customer_base = customer_base.merge(gap_features, on="Customer ID", how="left")

# Cleanup infinities if avg_gap_days is 0 (rare) -> handle safely
customer_base.replace([np.inf, -np.inf], np.nan, inplace=True)

cols_preview = [
    "Customer ID", "n_invoices",
    "avg_gap_days", "median_gap_days", "std_gap_days",
    "gap_cv", "regularity_score"
]
display(customer_base[cols_preview].head(10))

Unnamed: 0,Customer ID,n_invoices,avg_gap_days,median_gap_days,std_gap_days,gap_cv,regularity_score
0,12346,12,35.909091,7.0,65.426989,1.822017,0.354356
1,12347,8,57.0,53.0,19.035055,0.333948,0.749654
2,12348,5,90.5,75.0,57.703264,0.637605,0.610648
3,12349,4,189.666667,162.0,187.040994,0.986156,0.503485
4,12350,1,,,,,
5,12351,1,,,,,
6,12352,10,39.222222,16.0,58.25757,1.48532,0.402363
7,12353,2,204.0,204.0,,,
8,12354,1,,,,,
9,12355,2,353.0,353.0,,,


## Robustness/Stability Features

In this step, we make our features model-friendly under heavy skew. Retail customer value data is rarely “normal”—a small group of customers tends to generate a disproportionate share of revenue. If we feed raw revenue-based features directly into a model, the learning process can become unstable and overly driven by extreme customers. To prevent that, we add robust versions of key variables using log transforms and winsorized (capped) features. This helps models generalize better and improves stability without discarding valuable signals.

In [173]:
# Define helper for winsorization (capping)
def cap_series(s, lower_q=0.01, upper_q=0.99):
    lo = s.quantile(lower_q)
    hi = s.quantile(upper_q)
    return s.clip(lower=lo, upper=hi)

# Choose features to stabilize (based on heavy-tail behavior)
skew_cols = [
    "total_revenue",
    "revenue_per_month",
    "aov",
    "quantity_per_month",
    "invoices_per_month"
]

# Keep only columns that actually exist
skew_cols = [c for c in skew_cols if c in customer_base.columns]

# Create capped (winsorized) versions
for c in skew_cols:
    customer_base[f"cap_{c}"] = cap_series(customer_base[c], lower_q=0.01, upper_q=0.99)

# Create log-transformed versions
# log1p(x) = log(1+x) handles zeros gracefully
for c in skew_cols:
    # Use capped values for log transform to reduce extreme leverage further
    customer_base[f"log1p_cap_{c}"] = np.log1p(customer_base[f"cap_{c}"])

# Flag power customers (for interpretability / segmentation)
customer_base["is_top_1pct_revenue"] = (
    customer_base["total_revenue"] >= customer_base["total_revenue"].quantile(0.99)
).astype(int)

customer_base["is_top_5pct_revenue"] = (
    customer_base["total_revenue"] >= customer_base["total_revenue"].quantile(0.95)
).astype(int)

# Cleanup infinities just in case
customer_base.replace([np.inf, -np.inf], np.nan, inplace=True)

# Preview a few key columns
preview_cols = ["Customer ID"] + skew_cols + \
               [f"cap_{c}" for c in skew_cols] + \
               [f"log1p_cap_{c}" for c in skew_cols] + \
               ["is_top_1pct_revenue", "is_top_5pct_revenue"]

display(customer_base[preview_cols].head())

Unnamed: 0,Customer ID,total_revenue,revenue_per_month,aov,quantity_per_month,invoices_per_month,cap_total_revenue,cap_revenue_per_month,cap_aov,cap_quantity_per_month,cap_invoices_per_month,log1p_cap_total_revenue,log1p_cap_revenue_per_month,log1p_cap_aov,log1p_cap_quantity_per_month,log1p_cap_invoices_per_month,is_top_1pct_revenue,is_top_5pct_revenue
0,12346,77556.46,5816.7345,6463.038333,5571.375,0.9,29205.901,5816.7345,1963.8105,5571.375,0.9,10.28216,8.668666,7.583151,8.625577,0.641854,1,1
1,12347,4921.53,367.278358,615.19125,221.41791,0.597015,4921.53,367.278358,615.19125,221.41791,0.597015,8.501578,5.908839,6.423557,5.404558,0.468136,0,0
2,12348,2019.4,167.353591,403.88,224.917127,0.414365,2019.4,167.353591,403.88,224.917127,0.414365,7.611051,5.126066,6.003591,5.420168,0.34668,0,0
3,12349,4428.69,233.088947,1107.1725,85.473684,0.210526,4428.69,233.088947,1107.1725,85.473684,0.210526,8.396085,5.455701,7.010468,4.45984,0.191055,0,0
4,12350,334.4,10032.0,334.4,5910.0,30.0,334.4,10032.0,334.4,5910.0,30.0,5.815324,9.213635,5.815324,8.68457,3.433987,0,0


In [176]:
customer_base.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5878 entries, 0 to 5877
Data columns (total 47 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   Customer ID                     5878 non-null   object        
 1   first_purchase_date             5878 non-null   datetime64[ns]
 2   last_purchase_date              5878 non-null   datetime64[ns]
 3   n_invoices                      5878 non-null   int64         
 4   txn_lines                       5878 non-null   int64         
 5   total_quantity                  5878 non-null   int64         
 6   total_revenue                   5878 non-null   float64       
 7   n_unique_products               5878 non-null   int64         
 8   recency_days                    5878 non-null   int64         
 9   tenure_days                     5878 non-null   int64         
 10  aov                             5878 non-null   float64       
 11  line