# TASK#3: Player Monthly Spending Prediction

Karena Thailand's finance team struggles with revenue forecasting,
consistently missing quarterly targets by 35-40%, which impacts budgeting,
hiring decisions, and investor confidence. Without accurate player-level
spending predictions, multiple departments are operating blindly. The VIP
support team of 20 people provides white-glove service but doesn't know
which players actually deserve priority treatment‚Äîlast quarter they spent
significant resources on 150 high-playtime players who collectively spent
only ‡∏ø12,000, while a whale who spent ‡∏ø255,000 waited 3 days for ticket
response and subsequently left for a competitor. The marketing team sends
promotional discount codes randomly, often giving 50% off to whales who
would have paid full price anyway, resulting in ‡∏ø6 million+ monthly revenue
loss. Customer service wastes time on players unlikely to ever monetize while
high-value customers receive generic treatment. Additionally, the company
needs to optimize their limited-time event scheduling and new content
releases based on when high-spending players are most active. Accurate
spending predictions would enable proper resource allocation, targeted
offers, revenue forecasting, and VIP program optimization across the
organization.


# STEP 1: Import and setup

In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from lightgbm import early_stopping, log_evaluation

In [None]:
TRAIN_PATH = "train.csv"
TEST_PATH = "test.csv"
OUTPUT_SUB_PATH = "submission_task3_baseline.csv"

TARGET_COL = "spending_30d"   # ‡∏ä‡∏∑‡πà‡∏≠ target ‡πÉ‡∏ô train.csv
USE_LOG_TARGET = False


def nmae(y_true, y_pred):
    """Normalized MAE = sum(|y-y_hat|) / sum(|y|)"""
    return np.sum(np.abs(y_true - y_pred)) / np.sum(np.abs(y_true))

# STEP 2: Load data


In [None]:
train = pd.read_csv(TRAIN_PATH)
test = pd.read_csv(TEST_PATH)

print("Train shape:", train.shape)
print("Test shape :", test.shape)

Train shape: (104000, 35)
Test shape : (25889, 34)


# STEP 3: Define feature columns

In [None]:
FEATURE_COLS = [
    "friend_count",
    "social_interactions",
    "guild_membership",
    "event_participation_rate",
    "daily_login_streak",
    "avg_session_length",
    "sessions_per_week",
    "total_playtime_hours",
    "days_since_last_login",
    "achievement_count",
    "achievement_completion_rate",
    "historical_spending",
    "prev_month_spending",
    "total_transactions",
    "avg_transaction_value",
    "account_age_days",
    "vip_status",
    "is_premium_member",
    "primary_game",
    "games_played",
    "cross_game_activity",
    "platform",
    "days_since_last_purchase",
    "purchase_frequency",
    "payment_methods_used",
    "purchases_on_discount",
    "discount_rate_used",
    "seasonal_spending_pattern",
    "owns_limited_edition",
    "competitive_rank",
    "tournament_participation",
    "segment",
]

missing_in_train = [c for c in FEATURE_COLS if c not in train.columns]
missing_in_test = [c for c in FEATURE_COLS if c not in test.columns]

if missing_in_train:
    print("‚ö†Ô∏è Missing in train:", missing_in_train)
if missing_in_test:
    print("‚ö†Ô∏è Missing in test:", missing_in_test)

FEATURE_COLS = [
    c for c in FEATURE_COLS
    if c in train.columns and c in test.columns
]
print("‚úÖ Use features:", len(FEATURE_COLS))
print(FEATURE_COLS)

‚úÖ Use features: 32
['friend_count', 'social_interactions', 'guild_membership', 'event_participation_rate', 'daily_login_streak', 'avg_session_length', 'sessions_per_week', 'total_playtime_hours', 'days_since_last_login', 'achievement_count', 'achievement_completion_rate', 'historical_spending', 'prev_month_spending', 'total_transactions', 'avg_transaction_value', 'account_age_days', 'vip_status', 'is_premium_member', 'primary_game', 'games_played', 'cross_game_activity', 'platform', 'days_since_last_purchase', 'purchase_frequency', 'payment_methods_used', 'purchases_on_discount', 'discount_rate_used', 'seasonal_spending_pattern', 'owns_limited_edition', 'competitive_rank', 'tournament_participation', 'segment']


# STEP 4: Prepare X, y and handle missing values

In [None]:
y = train[TARGET_COL].astype(float)
X = train[FEATURE_COLS].astype(float)
X_test = test[FEATURE_COLS].astype(float)

for col in FEATURE_COLS:
    med = X[col].median()
    X[col] = X[col].fillna(med)
    X_test[col] = X_test[col].fillna(med)

# STEP 5: Feature Engineering

In [None]:
X["spend_per_session"] = X["historical_spending"] / (X["sessions_per_week"] + 1)
X_test["spend_per_session"] = X_test["historical_spending"] / (X_test["sessions_per_week"] + 1)

X["spend_per_hour"] = X["historical_spending"] / (X["total_playtime_hours"] + 1)
X_test["spend_per_hour"] = X_test["historical_spending"] / (X_test["total_playtime_hours"] + 1)

X["recency_score"] = 1 / (X["days_since_last_purchase"] + 1)
X_test["recency_score"] = 1 / (X_test["days_since_last_purchase"] + 1)

X["engagement_score"] = (
    0.3 * X["social_interactions"] +
    0.3 * X["event_participation_rate"] +
    0.2 * X["friend_count"] +
    0.2 * X["guild_membership"]
)

X_test["engagement_score"] = (
    0.3 * X_test["social_interactions"] +
    0.3 * X_test["event_participation_rate"] +
    0.2 * X_test["friend_count"] +
    0.2 * X_test["guild_membership"]
)

# STEP 6: Train/Validation split + Baseline

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Train size:", X_train.shape, "Valid size:", X_valid.shape)

mean_pred = y_train.mean()
y_valid_mean = np.full_like(y_valid, mean_pred, dtype=float)
baseline_nmae = nmae(y_valid, y_valid_mean)
print(f"Baseline NMAE (mean predictor): {baseline_nmae:.6f}")

Train size: (83200, 36) Valid size: (20800, 36)
Baseline NMAE (mean predictor): 1.603269


# STEP 7: Define model (LightGBM / RandomForest)

In [None]:
try:
    from lightgbm import LGBMRegressor
    USE_LGBM = True
    print("üöÄ Using LightGBMRegressor (tuned + early stopping ready)")

    model = LGBMRegressor(
      n_estimators=8000,       # ‡πÄ‡∏û‡∏¥‡πà‡∏° model capacity
      learning_rate=0.015,     # ‡∏ä‡πâ‡∏≤‡∏•‡∏á‡∏≠‡∏µ‡∏Å‡∏ô‡∏¥‡∏î = smooth ‡∏Ç‡∏∂‡πâ‡∏ô
      num_leaves=255,          # ‡∏¢‡∏∑‡∏î‡∏´‡∏¢‡∏∏‡πà‡∏ô‡∏Ç‡∏∂‡πâ‡∏ô
      max_depth=-1,

      min_child_samples=40,    # ‡∏Å‡∏±‡∏ô overfit ‡πÄ‡∏•‡πá‡∏Å‡∏ô‡πâ‡∏≠‡∏¢
      subsample=0.9,           # ‡∏ó‡∏≥‡πÉ‡∏´‡πâ‡∏ï‡πâ‡∏ô‡πÑ‡∏°‡πâ randomized ‡∏ô‡πâ‡∏≠‡∏¢‡∏Å‡∏ß‡πà‡∏≤‡πÄ‡∏î‡∏¥‡∏°
      colsample_bytree=0.7,    # ‡∏•‡∏î column sampling ‡πÉ‡∏´‡πâ‡∏ï‡πâ‡∏ô‡πÑ‡∏°‡πâ‡πÑ‡∏°‡πà‡∏ö‡πâ‡∏≤

      reg_alpha=0.3,           # ‡∏ú‡πà‡∏≠‡∏ô regularization ‡πÉ‡∏´‡πâ‡πÇ‡∏°‡πÄ‡∏î‡∏•‡πÄ‡∏£‡∏µ‡∏¢‡∏ô‡∏£‡∏π‡πâ‡πÄ‡∏û‡∏¥‡πà‡∏°
      reg_lambda=0.8,

      random_state=42,
      n_jobs=-1,
)


except ImportError:
    from sklearn.ensemble import RandomForestRegressor
    USE_LGBM = False
    print("üöÄ LightGBM not found, using RandomForestRegressor instead (tuned)")

    model = RandomForestRegressor(
        n_estimators=800,
        max_depth=20,
        min_samples_split=20,
        min_samples_leaf=10,
        max_features="sqrt",
        random_state=42,
        n_jobs=-1,
    )


üöÄ Using LightGBMRegressor (tuned + early stopping ready)


# STEP 8: Train model with early stopping + Evaluate on validation

In [None]:
USE_LOG_TARGET = False # ‡∏´‡∏£‡∏∑‡∏≠ True ‡∏Å‡πá‡πÑ‡∏î‡πâ‡∏ï‡∏≤‡∏°‡∏ï‡πâ‡∏≠‡∏á‡∏Å‡∏≤‡∏£

if USE_LGBM and USE_LOG_TARGET:
    print("üìê Training LGBM with log1p target + early stopping")

    y_train_log = np.log1p(y_train)
    y_valid_log = np.log1p(y_valid)

    model.fit(
        X_train, y_train_log,
        eval_set=[(X_valid, y_valid_log)],
        eval_metric="mae",
        callbacks=[
            early_stopping(stopping_rounds=200),
            log_evaluation(100)
        ]
    )

    y_valid_pred_log = model.predict(X_valid)
    y_valid_pred = np.expm1(y_valid_pred_log)

elif USE_LGBM and not USE_LOG_TARGET:
    print("üìê Training LGBM with RAW target + early stopping")

    model.fit(
        X_train, y_train,
        eval_set=[(X_valid, y_valid)],
        eval_metric="mae",
        callbacks=[
            early_stopping(stopping_rounds=200),
            log_evaluation(100)
        ]
    )

    y_valid_pred = model.predict(X_valid)

else:
    # ‡∏Å‡∏£‡∏ì‡∏µ‡πÉ‡∏ä‡πâ RandomForest
    print("üìê Training RandomForest")
    if USE_LOG_TARGET:
        y_train_log = np.log1p(y_train)
        model.fit(X_train, y_train_log)
        y_valid_pred_log = model.predict(X_valid)
        y_valid_pred = np.expm1(y_valid_pred_log)
    else:
        model.fit(X_train, y_train)
        y_valid_pred = model.predict(X_valid)

# clip ‡πÑ‡∏°‡πà‡πÉ‡∏´‡πâ negative
y_valid_pred = np.maximum(y_valid_pred, 0)

val_mae = mean_absolute_error(y_valid, y_valid_pred)
val_nmae = nmae(y_valid, y_valid_pred)

print(f"Validation MAE : {val_mae:.4f}")
print(f"Validation NMAE: {val_nmae:.6f}")

üìê Training LGBM with RAW target + early stopping
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.046399 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 5625
[LightGBM] [Info] Number of data points in the train set: 83200, number of used features: 36
[LightGBM] [Info] Start training from score 10408.760748
Training until validation scores don't improve for 200 rounds
[100]	valid_0's l1: 4928.94	valid_0's l2: 1.81876e+08
[200]	valid_0's l1: 3163.01	valid_0's l2: 1.45463e+08
[300]	valid_0's l1: 2987.39	valid_0's l2: 1.44032e+08
[400]	valid_0's l1: 2991.23	valid_0's l2: 1.45028e+08
Early stopping, best iteration is:
[295]	valid_0's l1: 2988.83	valid_0's l2: 1.43969e+08
Validation MAE : 2988.7675
Validation NMAE: 0.292648


# STEP 9: Retrain on full train data & predict test set

In [None]:
print("üìö Retraining on full data...")

if USE_LGBM and USE_LOG_TARGET:
    y_full_log = np.log1p(y)
    model.fit(X, y_full_log)
    y_test_pred_log = model.predict(X_test)
    y_test_pred = np.expm1(y_test_pred_log)

elif USE_LGBM:
    model.fit(X, y)
    y_test_pred = model.predict(X_test)

else:
    # RandomForest
    if USE_LOG_TARGET:
        y_full_log = np.log1p(y)
        model.fit(X, y_full_log)
        y_test_pred_log = model.predict(X_test)
        y_test_pred = np.expm1(y_test_pred_log)
    else:
        model.fit(X, y)
        y_test_pred = model.predict(X_test)

y_test_pred = np.maximum(y_test_pred, 0)

üìö Retraining on full data...
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.048805 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 5634
[LightGBM] [Info] Number of data points in the train set: 104000, number of used features: 36
[LightGBM] [Info] Start training from score 10369.578410


# STEP 10: Build submission file

In [None]:
submission = pd.DataFrame({
    "id": test["id"],        # ‡∏ï‡πâ‡∏≠‡∏á‡∏°‡∏µ‡πÉ‡∏ô test.csv
    "task1": 0,              # freeze / ‡πÉ‡∏´‡πâ‡πÄ‡∏õ‡πá‡∏ô default 0
    "task2": 0,
    "task3": y_test_pred,    # ‚ù§Ô∏è ‡∏ú‡∏•‡∏•‡∏±‡∏û‡∏ò‡πå‡∏ó‡∏µ‡πà‡πÄ‡∏£‡∏≤‡∏ó‡∏≥‡∏ô‡∏≤‡∏¢
    "task4": 0,
    "task5": 0
})

submission.to_csv(OUTPUT_SUB_PATH, index=False)
print(f"‚úÖ Saved submission to: {OUTPUT_SUB_PATH}")

‚úÖ Saved submission to: submission_task3_baseline.csv
