# EstrelaBet - Feature Engineering (First Session Only)

## Objective
Create features for predicting **customer churn** (customers who do NOT make a redeposit after their first bet).

**IMPORTANT**: To avoid data leakage, we use ONLY first session features. 
Features from subsequent sessions would contain information about whether the user returned, 
which is exactly what we're trying to predict.

## Target Variable
- **Churn = 1**: Customer did NOT make a redeposit (churned)
- **Churn = 0**: Customer made a redeposit (retained)

## Feature Categories (First Session Only)
1. **Temporal Features** - When did the first session occur
2. **Betting Features** - Betting behavior in first session
3. **Session Features** - Session characteristics
4. **User Profile Features** - Demographics and account info
5. **Campaign Features** - Marketing exposure

---

## 1. Setup and Data Loading

In [1]:
# Core libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('seaborn-v0_8-whitegrid')

In [3]:
# Preprocessing
from sklearn.preprocessing import LabelEncoder, StandardScaler, RobustScaler
from scipy import stats

pd.set_option('display.max_columns', 50)

In [4]:
# Load the raw dataset
df = pd.read_csv('../data/test_dataset.csv')

print(f"Dataset Shape: {df.shape}")
print(f"Unique Users: {df['user_id'].nunique()}")

Dataset Shape: (10000, 29)
Unique Users: 1446


In [5]:
# Convert timestamps and sort
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['user_id', 'timestamp']).reset_index(drop=True)

---
## 2. Target Variable Construction

In [None]:
def create_target_variable(df):
    """
    Create target: Did user churn (NOT make a redeposit) after their first session?
    
    Target:
        - 1 = Churned (no redeposit after first session)
        - 0 = Retained (made a redeposit after first session)
    """
    df_sorted = df.sort_values(['user_id', 'timestamp']).copy()
    
    # Mark first session
    df_sorted['is_first_session'] = ~df_sorted.duplicated('user_id', keep='first')
    
    # Check for churn (no redeposit) after first session
    def check_churn(group):
        subsequent = group[~group['is_first_session']]
        if len(subsequent) == 0:
            return 1  # No subsequent sessions = churned
        return 0 if (subsequent['deposit_amount'].fillna(0) > 0).any() else 1
    
    target = df_sorted.groupby('user_id').apply(check_churn)
    target.name = 'churn'
    
    return target, df_sorted

In [None]:
# Create target variable
target, df_processed = create_target_variable(df)

print(f"Target Distribution:")
print(target.value_counts())
print(f"\nChurn Rate: {target.mean():.2%}")
print(f"Retention Rate: {1 - target.mean():.2%}")

---
## 3. First Session Features

Extract features from the user's first session - early indicators of user behavior.

In [8]:
# Get first session for each user
first_sessions = df_processed[df_processed['is_first_session']].copy()

In [9]:
# Initialize first session features
first_session_features = pd.DataFrame()
first_session_features['user_id'] = first_sessions['user_id']

In [10]:
# Temporal features
first_session_features['first_session_hour'] = first_sessions['hour'].values
first_session_features['first_session_day_of_week'] = first_sessions['day_of_week'].values
first_session_features['first_session_weekend'] = first_sessions['is_weekend'].values
first_session_features['first_session_holiday'] = first_sessions['is_holiday'].values

In [11]:
# Game and device features
first_session_features['first_session_game_type'] = first_sessions['game_type'].values
first_session_features['first_session_is_sports'] = (first_sessions['game_type'] == 'sports_betting').astype(int).values
first_session_features['first_session_device'] = first_sessions['device_type'].values
first_session_features['first_session_country'] = first_sessions['country'].values

In [12]:
# Payment and account features
first_session_features['first_session_payment_method'] = first_sessions['payment_method'].values
first_session_features['first_session_account_age'] = first_sessions['account_age_days'].values
first_session_features['first_session_vip_tier'] = first_sessions['vip_tier'].values
first_session_features['first_session_user_age'] = first_sessions['user_age'].values

In [13]:
# Campaign and bonus features
first_session_features['first_session_campaign'] = first_sessions['campaign_type'].values
first_session_features['first_session_bonus_used'] = first_sessions['bonus_used'].values

In [14]:
# Betting behavior features
first_session_features['first_session_bet_amount'] = first_sessions['bet_amount'].values
first_session_features['first_session_win_amount'] = first_sessions['win_amount'].values
first_session_features['first_session_net_result'] = first_sessions['net_result'].values
first_session_features['first_session_won'] = (first_sessions['net_result'] > 0).astype(int).values

In [15]:
# Session behavior features
first_session_features['first_session_length'] = first_sessions['session_length_minutes'].values
first_session_features['first_session_games_played'] = first_sessions['games_played'].values

In [16]:
# Deposit/Withdrawal features
first_session_features['first_session_deposited'] = (first_sessions['deposit_amount'].fillna(0) > 0).astype(int).values
first_session_features['first_session_deposit_amount'] = first_sessions['deposit_amount'].fillna(0).values
first_session_features['first_session_withdrew'] = (first_sessions['withdrawal_amount'].fillna(0) > 0).astype(int).values

In [17]:
print(f"First Session Features: {first_session_features.shape}")
first_session_features.head()

First Session Features: (1446, 24)


Unnamed: 0,user_id,first_session_hour,first_session_day_of_week,first_session_weekend,first_session_holiday,first_session_game_type,first_session_is_sports,first_session_device,first_session_country,first_session_payment_method,first_session_account_age,first_session_vip_tier,first_session_user_age,first_session_campaign,first_session_bonus_used,first_session_bet_amount,first_session_win_amount,first_session_net_result,first_session_won,first_session_length,first_session_games_played,first_session_deposited,first_session_deposit_amount,first_session_withdrew
0,user_000002,16,1,0,1,blackjack,0,mobile,MX,e_wallet,17,gold,43.0,welcome_bonus,0,6731.21,12678.31,5947.1,1,144,48,1,3802.83,0
7,user_000003,19,3,0,0,blackjack,0,tablet,BR,e_wallet,3,gold,52.0,welcome_bonus,1,11278.64,0.0,-11278.64,0,240,80,1,5584.4,0
19,user_000004,22,4,0,0,poker,0,mobile,FR,e_wallet,6,gold,63.0,none,0,279.05,465.14,186.09,1,222,74,1,2565.21,0
30,user_000005,20,6,1,0,poker,0,tablet,MX,credit_card,29,bronze,42.0,free_spins,0,15451.79,0.0,-15451.79,0,376,125,0,0.0,0
38,user_000006,3,3,0,0,live_dealer,0,desktop,MX,bank_transfer,13,silver,42.0,reload_bonus,0,387.12,660.91,273.79,1,311,103,1,1852.56,1


---
## 4. Merge Features and Add Target

**NOTE**: We skip Behavioral, Financial, Engagement, and Trend features as they would cause data leakage.
Those features require data from multiple sessions, but we're predicting based on first session only.

In [None]:
# Use only first session features (no data leakage)
features_df = first_session_features.copy()

# Add target (churn)
features_df = features_df.merge(target.reset_index(), on='user_id', how='left')

print(f"Features Dataset: {features_df.shape}")
print(f"Target column: 'churn' (1=churned, 0=retained)")

In [63]:
def handle_missing_values(df):
    """
    Handle missing values based on feature type and business logic.
    """
    df = df.copy()
    
    # Numerical features - fill with median
    numerical_cols = df.select_dtypes(include=[np.number]).columns
    for col in numerical_cols:
        if df[col].isnull().any():
            df[col] = df[col].fillna(df[col].median())
    
    # Categorical features - fill with 'unknown'
    categorical_cols = df.select_dtypes(include=['object']).columns
    for col in categorical_cols:
        if df[col].isnull().any():
            df[col] = df[col].fillna('unknown')
    
    return df

In [64]:
# Apply missing value handling
features_clean = handle_missing_values(features_df)

# Verify no missing values
print(f"Remaining missing values: {features_clean.isnull().sum().sum()}")

Remaining missing values: 0


---
## 5. Handle Missing Values

In [65]:
# Identify categorical columns
categorical_cols = features_clean.select_dtypes(include=['object']).columns.tolist()
categorical_cols.remove('user_id')  # Keep user_id as is

print(f"Categorical columns to encode: {len(categorical_cols)}")
for col in categorical_cols:
    print(f"  - {col}: {features_clean[col].nunique()} unique values")

Categorical columns to encode: 9
  - first_session_game_type: 7 unique values
  - first_session_device: 3 unique values
  - first_session_country: 11 unique values
  - first_session_payment_method: 6 unique values
  - first_session_vip_tier: 5 unique values
  - first_session_campaign: 5 unique values
  - favorite_game_type: 7 unique values
  - primary_device: 3 unique values
  - primary_payment_method: 6 unique values


In [66]:
# VIP tier - ordinal encoding
vip_order = {'unknown': 0, 'bronze': 1, 'silver': 2, 'gold': 3, 'platinum': 4, 'diamond': 5}
features_encoded = features_clean.copy()

if 'first_session_vip_tier' in features_encoded.columns:
    features_encoded['first_session_vip_tier_encoded'] = features_encoded['first_session_vip_tier'].map(vip_order).fillna(0)

In [None]:
# One-hot encode low cardinality features (first session only)
low_cardinality_cols = [
    'first_session_device', 'first_session_game_type', 'first_session_campaign'
]

for col in low_cardinality_cols:
    if col in features_encoded.columns:
        dummies = pd.get_dummies(features_encoded[col], prefix=col, drop_first=True)
        features_encoded = pd.concat([features_encoded, dummies], axis=1)

In [None]:
# Target encode high cardinality features (first session only)
high_cardinality_cols = [
    'first_session_country', 'first_session_payment_method'
]

for col in high_cardinality_cols:
    if col in features_encoded.columns:
        target_means = features_encoded.groupby(col)['churn'].mean()
        features_encoded[f'{col}_encoded'] = features_encoded[col].map(target_means).fillna(features_encoded['churn'].mean())

In [69]:
# Drop original categorical columns
cols_to_drop = [col for col in categorical_cols if col in features_encoded.columns]
features_encoded = features_encoded.drop(columns=cols_to_drop)

print(f"Encoded Features Dataset: {features_encoded.shape}")

Encoded Features Dataset: (1446, 105)


In [None]:
def add_cyclical_features(df):
    """Add sin/cos encoding for cyclical features"""
    df = df.copy()
    
    # Hour encoding (first session only)
    if 'first_session_hour' in df.columns:
        df['first_session_hour_sin'] = np.sin(2 * np.pi * df['first_session_hour'] / 24)
        df['first_session_hour_cos'] = np.cos(2 * np.pi * df['first_session_hour'] / 24)
    
    return df

In [None]:
def add_day_cyclical_features(df):
    """Add sin/cos encoding for day of week"""
    df = df.copy()
    
    # Day of week encoding (first session only)
    if 'first_session_day_of_week' in df.columns:
        df['first_session_dow_sin'] = np.sin(2 * np.pi * df['first_session_day_of_week'] / 7)
        df['first_session_dow_cos'] = np.cos(2 * np.pi * df['first_session_day_of_week'] / 7)
    
    return df

In [72]:
# Apply cyclical encodings
features_final = add_cyclical_features(features_encoded)
features_final = add_day_cyclical_features(features_final)

print(f"Final Features Dataset: {features_final.shape}")

Final Features Dataset: (1446, 113)


---
## 7. Feature Analysis

In [None]:
# Get numeric columns for correlation
numeric_cols = features_final.select_dtypes(include=[np.number]).columns.tolist()
numeric_cols = [col for col in numeric_cols if col not in ['user_id', 'churn']]

In [None]:
# Correlation with target (churn)
target_correlations = features_final[numeric_cols + ['churn']].corr()['churn'].drop('churn')
target_correlations = target_correlations.sort_values(key=abs, ascending=False)

print("Top 20 Features Correlated with Churn Target:")
print("="*50)
print("(Positive = higher value leads to MORE churn)")
print("(Negative = higher value leads to LESS churn)")
print("="*50)
for feature, corr in target_correlations.head(20).items():
    print(f"{feature:45} {corr:+.4f}")

In [None]:
# Visualize top correlations with churn
fig, ax = plt.subplots(figsize=(10, 10))

top_features = target_correlations.head(20)
# For churn: negative correlation = reduces churn (green/good), positive = increases churn (red/bad)
colors = ['#e74c3c' if x > 0 else '#27ae60' for x in top_features.values]

ax.barh(range(len(top_features)), top_features.values, color=colors)
ax.set_yticks(range(len(top_features)))
ax.set_yticklabels(top_features.index)
ax.set_xlabel('Correlation with Churn Target')
ax.set_title('Top 20 Features Correlated with Churn\n(Red = increases churn, Green = reduces churn)', 
             fontsize=14, fontweight='bold')
ax.axvline(0, color='black', linewidth=0.5)
ax.invert_yaxis()

plt.tight_layout()
plt.show()

---
## 8. Save Final Dataset

In [76]:
# Save the final feature dataset
features_final.to_csv('../data/features_engineered.csv', index=False)
print(f"Saved engineered features: {features_final.shape}")

Saved engineered features: (1446, 113)


In [None]:
# Save feature names for reference
feature_names = [col for col in features_final.columns if col not in ['user_id', 'churn']]
with open('../data/feature_names.txt', 'w') as f:
    for name in feature_names:
        f.write(f"{name}\n")

print(f"Total features: {len(feature_names)}")

In [None]:
# Summary statistics
print("\nFeature Engineering Summary")
print("="*50)
print(f"Total Users: {len(features_final):,}")
print(f"Total Features: {len(feature_names)}")
print(f"Churn Rate: {features_final['churn'].mean():.2%}")
print(f"Retention Rate: {1 - features_final['churn'].mean():.2%}")
print(f"\nFeature Categories:")
print(f"  - Numerical: {features_final.select_dtypes(include=[np.number]).shape[1]}")
print(f"  - Boolean/Binary: {sum(features_final[feature_names].nunique() == 2)}")

---
## Next Steps

The next notebook (`03_modeling_and_evaluation.ipynb`) will:
1. Split data into train/validation/test sets
2. Train multiple models (Logistic Regression, Random Forest, XGBoost, LightGBM)
3. Perform hyperparameter tuning
4. Evaluate models with ML and business metrics
5. Analyze feature importance and SHAP values