# Churn Prediction Model

This notebook implements a churn prediction model for a music streaming service.

## Day 1: Data Pipeline & Feature Engineering Foundation

**Goal**: Transform event logs into user-level features for prediction.

## Step 1.1: Load Data and Create Labels

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import time

# Load data
print("Loading data...")
train = pd.read_parquet('train.parquet')
test = pd.read_parquet('test.parquet')

print(f"Train: {train.shape[0]:,} rows")
print(f"Test: {test.shape[0]:,} rows")

Loading data...
Train: 17,499,636 rows
Test: 4,393,179 rows


In [2]:
# Create user-level churn labels
churned_users = train[train['page'] == 'Cancellation Confirmation']['userId'].unique()
all_users = train['userId'].unique()

print(f"Total users: {len(all_users):,}")
print(f"Churned users: {len(churned_users):,}")
print(f"Churn rate: {len(churned_users)/len(all_users):.2%}")

# Get churn timestamps for temporal slicing later
churn_times = train[train['page'] == 'Cancellation Confirmation'].groupby('userId')['time'].first()
print(f"\nChurn times recorded for {len(churn_times):,} users")

Total users: 19,140
Churned users: 4,271
Churn rate: 22.31%

Churn times recorded for 4,271 users


## Step 1.2: Vectorized Feature Engineering

Using `groupby` and vectorized pandas operations for ~100x speedup over row-by-row processing.

Feature categories:
- **Engagement**: total events, songs, sessions
- **Behavioral ratios**: thumbs up/down per song, error rate
- **Temporal**: days active, recency, activity trend
- **Subscription**: paid/free status, level changes, downgrade/upgrade events
- **Content diversity**: unique songs/artists, listen time

In [3]:
def create_user_features_vectorized(df, churn_times_series=None):
    """
    Vectorized feature engineering using groupby operations.
    ~100x faster than row-by-row processing.

    Parameters:
    - df: DataFrame with events
    - churn_times_series: Series with userId as index, churn time as value (for Cancel page handling)

    Returns:
    - DataFrame with one row per user
    """
    print("Computing basic aggregations...")

    # ===== BASIC AGGREGATIONS =====
    basic_agg = df.groupby('userId').agg(
        total_events=('page', 'count'),
        total_sessions=('sessionId', 'nunique'),
        time_min=('time', 'min'),
        time_max=('time', 'max'),
        registration=('registration', 'first'),
        gender=('gender', 'first'),
        location=('location', 'first'),
        level_last=('level', 'last'),
    )

    # ===== PAGE COUNTS (using crosstab for efficiency) =====
    print("Computing page counts...")
    important_pages = ['NextSong', 'Thumbs Up', 'Thumbs Down', 'Add to Playlist',
                       'Add Friend', 'Downgrade', 'Upgrade', 'Error', 'Help',
                       'Home', 'Settings', 'Roll Advert', 'Logout']

    # Filter to important pages only, then crosstab
    page_df = df[df['page'].isin(important_pages)][['userId', 'page']]
    page_counts = pd.crosstab(page_df['userId'], page_df['page'])

    # Ensure all important pages exist as columns
    for page in important_pages:
        if page not in page_counts.columns:
            page_counts[page] = 0

    # Rename columns
    page_counts.columns = [f'page_{col.lower().replace(" ", "_")}' for col in page_counts.columns]

    # ===== SESSION STATISTICS =====
    print("Computing session statistics...")
    session_sizes = df.groupby(['userId', 'sessionId']).size().reset_index(name='session_length')
    session_stats = session_sizes.groupby('userId')['session_length'].agg(
        avg_session_length='mean',
        max_session_length='max',
        std_session_length='std'
    ).fillna(0)

    # ===== SONG-RELATED FEATURES =====
    print("Computing song features...")
    songs_df = df[df['page'] == 'NextSong']

    song_agg = songs_df.groupby('userId').agg(
        total_songs=('page', 'count'),
        unique_songs=('song', 'nunique'),
        unique_artists=('artist', 'nunique'),
        avg_song_length=('length', lambda x: x.clip(upper=1200).mean()),
        total_listen_time=('length', lambda x: x.clip(upper=1200).sum()),
        std_song_length=('length', lambda x: x.clip(upper=1200).std()),
    ).fillna(0)

    # ===== LEVEL CHANGES =====
    print("Computing subscription features...")
    df_sorted = df.sort_values(['userId', 'time'])
    df_sorted['level_changed'] = (df_sorted['level'] != df_sorted.groupby('userId')['level'].shift()).astype(int)
    level_changes = df_sorted.groupby('userId')['level_changed'].sum() - 1  # subtract 1 for first row
    level_changes = level_changes.clip(lower=0)

    # Paid events ratio
    df_sorted['is_paid_event'] = (df_sorted['level'] == 'paid').astype(int)
    paid_ratio = df_sorted.groupby('userId')['is_paid_event'].mean()

    # ===== ACTIVITY TREND =====
    print("Computing temporal features...")
    def compute_activity_trend(group):
        if len(group) <= 1:
            return 0
        mid_time = group['time'].min() + (group['time'].max() - group['time'].min()) / 2
        first_half = (group['time'] <= mid_time).sum()
        second_half = (group['time'] > mid_time).sum()
        return (second_half - first_half) / max(first_half, 1)

    activity_trend = df.groupby('userId').apply(compute_activity_trend, include_groups=False)

    # ===== CANCEL PAGE VISITS (with 12-hour exclusion) =====
    print("Computing cancel page visits...")
    cancel_df = df[df['page'] == 'Cancel'][['userId', 'time']].copy()

    if churn_times_series is not None and len(cancel_df) > 0:
        # Merge churn times
        cancel_df = cancel_df.merge(
            churn_times_series.rename('churn_time').reset_index(),
            on='userId',
            how='left'
        )
        # Count only cancels > 12 hours before churn (or all if no churn)
        cancel_df['is_safe'] = (
            cancel_df['churn_time'].isna() |
            (cancel_df['time'] < cancel_df['churn_time'] - pd.Timedelta(hours=12))
        )
        cancel_page_visits = cancel_df[cancel_df['is_safe']].groupby('userId').size()
    else:
        cancel_page_visits = cancel_df.groupby('userId').size()

    # ===== COMBINE ALL FEATURES =====
    print("Combining features...")
    features = basic_agg.copy()

    # Join page counts
    features = features.join(page_counts, how='left').fillna(0)

    # Join session stats
    features = features.join(session_stats, how='left').fillna(0)

    # Join song features
    features = features.join(song_agg, how='left').fillna(0)

    # Add level changes and paid ratio
    features['level_changes'] = level_changes
    features['paid_ratio'] = paid_ratio

    # Add activity trend
    features['activity_trend'] = activity_trend

    # Add cancel page visits
    features['cancel_page_visits'] = cancel_page_visits.reindex(features.index).fillna(0).astype(int)

    # ===== COMPUTE DERIVED FEATURES =====
    print("Computing derived features...")

    # Temporal features
    features['days_active'] = (features['time_max'] - features['time_min']).dt.days + 1
    features['days_since_registration'] = (features['time_max'] - features['registration']).dt.days
    features['events_per_day'] = features['total_events'] / features['days_active'].clip(lower=1)
    features['songs_per_day'] = features['total_songs'] / features['days_active'].clip(lower=1)

    # Subscription features
    features['is_paid'] = (features['level_last'] == 'paid').astype(int)
    features['has_downgrade'] = (features['page_downgrade'] > 0).astype(int)
    features['has_upgrade'] = (features['page_upgrade'] > 0).astype(int)

    # Behavioral ratios
    features['thumbs_up_ratio'] = features['page_thumbs_up'] / features['total_songs'].clip(lower=1)
    features['thumbs_down_ratio'] = features['page_thumbs_down'] / features['total_songs'].clip(lower=1)
    features['playlist_add_ratio'] = features['page_add_to_playlist'] / features['total_songs'].clip(lower=1)
    features['error_rate'] = features['page_error'] / features['total_events'].clip(lower=1)
    features['ad_ratio'] = features['page_roll_advert'] / features['total_songs'].clip(lower=1)

    # Song repeat ratio
    features['song_repeat_ratio'] = features['total_songs'] / features['unique_songs'].clip(lower=1)

    # Fix ratios for users with 0 songs
    zero_songs = features['total_songs'] == 0
    ratio_cols = ['thumbs_up_ratio', 'thumbs_down_ratio', 'playlist_add_ratio', 'ad_ratio', 'song_repeat_ratio']
    features.loc[zero_songs, ratio_cols] = 0

    # ===== EXTRACT STATE FROM LOCATION =====
    def extract_state(loc):
        if pd.isna(loc) or loc == 'Unknown':
            return 'Unknown'
        if ',' in str(loc):
            return str(loc).split(',')[-1].strip()[:2]
        return 'Unknown'

    features['state'] = features['location'].apply(extract_state)

    # Fix gender - ensure it's always a string (handles mixed types from NaN)
    features['gender'] = features['gender'].fillna('Unknown').astype(str)

    # ===== CLEANUP =====
    # Drop intermediate columns
    features = features.drop(columns=['time_min', 'time_max', 'registration', 'location', 'level_last'])

    # Reset index to make userId a column
    features = features.reset_index()

    print(f"Done! Created {len(features)} user feature rows with {len(features.columns)-1} features")
    return features

## Step 1.3: Create Training Dataset

In [4]:
# Create features for all training users using vectorized operations
print("Creating features for training users...")
print(f"Processing {len(all_users):,} users from {len(train):,} events\n")

import time
start_time = time.time()

train_features = create_user_features_vectorized(train, churn_times_series=churn_times)

elapsed = time.time() - start_time
print(f"\nCompleted in {elapsed:.1f} seconds")

# Add churn labels
churned_set = set(churned_users)
train_features['churn'] = train_features['userId'].apply(lambda x: 1 if x in churned_set else 0)

print(f"\nTraining set shape: {train_features.shape}")
print(f"Churn distribution:\n{train_features['churn'].value_counts()}")

Creating features for training users...
Processing 19,140 users from 17,499,636 events

Computing basic aggregations...
Computing page counts...
Computing session statistics...
Computing song features...
Computing subscription features...
Computing temporal features...
Computing cancel page visits...
Combining features...
Computing derived features...
Done! Created 19140 user feature rows with 43 features

Completed in 34.4 seconds

Training set shape: (19140, 45)
Churn distribution:
churn
0    14869
1     4271
Name: count, dtype: int64


In [5]:
# Verify feature quality
print("Feature columns:")
print(train_features.columns.tolist())
print(f"\nTotal features: {len(train_features.columns) - 2}")  # excluding userId and churn

# Check for any missing values
missing = train_features.isnull().sum()
if missing.sum() > 0:
    print("\nMissing values:")
    print(missing[missing > 0])
else:
    print("\nNo missing values in features!")

Feature columns:
['userId', 'total_events', 'total_sessions', 'gender', 'page_add_friend', 'page_add_to_playlist', 'page_downgrade', 'page_error', 'page_help', 'page_home', 'page_logout', 'page_nextsong', 'page_roll_advert', 'page_settings', 'page_thumbs_down', 'page_thumbs_up', 'page_upgrade', 'avg_session_length', 'max_session_length', 'std_session_length', 'total_songs', 'unique_songs', 'unique_artists', 'avg_song_length', 'total_listen_time', 'std_song_length', 'level_changes', 'paid_ratio', 'activity_trend', 'cancel_page_visits', 'days_active', 'days_since_registration', 'events_per_day', 'songs_per_day', 'is_paid', 'has_downgrade', 'has_upgrade', 'thumbs_up_ratio', 'thumbs_down_ratio', 'playlist_add_ratio', 'error_rate', 'ad_ratio', 'song_repeat_ratio', 'state', 'churn']

Total features: 43

No missing values in features!


In [6]:
# Basic statistics for key features
key_features = ['total_events', 'total_songs', 'total_sessions', 'days_active',
                'thumbs_down_ratio', 'error_rate', 'has_downgrade', 'activity_trend']

print("Key feature statistics:")
train_features[key_features].describe()

Key feature statistics:


Unnamed: 0,total_events,total_songs,total_sessions,days_active,thumbs_down_ratio,error_rate,has_downgrade,activity_trend
count,19140.0,19140.0,19140.0,19140.0,19140.0,19140.0,19140.0,19140.0
mean,914.296552,746.67884,10.885998,32.14791,0.0131,0.001019,0.636677,2.149112
std,1079.652218,898.682491,10.654959,15.689274,0.014867,0.002589,0.480969,16.86929
min,1.0,0.0,1.0,1.0,0.0,0.0,0.0,-0.99827
25%,202.0,155.0,4.0,21.0,0.006939,0.0,0.0,-0.370469
50%,537.5,428.0,8.0,37.0,0.010383,0.0,1.0,0.111111
75%,1213.0,991.0,14.0,45.0,0.015185,0.001351,1.0,1.136364
max,10998.0,9248.0,116.0,50.0,0.5,0.111111,1.0,934.0


In [7]:
# Save training features
train_features.to_parquet('train_features.parquet', index=False)
print("Training features saved to train_features.parquet")

Training features saved to train_features.parquet


## Step 1.4: Create Test Dataset

In [8]:
# Create features for test users using vectorized operations
test_users = test['userId'].unique()
print(f"Creating features for {len(test_users):,} test users from {len(test):,} events\n")

start_time = time.time()

test_features = create_user_features_vectorized(test, churn_times_series=None)

elapsed = time.time() - start_time
print(f"\nCompleted in {elapsed:.1f} seconds")
print(f"Test set shape: {test_features.shape}")

Creating features for 2,904 test users from 4,393,179 events

Computing basic aggregations...
Computing page counts...
Computing session statistics...
Computing song features...
Computing subscription features...
Computing temporal features...
Computing cancel page visits...
Combining features...
Computing derived features...
Done! Created 2904 user feature rows with 43 features

Completed in 6.3 seconds
Test set shape: (2904, 44)


In [9]:
# Verify test features match train features (excluding churn column)
train_cols = set(train_features.columns) - {'churn'}
test_cols = set(test_features.columns)

if train_cols == test_cols:
    print("Feature columns match between train and test!")
else:
    print("Column differences:")
    print(f"  In train only: {train_cols - test_cols}")
    print(f"  In test only: {test_cols - train_cols}")

# Check for missing values in test
missing_test = test_features.isnull().sum()
if missing_test.sum() > 0:
    print("\nMissing values in test:")
    print(missing_test[missing_test > 0])
else:
    print("\nNo missing values in test features!")

Feature columns match between train and test!

No missing values in test features!


In [10]:
# Save test features
test_features.to_parquet('test_features.parquet', index=False)
print("Test features saved to test_features.parquet")

Test features saved to test_features.parquet


## Day 1 Summary

### Completed:
- Loaded train (17.5M events) and test (4.4M events) data
- Created user-level churn labels (22% churn rate)
- Built feature engineering function with 35+ features
- Created training features for 19,140 users
- Created test features for 2,904 users
- Saved features as parquet files

### Feature Categories:
| Category | Features |
|----------|----------|
| Engagement | total_events, total_songs, total_sessions, avg/max/std_session_length |
| Page counts | page_nextsong, page_thumbs_up/down, page_downgrade, etc. |
| Behavioral ratios | thumbs_up/down_ratio, playlist_add_ratio, error_rate, ad_ratio |
| Temporal | days_active, days_since_registration, events_per_day, activity_trend |
| Subscription | is_paid, level_changes, has_downgrade, has_upgrade, paid_ratio |
| Content | unique_songs/artists, avg_song_length, total_listen_time, song_repeat_ratio |
| Demographics | gender, state |

In [11]:
# Final summary
print("="*50)
print("DAY 1 COMPLETE")
print("="*50)
print(f"\nTraining set: {train_features.shape[0]:,} users, {train_features.shape[1]-2} features")
print(f"Test set: {test_features.shape[0]:,} users, {test_features.shape[1]-1} features")
print(f"\nChurn rate: {train_features['churn'].mean():.2%}")
print(f"\nFiles created:")
print("  - train_features.parquet")
print("  - test_features.parquet")

DAY 1 COMPLETE

Training set: 19,140 users, 43 features
Test set: 2,904 users, 43 features

Churn rate: 22.31%

Files created:
  - train_features.parquet
  - test_features.parquet
