# Player Engagement & Retention Analysis


### Introduction

This notebook documents the cleaning and exploratory analysis of a simulated dataset for a fictional live-service video game. Dashboards and final data visualization for this project can be found at (LINK HERE).

##### Industry: Video Games/Live-Service Games
##### Company Type: Major Publisher
##### Primary Audience: Product Managers/Game Designers (responsible for content scope, cadence, and post-launch tuning)

##### Core Analytical Question
* How do player engagement and retention metrics change before and after a major content update, and which player segments are most affected?
##### Additional Stakeholder Questions
* Which features or content introduced in the update are most associated with increases or decreases in engagement among different player segments?
* Are there identifiable patterns of player behavior post-update that predict long-term retention or churn for high-value segments?


### Objective

The objective of this analysis is to enable stakeholders (e.g., product managers, live-ops teams, and game designers) to evaluate the impact of a major content update on player engagement and retention and to identify which player segments are positively or negatively affected.

By comparing pre- and post-update engagement scores and D1/D7/D30 retention outcomes across player segments, this analysis should allow stakeholders to:
* Assess whether the content update successfully increased short-term and long-term player retention
* Identify segments at risk of disengagement or churn following the update
* Understand how engagement behaviors (frequency, session depth, and social interaction) relate to post-update retention
* Inform future content, live-ops timing, and targeted interventions aimed at improving player retention and sustained engagement

Ultimately, this analysis is intended to support data-informed decisions about content design, update cadence, and post-update player targeting in a live-service game environment.


### Definitions

##### Engagement
For the sake of this analysis, a player can be considered engaged based on three criteria: session frequency (played at least 1 day per week or 4 days per month), session duration (played for at least 30 minutes per session), and social interaction (participated in at least one multiplayer event per session). I developed the following formulae to determine an "engagement score" for each player:

1. Frequency score $F=\frac{Days\:played\:in\:period}{Target\:days}$

2. Duration score $D=\frac{Average\:session\:duration}{Target\:duration}$

3. Social score $S=\frac{Average\:social\:interactions\:per\:session}{Target\:number}$

4. Engagement score $E=F \times D \times S$

Frequency, duration, and social scores are weighted equally for the purpose of this analysis. An engagement score $\ge$ 1 will indicate a player that demonstrates meaningful engagement across all three behavioral dimensions (and can therefore be considered engaged).

##### Retention
For the sake of this analysis, a player is considered retained if they return to the game and demonstrate continued activity after the reference point (in this case, a major content update). Retention is defined using industry-standard benchmarks with window-based criteria:

D1 Retention: Player logged in and completed at least one gameplay session 1 day after the content update

D7 Retention: Player logged in and completed at least one gameplay session within 7 days of the content update

D30 Retention: Player logged in and completed at least one gameplay session within 30 days of the content update

A “gameplay session” is defined as a session meeting the minimum activity threshold (≥ 30 minutes of playtime), ensuring retention reflects meaningful return behavior rather than a trivial login. These retention metrics are cumulative window-based measures and are used to assess short-, mid-, and long-term player return behavior following the update.


### Assumptions and Limitations

This analysis uses a simulated dataset designed to approximate realistic player behavior in a live-service game. Results are illustrative and demonstrate analytical approach rather than real-world performance.

Engagement is measured using a composite score based on session frequency, duration, and social interaction. These metrics act as behavioral proxies and do not capture qualitative factors such as player sentiment or satisfaction.

Frequency, duration, and social interaction are weighted equally in the engagement score. This assumes comparable impact across dimensions, which may vary by genre or player segment and would require validation in a production environment.

Retention is defined as returning to the game and completing at least one session of ≥ 30 minutes. This prioritizes meaningful engagement but may exclude shorter, intentional player interactions.

Pre- vs. post-update comparisons assume the content update is the primary driver of observed changes. External influences (e.g., marketing efforts, seasonality, competing releases) are not explicitly controlled for, so findings should be interpreted as associative rather than causal.

### Data Loading and Audit

The code block below imports the dataset and displays the first five records to confirm the dataset loaded correctly.

In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv('player_engagement_data.csv')
print(df.head())

  session_id  player_id    game_title        date  session_duration_min  \
0  S00000001  P00010000  Mythic Quest  2024-04-26                 101.0   
1  S00000002  P00010000  Mythic Quest  2024-05-01                 145.0   
2  S00000003  P00010000  Mythic Quest  2024-05-25                   NaN   
3  S00000004  P00010000  Mythic Quest  2024-05-27                  27.0   
4  S00000005  P00010000  Mythic Quest  2024-06-03                 134.0   

   in_game_purchases_usd level platform region player_type  \
0                    NaN     9     Xbox    NaN         NaN   
1                    0.0    10     Xbox    NaN         NaN   
2                    0.0    13     Xbox    NaN         NaN   
3                    0.0    15     Xbox    NaN         NaN   
4                    NaN    15     Xbox    NaN         NaN   

  account_age_category  achievement_count  social_interactions churn_flag  
0                  new               19.0                 31.0          0  
1                  new  


Next, I print the datatypes of the dataset so I can get a feel for the information contained in the dataset and see whether the columns are typed correctly.

In [2]:
print("\n--- Datatypes ---")
print(df.dtypes)


--- Datatypes ---
session_id                object
player_id                 object
game_title                object
date                      object
session_duration_min     float64
in_game_purchases_usd    float64
level                     object
platform                  object
region                    object
player_type               object
account_age_category      object
achievement_count        float64
social_interactions      float64
churn_flag                object
dtype: object


Right away, I can see that the 'level' column is mistyped. Although player level is stored numerically, it represents an ordinal progression rather than a continuous quantitative measure. So I convert the 'level' column to an object datatype.

In [3]:
df['level'] = df['level'].astype(object)

I also suspect that the 'churn_flag' column should be a boolean. I check the unique values in the 'churn_flag' column, convert the "No"s to 0 and the "Yes"s to 1, and convert the column's datatype to bool.

In [4]:
df['churn_flag'] = (
    df['churn_flag']
      .replace({'Yes': 1, 'No': 0})
      .pipe(pd.to_numeric, errors='coerce')
      .astype('boolean')
)

The last datatype I want to update is for the 'date' column. I convert it to a datetime type.

In [5]:
df['date'] = pd.to_datetime(df['date'])

I check my work and make sure that all datatypes are now correct.

In [6]:
print("\n--- Datatypes R1 ---")
print(df.dtypes)


--- Datatypes R1 ---
session_id                       object
player_id                        object
game_title                       object
date                     datetime64[ns]
session_duration_min            float64
in_game_purchases_usd           float64
level                            object
platform                         object
region                           object
player_type                      object
account_age_category             object
achievement_count               float64
social_interactions             float64
churn_flag                      boolean
dtype: object


Now that my data is typed correctly, I print a summary of the dataset so I can look for anomalies.

In [7]:
print("\n--- Statistical Summary (All Columns) ---")
print(df.describe(include='all'))


--- Statistical Summary (All Columns) ---
       session_id  player_id    game_title                           date  \
count       45049      45049         45049                          45049   
unique      45049       9653             1                            NaN   
top     S00000001  P00019999  Mythic Quest                            NaN   
freq            1          8         45049                            NaN   
mean          NaN        NaN           NaN  2024-06-12 16:55:39.723412224   
min           NaN        NaN           NaN            2024-04-16 00:00:00   
25%           NaN        NaN           NaN            2024-05-13 00:00:00   
50%           NaN        NaN           NaN            2024-06-12 00:00:00   
75%           NaN        NaN           NaN            2024-07-13 00:00:00   
max           NaN        NaN           NaN            2024-08-14 00:00:00   
std           NaN        NaN           NaN                            NaN   

        session_duration_min  in

Before moving on to the data cleaning phase, I note the unusual minimum value for the 'achievement_count' column so I can investigate.

### Data Cleaning and Preparation

I start with column-by-column checks to make sure spelling, capitalization, and whitespace are all used consistently. I also check column-specific parameters, as shown below. The following code block checks that every value in 'session_id' is formatted correctly and there are no missing sessions.

In [10]:
import re
from typing import Dict, Any

def validate_id_format(df: pd.DataFrame, col: str, prefix: str = None, digits: int = 8) -> Dict[str, Any]:
    """
    Validate ID format for a column.
    - Expected format: '<PREFIX>' followed by exactly `digits` digits (default 8), e.g. 'S00000001' or 'P00000001'.
    - If `prefix` is None, infer from the first non-null value's first character (uppercased).
    Returns a dict with:
      - 'prefix': inferred/used prefix (single char)
      - 'pattern': the regex used
      - 'valid_mask': boolean Series (True where format matched)
      - 'digits_str': Series of captured digit strings (NaN where not matched)
      - 'digits_int': Series of ints for valid rows (index aligned, dtype Int64 for nullable ints)
      - 'invalid_rows': DataFrame of rows that did not match (includes original NaNs)
    """
    if col not in df.columns:
        raise KeyError(f"Column '{col}' not in dataframe")

    orig = df[col]
    sample = orig.dropna().astype(str).str.strip()
    if sample.empty:
        # nothing to validate
        return {
            'prefix': None,
            'pattern': None,
            'valid_mask': pd.Series([False] * len(df), index=df.index),
            'digits_str': pd.Series([pd.NA] * len(df), index=df.index, dtype="object"),
            'digits_int': pd.Series([pd.NA] * len(df), index=df.index, dtype="Int64"),
            'invalid_rows': df.copy()
        }

    used_prefix = (prefix.upper() if prefix is not None else sample.iloc[0][0].upper())
    pattern = rf'^{re.escape(used_prefix)}(\d{{{digits}}})$'

    s = orig.astype(str).str.strip()
    original_na_mask = orig.isna()

    digits_str = s.str.extract(pattern)[0]    # captured digits or NaN
    valid_mask = (~digits_str.isna()) & (~original_na_mask)

    # ints for valid rows, use pandas nullable Int64 so missingness is preserved if needed
    digits_int = pd.Series(pd.NA, index=df.index, dtype="Int64")
    if valid_mask.any():
        digits_int.loc[valid_mask] = digits_str[valid_mask].astype(int)

    invalid_rows = df[~valid_mask].copy()

    return {
        'prefix': used_prefix,
        'pattern': pattern,
        'valid_mask': valid_mask,
        'digits_str': digits_str,
        'digits_int': digits_int,
        'invalid_rows': invalid_rows
    }


def assess_sequence(digits_int: pd.Series, df: pd.DataFrame = None, prefix: str = "S", digits: int = 8) -> Dict[str, Any]:
    """
    Given a Series of integer ids (aligned to the original dataframe index) for VALID rows,
    check duplicates and missingness in the contiguous range min..max.

    Parameters:
      - digits_int: pd.Series of ints (nullable Int64) containing only valid numeric ids or pd.NA for invalid rows.
      - df: optional original DataFrame. If provided, duplicate_rows will be returned as rows from this df.
      - prefix: prefix to use when formatting missing ids (single char)
      - digits: number of digits for zero-padding when formatting missing ids

    Returns dict with:
      - 'duplicate_rows': DataFrame (if df provided) containing rows involved in duplicates; else list of duplicated numeric values
      - 'duplicate_values': list of numeric values that are duplicated (empty if none)
      - 'missing_ids': list of formatted missing ids (e.g. 'S00000005')
      - 'summary': dict with counts and min/max
    """
    # Filter to present numeric values
    present = digits_int.dropna().astype(int)
    if present.empty:
        summary = {
            'valid_count': 0,
            'duplicate_count': 0,
            'min_numeric_id': None,
            'max_numeric_id': None,
            'missing_count': None
        }
        return {
            'duplicate_rows': (df.iloc[0:0].copy() if df is not None else []),
            'duplicate_values': [],
            'missing_ids': [],
            'summary': summary
        }

    # Find duplicate numeric values
    dup_mask = present.duplicated(keep=False)
    duplicate_values = sorted(present[dup_mask].unique().tolist())

    if df is not None:
        # select rows in original df corresponding to duplicated numeric ids
        duplicate_rows = df.loc[present.index[dup_mask]].copy()
    else:
        duplicate_rows = duplicate_values  # fallback: list of duplicated numeric ids

    # Check missing numbers in the contiguous range min..max
    mn = int(present.min())
    mx = int(present.max())
    full_set = set(range(mn, mx + 1))
    present_set = set(present.tolist())
    missing_nums = sorted(full_set - present_set)
    missing_ids = [f"{prefix}{n:0{digits}d}" for n in missing_nums]

    summary = {
        'valid_count': int(present.size),
        'duplicate_count': int(duplicate_rows.shape[0]) if isinstance(duplicate_rows, pd.DataFrame) else len(duplicate_values),
        'min_numeric_id': mn,
        'max_numeric_id': mx,
        'missing_count': len(missing_ids)
    }

    return {
        'duplicate_rows': duplicate_rows,
        'duplicate_values': duplicate_values,
        'missing_ids': missing_ids,
        'summary': summary
    }

validate_id_format(df, 'session_id')

fmt_res = validate_id_format(df, 'session_id')
print("Pattern used:", fmt_res['pattern'])
print("Invalid rows (first 10):")
display(fmt_res['invalid_rows'].head(10))
seq_res = assess_sequence(fmt_res['digits_int'], df=df, prefix=fmt_res['prefix'])
print("Sequence summary:", seq_res['summary'])
print("First 20 missing ids:", seq_res['missing_ids'][:20])
display(seq_res['duplicate_rows'].head(10))

Pattern used: ^S(\d{8})$
Invalid rows (first 10):


Unnamed: 0,session_id,player_id,game_title,date,session_duration_min,in_game_purchases_usd,level,platform,region,player_type,account_age_category,achievement_count,social_interactions,churn_flag


Sequence summary: {'valid_count': 45049, 'duplicate_count': 0, 'min_numeric_id': 1, 'max_numeric_id': 45049, 'missing_count': 0}
First 20 missing ids: []


Unnamed: 0,session_id,player_id,game_title,date,session_duration_min,in_game_purchases_usd,level,platform,region,player_type,account_age_category,achievement_count,social_interactions,churn_flag


'session_id' appears to be formatted correctly, and I can see from my summary above that there are no duplicate session IDs in this dataset, so I move on to the next column. For 'player_id', I just want to make sure that every ID is formatted correctly.

In [20]:
validate_id_format(df, 'player_id')

{'prefix': 'P',
 'pattern': '^P(\\d{8})$',
 'valid_mask': 0        True
 1        True
 2        True
 3        True
 4        True
          ... 
 45044    True
 45045    True
 45046    True
 45047    True
 45048    True
 Length: 45049, dtype: bool,
 'digits_str': 0        00010000
 1        00010000
 2        00010000
 3        00010000
 4        00010000
            ...   
 45044    00019999
 45045    00019999
 45046    00019999
 45047    00019999
 45048    00019999
 Name: 0, Length: 45049, dtype: object,
 'digits_int': 0        10000
 1        10000
 2        10000
 3        10000
 4        10000
          ...  
 45044    19999
 45045    19999
 45046    19999
 45047    19999
 45048    19999
 Length: 45049, dtype: Int64,
 'invalid_rows': Empty DataFrame
 Columns: [session_id, player_id, game_title, date, session_duration_min, in_game_purchases_usd, level, platform, region, player_type, account_age_category, achievement_count, social_interactions, churn_flag]
 Index: []}

Next I check the 'date' column for any null (NaT) values.

In [21]:
num_missing_date = df['date'].isna().sum()
print(f"date NaT count: {num_missing_date}")

date NaT count: 0


Then I check 'session_duration_min' for missing values.

In [23]:
num_missing_duration = df['session_duration_min'].isna().sum()
print(f"session duration NaN count: {num_missing_duration}")

session duration NaN count: 2231


There are 2,231 missing values in 'session_duration_min'. Further research is required to determine why these values are missing. I start with a quick sanity check.

In [24]:
# basic counts + percent
n_total = len(df)
n_missing = df['session_duration_min'].isna().sum()
print(n_missing, "missing of", n_total, f"({n_missing/n_total:.1%})")

# peek at some missing rows
df[df['session_duration_min'].isna()].sample(10)

2231 missing of 45049 (5.0%)


Unnamed: 0,session_id,player_id,game_title,date,session_duration_min,in_game_purchases_usd,level,platform,region,player_type,account_age_category,achievement_count,social_interactions,churn_flag
28656,S00028657,P00016336,Mythic Quest,2024-06-07,,13.55,31,PlayStation,ASIA,,New,36.0,13.0,False
9947,S00009948,P00012209,Mythic Quest,2024-04-18,,,21,PLAYSTATION,,Casual,new,20.0,31.0,False
13679,S00013680,P00013041,Mythic Quest,2024-07-22,,,24,,ASIA,Premium,New,43.0,,False
25760,S00025761,P00015700,Mythic Quest,2024-06-02,,0.0,6,pc,OCE,Premium,new,16.0,6.0,False
9341,S00009342,P00012087,Mythic Quest,2024-06-01,,0.0,54,,,free,veteran,26.0,31.0,False
21851,S00021852,P00014827,Mythic Quest,2024-04-19,,,30,PlayStation,SA,free,Veteran,2.0,28.0,False
22489,S00022490,P00014983,Mythic Quest,2024-06-26,,100.97,36,pc,,Whale,Veteran,25.0,11.0,False
13274,S00013275,P00012962,Mythic Quest,2024-05-21,,0.0,14,PLAYSTATION,,Free,New,26.0,11.0,False
44129,S00044130,P00019795,Mythic Quest,2024-07-31,,,36,Xbox,EU,Whale,,34.0,56.0,False
13808,S00013809,P00013070,Mythic Quest,2024-06-23,,0.0,48,pc,ASIA,Premium,New,49.0,6.0,False


The preview does not make the reason for missingness immediately clear, so I check to see if other related columns are missing for the same records.

In [31]:
cols = ['date', 'platform','region', 'player_type', 'account_age_category', 'social_interactions']
df_missing = df[df['session_duration_min'].isna()]
df_missing[cols].info()
df_missing[cols].head(20)

<class 'pandas.core.frame.DataFrame'>
Index: 2231 entries, 2 to 45038
Data columns (total 6 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   date                  2231 non-null   datetime64[ns]
 1   platform              1582 non-null   object        
 2   region                1409 non-null   object        
 3   player_type           1892 non-null   object        
 4   account_age_category  1847 non-null   object        
 5   social_interactions   2090 non-null   float64       
dtypes: datetime64[ns](1), float64(1), object(4)
memory usage: 87.1+ KB


Unnamed: 0,date,platform,region,player_type,account_age_category,social_interactions
2,2024-05-25,Xbox,,,new,45.0
46,2024-04-24,,,,,25.0
52,2024-05-20,PlayStation,na,Whale,new,
90,2024-08-14,PLAYSTATION,na,Free,New,56.0
112,2024-07-08,,ASIA,Premium,Intermediate,54.0
113,2024-04-20,PlayStation,SA,PREMIUM,,14.0
119,2024-08-04,PlayStation,SA,PREMIUM,,47.0
145,2024-05-06,pc,ASIA,Free,New,2.0
151,2024-08-01,PlayStation,na,Free,Veteran,13.0
155,2024-07-16,Xbox,na,Free,,5.0


Almost half of the records with a missing session duration also have a missing region, but more research would be required to determine if the two are correlated. For now, I move on and compare distributions and frequencies for rows with versus without missing session durations.

In [33]:
# categorical columns to inspect
for c in cols:
    print("==", c, "==")
    print("missing rows:")
    print(df_missing[c].value_counts(dropna=False).head(10))
    print("overall:")
    print(df[c].value_counts(dropna=False).head(10))
    print()

== date ==
missing rows:
date
2024-04-29    31
2024-06-28    29
2024-04-26    28
2024-06-21    28
2024-06-10    28
2024-08-08    26
2024-07-06    26
2024-04-17    25
2024-07-02    25
2024-05-06    25
Name: count, dtype: int64
overall:
date
2024-05-03    451
2024-04-23    436
2024-06-06    435
2024-04-18    431
2024-05-02    420
2024-05-10    420
2024-04-25    418
2024-04-19    417
2024-05-04    415
2024-05-14    415
Name: count, dtype: int64

== platform ==
missing rows:
platform
NaN            649
pc             334
PC             325
PLAYSTATION    318
PlayStation    312
Xbox           293
Name: count, dtype: int64
overall:
platform
NaN            12858
pc              6686
PLAYSTATION     6468
PlayStation     6442
PC              6305
Xbox            6290
Name: count, dtype: int64

== region ==
missing rows:
region
NaN     822
SA      301
ASIA    300
EU      273
OCE     273
na      262
Name: count, dtype: int64
overall:
region
NaN     17143
ASIA     5835
OCE      5671
SA       5528


I can see that missing durations are spread across many dates rather than concentrated on a single outage day. That suggests missingness is not caused by a single-day instrumentation failure.

 Many of the records missing session durations are also missing other values (e.g. platform NaN = 649 among the missing-rows subset). This indicates that rows with missing duration often also lack other metadata, which could indicate upstream log loss or incomplete events. In a real-world scenario, I would flag this to the appropriate team or investigate outside the dataset as needed.