# Player Engagement & Retention Analysis


### Introduction

This notebook documents the cleaning and exploratory analysis of a simulated dataset for a fictional live-service video game. Dashboards and final data visualization for this project can be found at (LINK HERE).

##### Industry: Video Games/Live-Service Games
##### Company Type: Major Publisher
##### Primary Audience: Product Managers/Game Designers (responsible for content scope, cadence, and post-launch tuning)

##### Core Analytical Question
* How do player engagement and retention metrics change before and after a major content update, and which player segments are most affected?
##### Additional Stakeholder Questions
* Which features or content introduced in the update are most associated with increases or decreases in engagement among different player segments?
* Are there identifiable patterns of player behavior post-update that predict long-term retention or churn for high-value segments?


### Objective

The objective of this analysis is to enable stakeholders (e.g., product managers, live-ops teams, and game designers) to evaluate the impact of a major content update on player engagement and retention and to identify which player segments are positively or negatively affected.

By comparing pre- and post-update engagement scores and D1/D7/D30 retention outcomes across player segments, this analysis should allow stakeholders to:
* Assess whether the content update successfully increased short-term and long-term player retention
* Identify segments at risk of disengagement or churn following the update
* Understand how engagement behaviors (frequency, session depth, and social interaction) relate to post-update retention
* Inform future content, live-ops timing, and targeted interventions aimed at improving player retention and sustained engagement

Ultimately, this analysis is intended to support data-informed decisions about content design, update cadence, and post-update player targeting in a live-service game environment.


### Definitions

##### Engagement
A player can be considered engaged based on three criteria: session frequency (played at least 3 days per week or 10 days per month), session duration (played for at least 30 minutes per session), and social interaction (participated in at least one multiplayer event per session). I developed the following formulae to determine an "engagement score" for each player:

1. Frequency score $F=\frac{Days\:played\:in\:period}{Target\:days}$

2. Duration score $D=\frac{Average\:session\:duration}{Target\:duration}$

3. Social score $S=\frac{Average\:social\:interactions\:per\:session}{Target\:number}$

4. Engagement score $E=F \times D \times S$

Frequency, duration, and social scores are weighted equally for the purpose of this analysis. An engagement score $\ge$ 1 will indicate a player that demonstrates meaningful engagement across all three behavioral dimensions (and can therefore be considered engaged).

##### Retention
A player is considered retained if they return to the game and demonstrate continued activity after the reference point (in this case, a major content update). Retention is defined using industry-standard benchmarks with window-based criteria:

D1 Retention: Player logged in and completed at least one gameplay session 1 day after the content update

D7 Retention: Player logged in and completed at least one gameplay session within 7 days of the content update

D30 Retention: Player logged in and completed at least one gameplay session within 30 days of the content update

A “gameplay session” is defined as a session meeting the minimum activity threshold (≥ 30 minutes of playtime), ensuring retention reflects meaningful return behavior rather than a trivial login. These retention metrics are cumulative window-based measures and are used to assess short-, mid-, and long-term player return behavior following the update.


### Assumptions and Limitations

This analysis uses a simulated dataset designed to approximate realistic player behavior in a live-service game. Results are illustrative and demonstrate analytical approach rather than real-world performance.

Engagement is measured using a composite score based on session frequency, duration, and social interaction. These metrics act as behavioral proxies and do not capture qualitative factors such as player sentiment or satisfaction.

Frequency, duration, and social interaction are weighted equally in the engagement score. This assumes comparable impact across dimensions, which may vary by genre or player segment and would require validation in a production environment.

Retention is defined as returning to the game and completing at least one session of ≥ 30 minutes. This prioritizes meaningful engagement but may exclude shorter, intentional player interactions.

Pre- vs. post-update comparisons assume the content update is the primary driver of observed changes. External influences (e.g., marketing efforts, seasonality, competing releases) are not explicitly controlled for, so findings should be interpreted as associative rather than causal.

### Data Loading and Audit

The code block below imports the dataset and displays the first five records to confirm the dataset loaded correctly.

In [3]:
import pandas as pd
import numpy as np
df = pd.read_csv('player_engagement_data.csv')
print(df.head())

  session_id  player_id    game_title        date  session_duration_min  \
0  S00000001  P00010000  Mythic Quest  2024-04-26                 101.0   
1  S00000002  P00010000  Mythic Quest  2024-05-01                 145.0   
2  S00000003  P00010000  Mythic Quest  2024-05-25                   NaN   
3  S00000004  P00010000  Mythic Quest  2024-05-27                  27.0   
4  S00000005  P00010000  Mythic Quest  2024-06-03                 134.0   

   in_game_purchases_usd level platform region player_type  \
0                    NaN     9     Xbox    NaN         NaN   
1                    0.0    10     Xbox    NaN         NaN   
2                    0.0    13     Xbox    NaN         NaN   
3                    0.0    15     Xbox    NaN         NaN   
4                    NaN    15     Xbox    NaN         NaN   

  account_age_category  achievement_count  social_interactions churn_flag  
0                  new               19.0                 31.0          0  
1                  new  


Next, I print the datatypes of the dataset so I can get a feel for the information contained in the dataset and see whether the columns are typed correctly.

In [19]:
print("\n--- Datatypes ---")
print(df.dtypes)


--- Datatypes ---
session_id                       object
player_id                        object
game_title                       object
date                     datetime64[ns]
session_duration_min            float64
in_game_purchases_usd           float64
level                            object
platform                         object
region                           object
player_type                      object
account_age_category             object
achievement_count               float64
social_interactions             float64
churn_flag                         bool
dtype: object


Right away, I can see that the "level" column is mistyped. Although player level is stored numerically, it represents an ordinal progression rather than a continuous quantitative measure. So I convert the "level" column to an "object" datatype.

In [20]:
df['level'] = df['level'].astype(object)

I also suspect that the "churn_flag" column should be a boolean. I check the unique values in the "churn_flag" column, convert the "No"s to 0 and the "Yes"s to 1, and convert the column's datatype to bool.

In [21]:
print(df['churn_flag'].unique())
df['churn_flag'] = df['churn_flag'].replace('Yes', 1)
df['churn_flag'] = df['churn_flag'].replace('No', 0)
df['churn_flag'] = pd.to_numeric(df['churn_flag'])
df['churn_flag'] = df['churn_flag'].astype(bool)
print(df['churn_flag'].unique())

[False  True]
[False  True]


The last datatype I want to update is for the 'date' column. I convert it to a datetime type.

In [22]:
df['date'] = pd.to_datetime(df['date'])

I check my work and make sure that all datatypes are now correct.

In [23]:
print("\n--- Datatypes R1 ---")
print(df.dtypes)


--- Datatypes R1 ---
session_id                       object
player_id                        object
game_title                       object
date                     datetime64[ns]
session_duration_min            float64
in_game_purchases_usd           float64
level                            object
platform                         object
region                           object
player_type                      object
account_age_category             object
achievement_count               float64
social_interactions             float64
churn_flag                         bool
dtype: object


Now that my data is typed correctly, I print a summary of the dataset so I can look for anomalies.

In [24]:
print("\n--- Statistical Summary (All Columns) ---")
print(df.describe(include='all'))


--- Statistical Summary (All Columns) ---
       session_id  player_id    game_title                           date  \
count       45049      45049         45049                          45049   
unique      45049       9653             1                            NaN   
top     S00000001  P00019999  Mythic Quest                            NaN   
freq            1          8         45049                            NaN   
mean          NaN        NaN           NaN  2024-06-12 16:55:39.723412224   
min           NaN        NaN           NaN            2024-04-16 00:00:00   
25%           NaN        NaN           NaN            2024-05-13 00:00:00   
50%           NaN        NaN           NaN            2024-06-12 00:00:00   
75%           NaN        NaN           NaN            2024-07-13 00:00:00   
max           NaN        NaN           NaN            2024-08-14 00:00:00   
std           NaN        NaN           NaN                            NaN   

        session_duration_min  in

Before moving on to the data cleaning phase, I note the unusual minimum value for the 'achievement_count' column so I can investigate. I also note that there is a unique session ID for every record in this dataset, so there are no duplicate records.

### Data Cleaning and Preparation

I start with column-by-column checks to make sure spelling, capitalization, and whitespace are all used consistently.