### Introduction about the Game
+ The goal of this game is to align **three** identical cookies to feed a cat and complete each level. 
+ In addition to collected credit, you may earn Keys to activate gates at specific levels.


### Data Description
The data is from 90,189 players that installed the game while the df-test was running. The varidfles are:
+ `userid` - a unique number that identifies each player.
+ `version` - whether the player was put in the control group (gate_30 - a gate at level 30) or the test group (gate_40 - a gate at level 40).
+ `sum_gamerounds` - the number of game rounds played by the player during the first week after installation
+ `retention_1` - did the player come back and play 1 day after installing?(bool)
+ `retention_7` - did the player come back and play 7 days after installing?(bool) <br>
When a player installed the game, he or she was randomly assigned to either *gate_30* or *gate_40*. <br><br>
**Note**: An important fact to keep in mind is that in the game industry one crucial metric is `retention_1,` since it defines if the game generate a first engagement with the first log-in of the player.

### Hypothesis
Tactile Entertainment is planning to move Cookie Cats' time gates from level 30 to 40, but they don’t know by how much the user retention can be impacted by this decision. <br><br>
So seeing this viewpoint, a decision like this can impact not only user retention, the expected revenue as well that’s why i am going to set the initial hypothesis as:

+ Moving the Time Gate from Level 30 to Level 40 will decrease our user retention.
+ Moving the Time Gate from Level 30 to Level 40 will increase our user retention.


#### Data Preprocessing 

In [27]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff

df = pd.read_csv('cookie_cats.csv')
df.head()

Unnamed: 0,userid,version,sum_gamerounds,retention_1,retention_7
0,116,gate_30,3,False,False
1,337,gate_30,38,True,False
2,377,gate_40,165,True,False
3,483,gate_40,1,False,False
4,488,gate_40,179,True,True


In [28]:
# Compare the difference of players and records
print("Number of players: \n", df.userid.nunique(), '\n',
        "Number of records: \n", len(df.userid),'\n')

Number of players: 
 90189 
 Number of records: 
 90189 



I already saw the data is almost ideally sampled, so I just count distinct records. 

### Data Cleaning

The data doesn’t require any kind of transformation and the data types are aligned with their purpose.

In [29]:
print(df.dtypes)

userid             int64
version           object
sum_gamerounds     int64
retention_1         bool
retention_7         bool
dtype: object


There aren't any NULL values, or even duplicates.

In [30]:
# Data Consistency
# Check for missing values
print("Missing values: \n", df.isnull().sum(), '\n')

# Check for duplicates
print("Duplicate records: \n", df.duplicated().sum(), '\n')

# Check for unique values in 'version'
print("Unique values in 'version': \n", df['version'].unique(), '\n')

# Check for unique values in 'retention_1' and 'retention_7'
print("Unique values in 'retention_1': \n", df['retention_1'].unique(), '\n')
print("Unique values in 'retention_7': \n", df['retention_7'].unique(), '\n')

Missing values: 
 userid            0
version           0
sum_gamerounds    0
retention_1       0
retention_7       0
dtype: int64 

Duplicate records: 
 0 

Unique values in 'version': 
 ['gate_30' 'gate_40'] 

Unique values in 'retention_1': 
 [False  True] 

Unique values in 'retention_7': 
 [False  True] 



### Normalization
By seeing the distribution of the quartiles and understanding the goal of our study, where I just want **sum_gamerounds** as numeric, we can check that the data is similar and does not require changes.

In [31]:
df.describe()

Unnamed: 0,userid,sum_gamerounds
count,90189.0,90189.0
mean,4998412.0,51.872457
std,2883286.0,195.050858
min,116.0,0.0
25%,2512230.0,5.0
50%,4995815.0,16.0
75%,7496452.0,51.0
max,9999861.0,49854.0


### A/B Analysis
The most accurate way to test changes is to perform A/B testing by targeting a specific varidfle, in the case retention (for 1 and 7 days after installation). <br>
I will create two groups in the version varidfle: 
+ Control group: The time gate is located at level 30. We are going to consider this one as a no-treatment group.
+ Treatment group: The company plans to move the time gate to level 40. We are going to use this as a subject of study, due to the change involved. <br>

In this section, I going to use a **bootstrapping** techinique, of course, just in case the comparison for the retention probdfilities between group. 

In [32]:
# Counting the number of players in each df group.
players_g30 = df[df['version'] == 'gate_30']
players_g40 = df[df['version'] == 'gate_40']

print('Number of players tested at Gate 30:', str(players_g30.shape[0]), '\n',
     'Number of players tested at Gate 40:', str(players_g40.shape[0]))

Number of players tested at Gate 30: 44700 
 Number of players tested at Gate 40: 45489


### Game Round Proportion
As we see the proportion of players sampled for each group is balanced, so for now, only exploring the Game Rounds data is in the queue.

In [38]:
# A/B Groups & Target Summary Stats
df.groupby("version").sum_gamerounds.agg(["count", "median", "mean", "std", "max"])

Unnamed: 0_level_0,count,median,mean,std,max
version,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
gate_30,44699,17.0,51.342111,102.057598,2961
gate_40,45489,16.0,51.298776,103.294416,2640


In [39]:
# Distribution Boxplot with outliers
box1 = px.box(
    df, 
    x="sum_gamerounds",
    title="Game Rounds Overall Distribution by player", 
    labels={"sum_gamerounds": "Game Rounds registered"}
)

# Tùy chỉnh layout trực tiếp
box1.update_layout(
    height=720,
    width=1000,
    font=dict(size=15)
)

# Thêm annotation nếu muốn
box1.add_annotation(
    text="Game Rounds Distribution",
    xref="paper", yref="paper",
    x=0.5, y=1.08, showarrow=False,
    font=dict(size=16)
)

box1.show()

For now, we see that exist clear outliers in the dataset since one user has recorded 49,854 Game rounds played in less than 14 days, meanwhile, the max recorded, excluding the outlier, is around 2,900 to 3,000. The only response to this case situation is a “bot”, a “bug” or a “glitch”.
Nevertheless, it’s preferable to clean it, since only affected one record. Let’s prune it.

In [40]:
df = df[df['sum_gamerounds'] != 49854]

In [41]:
# ECDF plot
ecdf = go.Figure()

# Add traces
ecdf.add_trace(go.Scatter(x=x_rounds, y=y_rounds,
                    mode='markers',
                    name='Game Rounds'))
ecdf.add_trace(go.Scatter(x=ptiles, y=percentiles/100,
                    mode='markers+text',
                    name='Percentiles', marker_line_width=2, marker_size=10,
                    text=percentiles, textposition="bottom right"))

# Chỉnh layout trực tiếp
ecdf.update_layout(
    title='Game Rounds Cumulative Distribution Plot',
    yaxis_title="Cumulative Probability",
    height=720,
    width=1000,
    font=dict(size=15)
)

# Thêm annotation nếu muốn
ecdf.add_annotation(
    text="Game Rounds Distribution",
    xref="paper", yref="paper",
    x=0.5, y=1.08, showarrow=False,
    font=dict(size=16)
)

ecdf.show()

As we see 95% of our data is below 500 Game Rounds.

In [44]:
print("The 95 percentile of the data is at: ", ptiles[4], "Game Rounds","\n",
"This means there are ", df[df["sum_gamerounds"] <= ptiles[4]].shape[0], " players")

The 95 percentile of the data is at:  221.0 Game Rounds 
 This means there are  85706  players


221 game rounds is better than what I expected.

In [42]:
df.groupby("sum_gamerounds").userid.count().reset_index().head(20)

Unnamed: 0,sum_gamerounds,userid
0,0,3994
1,1,5538
2,2,4606
3,3,3958
4,4,3629
5,5,2992
6,6,2861
7,7,2379
8,8,2267
9,9,2013


okay, there are 3994 users never played the game. I think there are plenty of reasons cause that: 
+ Download to have more priority for other games 
+ They dont have time to play/ enjoy game
+ Maybe the ads are different from the reality games(graphically, configurations) so the customers doesn't interest in the game. 
the users decreases as the levels progress (research)
+ Difficulty rises 
+ The later the experience, the less thing new (UX UI)

And in most cases, players just play a couple of game rounds in their first two weeks. But, we are looking for players that like the game and to get hooked, that’s one of our interests.

A common metric in the video gaming industry for how fun and engaging a game is 1-day retention as we mentioned before.

#### Player Retention Model 
+ Retention is the percentage of players that come back and plays the game one day after they have installed it. 
+ The higher 1-day retention is, the easier it is to retain players and build a large player base.

In [60]:
# The % of users that came back the day after they installed
prop = len(df[df['retention_1'] == True]) / len(df['retention_1']) * 100
print("The overall retention for 1 day is: ", str(round(prop,2)),"%")

The overall retention for 1 day is:  44.52 %


+ Less than half of the players come back one day after installing the game.
+ Now that we have a benchmark, let’s look at how 1-day retention differs between the two AB groups.

In [61]:
# Calculating 1-day retention for each AB-group

# control group
prop_gate30 = len(players_g30[players_g30['retention_1'] == True])/len(players_g30['retention_1']) * 100

# treatment group
prop_gate40 = len(players_g40[players_g40['retention_1'] == True])/len(players_g40['retention_1']) * 100

print('Group 30 at 1 day retention: ',str(round(prop_gate30,2)),"%","\n",
     'Group 40 at 1 day retention: ',str(round(prop_gate40,2)),"%")


Group 30 at 1 day retention:  44.82 % 
 Group 40 at 1 day retention:  44.23 %


+ It shows that moving the gate to level 40 (44.23%) resulted in a modest drop in 1-day retention compared to the control at level 30 (44.82%).

+ It's a little alteration, but even minor improvements in retention may have a significant influence. While we are certain of the difference in statistics, how confident can we be that a gate at level 40 will be more dangerous in the future?


+ As a result, it is critical to investigate bootstrapping approaches, which are defined as "a sampling with replacement from observed data to estimate the variability in a statistic of interest". In this case, retention, and we'll create a function for it.

In [62]:
# Bootstrapping Function
def draw_bs_reps(data,func,iterations=1):
    boot_Xd = []
    for i in range(iterations):
        boot_Xd.append(func(data = np.random.choice(data, len(data))))
    return boot_Xd
# Retention Function
def retention(data):
    ret = len(data[data == True])/len(data)
    return ret

In [63]:
# Bootstrapping for gate 30
btg30_1d = draw_bs_reps(players_g30['retention_1'], retention, iterations = 1000)
# Bootstrapping for gate 40
btg40_1d = draw_bs_reps(players_g40['retention_1'], retention, iterations = 1000)


mean_g40 = np.mean(btg40_1d)
mean_g30 = np.mean(btg30_1d)



In [64]:
# Bootstrapping for gate 30
btg30_1d = draw_bs_reps(players_g30['retention_1'], retention, iterations=1000)
# Bootstrapping for gate 40
btg40_1d = draw_bs_reps(players_g40['retention_1'], retention, iterations=1000)

mean_g40 = np.mean(btg40_1d)
mean_g30 = np.mean(btg30_1d)

# A Kernel Density Estimate plot of the bootstrap distributions
boot_1d = pd.DataFrame(data={'gate_30': btg30_1d, 'gate_40': btg40_1d}, index=range(1000))

# Plotting histogram
hist_1d = [boot_1d.gate_30, boot_1d.gate_40]
dist_1d = ff.create_distplot(hist_1d, group_labels=["Gate 30 (Control)", "Gate 40 (Treatment)"], show_rug=False, colors=['#3498DB', '#28B463'])
dist_1d.add_vline(x=mean_g40, line_width=3, line_dash="dash", line_color="#28B463")
dist_1d.add_vline(x=mean_g30, line_width=3, line_dash="dash", line_color="#3498DB")
dist_1d.add_vrect(x0=mean_g30, x1=mean_g40, line_width=0, fillcolor="#F1C40F", opacity=0.2)

# Chỉnh layout trực tiếp
dist_1d.update_layout(
    xaxis_range=[0.43, 0.46],
    title='1-Day Retention Bootstrapping by A/B Group',
    xaxis_title="Retention",
    height=720,
    width=1000,
    font=dict(size=15)
)

# Thêm annotation nếu muốn
dist_1d.add_annotation(
    text="Retention Distribution",
    xref="paper", yref="paper",
    x=0.5, y=1.08, showarrow=False,
    font=dict(size=16)
)

dist_1d.show()

The difference still looking close, for this reason, is preferable to zoom it by plotting the difference as an individual measure

In [65]:
# Adding a column with the % difference between the two AB-groups
boot_1d['diff'] = (
    ((boot_1d['gate_30'] - boot_1d['gate_40']) / boot_1d['gate_40']) * 100
)

# Ploting the bootstrap % difference
hist_1d_diff = [boot_1d['diff']]
dist_1d_diff = ff.create_distplot(
    hist_1d_diff,
    show_rug=False,
    colors=['#F1C40F'],
    group_labels=["Gate 30 - Gate 40"],
    show_hist=False
)
dist_1d_diff.add_vline(x=np.mean(boot_1d['diff']), line_width=3, line_dash="dash", line_color="black")

# Chỉnh layout trực tiếp
dist_1d_diff.update_layout(
    xaxis_range=[-3, 6],
    title='Percentage of "1 day retention" difference between A/B Groups',
    xaxis_title="% Difference",
    height=720,
    width=1000,
    font=dict(size=15)
)

# Thêm annotation nếu muốn
dist_1d_diff.add_annotation(
    text="Retention Difference Distribution",
    xref="paper", yref="paper",
    x=0.5, y=1.08, showarrow=False,
    font=dict(size=16)
)

dist_1d_diff.show()

From this chart, we can see that the percentual difference is around 1% - 2%, and that most of the distribution is above 0%, in favor of a gate at level 30.

But, what is the probability that the difference is above 0%? Let’s calculate that as well.

In [66]:
# Calculating the probability that 1-day retention is greater when the gate is at level 30
prob = (boot_1d['diff'] > 0.0).sum() / len(boot_1d['diff'])

# Pretty printing the probability
print('The probabilty of Group 30 (Control) having a higher \n retention than Group 40 (Treatment) is: ', prob*100, '%')


The probabilty of Group 30 (Control) having a higher 
 retention than Group 40 (Treatment) is:  96.3 %


### 7-day retention by A/B group
+ The bootstrap analysis tells us that there is a high probability that 1-day retention is better when the time gate is at level 30. However, since players have only been playing the game for one day, likely, most players haven’t reached level 30 yet. That is, many players won’t have been affected by the gate, even if it’s as early as level 30.
<br>
+ But after having played for a week, more players should have reached level 40, and therefore it makes sense to also look at 7-day retention. That is: What percentage of the people that installed the game also showed up a week later to play the game again?
<br>
+ So I calculating 7-day retention for the two AB groups.

In [67]:
# Calculating 7-day retention for both AB-groups
ret30_7d = len(players_g30[players_g30['retention_7'] == True])/len(players_g30['retention_7']) * 100
ret40_7d = len(players_g40[players_g40['retention_7'] == True])/len(players_g40['retention_7']) * 100

print('Group 30 at 7 day retention: ',str(round(ret30_7d,2)),"%","\n",
     'Group 40 at 7 day retention: ',str(round(ret40_7d,2)),"%")


Group 30 at 7 day retention:  19.02 % 
 Group 40 at 7 day retention:  18.2 %


Like with 1-day retention, we see that 7-day retention is barely lower (18.20%) when the gate is at level 40 than when the time gate is at level 30 (19.02%). This difference is also larger than for 1-day retention.
<br><br>
We also see that the overall 7-day retention is lower than the overall 1-day retention; fewer people play a game a week than a day after installing.
<br><br>
But as before, let’s use bootstrap analysis to figure out how sure we can be of the difference between the AB-groups.

In [70]:
# Creating a list with bootstrapped means for each AB-group

# Bootstrapping for control group
btg30_7d = draw_bs_reps(players_g30['retention_7'], retention, iterations=500)

# Bootstrapping for treatment group
btg40_7d = draw_bs_reps(players_g40['retention_7'], retention, iterations=500)

boot_7d = pd.DataFrame(data={'gate_30': btg30_7d, 'gate_40': btg40_7d}, index=range(500))

# Adding a column with the % difference between the two AB-groups
boot_7d['diff'] = (boot_7d['gate_30'] - boot_7d['gate_40']) / boot_7d['gate_30'] * 100

# Ploting the bootstrap % difference
hist_7d_diff = [boot_7d['diff']]
dist_7d_diff = ff.create_distplot(
    hist_7d_diff,
    show_rug=False,
    colors=['#FF5733'],
    group_labels=["Gate 30 - Gate 40"],
    show_hist=False
)
dist_7d_diff.add_vline(x=np.mean(boot_7d['diff']), line_width=3, line_dash="dash", line_color="black")

# Chỉnh layout trực tiếp
dist_7d_diff.update_layout(
    xaxis_range=[-4, 12],
    title='Percentage of "7 day retention" difference between A/B Groups',
    xaxis_title="% Difference",
    height=720,
    width=1000,
    font=dict(size=15)
)

# Thêm annotation nếu muốn
dist_7d_diff.add_annotation(
    text="Retention Difference Distribution",
    xref="paper", yref="paper",
    x=0.5, y=1.08, showarrow=False,
    font=dict(size=16)
)

dist_7d_diff.show()

# Calculating the probability that 7-day retention is greater when the gate is at level 30
prob = (boot_7d['diff'] > 0).sum() / len(boot_7d)

# Pretty printing the probability
print('The probabilty of Group 30 (Control) having a higher \n retention than Group 40 (Treatment) is:) ', prob*100, '%')

The probabilty of Group 30 (Control) having a higher 
 retention than Group 40 (Treatment) is:)  100.0 %


I reject Ho (reject the null hypothesis of no difference) and accept H1 (there is a difference, gate 30 retains players better than gate 40).

### Conclusion 
**So, why is retention higher when the gate is set earlier?**
+ Normally, we would predict the opposite: the later the hurdle, the longer individuals stay involved in the game. However, the evidence does not support this; instead, we explained it using the hedonic adaptation theory.

**What to do, as a stakeholders?**
+ We now have enough statistical information to infer that 7-day retention is higher when the gate is at level 30 than when it is at level 40, just as we found for 1-day retention.  
+ If we want to keep consumer retention high, we should not move the gate from level 30 to level 40, it means we keep our Control method in the current gate system.