# Games - Analysis of advertising sources

# Materials
* [Prezentation](https://drive.google.com/file/d/1_bskwJpbjnoEFxYbnqe9MtvLwI8bs-oq/view?usp=sharing)
* [Dashboard](https://public.tableau.com/views/Games_user_sources/Dashboard?:language=en-GB&publish=yes&:display_count=n&:origin=viz_share_link)

## Project goal

The team of the mobile game "Space Brothers" has achieved the popularity of the game and plans to develop and implement a monetization system in it, this requires a preliminary analysis.

The product manager set the task to analyze the behavior of the players depending on the source of the transition.

- Conduct exploratory data analysis;
- Analyze the influence of the source of transition to the game on the user's behavior;
- Test statistical hypotheses.


## Loading data and studying general information

In [33]:
# import all package 
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from scipy import stats as st

import plotly.io as pio
pio.renderers


import warnings
warnings.filterwarnings('ignore')

In [34]:
# loading data

try: # local path
    game_actions = pd.read_csv('game_actions.csv')
    ad_costs = pd.read_csv('ad_costs.csv')
    user_source = pd.read_csv('user_source.csv')
except:  # server path
    game_actions = pd.read_csv('/datasets/game_actions.csv')
    ad_costs = pd.read_csv('/datasets/ad_costs.csv')
    user_source = pd.read_csv('/datasets/user_source.csv')
    
# function for study data
def study_data_info(data):
    display(data.head())
    display(data.describe())
    display(data.info())
    display('Missing values:', data.isna().sum()) 
    display('Duplicates:', data.duplicated().sum())
    

In [35]:
# study data
data_sets = [game_actions, ad_costs, user_source]
for data in data_sets:
    study_data_info(data)

Unnamed: 0,event_datetime,event,building_type,user_id,project_type
0,2020-05-04 00:00:01,building,assembly_shop,55e92310-cb8e-4754-b622-597e124b03de,
1,2020-05-04 00:00:03,building,assembly_shop,c07b1c10-f477-44dc-81dc-ec82254b1347,
2,2020-05-04 00:00:16,building,assembly_shop,6edd42cc-e753-4ff6-a947-2107cd560710,
3,2020-05-04 00:00:16,building,assembly_shop,92c69003-d60a-444a-827f-8cc51bf6bf4c,
4,2020-05-04 00:00:35,building,assembly_shop,cdc6bb92-0ccb-4490-9866-ef142f09139d,


Unnamed: 0,event_datetime,event,building_type,user_id,project_type
count,135640,135640,127957,135640,1866
unique,128790,3,3,13576,1
top,2020-05-09 12:35:56,building,spaceport,bf542075-e3a2-4e79-82d8-3838e86d2a25,satellite_orbital_assembly
freq,4,127957,59325,22,1866


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 135640 entries, 0 to 135639
Data columns (total 5 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   event_datetime  135640 non-null  object
 1   event           135640 non-null  object
 2   building_type   127957 non-null  object
 3   user_id         135640 non-null  object
 4   project_type    1866 non-null    object
dtypes: object(5)
memory usage: 5.2+ MB


None

'Missing values:'

event_datetime         0
event                  0
building_type       7683
user_id                0
project_type      133774
dtype: int64

'Duplicates:'

1

Unnamed: 0,source,day,cost
0,facebook_ads,2020-05-03,935.882786
1,facebook_ads,2020-05-04,548.35448
2,facebook_ads,2020-05-05,260.185754
3,facebook_ads,2020-05-06,177.9822
4,facebook_ads,2020-05-07,111.766796


Unnamed: 0,cost
count,28.0
mean,271.556321
std,286.86765
min,23.314669
25%,66.747365
50%,160.056443
75%,349.034473
max,969.139394


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28 entries, 0 to 27
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   source  28 non-null     object 
 1   day     28 non-null     object 
 2   cost    28 non-null     float64
dtypes: float64(1), object(2)
memory usage: 800.0+ bytes


None

'Missing values:'

source    0
day       0
cost      0
dtype: int64

'Duplicates:'

0

Unnamed: 0,user_id,source
0,0001f83c-c6ac-4621-b7f0-8a28b283ac30,facebook_ads
1,00151b4f-ba38-44a8-a650-d7cf130a0105,yandex_direct
2,001aaea6-3d14-43f1-8ca8-7f48820f17aa,youtube_channel_reklama
3,001d39dc-366c-4021-9604-6a3b9ff01e25,instagram_new_adverts
4,002f508f-67b6-479f-814b-b05f00d4e995,facebook_ads


Unnamed: 0,user_id,source
count,13576,13576
unique,13576,4
top,0001f83c-c6ac-4621-b7f0-8a28b283ac30,yandex_direct
freq,1,4817


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13576 entries, 0 to 13575
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   user_id  13576 non-null  object
 1   source   13576 non-null  object
dtypes: object(2)
memory usage: 212.2+ KB


None

'Missing values:'

user_id    0
source     0
dtype: int64

'Duplicates:'

0

**Description of datasets:**

The main dataset contains data about events that took place in the Space Brothers mobile game. In it, users build their space program and try to succeed in the difficult task of colonizing the galaxy.

The main monetization of the game is just planned. But it is assumed that the application will display ads on the screen with a choice of the type of object for construction.

The dataset contains data on the game of users at the first level. Completing the first level requires the player to fulfill one of two conditions:

- Defeat the first enemy
- Project implementation - development of an orbital assembly of satellites

The dataset contains the data of the first users of the application - a cohort of users who started using the application from May 4 to May 10 inclusive.

Dataset *game_actions.csv*:

- `event_datetime` — event time;
- `event` is one of three events:
    1. `building` - the object is built,
    2. `finished_stage_1` - the first level is completed,
    3. `project` - the project is completed;
- `building_type` — one of three building types:
    1. `assembly_shop` - assembly shop,
    2. `spaceport` - spaceport,
    3. `research_center` - research center;
- `user_id` - user ID;
- `project_type` - the type of the implemented project;

In addition to the main dataset, there are two datasets with information about advertising activities. They will also help in solving the problem.

Dataset *ad_cost.csv*:

- `day` - the day on which the ad was clicked
- `source` - traffic source
- `cost` - cost of clicks

The user_source.csv dataset contains the following columns:

- `user_id` - user ID
- `source` - sources from which the user who installed the application came

***Conclusion***

The data is presented in 3 tables and contains information about events committed by users in the mobile game "Space Brothers" (table `game_actions`), sources of user acquisition (table `user_source`) and costs for each source of attraction (table `ad_cost`).

Previewing the data showed that the `game_actions` table contained 1 row - a complete duplicate, as well as missing values ​​in the '*building_type*' and '*project_type*' fields. In addition, fields containing information about dates have the **object** data type, which can interfere with the correct work with the data.
At the next stages, we will perform data preprocessing and prepare them for analysis.

## Preprocessing Data

### Changing data types

For columns containing date and time information, we will change the data type to work correctly with dates in the future.

In [36]:
# Changing data types 
game_actions['event_datetime'] = pd.to_datetime(game_actions['event_datetime'])
ad_costs['day'] = pd.to_datetime(ad_costs['day'])

# check
game_actions.info()
ad_costs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 135640 entries, 0 to 135639
Data columns (total 5 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   event_datetime  135640 non-null  datetime64[ns]
 1   event           135640 non-null  object        
 2   building_type   127957 non-null  object        
 3   user_id         135640 non-null  object        
 4   project_type    1866 non-null    object        
dtypes: datetime64[ns](1), object(4)
memory usage: 5.2+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28 entries, 0 to 27
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   source  28 non-null     object        
 1   day     28 non-null     datetime64[ns]
 2   cost    28 non-null     float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 800.0+ bytes


### Working with missing values and duplicates

While previewing the data, 1 duplicate was noticed (2 rows are complete duplicates). Let's display this line and calculate the share of duplicates in the total amount of data.

In [37]:
# rows-duplicates
game_actions_duplicated = game_actions[game_actions.duplicated()]
display(game_actions_duplicated)
# share of duplicates
print('Share of duplicates: {: .6f}'.format((game_actions_duplicated.shape[0] / game_actions.shape[0])))

Unnamed: 0,event_datetime,event,building_type,user_id,project_type
74891,2020-05-10 18:41:56,building,research_center,c9af55d2-b0ae-4bb4-b3d5-f32aa9ac03af,


Share of duplicates:  0.000007


Since there is only one duplicate line and it is less than 1% of the data, we will delete it, probably the line is duplicated for technical reasons (both the time of the event and the user id are the same).

In [38]:
# deleting rows-duplicates
actions = game_actions.drop_duplicates()
print('Rows removed:', game_actions.shape[0] - actions.shape[0]) # check

Rows removed: 1


__________
Next, let's see what unique values the fields with types of buildings in the game, types of projects and events contain. Let's display the information along with the missing values.

In [39]:
display(actions['building_type'].value_counts(dropna=False))
display(actions['project_type'].value_counts(dropna=False))
actions['event'].value_counts(dropna=False)

spaceport          59325
assembly_shop      54494
research_center    14137
NaN                 7683
Name: building_type, dtype: int64

NaN                           133773
satellite_orbital_assembly      1866
Name: project_type, dtype: int64

building            127956
finished_stage_1      5817
project               1866
Name: event, dtype: int64

There are 3 types of events in the game: 'building', 'level 1 completion' and 'project completion'. Events are filled completely and have no gaps. Probably, the `building` event corresponds to 3 types of buildings (spaceport, assembly_shop, research_center) and therefore for the other 2 events: `finished_stage_1` and `project` there should be no buildings (this may explain the presence of missing values).

The project type in the game is only 1 `satellite_orbital_assembly` (the completion of this project ensures the passage of level 1) and, probably, this project type will correspond only to the `project` event, since their total number is the same and = 1866.

Let's test these theories below.

In [40]:
print('`building_type`=NaN while `event`=', actions[actions['building_type'].isnull()]['event'].unique())
print('`project_type`=NaN while `event`=', actions[actions['project_type'].isnull()]['event'].unique())

`building_type`=NaN while `event`= ['finished_stage_1' 'project']
`project_type`=NaN while `event`= ['building' 'finished_stage_1']


Assumptions confirmed:
* the `building_type` column is filled only if the building event `building` is running
* the `project_type` column is filled only if the event `project` is running.

This means that missing values are not an error in the data and do not require processing and removal, as this may lead to distortion of the analysis results.

### Combining data into a one dataset

For the convenience of further analysis of user behavior depending on the sources, let's combine the `actions` and `user_source` tables into a one dataframe by a unique user id.

In [41]:
# combining datasets
df = pd.merge(actions, user_source, on=['user_id'], how='left')
# check
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 135639 entries, 0 to 135638
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   event_datetime  135639 non-null  datetime64[ns]
 1   event           135639 non-null  object        
 2   building_type   127956 non-null  object        
 3   user_id         135639 non-null  object        
 4   project_type    1866 non-null    object        
 5   source          135639 non-null  object        
dtypes: datetime64[ns](1), object(5)
memory usage: 7.2+ MB


Unnamed: 0,event_datetime,event,building_type,user_id,project_type,source
0,2020-05-04 00:00:01,building,assembly_shop,55e92310-cb8e-4754-b622-597e124b03de,,youtube_channel_reklama
1,2020-05-04 00:00:03,building,assembly_shop,c07b1c10-f477-44dc-81dc-ec82254b1347,,facebook_ads
2,2020-05-04 00:00:16,building,assembly_shop,6edd42cc-e753-4ff6-a947-2107cd560710,,instagram_new_adverts
3,2020-05-04 00:00:16,building,assembly_shop,92c69003-d60a-444a-827f-8cc51bf6bf4c,,facebook_ads
4,2020-05-04 00:00:35,building,assembly_shop,cdc6bb92-0ccb-4490-9866-ef142f09139d,,yandex_direct


Let's also add an additional `date` column for ease of analysis in the future.

In [42]:
# adding 'date' column
date = df['event_datetime'].dt.date
df.insert(1, 'date', date)
df.head() # check

Unnamed: 0,event_datetime,date,event,building_type,user_id,project_type,source
0,2020-05-04 00:00:01,2020-05-04,building,assembly_shop,55e92310-cb8e-4754-b622-597e124b03de,,youtube_channel_reklama
1,2020-05-04 00:00:03,2020-05-04,building,assembly_shop,c07b1c10-f477-44dc-81dc-ec82254b1347,,facebook_ads
2,2020-05-04 00:00:16,2020-05-04,building,assembly_shop,6edd42cc-e753-4ff6-a947-2107cd560710,,instagram_new_adverts
3,2020-05-04 00:00:16,2020-05-04,building,assembly_shop,92c69003-d60a-444a-827f-8cc51bf6bf4c,,facebook_ads
4,2020-05-04 00:00:35,2020-05-04,building,assembly_shop,cdc6bb92-0ccb-4490-9866-ef142f09139d,,yandex_direct


***Conclusion***

During data preprocessing:
- the data types containing information about dates in the `game_actions` and `ad_costs` tables were changing to the type corresponding to the dates for further correct work with them;
- deleted duplicate row (less than 1% of the total amount of data), probably formed as a result of a technical error;
- the detected missing values in the `building_type` and `project_type` columns were checked and it was found that they are not an error in the data and do not require processing and deletion, as this may lead to distortion of the analysis results;
- combining the `actions` and `user_source` tables into a one dataframe for the convenience of further analysis of user behavior depending on the source of their involvement;
- added `date` column for the convenience of further analysis.

The preliminary data is complete and prepared for further analysis.

## Exploratory data analysis

### Overview of user behavior in the game

The main task of the project is to analyze the behavior of players depending on the sources. Let's take a look at the overall user behavior in the game:
* number of unique users per day
* user activity: the number and types of events by day, as well as the number of buildings in the game by their types
* strategies that users choose to complite level 1.

#### Number of unique users per day

Consider the dynamics of the number of unique users by day.

In [43]:
dau_total = df.groupby('date')['user_id'].nunique().reset_index()
fig = px.line(dau_total, x='date', y='user_id', title="Number of unique users per day (DAU)")
fig.show(renderer = 'svg')

ValueError: 
Image export using the "kaleido" engine requires the kaleido package,
which can be installed using pip:
    $ pip install -U kaleido


In [None]:
dau_total.describe()

According to the graph, one can note an increase in the number of users by May 10 to 9219 and a sharp decline from May 11 to 5995 users and a further smoother decline until the end of the month to a minimum value = 4 users. A jump on 11.05 may mean the end of the advertising campaign or the end of the weekend, it is also possible that further user activity is not visible, since we only have data for 1 level. Let's pay attention to this at the stage of analysis of advertising sources.

#### User activity

Let's look at user activity by day and by type of events in the game.

In [None]:
data_events = df.groupby(['date', 'event'])['event_datetime'].count().reset_index()
fig = px.bar(data_events, x='date', y='event_datetime', color='event',\
             title="Number of events in the game per day")
fig.show()

In [None]:
# number of events per user
events_per_user = df.groupby('user_id')['event'].count().reset_index()
events_per_user.event.describe()

The graph of user activity by day correlates with the number of users by day and also has a sharp decline on May 11 (decrease in events from 15k to 7k). Buildings predominate in events (which are needed in greater numbers than other events to complete the level), the share of 'level 1 completion' events is maximum 16.05 (649 events).

On average, there are 9.9 events per 1 user, which coincides with the median = 10 events, while the standard deviation = 4, which indicates quite a significant spread in values.

Next, consider which buildings are the most popular.

In [None]:
data_building = df.groupby('building_type')['event'].count().reset_index().sort_values('event', ascending=False)
fig = px.bar(data_building, x='building_type', y='event', \
             title="Number of buildings in the game by type of building")
fig.show()

The `assembly_shop` and `spaceport` builds are the most popular (absolute value 54k and 59k respectively), while the `research_center` building is significantly less popular (absolute value 14k). This information should be studied in more detail when planning the monetization model, which is planned to be done on the object construction pages.

#### User strategies in the game

For further analysis, we will collect information on users: the source from which they came and what strategy they chose to pass level 1 (in case the level was completed by the user).

In [None]:
user_profiles = df.groupby(['user_id', 'source'])['event'].nunique().reset_index()
user_profiles['level_1_strategy'] = user_profiles['event'].map({1: 'not finished', 2: 'fight', 3: 'project'})
user_profiles.drop('event', axis=1, inplace=True)
user_profiles.head()

Let's look at the share of strategies, including how many users completed Level 1.

In [None]:
# user strategies on Level 1
data=user_profiles.groupby('level_1_strategy')['user_id'].count().reset_index()
fig = px.pie(data, values='user_id', names='level_1_strategy', title='User strategies on Level 1')
fig.show()

On the graph, 57% of users did not complete Level 1. Of the 43% of players who completed the first level, the majority chooses battle as a strategy, those who choose the project to build an orbital station are more than 2 times less.

### Analysis of user acquisition sources

At this stage, we will consider the behavior of users in the context of the sources of their attraction.
First, let's see which source brings the most users.

In [None]:
data_source = df.groupby('source')['user_id'].nunique().reset_index()
fig = px.pie(data_source, values='user_id', names='source', title='Distribution of users by acquisition channels')
fig.show()

The largest number of users comes from yandex_direct (35.5%), followed by instagram - 24.7%, then facebook - 20.1% and the least number of users comes from youtube - 19.8% of the total number of users.

Let's also consider whether the strategies of users in level 1 differ depending on the source from which they came.

In [None]:
# user strategies by sources
data_strategy_by_sources = user_profiles.groupby(['source', 'level_1_strategy'])['user_id'].count().reset_index()\
                                                                         .sort_values('user_id', ascending=False)
fig = px.bar(data_strategy_by_sources, x='source', y='user_id', color='level_1_strategy', barmode='group', \
                             text_auto=True, title="User strategies in the game by sources of their attraction")
fig.show()

In [None]:
users_strategy = user_profiles.query('level_1_strategy != "not finished"').pivot_table(index='source', \
                columns='level_1_strategy', values='user_id', aggfunc='nunique').reset_index()

users_strategy['%_fight'] = users_strategy['fight'] / (users_strategy['fight'] + users_strategy['project'])
users_strategy['%_project'] = users_strategy['project'] / (users_strategy['fight'] + users_strategy['project'])

users_strategy.style.format({'%_fight': '{:.1%}', '%_project': '{:.1%}'}).background_gradient(cmap='Greens',\
                                                                                              axis=0)

The graph and table above show that the distribution of strategies that users have chosen practically does not change depending on the source from which they came. It can be concluded that the strategies of users in level 1 do not envy the source from which the user came.

In [None]:
data_events = df.groupby(['date', 'source'])['event_datetime'].count().reset_index()
fig = px.line(data_events, x='date', y='event_datetime', color='source', \
              title="The number of events in the game by day, by source")
fig.show()

The number of events in the game correlates with the overall schedule of events and with the number of users that are attracted from each of the sources. At the same time, most of the events for users are from yandex_direct. According to all sources, there is also a sharp drop in activity on May 11, which is also likely due to changes in the advertising campaign or the weekend.

In [None]:
# number of events per user by sources of attraction
events_per_user_by_source = df.groupby(['source', 'user_id'])['event'].count().reset_index()\
                                                                         .sort_values('event', ascending=False)
fig = px.box(events_per_user_by_source, x='source', y='event', title='Number of events per user by sources')
fig.show()

The average number of events per 1 user differs insignificantly, in addition, the values of the median, minimum and maximum values of the number of events per user also do not have critical differences.

### User acquisition cost analysis

To analyze user acquisition costs, let's combine data on users and acquisition channels, for this:
* Determine the date of the first entry into the game for each user.

In [None]:
# the date of the first entry
first_enter_by_user = df.groupby(['source', 'user_id'])['date'].min().reset_index()\
                       .rename(columns={'source': 'user_source', 'date': 'first_enter'})
first_enter_by_user['first_enter'] = pd.to_datetime(first_enter_by_user['first_enter'])
# the number of users by the dates of their first entry into the game
first_enter_by_user_grouped = first_enter_by_user.groupby(['first_enter', 'user_source'])['user_id'].nunique()\
                                    .reset_index().rename(columns={'user_id': 'number_of_users'})
fig = px.bar(first_enter_by_user_grouped, x='first_enter', y='number_of_users', color='user_source', \
                   barmode='group', title="New users per day")
fig.show()

* Let's compare the distribution by dates of clicks on ads in the context of the sources of user acquisition.

In [None]:
total_costs = ad_costs.groupby(['day', 'source'])['cost'].sum().reset_index()
fig = px.bar(total_costs, x='day', y='cost', color='source', barmode='group', \
                                 title="Ad spend by day-click on ad")
fig.show()

From the graphs above: the ad spend date is before the first login date. For the purposes of calculating the cost of attracting 1 user, let's align the dates: we will consider the date of acquiring a user as the date of entry 1 - 1 day. Next, we connect the tables to calculate the total costs.

In [None]:
# align 1 entry date with user acquisition date
first_enter_by_user_grouped['first_enter'] = first_enter_by_user_grouped['first_enter'] - pd.Timedelta(days=1)
# joining tables to calculate total costs
total_costs = pd.merge(total_costs, first_enter_by_user_grouped, how='outer', \
                       left_on=['day','source'], right_on = ['first_enter', 'user_source'])
total_costs.head()

In [None]:
# grouping total costs and total number of users by source
total_costs_final = total_costs.pivot_table(index='source', margins=True, aggfunc=sum, margins_name='Total')\
                                                                                                .reset_index()
total_costs_final.style.format({'cost': '{:.2f}'}).background_gradient(cmap='Greens', axis=0)

In total, 7603.53 were spent on advertising and 13576 users were attracted. The most expensive source is yandex_direct, but it was he who attracted the largest number of users in comparison with other sources. We visualize the costs of attracting users by day in the context of sources.

In [None]:
# expenses by day by source
ad_costs_grouped = total_costs.groupby(['day', 'source'])['cost'].sum().reset_index()
fig = px.bar(ad_costs_grouped, x='day', y='cost', color='source', barmode='group', \
                         title="User acquisition aosts by day/source")
fig.show()

According to the graph above, indeed, advertising costs across all sources were reduced by almost 2 times on May 4 compared to May 3, which was reflected in the number of new users. It is also worth noting that the costs for the 3 first most popular sources differ insignificantly, while the attracted users differ more significantly. Let's check this assumption by calculating the costs per 1 user for each source.

In [None]:
# САС
total_costs_final['CAC'] = total_costs_final['cost'] / total_costs_final['number_of_users']
total_costs_final.style.format({'cost': '{:.2f}'}).background_gradient(cmap='Greens', axis=0)

YouTube turned out to be the most profitable source in terms of price per 1 user, but yandex_direct should be considered the most effective of the other sources - the cost of attracting 1 user was 0.46. Facebook can be considered the most inefficient with the price per 1 user = 0.78.

***Conclusion***

During exploratory analysis of user behavior data in the game, as well as user acquisition sources:
1. The average number of users in the game was 2884 users. The dynamics of the number of users showed a sharp increase in users on May 10 and a sharp decrease on May 11 - the jump correlates with a decrease in advertising costs on the indicated dates across all sources.
2. The graph of user activity by day correlates with the number of users by day and also has a sharp decline on 11.05 (Events are dominated by buildings (which are needed in greater numbers than other events to complete the level).
3. The `assembly_shop` and `spaceport` buildings are the most popular, while the `research_center` building is much less popular. This information should be studied in more detail when planning the monetization model, which is planned to be done on the object construction pages.
4. Only 43% completed the first level, most of them chose 'battle' as a strategy, those who choose 'orbital station construction project' are 2 times less.
5. The largest number of users comes from yandex_direct (35.5%), in second place is instagram - 24.7%, then facebook - 20.1% and the least number of users comes from youtube - 19.8% of the total number of users. User strategies in the game do not differ depending on the source of their attraction.
6. In total, 7603.53 were spent on advertising and 13576 users were attracted. The most expensive source is yandex_direct (2233), but it was he who attracted the largest number of users in comparison with other sources. The lowest CAC for youtube is 0.39 and the highest for facebook is 0.78.

## Hypothesis testing

### Hypothesis test: the time to complete the level differs depending on the way of passing: through the implementation of the project or through the victory over the first player

To test the hypothesis that the average time to complete a level differs depending on the chosen strategy:
- time of the first entry into the game by the user;
- let's connect the data about the first entry into the game with the data about the chosen strategy by each of the users;
- calculate the probility of passing the 1st level by users (who completed it).

In [None]:
# calculation of the time of the first entry into the game by users and its joining with the general dataset
first_enter_datetime = df.groupby(['source', 'user_id'])['event_datetime'].min().reset_index()\
                         .rename(columns={'event_datetime': 'first_enter_datetime'})
df = df.merge(first_enter_datetime[['user_id', 'first_enter_datetime']])

# adding information about the strategy by players to the one dataset
df = df.merge(user_profiles[['user_id', 'level_1_strategy']])

# calculation of the time spent on level 1 by each player who completed the level
df_level_timing = df.query('level_1_strategy != "not finished" and event == "finished_stage_1"')
df_level_timing['level_timing'] = (df_level_timing['event_datetime'] - df_level_timing['first_enter_datetime'])\
                                                                                     .astype('timedelta64[D]')
df_level_timing.head()

Next, let's look at the distribution of time for players to complete level 1, depending on the chosen strategy.

In [None]:
# distribution of time to pass the 1st level depending on the chosen strategies
fig = px.histogram(df_level_timing, x='level_timing', color='level_1_strategy', \
                           title='Distribution of time to complete level 1 depending on the player strategy')
fig.show()

Further, to test the hypothesis, we will use the Student's T-test:

- after analyzing the graph, we see that the data is distributed according to the law of normal distribution
- 2 independent samples (different strategies)
- compare the averages of 2 general populations

The significance criterion or the level of significance is defined as 0.05 (5%). Thus, when testing statistical hypotheses, the probability of making a Type I error, that is, rejecting a true null hypothesis, will be 5%.

We formulate and test a hypothesis.
* **H₀** (null hypothesis) - there are no differences in the average values of the passage time of level 1 between the sample of users who complete the level by building the project and the sample of users who complete the level by winning the battle.
* **H₁** (alternative Hypothesis) - There are differences in the average values of the time to pass level 1 between samples of users, depending on the strategy they choose.

Next, we form the samples and check their variances.

In [None]:
# forming the samples
project_strategy = df_level_timing[df_level_timing['level_1_strategy'] == 'project']['level_timing']
fight_strategy = df_level_timing[df_level_timing['level_1_strategy'] == 'fight']['level_timing']
# checking variances
print('project_strategy: {:.2f}'.format(np.var(project_strategy)))
print('fight_strategy: {:.2f}'.format(np.var(fight_strategy)))

The samples have different variance - let's add the parameter equal_var = False when testing the hypothesis.

In [None]:
# t-test hypothesis test function
def ttest(sample1, sample2, alpha, equal_var):
    # threshold value 
    alpha = alpha
    results = st.ttest_ind(sample1, sample2, equal_var=equal_var)

    if results.pvalue < alpha:
        print("Rejecting the null hypothesis")
    else:
        print("Failed to reject the null hypothesis")

In [None]:
# testing hypothesis 
ttest(project_strategy, fight_strategy, 0.05, False)

Hypothesis testing showed that the average time to complete level 1 differs depending on the strategy chosen by the player. On average, players who complete a project to complete level 1 take longer to complete a level than those who choose combat as a strategy.

### Hypothesis test: the frequency of visiting the game differs depending on the source from which the users came

To test the hypothesis that the frequency of visiting the game is different depending on the source from which the users came: we calculate the number of days that users spent in the game in the context of the sources of their attraction.

In [None]:
# number of days that users spent in the game (per user)
data = df.groupby(['source', 'user_id'])['date'].count().reset_index()
# histogram of the distribution of the frequency of visiting the game by sources of user acquisition
fig = px.histogram(data, x='date', color='source',\
                   title='Distribution of the frequency of visiting the game by sources of user acquisition')
fig.show()

The distribution of the number of in-game days is not normal (harmonic). Therefore, in this case, the Mann-Whitney U-test is applicable.

Formulation of second hypothesis:
* **H₀** (null hypothesis) - the average number of in-game days for user samples, depending on the sources of their attraction, are equel.
* **H₁** (alternative hypothesis) - the average number of in-game days for user samples varies depending on the sources of their attraction.

In [None]:
# samples for test
facebook = df.query('source == "facebook_ads"').groupby(['source', 'user_id'])['date'].count()\
                                                                                        .reset_index()['date']
instagram = df.query('source == "instagram_new_adverts"').groupby(['source', 'user_id'])['date'].count()\
                                                                                        .reset_index()['date']
yandex = df.query('source == "yandex_direct"').groupby(['source', 'user_id'])['date'].count()\
                                                                                        .reset_index()['date']
youtube = df.query('source == "youtube_channel_reklama"').groupby(['source', 'user_id'])['date'].count()\
                                                                                        .reset_index()['date']

In [None]:
# function for testing hypotheses using the Mann-Whitney U-test
def mann_whit(sample1, sample2, alpha): 
    alpha = alpha
    results = st.mannwhitneyu(sample1, sample2)

    if results.pvalue < alpha:
        print("Rejecting the null hypothesis")
    else:
        print("Failed to reject the null hypothesis")

Let's perform a hypothesis test in pairs by user sources. In order to reduce the probability of a false positive result in multiple hypothesis testing, different methods of adjusting the level of significance are used to reduce FWER. We will use the Bonferroni Method (Bonferroni correction): the significance levels in each of the m comparisons are m times less than the significance level required for a single comparison. Simply, we divide the significance level of alpha by the number of comparisons (in our case =6).

In [None]:
mann_whit(facebook, instagram, 0.05/6)

In [None]:
mann_whit(facebook, yandex, 0.05/6)

In [None]:
mann_whit(facebook, youtube, 0.05/6)

In [None]:
mann_whit(instagram, yandex, 0.05/6)

In [None]:
mann_whit(instagram, youtube, 0.05/6)

In [None]:
mann_whit(yandex, youtube, 0.05/6)

Based on the hypothesis tests, we can conclude:
- statistical differences were found in the average values of in-game days for users from: *facebook* and *yandex*. For other sources, there are no statistical differences, that is, the average number of in-game days is the same.

### Hypothesis test: the number of buildings in the game differs depending on the source from which the users came

Next, we will check whether there is a dependence on the number of buildings by users and the source of their attraction.

In [None]:
# calculating the number of buildings by users and the source
data_building = df.groupby(['source', 'user_id'])['building_type'].count().reset_index()
# distribution of the number of buildings by users and by sources of user acquisition
fig = px.histogram(data_building, x='building_type', color='source', \
                    title='Distribution of the number of buildings by users and by sources of user acquisition')
fig.show()

The distribution of the number of in-game days is not normal (harmonic). Therefore, in this case, the Mann-Whitney U-test is applicable.

Formulation of third hypothesis:
* **H₀** (null hypothesis) - the average number of buildings among user samples according to the source of their attraction are equal.
* **H₁** (alternative hypothesis) - the average number of buildings among user samples in accordance with the source of their attraction has differences.

In [None]:
# samples
facebook = df.query('source == "facebook_ads"').groupby(['source', 'user_id'])['building_type']\
                                                .count().reset_index()['building_type']
instagram = df.query('source == "instagram_new_adverts"').groupby(['source', 'user_id'])['building_type']\
                                                            .count().reset_index()['building_type']
yandex = df.query('source == "yandex_direct"').groupby(['source', 'user_id'])['building_type']\
                                                .count().reset_index()['building_type']
youtube = df.query('source == "youtube_channel_reklama"').groupby(['source', 'user_id'])['building_type']\
                                                .count().reset_index()['building_type']

Let's check the hypothesis in pairs by sources. We also apply the Bonferroni correction to reduce FWER: we divide the significance level by the number of checks for these samples (=6).

In [None]:
mann_whit(facebook, instagram, 0.05/6)

In [None]:
mann_whit(facebook, yandex, 0.05/6)

In [None]:
mann_whit(facebook, youtube, 0.05/6)

In [None]:
mann_whit(instagram, yandex, 0.05/6)

In [None]:
mann_whit(instagram, youtube, 0.05/6)

In [None]:
mann_whit(yandex, youtube, 0.05/6)

Based on the hypothesis tests, we can conclude:
- Statistical differences were found in the average number of buildings for user samples from two pairs of sources: *facebook* and *yandex*, as well as *facebook* and *youtube*. For other sources, there are no statistical differences, i.e. the average number of buildings in user samples according to the sources of their attraction is the same.

***Conclusion***

Three hypotheses were tested:
1. The average time to complete level 1 differs depending on the strategy chosen by the player: players who complete the project to pass level 1 complete the level longer than those who chose combat as a strategy.
2. The frequency of entering the game (the average number of game days) differs only by a couple of sources of user acquisition: *facebook* and *yandex*, for the rest we can assume that there are no differences in game days.
3. The average number of buildings by users is also different only for pairs: *facebook* and *yandex*, as well as *facebook* and *youtube*, there are no differences for other sources.

## General conclusion

In the course of the project, data on events committed by users in the mobile game "Space Brothers" (table `game_actions`), sources of user acquisition (table `user_source`) and costs for each source of attraction (table `ad_cost`) were studied. In preparation data and their pre-processing:
- The data types containing information about dates in the `game_actions` and `ad_costs` tables have been converted to the type corresponding to the dates for further correct work with them.
- Removed duplicates and revealed that data gaps are not an error and do not require processing and deletion.
- Merged the `actions` and `user_source` tables into a common dataframe for the convenience of further analysis of user behavior depending on the source of their involvement, and added the `date` column for the convenience of further analysis.

**Exploratory analysis**
* The average number of unique users per day in the game 2884; a sharp increase in users was observed on May 10 and a sharp decrease on May 11 - the jump correlates with a decrease in advertising costs on the indicated dates across all sources.
* The average number of buildings per 1 user was 9 buildings; the `assembly_shop` and `spaceport` builds are the most popular, while the `research_center` build is significantly less popular, which can be useful when planning the monetization model that is planned to be done on the object build pages.
* Only 43% completed the first level, most of them chose 'battle' as a strategy, those who choose 'orbital station construction project' are 2 times less.
* The most productive traffic source: *yandex_direct* (35.5%), in second place *instagram* - 24.7%, then *facebook* - 20.1% and the least number of users brings *youtube* - 19.8% of the total number of users.
* User strategies in the game do not differ depending on the source of their attraction.
* In total, 7603.53 were spent on advertising and 13,576 users were attracted, while the most expensive source was *yandex_direct* (2233), but it was he who attracted the largest number of users compared to other sources.
* The most profitable traffic source is *youtube* (САС = 0.39) and the most unprofitable in terms of the cost of attracting 1 user is *facebook* (САС = 0.78).

**Hypothesis Testing**
1. Users who choose 'Combat' rather than 'Project' as their strategy will complete the level faster.
2. The frequency of entering the game (average number of game days) differs only for sources: *facebook* and *yandex*.
3. The average number of buildings by users is also different only for two pairs of sources: *facebook* and *yandex*, as well as *facebook* and *youtube*.

_______
**Result**

Answering the product manager's question about differences in user behavior depending on the channels of their acquisition, let's summarize: more effective and profitable sources of user acquisition were identified, but there were no clear differences in behavior, activity and strategies in the game among users from different traffic sources. It is worth noting separately a clear imbalance in strategies: a bias towards 'combat', which requires fewer buildings and faster completion of the level, which means that you can probably lose with the monetization model through the building windows of objects, which requires further study.