# Exploratory Data Analysis

We are going to be looking for patterns in players who only played in the month of June. This is with the purpose to distinguish players between short and long positions, and the reasons why they hold this positions. 

In [55]:
# import all necesary libraries for the project
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
import warnings
warnings.filterwarnings('ignore')


In [56]:
# Define working directory
path = "/Users/mau/Library/CloudStorage/Dropbox/Mac/Documents/Dissertation/Chapter 2/Entire_Data/By month"
os.chdir(path)

# Filter Columns
filter = ['playerkey', 'playercashableamt', 'wageredamt', 'maxbet', 'grosswin',
       'currencyinamt', 'assetnumber', 'theoreticalpaybackpercent', 
       'age', 'rank', 'gender', 'date', 'start_time', 'end_time', 'duration',
       'slotdenomination']
# Import data
dtf = pd.read_parquet('month_6_year_2015.parquet', columns=filter)

# Check data
dtf.columns

Index(['playerkey', 'playercashableamt', 'wageredamt', 'maxbet', 'grosswin',
       'currencyinamt', 'assetnumber', 'theoreticalpaybackpercent', 'age',
       'rank', 'gender', 'date', 'start_time', 'end_time', 'duration',
       'slotdenomination'],
      dtype='object')

In [51]:
# Sort data by start_time and playerkey
dtf = dtf.sort_values(by=['playerkey', 'date', 'start_time'])

## Calculate Foundamental Variables

The following variables were calculated using existing data:
* _player_loss_: how much money each player has lost in each gamble.
* _player_wins_: equals the amount of money they bet plus how much they won.
* _percent_return_: the return in player's bets for each gamble played. 

$$\text{percent return} = (\frac{df[wins] - df[wageredamt]}{df[wageredamt]})*100$$

* _playercashableamt_pct_change_: calculates the rate of change of player's outstanding gambling amount. 

$$\text{playercashableamt \% change} = (\frac{df[playercashableamt_{t+1}] - df[playercashableamt_{t}]}{df[playercashableamt_{t}]})*100$$

In [90]:
# Crate a new colum that is the inverse of casino_grosswin, named "player_loss"
dtf["player_loss"] = dtf["grosswin"] * -1

# how much each player wins
dtf['player_wins'] = dtf['wageredamt'] + dtf['player_loss']

# Calculate percentage return for each gamble and add it as a new column
dtf["percent_return"] = (dtf["player_wins"] - dtf["wageredamt"]) / dtf["wageredamt"] * 100

# Calculate the percent rate of change of playerscashableamt per playerkey
dtf["playercashableamt_pct_change"] = dtf.groupby("playerkey")["playercashableamt"].pct_change()
# Print the first 5 rows of the DataFrame

# Create a time series variable for each player that starts at 1 and increases by 1 for each row
dtf["gambles"] = dtf.groupby("playerkey").cumcount() + 1


### Calulates Number of Visits

In [58]:
# Group the DataFrame by playerkey
groups = dtf.groupby('playerkey')

# Initialize the visit column to 1 for the first visit of each player
dtf['visit'] = groups['date'].transform(lambda x: (x.diff().dt.days >= 1).cumsum() + 1)

# Reset the visit count for each new player
dtf['visit'] = dtf.groupby('playerkey')['visit'].apply(lambda x: x - x.iloc[0] + 1)


### Calculate Sessions based on Time

If there is a pause of 30 minutes or more between gambles it is considered the end of a session, and the start of a new one.

In [89]:
# Convert the start_time column to datetime
dtf["start_time"] = pd.to_datetime(dtf["start_time"])

# Sort data by start_time and playerkey
dtf = dtf.sort_values(by=['playerkey', 'date', 'start_time'])

# Compute the time difference between consecutive gambles for each player
dtf['time_diff'] = dtf.groupby('playerkey')['start_time'].diff()

# Initialize the session_time column to 1 for the first gamble of each player
dtf['session_time'] = dtf.groupby('playerkey').ngroup() + 1

# Update the session_time column based on the time difference
dtf['session_time'] += (dtf['time_diff'] > pd.Timedelta(minutes=30)).cumsum()

# Reset the session_time count for each new visit
dtf['session_time'] = dtf.groupby(['playerkey', 'visit'])['session_time'].apply(lambda x: x - x.iloc[0] + 1)

# Remove the temporary time_diff column
dtf = dtf.drop('time_diff', axis=1)

### Calculate Sessions based on Machine Change

Everytime a player switches machine a new sessions begins.

In [91]:
# Initialize the session_machine column to 1 for the first machine of each player
dtf['session_machine'] = (dtf.groupby("playerkey")["assetnumber"].diff() != 0).cumsum()

# Reset the session_machine count for each new visit
dtf['session_machine'] = dtf.groupby(['playerkey', 'visit'])['session_machine'].apply(lambda x: x - x.iloc[0] + 1)


### Calculate the number of gambles per Visit

In [93]:
# Create a column named "gambles_visit" that is the number of gambles per visit
dtf["gambles_visit"] = dtf.groupby(["playerkey", "visit"])["gambles"].cumcount() + 1

# Reset the gambles_visit count for each new visit
dtf['gambles_visit'] = dtf.groupby(['playerkey', 'visit'])['gambles_visit'].apply(lambda x: x - x.iloc[0] + 1)


### Calculate the number of gambles per Session Time

In [94]:
# Create a column named "gambles_session" that is the number of gambles per session
dtf["gambles_session"] = dtf.groupby(["playerkey", "session_time"])["gambles"].cumcount() + 1

# Reset the gambles_session count for each new session
dtf['gambles_session'] = dtf.groupby(['playerkey', 'session_time'])['gambles_session'].apply(lambda x: x - x.iloc[0] + 1)


### Calculate the number of gambles per Session Machine

In [95]:
# Create a column named "gambles_machine" that is the number of gambles per session_machine
dtf["gambles_machine"] = dtf.groupby(["playerkey", "session_machine"])["gambles"].cumcount() + 1

# Reset the gambles_machine count for each new session_machine
dtf['gambles_machine'] = dtf.groupby(['playerkey', 'session_machine'])['gambles_machine'].apply(lambda x: x - x.iloc[0] + 1)

## Frequencies of Gambles

### General

Let's see who plays the most, the least, and the min, max, average, and median number of gambles withouth making a distictions between visits, sessions per time or machine.

In [103]:
# Total number of players
print("Total number of players:", dtf["playerkey"].nunique())

# Total number of gambles
print("Total number of gambles:", dtf.groupby('playerkey')['gambles'].max().sum())

print("--------------------------------------------")
# What is the maximum number of periods played by each player?
print("Maximimum # of gambles of a single player:", dtf.groupby('playerkey')['gambles'].max().max())

#Who is the player with the maximum number of periods played?
print("Player who gambled the most:", dtf.groupby('playerkey')['gambles'].max().idxmax())

print("--------------------------------------------")

# What is the minimum number of periods played by each player?
print("Minimum # of gambles of a single player:", dtf.groupby('playerkey')['gambles'].max().min())

#Who is the player with the minimum number of periods played?
print("Player who gambled the least:", dtf.groupby('playerkey')['gambles'].max().idxmin())

print("--------------------------------------------")

# What is the median number of periods played by each player?
print("Median # of gambles of all players:", round(dtf.groupby('playerkey')['gambles'].max().median(), 2))

# How many players played at least 150 gambles?
print("Number of players who played exactly 150 times:", dtf.groupby('playerkey')['gambles'].max()[dtf.groupby('playerkey')['gambles'].max() == 150].count())
print("Number of players who gambled less than 150 times:", dtf.groupby('playerkey')['gambles'].max()[dtf.groupby('playerkey')['gambles'].max() < 150].count())
print("Number of players who gambled more than 150 times:", dtf.groupby('playerkey')['gambles'].max()[dtf.groupby('playerkey')['gambles'].max() > 150].count())
print("--------------------------------------------")

# Most common # of gambles per person
print("Most common # of gambles:", dtf.groupby('playerkey')['gambles'].max().mode().tolist())

# What is the average number of periods played by each player?
print("Average # of gambles of all players:", round(dtf.groupby('playerkey')['gambles'].max().mean(), 2))

# How many players played at least 278 gambles?
print("Number of players who gambled at least 278 times:", dtf.groupby('playerkey')['gambles'].max()[dtf.groupby('playerkey')['gambles'].max() >= 278].count())

Total number of players: 282
Total number of gambles: 78246
--------------------------------------------
Maximimum # of gambles of a single player: 3107
Player who gambled the most: 33
--------------------------------------------
Minimum # of gambles of a single player: 1
Player who gambled the least: 15
--------------------------------------------
Median # of gambles of all players: 150.0
Number of players who played exactly 150 times: 3
Number of players who gambled less than 150 times: 140
Number of players who gambled more than 150 times: 139
--------------------------------------------
Most common # of gambles: [2, 15]
Average # of gambles of all players: 277.47
Number of players who gambled at least 278 times: 87


### Visits and Sessions

In [106]:
# Calculate the average number of visitis per player
print("Average # of visits per player:", round(dtf.groupby('playerkey')['visit'].max().mean(), 2))

# Calculate median number of visits per player
print("Median # of visits per player:", round(dtf.groupby('playerkey')['visit'].max().median(), 2))

# Calculate the most common number of visits per player
print("Most common # of visits per player:", dtf.groupby('playerkey')['visit'].max().mode().tolist())

# Seperation line
print("--------------------------------------------")

# Calculate the average number of gambles per visit
print("Average # of gambles per visit:", round(dtf.groupby(['playerkey', 'visit'])['gambles_visit'].max().mean(), 2))

# Calculate the median number of gambles per visit
print("Median # of gambles per visit:", round(dtf.groupby(['playerkey', 'visit'])['gambles_visit'].max().median(), 2))

# Calculate the most common number of gambles per visit
print("Most common # of gambles per visit:", dtf.groupby(['playerkey', 'visit'])['gambles_visit'].max().mode().tolist())

# Seperation line
print("--------------------------------------------")

# Calculate the average number of gambles per session
print("Average # of gambles per session:", round(dtf.groupby(['playerkey', 'session_time'])['gambles_session'].max().mean(), 2))

# Calculate the median number of gambles per session
print("Median # of gambles per session:", round(dtf.groupby(['playerkey', 'session_time'])['gambles_session'].max().median(), 2))

# Calculate the most common number of gambles per session
print("Most common # of gambles per session:", dtf.groupby(['playerkey', 'session_time'])['gambles_session'].max().mode().tolist())

# Seperation line
print("--------------------------------------------")

# Calculate the average number of gambles per session_machine
print("Average # of gambles per session_machine:", round(dtf.groupby(['playerkey', 'session_machine'])['gambles_machine'].max().mean(), 2))

# Calculate the median number of gambles per session_machine
print("Median # of gambles per session_machine:", round(dtf.groupby(['playerkey', 'session_machine'])['gambles_machine'].max().median(), 2))

# Calculate the most common number of gambles per session_machine
print("Most common # of gambles per session_machine:", dtf.groupby(['playerkey', 'session_machine'])['gambles_machine'].max().mode().tolist())


Average # of visits per player: 1.0
Median # of visits per player: 1.0
Most common # of visits per player: [1]
--------------------------------------------
Average # of gambles per visit: 277.47
Median # of gambles per visit: 150.0
Most common # of gambles per visit: [2, 15]
--------------------------------------------
Average # of gambles per session: 190.84
Median # of gambles per session: 111.0
Most common # of gambles per session: [23]
--------------------------------------------
Average # of gambles per session_machine: 25.36
Median # of gambles per session_machine: 1.0
Most common # of gambles per session_machine: [1]


## Durations

### General

Let's calculate avergae durtion of gambles.

In [121]:
# Calculate the total duration played
print('Total duration played:', dtf['duration'].sum())

# Calculate the average duration played per player
print('Average duration played per player:', dtf.groupby('playerkey')['duration'].sum().mean())

# How many players played for more than the average duration?
print('Number of players who played more than the average duration:', dtf.groupby('playerkey')['duration'].sum()[dtf.groupby('playerkey')['duration'].sum() > dtf.groupby('playerkey')['duration'].sum().mean()].count())

# How many players played for less than the average duration?
print('Number of players who played less than the average duration:', dtf.groupby('playerkey')['duration'].sum()[dtf.groupby('playerkey')['duration'].sum() < dtf.groupby('playerkey')['duration'].sum().mean()].count())

# Seperation line
print("--------------------------------------------")

# Calculate the median duration played per player
print('Median duration played per player:', dtf.groupby('playerkey')['duration'].sum().median())

# How many players played for more than the median duration?
print('Number of players who played more than the median duration:', dtf.groupby('playerkey')['duration'].sum()[dtf.groupby('playerkey')['duration'].sum() > dtf.groupby('playerkey')['duration'].sum().median()].count())

# How many players played for less than the median duration?
print('Number of players who played less than the median duration:', dtf.groupby('playerkey')['duration'].sum()[dtf.groupby('playerkey')['duration'].sum() < dtf.groupby('playerkey')['duration'].sum().median()].count())

# sEperation line
print("--------------------------------------------")

# Calcualte the minimum duration played per player
print('Minimum duration played per player:', dtf.groupby('playerkey')['duration'].sum().min())
# Who is the player with the minimum duration played?
print('Player with the minimum duration played:', dtf.groupby('playerkey')['duration'].sum().idxmin())

# Calcualte the maximum duration played per player
print('Maximum duration played per player:', dtf.groupby('playerkey')['duration'].sum().max())
# Who is the player with the maximum duration played?
print('Player with the maximum duration played:', dtf.groupby('playerkey')['duration'].sum().idxmax())

Total duration played: 4 days 10:37:28.306000
Average duration played per player: 0 days 00:22:41.164205673
Number of players who played more than the average duration: 92
Number of players who played less than the average duration: 190
--------------------------------------------
Median duration played per player: 0 days 00:14:13.024500
Number of players who played more than the median duration: 141
Number of players who played less than the median duration: 141
--------------------------------------------
Minimum duration played per player: 0 days 00:00:03.970000
Player with the minimum duration played: 465
Maximum duration played per player: 0 days 02:34:25.548000
Player with the maximum duration played: 33


### Vists and Sessions

In [123]:
# What is the average duration played per visit?
print('Average duration played per visit:', dtf.groupby(['playerkey', 'visit'])['duration'].sum().mean())

# What is the median duration played per visit?
print('Median duration played per visit:', dtf.groupby(['playerkey', 'visit'])['duration'].sum().median())

# Seperation line
print("--------------------------------------------")

# What is the average duration played per session?
print('Average duration played per session:', dtf.groupby(['playerkey', 'session_time'])['duration'].sum().mean())

# What is the median duration played per session?
print('Median duration played per session:', dtf.groupby(['playerkey', 'session_time'])['duration'].sum().median())

# Seperation line
print("--------------------------------------------")

# What is the average duration played per session_machine?
print('Average duration played per session_machine:', dtf.groupby(['playerkey', 'session_machine'])['duration'].sum().mean())

# What is the median duration played per session_machine?
print('Median duration played per session_machine:', dtf.groupby(['playerkey', 'session_machine'])['duration'].sum().median())

# Seperation line
print("--------------------------------------------")

Average duration played per visit: 0 days 00:22:41.164205673
Median duration played per visit: 0 days 00:14:13.024500
--------------------------------------------
Average duration played per session: 0 days 00:15:36.215380487
Median duration played per session: 0 days 00:10:01.813000
--------------------------------------------
Average duration played per session_machine: 0 days 00:02:04.424086223
Median duration played per session_machine: 0 days 00:00:06.937000
--------------------------------------------


## EDA

Histograms