# Exploratory Data Analysis

We are going to be looking for patterns in players who only played in the month of August. This is with the purpose to distinguish players between short and long positions, and the reasons why they hold this positions. 

In [1]:
# import all necesary libraries for the project
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
import warnings
warnings.filterwarnings('ignore')


In [2]:
# Define working directory
path = "/Users/mau/Library/CloudStorage/Dropbox/Mac/Documents/Dissertation/Chapter 2/Entire_Data/By month"
os.chdir(path)

# Filter Columns
filter = ['playerkey', 'playercashableamt', 'wageredamt', 'maxbet', 'grosswin',
       'currencyinamt', 'assetnumber', 'theoreticalpaybackpercent', 
       'age', 'rank', 'gender', 'date', 'start_time', 'end_time', 'duration',
       'slotdenomination']
# Import data
dtf = pd.read_parquet('month_8_year_2015.parquet', columns=filter)

# Check data
dtf.columns

Index(['playerkey', 'playercashableamt', 'wageredamt', 'maxbet', 'grosswin',
       'currencyinamt', 'assetnumber', 'theoreticalpaybackpercent', 'age',
       'rank', 'gender', 'date', 'start_time', 'end_time', 'duration',
       'slotdenomination'],
      dtype='object')

In [3]:
# Sort data by start_time and playerkey
dtf = dtf.sort_values(by=['playerkey', 'start_time'])

## Calculate Foundamental Variables

The following variables were calculated using existing data:
* _player_loss_: how much money each player has lost in each gamble.
* _player_wins_: equals the amount of money they bet plus how much they won.
* _percent_return_: the return in player's bets for each gamble played. 

$$\text{percent return} = (\frac{df[wins] - df[wageredamt]}{df[wageredamt]})*100$$

* _playercashableamt_pct_change_: calculates the rate of change of player's outstanding gambling amount. 

$$\text{playercashableamt \% change} = (\frac{df[playercashableamt_{t+1}] - df[playercashableamt_{t}]}{df[playercashableamt_{t}]})*100$$

In [4]:
# Crate a new colum that is the inverse of casino_grosswin, named "player_loss"
dtf["player_loss"] = dtf["grosswin"] * -1

# how much each player wins
dtf['player_wins'] = dtf['wageredamt'] + dtf['player_loss']

# Calculate percentage return for each gamble and add it as a new column
dtf["percent_return"] = (dtf["player_wins"] - dtf["wageredamt"]) / dtf["wageredamt"] * 100

# Calculate the percent rate of change of playerscashableamt per playerkey
dtf["playercashableamt_pct_change"] = dtf.groupby("playerkey")["playercashableamt"].pct_change()
# Print the first 5 rows of the DataFrame

# Create a time series variable for each player that starts at 1 and increases by 1 for each row
dtf["gambles"] = dtf.groupby("playerkey").cumcount() + 1

## Frequency of Gambles

Let's see who plays the most, the least, and the min, max, average, and median number of gambles 

In [6]:
# Total number of players
print("Total number of players:", dtf["playerkey"].nunique())

# Total number of gambles
print("Total number of gambles:", dtf.groupby('playerkey')['gambles'].max().sum())

print("--------------------------------------------")
# What is the maximum number of periods played by each player?
print("Maximimum # of gambles of a single player:", dtf.groupby('playerkey')['gambles'].max().max())

#Who is the player with the maximum number of periods played?
print("Player who gambled the most:", dtf.groupby('playerkey')['gambles'].max().idxmax())

print("--------------------------------------------")

# What is the minimum number of periods played by each player?
print("Minimum # of gambles of a single player:", dtf.groupby('playerkey')['gambles'].max().min())

#Who is the player with the minimum number of periods played?
print("Player who gambled the least:", dtf.groupby('playerkey')['gambles'].max().idxmin())

print("--------------------------------------------")

# What is the median number of periods played by each player?
print("Median # of gambles of all players:", round(dtf.groupby('playerkey')['gambles'].max().median(), 2))

# Median # of gambles: 151
print("Number of players who played exactly the median:", dtf.groupby('playerkey')['gambles'].max()[dtf.groupby('playerkey')['gambles'].max() == 151].count())
print("Number of players who gambled less than the median:", dtf.groupby('playerkey')['gambles'].max()[dtf.groupby('playerkey')['gambles'].max() < 151].count())
print("Number of players who gambled more than the median:", dtf.groupby('playerkey')['gambles'].max()[dtf.groupby('playerkey')['gambles'].max() > 151].count())
print("--------------------------------------------")

# What is the average number of periods played by each player?
print("Average # of gambles of all players:", round(dtf.groupby('playerkey')['gambles'].max().mean(), 0))

# Average # of gambles: 421
print("Number of players who gambled exaclty the average:", dtf.groupby('playerkey')['gambles'].max()[dtf.groupby('playerkey')['gambles'].max() == 421].count())
print("Number of players who gambled less than the average:", dtf.groupby('playerkey')['gambles'].max()[dtf.groupby('playerkey')['gambles'].max() < 421].count())
print("Number of players who gambled more than the average:", dtf.groupby('playerkey')['gambles'].max()[dtf.groupby('playerkey')['gambles'].max() > 421].count())

Total number of players: 12854
Total number of gambles: 5413835
--------------------------------------------
Maximimum # of gambles of a single player: 26134
Player who gambled the most: 18059
--------------------------------------------
Minimum # of gambles of a single player: 1
Player who gambled the least: 9603
--------------------------------------------
Median # of gambles of all players: 151.0
Number of players who played exactly the median: 21
Number of players who gambled less than the median: 6412
Number of players who gambled more than the median: 6421
--------------------------------------------
Average # of gambles of all players: 421.0
Number of players who gambled exaclty the average: 8
Number of players who gambled less than the average: 9668
Number of players who gambled more than the average: 3178
