# Exploratory Data Analysis
I am using this notebook to learn how gamblers behaviors are similar to those of investors.

## Define Libraries

In [2]:
import pandas as pd
import os
import seaborn as sns
import matplotlib.pyplot as plt
import ipywidgets as widgets


# Getting rid of the SettingWithCopyWarning: 
pd.options.mode.chained_assignment = None

## Upload Data

In [3]:
# Set working directory
path = '/Users/mau/Library/CloudStorage/Dropbox/Mac/Documents/Dissertation/Chapter 2/Data'
os.chdir(path)

# Load data into a DataFrame
dtf = pd.read_parquet("slot_data_sample.parquet")

### Filter Columns and Inspect

In [4]:


# Select only specific columns
filter = ['playercashableamt', 'wageredamt', 'casino_grosswin', 'playerkey',
       'slotdenominationname', 'slotthemekey']

# Load just specific colums data into a DataFrame
df = pd.read_parquet("slot_data_sample.parquet", columns=filter)

# Print column names
print(df.columns)

# Print general information about the DataFrame
print(df.info())

# Count unique players
print(df['playerkey'].nunique())

Index(['playercashableamt', 'wageredamt', 'casino_grosswin', 'playerkey',
       'slotdenominationname', 'slotthemekey'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90274 entries, 0 to 90273
Data columns (total 6 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   playercashableamt     90274 non-null  float64
 1   wageredamt            90274 non-null  float64
 2   casino_grosswin       90274 non-null  float64
 3   playerkey             90274 non-null  int64  
 4   slotdenominationname  90274 non-null  object 
 5   slotthemekey          90274 non-null  int64  
dtypes: float64(3), int64(2), object(1)
memory usage: 4.1+ MB
None
98


## Calculate Foundamental Variables

The following variables were calculated using existing data:
* _player_loss_: how much money each player has lost in each gamble.
* _player_wins_: equals the amount of money they bet plus how much they won.
* _percent_return_: the return in player's bets for each gamble played. 

$$\text{percent return} = (\frac{df[wins] - df[wageredamt]}{df[wageredamt]})*100$$

* _playercashableamt_pct_change_: calculates the rate of change of player's outstanding gambling amount. 

$$\text{playercashableamt \% change} = (\frac{df[playercashableamt_{t+1}] - df[playercashableamt_{t}]}{df[playercashableamt_{t}]})*100$$

In [5]:
# Crate a new colum that is the inverse of casino_grosswin, named "player_loss"
df["player_loss"] = df["casino_grosswin"] * -1

df['player_wins'] = df['wageredamt'] + df['player_loss']
# Calculate percentage return for each gamble and add it as a new column
df["percent_return"] = (df["player_wins"] - df["wageredamt"]) / df["wageredamt"] * 100

# Calculate the percent rate of change of playerscashableamt per playerkey
df["playercashableamt_pct_change"] = df.groupby("playerkey")["playercashableamt"].pct_change()
# Print the first 5 rows of the DataFrame

# Create a time series variable for each player that starts at 1 and increases by 1 for each row
df["time"] = df.groupby("playerkey").cumcount() + 1

print(df.head())

   playercashableamt  wageredamt  casino_grosswin  playerkey  \
0               80.0        10.0             10.0          2   
1               70.0        10.0             10.0          2   
2               60.0        10.0             10.0          2   
3               50.0        10.0            -90.0          2   
4              140.0        10.0              0.0          2   

  slotdenominationname  slotthemekey  player_loss  player_wins  \
0               $5.00            390        -10.0          0.0   
1               $5.00            390        -10.0          0.0   
2               $5.00            390        -10.0          0.0   
3               $5.00            390         90.0        100.0   
4               $5.00            390         -0.0         10.0   

   percent_return  playercashableamt_pct_change  time  
0          -100.0                           NaN     1  
1          -100.0                     -0.125000     2  
2          -100.0                     -0.142857   

## Patterns in Return Stream

In this section, I am looking for return stream patterns that are similar to the market returns given to subjects in Saffort et.al 2008 experiemnt. These market returns followed historical returns from the DJIA from 1925 to 1964:

$$\text{\% return DJIA 1925-1964}: [30.0, 0.3, 28.8, 48.2, -17.2, -33.8, -52.7, -23.1, 66.7, 4.1, 38.5, 24.8, -32.8, 28.1, -2.9, -12.7, -15.4, 7.6, 13.8, 12.1, 26.6,\\ -8.1, 2.2, -2.1, 12.9, 17.6, 14.4, 8.4, -3.8, 44.0, 20.8, 2.3, -12.8, 34.0, 16.4, -9.3, 18.7, -10.8, 17.0, 14.0]$$

$$\text{pattern of returns: [1, 1, 1, 1, -1, -1, -1, -1, 1, 1, 1, 1, -1, 1, -1, -1, -1, 1, 1, 1, 1, -1, 1, -1, 1, 1, 1, 1, -1, 1, 1, 1, -1, 1, 1, -1, 1, -1, 1, 1]}$$

### Filter Data
We need to select players who have at least 40 gambles to be able to compare it to the 40 investing periods of Safford's subjects.

In [6]:
# Create a list of players that appear at least 40 times
players40 = df["playerkey"].value_counts()[df["playerkey"].value_counts() >= 40].index.tolist()
print(players40)
print(len(players40))

# Create a list of players that appear less than 40 times
players40less = df["playerkey"].value_counts()[df["playerkey"].value_counts() < 40].index.tolist()
print(players40less)
print(len(players40less))

# Create a new DataFrame with only the players that appear at least 40 times
df40 = df[df["playerkey"].isin(players40)]
print(df40.shape)

[33, 93, 95, 94, 20, 73, 48, 66, 44, 18, 38, 76, 29, 100, 62, 6, 99, 14, 89, 90, 8, 79, 13, 9, 69, 23, 40, 4, 61, 87, 19, 43, 72, 98, 16, 2, 36, 92, 17, 63, 12, 46, 97, 54, 91, 86, 84, 3, 85, 52, 30, 74, 35, 7, 96, 56, 65, 68, 83, 27, 71, 37, 70, 47, 11, 53, 77, 49, 39, 25, 88, 82, 42, 21, 22, 41, 80, 57, 51, 81, 78, 75, 10]
83
[55, 5, 50, 34, 64, 31, 45, 32, 58, 24, 59, 67, 26, 28, 15]
15
(90000, 11)


### Patter Recognition
Now that we have filter the data, we can procede to find a patter similar to those of Safford's experiemnt. 

* Define function to look for our desire pattern.
* Define variable _pattern_ and _sign_
* Conduct a hard match.
* Conduct a soft match.

In [7]:
def match_pattern(df, pattern, initial, window_size):
    col_name = f"match_{initial}_{window_size}"
    size = len(pattern[initial:window_size])
    df.loc[:,col_name] = df["sign"].rolling(size).apply(lambda x: (x == pattern[initial:window_size]).all(), raw=True)
    num_occurrences = df[col_name].sum()
    matching_players = df[df[col_name]== True]["playerkey"].unique().tolist()
    print(f"Pattern:", pattern[initial:window_size])
    print(f"Number of occurrences (pattern size {size}):", num_occurrences)
    print(f"Players matching pattern (pattern size {size}):", matching_players)
    print(f"Count of players (pattern size {size}):", len(matching_players))
    return matching_players


In [8]:
# Define pattern to search for
pattern = [1, 1, 1, 1, -1, -1, -1, -1, 1, 1, 1, 1, -1, 1, -1, -1, -1, 1, 1, 1, 1, -1, 1, -1, 1, 1, 1, 1, -1, 1, 1, 1, -1, 1, 1, -1, 1, -1, 1, 1]

# Create a new column 'sign' that is 1 if the 'percent_return' is positive -1 else.
df40.loc[:, "sign"] = df40["percent_return"].apply(lambda x: 1 if x > 0 else -1)

In [9]:
## Hard Match
# Find players that match the pattern exactly
players_match40 = match_pattern(df40, pattern, 0, 40)

Pattern: [1, 1, 1, 1, -1, -1, -1, -1, 1, 1, 1, 1, -1, 1, -1, -1, -1, 1, 1, 1, 1, -1, 1, -1, 1, 1, 1, 1, -1, 1, 1, 1, -1, 1, 1, -1, 1, -1, 1, 1]
Number of occurrences (pattern size 40): 0.0
Players matching pattern (pattern size 40): []
Count of players (pattern size 40): 0


In [10]:
## Soft Match
# Let's see how many times a the first 8 elements of the pattern appear in the data
players_match8 = match_pattern(df40, pattern, 0, 8)

# Let's see how many times a the first 10 elements of the pattern appear in the data
players_match10 = match_pattern(df40, pattern, 0, 10)

# Let's see how many times a the first 20 elements of the pattern appear in the data
players_match20 = match_pattern(df40, pattern, 0, 20)

# Let's see how many times a the pattern from 4 to 11 appears in the data
players_match4_11 = match_pattern(df40, pattern, 4, 11)

# Let's see how many times a the pattern from 4 to 11 appears in the data
players_match4_12 = match_pattern(df40, pattern, 4, 12)

Pattern: [1, 1, 1, 1, -1, -1, -1, -1]
Number of occurrences (pattern size 8): 32.0
Players matching pattern (pattern size 8): [3, 6, 14, 20, 23, 44, 57, 66, 73, 76, 79, 90, 91, 92, 93, 94, 95]
Count of players (pattern size 8): 17
Pattern: [1, 1, 1, 1, -1, -1, -1, -1, 1, 1]
Number of occurrences (pattern size 10): 2.0
Players matching pattern (pattern size 10): [79, 92]
Count of players (pattern size 10): 2
Pattern: [1, 1, 1, 1, -1, -1, -1, -1, 1, 1, 1, 1, -1, 1, -1, -1, -1, 1, 1, 1]
Number of occurrences (pattern size 20): 0.0
Players matching pattern (pattern size 20): []
Count of players (pattern size 20): 0
Pattern: [-1, -1, -1, -1, 1, 1, 1]
Number of occurrences (pattern size 7): 149.0
Players matching pattern (pattern size 7): [3, 6, 8, 9, 14, 16, 18, 20, 21, 23, 27, 29, 33, 36, 38, 44, 46, 48, 62, 63, 65, 66, 69, 72, 73, 74, 76, 79, 84, 85, 87, 89, 90, 91, 92, 93, 94, 95, 99, 100]
Count of players (pattern size 7): 40
Pattern: [-1, -1, -1, -1, 1, 1, 1, 1]
Number of occurrences (

## Slicing DataFrames per Matched Players and Visualizing Outcomes

In this section, we are only inspecting the patterns that have the most subjects. In this section, the code will create a new column called "match_-10_10" that is True for the rows 10 before and 10 after match_0_8 is True. The rolling method with the window size of 21 (10 before + 10 after + 1 current row) will take care of checking the 10 rows before and after the match, and the apply method with the lambda function will check if any of the values in the window is True. Finally, the fillna method will replace any NaN values with False, and the astype method will convert the True/False values to 1/0.


In [11]:
def filter_match(df, players_match, match_column, rolling_window, fill_value):
    # Create a new DataFrame with only the players that appear in players_match
    df_match_all = df[df["playerkey"].isin(players_match)]

    # Creaete a new column for called match_minus10_0 that is True for the rows around match_column is True
    df_match_all.loc[:, "match_rolling"] = df_match_all[match_column].rolling(window=rolling_window, center=True).apply(lambda x: any(x)).fillna(fill_value).astype(int)
    
    # Slice the DataFrame to only include the rows where match_rolling is True
    df_match_slice = df_match_all[df_match_all["match_rolling"] == True]
    # Return the new DataFrame
    return df_match_all, df_match_slice

In [12]:
# Create a new DataFrame with only the players that appear in players_match8
df40_match8_10_all, dtf40_match8_10_slice = filter_match(df=df40, players_match=players_match8, match_column="match_0_8", rolling_window=21, fill_value=False)

# Save the DataFrame to a parquet file
df40_match8_10_all.to_parquet("df40_match8_10_all.parquet")
print(df40_match8_10_all.shape)

# Save the DataFrame to a parquet file
dtf40_match8_10_slice.to_parquet("dtf40_match8_10_slice.parquet")
print(dtf40_match8_10_slice.shape)

(44566, 19)
(670, 19)


In [13]:

# Create a new DataFrame with only the players that appear in players_match4_11
df40_match4_10_all, dtf40_match4_10_slice = filter_match(df=df40, players_match=players_match4_11, match_column="match_4_11", rolling_window=21, fill_value=False)

# Save the DataFrame to a parquet file
df40_match4_10_all.to_parquet("df40_match4_10_all.parquet")
print(df40_match4_10_all.shape)

# Save the DataFrame to a parquet file
dtf40_match4_10_slice.to_parquet("dtf40_match4_10_slice.parquet")
print(dtf40_match4_10_slice.shape)

(78439, 19)
(3060, 19)


## Interactive Plots

The following section would be used to explore the data in an interactive way. These plots allow for user interaction, such as zooming, panning, and selecting data points. Users can customize the plot by choosing different variables to plot, adjusting axes ranges, and selecting data subsets. The interactive plots provide a dynamic way to visually explore the data and can reveal patterns or relationships that might not be apparent from static plots alone. By using interactive plots, we can gain a deeper understanding of the data and make more informed decisions during the data analysis process.

In [18]:
# Make a list of all the dataframes that are match and slice
dtf_lists = [df40_match8_10_all, dtf40_match8_10_slice, df40_match4_10_all, dtf40_match4_10_slice]

# Create a scatter plot of the players wins for only player with key 3
def plot_scatters(player_index=0, df_index=1, x="time", y="percent_return"):
    df = dtf_lists[df_index]
    players = dtf_lists[df_index]["playerkey"].unique().tolist()
    plt.scatter(x=df[df["playerkey"] == players[player_index]][x], y=df[df["playerkey"] == players[player_index]][y])
    plt.show()

# Create widgets for playerkey, df_index, x, and y
playerkey_widget = widgets.Dropdown(options=list(range(10)), value=0)
df_index_widget = widgets.Dropdown(options=[(f"DataFrame {i}", i) for i in range(len(dtf_lists))], value=1)
x_widget = widgets.Dropdown(options=list(dtf_lists[0].columns), value="time")
y_widget = widgets.Dropdown(options=list(dtf_lists[0].columns), value="percent_return")

# Create a VBox layout widget for the playerkey, x, and y widgets
left_box = widgets.VBox([widgets.Label("Player Key"), playerkey_widget, widgets.Label("X"), x_widget, widgets.Label("Y"), y_widget])

# Create a VBox layout widget for the df_index widget
right_box = widgets.VBox([widgets.Label("Data Frame"), df_index_widget])

# Create a new figure and axis object for the plot
fig, ax = plt.subplots()

output = widgets.Output() # create an Output widget to display the plot
with output:
    # Display the initial empty plot
    ax.set_xlim(0, 100)
    ax.set_ylim(0, 100)
    ax.set_xlabel("X Axis")
    ax.set_ylabel("Y Axis")
    plt.show()

# Create an HBox layout widget for the plot and the two VBox widgets
widgets.HBox([output, left_box, right_box])

# Use interact to call the plot_scatters function with the current widget values
widgets.interact(plot_scatters, player_index=playerkey_widget, df_index=df_index_widget, x=x_widget, y=y_widget);

interactive(children=(Dropdown(description='player_index', options=(0, 1, 2, 3, 4, 5, 6, 7, 8, 9), value=0), D…

In [15]:
# Make a list of all the dataframes that are match and slice
dtf_lists = [df40_match8_10_all, dtf40_match8_10_slice, df40_match4_10_all, dtf40_match4_10_slice]
players = dtf_lists[0]["playerkey"].unique().tolist()
print(players[0])

3


## Statistical Analysis
