# Data Exploration

## Import Libraries & Read In Data

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pyreadr
import itertools

In [3]:
data = pyreadr.read_r('../data/IGTdata.rdata')

**Overview of dataset**

In [4]:
print("{:12}| {:57}| {:30}".format('DF NAME','COLUMN NAMES','ROW NAMES'))
print('-'*110)
for key, value in data.items():
    if value.shape[1] > 5:
        column_names = ''.join([', '.join(value.columns[:3]), ',...,', ', '.join(value.columns[-2:])])
    else:
        column_names = ', '.join(value.columns)
    row_names = [str(v) for v in list(value.index.values)]
    row_names = ''.join([', '.join(row_names[:2]), ',...,',', '.join(row_names[-2:])])
    print("{:12}| {:57}| {:30}".format(key,column_names,row_names))

DF NAME     | COLUMN NAMES                                             | ROW NAMES                     
--------------------------------------------------------------------------------------------------------------
choice_95   | Choice_1, Choice_2, Choice_3,...,Choice_94, Choice_95    | Subj_1, Subj_2,...,Subj_14, Subj_15
wi_95       | Wins_1, Wins_2, Wins_3,...,Wins_94, Wins_95              | Subj_1, Subj_2,...,Subj_14, Subj_15
lo_95       | Losses_1, Losses_2, Losses_3,...,Losses_94, Losses_95    | Subj_1, Subj_2,...,Subj_14, Subj_15
index_95    | Subj, Study                                              | 0, 1,...,13, 14               
choice_100  | Choice_1, Choice_2, Choice_3,...,Choice_99, Choice_100   | Subj_1, Subj_2,...,Subj_503, Subj_504
wi_100      | Wins_1, Wins_2, Wins_3,...,Wins_99, Wins_100             | Subj_1, Subj_2,...,Subj_503, Subj_504
lo_100      | Losses_1, Losses_2, Losses_3,...,Losses_99, Losses_100   | Subj_1, Subj_2,...,Subj_503, Subj_504
index_100   | Subj, S

* The studies are grouped by the number of trials (t) completed; 95, 100 or 150. 
* For each group, there are 4 data frames, where each row correspnds to a subject (s):
  * choice_t : 
    * Entries are either 1, 2, 3 or 4 which correspond to deck A, B, C and D, respectively.
    * Dimensionality is s x t. 
    * The entry of the second row and third column indicates the choice made by the second subject on the third trial.
  * wi_t :
    * Contains the win achieved as a result of each choice.
    * Dimensionality is s x t. 
    * The entry of the second row and third column corresponds to the reward received by the second subject on third trial.
  * lo_t :
    * Contains the loss incurred as a result of each choice.
    * Dimensionality is s x t. 
    * The entry of the second row and third column corresponds to the loss incurred by the second subject on third trial.
  * index_t :
    * Entries are the name of the first author of the study that reports the data name of the first author of the study that reports the data of the corresponding participant.
    * Dimensionality is s x 2. 
    * The entry of the second row indicates which study the second subject participated in.
    

## Data Cleaning & Validation

**Update index_t row names for confirmity.**<br>
All other data frames have consistent row names of the form Subj_1, Subj_2, Subj_3 etc.

In [5]:
for key, value in data.items():
        if not key[0:5] == 'index':
            continue
        data[key] = value.drop(columns=['Subj'])
        data[key].index = ['Subj_'+str(i) for i in range(1,value.shape[0]+1)]
data['index_150'].head()

Unnamed: 0,Study
Subj_1,Steingroever2011
Subj_2,Steingroever2011
Subj_3,Steingroever2011
Subj_4,Steingroever2011
Subj_5,Steingroever2011


**Verify table 1 (include link  to it).** <br> cite:t}`Steingroever_Fridberg_Horstmann_Kjome_Kumari_Lane_Maia_McClelland_Pachur_Premkumar` speculates that the sample size may be less than 617 due to "missing data for one participant in {cite:t}`kjome_lane_schmitz_green_ma_prasla_swann_moeller_2010`, and for two participants in {cite:t}`Wood_Busemeyer_Koling_Cox_Davis_2005`". According to table 1 (link again), there should be 19 participants in {cite:t}`kjome_lane_schmitz_green_ma_prasla_swann_moeller_2010` and 153 in {cite:t}`Wood_Busemeyer_Koling_Cox_Davis_2005`.

In [26]:
print("Total number of subjects:", data['choice_95'].shape[0] + data['choice_100'].shape[0] + data['choice_150'].shape[0])

Total number of subjects: 617


Appears in order, let's take a closer look to be sure.

In [23]:
print("Subjects in Kjome study:", len(data['index_100'][data['index_100']['Study'] == 'Kjome']))
print("Subjects in Wood study:", len(data['index_100'][data['index_100']['Study'] == 'Wood']))

Subjects in Kjome study: 19
Subjects in Wood study: 153


Confirms correct number of subjects reported.

**Check for nulls/ unexpected entries**

Sanity checking the data frames for unusual entries, such as null values and unexpected data types. Additionally, confirming the data frames are structured as expected, e.g. checking that all entries in lo_t are negative integers and that 1, 2, 3 and 4 are the only entries in choice_t.

In [219]:
for key, value in data.items():
    try:
        uniq_entries = ', '.join([("{:.2f} ({:.2f}%)".format(entry, count*100)) for entry, count in value.stack().value_counts(normalize=True).sort_index().iteritems()])
    except:
        uniq_entries = ', '.join([("{:}".format(entry)) for entry, count in value.stack().value_counts().sort_index().iteritems()])

    print("\033[1mUnique entries (and their frequency) in {:}:\033[0m \n{:}".format(key, uniq_entries))

[1mUnique entries (and their frequency) in choice_95:[0m 
1.00 (13.12%), 2.00 (29.82%), 3.00 (13.61%), 4.00 (43.44%)
[1mUnique entries (and their frequency) in wi_95:[0m 
50.00 (57.05%), 100.00 (42.95%)
[1mUnique entries (and their frequency) in lo_95:[0m 
-1250.00 (3.23%), -350.00 (1.61%), -300.00 (1.19%), -250.00 (5.54%), -200.00 (1.12%), -150.00 (1.19%), -75.00 (0.56%), -50.00 (4.91%), -25.00 (0.98%), 0.00 (79.65%)
[1mUnique entries (and their frequency) in index_95:[0m 
Fridberg
[1mUnique entries (and their frequency) in choice_100:[0m 
1.00 (15.21%), 2.00 (30.98%), 3.00 (23.66%), 4.00 (30.14%)
[1mUnique entries (and their frequency) in wi_100:[0m 
40.00 (3.60%), 45.00 (2.41%), 50.00 (34.28%), 55.00 (4.20%), 60.00 (3.41%), 65.00 (2.19%), 70.00 (1.71%), 75.00 (0.59%), 80.00 (2.58%), 85.00 (0.27%), 90.00 (2.74%), 95.00 (0.05%), 100.00 (31.80%), 110.00 (3.68%), 120.00 (3.62%), 130.00 (1.75%), 140.00 (0.79%), 150.00 (0.23%), 160.00 (0.08%), 170.00 (0.02%)
[1mUnique entries

No unexpected entries.

In [None]:
# data preparation section
# df for win - loss (outcome of each trial)
# cumulative profit/loss

In [None]:
# maybe contrast payoff scheme 3 -> if so add payoff scheme to study data frame

In [None]:
# verify good/bad decks
# when do players develop a tendency/aversion to certain decks -> do they correctly identify them