# Data Exploration

The first step of the data exploration phase is to import any necessary packages

In [1]:
import pandas as pd
import numpy as np

For the purpose of this assignment their are several different data sets to be read in and explored.<br />
There are 12 data sets to read in.
They are as follows:
* **Wins:** These data sets show how much money participants win during each part of the trial
* **Losses:** These data sets show how much money participants lose during each part of the trial
* **Choice:** These data sets indicate what deck participants chose during each part of the trial
* **Index:** These data sets contain the name of the first author of the study that reports the data of the corresponding participant.

### Importing data sets

In [2]:
win_95 = pd.read_csv('data/wi_95.csv')
win_100 = pd.read_csv('data/wi_100.csv')
win_150 = pd.read_csv('data/wi_150.csv')

In [3]:
loss_95 = pd.read_csv('data/lo_95.csv')
loss_100 = pd.read_csv('data/lo_100.csv')
loss_150 = pd.read_csv('data/lo_150.csv')

In [4]:
choice_95 = pd.read_csv('data/choice_95.csv')
choice_100 = pd.read_csv('data/choice_100.csv')
choice_150 = pd.read_csv('data/choice_150.csv')

In [5]:
index_95 = pd.read_csv('data/index_95.csv')
index_100 = pd.read_csv('data/index_100.csv')
index_150 = pd.read_csv('data/index_150.csv')

The next process is to clean the above data

Data cleaning is a very important part of the data exploration process as it will identify and remove errors for machine learning processes in the future

### Data Cleaning

My first data cleaning step is to check for null values in the data sets

In [6]:
win_95.isna().sum().sum() + win_100.isna().sum().sum() + win_150.isna().sum().sum()

0

There are no null values in the wins data sets


In [7]:
loss_95.isna().sum().sum() + loss_100.isna().sum().sum() + loss_150.isna().sum().sum()

0

There are no null values in the losses data sets

In [8]:
choice_95.isna().sum().sum() + choice_100.isna().sum().sum() + choice_150.isna().sum().sum()

0

There are no null values in the choices data sets


In [9]:
index_95.isna().sum().sum() + index_100.isna().sum().sum() + index_150.isna().sum().sum()

0

There are no null values in the choices data sets

Due to the large number of data sets, it might make it simpler to join tables based on the number of trials

:::{note}
Below I am making a new column for each number of trials. The new column shows to total won or lost per person.
:::

In [55]:
total_win_95 = win_95.sum(axis=1)
total_loss_95 = loss_95.sum(axis=1)
total_95 = total_win_95 + total_loss_95

total_win_100 = win_100.sum(axis=1)
total_loss_100 = loss_100.sum(axis=1)
total_100 = total_win_100 + total_loss_100

total_win_150 = win_150.sum(axis=1)
total_loss_150 = loss_150.sum(axis=1)
total_150 = total_win_150 + total_loss_150

Making totals into pandas dataframes for further analysis

In [64]:
total_95 = pd.DataFrame(total_95)
total_95 = total_95.rename(columns={0: 'Total'})

total_100 = pd.DataFrame(total_100)
total_100 = total_100.rename(columns={0: 'Total'})

total_150 = pd.DataFrame(total_150)
total_150 = total_150.rename(columns={0: 'Total'})

In [81]:
total_95["Study_Type"] = index_95["Study"].values
total_100["Study_Type"] = index_95["Study"].values
total_150["Study_Type"] = index_150["Study"].values

Unnamed: 0,total,Study_Type
Subj_1,1150,Fridberg
Subj_2,-675,Fridberg
Subj_3,-750,Fridberg
Subj_4,-525,Fridberg
Subj_5,100,Fridberg


In [79]:
index_95["Study"].dtype

dtype('O')

### Data Analysis/Exploration

Join all wins together, add column for what trial they were part of and also add how many partcipants were in that study

Join all losses together, add column for what trial they were part of and also add how many partcipants were in that study

Analysing the wins and losses for each number of trials

In [17]:
agg_choice_95 = choice_95.apply(pd.Series.value_counts, axis=1)

agg_choice_100 = choice_100.apply(pd.Series.value_counts, axis=1)

agg_choice_150 = choice_150.apply(pd.Series.value_counts, axis=1)

In [21]:
agg_choice_100

Unnamed: 0,1,2,3,4
Subj_1,21,42,15,22
Subj_2,14,35,18,33
Subj_3,21,42,7,30
Subj_4,13,24,28,35
Subj_5,15,31,28,26
...,...,...,...,...
Subj_500,17,29,28,26
Subj_501,14,15,44,27
Subj_502,27,32,17,24
Subj_503,27,25,23,25


Join all wins together, add column for what trial they were part of and also add how many partcipants were in that study

Join all losses together, add column for what trial they were part of and also add how many partcipants were in that study