# Realign & Homogenize ED-1, ED-2, ED-3

`2.clean_realign_homogenize_all`

Realign and merge converted data from ED-3 into the concatenated data from ED-1 and ED-2.

Differentiate, import, and reassociate memory data into the main-task trialwise dataset.

### Configuration

In [1]:
from pathlib import Path

import pandas as pd
from datetime import datetime

from _utils import clean

In [2]:
date = datetime.today().strftime('%y%m%d')

In [3]:
derivs_dir = Path('..') / 'derivatives'
allsub_dir = derivs_dir / '00.allsub'

## Pull Concatenated Taskwise data

#### ED-1

In [4]:
main_fpath_1 = allsub_dir / ('econdec-1_task-main_beh_' + date + '.csv')
frac_fpath_1 = allsub_dir / ('econdec-1_task-frac_beh_' + date + '.csv')
face_fpath_1 = allsub_dir / ('econdec-1_task-face_beh_' + date + '.csv')

In [5]:
main_df_1 = clean.smooth_columns(pd.read_csv(main_fpath_1))
frac_df_1 = clean.smooth_columns(pd.read_csv(frac_fpath_1))
face_df_1 = clean.smooth_columns(pd.read_csv(face_fpath_1))

#### ED-2

In [6]:
main_fpath_2 = allsub_dir / ('econdec-2_task-main_beh_' + date + '.csv')
frac_fpath_2 = allsub_dir / ('econdec-2_task-frac_beh_' + date + '.csv')
face_fpath_2 = allsub_dir / ('econdec-2_task-face_beh_' + date + '.csv')

In [7]:
main_df_2 = clean.smooth_columns(pd.read_csv(main_fpath_2))
frac_df_2 = clean.smooth_columns(pd.read_csv(frac_fpath_2))
face_df_2 = clean.smooth_columns(pd.read_csv(face_fpath_2))

#### ED-3

In [8]:
main_fpath_3 = allsub_dir / ('econdec-3_task-main_beh_' + date + '.csv')
frac_fpath_3 = allsub_dir / ('econdec-3_task-frac_beh_' + date + '.csv')
face_fpath_3 = allsub_dir / ('econdec-3_task-face_beh_' + date + '.csv')

In [9]:
main_df_3 = clean.eye_cleanup(clean.smooth_columns(pd.read_csv(main_fpath_3)))
frac_df_3 = clean.smooth_columns(pd.read_csv(frac_fpath_3))
face_df_3 = clean.smooth_columns(pd.read_csv(face_fpath_3))

We'll address the overall trial count by invoking `eye_cleanup`, a specially designed utility function for this purpose:

In [10]:
print(clean.eye_cleanup.__doc__)


    Removes rows from a DataFrame based on predetermined indicators of extraneous data, or an indicator that the row represents unwanted practice data.
    


## Note

I'm unsure whether the above (repetitive), or one of the options below (unintuitive) is a cleaner way to represent the data corpus at this stage.

1. Still pretty repetitive here, but readable. Sets up better code efficiency later. (I'm leaning towards this option)

2. Harder to read, better code efficiency *now and later*.

# Homogenize main task column names

In [11]:
from config import new_columns

In [12]:
main_df_1 = main_df_1.rename(columns = new_columns).set_index(['subjnum','block','trial']).reset_index()
main_df_2 = main_df_2.rename(columns = new_columns).set_index(['subjnum','block','trial']).reset_index()
main_df_3 = main_df_3.rename(columns = new_columns).set_index(['subjnum','block','trial']).reset_index()

# Exclude bad subjects

In [13]:
from config import exclusions

In [14]:
main_df_1 = main_df_1[~main_df_1['subjnum'].isin(exclusions)]
main_df_2 = main_df_2[~main_df_2['subjnum'].isin(exclusions)]
main_df_3 = main_df_3[~main_df_3['subjnum'].isin(exclusions)]

In [15]:
print(
    len(main_df_1.subjnum.unique()),
    len(main_df_2.subjnum.unique()),
    len(main_df_3.subjnum.unique()),
)

88 101 72


# Trial Counts

In [16]:
main_df_1.groupby('subjnum').count().iloc[:,0].value_counts()

72    88
Name: block, dtype: int64

In [17]:
main_df_2.groupby('subjnum').count().iloc[:,0].value_counts()

72    101
Name: block, dtype: int64

In [18]:
main_df_3.groupby('subjnum').count().iloc[:,0].value_counts()

72    66
71     6
Name: block, dtype: int64

There are 2 problems with the behavioral data from ED-3 above:

1. Subjects should each have 72 trials, but we see 90 for most trials. This is for two reasons:
    1. Practice blocks are not separate from the Main task in ED-3, adding an additional 2 blocks of 2 trials (total of 4, bringing 72 up to 76)
    2. Each block in the Practice and Main tasks comes with an additional trial row, repeating the data from the last trial in that block. This adds 12 rows for the Main task, and 2 for the Practice task (total of 14, bringing our 76 up to 90).
2. Some (8) subjects are missing exactly 1 of those 90 trials.
    1. We aren't exactly sure why this data was lost, but the EyeLink software seems to have failed to write it into the raw data during some instances of required recalibration of the eye-tracking sensor system. In any case, 

Now we still have to deal with the subjects who are missing trials. Invoking `eye_cleanup` seems to have cleared up the trial count discrepancy for 2 of these 8 subjects, leaving us with 6. This is curious, as it implies that the missing trials for those 2 were either practice trials or the extraneous block-repeat trial rows. We need to work to clarify why this is.

In any case, the issue with the remaining 6 subjects can be remedied by adding a "dummy" trial in the position of the missing one.

Looks like the 6 missing trials are all in the 1st trial position within their block, judging by the 6 missing trial values at 1.

There's our 6 offenders listed.

# Cleaning

Create `['study']` label for each DataFrame

In [19]:
main_df_1['study'] = main_df_1.apply(clean.label_study, axis=1)
main_df_2['study'] = main_df_2.apply(clean.label_study, axis=1)
main_df_3['study'] = main_df_3.apply(clean.label_study, axis=1)

Put `choicert` and `outcomert` in the same units as ED-1 and ED-2

In [20]:
for col in ('choicert','outcomert'):
    main_df_3[col] = main_df_3[col].astype(float) *.001

# Cleaned Output with Exclusions

In [21]:
exclusions_dir = derivs_dir / '01.exclusions'
if not Path.exists(exclusions_dir): Path.mkdir(exclusions_dir)

In [22]:
main_df_1.to_csv(exclusions_dir / ('econdec-1_task-main_beh_' + date + '.csv'))
main_df_2.to_csv(exclusions_dir / ('econdec-2_task-main_beh_' + date + '.csv'))
main_df_3.to_csv(exclusions_dir / ('exondec-3_task-main_beh_' + date + '.csv'))

# Main task

#### ED-1

In [23]:
main_df_1.head()

Unnamed: 0,subjnum,block,trial,agegroup,experimentername,date,time,trialnumbydomdist,domain,magnitude,...,confidence,confidencest,confidencert,stocknumber,bondnumber,genderjudgment,bankaccount,trueprob,estwithinrange?,study
0,100,1,1,1,kf,10_12,11:31:01.963000,1,LOSS,low,...,8,2141471.0,3.022637,16,9,1,-6,0.3,0,1
1,100,1,2,1,kf,10_12,11:31:01.963000,2,LOSS,low,...,8,2141525.0,3.695852,16,9,1,-12,0.155172,0,1
2,100,1,3,1,kf,10_12,11:31:01.963000,3,LOSS,low,...,8,2141546.0,3.121775,16,9,1,-18,0.3,1,1
3,100,1,4,1,kf,10_12,11:31:01.963000,4,LOSS,low,...,7,2141574.0,3.406241,16,9,1,-24,0.5,0,1
4,100,1,5,1,kf,10_12,11:31:01.963000,5,LOSS,low,...,8,2141602.0,4.553061,16,9,1,-26,0.7,0,1


#### ED-2

In [24]:
main_df_2.head()

Unnamed: 0,subjnum,block,trial,agegroup,experimentername,date,time,trialnumbydomdist,domain,magnitude,...,confidence,confidencest,confidencert,stocknumber,bondnumber,genderjudgment,bankaccount,trueprob,estwithinrange?,study
0,2001,1,1,1,ed,9_7,14:56:22.840000,1,GAIN,low,...,9.0,22151.347369,2.155163,18,1,0,6,0.7,0,2
1,2001,1,2,1,ed,9_7,14:56:22.840000,2,GAIN,low,...,9.0,22168.689525,0.90059,18,1,1,12,0.844828,0,2
2,2001,1,3,1,ed,9_7,14:56:22.840000,3,GAIN,low,...,8.0,22185.804354,1.129109,18,1,1,14,0.927027,0,2
3,2001,1,4,1,ed,9_7,14:56:22.840000,4,GAIN,low,...,8.0,22206.712546,1.060679,18,1,1,16,0.967365,0,2
4,2001,1,5,1,ed,9_7,14:56:22.840000,5,GAIN,low,...,8.0,22223.815899,1.148896,18,1,1,22,0.985748,0,2


#### ED-3

In [25]:
main_df_3.head()

Unnamed: 0,subjnum,block,trial,agegroup,genderjudgment,bankaccount,bypassed,confidence,date,emotionresponse,...,originaltrialnumber,practice,stockfractallocation,stockfractallocationtype,stockpic,stocktext,stocktextlocation,stockvalue,trueprob,study
0,301,5,1,1,1,-6.0,0,8.0,11041300,58,...,1.0,3,"(565, 540)",L,fractal12b.jpg,-$2 or -$10,"(640, 510)",-2,0.7,3
1,301,5,2,1,1,-12.0,0,8.0,11041300,58,...,1.0,3,"(1355, 540)",R,fractal12b.jpg,-$2 or -$10,"(1280, 510)",-10,0.5,3
2,301,5,3,1,1,-18.0,0,7.0,11041300,58,...,1.0,3,"(1355, 540)",R,fractal12b.jpg,-$2 or -$10,"(1280, 510)",-10,0.3,3
3,301,5,4,1,1,-20.0,0,6.0,11041300,58,...,1.0,3,"(1355, 540)",R,fractal12b.jpg,-$2 or -$10,"(1280, 510)",-2,0.5,3
4,301,5,5,1,1,-30.0,0,7.0,11041300,58,...,1.0,3,"(1355, 540)",R,fractal12b.jpg,-$2 or -$10,"(1280, 510)",-10,0.3,3


### Unified columns

In [26]:
main_df_all = pd.concat([main_df_1, main_df_2, main_df_3])

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


In [27]:
print(clean.clean_stockchosen.__doc__)


    INtended for use with DataFrame.apply()

    Composes a boolean 'stockchosen' column from atomic indicators:

    - Whether the stock was on the left or right side of the screen
    - Which button was pressed at selection (left or right)
    


In [28]:
print(clean.clean_bondpic.__doc__)


    Intended for use with DataFrame.apply()

    Calls the split function from os.path on the 'bondpic' element
    


In [29]:
main_df_all['stockchosen'] = main_df_all.apply(clean.clean_stockchosen, axis=1)
main_df_all['bondpic'] = main_df_all.apply(clean.clean_bondpic, axis=1)
main_df_all['stockpic'] = main_df_all.apply(clean.clean_stockpic, axis=1)
len(main_df_all)

18786

In [30]:
main_df_all.head()

Unnamed: 0,agegroup,bankaccount,block,bondfractallocation,bondfractallocationtype,bondnumber,bondpic,bondtext,bondtextlocation,bondvalue,...,stockpic,stocktext,stocktextlocation,stockvalue,study,subjnum,time,trial,trialnumbydomdist,trueprob
0,1,-6.0,1,,,9.0,fractal9b.jpg,,,,...,fractal16b.jpg,,,-10,1,100,11:31:01.963000,1,1.0,0.3
1,1,-12.0,1,,,9.0,fractal9b.jpg,,,,...,fractal16b.jpg,,,-10,1,100,11:31:01.963000,2,2.0,0.155172
2,1,-18.0,1,,,9.0,fractal9b.jpg,,,,...,fractal16b.jpg,,,-2,1,100,11:31:01.963000,3,3.0,0.3
3,1,-24.0,1,,,9.0,fractal9b.jpg,,,,...,fractal16b.jpg,,,-2,1,100,11:31:01.963000,4,4.0,0.5
4,1,-26.0,1,,,9.0,fractal9b.jpg,,,,...,fractal16b.jpg,,,-2,1,100,11:31:01.963000,5,5.0,0.7


# Fractal Memory

#### ED-1

In [31]:
frac_df_1['oldfractal'] = frac_df_1.apply(clean.clean_paths, axis=1)

In [32]:
frac_lil_df_1 = frac_df_1[['subjectid','oldfractal','judgment']].sort_values(['subjectid','oldfractal'])

In [33]:
frac_lil_bond_df_1 = frac_lil_df_1.rename(columns={
    'subjectid':'subjnum','oldfractal':'bondpic','judgment':'bondmem'
})

In [34]:
frac_lil_stock_df_1 = frac_lil_df_1.rename(columns={
    'subjectid':'subjnum','oldfractal':'stockpic','judgment':'stockmem'
})

#### ED-2

In [35]:
frac_df_2['oldfractal'] = frac_df_2.apply(clean.clean_paths, axis=1)

In [36]:
frac_lil_df_2 = frac_df_2[['subjectid','oldfractal','judgment']].sort_values(['subjectid','oldfractal'])

In [37]:
frac_lil_bond_df_2 = frac_lil_df_2.rename(columns={
    'subjectid':'subjnum','oldfractal':'bondpic','judgment':'bondmem'
})

In [38]:
frac_lil_stock_df_2 = frac_lil_df_2.rename(columns={
    'subjectid':'subjnum','oldfractal':'stockpic','judgment':'stockmem'
})

#### ED-3

In [39]:
frac_lil_df_3 = frac_df_3[['originalparticipant','correctfractal','selection','correctfractallocation']]
frac_lil_df_3['selection'] = frac_lil_df_3.apply(clean.clean_selection, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [40]:
frac_lil_bond_df_3 = frac_lil_df_3.rename(columns={
    'originalparticipant':'subjnum',
    'correctfractal':'bondpic',
    'selection':'bondmem'
}).drop(columns='correctfractallocation')

frac_lil_stock_df_3 = frac_lil_df_3.rename(columns={
    'originalparticipant':'subjnum',
    'correctfractal':'stockpic',
    'selection':'stockmem'
}).drop(columns='correctfractallocation')

## Concatenate ED-1, ED-2, ED-3 Fractal Memory

In [41]:
frac_lil_bond_df = pd.concat([
    frac_lil_bond_df_1, frac_lil_bond_df_2, frac_lil_bond_df_3
])

frac_lil_stock_df = pd.concat([
    frac_lil_stock_df_1, frac_lil_stock_df_2, frac_lil_stock_df_3
])

# Face Memory

#### ED-1

In [42]:
face_lil_df_1 = face_df_1[['subjectid','face','subjresp']]
face_lil_df_1 = face_lil_df_1.rename(columns={
    'subjectid':'subjnum','face':'facepic','subjresp':'facemem'
})

#### ED-2

In [43]:
face_lil_df_2 = face_df_2[['subjectid','face','subjresp']]
face_lil_df_2 = face_lil_df_2.rename(columns={
    'subjectid':'subjnum','face':'facepic','subjresp':'facemem'
})

#### ED-3

In [44]:
face_lil_df_3 = face_df_3[
    ['originalparticipant','facefile','selection']
].rename(columns={
    'originalparticipant':'subjnum',
    'facefile':'facepic',
    'selection':'facemem'
})

## Concatenate ED-1, ED-2, ED-3

In [45]:
face_lil_df = pd.concat([
    face_lil_df_1, face_lil_df_2, face_lil_df_3
])

# Reintroduce contextual memory data

In [46]:
main_df_all = main_df_all.merge(frac_lil_bond_df, how='left')
main_df_all = main_df_all.merge(frac_lil_stock_df, how='left')
main_df_all = main_df_all.merge(face_lil_df, how='left')
# unified_main_frame[['subjnum','stockpic','bondpic','stockmem','bondmem']]

In [47]:
trials=[]
for s in range(len(main_df_all.subjnum.unique())):
    for t in range(1,73):
        trials.append(t)

In [48]:
blocks=[]
for s in range(len(main_df_all.subjnum.unique())):
    for b in range(1,13):
        for x in range(6):
            blocks.append(b)

## Check size
Final merged DataFrame compared to expected number of blocks & trials:

In [49]:
print(len(blocks))
print(len(trials))
print(len(main_df_all))

18792
18792
18786


6 ED3 subjects are missing a trial so the trial and block numbers won't match up here.

# Drop Unnecessary Columns?

In [50]:
df_all_drop_columns = [
    'agegroup','experimentername','date','time','trialnumbydomdist',
    'choicest','outcomest','esttaskst',
    'confidencest','stocknumber','bondnumber','genderjudgment',
    'fractalchosen','estwithinrange?','confidencert',
    'practice','bubblefile','bondvalue','stocktext','bondtext',
    'stocktextlocation','bondtextlocation','emotionresponse','bypassed','agegroup','experimentername',
    'date','correctfractallocation','incorrectfractallocation','paymentaccuracy','phase',
    'stockfractallocation','bondfractallocation','stockfractallocationtype','bondfractallocationtype',
    'showinstruction','gender','selection','cueonleft','cueonright',
    'correctfractal','incorectfractal','oldfaceequalstrue','facefile','facekeypressed',
    'originalsubjectnumber','originalparticipantnumber','originaltrialnumber','originaltrailnumber',
    'fracdomain','facedomain','fracmagnitude','facestockvalue','genderjudgment',
]

We're not using any of the columns listed above. There's no real reason to remove the data, but it makes the output cleaner and easier to look at without all the extraneous information.

In [51]:
main_df_all = main_df_all.drop(df_all_drop_columns, axis=1).set_index([
    'subjnum','block','trial'
]).reset_index()

# Output

ONly when all data is fully aligned and homogenized.

**ALL** cleaning steps should be done before this point.

In [52]:
homog_dir = derivs_dir / '02.homogenized'
if not Path.exists(homog_dir):
    Path.mkdir(homog_dir)

In [53]:
fpath = homog_dir  / ('econdec-full_task-main_beh_' + date + '.csv')
main_df_all.to_csv(fpath, index=False)

In [54]:
len(main_df_all.subjnum.unique())

261

For reference:

```
final_columns=['study','subjnum','trial','block','domain','dom',
               'estimation','trueprob','estdiff','valestdiff','valestdiffvalid',
               'choicert','choicerta3sd','choicerti3sd','choicemed12v3','choicemed123'
               'esttaskrt','esttaskrta3sd','esttaskrti3sd',
               'outcomert','outcomerta3sd','outcomerti3sd','outcomemed12','outcomemed123'
               'stockchosen','waschoiceoptimal','optimalchoiceshouldhavebeen',
               'magnitude','stockvalue','absstockval','b4choiceprobability',
               'stockpic','bondpic','facepic','stockmemresp','bondmemresp',
               'studymedchoice','studysplitchoice','studymedoutcome','studysplitoutcome',
               'primemedchoice','primesplitchoice','primemedoutcome','primesplitoutcome']
               ```