# Realign & Homogenize ED-1, ED-2, ED-3

`2.clean_realign_homogenize_all`

Realign and merge converted data from ED-3 into the concatenated data from ED-1 and ED-2.

Differentiate, import, and reassociate memory data into the main-task trialwise dataset.

### Configuration

In [None]:
from pathlib import Path

import pandas as pd
from datetime import datetime

from _utils import clean

In [None]:
date = datetime.today().strftime('%y%m%d')

In [None]:
from config import derivatives_dir as derivs_dir
allsub_dir = derivs_dir / '00.allsub'

## Pull Concatenated Taskwise data

#### ED-1

In [None]:
main_fpath_1 = allsub_dir / ('econdec-1_task-main_beh_' + date + '.csv')
frac_fpath_1 = allsub_dir / ('econdec-1_task-frac_beh_' + date + '.csv')
face_fpath_1 = allsub_dir / ('econdec-1_task-face_beh_' + date + '.csv')

In [None]:
main_df_1 = clean.smooth_columns(pd.read_csv(main_fpath_1))
frac_df_1 = clean.smooth_columns(pd.read_csv(frac_fpath_1))
face_df_1 = clean.smooth_columns(pd.read_csv(face_fpath_1))

#### ED-2

In [None]:
main_fpath_2 = allsub_dir / ('econdec-2_task-main_beh_' + date + '.csv')
frac_fpath_2 = allsub_dir / ('econdec-2_task-frac_beh_' + date + '.csv')
face_fpath_2 = allsub_dir / ('econdec-2_task-face_beh_' + date + '.csv')

In [None]:
main_df_2 = clean.smooth_columns(pd.read_csv(main_fpath_2))
frac_df_2 = clean.smooth_columns(pd.read_csv(frac_fpath_2))
face_df_2 = clean.smooth_columns(pd.read_csv(face_fpath_2))

#### ED-3

In [None]:
main_fpath_3 = allsub_dir / ('econdec-3_task-main_beh_' + date + '.csv')
frac_fpath_3 = allsub_dir / ('econdec-3_task-frac_beh_' + date + '.csv')
face_fpath_3 = allsub_dir / ('econdec-3_task-face_beh_' + date + '.csv')

In [None]:
main_df_3 = clean.eye_cleanup(clean.smooth_columns(pd.read_csv(main_fpath_3)))
frac_df_3 = clean.smooth_columns(pd.read_csv(frac_fpath_3))
face_df_3 = clean.smooth_columns(pd.read_csv(face_fpath_3))

We'll address the overall trial count by invoking `eye_cleanup`, a specially designed utility function for this purpose:

In [None]:
print(clean.eye_cleanup.__doc__)

## Note

I'm unsure whether the above (repetitive), or one of the options below (unintuitive) is a cleaner way to represent the data corpus at this stage.

1. Still pretty repetitive here, but readable. Sets up better code efficiency later. (I'm leaning towards this option)

2. Harder to read, better code efficiency *now and later*.

# Homogenize main task column names

In [None]:
from config import new_columns

In [None]:
main_df_1 = main_df_1.rename(columns = new_columns).set_index(['subjnum','block','trial']).reset_index()
main_df_2 = main_df_2.rename(columns = new_columns).set_index(['subjnum','block','trial']).reset_index()
main_df_3 = main_df_3.rename(columns = new_columns).set_index(['subjnum','block','trial']).reset_index()

# Exclude bad subjects

In [None]:
from config import exclusions

In [None]:
main_df_1 = main_df_1[~main_df_1['subjnum'].isin(exclusions)]
main_df_2 = main_df_2[~main_df_2['subjnum'].isin(exclusions)]
main_df_3 = main_df_3[~main_df_3['subjnum'].isin(exclusions)]

In [None]:
print(
    len(main_df_1.subjnum.unique()),
    len(main_df_2.subjnum.unique()),
    len(main_df_3.subjnum.unique()),
)

# Trial Counts

## Check size
Final merged DataFrame compared to expected number of blocks & trials:

6 ED3 subjects are missing a trial so the trial and block numbers won't match up here.

In [None]:
main_df_1.groupby('subjnum').count().iloc[:,0].value_counts()

In [None]:
main_df_2.groupby('subjnum').count().iloc[:,0].value_counts()

In [None]:
main_df_3.groupby('subjnum').count().iloc[:,0].value_counts()

There are 2 problems with the behavioral data from ED-3 above:

1. Subjects should each have 72 trials, but we see 90 for most trials. This is for two reasons:
    1. Practice blocks are not separate from the Main task in ED-3, adding an additional 2 blocks of 2 trials (total of 4, bringing 72 up to 76)
    2. Each block in the Practice and Main tasks comes with an additional trial row, repeating the data from the last trial in that block. This adds 12 rows for the Main task, and 2 for the Practice task (total of 14, bringing our 76 up to 90).
2. Some (8) subjects are missing exactly 1 of those 90 trials.
    1. We aren't exactly sure why this data was lost, but the EyeLink software seems to have failed to write it into the raw data during some instances of required recalibration of the eye-tracking sensor system. In any case, 

Now we still have to deal with the subjects who are missing trials. Invoking `eye_cleanup` seems to have cleared up the trial count discrepancy for 2 of these 8 subjects, leaving us with 6. This is curious, as it implies that the missing trials for those 2 were either practice trials or the extraneous block-repeat trial rows. We need to work to clarify why this is.

In any case, the issue with the remaining 6 subjects can be remedied by adding a "dummy" trial in the position of the missing one.

Looks like the 6 missing trials are all in the 1st trial position within their block, judging by the 6 missing trial values at 1.

There's our 6 offenders listed.

# Cleaning

Create `['study']` label for each DataFrame

In [None]:
main_df_1['study'] = main_df_1.apply(clean.label_study, axis=1)
main_df_2['study'] = main_df_2.apply(clean.label_study, axis=1)
main_df_3['study'] = main_df_3.apply(clean.label_study, axis=1)

Put `choicert` and `outcomert` in the same units as ED-1 and ED-2

In [None]:
for col in ('choicert','outcomert'):
    main_df_3[col] = main_df_3[col].astype(float) * .001

# Cleaned Output with Exclusions

In [None]:
exclusions_dir = derivs_dir / '01.exclusions'
if not Path.exists(exclusions_dir): Path.mkdir(exclusions_dir)

In [None]:
main_df_1.to_csv(exclusions_dir / ('econdec-1_task-main_beh_' + date + '.csv'))
main_df_2.to_csv(exclusions_dir / ('econdec-2_task-main_beh_' + date + '.csv'))
main_df_3.to_csv(exclusions_dir / ('exondec-3_task-main_beh_' + date + '.csv'))

# Main task

#### ED-1

In [None]:
main_df_1.head()

#### ED-2

In [None]:
main_df_2.head()

#### ED-3

In [None]:
main_df_3.head()

### Unified columns

In [None]:
main_df_all = pd.concat([main_df_1, main_df_2, main_df_3], sort=True)

In [None]:
main_df_all['stockchosen'] = main_df_all.apply(clean.clean_stockchosen, axis=1)
main_df_all['bondpic'] = main_df_all['bondpic'].map(clean.clean_fpath)
main_df_all['stockpic'] = main_df_all['stockpic'].map(clean.clean_fpath)
len(main_df_all)

In [None]:
main_df_all.head()

# Fractal Memory

#### ED-1

In [None]:
frac_df_1['oldfractal'] = frac_df_1['oldfractal'].map(lambda x : Path(x).name)

In [None]:
frac_lil_df_1 = frac_df_1[['subjectid','oldfractal','judgment']].sort_values(['subjectid','oldfractal'])

In [None]:
frac_lil_bond_df_1 = frac_lil_df_1.rename(columns={
    'subjectid':'subjnum','oldfractal':'bondpic','judgment':'bondmem'
})

In [None]:
frac_lil_stock_df_1 = frac_lil_df_1.rename(columns={
    'subjectid':'subjnum','oldfractal':'stockpic','judgment':'stockmem'
})

#### ED-2

In [None]:
frac_df_2['oldfractal'] = frac_df_1['oldfractal'].map(lambda x : Path(x).name)

In [None]:
frac_lil_df_2 = frac_df_2[['subjectid','oldfractal','judgment']].sort_values(['subjectid','oldfractal'])

In [None]:
frac_lil_bond_df_2 = frac_lil_df_2.rename(columns={
    'subjectid':'subjnum','oldfractal':'bondpic','judgment':'bondmem'
})

In [None]:
frac_lil_stock_df_2 = frac_lil_df_2.rename(columns={
    'subjectid':'subjnum','oldfractal':'stockpic','judgment':'stockmem'
})

#### ED-3

In [None]:
frac_lil_df_3 = frac_df_3[['originalparticipant','correctfractal','selection','correctfractallocation']]
frac_lil_df_3['selection'] = frac_lil_df_3.apply(clean.clean_selection, axis=1)

In [None]:
frac_lil_bond_df_3 = frac_lil_df_3.rename(columns={
    'originalparticipant':'subjnum',
    'correctfractal':'bondpic',
    'selection':'bondmem'
}).drop(columns='correctfractallocation')

frac_lil_stock_df_3 = frac_lil_df_3.rename(columns={
    'originalparticipant':'subjnum',
    'correctfractal':'stockpic',
    'selection':'stockmem'
}).drop(columns='correctfractallocation')

## Concatenate ED-1, ED-2, ED-3 Fractal Memory

In [None]:
frac_lil_bond_df = pd.concat([
    frac_lil_bond_df_1, frac_lil_bond_df_2, frac_lil_bond_df_3
])

frac_lil_stock_df = pd.concat([
    frac_lil_stock_df_1, frac_lil_stock_df_2, frac_lil_stock_df_3
])

# Face Memory

#### ED-1

In [None]:
face_lil_df_1 = face_df_1[['subjectid','face','subjresp']]
face_lil_df_1 = face_lil_df_1.rename(columns={
    'subjectid':'subjnum','face':'facepic','subjresp':'facemem'
})

#### ED-2

In [None]:
face_lil_df_2 = face_df_2[['subjectid','face','subjresp']]
face_lil_df_2 = face_lil_df_2.rename(columns={
    'subjectid':'subjnum','face':'facepic','subjresp':'facemem'
})

#### ED-3

In [None]:
face_lil_df_3 = face_df_3[
    ['originalparticipant','facefile','selection']
].rename(columns={
    'originalparticipant':'subjnum',
    'facefile':'facepic',
    'selection':'facemem'
})

## Concatenate ED-1, ED-2, ED-3

In [None]:
face_lil_df = pd.concat([
    face_lil_df_1, face_lil_df_2, face_lil_df_3
])

# Reintroduce contextual memory data

In [None]:
main_df_all = main_df_all.merge(frac_lil_bond_df, how='left')
main_df_all = main_df_all.merge(frac_lil_stock_df, how='left')
main_df_all = main_df_all.merge(face_lil_df, how='left')
# unified_main_frame[['subjnum','stockpic','bondpic','stockmem','bondmem']]

# Drop Unnecessary Columns?

In [None]:
df_all_drop_columns = [
    'agegroup','experimentername','date','time','trialnumbydomdist',
    'choicest','outcomest','esttaskst',
    'confidencest','stocknumber','bondnumber','genderjudgment',
    'fractalchosen','estwithinrange?','confidencert',
    'practice','bubblefile','bondvalue','stocktext','bondtext',
    'stocktextlocation','bondtextlocation','emotionresponse','bypassed','agegroup','experimentername',
    'date','correctfractallocation','incorrectfractallocation','paymentaccuracy','phase',
    'stockfractallocation','bondfractallocation','stockfractallocationtype','bondfractallocationtype',
    'showinstruction','gender','selection','cueonleft','cueonright',
    'correctfractal','incorectfractal','oldfaceequalstrue','facefile','facekeypressed',
    'originalsubjectnumber','originalparticipantnumber','originaltrialnumber','originaltrailnumber',
    'fracdomain','facedomain','fracmagnitude','facestockvalue','genderjudgment',
]

We're not using any of the columns listed above. There's no real reason to remove the data, but it makes the output cleaner and easier to look at without all the extraneous information.

In [None]:
main_df_all = main_df_all.drop(df_all_drop_columns, axis=1).set_index([
    'subjnum','block','trial'
]).reset_index()

# Output

ONly when all data is fully aligned and homogenized.

**ALL** cleaning steps should be done before this point.

In [None]:
homog_dir = derivs_dir / '02.homogenized'
if not Path.exists(homog_dir):
    Path.mkdir(homog_dir)

In [None]:
fpath = homog_dir  / ('econdec-full_task-main_beh_' + date + '.csv')
main_df_all.to_csv(fpath, index=False)

In [None]:
len(main_df_all.subjnum.unique())

For reference:

```
final_columns=['study','subjnum','trial','block','domain','dom',
               'estimation','trueprob','estdiff','valestdiff','valestdiffvalid',
               'choicert','choicerta3sd','choicerti3sd','choicemed12v3','choicemed123'
               'esttaskrt','esttaskrta3sd','esttaskrti3sd',
               'outcomert','outcomerta3sd','outcomerti3sd','outcomemed12','outcomemed123'
               'stockchosen','waschoiceoptimal','optimalchoiceshouldhavebeen',
               'magnitude','stockvalue','absstockval','b4choiceprobability',
               'stockpic','bondpic','facepic','stockmemresp','bondmemresp',
               'studymedchoice','studysplitchoice','studymedoutcome','studysplitoutcome',
               'primemedchoice','primesplitchoice','primemedoutcome','primesplitoutcome']
               ```