# Data Science in Psychology & Neuroscience (DSPN): 

## Lecture 11. Data Wrangling (part 2, CONTINUED)

### Date: September 26, 2023

### To-Dos From Last Class:

* Download data for today's wrangling session #1 dataset from <a href="https://github.com/hogeveen-lab/DSPN_Fall2022_Git/tree/master/misc_exercises/imitation_inhibition_paradigm">Github</a>

### Today:

* Wrangle some real data

### Homework

* Download <a href="https://github.com/hogeveen-lab/DSPN_Fall2023_Git/tree/main/assignment_starters/assign3_starter">Assignment #3 starter kit</a>

# Data wrangling in Pandas

## (finishing from last class) 5. Combining Data Sets 

<img src="img/combining_data.png" width="600">

In [1]:
import pandas as pd
import numpy as np

# creating random integer lists to populate made up data frames
x = np.random.randint(10,20,3)
y = np.random.randint(20,30,3)

# putting together some example data frames
df1 = pd.DataFrame({'pid' : ['P1','P2','P3'],
                   'var1' : [x[0],x[1],x[2]]})
# display(df1)
df2 = pd.DataFrame({'pid' : ['P1','P2','P4'],
                   'var2' : [y[0],y[1],y[2]]})
# display(df2)

# Standard merge approach #1
print('Merge LEFT (right --> left; lose unique rows from the RIGHT df)')
merge_left = pd.merge(df1,df2,on='pid',how='left')
display(merge_left)

# Standard merge approach #2
print('Merge RIGHT (left --> right; lose unique rows from the LEFT df)')
merge_right = pd.merge(df1,df2,on='pid',how='right')
display(merge_right)

# Standard merge approach #3
print('Merge INNER (lose unique data from BOTH dfs)')
merge_inner = pd.merge(df1,df2,on='pid',how='inner')
display(merge_inner)

# Standard merge approach #4
print('Merge OUTER (retain ALL the data)')
merge_outer = pd.merge(df1,df2,on='pid',how='outer')
display(merge_outer)

Merge LEFT (right --> left; lose unique rows from the RIGHT df)


Unnamed: 0,pid,var1,var2
0,P1,12,28.0
1,P2,15,22.0
2,P3,10,


Merge RIGHT (left --> right; lose unique rows from the LEFT df)


Unnamed: 0,pid,var1,var2
0,P1,12.0,28
1,P2,15.0,22
2,P4,,20


Merge INNER (lose unique data from BOTH dfs)


Unnamed: 0,pid,var1,var2
0,P1,12,28
1,P2,15,22


Merge OUTER (retain ALL the data)


Unnamed: 0,pid,var1,var2
0,P1,12.0,28.0
1,P2,15.0,22.0
2,P3,10.0,
3,P4,,20.0


## Tangent: Because R.T. Asked some great questions recently!

### 1. How to generate random lists withOUT replacement

In [2]:
import random

y = np.random.randint(20,30,3)
z = random.sample(range(20,30),3)

print(y,z)

[23 26 20] [23, 27, 20]


### 2. How to pivot wide with grouping variables as columns, NOT indices

In [3]:
# What if we have some variable(s) we DON't want to lengthen?
df_with_group = pd.DataFrame({'pid': [1,2,3,4],
                              'grp' : [1, 1, 2, 2],
                              'var1' : [4, 5, 6, 7],
                              'var2' : [8, 9, 10, 11],
                              'var3' : [12, 13, 14, 15]})
df_long = pd.melt(df_with_group,id_vars=['pid','grp'])
display(df_long)

df_wide = df_long.pivot(index=['pid','grp'], columns='variable', values='value')
display(df_wide)
df_wide_reset = df_wide.reset_index().rename_axis(None,axis=1)
display(df_wide_reset)

Unnamed: 0,pid,grp,variable,value
0,1,1,var1,4
1,2,1,var1,5
2,3,2,var1,6
3,4,2,var1,7
4,1,1,var2,8
5,2,1,var2,9
6,3,2,var2,10
7,4,2,var2,11
8,1,1,var3,12
9,2,1,var3,13


Unnamed: 0_level_0,variable,var1,var2,var3
pid,grp,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,1,4,8,12
2,1,5,9,13
3,2,6,10,14
4,2,7,11,15


Unnamed: 0,pid,grp,var1,var2,var3
0,1,1,4,8,12
1,2,1,5,9,13
2,3,2,6,10,14
3,4,2,7,11,15


# Wrangle some real data

<img src="img/imit_inhib_fileorg.png" width=500>

## Breaking into 8 code chunks
## 1. Import packages

In [1]:
### Part 1 --> Importing data wrangling packages I often use
import os
from glob import glob # only need the glob subpackage from glob
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None

## 2. Setting paths to the first level data

In [11]:
### Part 2 --> setting paths to the first level data

# get current working directory
base_dir = os.getcwd()

# Go above current working directory and
first_dir = os.path.join(base_dir,'misc/imitation_inhibition_paradigm/data/first')
# P_file_pattern = 
second_dir = os.path.join(base_dir,'misc/imitation_inhibition_paradigm/data/second')
questionnaire_file = os.path.join(second_dir,'ait_questionnaires.csv')
# print(os.path.join(first_dir,'P*.txt'))
# Using glob to find all participant data files
all_files = glob(os.path.join(first_dir,'P*.txt'))
# print(all_files)

In [14]:
# Using glob to find all participant data files
all_files = glob('/Users/jeremyhogeveen/Dropbox/Fall_2023/teaching/PSY450_650/DSPN_Fall2023_workdir/lectures/misc/imitation_inhibition_paradigm/data/first/P*.txt')
# print(all_files)

## 3. Load a test subject to make sense of things

In [31]:
# load in an individual subject file
df_test = pd.read_csv(all_files[0],error_bad_lines=False,skiprows=5,sep='\t')
# subset to only experimental rows
df_test = df_test[df_test['Name.1']=='AI_Block']
# subset to only key releases
df_test = df_test[df_test['Released']=='Released']

# How many key realeases do we actually have?
print('How many rows in key release filtered data frame:',len(df_test))

# identify double responses
df_test['shift'] = df_test['Name.2'].shift()
df_test['double_response'] = np.where(df_test['Name.2']==df_test['shift'],1,0)
df_test = df_test[df_test['double_response']!=1]
print('How many rows in key release filtered data frame, after removing double responses?:',len(df_test))


How many rows in key release filtered data frame: 101
How many rows in key release filtered data frame, after removing double responses?: 100


In [34]:
df_test

Unnamed: 0,Group,Name,Name.1,Name.2,Name.3,Response,Key,Released,Response.1,Code,Time,(Trial Variable),Finger,Congruence,Repeated,Correct,shift,double_response
49,Main Group,P8,AI_Block,"AI_Trial, 7","AI_Blue (7, i5inc.bmp)",index,v,Released,(based on code value),C,756,,1,4,1,,,0
54,Main Group,P8,AI_Block,"AI_Trial, 10","AI_Blue (10, m5base.bmp)",middle,b,Released,(based on code value),C,663,,2,0,1,,"AI_Trial, 7",0
58,Main Group,P8,AI_Block,"AI_Trial, 5","AI_Final_Stage (5, i4con.bmp)",index,v,Released,(based on code value),C,532,,1,3,1,,"AI_Trial, 10",0
63,Main Group,P8,AI_Block,"AI_Trial, 9","AI_Blue (9, i5base.bmp)",index,v,Released,(based on code value),C,616,,1,0,1,,"AI_Trial, 5",0
67,Main Group,P8,AI_Block,"AI_Trial, 3","AI_Final_Stage (3, i4baseinc.bmp)",index,v,Released,(based on code value),C,536,,1,2,1,,"AI_Trial, 9",0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499,Main Group,P8,AI_Block,"AI_Trial, 3","AI_Blue (3, i5baseinc.bmp)",index,v,Released,(based on code value),C,702,,1,2,10,,"AI_Trial, 4",0
504,Main Group,P8,AI_Block,"AI_Trial, 9","AI_Blue (9, i5base.bmp)",index,v,Released,(based on code value),C,669,,1,0,10,,"AI_Trial, 3",0
509,Main Group,P8,AI_Block,"AI_Trial, 7","AI_Blue (7, i5inc.bmp)",index,v,Released,(based on code value),C,729,,1,4,10,,"AI_Trial, 9",0
514,Main Group,P8,AI_Block,"AI_Trial, 6","AI_Blue (6, m5con.bmp)",middle,b,Released,(based on code value),C,693,,2,3,10,,"AI_Trial, 7",0


## 4. Iterate through to load the first level data
###    * Concatenate all together to create one data frame to rule them all

In [51]:
# declare an empty list of data frame
dfs_list = []

# iterate through the files
for cur_file in all_files:
    # load in an individual subject file
    df_tmp = pd.read_csv(cur_file,error_bad_lines=False,skiprows=5,sep='\t')
    # subset to only experimental rows
    df_tmp = df_tmp[df_tmp['Name.1']=='AI_Block']
    # subset to only key releases
    df_tmp = df_tmp[df_tmp['Released']=='Released']
    # identify double responses
    df_tmp['shift'] = df_tmp['Name.2'].shift()
    df_tmp['double_response'] = np.where(df_tmp['Name.2']==df_tmp['shift'],1,0)
    df_tmp = df_tmp[df_tmp['double_response']!=1]
    # append to list of data frames
    dfs_list.append(df_tmp)

df_allsubjects = pd.concat(dfs_list,axis=0)
display(df_allsubjects)

Unnamed: 0,Group,Name,Name.1,Name.2,Name.3,Response,Key,Released,Response.1,Code,Time,(Trial Variable),Finger,Congruence,Repeated,Correct,shift,double_response
49,Main Group,P8,AI_Block,"AI_Trial, 7","AI_Blue (7, i5inc.bmp)",index,v,Released,(based on code value),C,756,,1,4,1,,,0
54,Main Group,P8,AI_Block,"AI_Trial, 10","AI_Blue (10, m5base.bmp)",middle,b,Released,(based on code value),C,663,,2,0,1,,"AI_Trial, 7",0
58,Main Group,P8,AI_Block,"AI_Trial, 5","AI_Final_Stage (5, i4con.bmp)",index,v,Released,(based on code value),C,532,,1,3,1,,"AI_Trial, 10",0
63,Main Group,P8,AI_Block,"AI_Trial, 9","AI_Blue (9, i5base.bmp)",index,v,Released,(based on code value),C,616,,1,0,1,,"AI_Trial, 5",0
67,Main Group,P8,AI_Block,"AI_Trial, 3","AI_Final_Stage (3, i4baseinc.bmp)",index,v,Released,(based on code value),C,536,,1,2,1,,"AI_Trial, 9",0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
467,Main Group,P41,AI_Block,"AI_Trial, 4","AI_Final_Stage (4, m4baseinc.bmp)",index,v,Released,(based on code value),E,513,,2,2,10,,"AI_Trial, 1",0
473,Main Group,P41,AI_Block,"AI_Trial, 8","AI_Final_Stage (8, m4inc.bmp)",middle,b,Released,(based on code value),C,517,,2,4,10,,"AI_Trial, 4",0
477,Main Group,P41,AI_Block,"AI_Trial, 5","AI_Final_Stage (5, i4con.bmp)",index,v,Released,(based on code value),C,355,,1,3,10,,"AI_Trial, 8",0
481,Main Group,P41,AI_Block,"AI_Trial, 6","AI_Final_Stage (6, m4con.bmp)",middle,b,Released,(based on code value),C,440,,2,3,10,,"AI_Trial, 5",0


## 5. Merge with questionnaire data

In [52]:
# reading in questionnaire responses
questionnaire_file = os.path.join(second_dir,'ait_questionnaires.csv')
df_questionnaire = pd.read_csv(questionnaire_file)
# display(df_questionnaire)

# merge questionnaires with trial-level data
df_allsubjects = df_allsubjects.rename(columns={"Name": "pid"})
df_allsubjects = pd.merge(df_allsubjects,df_questionnaire,on="pid",how="outer")
df_allsubjects


Unnamed: 0,Group,pid,Name.1,Name.2,Name.3,Response,Key,Released,Response.1,Code,...,questionnaire_1,questionnaire_2,questionnaire_3,questionnaire_4,questionnaire_5,questionnaire_6,questionnaire_7,questionnaire_8,questionnaire_9,questionnaire_10
0,Main Group,P8,AI_Block,"AI_Trial, 7","AI_Blue (7, i5inc.bmp)",index,v,Released,(based on code value),C,...,62,22,74,37,62,32,46,56,43,78
1,Main Group,P8,AI_Block,"AI_Trial, 10","AI_Blue (10, m5base.bmp)",middle,b,Released,(based on code value),C,...,62,22,74,37,62,32,46,56,43,78
2,Main Group,P8,AI_Block,"AI_Trial, 5","AI_Final_Stage (5, i4con.bmp)",index,v,Released,(based on code value),C,...,62,22,74,37,62,32,46,56,43,78
3,Main Group,P8,AI_Block,"AI_Trial, 9","AI_Blue (9, i5base.bmp)",index,v,Released,(based on code value),C,...,62,22,74,37,62,32,46,56,43,78
4,Main Group,P8,AI_Block,"AI_Trial, 3","AI_Final_Stage (3, i4baseinc.bmp)",index,v,Released,(based on code value),C,...,62,22,74,37,62,32,46,56,43,78
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4740,Main Group,P41,AI_Block,"AI_Trial, 4","AI_Final_Stage (4, m4baseinc.bmp)",index,v,Released,(based on code value),E,...,74,78,11,82,74,54,8,36,73,22
4741,Main Group,P41,AI_Block,"AI_Trial, 8","AI_Final_Stage (8, m4inc.bmp)",middle,b,Released,(based on code value),C,...,74,78,11,82,74,54,8,36,73,22
4742,Main Group,P41,AI_Block,"AI_Trial, 5","AI_Final_Stage (5, i4con.bmp)",index,v,Released,(based on code value),C,...,74,78,11,82,74,54,8,36,73,22
4743,Main Group,P41,AI_Block,"AI_Trial, 6","AI_Final_Stage (6, m4con.bmp)",middle,b,Released,(based on code value),C,...,74,78,11,82,74,54,8,36,73,22


## 6. Write to trial-level allsubjects csv

In [53]:
out_filename = os.path.join(second_dir,'ait_trialwise.csv')
df_allsubjects.to_csv(out_filename,index=False)

#### Pick up next class..
7. Compute summary measures
8. Save to summary allsubjects csv