# Data Science in Psychology & Neuroscience (DSPN): 

## Lecture 9. Data Wrangling (part 2)

### Date: September 26, 2023

### To-Dos From Last Class:

* Download data for today's wrangling session #1 dataset from <a href="https://github.com/hogeveen-lab/DSPN_Fall2022_Git/tree/master/misc_exercises/imitation_inhibition_paradigm">Github</a>

### Today:

* Data wrangling in Pandas (stealing heavily from this <a href="https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf">Cheatsheet</a>)
    * Last part of last class: Combining data frames
* Wrangle some real data

### Homework

* Download <a href="https://github.com/hogeveen-lab/DSPN_Fall2022_Git/tree/master/assignment_starters/assign3_starter">Assignment #3 starter kit</a>

# Data wrangling in Pandas

## (finishing from last class) 5. Combining Data Sets 

<img src="img/combining_data.png" width="600">

In [17]:
import pandas as pd
import numpy as np

# function to generate random integers
def random_integer(x,n):
    for i in range(n):
        x.append(np.random.randint(25,75))
    return x

# running the function and check the output
x = []
random_integer(x,6)
# print(x)

# putting together some example data frames
df1 = pd.DataFrame({'pid' : ['P1','P2','P3'],
                   'var1' : [x[0],x[1],x[2]]})
# display(df1)
df2 = pd.DataFrame({'pid' : ['P1','P2','P4'],
                   'var2' : [x[3],x[4],x[5]]})
# display(df2)

# Standard merge approach #1
print('Merge LEFT (right --> left; lose unique rows from the RIGHT df)')
merge_left = pd.merge(df1,df2,on='pid',how='left')
display(merge_left)

# Standard merge approach #2
print('Merge RIGHT (left --> right; lose unique rows from the LEFT df)')
merge_right = pd.merge(df1,df2,on='pid',how='right')
display(merge_right)

# Standard merge approach #3
print('Merge INNER (lose unique data from BOTH dfs)')
merge_inner = pd.merge(df1,df2,on='pid',how='inner')
display(merge_inner)

# Standard merge approach #4
print('Merge OUTER (retain ALL the data)')
merge_outer = pd.merge(df1,df2,on='pid',how='outer')
display(merge_outer)

Merge LEFT (right --> left; lose unique rows from the RIGHT df)


Unnamed: 0,pid,var1,var2
0,P1,59,72.0
1,P2,40,72.0
2,P3,30,


Merge RIGHT (left --> right; lose unique rows from the LEFT df)


Unnamed: 0,pid,var1,var2
0,P1,59.0,72
1,P2,40.0,72
2,P4,,42


Merge INNER (lose unique data from BOTH dfs)


Unnamed: 0,pid,var1,var2
0,P1,59,72
1,P2,40,72


Merge OUTER (retain ALL the data)


Unnamed: 0,pid,var1,var2
0,P1,59.0,72.0
1,P2,40.0,72.0
2,P3,30.0,
3,P4,,42.0


# Wrangle some real data

<img src="img/imit_inhib_fileorg.png" width=500>

## Breaking into 8 code chunks
## 1. Import packages

In [18]:
### Part 1 --> Importing data wrangling packages I often use
import os
from glob import glob # only need the glob subpackage from glob
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None

## 2. Setting paths to the first level data

In [19]:
### Part 2 --> setting paths to the first level data

# get current working directory
base_dir = os.getcwd()

# Go above current working directory and
first_dir = os.path.join(base_dir,'misc/imitation_inhibition_paradigm/data/first')
P_file_pattern = 'P*.txt'
second_dir = os.path.join(base_dir,'misc/imitation_inhibition_paradigm/data/second')
questionnaire_file = os.path.join(second_dir,'ait_questionnaires.csv')

# Using glob to find all participant data files
all_files = glob(os.path.join(first_dir,P_file_pattern))
print(all_files)

['/Users/jeremyhogeveen/Dropbox/Fall_2023/teaching/PSY450_650/DSPN_Fall2023_workdir/lectures/misc/imitation_inhibition_paradigm/data/first/P8.txt', '/Users/jeremyhogeveen/Dropbox/Fall_2023/teaching/PSY450_650/DSPN_Fall2023_workdir/lectures/misc/imitation_inhibition_paradigm/data/first/P9.txt', '/Users/jeremyhogeveen/Dropbox/Fall_2023/teaching/PSY450_650/DSPN_Fall2023_workdir/lectures/misc/imitation_inhibition_paradigm/data/first/P49.txt', '/Users/jeremyhogeveen/Dropbox/Fall_2023/teaching/PSY450_650/DSPN_Fall2023_workdir/lectures/misc/imitation_inhibition_paradigm/data/first/P48.txt', '/Users/jeremyhogeveen/Dropbox/Fall_2023/teaching/PSY450_650/DSPN_Fall2023_workdir/lectures/misc/imitation_inhibition_paradigm/data/first/P13.txt', '/Users/jeremyhogeveen/Dropbox/Fall_2023/teaching/PSY450_650/DSPN_Fall2023_workdir/lectures/misc/imitation_inhibition_paradigm/data/first/P12.txt', '/Users/jeremyhogeveen/Dropbox/Fall_2023/teaching/PSY450_650/DSPN_Fall2023_workdir/lectures/misc/imitation_inhibi

In [20]:
# Using glob to find all participant data files
all_files = glob('/Users/jeremyhogeveen/Dropbox/Fall_2023/teaching/PSY450_650/DSPN_Fall2023_workdir/lectures/misc/imitation_inhibition_paradigm/data/first/P*.txt')
print(all_files)

['/Users/jeremyhogeveen/Dropbox/Fall_2023/teaching/PSY450_650/DSPN_Fall2023_workdir/lectures/misc/imitation_inhibition_paradigm/data/first/P8.txt', '/Users/jeremyhogeveen/Dropbox/Fall_2023/teaching/PSY450_650/DSPN_Fall2023_workdir/lectures/misc/imitation_inhibition_paradigm/data/first/P9.txt', '/Users/jeremyhogeveen/Dropbox/Fall_2023/teaching/PSY450_650/DSPN_Fall2023_workdir/lectures/misc/imitation_inhibition_paradigm/data/first/P49.txt', '/Users/jeremyhogeveen/Dropbox/Fall_2023/teaching/PSY450_650/DSPN_Fall2023_workdir/lectures/misc/imitation_inhibition_paradigm/data/first/P48.txt', '/Users/jeremyhogeveen/Dropbox/Fall_2023/teaching/PSY450_650/DSPN_Fall2023_workdir/lectures/misc/imitation_inhibition_paradigm/data/first/P13.txt', '/Users/jeremyhogeveen/Dropbox/Fall_2023/teaching/PSY450_650/DSPN_Fall2023_workdir/lectures/misc/imitation_inhibition_paradigm/data/first/P12.txt', '/Users/jeremyhogeveen/Dropbox/Fall_2023/teaching/PSY450_650/DSPN_Fall2023_workdir/lectures/misc/imitation_inhibi

## 3. Load a test subject to make sense of things

In [21]:
### Part 3 --> Loading in a test subject to make sense of things

# Reading in the data
sample_df = pd.read_csv(all_files[0], error_bad_lines=False, skiprows=5, sep='\t')
print('How many rows in initial loaded data frame:',len(sample_df)) # What things might cause this to not == 100?
# Filtering the data down to just the experimental block rows
sample_df = sample_df[sample_df['Name.1']=="AI_Block"]
sample_df.loc[:, 'Finger':'Repeated']
sample_df.loc[50:300, :]

# Filtering the df down to just the key release responses
sample_df_releases = sample_df[sample_df['Released']=='Released'] 

# How many key release responses do we have?
print('How many rows in key release filtered data frame:',len(sample_df_releases)) # What things might cause this to not == 100? For now, just worry baout double responses. For this task, error rates so low that miss response trials not really important.

# Identifying double responses
sample_df_releases['shift'] = sample_df_releases['Name.2'].shift() # creating a new column ('shift') based on the next row of our trial name column. "SettingWithCopyWarning".
# display(sample_df_releases[['Name.2','shift']]) # checking that it worked, show them shift(-1) 
sample_df_releases['double_response'] = np.where(sample_df_releases['shift']==sample_df_releases['Name.2'], 1, 0) # using a numpy where conditional to identify double responses

# Checking that the double response thing worked
double_resp_df = sample_df_releases[sample_df_releases['double_response']==1] 
# display(double_resp_df[['Name.2','shift','double_response']]) # checking that it worked, show them shift(-1)

# Filtering our double response trials
sample_df_releases_nodouble = sample_df_releases[sample_df_releases['double_response']==0] 
print('How many rows in no-double-response filtered data frame:',len(sample_df_releases_nodouble)) # Seeing if we have the right # of rows now

How many rows in initial loaded data frame: 521
How many rows in key release filtered data frame: 101
How many rows in no-double-response filtered data frame: 100


In [22]:
# Demonstrating that Pandas data frames are a collection of series, which are different ways of storing arrays

# display(sample_df_releases) # data frame
display(sample_df_releases['Finger']) # Series w/in the data frame
display(sample_df_releases['Finger'].values) # converting the Data Frame to a numpy array
display(np.array([1, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1, 1, 1, 2, 1,
       2, 2, 1, 1, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1,
       1, 1, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 2,
       2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 1, 1, 2, 1, 1, 2, 1, 2, 1,
       2, 1, 2, 1, 2, 2, 1, 2, 1, 1, 1, 2, 2]))


49     1
54     2
58     1
63     1
67     1
      ..
499    1
504    1
509    1
514    2
519    2
Name: Finger, Length: 101, dtype: int64

array([1, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1, 1, 1, 2, 1,
       2, 2, 1, 1, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1,
       1, 1, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 2,
       2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 1, 1, 2, 1, 1, 2, 1, 2, 1,
       2, 1, 2, 1, 2, 2, 1, 2, 1, 1, 1, 2, 2])

array([1, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1, 1, 1, 2, 1,
       2, 2, 1, 1, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1,
       1, 1, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 2,
       2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 1, 1, 2, 1, 1, 2, 1, 2, 1,
       2, 1, 2, 1, 2, 2, 1, 2, 1, 1, 1, 2, 2])

## 4. Iterate through to load the first level data
###    * Concatenate all together to create one data frame to rule them all

In [23]:
### Part 4 --> Iterating through to get all the first level data, concatenating into allsubs data frame

pid_counter = 1
dfs_list = [] # creating a list of pandas objects

for cur_file in all_files:
    # Copying same logic from our test subject
#     print(cur_file)
    cur_df = pd.read_csv(cur_file, error_bad_lines=False, skiprows=5, sep='\t') 
    cur_df_releases = cur_df[(cur_df['Released']=='Released') & (cur_df['Name.1']=="AI_Block")] 
#     print('How many rows in key release filtered data frame:',cur_df_releases['Congruence'].count()) # What things might cause this to not == 100? For now, just worry baout double responses. For this task, error rates so low that miss response trials not really important.
    cur_df_releases['double_response'] = np.where(cur_df_releases['Name.2'].shift()==cur_df_releases['Name.2'], 1, 0) # faster way to find double responses
    cur_df_releases_nodouble = cur_df_releases[cur_df_releases['double_response']==0] 
#     print('How many rows in no-double-response filtered data frame:',cur_df_releases_nodouble['Congruence'].count()) # Seeing if we have the right # of rows now
    # Appending all the data into a data frame
    dfs_list.append(cur_df_releases_nodouble)
    pid_counter+=1

# Concatenate all DFs together along the row axis
allsubs_df = pd.concat(dfs_list, axis=0)

# Checking what we got and making sure it makes sense given how many trials we should have
# display(dfs_list)
display(allsubs_df)
print('the participant counter from our loop:',pid_counter-1,'should be roughly equivalent to our # of rows / 100:',allsubs_df['Congruence'].count() / 100)

Unnamed: 0,Group,Name,Name.1,Name.2,Name.3,Response,Key,Released,Response.1,Code,Time,(Trial Variable),Finger,Congruence,Repeated,Correct,double_response
49,Main Group,P8,AI_Block,"AI_Trial, 7","AI_Blue (7, i5inc.bmp)",index,v,Released,(based on code value),C,756,,1,4,1,,0
54,Main Group,P8,AI_Block,"AI_Trial, 10","AI_Blue (10, m5base.bmp)",middle,b,Released,(based on code value),C,663,,2,0,1,,0
58,Main Group,P8,AI_Block,"AI_Trial, 5","AI_Final_Stage (5, i4con.bmp)",index,v,Released,(based on code value),C,532,,1,3,1,,0
63,Main Group,P8,AI_Block,"AI_Trial, 9","AI_Blue (9, i5base.bmp)",index,v,Released,(based on code value),C,616,,1,0,1,,0
67,Main Group,P8,AI_Block,"AI_Trial, 3","AI_Final_Stage (3, i4baseinc.bmp)",index,v,Released,(based on code value),C,536,,1,2,1,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
467,Main Group,P41,AI_Block,"AI_Trial, 4","AI_Final_Stage (4, m4baseinc.bmp)",index,v,Released,(based on code value),E,513,,2,2,10,,0
473,Main Group,P41,AI_Block,"AI_Trial, 8","AI_Final_Stage (8, m4inc.bmp)",middle,b,Released,(based on code value),C,517,,2,4,10,,0
477,Main Group,P41,AI_Block,"AI_Trial, 5","AI_Final_Stage (5, i4con.bmp)",index,v,Released,(based on code value),C,355,,1,3,10,,0
481,Main Group,P41,AI_Block,"AI_Trial, 6","AI_Final_Stage (6, m4con.bmp)",middle,b,Released,(based on code value),C,440,,2,3,10,,0


the participant counter from our loop: 48 should be roughly equivalent to our # of rows / 100: 47.45


## 5. Merge with questionnaire data

In [24]:
### Part 5 --> Loading in the questionnaire data and merging it with behavioral data

# renaming pid column in data frame
# print(allsubs_df['Name'])
allsubs_df = allsubs_df.rename(columns={"Name": "pid"})

# Reading in the npi data data
questionnaire_df = pd.read_csv(questionnaire_file)
display(questionnaire_df)

# Merging the npi and main df
allsubs_df = pd.merge(allsubs_df,questionnaire_df,how='outer',on='pid')
display(allsubs_df)

Unnamed: 0,pid,questionnaire_1,questionnaire_2,questionnaire_3,questionnaire_4,questionnaire_5,questionnaire_6,questionnaire_7,questionnaire_8,questionnaire_9,questionnaire_10
0,P8,62,22,74,37,62,32,46,56,43,78
1,P9,0,60,95,30,68,29,93,52,55,64
2,P49,0,36,88,13,14,20,36,5,78,12
3,P48,9,82,6,96,18,66,87,34,22,75
4,P13,60,69,15,62,88,29,37,74,7,40
5,P12,57,54,25,76,54,50,2,92,43,83
6,P38,66,9,63,35,97,42,44,61,58,34
7,P10,21,76,76,70,14,81,44,57,31,29
8,P11,22,20,92,57,36,34,90,30,72,57
9,P39,18,69,40,14,12,88,53,31,71,26


Unnamed: 0,Group,pid,Name.1,Name.2,Name.3,Response,Key,Released,Response.1,Code,...,questionnaire_1,questionnaire_2,questionnaire_3,questionnaire_4,questionnaire_5,questionnaire_6,questionnaire_7,questionnaire_8,questionnaire_9,questionnaire_10
0,Main Group,P8,AI_Block,"AI_Trial, 7","AI_Blue (7, i5inc.bmp)",index,v,Released,(based on code value),C,...,62,22,74,37,62,32,46,56,43,78
1,Main Group,P8,AI_Block,"AI_Trial, 10","AI_Blue (10, m5base.bmp)",middle,b,Released,(based on code value),C,...,62,22,74,37,62,32,46,56,43,78
2,Main Group,P8,AI_Block,"AI_Trial, 5","AI_Final_Stage (5, i4con.bmp)",index,v,Released,(based on code value),C,...,62,22,74,37,62,32,46,56,43,78
3,Main Group,P8,AI_Block,"AI_Trial, 9","AI_Blue (9, i5base.bmp)",index,v,Released,(based on code value),C,...,62,22,74,37,62,32,46,56,43,78
4,Main Group,P8,AI_Block,"AI_Trial, 3","AI_Final_Stage (3, i4baseinc.bmp)",index,v,Released,(based on code value),C,...,62,22,74,37,62,32,46,56,43,78
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4740,Main Group,P41,AI_Block,"AI_Trial, 4","AI_Final_Stage (4, m4baseinc.bmp)",index,v,Released,(based on code value),E,...,74,78,11,82,74,54,8,36,73,22
4741,Main Group,P41,AI_Block,"AI_Trial, 8","AI_Final_Stage (8, m4inc.bmp)",middle,b,Released,(based on code value),C,...,74,78,11,82,74,54,8,36,73,22
4742,Main Group,P41,AI_Block,"AI_Trial, 5","AI_Final_Stage (5, i4con.bmp)",index,v,Released,(based on code value),C,...,74,78,11,82,74,54,8,36,73,22
4743,Main Group,P41,AI_Block,"AI_Trial, 6","AI_Final_Stage (6, m4con.bmp)",middle,b,Released,(based on code value),C,...,74,78,11,82,74,54,8,36,73,22


## 6. Write to trial-level allsubjects csv

In [25]:
### Part 6 --> Write data to a csv

# Writing the data to a second-level data frame that we will eventually play with in R
out_filename = os.path.join(second_dir,'ait_trialwise.csv')
allsubs_df.to_csv(out_filename,index=False)

#### Pick up next class..
7. Compute summary measures
8. Save to summary allsubjects csv