# Data Science in Psychology & Neuroscience (DSPN): 

## Lecture 9. Data Wrangling (part 2)

### Date: September 22, 2020

### To-Dos From Last Class:

* Download data for today's wrangling session #1 dataset from <a href="https://github.com/hogeveen-lab/DSPN_Fall2020_git">Github</a>

### Today:

* Data wrangling in Pandas (stealing heavily from this <a href="https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf">Cheatsheet</a>)
    * Last part of last class: Combining data frames
* Wrangle some real data

### Homework

* Download Assignment #3 start kit data
    * Beginner level
    * Advanced level

# Data wrangling in Pandas

## (finishing from last class) 5. Combining Data Sets 

<img src="img/combining_data.png" width="600">

In [25]:
import pandas as pd
import numpy as np

# func for generating random int
def random_integer(x,n): # where x is an empty list, and n is the # of vals we need
    for i in range(n):
        x.append(np.random.randint(25,75))
    return x

# running the func and checking the output
x = []
random_integer(x,6)
print(x)

# putting together some test data frames
df1 = pd.DataFrame({'pid' : ['P1','P2','P3'],
                   'var1' : [x[0],x[1],x[2]]})
df2 = pd.DataFrame({'pid' : ['P1','P2','P4'],
                   'var2' : [x[3],x[4],x[5]]})

### Standard approach #1 -- join matching rows from df2 to df1
print('Merge LEFT (right --> left; lose unique data from right)')
print(pd.merge(df1,df2,how='left',on='pid'))

### Standard approach #2 -- join matching rows from df1 to df2
print('Merge RIGHT (left --> right; lose unique data from left)')
print(pd.merge(df1,df2,how='right',on='pid'))

### Standard approach #3 -- retain rows present in BOTH dfs
print('Merge INNER (lose unique data from either df)')
print(pd.merge(df1,df2,how='inner',on='pid'))

### Standard approach #4 -- retain rows present in ANY dfs
print('Merge OUTER (retain all data)')
print(pd.merge(df1,df2,how='outer',on='pid'))

Merge LEFT (right --> left; lose unique data from right)
  pid  var1  var2
0  P1    70  43.0
1  P2    43  49.0
2  P3    47   NaN
Merge RIGHT (left --> right; lose unique data from left)
  pid  var1  var2
0  P1  70.0    43
1  P2  43.0    49
2  P4   NaN    33
Merge INNER (lose unique data from either df)
  pid  var1  var2
0  P1    70    43
1  P2    43    49
Merge OUTER (retain all data)
  pid  var1  var2
0  P1  70.0  43.0
1  P2  43.0  49.0
2  P3  47.0   NaN
3  P4   NaN  33.0


# Wrangle some real data

## Breaking into 8 code chunks

1. Import packages
2. Setting paths to the first level data
3. Load a test subject to make sense of things
4. Iterate through to load the first level data
    * Concatenate all together to create one data frame to rule them all
5. Merge with questionnaire data
6. Write to trial-level allsubjects csv
#### Pick up next class..
7. Compute summary measures
8. Save to summary allsubjects csv

In [29]:
### Part 1 --> Importing data wrangling packages I often use

# Packages that are key for interacting with the OS and matching filename patterns
import os
from glob import glob # only need the glob subpackage from glob

# Packages that are key for data wrangling
import numpy as np
import pandas as pd

In [30]:
### Part 2 --> setting paths to the first level data

# get current working directory
script_dir = os.getcwd()
base_dir = os.path.dirname(os.getcwd())

# Go above current working directory and
first_dir = os.path.join(base_dir,'misc_exercises/imitation_inhibition_paradigm/data/first') #misc_exercises/ for git
# first_dir = os.path.join(base_dir,'exercises/imitation_inhibition_paradigm/data/first') #misc_exercises/ for git
P_file_pattern = 'P*.txt'
second_dir = os.path.join(base_dir,'misc_exercises/imitation_inhibition_paradigm/data/second')
# second_dir = os.path.join(base_dir,'exercises/imitation_inhibition_paradigm/data/second')
questionnaire_file = os.path.join(second_dir,'ait_questionnaires.csv')

# Using glob to find all participant data files
all_files = glob(os.path.join(first_dir,P_file_pattern))

In [31]:
### Part 3 --> Loading in a test subject to make sense of things

# Reading in the data
sample_df = pd.read_csv(all_files[0], error_bad_lines=False, skiprows=5, sep='\t') 
print('How many rows in initial loaded data frame:',len(sample_df)) # What things might cause this to not == 100?

# Filtering the data down to just the experimental block rows
sample_df = sample_df[sample_df['Name.1']=="AI_Block"]
sample_df.loc[:, 'Finger':'Repeated']
sample_df.loc[50:300, :]

# Filtering the df down to just the key release responses
sample_df_releases = sample_df[sample_df['Released']=='Released'] 

# How many key release responses do we have?
print('How many rows in key release filtered data frame:',len(sample_df_releases)) # What things might cause this to not == 100? For now, just worry baout double responses. For this task, error rates so low that miss response trials not really important.
# print(sample_df_releases)

# Identifying double responses
sample_df_releases['shift'] = sample_df_releases['Name.2'].shift() # creating a new column ('shift') based on the next row of our trial name column. "SettingWithCopyWarning".
# print(sample_df_releases[['Name.2','shift']]) # checking that it worked, show them shift(-1) 
sample_df_releases['double_response'] = np.where(sample_df_releases['shift']==sample_df_releases['Name.2'], 1, 0) # using a numpy where conditional to identify double responses

# Checking that the double response thing worked
test_df = sample_df_releases[sample_df_releases['Name.2']=='AI_Trial, 2']
# print(test_df[['Name.2','shift','double_response']]) # checking that it worked, show them shift(-1)

# Filtering our double response trials
sample_df_releases_nodouble = sample_df_releases[sample_df_releases['double_response']==0] 
print('How many rows in no-double-response filtered data frame:',len(sample_df_releases_nodouble)) # Seeing if we have the right # of rows now

How many rows in initial loaded data frame: 521
How many rows in key release filtered data frame: 101
How many rows in no-double-response filtered data frame: 100


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [37]:
# Demonstrating that Pandas data frames are a collection of series, which are different ways of storing arrays

sample_df_releases # data frame
sample_df_releases['Finger'] # Series w/in the data frame
sample_df_releases['Finger'].values # converting the Data Frame to a numpy array
np.array([1, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1, 1, 1, 2, 1,
       2, 2, 1, 1, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1,
       1, 1, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 2,
       2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 1, 1, 2, 1, 1, 2, 1, 2, 1,
       2, 1, 2, 1, 2, 2, 1, 2, 1, 1, 1, 2, 2])


          Group Name    Name.1        Name.2  \
49   Main Group   P8  AI_Block   AI_Trial, 7   
54   Main Group   P8  AI_Block  AI_Trial, 10   
58   Main Group   P8  AI_Block   AI_Trial, 5   
63   Main Group   P8  AI_Block   AI_Trial, 9   
67   Main Group   P8  AI_Block   AI_Trial, 3   
..          ...  ...       ...           ...   
499  Main Group   P8  AI_Block   AI_Trial, 3   
504  Main Group   P8  AI_Block   AI_Trial, 9   
509  Main Group   P8  AI_Block   AI_Trial, 7   
514  Main Group   P8  AI_Block   AI_Trial, 6   
519  Main Group   P8  AI_Block  AI_Trial, 10   

                                Name.3 Response Key  Released  \
49              AI_Blue (7, i5inc.bmp)    index   v  Released   
54            AI_Blue (10, m5base.bmp)   middle   b  Released   
58       AI_Final_Stage (5, i4con.bmp)    index   v  Released   
63             AI_Blue (9, i5base.bmp)    index   v  Released   
67   AI_Final_Stage (3, i4baseinc.bmp)    index   v  Released   
..                               

In [24]:
### Part 4 --> Iterating through to get all the first level data, concatenating into allsubs data frame

pid_counter = 1
dfs_list = [] # creating a list of pandas objects

for cur_file in all_files:
    # Copying same logic from our test subject
#     print(cur_file)
    cur_df = pd.read_csv(cur_file, error_bad_lines=False, skiprows=5, sep='\t') 
    cur_df_releases = cur_df[(cur_df['Released']=='Released') & (cur_df['Name.1']=="AI_Block")] 
#     print('How many rows in key release filtered data frame:',cur_df_releases['Congruence'].count()) # What things might cause this to not == 100? For now, just worry baout double responses. For this task, error rates so low that miss response trials not really important.
    cur_df_releases['double_response'] = np.where(cur_df_releases['Name.2'].shift()==cur_df_releases['Name.2'], 1, 0) # faster way to find double responses
    cur_df_releases_nodouble = cur_df_releases[cur_df_releases['double_response']==0] 
#     print('How many rows in no-double-response filtered data frame:',cur_df_releases_nodouble['Congruence'].count()) # Seeing if we have the right # of rows now
    # Appending all the data into a data frame
    dfs_list.append(cur_df_releases_nodouble)
    pid_counter+=1

# Concatenate all DFs together along the row axis
allsubs_df = pd.concat(dfs_list, axis=0)

# Checking what we got and making sure it makes sense given how many trials we should have
# print(dfs_list)
# print(allsubs_df)
print('the participant counter from our loop:',pid_counter-1,'should be roughly equivalent to our # of rows / 100:',allsubs_df['Congruence'].count() / 100)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':


the participant counter from our loop: 48 should be roughly equivalent to our # of rows / 100: 47.45


In [27]:
### Part 5 --> Loading in the questionnaire data and merging it with behavioral data

# renaming pid column in data frame
# print(allsubs_df['Name'])
allsubs_df = allsubs_df.rename(columns={"Name": "pid"})

# Reading in the npi data data
questionnaire_df = pd.read_csv(questionnaire_file)
print(questionnaire_df)

# Merging the npi and main df
allsubs_df = pd.merge(allsubs_df,questionnaire_df,how='outer',on='pid')
print(allsubs_df)

    pid  questionnaire_1  questionnaire_2  questionnaire_3  questionnaire_4  \
0    P8               62               22               74               37   
1    P9                0               60               95               30   
2   P49                0               36               88               13   
3   P48                9               82                6               96   
4   P13               60               69               15               62   
5   P12               57               54               25               76   
6   P38               66                9               63               35   
7   P10               21               76               76               70   
8   P11               22               20               92               57   
9   P39               18               69               40               14   
10  P15               71               75               19               59   
11  P29               51               71           

In [28]:
### Part 6 --> Write data to a csv

# Writing the data to a second-level data frame that we will eventually play with in R
out_filename = os.path.join(second_dir,'ait_trialwise.csv')
allsubs_df.to_csv(out_filename,index=False)