# Data Science in Psychology & Neuroscience (DSPN): 

## Lecture 9. Data Wrangling (part 2)

### Date: September 22, 2020

### To-Dos From Last Class:

* Download data for today's wrangling session #1 dataset from <a href="https://github.com/hogeveen-lab/DSPN_Fall2020_git">Github</a>

### Today:

* Data wrangling in Pandas (stealing heavily from this <a href="https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf">Cheatsheet</a>)
    * Last part of last class: Combining data frames
* Wrangle some real data

### Homework

* Download Assignment #3 starter kit data
    * Beginner level
    * Advanced level

# Data wrangling in Pandas

## (finishing from last class) 5. Combining Data Sets 

<img src="img/combining_data.png" width="600">

In [6]:
import pandas as pd
import numpy as np

# func for generating random integer lists
def random_integer(x,n):
    for i in range(n):
        x.append(np.random.randint(25,75))
    return x

# calling the function to generate random data
x = []
random_integer(x,6)
print(x)

# putting together the data in two separate data frames
df1 = pd.DataFrame({'pid' : ['P1','P2','P3'],
                   'var1' : [x[0],x[1],x[2]]})
df2 = pd.DataFrame({'pid' : ['P1','P2','P4'],
                   'var2' : [x[3],x[4],x[5]]})

# Standard approach #1 
print('Merge LEFT (right -->left; lost unique data from right)')
print(pd.merge(df1,df2,how='left',on='pid'))

# Standard approach #2 
print('Merge RIGHT (left --> right; lost unique data from left)')
print(pd.merge(df1,df2,how='right',on='pid'))

# Standard approach #3 
print('Merge INNER (lose unique data from EITHER data frame)')
print(pd.merge(df1,df2,how='inner',on='pid'))

# Standard approach #4 
print('Merge OUTER (retain all data)')
print(pd.merge(df1,df2,how='outer',on='pid'))

[29, 42, 63, 57, 50, 25]
Merge LEFT (right -->left; lost unique data from right)
  pid  var1  var2
0  P1    29  57.0
1  P2    42  50.0
2  P3    63   NaN
Merge RIGHT (left --> right; lost unique data from left)
  pid  var1  var2
0  P1  29.0    57
1  P2  42.0    50
2  P4   NaN    25
Merge INNER (lose unique data from EITHER data frame)
  pid  var1  var2
0  P1    29    57
1  P2    42    50
Merge OUTER (retain all data)
  pid  var1  var2
0  P1  29.0  57.0
1  P2  42.0  50.0
2  P3  63.0   NaN
3  P4   NaN  25.0


# Wrangle some real data

## Breaking into 8 code chunks

1. Import packages
2. Setting paths to the first level data
3. Load a test subject to make sense of things
4. Iterate through to load the first level data
    * Concatenate all together to create one data frame to rule them all
5. Merge with questionnaire data
6. Write to trial-level allsubjects csv
#### Pick up next class..
7. Compute summary measures
8. Save to summary allsubjects csv

In [7]:
# step 1 -- import packages

# data wrangling packages
import pandas as pd
import numpy as np

# packages that are key for interacting with the operating system
import os
from glob import glob

In [41]:
# step 2 -- set up our filepaths

# get current working directory
script_dir = os.getcwd()

# get data directory
data_dir = os.path.join(script_dir,'misc_exercises/imitation_inhibition_paradigm/data')
first_dir = os.path.join(data_dir,'first')
second_dir = os.path.join(data_dir,'second')
questionnaire_file = os.path.join(second_dir,'ait_questionnaires.csv')

# filepath pattern
P_file_pattern = 'P*.txt'

# Using GLOB to identify all files that match our P_file_pattern
P_full_path = os.path.join(first_dir,P_file_pattern)
all_files = glob(P_full_path)

In [39]:
### Part 3 --> Loading in a test subject to make sense of things

# Reading in the data
sample_df = pd.read_csv(all_files[0], skiprows=5, sep='\t') #error_bad_lines=False, 

# filtering down to just the experimental block
sample_df = sample_df[sample_df['Name.1']=='AI_Block']

# filtering down to just the key releases
sample_df_releases = sample_df[sample_df['Released']=='Released']

# printing out the length of the df_releases 
print('How many rows in key release filtered data frame:',len(sample_df_releases))

# Identifying double responses
sample_df_releases['shift'] = sample_df_releases['Name.2'].shift()
sample_df_releases['double_response'] = np.where(sample_df_releases['shift']==sample_df_releases['Name.2'], 1, 0) # using a numpy where conditional to identify double responses

# Filtering our double response trials
sample_df_releases_nodouble = sample_df_releases[sample_df_releases['double_response']==0] 
print('How many rows in no-double-response filtered data frame:',len(sample_df_releases_nodouble)) # Seeing if we have the right # of rows now

How many rows in key release filtered data frame: 101
How many rows in no-double-response filtered data frame: 100


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
