# Data Science in Psychology & Neuroscience (DSPN): 

## Lecture 10. Data Wrangling (part 2)

### Date: September 26, 2023

### To-Dos From Last Class:

* Download data for today's wrangling session #1 dataset from <a href="https://github.com/hogeveen-lab/DSPN_Fall2022_Git/tree/master/misc_exercises/imitation_inhibition_paradigm">Github</a>

### Today:

* Data wrangling in Pandas (stealing heavily from this <a href="https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf">Cheatsheet</a>)
    * Last part of last class: Combining data frames
* Wrangle some real data

### Homework

* Download <a href="https://github.com/hogeveen-lab/DSPN_Fall2022_Git/tree/master/assignment_starters/assign3_starter">Assignment #3 starter kit</a>

# Data wrangling in Pandas

## (finishing from last class) 5. Combining Data Sets 

<img src="img/combining_data.png" width="600">

In [12]:
import pandas as pd
import numpy as np

# generate random values to enter into our data frames
x = np.random.randint(10,20,3)
y = np.random.randint(20,30,3)
print(x,y)

# putting random integer lists into data frames
df1 = pd.DataFrame({'pid' : ['P1','P2','P3'],
                   'var1' : [x[0],x[1],x[2]]})
display(df1)
df2 = pd.DataFrame({'pid' : ['P1','P2','P4'],
                   'var2' : [y[0],y[1],y[2]]})
display(df2)

# Standard merge approach #1
print('Merge LEFT (right --> left: lose unique rows from the RIGHT df)')
df_merge_left = pd.merge(df1,df2,on='pid',how='left')
display(df_merge_left)

# Standard merge approach #2
print('Merge RIGHT (left --> right: lose unique rows from the LEFT df)')
df_merge_right = pd.merge(df1,df2,on='pid',how='right')
display(df_merge_right)

# Standard merge approach #3
print('Merge INNER (lose unique data from BOTH dfs)')
df_merge_inner = pd.merge(df1,df2,on='pid',how='inner')
display(df_merge_inner)

# Standard merge approach #4
print('Merge OUTER (keep all data present in both data frames)')
df_merge_outer = pd.merge(df1,df2,on='pid',how='outer')
display(df_merge_outer)

[18 11 18] [22 24 28]


Unnamed: 0,pid,var1
0,P1,18
1,P2,11
2,P3,18


Unnamed: 0,pid,var2
0,P1,22
1,P2,24
2,P4,28


Merge LEFT (right --> left: lose unique rows from the RIGHT df)


Unnamed: 0,pid,var1,var2
0,P1,18,22.0
1,P2,11,24.0
2,P3,18,


Merge RIGHT (left --> right: lose unique rows from the LEFT df)


Unnamed: 0,pid,var1,var2
0,P1,18.0,22
1,P2,11.0,24
2,P4,,28


Merge INNER (lose unique data from BOTH dfs)


Unnamed: 0,pid,var1,var2
0,P1,18,22
1,P2,11,24


Merge OUTER (keep all data present in both data frames)


Unnamed: 0,pid,var1,var2
0,P1,18.0,22.0
1,P2,11.0,24.0
2,P3,18.0,
3,P4,,28.0


## Tangent: Because R.T. Asked some great questions recently!

### 1. How to generate random lists withOUT replacement

In [19]:
x = np.random.randint(10,20,3)

import random

z = random.sample(range(10,20),3)

print(x,z)

[19 19 15] [15, 19, 14]


### 2. How to pivot wide with grouping variables as columns, NOT indices

In [27]:
# What if we have some variable(s) we DON't want to lengthen?
df_with_group = pd.DataFrame({'pid': [1,2,3,4],
                              'grp' : [1, 1, 2, 2],
                              'var1' : [4, 5, 6, 7],
                              'var2' : [8, 9, 10, 11],
                              'var3' : [12, 13, 14, 15]})

df_long = df_with_group.melt(id_vars=['pid','grp'])

df_wide = df_long.pivot(index=['pid','grp'], columns='variable', values='value')
display(df_with_group)
display(df_wide)
df_wide_reset = df_wide.reset_index().rename_axis(None,axis=1)
display(df_wide_reset)

Unnamed: 0,pid,grp,var1,var2,var3
0,1,1,4,8,12
1,2,1,5,9,13
2,3,2,6,10,14
3,4,2,7,11,15


Unnamed: 0_level_0,variable,var1,var2,var3
pid,grp,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,1,4,8,12
2,1,5,9,13
3,2,6,10,14
4,2,7,11,15


Unnamed: 0,pid,grp,var1,var2,var3
0,1,1,4,8,12
1,2,1,5,9,13
2,3,2,6,10,14
3,4,2,7,11,15


# Wrangle some real data

<img src="img/imit_inhib_fileorg.png" width=500>

## Breaking into 8 code chunks
## 1. Import packages

## 2. Setting paths to the first level data

## 3. Load a test subject to make sense of things

## 4. Iterate through to load the first level data
###    * Concatenate all together to create one data frame to rule them all

## 5. Merge with questionnaire data

## 6. Write to trial-level allsubjects csv

#### Pick up next class..
7. Compute summary measures
8. Save to summary allsubjects csv