# Data Science in Psychology & Neuroscience (DSPN): 

## Lecture 9. Data Wrangling (part 1)

### Date: September 21, 2023

### To-Dos From Last Class:

* Submit Assignment #2: <a href="https://www.dropbox.com/request/gzkRwmYMySDWiUddKMCu">Integrate & Fire</a> (before 9/21, 23:00 MDT)

### Today:

* What is data wrangling?
* What is Pandas?
* Data wrangling in Pandas (stealing heavily from this <a href="https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf">Cheatsheet</a>)
    1. Creating a data frame
    2. Reshaping
    3. Subsetting
    4. Adding columns
    5. Combining data frames

### Homework

* Download data for next class' wrangling session from <a href="https://github.com/hogeveen-lab/DSPN_Fall2023_Git">Github</a> --> misc/imitation_inhibition_paradigm



# What is data wrangling?

<img src="img/data_wrangling_schematic.png" width="600">

<img src="img/lotr.gif">

# What is Pandas?

* A data wrangling _package_ for Python
    * Takes a lot of what is good about R and brings it into a Python general-purpose programming env

<img src="img/pandas.jpeg" width="650">

## What does this mean?

* Tidy data
    * General organizational structure used to hold and manipulate data objects used in R and Pandas
    
<img src="img/tidy_data.png" width="650">

* This enables you to perform __vectorized operations__ on your data
    * Pandas (and tidyverse/dplyr in R) preserve your observations while you run those operations
    
<img src="img/tidy_data_vectorized_operations.png" width="450">

# Data Wrangling in Pandas

## 1. Creating DataFrames

In [7]:
# import and naming the package
import pandas as pd

# create a data frame from scratch (assign values by column)
df = pd.DataFrame({'var1' : [4,5,6],
                  'var2' : [7,8,9],
                  'var3' : [10,11,12]},
                 index=['obs1','obs2','obs3'])
print(df)

column_names = ['var1','var2','var3']
# creating a data frame from scratch (assign values by row)
df_byrow = pd.DataFrame([[4,7,10],
                  [5,8,11],
                  [6,9,12]],
                  index=['obs1','obs2','obs3'],
                  columns=column_names)
# print(df_byrow)

# # most often...
# filepath = '~/Desktop/filtdf.csv'
# df_real = pd.read_csv(filepath)
# print(df_real)

      var1  var2  var3
obs1     4     7    10
obs2     5     8    11
obs3     6     9    12


## 2. Reshaping data

### Melt (i.e. go from wide to long)

<img src="img/melt.png" width="400">

In [16]:
# print(df)
df_long = pd.melt(df, var_name = 'Variables',value_name = 'Observations')
# print(df_long)

# what if we have some variable(s) we DON'T want to lengthen
df_with_group = pd.DataFrame({'pid' : [1,2,3,4],
                              'grp' : [0,0,1,1],
                              'var1' : [4,5,6,7],
                              'var2' : [8,9,10,11],
                              'var3' : [12,13,14,15]})
print(df_with_group)
# print(pd.melt(df_with_group))
print(pd.melt(df_with_group,id_vars=['pid','grp']))

   pid  grp  var1  var2  var3
0    1    0     4     8    12
1    2    0     5     9    13
2    3    1     6    10    14
3    4    1     7    11    15
    pid  grp variable  value
0     1    0     var1      4
1     2    0     var1      5
2     3    1     var1      6
3     4    1     var1      7
4     1    0     var2      8
5     2    0     var2      9
6     3    1     var2     10
7     4    1     var2     11
8     1    0     var3     12
9     2    0     var3     13
10    3    1     var3     14
11    4    1     var3     15


### Pivot (i.e. go from long to wide)

<img src="img/pivot.png" width="600">

In [22]:
df_long = pd.melt(df_with_group,id_vars=['pid','grp'])
print(df_long)

# df_wide = pd.pivot(df_long...)
df_wide = df_long.pivot(index = ['pid','grp'], columns = 'variable',values = 'value')
print(df_with_group)

    pid  grp variable  value
0     1    0     var1      4
1     2    0     var1      5
2     3    1     var1      6
3     4    1     var1      7
4     1    0     var2      8
5     2    0     var2      9
6     3    1     var2     10
7     4    1     var2     11
8     1    0     var3     12
9     2    0     var3     13
10    3    1     var3     14
11    4    1     var3     15
   pid  grp  var1  var2  var3
0    1    0     4     8    12
1    2    0     5     9    13
2    3    1     6    10    14
3    4    1     7    11    15


## 3. Subsetting Data

### Filter (i.e. subset rows)

<img src="img/filter.png" width="600">

In [32]:
print(df_with_group)

df_with_var2_above10 = df_with_group[df_with_group.var2>=10]

print(df_with_var2_above10)

df_with_var2above10_var3below15 = df_with_group[(df_with_group.var2>=10) & (df_with_group.var3<15)]

print(df_with_var2above10_var3below15)

   pid  grp  var1  var2  var3
0    1    0     4     8    12
1    2    0     5     9    13
2    3    1     6    10    14
3    4    1     7    11    15
   pid  grp  var1  var2  var3
2    3    1     6    10    14
3    4    1     7    11    15
   pid  grp  var1  var2  var3
2    3    1     6    10    14


### Select (i.e. subset columns)

<img src="img/select.png" width="600">

In [41]:
print(df_long)

df_pid_value = df_long[['pid','value']]

print(df_pid_value)

print(df_long.loc[:,'variable':'value'])

# print(df_long.iloc[:,[2,3]])

    pid  grp variable  value
0     1    0     var1      4
1     2    0     var1      5
2     3    1     var1      6
3     4    1     var1      7
4     1    0     var2      8
5     2    0     var2      9
6     3    1     var2     10
7     4    1     var2     11
8     1    0     var3     12
9     2    0     var3     13
10    3    1     var3     14
11    4    1     var3     15
    pid  value
0     1      4
1     2      5
2     3      6
3     4      7
4     1      8
5     2      9
6     3     10
7     4     11
8     1     12
9     2     13
10    3     14
11    4     15
   variable  value
0      var1      4
1      var1      5
2      var1      6
3      var1      7
4      var2      8
5      var2      9
6      var2     10
7      var2     11
8      var3     12
9      var3     13
10     var3     14
11     var3     15


## 4. Making New Columns

In [43]:
print(df_wide)
df_wide['diff'] = df_wide['var3'] - df_wide['var1']
print(df_wide)


variable  var1  var2  var3
pid grp                   
1   0        4     8    12
2   0        5     9    13
3   1        6    10    14
4   1        7    11    15
variable  var1  var2  var3  diff
pid grp                         
1   0        4     8    12     8
2   0        5     9    13     8
3   1        6    10    14     8
4   1        7    11    15     8


## 5. Combining Data Sets

<img src="img/combining_data.png" width="600">