## Working with Pandas

Pandas is a Python module built on top of Numpy, allowing users to store and manipulate flexible data array objects with row and column headers. Essentially, it provides some Excel-like functionality to Python. The core of pandas is the <b>DataFrame</b>, which you can think of as a table (or numpy recarray) that supports any kind of data and has explicit labels for rows and columns. 

Before getting into DataFrames, let's first discuss the <b>Series</b> object -- pandas' version of an array, but it has some unique properties. For example, it can be indexed by non-integer data (unlike numpy arrays): 

In [1]:
import pandas as pd
import numpy as np

#Create a new Series object with strings as indices. By default, Series will come with standard integer indices.
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [2]:
#data can be retrieved via string indices (like a Python dictionary)
data['b']

0.5

In [3]:
#or even more dramatically:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [4]:
#Series objects can be sliced, just like numpy arrays
population['California':'New York']

California    38332521
Texas         26448193
New York      19651127
dtype: int64

A DataFrame is a sequence of of aligned Series objects, essentially turning indexed lists into indexed 2D arrays, or tables. For example, let's create a table of information about our states: 

In [5]:
#Create a new Series object with info about state areas
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

In [6]:
states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [7]:
#See which indices we're using
print(states.index)

#See our column headers
print(states.columns)

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')
Index(['population', 'area'], dtype='object')


Do note that indexing into DataFrames works a little differently than arrays. In many ways, pandas objects are more like dictionaries than arrays (but really, more of a dictionary-array hybrid):

In [8]:
#Try running states[0]

In [9]:
#But now try
states['population']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: population, dtype: int64

In [10]:
#Also note that using a column name as an attribute will work
states.population

#though in general the dictionary-style indexing is safer and can work with non-strings.

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: population, dtype: int64

Sometimes you really want to go back to the implicit integer index. There's a really easy way to do this, the iloc indexer: 

In [11]:
#Get the data for 'California' without using the explicit index
states.iloc[0]

population    38332521
area            423967
Name: California, dtype: int64

In [12]:
#Want to add a new column? It's easy! Just as if you were manipulating a Python dictionary
states['density'] = states['population'] / states['area']
states

Unnamed: 0,population,area,density
California,38332521,423967,90.413926
Texas,26448193,695662,38.01874
New York,19651127,141297,139.076746
Florida,19552860,170312,114.806121
Illinois,12882135,149995,85.883763


In [13]:
#Pandas methods even support combinations of indexing/slicing/masking. The 'loc' method lets us only refer to explicit indices (the opposite of the iloc method)
states.loc[states['density']>100, ['population', 'density']]

Unnamed: 0,population,density
New York,19651127,139.076746
Florida,19552860,114.806121


### Manipulating pandas DataFrames

Pandas has some fantastic string methods for finding and manipulating data in a DataFrame.

In [14]:
data = ['peter', 'Paul', 'MARY', 'gEORGE', 'mark']
names = pd.Series(data)
names

0     peter
1      Paul
2      MARY
3    gEORGE
4      mark
dtype: object

In [15]:
names.str.capitalize()

0     Peter
1      Paul
2      Mary
3    George
4      Mark
dtype: object

In [16]:
names.str.endswith('l')

0    False
1     True
2    False
3    False
4    False
dtype: bool

Particularly useful is the .contains() method, which can be used to find strings where they appear in a column. 

In [17]:
names.loc[:] = names.str.capitalize()  #first lets standardize the capitalization
names.str.contains('ar')

0    False
1    False
2     True
3    False
4     True
dtype: bool

Masking and the .query() method are your friends. You'll find yourself using these often to access subarrays of a dataframe. For example...

In [18]:
#Similar to an example we looked at before
states[states['density']>100]

Unnamed: 0,population,area,density
New York,19651127,141297,139.076746
Florida,19552860,170312,114.806121


In [19]:
#Which could also be written as:
states.query('density>100')

Unnamed: 0,population,area,density
New York,19651127,141297,139.076746
Florida,19552860,170312,114.806121


This is just a taste of some of the most useful functionality in Pandas. For a full overview of all pandas functionality, check out this tutorial: https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html

## Explore the RAM database

The CML's database of intracranial and scalp EEG comes in a pandas dataframe format. All the pertinent data about each experimental session is recorded in a row of a dataframe. The distributed example data consists of 20 intracranial EEG participants in FR1 (a non-stimulation free-recall experiment) and catFR1 (a non-stimulation categorized free-recall experiment).

Let's load the example database to get a better sense of this data format. We're going to use the **CMLLoad** class found in the CMLLoad.py file in this repository, which is a small helper class which will load the example data.

In [20]:
# First, our import statements.
# The CMLLoad class is your gateway to the experimental data, including channels, events, and eeg data.
from CMLLoad import CMLLoad

# We need to tell CMLLoad what directory contains the experimental data files.
# Point this to where they are on your system.
load = CMLLoad('./CMLExamples')

df = load.Index()

In [21]:
# This dataframe contains index information about the experimental sessions in the example dataset.
df[:10]

Unnamed: 0,channels_file,eeg_file,events_file,experiment,session,subject
0,subjects/R1380D/R1380D_FR1_1_channels.csv,subjects/R1380D/R1380D_FR1_1_eeg.h5,subjects/R1380D/R1380D_FR1_1_events.csv,FR1,1,R1380D
1,subjects/R1380D/R1380D_catFR1_1_channels.csv,subjects/R1380D/R1380D_catFR1_1_eeg.h5,subjects/R1380D/R1380D_catFR1_1_events.csv,catFR1,1,R1380D
2,subjects/R1380D/R1380D_catFR1_2_channels.csv,subjects/R1380D/R1380D_catFR1_2_eeg.h5,subjects/R1380D/R1380D_catFR1_2_events.csv,catFR1,2,R1380D
3,subjects/R1380D/R1380D_catFR1_3_channels.csv,subjects/R1380D/R1380D_catFR1_3_eeg.h5,subjects/R1380D/R1380D_catFR1_3_events.csv,catFR1,3,R1380D
4,subjects/R1380D/R1380D_catFR1_4_channels.csv,subjects/R1380D/R1380D_catFR1_4_eeg.h5,subjects/R1380D/R1380D_catFR1_4_events.csv,catFR1,4,R1380D
5,subjects/R1111M/R1111M_FR1_0_channels.csv,subjects/R1111M/R1111M_FR1_0_eeg.h5,subjects/R1111M/R1111M_FR1_0_events.csv,FR1,0,R1111M
6,subjects/R1111M/R1111M_FR1_1_channels.csv,subjects/R1111M/R1111M_FR1_1_eeg.h5,subjects/R1111M/R1111M_FR1_1_events.csv,FR1,1,R1111M
7,subjects/R1111M/R1111M_FR1_2_channels.csv,subjects/R1111M/R1111M_FR1_2_eeg.h5,subjects/R1111M/R1111M_FR1_2_events.csv,FR1,2,R1111M
8,subjects/R1111M/R1111M_FR1_3_channels.csv,subjects/R1111M/R1111M_FR1_3_eeg.h5,subjects/R1111M/R1111M_FR1_3_events.csv,FR1,3,R1111M
9,subjects/R1111M/R1111M_catFR1_0_channels.csv,subjects/R1111M/R1111M_catFR1_0_eeg.h5,subjects/R1111M/R1111M_catFR1_0_events.csv,catFR1,0,R1111M


In [22]:
# Let's see what experiments we have access to
df['experiment'].unique()

array(['FR1', 'catFR1'], dtype=object)

**Exercise: How many RAM sessions were run on Jefferson subjects?**

In [23]:
np.sum(df['subject'].str.endswith('J'))

26

## Overview of the CML experiment types

### Verbal free-recall tasks (no-stim)
* FR1
* catFR1

### Paired-associates tasks
* PAL1
* PAL2 (open-loop stim)
* PAL3 (closed-loop stim)
* PAL5 (closed-loop stim)

### Spatial navigation tasks
* YC1
* TH1
* THR
* THR1
* YC2 (open-loop stim)
* TH3 (closed-loop stim)

### Verbal free-recall w/ stim
(Basically, any FR task with a number above 1 somewhere)
* FR2 (open-loop)
* catFR2
* FR3 (closed-loop)
* catFR3
* FR5 (closed-loop)
* catFR5
* PS4_FR (closed-loop)
* PS4_catFR (closed-loop)
* PS5_catFR (closed-loop)
* FR6 (multi-target stim)
* catFR6 (multi-target stim)
* TICL_FR (encoding/math/retrieval stim)

### No-task stimulation ("parameter search")
* PS1
* PS2/PS2.1
* PS3
* LocationSearch


In [24]:
# And now let's find all the subjects who did the FR1 task
fr1_df = df.query('experiment == "FR1"')
fr1_df['subject'].unique()

array(['R1380D', 'R1111M', 'R1332M', 'R1377M', 'R1065J', 'R1385E',
       'R1189M', 'R1390M', 'R1391T', 'R1401J', 'R1361C', 'R1060M',
       'R1350D', 'R1378T', 'R1375C', 'R1383J', 'R1354E', 'R1292E'],
      dtype=object)

### Load data from an example subject
Here, let's go through an example of loading experimental events and EEG from one subject

In [25]:
from CMLLoad import CMLLoad
load = CMLLoad('./CMLExamples')
df = load.Index()

#Specify which subject and experiment we want
sub = 'R1111M'
exp = 'FR1'

#Find out the sessions for this subject
sessions = list(df[(df['subject']==sub) & (df['experiment']==exp)]['session'])


In [26]:
print(sub+' sessions: '+str(sessions))

R1111M sessions: [0, 1, 2, 3]


#### Load experimental events

This subject completed four sessions of FR1. Let's load data from the first session. First, we'll need to select out the dataframes we want, then pass one row into the CMLLoad object we created.

In [27]:
df_select = df[(df['subject']==sub) & (df['experiment']==exp)]
print(df_select.iloc[0])

channels_file    subjects/R1111M/R1111M_FR1_0_channels.csv
eeg_file               subjects/R1111M/R1111M_FR1_0_eeg.h5
events_file        subjects/R1111M/R1111M_FR1_0_events.csv
experiment                                             FR1
session                                                  0
subject                                             R1111M
Name: 5, dtype: object


In [28]:
evs = load.Load(df_select.iloc[0], 'events')
print(evs[100:103])

     eegoffset  answer                    eegfile  exp_version experiment  \
100     230205    -999  R1111M_FR1_0_22Jan16_1638         1.05        FR1   
101     231472    -999  R1111M_FR1_0_22Jan16_1638         1.05        FR1   
102     232280    -999  R1111M_FR1_0_22Jan16_1638         1.05        FR1   

     intrusion  is_stim  iscorrect item_name  item_num  ...  protocol  \
100       -999        0       -999      DOLL        78  ...        r1   
101       -999        0       -999       BED        18  ...        r1   
102       -999        0       -999         X      -999  ...        r1   

     recalled  rectime  serialpos session  stim_list  stim_params  subject  \
100         0     -999         11       0          0           []   R1111M   
101         0     -999         12       0          0           []   R1111M   
102         0     -999       -999       0          0           []   R1111M   

          test            type  
100  [0, 0, 0]            WORD  
101  [0, 0, 0]     

The events dataframe contains information about everything that happened during an experimental session. It indicates the time at which every word appeared on the screen, and when those words were later recalled. It also contains information about events that you might not care about, such as when the countdown timer starts and ends.
<center>
<img src="https://github.com/esolomon/PythonBootcamp2019/blob/master/figures/task_design-01.jpg?raw=true" width=650>
</center>
Let's take a look at all the columns in this dataframe.

In [29]:
evs.columns

Index(['eegoffset', 'answer', 'eegfile', 'exp_version', 'experiment',
       'intrusion', 'is_stim', 'iscorrect', 'item_name', 'item_num', 'list',
       'montage', 'msoffset', 'mstime', 'protocol', 'recalled', 'rectime',
       'serialpos', 'session', 'stim_list', 'stim_params', 'subject', 'test',
       'type'],
      dtype='object')

* 'eegoffset' indicates where (in samples) in the EEG file this event occurred. CMLReaders needs this info, but usually you won't need to deal with it directly.
* 'eegfile' is the path to the corresponding file where raw EEG is saved.
* 'experiment' is the behavioral task we're looking at. 
* 'intrusion' is an indicator of intrusion events during the recall period. -1 indicates an extra-list intrusion, otherwise, it's the list number from which the word came.
* 'is_stim' flags whether stimulation occurred during this event. We won't be dealing with stimulation data in this bootcamp. 
* <b>'item_name'</b> is the word that was presented or recalled.
* 'item_num' is the index for this word in the word pool. 
* 'list' is the list number. 
* 'montage' is the subject montage, which you loaded earlier.
* 'mstime' is a time indicator, in ms. Good for comparing between events, but the absolute value is meaningless. 
* <b>'recalled'</b> is a indicator of whether an encoding word was later recalled successfully.
* 'rectime' is the time, in ms, when a word was recalled relative to the start of the recall period for that list.
* <b>'serialpos'</b> is the serial position of a presented/recalled word
* 'stim_list' is an indicator of whether stimulation was active during this list. 
* 'stim_params' is a dictionary of stimulation parameters.
* 'subject' is the subject you're analyzing!
* <b>'type'</b> is the type of event, e.g. 'WORD' or 'REC_WORD'

Please see https://pennmem.github.io/cmlreaders/html/events.html for even more information!

Say we're just interested in analyzing word encoding events. To filter by event type, use handy pandas functionality:

In [30]:
pd.set_option('display.max_columns', 100)  #an optional command that lets us view the full dataframe within Jupyter notebooks

In [31]:
word_evs = evs[evs['type']=='WORD']
word_evs[:10]

Unnamed: 0,eegoffset,answer,eegfile,exp_version,experiment,intrusion,is_stim,iscorrect,item_name,item_num,list,montage,msoffset,mstime,protocol,recalled,rectime,serialpos,session,stim_list,stim_params,subject,test,type
27,100520,-999,R1111M_FR1_0_22Jan16_1638,1.05,FR1,-999,0,-999,BEAR,17,1,0,1,1453499295325,r1,1,5210,1,0,0,[],R1111M,"[0, 0, 0]",WORD
28,101829,-999,R1111M_FR1_0_22Jan16_1638,1.05,FR1,-999,0,-999,WING,294,1,0,1,1453499297942,r1,1,5748,2,0,0,[],R1111M,"[0, 0, 0]",WORD
29,103113,-999,R1111M_FR1_0_22Jan16_1638,1.05,FR1,-999,0,-999,DOOR,79,1,0,1,1453499300510,r1,1,7882,3,0,0,[],R1111M,"[0, 0, 0]",WORD
30,104329,-999,R1111M_FR1_0_22Jan16_1638,1.05,FR1,-999,0,-999,PLANT,188,1,0,1,1453499302943,r1,1,6815,4,0,0,[],R1111M,"[0, 0, 0]",WORD
31,105638,-999,R1111M_FR1_0_22Jan16_1638,1.05,FR1,-999,0,-999,ROOT,204,1,0,1,1453499305561,r1,0,-999,5,0,0,[],R1111M,"[0, 0, 0]",WORD
32,106897,-999,R1111M_FR1_0_22Jan16_1638,1.05,FR1,-999,0,-999,LEAF,146,1,0,1,1453499308078,r1,0,-999,6,0,0,[],R1111M,"[0, 0, 0]",WORD
33,108105,-999,R1111M_FR1_0_22Jan16_1638,1.05,FR1,-999,0,-999,SNOW,236,1,0,1,1453499310495,r1,0,-999,7,0,0,[],R1111M,"[0, 0, 0]",WORD
34,109372,-999,R1111M_FR1_0_22Jan16_1638,1.05,FR1,-999,0,-999,BLOOM,22,1,0,1,1453499313029,r1,0,-999,8,0,0,[],R1111M,"[0, 0, 0]",WORD
35,110580,-999,R1111M_FR1_0_22Jan16_1638,1.05,FR1,-999,0,-999,STRAW,257,1,0,1,1453499315446,r1,0,-999,9,0,0,[],R1111M,"[0, 0, 0]",WORD
36,111813,-999,R1111M_FR1_0_22Jan16_1638,1.05,FR1,-999,0,-999,BUSH,38,1,0,1,1453499317912,r1,0,-999,10,0,0,[],R1111M,"[0, 0, 0]",WORD


Applying these kinds of filters are useful if you're only interested in analyzing one kind of event. For instance, we could also just find recall events:

In [32]:
rec_evs = evs[evs['type']=='REC_WORD']
rec_evs[:10]

Unnamed: 0,eegoffset,answer,eegfile,exp_version,experiment,intrusion,is_stim,iscorrect,item_name,item_num,list,montage,msoffset,mstime,protocol,recalled,rectime,serialpos,session,stim_list,stim_params,subject,test,type
47,130521,-999,R1111M_FR1_0_22Jan16_1638,1.05,FR1,0,0,-999,BEAR,17,1,0,20,1453499355330,r1,1,5210,-999,0,0,[],R1111M,"[0, 0, 0]",REC_WORD
48,130790,-999,R1111M_FR1_0_22Jan16_1638,1.05,FR1,0,0,-999,WING,294,1,0,20,1453499355868,r1,1,5748,-999,0,0,[],R1111M,"[0, 0, 0]",REC_WORD
49,131324,-999,R1111M_FR1_0_22Jan16_1638,1.05,FR1,0,0,-999,PLANT,188,1,0,20,1453499356935,r1,1,6815,-999,0,0,[],R1111M,"[0, 0, 0]",REC_WORD
50,131857,-999,R1111M_FR1_0_22Jan16_1638,1.05,FR1,0,0,-999,DOOR,79,1,0,20,1453499358002,r1,1,7882,-999,0,0,[],R1111M,"[0, 0, 0]",REC_WORD
51,135459,-999,R1111M_FR1_0_22Jan16_1638,1.05,FR1,0,0,-999,TOY,277,1,0,20,1453499365206,r1,1,15086,-999,0,0,[],R1111M,"[0, 0, 0]",REC_WORD
78,180505,-999,R1111M_FR1_0_22Jan16_1638,1.05,FR1,0,0,-999,DEER,72,2,0,20,1453499455303,r1,1,2863,-999,0,0,[],R1111M,"[0, 0, 0]",REC_WORD
79,181022,-999,R1111M_FR1_0_22Jan16_1638,1.05,FR1,0,0,-999,MULE,163,2,0,20,1453499456338,r1,1,3898,-999,0,0,[],R1111M,"[0, 0, 0]",REC_WORD
80,181875,-999,R1111M_FR1_0_22Jan16_1638,1.05,FR1,0,0,-999,SLUSH,231,2,0,20,1453499458043,r1,1,5603,-999,0,0,[],R1111M,"[0, 0, 0]",REC_WORD
81,182389,-999,R1111M_FR1_0_22Jan16_1638,1.05,FR1,0,0,-999,PIPE,185,2,0,20,1453499459071,r1,1,6631,-999,0,0,[],R1111M,"[0, 0, 0]",REC_WORD
82,184110,-999,R1111M_FR1_0_22Jan16_1638,1.05,FR1,0,0,-999,SPRING,244,2,0,20,1453499462514,r1,1,10074,-999,0,0,[],R1111M,"[0, 0, 0]",REC_WORD


**Exercise: What is R1111M's overall recall percent correct?**

**Exercise: What is R1111M's percent correct at each serial position?**

**Exercise: Plot the distribution of inter-response times for R1111M's recalled words**