# Asignment 2: Behavioral Analysis of Memory Search
Please submit this assignment to Canvas as a jupyter notebook (.ipynb).  The assignment will have you carry out several behavioral analyses that illustrate fundamental dynamics of human memory.

In [1]:
# imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import cmlreaders as cml

## Assignment Overview
In this assignment you will analyze behavioral data from a big-data study of human memory, particularly the ltpFR2 dataset.  ltpFR2 is a free recall experiment during which each session includes 24 lists, each of which consists of a series of 24 words presented one after another -- we call this the encoding period.  Each list is followed by a distractor, during which participants solves simple math problems, and then a recall period, during which participants attempt to recall as many of the just-learned words as they can, in any order (hence "free" recall).  Chapter 1 of Electrophysiology of Human Memory summarizes the principles concerning recall that you will evaluate in this assignment.

* A note on terminology: Subjects run the experiments in units of "sessions". They come into the lab, run the ltpFR2 free recall experiment over the 24 lists containing 24 words each, and then they go home. They then (typically) come back a different day and do it again (for ltpFR2, most subjects completed 24 sessions). Each of these "trips to the lab" to collect data constitutes a "session". There are several benefits to breaking up the total data collection into multiple sessions:

    * We can collect more data. It'd be difficult to impossible to collect 24 (or even 2 or 3) sessions from a person in a single day.

    * Subjects may for non-experimental reasons perform better and worse on one day vs. another. For instance, someone may sleep poorly the night before a session and do worse than they usually would. Collecting data over multiple days allows us to average out those shorter term variations in performance.

    * Collecting data over multiple sessions allows us to study practice effects and word familiarity effects.
    
    * Some subjects did not complete all sessions.  Include only the subjects that completed all 24 sessions. Additionally, drop the 24th session (which has session number 23 since the sessions are zero-indexed) which used a different variant of the experiment.

In [6]:
# let's get all the ltpFR2 sessions from the database
exp = 'ltpFR2'
df = cml.get_data_index('ltp', rootdir='/').query("experiment == @exp")
df

Unnamed: 0,all_events,experiment,import_type,math_events,original_session,session,subject,subject_alias,task_events
487,protocols/ltp/subjects/LTP093/experiments/ltpF...,ltpFR2,build,protocols/ltp/subjects/LTP093/experiments/ltpF...,0,0,LTP093,LTP093,protocols/ltp/subjects/LTP093/experiments/ltpF...
488,protocols/ltp/subjects/LTP093/experiments/ltpF...,ltpFR2,build,protocols/ltp/subjects/LTP093/experiments/ltpF...,1,1,LTP093,LTP093,protocols/ltp/subjects/LTP093/experiments/ltpF...
489,protocols/ltp/subjects/LTP093/experiments/ltpF...,ltpFR2,build,protocols/ltp/subjects/LTP093/experiments/ltpF...,10,10,LTP093,LTP093,protocols/ltp/subjects/LTP093/experiments/ltpF...
490,protocols/ltp/subjects/LTP093/experiments/ltpF...,ltpFR2,build,protocols/ltp/subjects/LTP093/experiments/ltpF...,11,11,LTP093,LTP093,protocols/ltp/subjects/LTP093/experiments/ltpF...
491,protocols/ltp/subjects/LTP093/experiments/ltpF...,ltpFR2,build,protocols/ltp/subjects/LTP093/experiments/ltpF...,12,12,LTP093,LTP093,protocols/ltp/subjects/LTP093/experiments/ltpF...
...,...,...,...,...,...,...,...,...,...
6526,protocols/ltp/subjects/LTP393/experiments/ltpF...,ltpFR2,build,protocols/ltp/subjects/LTP393/experiments/ltpF...,5,5,LTP393,LTP393,protocols/ltp/subjects/LTP393/experiments/ltpF...
6527,protocols/ltp/subjects/LTP393/experiments/ltpF...,ltpFR2,build,protocols/ltp/subjects/LTP393/experiments/ltpF...,6,6,LTP393,LTP393,protocols/ltp/subjects/LTP393/experiments/ltpF...
6528,protocols/ltp/subjects/LTP393/experiments/ltpF...,ltpFR2,build,protocols/ltp/subjects/LTP393/experiments/ltpF...,7,7,LTP393,LTP393,protocols/ltp/subjects/LTP393/experiments/ltpF...
6529,protocols/ltp/subjects/LTP393/experiments/ltpF...,ltpFR2,build,protocols/ltp/subjects/LTP393/experiments/ltpF...,8,8,LTP393,LTP393,protocols/ltp/subjects/LTP393/experiments/ltpF...


A couple of important points about the ltpFR2 data set:

* It is a *scalp EEG* data set.  This means that we do not need to specify the localization and montage when instantiating a reader, since there aren't any electrodes implanted in their brain.
* The "trial" column indicates the list number, not the "list" column.

In [7]:
# as an example, let's load behavioral events for an LTPFR2 session
df_sess = df.iloc[0]
reader = cml.CMLReader(subject=df_sess['subject'], 
                       experiment=df_sess['experiment'], 
                       session=df_sess['session'])
evs = reader.load('events')
evs

Unnamed: 0,eegoffset,answer,begin_distractor,begin_math_correct,eegfile,eogArtifact,experiment,final_distractor,final_math_correct,intruded,...,phase,protocol,recalled,rectime,serialpos,session,subject,test,trial,type
0,208336,-999,-999,-999,/protocols/ltp/subjects/LTP093/experiments/ltp...,-1,ltpFR2,-999,-999,0,...,,ltp,0,-999,-999,0,LTP093,"[0, 0, 0]",-999,SESS_START
1,249459,-999,-999,-999,/protocols/ltp/subjects/LTP093/experiments/ltp...,-1,ltpFR2,-999,-999,-999,...,,ltp,-999,-999,-999,0,LTP093,"[-999, -999, -999]",-999,START
2,249475,24,-999,-999,/protocols/ltp/subjects/LTP093/experiments/ltp...,-1,ltpFR2,-999,-999,-999,...,,ltp,-999,5611,-999,0,LTP093,"[7, 8, 9]",-999,PROB
3,252312,20,-999,-999,/protocols/ltp/subjects/LTP093/experiments/ltp...,-1,ltpFR2,-999,-999,-999,...,,ltp,-999,2849,-999,0,LTP093,"[3, 8, 9]",-999,PROB
4,253768,11,-999,-999,/protocols/ltp/subjects/LTP093/experiments/ltp...,-1,ltpFR2,-999,-999,-999,...,,ltp,-999,2169,-999,0,LTP093,"[2, 6, 3]",-999,PROB
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1336,2788365,-999,24000,-999,/protocols/ltp/subjects/LTP093/experiments/ltp...,-1,ltpFR2,24000,-999,0,...,,ltp,0,10186,15,0,LTP093,"[0, 0, 0]",24,REC_WORD
1337,2790926,-999,24000,-999,/protocols/ltp/subjects/LTP093/experiments/ltp...,-1,ltpFR2,24000,-999,0,...,,ltp,0,15308,18,0,LTP093,"[0, 0, 0]",24,REC_WORD
1338,2791475,-999,24000,-999,/protocols/ltp/subjects/LTP093/experiments/ltp...,-1,ltpFR2,24000,-999,0,...,,ltp,0,16406,19,0,LTP093,"[0, 0, 0]",24,REC_WORD
1339,2798827,-999,24000,-999,/protocols/ltp/subjects/LTP093/experiments/ltp...,-1,ltpFR2,24000,-999,0,...,,ltp,0,31109,1,0,LTP093,"[0, 0, 0]",24,REC_WORD


## Question 1: Inter-Response Time Curves
An inter-response time (IRT) is defined as the time between two successive recalls.  In the IRT curves, seen in FHM figure 6.11, we want to plot the IRT as a function of both transition position and total number of recalls on the list.  Essentially, on our plot, we want IRT on the y-axis, transition position on the x-axis, and a different curve for each number of total recalls on the list.  Each point on the graph should represent an average IRT for the given transition position and total number of recalls.

* Note that the event dataframe contains information on the time at which the recall occurred relative to the start of the recall period.


1) Compute and plot the IRT growth curves for LTP093's ltpFR2 sessions (all one one graph).  Make sure you include a legend on your graph indicating the number of recalls each curve corresponds to.
2) Compte and plot the between-subject average IRT growth curves.
3) Give a brief explanation of what you found.  Explain how you handled recall errors, such as repetitions and intrusions, in your analyses (this should make you think about how you define transition positions).

In [8]:
# Question 1.1
### YOUR ANSWER HERE

In [9]:
# Question 1.2
### YOUR ANSWER HERE

Question 1.3

**YOUR ANSWER HERE**

## Question 2: Lag-CRP
The Lag Conditional Response Probability (Lag-CRP) is a calculation measuring the temporal organization of memory.  When we transition freely from one recall to another, are we more likely to transition between items that were presented close together in time or far apart in time?  See http://memory.psych.upenn.edu/CRP_Tutorial for an example of the concept and FHM figure 6.8 for an example of the graph.  Essentially, at each lag, we divide the number of actual transitions by the number of possible transitions to get a conditional response probability.  Remember that repeats and intrusions (words not from the list) can appear in the recall events, and must be dealt with.

1) Using the first ltpFR2 session of LTP093, calculate and plot the Lag-CRP for lags ranging from -15 to 15.
2) Compute and plot the Lag-CRP for LTP093, averaged over all the subject's ltpFR2 sessions.
3) Compute the Lag-CRP for all subjects in the ltpFR2 data set, plotting the between-subject Lag-CRP curve (i.e. the average across subject-level lag-CRPs).

Again, think about the design of your code.  You should be able to some of the same code for multiple parts of the question!

4) What effects do you see in these curves?
5) Clearly explain how you dealt with intrusions and repeated recalls.

In [14]:
# Question 2.1
### YOUR CODE HERE

In [15]:
# Question 2.2
### YOUR CODE HERE

In [16]:
# Question 2.3
### YOUR CODE HERE

Question 2.4

**YOUR ANSWER HERE**

Question 2.5

**YOUR ANSWER HERE**

## Question 3: PLI Recency
When subjects incorrectly recall an item that was not studied on the current list it is often an item seen on a recent prior list.  We call these errors prior-list intrusions (PLIs) because they are intrusion errors that came from a prior list, as opposed to extra-list intrusions (ELIs) which have not been encountered during the experimental session.

The PLI recency curve plots the proportion of PLIs as a function of the number of lists back the PLI was studied on, as seen in FHM figure 6.15  For example, if a word was recalled on list 8 that was studied on list 5, that PLI would have a list recency of 3.

1) Restricting the analysis to lists 10-24, compute the between-subject average proportion of PLIs coming from 1-9 lists back.  Then plot the resulting curve (aggregated across lists 10-24).  

2) What do these results show?  Why is it important to not include earlier lists in this analysis?

In [18]:
# Question 3.1
### YOUR CODE HERE

Question 3.2

**YOUR ANSWER HERE**