# Experimental data preprocessing - first stage

Download raw data for experiment *Brisbane* and process it to create two data-frames: one for recognition data and one for recall data. We also verify the integrity of the data files, primarily for the purposes of reproducibility. 

The downloading, processing, and saving should take just a matter of seconds. 

In [1]:
# Standard library imports
import os
import string
import cPickle as pickle

# Local imports 
from utils import processing, utils, topicmodels

Set location of the cache directory.

In [2]:
cache_directory = '../cache'
cache_fullpath = lambda path: os.path.join(cache_directory, path)

Verify that two key files are in the cache.

In [3]:
filenames = {
    'experiment_cfg' : [('Brismo.cfg',
                         '909d9f8de483c4547f26fb4c34b91e12908ab5c144e065dc0fe6c1504b1f22c9')],
    'fake_subject_file' : [('fake_subject_uids.txt',
                            '04bfa8c11b999b371f24ca907c314d43064e42c23a1e0aa2025c797a4d454b66')]
}

utils.verify_cache_files(filenames['experiment_cfg'] + filenames['fake_subject_file'],
                         cache=cache_directory,
                         verbose=False)

Set the list of fake subject uids. Remember that so-called "fake subjects" are just experimental subject accounts that were set up for testing the experiment website. In other words, these are test accounts used in either automatic or manual testing of the system. Removing these is not removing real subjects. 

In [4]:
processing.fake_subject_uids = processing.get_fake_subject_uids(cache_fullpath('fake_subject_uids.txt'))

Obtain the experimental data from the permanent and world readable URL. We could cache this data, but obtaining it in this way ensures that we check that the raw data is always publicly available.

In [5]:
data = processing.get_data('https://data.cognitionexperiments.org/06b643a')

Extract out all the experiment sessions.

In [6]:
sessions = data['ExperimentVersions'][0]['Sessions']

Parse and process both the recognition data and the recall data, and then take a look at both. Note that the seed below is used to assign a unique random id to each slide. 

In [7]:
Df = {}
Df['recognition'] = processing.get_textrecognition_data(sessions, seed=4244)
Df['recall'] = processing.get_textrecall_data(sessions, seed=4312)

In [8]:
Df['recall'].head()

Unnamed: 0,session,subject,age,sex,slide,completed,text,readingtime,word
0,186a069,4ba33f7,29,Male,96a7502,True,11,60.165,Apparently
1,186a069,4ba33f7,29,Male,96a7502,True,11,60.165,There
2,186a069,4ba33f7,29,Male,96a7502,True,11,60.165,is
3,186a069,4ba33f7,29,Male,96a7502,True,11,60.165,no
4,186a069,4ba33f7,29,Male,96a7502,True,11,60.165,case


In [9]:
Df['recognition'].head()

Unnamed: 0,session,subject,age,sex,slide,completed,text,readingtime,word,expected,order,hit,response,correct,rt
0,186a069,4ba33f7,29,Male,d69884d,True,45,62.805,purple,True,0,True,True,True,1.002
1,186a069,4ba33f7,29,Male,d69884d,True,45,62.805,tastefully,False,1,True,False,True,0.917
2,186a069,4ba33f7,29,Male,d69884d,True,45,62.805,cataract,True,2,True,True,True,1.199
3,186a069,4ba33f7,29,Male,d69884d,True,45,62.805,sack,True,3,True,True,True,0.71
4,186a069,4ba33f7,29,Male,d69884d,True,45,62.805,relic,False,4,True,False,True,1.04


## Process the recognition data

* Filter out all the misses. A "miss" is where the no recognition response was made within the permitted time interval.
* Verify that *correct* variable agrees with what the *expected* and actual responses were.
* Create a new variable that uniquely identifies each "stimulus" as the text-word combination. For example, *45-purple* is a unique stimulus, and is the word *purple* in text 50.

In [10]:
Df['recognition'] = Df['recognition'].query('hit == True')

assert Df['recognition'][['expected', 'response', 'correct']].apply(lambda row: (row[0] == row[1]) == row[2], 1).all()

Df['recognition']['stimulus'] = Df['recognition'][['text', 'word']].apply(lambda x: str(x[0]) + '-' + x[1], axis=1)

In [11]:
Df['recognition'].head()

Unnamed: 0,session,subject,age,sex,slide,completed,text,readingtime,word,expected,order,hit,response,correct,rt,stimulus
0,186a069,4ba33f7,29,Male,d69884d,True,45,62.805,purple,True,0,True,True,True,1.002,45-purple
1,186a069,4ba33f7,29,Male,d69884d,True,45,62.805,tastefully,False,1,True,False,True,0.917,45-tastefully
2,186a069,4ba33f7,29,Male,d69884d,True,45,62.805,cataract,True,2,True,True,True,1.199,45-cataract
3,186a069,4ba33f7,29,Male,d69884d,True,45,62.805,sack,True,3,True,True,True,0.71,45-sack
4,186a069,4ba33f7,29,Male,d69884d,True,45,62.805,relic,False,4,True,False,True,1.04,45-relic


Confirm that we have all data for all texts, except text 34. In data set '06b643a', there is no recognition memory data for text 34. On a few occassions, text 34 was recorded as presented as the recognition memory text, but on those occassion there were no responses from the subject (this may have occurred if they started but did not complete the test).

In [12]:
texts_1_to_50_set = set(xrange(1, 51))
try:
    texts_in_recognition_results = set(Df['recognition']['text'].unique())
    assert  texts_in_recognition_results == texts_1_to_50_set
except AssertionError:
    assert texts_1_to_50_set.difference(texts_in_recognition_results) == set((34,))

In [13]:
recognition_results_filename = 'brisbane_06b643a_recognition_results.pkl'
recognition_results_file_checksum\
     = 'e5680ff9853133af8f4d6d7d96382ee7d1698748289b0c77a2ca20fb123c71c3'

Df['recognition'].to_pickle(cache_fullpath(recognition_results_filename))

assert utils.checksum(cache_fullpath(recognition_results_filename)) == recognition_results_file_checksum

## Process the recall data

Make all recalled word lower case

In [14]:
Df['recall']['word'] = map(string.lower, Df['recall']['word'])

Confirm that we have recall memory test data for all texts.

In [15]:
assert set(Df['recall']['text'].unique()) == texts_1_to_50_set

Determine if the recalled word is correct, i.e. actually in the to be remembered text, or not.

In [16]:
texts = topicmodels.get_experiment_texts('Brismo.cfg', cache=cache_directory)
text_contents = {}
for text in texts:
    _, text_id = text.split('_')
    text_contents[int(text_id) + 1] = utils.tokenize(texts[text])    

Determine the accuracy of each recalled word. 

In [17]:
Df['recall']['accuracy'] = Df['recall'].apply(lambda arg: (arg[-1] in text_contents[arg[-3]]), axis=1)

Like what we did above with the *stimulus* variable in the recognition data, create a "response" variable that is the recalled word in the given text. Use the 'n-w' format, where 'n' is the text ID and 'w' is the word.

In [18]:
Df['recall']['response'] = Df['recall'].apply(lambda x: str(x[6])+'-'+x[8], axis=1)

In [19]:
Df['recall'].head()

Unnamed: 0,session,subject,age,sex,slide,completed,text,readingtime,word,accuracy,response
0,186a069,4ba33f7,29,Male,96a7502,True,11,60.165,apparently,True,11-apparently
1,186a069,4ba33f7,29,Male,96a7502,True,11,60.165,there,True,11-there
2,186a069,4ba33f7,29,Male,96a7502,True,11,60.165,is,True,11-is
3,186a069,4ba33f7,29,Male,96a7502,True,11,60.165,no,True,11-no
4,186a069,4ba33f7,29,Male,96a7502,True,11,60.165,case,True,11-case


In [20]:
recall_results_filename = 'brisbane_06b643a_recall_results.pkl'
recall_results_file_checksum\
     = 'a94d812373123b9a8b1eac848276e8ffc6a563ebca71ff2bf5adc97c825cbc14'

Df['recall'].to_pickle(cache_fullpath(recall_results_filename))

assert utils.checksum(cache_fullpath(recall_results_filename)) == recall_results_file_checksum