# Loading data & Creating X and Y
This notebook contains examples of how to:
* Create X from stored XLS file info
* Create Y from stored variables
* Overlay results on makeshift brain
* Create an ad hoc ROI (which you will need for MVPA / RSA)

In [None]:
import numpy as np
import matplotlib.pyplot as plt

import os
import sys
import glob
import h5py
import pandas

# Add directory above to python path so we can find utility functions
sys.path.append(os.path.abspath('..'))
# Import local utility functions
import utils

from scipy.stats import zscore

%matplotlib inline

Note: to import the utils functions in another notebook (which you will need), you have to make sure you add 

In [None]:
fdir = '/unrshare/LESCROARTSHARE/data_PSY763/SnowLabData/'
glob.glob(fdir + '*xls*')

In [None]:
# Load two excel files as data frames in pandas
df1 = pandas.read_excel(os.path.join(fdir, 'Sub03_Run_Breakdown.xls'))
df2 = pandas.read_excel(os.path.join(fdir, 'Subject 3 Runs with specific object info.xlsx'))

# Reasoning through how to extract the relevant info
The next sequence of cells show my thought process in reasoning about how to get the relevant info out of these excel files. All the code in them is not strictly necessary; there is a more abbreviated version below. But I thought this would be helpful for you to see how to plod through this rather than jump you to the final answer. There are lots of print statements (showing the size of arrays, showing how many True values in logical indices that are created), which are meant to serve as sanity checks. 

Here we go.

In [None]:
df1[:5]

In [None]:
df2[:4]

In this experiment, there are 127 * 7 = 889 time points (TRs). Thus, the X for this experiment must be 889 time points long. 

In [None]:
# extract a few useful variables
# For the first xls file
run = df1['Run'].values
onset = df1['BV Start'].values
offset = df1['BV Stop'].values
cat = df1['Run6'].values
# Separate indices for the second xls file
objects = df2['object'].values
# A little cleanup to get rid of extra quotes
print(objects[:3]) # before
objects = np.array([o.strip("'''") for o in objects])
print(objects[:3]) # after (no extra quotes)
conditions = df2['condition'].values
# Similar cleanup
conditions = np.array([c.strip("'") for c in conditions])
object_run = df2['run'].values

In [None]:
# Extract one run of data to figure out how to parse the rest
# Create a logical index for run 1
ri = run==1 
# (this index is over all rows in the xls file)
print(ri.shape)
# (73 of the 511 rows in the xls file are for run 1)
print(ri.sum())

In [None]:
# Do the same for run 1 in the second xls file
ri_objects = object_run==1
objects_r1 = objects[ri_objects]
conditions_r1 = conditions[ri_objects]
# (there are 24 objects shown in run 1)
print(len(objects_r1))
print(len(conditions_r1))

There is no category or variable stored that indicates TRIAL, so we have to get a bit tricky here. 

In [None]:
# Select the category ()
cat_r1 = cat[ri]
# Select each entry that is NOT fixation and is NOT response
fixation_r1 = cat_r1=='Fixation'
response_r1 = cat_r1=='Response'
# Word onsets were when it was NOT fixation and NOT 
word_on = ~(fixation_r1 | response_r1)
# This leaves us with a logical index over rows for when the words were on, 
# which we can use to select rows to give us the onset times (in TR indices)
# (this index is over rows in run 1 only - there are 73 rows relating to run 1 in the xls file)
print(word_on.shape)
# (there are 24 trials in run 1 - which is good, because it matches with the 
print(word_on.sum())
# Select onset & offset values for run 1
onsets_r1 = onset[ri]
offsets_r1 = offset[ri]
word_onsets_r1 = onsets_r1[word_on]
word_offsets_r1 = offsets_r1[word_on]
# This should match up with the xls row entries above.
print('Words on:', word_onsets_r1)
print('Words off:', word_offsets_r1)
# Technically, we can ignore offsets, because all of these are 1-TR conditions.
# Also, we have to set the indices to be zero-based, because python.
word_onsets_r1 -= 1

In [None]:
n_TRs_per_run = 127

# We can use these onsets to create a design matrix that we can use.
# Here, we create a simple design matrix - one column, just ones at image onset.
X_simple = np.zeros((n_TRs_per_run,1))
for on in word_onsets_r1:
    X_simple[on, 0] = 1.0

In [None]:
plt.imshow(X_simple.T, aspect='auto')

In [None]:
cond2number = {'real':0, 'photo':1, 'foil':2}

In [None]:
cond2number

In [None]:
cond2number['foil']

In [None]:
# Here, we create a slightly more complex design matrix, with separate conditions for 
# real and image (with "foil" left out)
n_conditions = 3
X = np.zeros((n_TRs_per_run, n_conditions))
# Define a dictionary to map the condition to a number
cond2number = {'real':0, 'photo':1, 'foil':2}
# "enumerate" returns an index (0, 1, 2, etc) along with the values in word_onsets_r1,
# which we map to the "itrial" variable here
for itrial, on in enumerate(word_onsets_r1):
    print('---Trial %d---'%itrial) # simple formatting
    this_condition = conditions_r1[itrial]
    print(this_condition)
    cond_idx = cond2number[this_condition]
    print('... assigned to ', cond_idx)
    X[on, cond_idx] = 1

In [None]:
# Et voila.
plt.imshow(X.T, aspect='auto')
plt.ylabel('Condition')
plt.yticks([0, 1, 2], ['real', 'photo', 'foil'])
plt.xlabel('Time (TRs)')

# Real deal
OK, so now our task is to do that for every run. THIS is the cell you should keep and modify when creating your own models. ***The main thing you will have to change is the dictionary that maps condition to a number.*** You will probably want to define a dict that is called `word2feature` or some such, which takes all the words in the experiment and maps them to one of several different features (which can be indicator variables, or whatever`*`). 

`*` there is an example of "whatever" below.

In [None]:
# Start with a list
X_list = []
n_runs = 7
n_TRs_per_run = 127
n_conditions = 3 # could be: n_features. If so don't forget to change it below.
for this_run in range(1, n_runs+1):
    Xtmp = np.zeros((n_TRs_per_run, n_conditions))
    ri_xls1 = run==this_run
    ri_xls2 = object_run==this_run
    # -1 because python wants zero-based indices
    # xls1 is too long - not just trials, but 3 values per trial. thus, fancier selection
    # Word onsets were when it was NOT fixation and NOT 
    word_on = ~((cat[ri]=='Fixation') | (cat[ri]=='Response'))
    all_onsets_thisrun = onset[ri_xls1]
    word_onsets_thisrun = all_onsets_thisrun[word_on]
    
    conds_thisrun = conditions[ri_xls2]
    objects_thisrun = conditions[ri_xls2]
    # All of these variables should always be 24 long for this experiment
    #print(len(objects_thisrun))
    #print(len(conds_thisrun))
    #print(len(word_onsets_thisrun))
    for itrial in range(0,24):
        on = word_onsets_thisrun[itrial]
        o = objects_thisrun[itrial]
        cond = conds_thisrun[itrial]
        cond_idx = cond2number[cond]
        # OR: define object2feature (see below), and call this:
        #feature_idx = object2feature[o]
        # ... and then use feature_idx as the index for Xtmp in the next line 
        # instead of cond_idx
        Xtmp[on, cond_idx] = 1
    # For each run, add the X variable we have created to a list:
    X_list.append(Xtmp)
# ... and concatenate everything here:
X = np.vstack(X_list)

In [None]:
plt.imshow(X.T, aspect='auto')

Another way you could go about creating this X would be to create the full array of zeros (889 x 3) first, and then index into it. This would be a little more annoying, since all the trial indices you have start with 1 each run, and the trial indices into that big array for run 2 woudl have to start with 128. 

The way it's done above just keeps things a little simpler for bookkeeping.

In [None]:
# Et voila.
print(X.shape)
plt.imshow(X.T, aspect='auto')

To define your own features, you will have to map words (or other variables from the excel file info) to your own conditions / features, using something like this: 

In [None]:
# Define a new model like this: 
object2feature = {
    'Head phones': 1,
    'Ice cream scoop': 1,
    'Bandages': 1,
    'Ice scrapper': 1,
    'Baseball glove': 1,
    'Bow tie': 1,
    'Camera': 1,
    'Battery': 1,
    'Carrots': 1,
    'Beer mug': 1,
    'Ladle': 1,
    'Box knife': 1,
    'Book': 1,
    'Comb': 1,
    'Lemon': 1,
    'Apron': 1,
    'Bird': 1,
    'Acorn': 1,
    'Lock': 1,
    'Lollypop': 1,
    'Can opener': 1,
    'Banana': 1,
    'Belt': 1,
    'Magnifying glass': 1,
    'Bird house': 1,
    'Plate': 1,
    'Golf ball': 1,
    'Playing card': 1,
    'Saucepan': 1,
    'Razor': 1,
    'Salt shaker': 1,
    'Glass bottle': 1,
    'Hammer': 1,
    'Bowl': 1,
    'Jack-O-Lantern': 1,
    'Handbag': 1,
    'Lint roller': 1,
    'Bottle cap': 1,
    'Scissors': 1,
    'Bullet': 1,
    'Bronze sponge': 1,
    'Brick': 1,
    'Bath sponge': 1,
    'Lighter': 1,
    'Binoculars': 1,
    'Garden shovel': 1,
    'Butter knife': 1,
    'Bolt': 1,
    'Coffee filter': 1,
    'Dice': 1,
    'Cork': 1,
    'Dog': 1,
    'Flip flop': 1,
    'Dog bowl': 1,
    'Corkscrew': 1,
    'Extension cord': 1,
    'Dishbrush': 1,
    'Domino': 1,
    'Flower': 1,
    'Car lighter': 1,
    'Dust pan': 1,
    'Cactus': 1,
    'Cotton balls': 1,
    'Flashlight': 1,
    'CD': 1,
    'Cell phone': 1,
    'Baby bottle': 1,
    'Flask': 1,
    'Fork': 1,
    'Chalk': 1,
    'Funnel': 1,
    'Coat hook': 1,
    'Dish soap': 1,
    'Pine cone': 1,
    'Turkey baster': 1,
    'Sauce brush': 1,
    'Scale': 1,
    'Remote control': 1,
    'Wine glass': 1,
    'Game controller': 1,
    'Vase': 1,
    'Rubber duck': 1,
    'Electrical tape': 1,
    'Flower pot': 1,
    'Door stop': 1,
    'Tennis ball': 1,
    'Curling Iron': 1,
    'Pear': 1,
    'Pizza cutter': 1,
    'Swim goggles': 1,
    'Crayon': 1,
    'Door knob': 1,
    'Light bulb': 1,
    'Electrial outlet cover': 1,
    'Hole punch': 1,
    'Cow bell': 1,
    'Frying pan': 1,
    'Mouse': 1,
    'Paint roller': 1,
    'Pasta spoon': 1,
    'Mason jar': 1,
    'Matches': 1,
    'Flyswatter': 1,
    'Eye dropper': 1,
    'Mug': 1,
    'Gift bow': 1,
    'Nail polish': 1,
    'Paintbrush': 1,
    'Pacifier': 1,
    'Pencil': 1,
    'Napkin holder': 1,
    'Fuse': 1,
    'Eye patch': 1,
    'Nutcracker': 1,
    'Picture frame': 1,
    'Mitten': 1,
    'Frisbee': 1,
    'Piggy bank': 1,
    'Plastic bottle': 1,
    'Hair band': 1,
    'Measuring cup': 1,
    'Hand fan': 1,
    'Hair clip': 1,
    'Jewelry box': 1,
    'Toy truck': 1,
    'Toothbrush': 1,
    'Tennis shoe': 1,
    'High heel shoe': 1,
    'Stapler': 1,
    'Highlighter marker': 1,
    'Thread': 1,
    'Beanie': 1,
    'Soap': 1,
    'Ice tray': 1,
    'Tape': 1,
    'Dumbbell': 1,
    'Sponge': 1,
    'Tea bag': 1,
    'Grater': 1,
    'Timer': 1,
    'Hour glass': 1,
    'Tongs': 1,
    'Spatula': 1,
    'Handsaw': 1,
    'Ashtray': 1,
    'Basket': 1,
    'Egg slicer': 1,
    'Shot glass': 1,
    'Medicine bottle': 1,
    'Bell': 1,
    'Birdie': 1,
    'Oven mitt': 1,
    'Candle': 1,
    'Whistle': 1,
    'Wrench': 1,
    'Checkers': 1,
    'Phone': 1,
    'Drink shaker': 1,
    'Clothes hanger': 1,
    'Clothes pin': 1,
    'Straw': 1,
    'Wire cutters': 1,
    'Lipstick': 1,
    'MP3 Player': 1,
    'Snow goggles': 1,
    'Measuring tape': 1,
    'Hand blender': 1,
    'Butter dish': 1,
    }

# Load fMRI data

In [None]:
fdir = '/unrshare/LESCROARTSHARE/data_PSY763/SnowLabData/'
all_files = sorted(glob.glob(fdir + '*mat'))

In [None]:
unsmoothed_files = all_files[::2]
smoothed_files = all_files[1::2]
# Show what we've done with this indexing:
for f in unsmoothed_files:
    print(f)

In [None]:
# Create a list for data
data = []
# Load each file into list. NOTE that here we are choosing smoothed or unsmoothed data!
for file in smoothed_files:
    with h5py.File(file) as hf:
        d = hf['data'].value
    print('Original size: ', d.shape)
    # Transpose data so time is first axis
    d = d.T
    # Map the 4 values returned by d.shape to separate variables
    t, z, y, x = d.shape
    print('Transposed size: ', d.shape)
    # Reshape data to be time x (all voxels)
    d = np.reshape(d, (127, -1)) # the -1 here means string everything all out into one vector
    print('Reshaped size: ', d.shape)
    # standardize by run, because that makes many things easier
    data.append(zscore(d, axis=0))

In [None]:
# Time is now first dimension; stack everything up
Y = np.vstack(data)

In [None]:
# Check it out: X and Y.
X.shape, Y.shape

Don't forget to account for the HRF in your X! See the functions in utils.fmri for a useful utility function.

Other utility functions you will need are:

In [None]:
import imp
imp.reload(utils.fmri)
imp.reload(utils)

In [None]:
# Get a makeshif 3D brain on which to plot your data
brain = utils.fmri.get_brain(unsmoothed_files[0])
print(brain.shape)

In [None]:
# Define an arbitrary ROI
roi = np.zeros_like(brain)
roi[35:45, 5:15, 8:-8] = 1
# Flatten, so this will be like other statistical results derived
# from your X / Y matrices:
roi_flat = roi.flatten()

Use the utility function to dispay the ROI as you would data. 

### Some notes on the function utils.fmri.overlay_brain():
The `threshold` input crops out the zeros in the ROI (if set above zero); if you don't set threshold, you won't see the brain underneath the data at all. Same applies to your results. 

**WARNING**: NaNs in your data will mess up image plots (and many others, too). NaNs can come from dividing by zero (e.g. for voxels outside the brain). Shit happens. You can convert nans to zeros using the function `np.nan_to_num`

You also want to play with the vmin / vmax arguments to this function.

In [None]:
# Show the ROI
utils.fmri.overlay_brain(roi_flat, brain, threshold=0.5, cmap='inferno',
                  vmin=0, vmax=1)

Note: to use `roi` (or `roi_flat`, for that matter) as a logical index, you have to convert the values in it to True / False values instead of 1s and 0s. How would you do this...? (There are examples in the class notebooks).