# Tutorial 1c: Tidy data

*This tutorial was generated from an Jupyter notebook.  You can download the notebook [here](t1b_tidy_data.ipynb).*

In [1]:
# Standard library imports
import collections
import gzip
import os
import shutil

# Our numerical workhorses
import numpy as np
import pandas as pd
import scipy.stats as st
import scipy.signal

# Import pyplot for plotting
import matplotlib.pyplot as plt

# Seaborn, useful for graphics
import seaborn as sns

# Import Bokeh modules for interactive plotting
import bokeh.io
import bokeh.models
import bokeh.plotting

# Magic function to make matplotlib inline; other style specs must come AFTER
%matplotlib inline

# This enables SVG graphics inline (only use with static plots (non-Bokeh))
%config InlineBackend.figure_formats = {'png', 'retina'}

# JB's favorite Seaborn settings for notebooks
rc = {'lines.linewidth': 2, 
      'axes.labelsize': 18, 
      'axes.titlesize': 18, 
      'axes.facecolor': 'DFDFE5'}
sns.set_context('notebook', rc=rc)
sns.set_style('darkgrid', rc=rc)

# Set up Bokeh for inline viewing
# bokeh.io.output_notebook()

## Tidying using Pandas
We can use Pandas powerful tools to tidy our data by loading it into a Pandas `DataFrame`, and then use some of Pandas's slick functions to make it tidy.

We start by loading in the metadata describing the regions for each of the grayordinates in the main data file.  We load them in as Pandas `Series` objects because they are easier to index (no need to specify a column name).  We do this by using the `pd.read_csv()` function to read into a `DataFrame`, and then just slide out the `Series` we want.

In [28]:
# Load in the key that relates brain region to identifies
s_region_keys = pd.read_csv('../data/dubois_et_al/full_dataset/regionkey.csv', 
                            header=None, names=['region_name'], 
                            index_col=0)['region_name']

# Load in regions corresponding to each row of big data set
s_regions = pd.read_csv('../data/dubois_et_al/full_dataset/SUB01_region.csv', 
                        header=None, names=['region_id'])['region_id']

We have to be careful because the data set has some regions listed as "`0`," which means that the brain region is undefined.  This is not included in `s_regions_keys`, so we need to add that in.

Now, we can load in the main `DataFrame`.

In [29]:
# Load in big daddy
df = pd.read_csv('../data/dubois_et_al/full_dataset/SUB01_FIX_data.csv', header=None)

We will use `pd.melt()` to put it in tidy format.  Here is our strategy

1. Compute the time points, knowing that each column in the `DataFrame` holds a single time point, and that the elapsed time between samples is one second.
2. Make the time points the column names.
3. Concatenate the region index onto the `DataFrame`.
4. Concatenate the region string onto the `DataFrame`.
5. Melt the `DataFrame`.
6. Write it out as a compressed CSV file.

We'll do steps 1-5 in the cell below.

In [30]:
# Frame rate
frame_rate = 1.0  # frames per second

# Number of frames (equal to number of columns)
n_frames = len(df.columns)

# Compute time points of images
t = np.linspace(0.0, (n_frames - 1) / frame_rate, n_frames)

# Make the time points into the column names
df.columns = t

# Put in the region index
df['region_id'] = s_regions

# Add region names
reg_list = [s_region_keys[region] for region in s_regions]
df['region_name'] = reg_list

# The index if the grayordinate; keep them
df['grayordinate'] = df.index

# Melt the DataFrame
df = pd.melt(df, id_vars=['region_id', 'region_name', 'grayordinate'], 
             var_name='time (s)', value_name='voxel_value')

We now have a tidy `DataFrame`!  We can write it out to a CSV file so we have it for later.  One issue with tidy data is that storing it in raw form results in bloated file sizes because of the many repeated values.  However, having many repeated values does enable good compression.  We can therefore use the [gzip module](https://docs.python.org/3/library/gzip.html) from the standard Python library to make the new file.

In [31]:
# # Write CSV
# df.to_csv('../data/dubois_et_al/SUB01_data_tidy.csv', index=False, 
#           float_format='%.5f')

# # Compress it
# with open('../data/dubois_et_al/SUB01_data_tidy.csv', 'rb') as f_in:
#     with gzip.open('../data/dubois_et_al/SUB01_data_tidy.csv.gz', 'wb') as f_out:
#         shutil.copyfileobj(f_in, f_out)
        
# # Delete uncompressed CSV
# os.remove('../data/dubois_et_al/SUB01_data_tidy.csv')

If we look at the relative size of the compressed original file and the compressed tidy file, the tidy file is only about twice as big.

In [32]:
# os.path.getsize('../data/dubois_et_al/SUB01_data_tidy.csv.gz') / \
#             os.path.getsize('../data/dubois_et_al/SUB01_data.csv.gz')

## Tidying by hand
Another way to generate a tidy `DataFrame` is to hand build a CSV file while reading in the data.  The advantage of this approach is that we do not have to read in the entire original data set.  In fact, only one line at a time is read in and written out, so we will not have any issues with RAM.  The output is also a bit prettier, because we can control the ordering of the columns and rows.  However, this does not really matter when working with tidy data, since the indices and ordering is irrelevant. 

We will reload the `DataFrame`s with the regions and region keys so that the code block below can stand alone without everything we did in the [previous section](#Tidying-using-Pandas).

In [33]:
# # Load in the key that relates brain region to identifies
# s_region_keys = pd.read_csv('../data/dubois_et_al/regionkey.csv', 
#                             header=None, names=['region_name'], 
#                             index_col=0)['region_name']

# # Load in regions corresponding to each row of big data set
# s_regions = pd.read_csv('../data/dubois_et_al/SUB01_region.csv', 
#                         header=None, names=['region_id'])['region_id']

# # Append zero region key for undefined region
# s_region_keys[0] = 'UNDEFINED'

# # Frame rate
# frame_rate = 1.0  # frames per second

# infile = '../data/dubois_et_al/SUB01_data.csv.gz'
# outfile = '../data/dubois_et_al/SUB01_data_tidy.csv'
# with gzip.open(infile, 'r') as f_in, open(outfile, 'w') as f_out:
#     # Write header
#     f_out.write('grayordinate,region_id,region_name,time (s),voxel_value\n')
    
#     # Loop through time entries in data file  (each one is a grayordinate)
#     g = 0
#     line = f_in.readline()
#     while line != b'':
#         # Determine region ID, region name, and voxel values for time series
#         region_id = s_regions[g]
#         region_name = s_region_keys[region_id]
#         voxel_vals = np.fromstring(line, sep=',')

#         # Compute time points for time series
#         t = np.linspace(0.0, (len(voxel_vals) - 1) / frame_rate, len(voxel_vals))

#         # Write time series to CSV file in tidy format
#         for i, val in enumerate(voxel_vals):
#             f_out.write('%d,%d,%s,%g,%.5f\n' \
#                                 % (g, region_id, region_name, t[i], val))
        
#         # Go to the next grayordinate
#         g += 1
#         line = f_in.readline()

Now that we have made the file, we can compress it, as we did before.  We will overwrite the one we made in the [previous section]([previous section](#Tidying-using-Pandas), because it does not really matter which one we use.

In [34]:
# # Compress it
# with open('../data/dubois_et_al/SUB01_data_tidy.csv', 'rb') as f_in:
#     with gzip.open('../data/dubois_et_al/SUB01_data_tidy.csv.gz', 'wb') as f_out:
#         shutil.copyfileobj(f_in, f_out)
        
# # Delete uncompressed CSV
# os.remove('../data/dubois_et_al/SUB01_data_tidy.csv')

Of course, storing the data in tidy format is advantageous because now we don't have to re-tidy it after loading it in.  Another advantage is that we can now use a tool like [Dask](http://dask.pydata.org/en/latest/) to handle the large data set taking advantage of parallel processing.

## Mean activity of each region
Compute mean activity of each region.

In [35]:
gb_names = ['region_id', 'region_name', 'time (s)']
df_reg = df.groupby(gb_names)['voxel_value'].mean().reset_index()

In [36]:
# Columns to consider
regs = list(df_reg['region_id'].unique())
regs.pop(0)

# DataFrame to hold correlations
cols = ['region_1', 'region_2', 'region_1_name', 'region_2_name', 'pearson_r']
df_corr = pd.DataFrame(columns=cols)

n = len(df_reg['region_id'].unique())
# Compute pearson correlation
for r1 in regs:
    for r2 in regs:
        r1_name = s_region_keys[r1]
        r2_name = s_region_keys[r2]
        r = st.pearsonr(df_reg[df_reg['region_id']==r1]['voxel_value'], 
                         df_reg[df_reg['region_id']==r2]['voxel_value'])[0]
        df_corr = df_corr.append(pd.DataFrame([[r1, r2, r1_name, r2_name, r]], 
                                              columns=cols), ignore_index=True)

In [37]:
def rgb_frac_to_hex(rgb_frac):
    """
    Parameters
    """
    return '#{0:02x}{1:02x}{2:02x}'.format(int(rgb_frac[0] * 255), 
                                           int(rgb_frac[1] * 255),
                                           int(rgb_frac[2] * 255))


def data_to_hex_color(data, palette, data_range=[-1, 1]):
    """
    Convert a data point to a color
    """
    if data > data_range[1] or data < data_range[0]:
        raise RuntimeError('data outside of range')
    elif data == data_range[1]:
        return rgb_frac_to_hex(palette[-1])
    
    f = (data - data_range[0]) / (data_range[1] - data_range[0])
    return rgb_frac_to_hex(palette[int(f * len(palette))])
    

def plot_mat(df, i_col, j_col, data_col, n_colors=21, colormap='RdBu_r'):
    """
    Plot matrix.
    """
    # Get colors
    palette = sns.color_palette(colormap, n_colors)
    
    # Compute colors for squares
    df['color'] = df[data_col].apply(data_to_hex_color, args=(palette,))
    
    # Data source
    source = bokeh.plotting.ColumnDataSource(df)
    
    tools = 'reset,resize,hover,save,pan,box_zoom,wheel_zoom'

    p = bokeh.plotting.figure(
               x_range=list(df[i_col].unique()),
               y_range=list(reversed(list(df[j_col].unique()))),
               x_axis_location='above', plot_width=1000, plot_height=1000,
               toolbar_location='left', tools=tools)

    p.rect(i_col, j_col, 1, 1, source=source, color='color', line_color=None)

    p.grid.grid_line_color = None
    p.axis.axis_line_color = None
    p.axis.major_tick_line_color = None
    p.axis.major_label_text_font_size = '8pt'
    p.axis.major_label_standoff = 0
    p.xaxis.major_label_orientation = np.pi/3

    hover = p.select(dict(type=bokeh.models.HoverTool))
    hover.tooltips = collections.OrderedDict([
    ('i', '  @' + i_col),
    ('j', '  @' + j_col),
    (data_col, '  @' + data_col)])

    return p

In [38]:
p = plot_mat(df_corr, 'region_1_name', 'region_2_name', 'pearson_r', n_colors=200)
bokeh.plotting.output_file('act_mat.html')
bokeh.io.show(p)

In [None]:
df_out

In [None]:
plt.matshow(corr_mat, cmap=plt.cm.jet)

## What is the amygdala doing?
Let's see which grayordinates comprise the amygdala.  To check, we look for any `region_name` that has `AMYGDALA` in it.

In [None]:
# Get indices associated with amygdala
#inds = df['region_name'].str.contains("AMYGDALA")
inds_right = df['region_id'] == 6
inds_left = df['region_id'] == 5

# Count the number of unique grayordinates in there
amyg_left = df[inds_left]
amyg_right = df[inds_right]

# Compute mean for each time point
mean_amyg_left = amyg_left.groupby('time (s)')['voxel_value'].mean()
mean_amyg_right = amyg_right.groupby('time (s)')['voxel_value'].mean()

In [None]:
# Plot results
plt.plot(t, mean_amyg_left)
plt.plot(t, mean_amyg_right, color=sns.color_palette()[2])
plt.xlabel('time (s)')
plt.ylabel('mean voxel value')

# Make interactive with Bokeh
bokeh.plotting.show(bokeh.mpl.to_bokeh())

In [None]:
df['region_name'].unique()

In [None]:
# Plot results
plt.plot(mean_amyg_left, mean_amyg_right, marker='o', linestyle='None')
plt.axis('equal')
plt.xlabel('time (s)')
plt.ylabel('mean voxel value')

# Make interactive with Bokeh
# bokeh.plotting.show(bokeh.mpl.to_bokeh())

So, we have 647 grayordinates comprising the amygdala.  Let's plot the average activity of the amygdala over time.

In [None]:
# Compute mean for each time point
mean_amyg = df[inds].groupby('time (s)')['voxel_value'].mean()

# Plot results
plt.plot(t, mean_amyg)
plt.xlabel('time (s)')
plt.ylabel('mean voxel value')

# Make interactive with Bokeh
bokeh.plotting.show(bokeh.mpl.to_bokeh())

We should do more analysis, but, by eye, we see two different periodicities.  ??