### Problem 2.2 (Exploring fish sleep data, 65 pts) 

In [Tutorial 2](../tutorials/t2a_tidy_data.html), we used a data set dealing with zerbafish sleep to learn about tidy data and split-apply-combine. It was fun to work with the data and to make some plots of fish activity over time. In this problem, you will work with your group to come up with some good ways to parametrize sleep behavior and estimate the values of these parameters.

Choose two different ways to parametrize sleep behavior.  You can use sleep metrics from the [Prober, et al. paper](https://doi.org/10.1523/JNEUROSCI.4332-06.2006) or (for more fun) invent your own.  For each of the ways of parametrizing sleep, provide instructive plots and estimate the values of the parameters. Be sure to discuss the rationale behind choosing your parametrizations.

Note that there is a lot of debate among the community of scientists studying sleep how to best quantify the behavior. This is generally true in studies of behavior, and much of the process of understanding the measurements is deciding on what to use as metrics. This problem obviously has no right answer. What is important is that you can provide a clear rational for your choices.

As you work through this problem, much of what you will do is exploratory data analysis. You will work with data frames to compute the behavioral metrics of interest and make instructive plots. Again, this problem is intentionally open-ended. You are taking a data set and making plots that you might put in a presentation or in a paper to describe the behavior. As you do the analysis, provide text that discusses your choice and what conclusions you can draw from your analyses.

You do not need to do any data validation (we'll get to that next week). You can download and use the resampled data set you generated in Tutorial 2 [here](../data/130315_1A_aanat2_resampled.csv). If you feel that you need to use the original data set, you can get the activity file [here](../data/130315_1A_aanat2.csv) and the genotypes file [here](../data/130315_1A_genotypes.txt).

## Solution

In [7]:
import itertools

# Our numerical workhorses
import numpy as np
import pandas as pd
import scipy.integrate

# Import Altair for high level plotting
import altair as alt
import altair_catplot as altcat

# Import Bokeh modules for interactive plotting
import bokeh.io
import bokeh.plotting

# Set up Bokeh for inline viewing
bokeh.io.output_notebook()

# Pevent bulky altair plots
alt.data_transformers.enable('json')

DataTransformerRegistry.enable('json')

We will begin by loading in the position data using pandas, and combining the file containing fish species with the position and activity data to synthesize a single dataframe. 

In [11]:
df = pd.read_csv('../data/130315_1A_aanat2.csv', comment='#')

# Load in the genotype file, call it df_gt for genotype DataFrame
df_gt = pd.read_csv('../data/130315_1A_genotypes.txt',
                    delimiter='\t',
                    comment='#',
                    header=[0, 1])

# Reset the columns to be the second level of indexing
df_gt.columns = df_gt.columns.get_level_values(1)

# Rename Columns to be more informative
df_gt.columns = ['wt', 'het', 'mut']

# Assign column names as attributes rather than columns themseves.
df_gt = pd.melt(df_gt, var_name='genotype', value_name='location')

# Take a look at the genotype dataframe
df_gt.head(5)

Unnamed: 0,genotype,location
0,wt,2.0
1,wt,14.0
2,wt,18.0
3,wt,24.0
4,wt,28.0


The overarching goal of our analysis is to determine the differences in sleeping patterns between different fish genotypes. Fish who were not assigned a genotype will thus not be considered, as they will not aid in our comparison. 

In [12]:
# Drop all rows that have a NaN in them. This will eliminate 
# all fish for which the genotype was not determined. 
df_gt = df_gt.dropna()
df_gt = df_gt.reset_index(drop=True)

# Location value is a float, but we would prefer an integer. 
df_gt.loc[:,'location'] = df_gt.loc[:, 'location'].astype(int)

# Combine genotype and activity data using location as a common column.
df = pd.merge(df, df_gt)

# Cast the time field as a datetime object. This will make manipulating
# time values much more convenient. 
df['time'] = pd.to_datetime(df['time'])

# Add a light/dark column that denotes the time of day for each activity level.
df['light'] = (  (df['time'].dt.time >= pd.to_datetime('9:00:00').time())
               & (df['time'].dt.time < pd.to_datetime('23:00:00').time()))

In [25]:
# Let's take a look at the processed data so far. 
df.head(3)

Unnamed: 0,location,activity,time,zeit,zeit_ind,day,genotype,light,InactivityBegun
869,1,0.0,2013-03-16 09:00:09,0.0025,0,5,het,True,4
870,1,0.0,2013-03-16 09:01:09,0.019167,1,5,het,True,0
871,1,0.0,2013-03-16 09:02:09,0.035833,2,5,het,True,0


The experiment did not actually start until zeit = 0, so we would like to eliminate this data from the dataframe. 

In [26]:
df = df[df['zeit'] >= 0]
df.head(3)

Unnamed: 0,location,activity,time,zeit,zeit_ind,day,genotype,light,InactivityBegun
869,1,0.0,2013-03-16 09:00:09,0.0025,0,5,het,True,4
870,1,0.0,2013-03-16 09:01:09,0.019167,1,5,het,True,0
871,1,0.0,2013-03-16 09:02:09,0.035833,2,5,het,True,0


We now want to split these dataframes by fish ID, so that we can plot each fish's activity separately. 

In [27]:
# Separate parent DF into separate dataframes for each fish
frames = []
for i in range(1, 97):
    # Isolate a single fish
    temp_df = df[df['location'] == i]
    
    # ensure there is data for this fish prior to adding it to frames
    if temp_df.shape[0] != 0:
        frames.append(temp_df)
print("We will analyze %i fish!" %len(frames))

We will analyze 73 fish!


In [28]:
# Get the activity level of all the fish together!

counter1 = 0
counter2 = 0
counter3 = 0
counter4 = 0
total_data = []
for i, d in enumerate(frames):
    data = []
    inactive_minutes_count = 0
    counter = 0
    for index, row in d.iterrows():
        if row['activity'] < 1: # our fishy is still inactive
            inactive_minutes_count += 1
            counter1 += 1
            
        else: # we are in a period of activity
            if inactive_minutes_count == 0:
                data.append(0)
                counter2 += 1

            else: # we are coming from a period of inactivity
                data += [inactive_minutes_count] + [0] * (inactive_minutes_count - 1) + [0]
                inactive_minutes_count = 0
                counter3 += 1

        
        if counter == d.shape[0] - 1: # we are at the end and are still inactive, manually add
            #print (i)
            data += [inactive_minutes_count] + [0] * (inactive_minutes_count - 1)
            counter4 += 1
            
        counter += 1
        
    total_data += data[:5363]
            
#     if len(data) % 5363 != 0:
#         data = data[:len(data) - 1]
        
#     print ('total iterations: ' + str(counter))
    
#     print ('length of calculated data array: ' + str(len(data)))
#     df_1 = df[df["location"] == i + 1]
#     print ('length of dataframe: ' + str(df_1.shape[0]))

#     df_1["InactivityBegun" + str(i)] = np.array(data[:df_1.shape[0]])

In [29]:
df["InactivityBegun" ] = np.array(total_data)

ValueError: Length of values does not match length of index

In [None]:
df.head()

In [None]:
print (len(total_data) - 391499)

In [None]:
df_inactive = df[df["InactivityBegun"] > 5]

In [None]:
df_inactive.head()

In [None]:
alt.Chart(df_inactive
    ).mark_tick(
    ).encode(
        x='InactivityBegun:Q',
        y=alt.Y('location:N', title='fish'),
        color=alt.Color('genotype:N', title="genotype")
    )

In [None]:
alt.Chart(df_inactive
    ).mark_tick(
    ).encode(
        x='InactivityBegun:Q',
        y=alt.Y('genotype:N', title='genotype'),
        color=alt.Color('genotype:N', title="genotype")
    )

In [None]:
alt.Chart(df
    ).mark_line(
    ).encode(
        x='time',
        y='InactivityBegun:Q',
        color=alt.Color('light:N', title="light")
    ).properties(title = "")