## 2 Selectivity

### Goals and General Instructions

The goal of this assignment is for you to use Python to calculate a *selectivity index* for some auditory neurons. Together with your class, you'll look at how selectivity changes at different stages of the auditory pathway.

As before, some of the cells are left blank or have placeholder text (in *italics*). *Your assignment is to complete those cells*.

**Very Important**: The server is currently **unable** to save your work. If you need to step away from your computer for any length of time, download the notebook as an 'ipynb' file (see the `File/Download as` menu). On your computer, give the downloaded file a unique name (e.g., with your computing ID). If you need to start up work again, use the upload button on the home screen to upload the file you saved. When you're done, make sure to download a fresh copy of the notebook and submit this to Collab for grading.

In [None]:
# These lines of code import some external libraries we'll use in this exercise. 
# You need to run this cell, or nothing below will work
%matplotlib inline
import numpy as np
import IPython
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['image.origin'] = 'lower'
mpl.rcParams['image.aspect'] = 'auto'
mpl.rcParams['image.cmap'] = 'jet'
mpl.rcParams['figure.figsize'] = (15.0, 6.0)

from tools import signal
from tools import pprox

---
### Background

Let's go through some important concepts. A neuron's **selectivity** is a measure of how sparsely it responds to a broad selection of stimuli. In other words, when tested with a large number of stimuli, a highly selective neuron would respond strongly to only a few stimuli and weakly to all the others. In contrast, a weakly selective neuron responds similarly to many stimuli.

To illustrate, here are two neurons responding to the same song:

In [None]:
# stimulus and unit names
stimname = "B8"
unit1 = "st489_1_5_1"
unit2 = "st489_2_5_3"

# load stimulus and calculate spectrogram
stim, sampling_rate = signal.load_stimulus(stimname)
spec, freqs, bins = signal.specgram(stim, 512, 10, sampling_rate, 1000)

ax1 = plt.subplot(311)
ax1.imshow(spec, extent=(bins[0], bins[-1], freqs[0], freqs[-1]))

# load and plot responses
ax2 = plt.subplot(312)
resp1 = pprox.load(unit1)
pprox.rasterplot(ax2, pprox.select_stimulus(resp1, stimname))

ax3 = plt.subplot(313)
resp2 = pprox.load(unit2)
pprox.rasterplot(ax3, pprox.select_stimulus(resp2, stimname))

for ax in (ax1, ax2, ax3):
    ax.set_xlim(bins[0], bins[-1])

**Question 1**: Based on the definition of selectivity above, which unit seems to be the most selective? Why? (this is a qualitative question, so any answer is correct as long as you justify it)

*Double click this text and enter your response*

How could we get at this question in a more quantitative way? There's two key methodological considerations we need to address.

First, what is a "stimulus"? Should we consider the whole 10-second song as the stimulus, or should we break the song up into smaller components? If you look at the spectrogram, you should be able to see that the bird is a bit repetitious. It repeats the same high-frequency (~ 8000 Hz) note three times in the first 2.5 seconds. Note how the structures around those note are similar too. This pattern is so common in starling song that we call these repeated elements **motifs**. For this exercise, we'll be using motifs to analyze selectivity.

Second, how do we quantify the strength of a neuron's response? A simple metric that will do well for our purposes is the **average rate**, which is defined by $<n/t>$, the number of spikes the neuron fires ($n$) divided by the duration of the stimulus ($t$), averaged across trials (indicated by the $<>$ brackets).

After making these methodological decisions, our task becomes fairly straightforward: divide the responses up into the time intervals corresponding to motifs, and calculate the average rate within each time interval. We'll then look at the distribution of response rates across motifs.

---
### Working with Tabular Data

To split up the data according to motifs, you'll need to know where the motifs start and stop. Fortunately, I've already done this for you and put the data in `stimuli/motifs.csv`. You can click [this link](/edit/stimuli/motifs.csv) to view the contents of the file. These data are stored in *comma-separated-value* (CSV) format, which is a simple way of representing tabular data. In a table, each record is stored in a row with a fixed number of fields (or columns). The names of the fields are given by the first row (i.e., the header).

We'll use a Python package called [pandas](https://pandas.pydata.org/) to read in this table. If you need some help using pandas, visit these tutorials from Software Carpentry:

- [Reading Tabular Data](http://swcarpentry.github.io/python-novice-gapminder/07-reading-tabular/)
- [Pandas DataFrames](http://swcarpentry.github.io/python-novice-gapminder/08-data-frames/)

Run the following cell to load and inspect `motifs.csv`:

In [None]:
import pandas as pd
motifs = pd.read_csv("stimuli/motifs.csv", index_col="song")
motifs

There are several things to notice here. First, the notebook doesn't display all the rows when the table is this long. Second, there are 5 columns in the data. We've assigned the `song` field to be an index. This makes it easier to select out the parts of the table that we need. In pandas, tables (which are called dataframes) are a lot like numpy arrays in that you can use indexing to select subsets of the table. Here, we want to get a list of the start and stop times for the motifs in song `B8`:

In [None]:
motifs.loc[stimname]

We don't really need the `motif` or `variant` fields, so we'll use slice notation to request just the `start` and `stop` columns. We'll also take this opportunity to convert the values from milliseconds to seconds.

In [None]:
motifs_B8 = motifs.loc[stimname, 'start':'stop'] / 1000
motifs_B8

Now let's annotate the plot with some vertical lines to indicate the motif boundaries

In [None]:
ax1 = plt.subplot(311)
ax1.imshow(spec, extent=(bins[0], bins[-1], freqs[0], freqs[-1]))

# note how we use some nested loops to avoid repeating ourselves:
for i, resp in enumerate((resp1, resp2)):
    ax = plt.subplot(3, 1, 2 + i)
    resp_B8 = pprox.select_stimulus(resp, stimname)
    pprox.rasterplot(ax, resp_B8)
    for t in motifs_B8.loc[:, 'start']:
        ax.vlines(t, 0, len(resp_B8), 'r')
    ax.set_xlim(bins[0], bins[-1])

---
### Splitting up the response

The plot above provides a useful visual summary of the response to the different motifs in the stimulus. You should now have a better sense of which neuron is more selective among motifs.

To quantify this intuition, you'll need to count up the number of spikes the neuron emitted during each of the motifs. This is easier than it sounds. To get started, let's think about how we'd solve the problem for the first motif, the one between 0.039 and 0.846 s. Here's the first trial for unit 1:

In [None]:
resp_B8[0]

We want to count the number of events that occurred after or at 0.039 and before 0.846. An easy way to do this is to take advantage of the fact that numpy can operate on all the elements of an array at once. We're using the `>=` (greater than or equal to) and the `<` (less than) operators. Here's what one of those operators returns when you apply it to the array:

In [None]:
resp_B8[0] >= 0.039

We'll combine the two comparisons using Boolean logic. We're only interested in events that are after `0.036` AND before `0.846`. The python symbol for AND is `&`:

In [None]:
between = (resp_B8[0] >= 0.036) & (resp_B8[0] < 0.846)
between

And now we just have to count up the number of `True` values, which is as simple as doing a sum:

In [None]:
print("In trial 0, unit 1 emitted %d spike(s) to the first motif in B8" % sum(between))

Obviously this would be very laborious to do for every motif and every trial. To avoid this tedium (which is also prone to error), we'll **abstract** the logic we used so that it can be applied using any motif boundaries. In computer code, the primary way we do abstraction is through **functions**. Consult the Software Carpentry tutorial on [Writing Functions](http://swcarpentry.github.io/python-novice-gapminder/14-writing-functions/) for more help.

In [None]:
# it's good to give functions detailed names so that you can remember what they do
def count_events_between(arr, start, stop):
    """ Returns the number of values in arr that are between start and stop """
    # - The string in the line above this one is called a "docstring". 
    #   It's what gets shown when you run help(function_name) or type Shift-Tab
    # - arr, start, and stop are the *formal arguments* of the function. When you call the function,
    #   the values you supply get assigned to these variables.
    # - the 'return' statement defines what the value of the function will be
    return sum((arr >= start) & (arr < stop))

Now we can apply the function to any set of inputs. For example, to find out how many spikes were emitted in trial 2 to the first motif, we enter this expression:

In [None]:
print("In trial 2, unit 1 emitted %d spike(s) to the first motif in B8" % count_events_between(resp_B8[2], 0.036, 0.846))

**Question 2** Edit the code below to print out the correct number of spikes in the response to the first presentation of the third motif:

In [None]:
print("In trial 0, unit 1 emitted %d spike(s) to the third motif in B8" % count_events_between())

**Question 3**: Using a *for loop*, calculate the AVERAGE number of spikes the neuron emitted to the first motif. I've provided you with some code to get started, so you just have to make it return the correct answer without any errors.

In [None]:
n_trials = len(resp_B8)
spikes = 0
for trial in resp_B8:
    spikes = spikes + ???
print("Unit 1 produced an AVERAGE of %3.2f spikes to motif 1" % ???)

Again, it's going to be rather tedious to do this for every motif and trial. Let's see if we can abstract further by writing another function that calls the `count_events_between` function.

In [None]:
def average_rate(trials, start, stop):
    """ Returns the average rate of events in trials that are between start and stop 
    
    trials - a list of arrays containing event times
    start - the start of the analysis window (inclusive)
    stop  - the end of the analysis window (exclusive)
    """
    # this expression uses "list comprehension" which is a fancy way of doing a for loop
    counts = [count_events_between(trial, start, stop) for trial in trials]
    return sum(counts) / len(counts) / (stop - start)

Let's see if the function returns the same result as you got above!

In [None]:
print("Unit 1 produced an AVERAGE of %3.2f spikes to motif 1" % average_rate(resp_B8, 0.036, 0.846))

Now we can loop over all the motifs in the response using our `average_rate` function, then plot the averages.

In [None]:
# this is an empty list; we'll append the average for each motif to it
averages = []
# this loop iterates over the rows in the table
for song, bounds in motifs_B8.iterrows():
    averages.append(average_rate(resp_B8, bounds[0], bounds[1]))
    
mpl.rcParams['figure.figsize'] = (6.0, 4.0)    
plt.plot(sorted(averages))

This plot is called a **cumulative distribution** plot. It shows you that the majority of the motifs elicit fairly weak responses (< 10 Hz) while only a few produce strong responses.

Let's write one more function to make our lives easier, then apply it to the other unit:

In [None]:
def average_rates(trials, motifs):
    """ From a list of trials, returns an array of average rates of events for each motif """
    # we're using the list comprehension syntax again here
    avgs = [average_rate(trials, bounds[0], bounds[1]) for song, bounds in motifs.iterrows()]
    # convert to a numpy array on return
    return np.asarray(avgs)

In [None]:
u1_avgs = average_rates(pprox.select_stimulus(resp1, stimname), motifs_B8)
u2_avgs = average_rates(pprox.select_stimulus(resp2, stimname), motifs_B8)

**Question 4**: Normalize the distributions calculated in the cell so that the numbers are relative to the maximum average rate. (Hint: look at `np.max`, and remember you can apply an operation to all of the elements of an array). Plot the normalized cumulative distributions for both stimuli on the same axis.

In [None]:
## write code here that generates the plot without raising any errors

**Question 5**: In the cell below, write a function called `motif_boundaries` that will extract rows from the `motifs` table that correspond to a specific song. Your function needs to work such that the assertion at the end of the cell does NOT generate an error. For full credit, give your function a docstring.

In [None]:
# write your function here

assert (motif_boundaries(motifs, stimname).loc[:,'start'] == motifs_B8.loc[:,'start']).all()

---
### Measuring selectivity

The final step is to come up with a statistic that captures the difference between selective and unselective response distributions.

**Question 6** What would the cumulative response distribution look like for a neuron that was perfectly selective? What would it look like for a neuron that was completely unselective (i.e. responded equally to all stimuli).

*Double click this text and enter your response*

There are actually a number of different metrics that can be used to measure selectivity. One that has stood the test of time, in spite of having a few problems, is the **activity fraction**, which is defined as:

\begin{align}
A & = \frac{1 - (\sum r_i / N)^2 / \sum r_i^2 / N}{1 - 1 / N}
\end{align}

where $r_i$ is the average response rate to stimulus $i$ and $N$ is the number of stimuli. Here's a python function that will calculate this from an array of average response rates:

In [None]:
def activity_fraction(rates):
    """Returns the activity fraction statistic for an array of rates """
    N = len(rates)
    return (1 - np.sum(rates / N)**2 / np.sum(rates**2 / N)) / (1 - 1/N)

In [None]:
print("For unit 1, A = %3.3f" % activity_fraction(u1_avgs))
print("For unit 2, A = %3.3f" % activity_fraction(u2_avgs))

Does this result agree with your sense of which neuron is more selective?

---
### Putting it together

As in our previous class, the final section of this notebook is a larger exercise that requires you to bring together what you've learned. You'll complete this in groups of 2-3.

Your assignment is to calculate activity fraction for a large sample of neurons in different regions of the songbird brain. You'll look at four different regions: L2b, L1, L3, and CLM.

Here's a few bits of code and data to get you started.

First, a table of all the units recorded in the database, organized by region.

In [None]:
units = pd.read_table("data/spikes/starling_units.tbl", index_col="area")

Next, a note that each unit was presented with multiple songs. Here's an illustration of how to apply the average_rates function to ALL of the motifs presented to our friend `unit1`. It relies on the `motif_boundaries` function you wrote above, so make sure that's working!

In [None]:
songs = pprox.get_stimuli(resp1)
avgs = []
for song in songs:
    avgs.append(average_rates(pprox.select_stimulus(resp1, song), motif_boundaries(motifs, song)))
avgs = np.concatenate(avgs)

You might consider sticking that code in a function for ready re-use.

So to summarize your task: for each neuron in the table `units`, load the response and calculate the activity fraction. Then calculate the mean and standard error for the activity fraction in each of the four areas. Do any of the areas seem more or less selective than the others?

In [None]:
# your code here - good luck!