# 3A: Data and Models

## Data as a random variables

Most statistics operates under the assumption that any observed data are actual samples drawn from some (to be learned) model of the world.

In the standard approach, these models are simply probability distributions, with parameters that govern the behavior of the model (i.e., what we can expect they will produce).

## Uncertainty

If the observed data samples are simply random draws from a probability distribution, then the level of uncertainty will decrease as we gain more data samples.

Statistical inference involves figuring out what model (e.g., probability distribution, but we will be building more complicated models later in the course) and parameters generated the data. 

Let's spend some time trying to perform this task by hand...

In [None]:
# load matplotlib inline mode
%matplotlib inline

# import some useful libraries
import numpy as np                # numerical analysis linear algebra
import pandas as pd               # efficient tables
import matplotlib.pyplot as plt   # plotting
import ipywidgets as widgets      # interactive widgets
from ipywidgets.widgets.interaction import show_inline_matplotlib_plots
from IPython.display import display, clear_output
import pickle

# import the distributions wrapped from scipy
from tools import dists

In [None]:
# load in the data
with open('data/random_data.pickle', 'rb') as f:
    data = pickle.load(f)

# tell us something about the data
for i,d in enumerate(data):
    print('Dataset %d has %d samples' % (i, len(d)))

In [None]:
# mapper between dist name and object
dist_dict = {'Beta': dists.beta,
             'Exponential': dists.exp,
             'Gamma': dists.gamma,
             'Normal': dists.normal,
             'Uniform': dists.uniform}

# Dropdown widget to pick datasets
ds_ind = widgets.Dropdown(options=range(len(data)),
                          description="Dataset")

# Checkbox for whether to show log likelihood
like_check = widgets.Checkbox(description='Show Log Likelihood', 
                              value=False)

# Checkbox for whether to show stem plot
stem_check = widgets.Checkbox(description='Show Stem Plot', 
                              value=False)

# set up the distributions tab
dist_tab = widgets.Tab(allow_none=False)
children = []
titles = []

# beta params
beta_alpha = widgets.FloatText(value=.5,
                               description='alpha')
beta_beta = widgets.FloatText(value=.5,
                              description='beta')
children.append(widgets.HBox([beta_alpha, beta_beta]))
titles.append('Beta')

# exp params
exp_lam = widgets.FloatText(value=5.0,
                            description='lam')
children.append(widgets.HBox([exp_lam]))
titles.append('Exponential')

# gamma params
gamma_alpha = widgets.FloatText(value=.5,
                                description='alpha')
gamma_beta = widgets.FloatText(value=.5,
                              description='beta')
children.append(widgets.HBox([gamma_alpha, gamma_beta]))
titles.append('Gamma')

# normal params
normal_mean = widgets.FloatText(value=0.0,
                                description='mean')
normal_std = widgets.FloatText(value=1.0,
                               description='std')
children.append(widgets.HBox([normal_mean, normal_std]))
titles.append('Normal')

# uniform params
uniform_lower = widgets.FloatText(value=0.0,
                                  description='lower')
uniform_upper = widgets.FloatText(value=1.0,
                                  description='upper')
children.append(widgets.HBox([uniform_lower, uniform_upper]))
titles.append('Uniform')

# add all the children and set the tab titles
dist_tab.children = children
for i in range(len(titles)):
    dist_tab.set_title(i, titles[i])

# set the full user interface
ui = widgets.VBox([widgets.HBox([ds_ind, like_check, stem_check]), dist_tab])

# define plotting function
def plot_data_and_dist(*vals, **kwargs):
    # first plot the data
    dat = data[ds_ind.value]
    plt.hist(dat, bins='auto', density=True, alpha=.5);
    
    # now plot the pdf of the dist
    npoints = 100
    
    # add support for 10% of the data range on either side
    support = (dat.min() - np.ptp(dat)*.1,
               dat.max() + np.ptp(dat)*.1)
    x = np.linspace(support[0], support[1], npoints)
    
    # get the selected dist and params
    params = {c.description: c.value for c in 
              dist_tab.children[dist_tab.selected_index].children}
    dist = dist_dict[dist_tab.get_title(dist_tab.selected_index)](**params)
    
    # calculate the pdf
    pdf = dist.pdf(x)
            
    # plot the pdf and add labels
    plt.plot(x, pdf, lw=3)
    plt.xlabel('Value')
    plt.ylabel('Probability')
    if like_check.value:
        # calculate the log like
        log_like = np.log(dist.pdf(dat)).sum()

        # add it to the plot with some formatting
        plt.title('Log Like: {:3.4f}'.format(log_like))
    
    if stem_check.value:
        # include the stem plot
        plt.stem(dat, dist.pdf(dat), 'g')

        
# set up triggers for updating the plot
out = widgets.interactive_output(plot_data_and_dist, 
                                 {'ds_ind': ds_ind,
                                  'like_check': like_check,
                                  'stem_check': stem_check,
                                  'beta_alpha': beta_alpha,
                                  'beta_beta': beta_beta,
                                  'exp_lam': exp_lam,
                                  'gamma_alpha': gamma_alpha,
                                  'gamma_beta': gamma_beta,
                                  'normal_mean': normal_mean,
                                  'normal_std': normal_std,
                                  'uniform_lower': uniform_lower,
                                  'uniform_upper': uniform_upper,
                                 })

# wrapper for tab change
# required b/c selecting tabs can't trigger plots like other widgets
def tab_change(*args, **kwargs):
    with out:
        clear_output(wait=True)
        plot_data_and_dist()
        show_inline_matplotlib_plots()
dist_tab.observe(tab_change, 'selected_index')

## Interactive data fitting (part 2)

The widget below (h/t Per Sederberg) shows the histogram of a randomly generated distribution in blue, overlaid with the PDF from one of five different distribution types in orange. Select one of the datasets and try to figure out what distribution it was generated from. You'll likely need to adjust the parameters of the distribution.

Refer back to Notebook 2 and/or Wikipedia if you want to remind yourself of what the different distributions look like and how they change shape with their parameters.

In [None]:
# show everything
display(ui, out)

**Q**: What dataset did you pick? What distribution and parameters best fit the observations by eye?

## Likelihood calculation

We don't have to do this by eye. Because our model has a probability density function (PDF), we can calculate the likelihood of observing the data given the model and parameters.

For any given model and parameters, you can determine the probability of having observed any individual data point by evaluating the PDF at the value of that data point (*Turn on the Stem Plot.*)


## Maximum Likelihood Estimation

The goal then becomes to maximize the likelihood of observing the data given a model and parameters:

$$P(D \mid \theta, M)$$

As long as the data points are all independent, the likelihood of observing all of them is the product of all the probabilities.

$$P(D \mid \theta, M) = \prod_i p(d_i \mid \theta, M)$$

It is more efficient and computationally tractable to maximize the log of $P(D|\theta,M)$, so we typically convert the likelihood into a sum of log likelihoods.

$$ \log P(D \mid \theta, M) = \sum_i \log p(d_i \mid \theta, M)$$

**Q** Turn on the Log Likelihood checkbox above and see if you can do better. Note that you'll get a numerical error if any of the data lie outside the support of the distribution, but this will go away once you fix that issue. In the cell below, enter your best estimates of the parameters.

## Automated optimization

Many approaches have been developed for searching parameter spaces to find the parameters that generate the maximum or minimum value of a function.

### Nelder--Mead Simplex

One very popular algorithm is the Nelder--Mead simplex.

It involves growing and shrinking a simplex (a generalization of a triangle to multiple dimensions) to search the parameter space efficiently to minimize a function.

![](https://upload.wikimedia.org/wikipedia/commons/thumb/d/de/Nelder-Mead_Himmelblau.gif/640px-Nelder-Mead_Himmelblau.gif)

In [None]:
# scipy includes lots of optimization methods
import scipy.optimize as opt

In [None]:
# Define a likelihood function
def like_fun(params, *args):
    # pull the model and dat out of the args
    model = args[0]
    dat = args[1]
    
    # instantiate the model with the params
    dist = model(*params)
    
    # calc the log like
    log_like = np.log(dist.pdf(dat)).sum()
    if np.isnan(log_like):
        log_like = -np.inf
    
    # return the negative of it to minimize
    return -log_like

## Calculate the best-fitting params

Execute the following code block to run Nelder-Mead on your dataset, using your chosen model from the widget above. Hopefully, in some small amount of time, the algorithm will converge on a solution, and you'll get a printout. See if you can figure out what the best parameter estimates are from the output.

Notice that each time you run the code below, a new starting value for the optimization is generated. Try running the optimization a few times and see if you get similar answers. Why is it important to do this?

If you're not sure if you've chosen the right PDF, go back to the widget, pick a different one, and rerun the cell below.

In [None]:
# grab the distribution and data from the UI above
dist_name = dist_tab.get_title(dist_tab.selected_index)
model = dist_dict[dist_name]
dat = data[ds_ind.value]

# set the bounds for the distribution
bound_dict = {'Beta': [(0, 10), (0, 10)],
              'Exponential': [(0, 20)],
              'Gamma': [(0, 10), (0, 10)],
              'Normal': [(-10, 10), (0, 10)],
              'Uniform': [(-10, 10), (-10, 10)]}
bounds = bound_dict[dist_name]

# generate a random starting point based on the bounds
# NB: it's possible to generate invalid starting points
x0 = [dists.uniform(*b).rvs() for b in bounds]

# print some information about the distribution and starting values
print('Dataset:', ds_ind.value)
print('Distribution:', dist_name)
print('Starting value:', x0)
print()

# run the optimizer
# NOTE, not all methods make use of the bounds method
res = opt.minimize(like_fun, x0, args=(model, dat), 
                   #bounds=bounds,
                   #method='L-BFGS-B',
                   #method='BFGS',
                   method='Nelder-Mead',
                   #method='TNC'
                  )
print(res)

## Visualize the fit

Let's visualize the likelihood as a 2D function of the parameters and see how our fit looks. If you're far away from where you think you should be, try picking some different values in the widget and/or re-running the optimization.

Note that this will only work for PDFs that have two parameters.

In [None]:
# set the grid of points to evaluate
x = np.linspace(bounds[0][0], bounds[0][1], 100)
y = np.linspace(bounds[1][0], bounds[1][1], 100)
xx, yy = np.meshgrid(x, y, sparse=True)

# evaluate the likelihood
z = np.log(model(xx, yy).pdf(dat[:, np.newaxis, np.newaxis])).sum(0)
z = np.exp(z)

# plot the contour and the best-fit value
plt.contourf(x, y, z, 100)
plt.plot(res.x[0], res.x[1], 'x', markersize=12, color='red')
plt.colorbar()
#plt.xlabel()
#plt.ylabel()

**Q**: Is your estimate at the peak of the likelihood function? Are some parameters constrained more or less than others?

## Assessing model fit

An important question you need to address when using a statistical model is how well your model fits the data.

Your fit can be bad for any number of reasons, but one rather common one is that your model is wrong.

If your model is really wrong, then the parameter estimates are not worth much.

However, because all models are wrong (see Box's Dictum), you can only compare models to see which are less wrong.

How do we do this?

We could simply compare the maximum likelihoods between the models, but that doesn't take into account the complexity of each model and amount of data. 

One, more principled, approach is Bayesian Information Criterion (BIC):

$$BIC = \text{ln}(n)k - 2\text{ln}(\hat{L}),$$

where $n$ is the number of data points, $k$ is the number of parameters, and $\hat{L}$ is the maximum likelihood value of the model $M$.

***SMALLER BIC values are better!!!***

In [None]:
# Calculate the model fit with BIC
# recall that the optimization returns the negative log likelihood
n = len(dat)
k = len(res.x)
L = -res.fun
bic = np.log(n)*k - 2*(L)
print('BIC:', bic)

## Model Comparison

But how do we decide if one model is better than the other? We can compare BIC values between models, turning them into a Bayes Factor!

$$BF_{01} = exp((BIC_1 - BIC_0)/2)$$

This is interpreted with the help of the following guidelines:

| Bayes Factor | Evidence |
|--------------|----------|
| 1--3         | Weak     |
| 3--20        | Positive |
| 20--150      | Strong   |
| >150         | Very Strong | 


**Q** Copy your first BIC value into the cell below, assigning it to `bic_0`. Now try fitting your data to another  distribution functions using the widget above. Assign the second BIC value to `bic_1` below and the execute the cell to calculate the Bayes Factor. Then, in the text cell below, describe which two PDFs you considered, the maximum likelihood *parameter estimates*, and which model fit the best.

In [None]:
# calc Bayes Factor
# (enter numbers from the BIC assessments above)
bic_0 = -33.93
bic_1 = -31.07
np.exp((bic_1 - bic_0)/2.)

## t-test example

We now have all the tools necessary to perform statistical inference (though we will improve on all these approaches in the coming weeks). You can:

- Use optimization techniques to identify the parameters that give rise to the maximum likelihood of observing the data given the model
- Assess model fit
- Compare models to guide model selection

Let's try a simple example of performing a t-test via model comparison approaches!

You're hopefully quite familiar with how you can use a t-test to determine if a the mean of a sample is significantly different from zero, or to determine if two samples have different means. But what's going on under the hood?

Let's consider the one-sample t-test. You run a t-test and get a $p$ value, which tells you the probability that you'd get your observations if the data were from a normal distribution with mean $\mu = 0$ (the null hypothesis).

The model comparison standpoint is similar, but what you're testing is whether there's more evidence for a model where $\mu \neq 0$ than for one in which $\mu = 0$. We can make this comparison using Bayes Factors.

In [None]:
# generate some data that may or may not be significantly different from zero.
# you can play around with the sample size, mean, and std deviation later
A = dists.normal(.3, .5).rvs(10)

# plot it
plt.hist(A, bins='auto', density=True);

## Standard t-test

First we'll perform a standard one-sample t-test on our data.

In [None]:
# do a one-sample t-test
import scipy.stats as stats

stats.ttest_1samp(A, 0.0)

## Fit a Student's t model

Next we fit the full model to the data.

In [None]:
# Define a likelihood function
def students_like(params, *args):
    # pull the model and dat out of the args
    dat = args[0]
    df = len(dat) - 1
    
    # instantiate the model with the params, 
    # the df is determined from the data
    dist = dists.students_t(params[0], params[1], df)
    
    # calc the log like
    log_like = np.log(dist.pdf(dat)).sum()
    if np.isnan(log_like):
        log_like = -np.inf
    
    # return the negative of it to minimize
    return -log_like

# pick an central starting point
x0 = [0.0, 1.0]

# run the optimization
res = opt.minimize(students_like, x0, args=(A,), 
                   #bounds=bounds,
                   #method='L-BFGS-B',
                   #method='BFGS',
                   method='Nelder-Mead',
                   #method='TNC'
                  )
print(res)

# calculate the BIC for this model and save it
n = len(A)
k = len(res.x)
L = -res.fun
bic_1 = np.log(n)*k - 2*(L)
print('BIC:', bic_1)

## Fit a null hypothesis model

Now we fit a model representing the null hypothesis that the mean of the data is actually 0.0. 

Note how we simply fix the mean of the Student's t distribution to zero, but still fit the standard deviation.

In [None]:
# Define a likelihood function
def students_null_like(params, *args):
    # pull the model and dat out of the args
    dat = args[0]
    df = len(dat) - 1
    
    # instantiate the model with the params
    # mean is fixed at zero and the df is determined from the data
    dist = dists.students_t(0.0, params[0], df)
    
    # calc the log like
    log_like = np.log(dist.pdf(dat)).sum()
    if np.isnan(log_like):
        log_like = -np.inf
    
    # return the negative of it to minimize
    return -log_like

# start at same point (though mean is fixed at zero)
x0 = [1.0]

# run the optimization
res = opt.minimize(students_null_like, x0, args=(A,), 
                   #bounds=bounds,
                   #method='L-BFGS-B',
                   #method='BFGS',
                   method='Nelder-Mead',
                   #method='TNC'
                  )
print(res)

# calculate and print the BIC
n = len(A)
k = len(res.x)
L = -res.fun
bic_0 = np.log(n)*k - 2*(L)
print('BIC:', bic_0)

## Model Comparison

Now that we have the BIC values for each model, we can use the Bayes Factor to determine whether the full model is preferred to the null model. 

We want a big number here. If it's less than 1.0 then there is no evidence that the alternative/full model should be preferred to the null model (i.e., the mean of the distribution is not different from 0.0).

In [None]:
# calculate Bayes Factor
bf = np.exp((bic_0 - bic_1)/2.)
print('Bayes Factor:', bf)

**Q**: In the cell that generates the observations, try three different values for the mean, standard deviation OR sample size. Then run the model-fitting code for each case. In the cell below, compare and contrast how changing these values affect the Bayes Factor and the p-value in the standard t-test.

# 3B: Data Input/Output

Computational neuroscience often involves a lot of data. In this lesson, you'll learn about:

- different kinds of data, how they are stored on disk
- how to read data from the disk into your program
- how to store data from your program to the disk

We'll consider three kinds of data:

- [time series](#Time-series-data)
- [point processes](#Point-processes)
- [structured records](#Structured-records)

# Time series data

Typically represented as an **array** of measurements: 

$$\mathbf{x} = \{x_0, x_1, \ldots, x_N\}$$

Can be **multichannel** if more than one measurement taken at a time. Each time point is now a **vector** ($\vec{x}$):

$$\mathbf{X} = \{\vec{x}_0, \vec{x}_1, \ldots, \vec{x}_N\}$$

Multichannel time series are represented as two-dimensional arrays. One dimension correponds to time and the other to the component of the measurement vector.

Note that the "channels" can be repeated **trials** rather than simultaneous measurements.

## Time series in Python

We use `numpy` arrays to store single- and multichannel time series in Python. Let's look at an example using some Gaussian white noise.

Gaussian noise is drawn from a normal distribution, and it's called white noise because it has equal power at all frequencies.

In [None]:
np.random.seed(1)                  # set random seed so we all get the same results
x = np.random.randn(1000)          # generate 100 random WN samples

plt.plot(x)

You can determine the number of elements in a 1D array using the `len()` function:

In [None]:
print("The number of points in x is:", len(x))

**Q**: Do you recall from the last exercise how to access subsets of a numpy array? In the code cell below, write an expression to evaluate the mean of the first 100 samples of `x`:

Numpy (and Python in general) supports **negative indexing**, which means that negative indices are interpreted as referencing elements from the **end** of the array. The following expression gives us the mean of the last 100 elements.

In [None]:
np.mean(x[-100:1000])
# you can leave out the second index in the slice if it refers to the end of the array, so this is equivalent:
np.mean(x[-100:])

## Multichannel time series

For multichannel data, the array has two dimensions. There is a (weak) convention that the first dimension of the array represents time. That means each column represents a separate channel.

Here is an example of a 3-channel array, again using Gaussian white noise. I've added some correlations between the channels to make things interesting.

In [None]:
y = np.random.multivariate_normal(mean=[0, 0, 0], cov=[[1.0, 0.2, 0.0],[0.2, 1.0, 0.1], [0.0, 0.1, 1.0]], size=1000)
plt.plot(y);

Notice how the plot now has three different-colored traces? Matplotlib assumes that time is the first dimension when you give it an array to plot.

The size along each dimension of the array is called its **shape**. You can get the shape (and therefore the dimension of an array) using the `.shape` attribute:

In [None]:
print("The shape of y is:", y.shape)
# note that len returns the number of elements along the first dimension
print("The number of time points in y is:", len(y))
print("The total size of y is:", y.size)

For multichannel data, we need two indices or slices to access values in the array.

In [None]:
# the first time point in the first channel. Note the comma.
y[0, 0]

In [None]:
# use `:` to indicate all the values along one dimension. This gives all the values for the first channel
plt.plot(y[:, 0])

In [None]:
# to get all the channels at a time point:
print("y_0 =", y[0, :])
# you can leave out the trailing indices
print("y_0 =", y[0])

**Q**: Calculate and plot the mean of all three channels

In [None]:
## Enter code here

## Input/Output 

Presumably you'd like to look at more than just randomly generated noise. So how do you get data from a recording into your program?

Usually, data are stored long-term on your computer's drive or in the cloud. There are advantage to both approaches, which we'll discuss later. For now, we're going to retrieve some files that I prepared for you to your local machine. Once we do this, we'll see how to load the data from these files.

Executing the following cell will run a shell command to retreive some data files.

In [None]:
!curl -s 'https://gracula.psyc.virginia.edu/public/courseware/comp_neurosci_data_022719.tgz' | tar zxv

### I/O for Time Series

Unfortunately, there is no agreed-upon standard for storing time series data, so you'll have to do some sleuthing.

There are three major kinds of storage formats: text, binary, and custom.

### Text

One way of storing numbers is how you would write them (i.e., as **text**). 

When reading a text file, the main thing you need to know is how the elements are separated. 

For single-channel data, usually each number goes on its own line. 

For multi-channel data, there will be multiple numbers per line, typically separated by white space (tabs and/or spaces) or by commas.

When you're storing data in text format, you also need to be mindful of the precision. 

#### Reading text files

One advantage of storing data as text is that it's human-readable. However, this isn't as much of an advantage as you might think. 

To see an example of an extracellular recording in text format, switch to the main Juptyer tab in your browser, navigate to the `data/io-examples` folder, and click on one of the files that ends in `.txt`

Numpy can easily load single- and multi-channel data from text files using the `loadtxt` command:

In [None]:
d = np.loadtxt("data/io-examples/st11_1_2_A8.txt")
plt.plot(d)

### Binary

Storing numbers as text is very inefficient. Let's see why.

In [None]:
## text is read into Python first as a string:
s = open("data/io-examples/st11_1_2_A8.txt", "r").readline().strip()
print("the number as text:", s)
print("size of the text (in bytes):", len(s))

## to use it as a number, python has to parse the text
f = float(s)
print("the number as a float:", f)
print("size of a float (in bytes):", d.dtype.itemsize)

#### Reading binary data

Not only does text-formatted data take up a lot more space, it also requires additional work for Python to translate into a numerical representation that it can do math on (i.e., floats and ints).

This inefficiency becomes a consideration for large datasets. Thus, we often want to store the data on disk in a binary format, i.e., the same format as it would be in memory.

A very powerful method for reading and writing binary data is to use a **memory map**. This essentially takes the contents of a binary file and treats it as an array.

In [None]:
import os   # standard library module used to construct paths

d = np.memmap(os.path.join("data", "io-examples", "st11_1_2_A8.dat"), mode="r", dtype='d')
plt.plot(d)

## Structured formats

Text and raw binary formats both have shortcomings and tradeoffs.

A shortcoming they both have in common is that it can be difficult to store metadata.

Without metadata, it may be hard to know how to interpret the contents of the file.

Some critical metadata we need for time series include:

- sampling rate
- dimensions of the array
- ordering of the array (i.e., time first or last) and what's in each channel
- measurement units

**Q**: What other metadata do you think are important for time series data? Write a couple of ideas in the cell below.

Structured data formats can be text- or binary-based.

Some formats are used widely and are well-documented, like [Javascript Object Notation](http://json.org) or [HDF5](https://support.hdfgroup.org/HDF5/). These formats are likely to have well-supported Python packages for I/O.
    
Other formats are more obscure or proprietary, like Axon Binary Format (ABF) or [Elan](http://elan.lyon.inserm.fr/). It may be difficult to find Python support to read these files, though the situation is improving thanks to projects like [Neo IO](https://neo.readthedocs.io/).

# Point processes

Typically represented as an ordered sequence of times in some interval from 0 to $T$:

$$\{0 \leq t_0 < t_1 < \ldots < t_N \leq T\}$$

In contrast to time series, there is not a fixed relationship between the number of events and the duration of the analysis interval.

## Point process data in Python

Point processes are also typically stored in `numpy` arrays, but the elements of the array are event times, not measurements.

Because point processes vary in the number of events, multi-channel point-processes are represented by **lists of arrays**, not by 2D arrays.

Let's look at some examples:

In [None]:
from tools import pprox
resp = pprox.load("data", "starling", "pprox", "st11_1_2_1")
resp_A8 = pprox.select_stimulus(resp, "A8")

The variable `resp_A8` points to a Python **list**. Lists are like arrays, but they can store heterogeneous data types. The syntax for accessing elements and slices is the same.

**Q:** Using what you know from previous exercises, complete the following code cell to print out some information about the data: 

In [None]:
print("The number of trials is:", ???)
print("The number of events in trial 0 is:", ???)
print("The time of the first event in trial 2 is:", ???)

## I/O for Point Processes

Just as there is no agreed-upon standard for storing time series data, there is also no standard format for point-process data.

Because point process data tend to be smaller than time series, text formats are more common than binary.

A very simple text format is to put each trial (or channel) on a separate line and separate the events on each line with a space. Take a look at `data/io-examples/st_11_2_1_A8.txt` for an example.

The [PySpike](http://mariomulansky.github.io/PySpike/) library has a function for loading data from such files, but we're going to write our own so that we can learn a bit about basic I/O in Python and looping.

In [None]:
# create a list where we will store our trials
trials = []
# open the file for reading
fp = open(os.path.join("data", "io-examples", "st11_1_2_1_A8.txt"), mode="r")
# loop through the lines of the file with a for statement
for line in fp:
    # read the line into an array
    arr = np.fromstring(line, sep=" ")
    # append the array to our list
    trials.append(arr)

If you learned how to program in Java or C or another low-level programming language, take a moment to appreciate how simple this task is in Python.

### Checking our work

**Q:** The `trials` list we created in the last code cell should be the same as `resp_A8`. In the cell below, complete **three assert** statements to check that this is true. I've provided you with one to get started. If you complete your task correctly, the cell will not emit any errors.

In [None]:
assert len(trials) == len(resp_A8), "The number of trials is not the same"
assert True == False, "The total number of events is not the same"

### Point process metadata

As with time series data, it's important to keep track of metadata. Here are some important metadata that need to be associated with point process files:

- type of event (e.g., spike, behavioral action, stimulus start/stop)
- number of channels
- unit scaling (e.g., milliseconds or seconds?)
- start time
- other experimental variables

## Structured Records

In both point process and time series data, the elements of the arrays have been **homogeneous** (i.e., all the same type). What if that's not the case?

The third (and final) kind of data we'll consider today consists of **records**. Each record in turn comprises **fields**, which may have different types.

This kind of data is also called **tabular data**. If you're coming from the R world, you might think of this kind of data as a `data.frame`.

It's common to encounter structured records when you have independent observations; for example, from different neurons or animals or populations. The fields in each record might include:

- a unique identifier for the observation
- group identifiers (e.g., cell, animal, population)
- independent variables (e.g., sex, treatment, age)
- dependent variable(s)

## Structured records in Python

Python used to lag pretty badly behind R for handling this kind of data, but we now have [pandas](http://pandas.pydata.org/), which is beginning to approach `numpy` in popularity and maturity.

As with numpy, there is a convention for importing pandas: 

```python
import pandas as pd
```

For a detailed introduction to pandas, take a look at [Chapter 3](https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html) of the Python Data Science Handbook.

There are two main concepts to understand in using pandas: `Series` and `DataFrames`.

A `Series` is essentially a column of a table. Like a numpy array, all the elements of a series are the same type. Unlike a numpy array, the indices of a `Series` do not have to be sequential integers, but can be any label you like.

For example, here's a `Series` that might represent the ages of several subjects:

In [None]:
import pandas as pd
ages = pd.Series([391, 442, 183], index=['st11', 'st22', 'st231'])
ages

You can access the elements of a `Series` using the standard Python bracket syntax:

In [None]:
ages['st11']

A `DataFrame` is a collection of `Series`, i.e. a table of columns. Here's how we might represent the ages and sexes of a set of subjects:

In [None]:
sex = pd.Series(['M', 'F', 'M'], index=['st11', 'st22', 'st231'])
subjects = pd.DataFrame({'age': ages, 'sex': sex})
subjects

Notice how we now have a table, which Jupyter renders nicely for us with the row and column indices are indicated in bold.

The bracket syntax for `DataFrames` accesses **columns**. It's important to remember that this is different from numpy arrays, where a single index gives you a row.

In [None]:
subjects['age']

To index by row and column, you have to use the `loc` attribute:

In [None]:
subjects.loc['st11', 'age']

Equivalently, you can use `iloc` and the numerical indices:

In [None]:
subjects.iloc[0, 0]

## I/O for Structured Records

Structured record data is usually stored on disk in text-based formats. This is because human readability is often quite important. There are two very common formats:

- In _comma-separated-value_ files, each record is on a separate line, and fields are separated by commas.
- In _whitespace-delimited-value_ files, each record is on a separate line, and fields are separated by white space (tabs or spaces)

In both kinds of files, it's common that the first line of the file is a header giving the name for each column.

Take a look at `data/stimuli/motifs.csv` for an example of a comma-delimited file.

One really good reason to use pandas is that it provides some nice I/O functions for these kinds of files. It's trivial to load tabular data into Python:

In [None]:
motifs = pd.read_csv(os.path.join("data", "starling", "stimuli", "motifs.csv"))
motifs

You can tell pandas that certain columns should be used as indices.

This allows you to select a subset of rows using the `loc` syntax. For example, to see all the rows where `song` is equal to `A8`:

In [None]:
motifs = motifs.set_index(['song'])
motifs.loc['A8']

### Exercise

Write code in the cell below to compute the following:

- the number of different songs
- the number of motifs for each song
- the average motif duration in each song