## Week 8 (Optional): Data Structures

This exercise builds on the lesson on Data I/O to introduce some ideas about how to organize your data as you read it into Python.

There are whole courses on data representation, but we will only touch on a few topics of importance.

- using Python **lists** and **dictionaries** and pandas **data frames** to represent hierarchical data
- using the principles of **tidy data** to standardize and simplify tabular data


### Readings

For more information:
 
- Wickham H. (2014) Tidy Data. J Stat Soft [doi:10.18637/jss.v059.i10](https://doi.org/10.18637/jss.v059.i10). This article is very R-centric, but the discussion of tidy data principles is generally applicable.

In [None]:
# setting up the notebook
%matplotlib inline

# import some useful libraries
import sys
import pprint
import numpy as np                # numerical analysis linear algebra
import pandas as pd
import matplotlib.pyplot as plt   # plotting
sys.path.insert(0,"/project/psyc5270-cdm8j/comp-neurosci")
from comp_neurosci_uva import data

## Review of Python data types

We've encountered a number of different Python data types so far. We've seen **atomic** data types like `int` and `float`, which are just single values, and we've seen **aggregate** data types like `list`, `tuple`, `str`, and `np.array` which can contain more than one value.

What all of the aggregate types we've covered so far have in common is that they are indexed by number. We can access individual elements by numerical index (e.g. `my_array[0]`) and subsets of the sequence by slicing (e.g. `my_array[2:20]`). We saw how arrays can have more than one dimension, which then means we can access rows, columns, and various other kinds of subsets based on a combination of indices and slices (e.g. `col = my_array[:,0]`)

We also saw that some kinds of aggregates support **nesting**. For example, point process data are usually represented by *lists of lists* or *lists of arrays*. Accessing elements of these data structures also requires multiple indices, but with some differences in syntax (e.g. `first_spike = trials[0][0]`).

## Python dictionaries

We're now going to consider the case when the elements in an aggregate are not (necessarily) in a particular order, but where instead they are defined by **labels**. In other words, instead of accessing elements by index (`data[0]`), we want to be able to do something like `data["label"]`.

The standard data type in Python for this kind of aggregate (which is called various things in different languages) is the `dict` (**dictionary**). As with a physical dictionary, each entry in a `dict` has a label (**key**) and a definition (**value**).

There are two ways of creating dictionaries in Python:

In [None]:
# dictionary literal form
d1 = {"a": 1, "b": "something else", 4: "cow"}
d1

In [None]:
# functional form can only use string-based keys
dict(a=1, b="something else")

The labels of a `dict` are called **keys**. In Python, keys can be strings, numbers, or anything that's *immutable*. The values in a `dict` can be anything.

You access elements of a `dict` by key using the familiar bracket syntax:

In [None]:
print("a ->", d1["a"])
print("b ->", d1["b"])

Trying to access a non-existent key generates an error:

In [None]:
d1["no"]

You can add new key/value pairs to a dict by simple assignment:

In [None]:
d1["new"] = 25
print(d1)

Keys are unique, so assigning a value to a key that already exists will replace the old value.

In [None]:
d1["new"] = 30
print(d1)

You can remove key/value pairs with the `pop` method:

In [None]:
print("4 ->", d1.pop(4))
print(d1)

## Nested data structures

Because Python `dict` and `list` can contain other `dict` and `list` objects, it's easy to create complex hierarchical data structures.

For example, here's how we might represent some spike time data for a single neuron presented with two different stimuli:

In [None]:
spikes = {'A8': [np.random.uniform(0, 1000, 10), np.random.uniform(0, 1000, 15)],
          'B8': [np.random.uniform(0, 1000, 22), np.random.uniform(0, 1000, 17)]}
cell = {'cell': 'st231_11', 'date': '2019-01-22', 'spikes': spikes}
pprint.pprint(cell)

## Traversing nested data structures

We can access elements of a nested data structure with the bracket operators. Each level needs its own set of brackets.

In [None]:
cell['spikes']['A8'][0]

We can also iterate through the structure using for loops. Iterating through list-type aggregates and dicts is slightly different, though.

In [None]:
# iterating through a list/tuple/array yields the items
# example: print # of spikes in each trial
for trial in cell['spikes']['A8']:
    print(len(trial))

In [None]:
# when iterating through a dict, we usually want to know the keys and values. Use the `items` method to get both.
for stim, trials in cell['spikes'].items():
    print("%s: %d trials" % (stim, len(trials)))

In [None]:
# if you need to know the index while iterating through a sequence, use the `enumerate` function
for i, trial in enumerate(cell['spikes']['A8']):
    print("trial %d: %d spikes" % (i, len(trial)))

### Nested loops

If you need to traverse at multiple levels, you have to use **nested** loops. Notice how the outer `for` block contains an inner `for` block. The intepreter will loop through the arrays in the trial lists for each stimulus. 

In [None]:
for stim, trials in cell['spikes'].items():
    print("%s:" % stim)
    for i, trial in enumerate(trials):
        print(" trial %d: %d spikes" % (i, len(trial)))

### List or dict?

It's possible to represent the same data in a variety of ways. Take the following examples:

In [None]:
prices     = {'a bear': 100, 'a dog': 20}
itemdict   = {'a bear': {'price': 100}, 'a dog': {'price': 20}}
itemlist   = [{'name': 'a bear', 'price': 100}, {'name': 'a dog', 'price': 20}]

Why choose one form over another? There are a few reasons you might want to go with the second or third option: ordering, clarity, and extensibility.

**Order**: In Python, dicts do not have a guaranteed order. That means when you iterate over a dict with `items()` or any other method, you're not necessarily going to get the same result. If order matters, use a list.

**Clarity**: In `prices`, it's not necessarily clear what the keys or the values in the dict mean, aside from the name of the variable. In `itemdict` and `itemlist`, these meanings are *explicit* rather than *implicit*. Using a more explicit representation can help your data structures be clearer to other users and programmers.

**Extensibility**: What do you do with `prices` if you want to add more information about your items? You really only have the option of creating another object or changing how the data are represented. With `itemdict` and `itemlist`, because each element is a dict, you can easily add new key/value pairs as your understanding of the problem changes. This can be important if you want older code to continue to work on the new data structures. This is also called *backwards compatibility*.

Take-home: think about how you want to represent your data in your program in terms of your current and future needs. It often makes sense to choose more complex representations to preserve flexibility down the line. And the difference in your code may not be that much. Looping over these three structures is practically identical:

In [None]:
for name, price in prices.items():
    print(name, "costs", price)
for name, item in itemdict.items():
    print(name, "costs", item["price"])
for item in itemlist:
    print(item["name"], "costs", item["price"])

## Nested data structures: a real example

Neuroscience experiments almost always have a hierarchical structure. 

Switch to the main tab for your jupyter notebook and look in the `data/starling-example` folder. You should see two subdirectories. These are the names of two units from different animals. The animal name is the first part of the directory name, i.e., `st11` and `st49`. Within each subdirectory, there are 6 files that hold the spike times in response to one of six different stimuli.

How do we load the data? And how should we represent them in Python?

Here's some code to traverse the data. We're now dealing with the filesystem, which has a hierarchical organization. The code demonstrates the use of the function `glob`, which gives us a list of files using **wildcards**, and the `split` function, which divides a string up into parts.

In [None]:
import os
import glob

def load_spikes(fname):
    """Load spikes from a file in flat ascii format"""
    with open(fname, "rt") as fp:
        return [np.fromstring(line, sep=" ") for line in fp]

# The '*' character will match any file name, so this `glob` call will return a list
# of all the files in `data/starling-example/`
data_path = os.path.join(data.data_path, "starling", "spikes", "*")
for dirname in glob.glob(data_path):
    neuron = os.path.basename(dirname)
    print("animal:", neuron.split("_")[0])
    print(" neuron:", neuron)
    for respfile in glob.glob(os.path.join(dirname, "*")):
        stim = os.path.splitext(os.path.basename(respfile))[0]
        print("  stim:", stim)
        for i, trial in enumerate(load_spikes(respfile)):
            print("   trial", i, ":", len(trial), "spikes")
        

The data are clearly nested, like so:

- animal
  - neuron
     - stimulus
        - trial

However, in deciding how to store the data we have some choices to make, because not everything needs to have its own level. 

As an example, the following structures hold the same information, but are not totally equivalent:

In [None]:
nested = {"animal_id": {"neuron_id": {"stimulus_id": {"trial_1": [1,2], "trial_2": [3,4]}}}}
flat = [{"animal": "animal_id", "neuron": "neuron_id", "stimulus": "stimulus_id", "trial": 1, "data": [1,2]},
        {"animal": "animal_id", "neuron": "neuron_id", "stimulus": "stimulus_id", "trial": 2, "data": [3,4]}]
pprint.pprint(nested)
pprint.pprint(flat)

In the *flattened* form, the levels are indicated by key/value pairs. In the *nested* form, the levels are indicated by hierarchical nesting. There may be many possible combinations of flattening and nesting.

To decide what's best, start by thinking about what the *natural unit of analysis* is. In other words, what always goes together, and what can be set aside as an incidental property (for now)?

Clearly, individual trials don't have much meaning on their own. They're repetitions that allow us to better estimate the distribution of a neuron's responses to a given stimulus. Similarly, individual stimuli represent a sample of the universe of possible stimuli. 

In contrast, if we're interested in analying how different neurons respond to various stimuli, we may not care right now about which animal the neuron came from. It might therefore make the most sense to have `animal` be a property rather than a separate level.

Note that this doesn't prevent you from doing hierarchical models at a later stage; it just means that you're committing to doing the first step of your analysis with neurons as individual units. Also, there's not one right answer to this question.

Let's load the data into a nested dictionary, with `neuron` as the natural unit of analysis:

In [None]:
spike_data = {}
for dirname in glob.glob(data_path):
    neuron = os.path.basename(dirname)
    animal = neuron.split("_")[0]
    stims = []
    for respfile in glob.glob(os.path.join(dirname, "*")):
        stim = os.path.splitext(os.path.basename(respfile))[0]
        trials = load_spikes(respfile)
        stims.append({"stimulus": stim, "response": trials})
    ndata = {"animal": animal, "stimuli": stims}
    spike_data[neuron] = ndata

### Exercise

Plot the responses of both neurons to all 6 stimuli. The plot should have 2 columns and 6 rows. Use code from previous assignments to generate the rasters.

## Hierarchical data in pandas

Hierarchical data can also be represented in tables. 

Recall that a pandas `Series` is a lot like a dictionary in that the elements are indexed by labels. However, the values in a series are usually scalars rather than arbitrary objects.

In [None]:
ages = pd.Series([391, 442, 183], index=['st11', 'st22', 'st231'])
ages

In fact, you can create Series objects from dicts:

In [None]:
pd.Series({'st11': 391, 'st22': 442, 'st231': 183})

A pandas DataFrame is also like a dictionary, but the values are rows. This is equivalent to a dict of dicts, and this is one way you can create a DataFrame (though note that the table may need to be transposed so that the outer nesting level corresponds to rows rather than columns).

In [None]:
pd.DataFrame({'st11': {'age': 391, 'sex': 'M'}, 'st22': {'age': 442, 'sex': 'F'}}).T

### Hierarchical indices

Data can be nested in a pandas table through the use of multiple indices. Here's an example of how a Series of spike counts might look for trials nested under stimuli:

In [None]:
counts = pd.Series([10, 22, 2, 5], index=[['A8', 'A8', 'B8', 'B8'], [0, 1, 0, 1]])
counts

Single-index `Series` and `DataFrame` objects are accessed with a single index; multi-index objects require more than one index:

In [None]:
counts["A8", 0]

This is conceptually quite similar to how you access data in nested dicts and lists, with some small differences in syntax:

In [None]:
nested["animal_id"]["neuron_id"]["stimulus_id"]["trial_1"][0]

However, because the data is more structured, multi-indexed pandas objects give you the ability to select rows based on any of the indices. For example, this pulls out all the rows where `trial` is 0.

In [None]:
counts[:, 0]

There is a LOT more you can do with indices, and it's worth giving some careful study to the section on [Hierarchical Indexing](https://jakevdp.github.io/PythonDataScienceHandbook/03.05-hierarchical-indexing.html) in the Python Data Science Handbook.

## Demo

Let's illustrate how we might traverse our spike data and calculate spike counts. We'll use standard Python types first and then convert to a pandas array:

In [None]:
spike_counts = []
for neuron, ndata in spike_data.items():
    for stimresp in ndata['stimuli']:
        for i, trial in enumerate(stimresp['response']):
            d = {
                "neuron": neuron, 
                "animal": ndata["animal"], 
                "stimulus": stimresp["stimulus"], 
                "trial": i, 
                "count": len(trial)
                }
            spike_counts.append(d)
spike_counts

Notice how we're flattening the data structure by storing information about each level in individual records. This makes it easier to convert to a pandas DataFrame:

In [None]:
count_df = pd.DataFrame(spike_counts)
count_df

The final step is to convert some of the columns to indices.

In [None]:
count_df_idx = count_df.set_index(['animal', 'neuron', 'stimulus', 'trial'])
count_df_idx

This is what enables us to flexibly select subsets of rows:

In [None]:
count_df_idx.loc["st11",:,"C8"]

## Tidy data

What's the best way to organize your tables? Although there may be many good answers, a guiding set of principles that will generally make your life easier falls under the rubric of **tidy data**. As summarized in the Wickham article, tidy data obeys the following principles:

1. Each variable forms a column
2. Each observation forms a row
3. Each type of observational unit forms a table

These principles are fairly easily [applied](https://tomaugspurger.github.io/modern-5-tidy.html) to tables but might require some more creative thinking when dealing with non-tabular data. In general, I try to avoid a lot of nesting, and try to get data into tabular forms as soon as possible.