# Python for (open) Neuroscience

_Lecture 1.3_ - More on `pandas`

Luigi Petrucco

Jean-Charles Mariani

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vigji/python-cimec/blob/main/lectures/Lecture1.3_More-pandas.ipynb)

## Announcements

- Next week we'll be setting up local Python installations, tutorial soon!
- There will be a second assignment, but not a third one - start thinking to a project though!
- Related: still looking for datasets!
- Questionnaire soon

### More `pandas`

In [3]:
import pandas as pd
import numpy as np

## Organize data in a dataframe

In [4]:
# Imagine we have 4 experimental subjects; to each one we show a stimulus 3 times; over each repetition 
# we measure 2 variables.
n_subjects = 4
n_repetitions = 3

# We could represent the data for each stimulus as a dictionary, 
# and the data for each subject as a list of dictionaries:
subject_data = [dict(var_1=np.random.rand(), var_2=np.random.rand()) for _ in range(n_repetitions)]
subject_data

[{'var_1': 0.009854337187272244, 'var_2': 0.012662097112300041},
 {'var_1': 0.865812355124264, 'var_2': 0.701025601200336},
 {'var_1': 0.8042320033587066, 'var_2': 0.860744421367863}]

In [5]:
# And the data for all subjects as a dictionary of lists of dictionaries:
all_subjects_data = dict()

for i in range(n_subjects):
    all_subjects_data[f"subj_{i}"] = \
        [dict(var_1=np.random.rand(), var_2=np.random.rand()) for _ in range(n_repetitions)]
all_subjects_data

{'subj_0': [{'var_1': 0.9888009612383392, 'var_2': 0.993568545602114},
  {'var_1': 0.3860111285065776, 'var_2': 0.5295175750547945},
  {'var_1': 0.5685538217395355, 'var_2': 0.3967766598475243}],
 'subj_1': [{'var_1': 0.5145549292148306, 'var_2': 0.341403939286263},
  {'var_1': 0.816293647337446, 'var_2': 0.4309273755939633},
  {'var_1': 0.1438166862638004, 'var_2': 0.6356644203363685}],
 'subj_2': [{'var_1': 0.9425411798967996, 'var_2': 0.9271632538579525},
  {'var_1': 0.46178590455623025, 'var_2': 0.3797246271800402},
  {'var_1': 0.2552954915331551, 'var_2': 0.007190466253909178}],
 'subj_3': [{'var_1': 0.36816981720526576, 'var_2': 0.5524012177436572},
  {'var_1': 0.3140434042152457, 'var_2': 0.2509696121100905},
  {'var_1': 0.8518898959553745, 'var_2': 0.5013453777576858}]}

This is now organized but very nested! it is not easy to perform statistics on it.

In [6]:
# Imagine we want to average the results across all subjects for variable_1:
means = []
for subject_results in all_subjects_data.values():
    for result in subject_results:
        means.append(result["var_1"])
np.mean(means)

0.5509797389718833

Instead, we can represent the data in a dataframe, **keeping it as flat as possible**!

Remember!


    🪷 The Zen of Python 🪷
        
        Flat is better than nested

In [7]:
# We can turn the data into a dataframe (does not matter how we do it here! this is just an ugly example)
trials_df = pd.DataFrame([dict(subject=i, repetition=j, **all_subjects_data[i][j])
                             for i in all_subjects_data.keys()
                             for j in range(n_repetitions)])

trials_df

Unnamed: 0,subject,repetition,var_1,var_2
0,subj_0,0,0.988801,0.993569
1,subj_0,1,0.386011,0.529518
2,subj_0,2,0.568554,0.396777
3,subj_1,0,0.514555,0.341404
4,subj_1,1,0.816294,0.430927
5,subj_1,2,0.143817,0.635664
6,subj_2,0,0.942541,0.927163
7,subj_2,1,0.461786,0.379725
8,subj_2,2,0.255295,0.00719
9,subj_3,0,0.36817,0.552401


We can now easily perform statistics on the data:

In [8]:
var1_mean = trials_df["var_1"].mean()

You do not always need pandas dataframes!!

Not efficient with many columns!

Many times your raw data (ephys, imaging...) can live in numpy array and you put in pandas derived quantities.

### Principles for organizing `pandas` dataframes

Keep in the same dataset all the data of the same type you have across groups (such as subjects). 

If you load lists of dataframes concatenate before working on them!

Consider having multiple dataframes to describe different aspects of your experiment. For example:
- a `subject` dataset with the info on your subjects
- a `trials` dataset with the trial responses across subjects

And keep consistent ids / nomenclature to easily work over both!

Example:

In [9]:
# Let's build a subjects dataframe for the experiment above:
np.random.seed(42)
subjects_df = pd.DataFrame(dict(sex=np.random.choice(["F", "M"], size=n_subjects),
                                handedness=np.random.choice(["left", "right"], size=n_subjects),
                                age=np.random.randint(20, 40, size=n_subjects)),
                          index=[f"subj_{i}" for i in range(n_subjects)])
subjects_df

Unnamed: 0,sex,handedness,age
subj_0,F,left,26
subj_1,M,right,38
subj_2,F,left,30
subj_3,F,left,30


We can now easily filter the subjects we want to work on based on categories:

In [10]:
selected_subjects_df = subjects_df[(subjects_df["sex"] == "F") & (subjects_df["age"] >=30)]
selected_subjects_df

Unnamed: 0,sex,handedness,age
subj_2,F,left,30
subj_3,F,left,30


In [11]:
selected_subjects_df.index

Index(['subj_2', 'subj_3'], dtype='object')

And restrain our analysis of the `trials_df` to these subjects :

In [12]:
# Here, we'll use another handy pandas method: `isin()`:

selection = trials_df["subject"].isin(selected_subjects_df.index)
selection


0     False
1     False
2     False
3     False
4     False
5     False
6      True
7      True
8      True
9      True
10     True
11     True
Name: subject, dtype: bool

In [13]:
trials_df.loc[selection, "var_1"].mean()

0.5322876155603451

(Practicals 1.3.0)

## Aggregate statistics

It can be useful to aggregate statistics based on the values of a column.

Imagine we want to quickly compute the mean of the values across trials for each subject.



### `.groupby()`

We have a handy syntax to average within each category with `.groupby()`.

The sintax is :
```python
df.groupby("name_of_the_category_column").operation()
```

Now, we want to compute average for every subject:

In [16]:
trials_df.head(5)

Unnamed: 0,subject,repetition,var_1,var_2
0,subj_0,0,0.988801,0.993569
1,subj_0,1,0.386011,0.529518
2,subj_0,2,0.568554,0.396777
3,subj_1,0,0.514555,0.341404
4,subj_1,1,0.816294,0.430927


In [18]:
# In this case, the operation is `mean()`.
# Note how the result will have the variable we group by as index:

subj_means_df = trials_df.groupby("subject").mean()
subj_means_df

Unnamed: 0_level_0,repetition,var_1,var_2
subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
subj_0,1.0,0.647789,0.639954
subj_1,1.0,0.491555,0.469332
subj_2,1.0,0.553208,0.438026
subj_3,1.0,0.511368,0.434905


By the way, this is a reason why methods are better than functions in this case: they can be chained with a clearer syntax!

# Index broadcasting in `pandas`

Let's subtract from each subject the mean for each variable.

In [19]:
trials_df.head(3)

Unnamed: 0,subject,repetition,var_1,var_2
0,subj_0,0,0.988801,0.993569
1,subj_0,1,0.386011,0.529518
2,subj_0,2,0.568554,0.396777


In [20]:
subj_means_df.head(3)

Unnamed: 0_level_0,repetition,var_1,var_2
subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
subj_0,1.0,0.647789,0.639954
subj_1,1.0,0.491555,0.469332
subj_2,1.0,0.553208,0.438026


The shapes obviously don't match:

In [None]:
print(trials_df.shape)
print(subj_means_df.shape)

In [21]:
trials_df - subj_means_df  # this is obviously funny:

Unnamed: 0,repetition,subject,var_1,var_2
0,,,,
1,,,,
2,,,,
3,,,,
4,,,,
5,,,,
6,,,,
7,,,,
8,,,,
9,,,,


But pandas will broadcast values using indices if we make them consistent!

In [23]:
subj_means_df

Unnamed: 0_level_0,repetition,var_1,var_2
subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
subj_0,1.0,0.647789,0.639954
subj_1,1.0,0.491555,0.469332
subj_2,1.0,0.553208,0.438026
subj_3,1.0,0.511368,0.434905


In [24]:
trials_df.set_index("subject") - subj_means_df
 #trials_df.head()

Unnamed: 0_level_0,repetition,var_1,var_2
subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
subj_0,-1.0,0.341012,0.353614
subj_0,0.0,-0.261778,-0.110437
subj_0,1.0,-0.079235,-0.243178
subj_1,-1.0,0.023,-0.127928
subj_1,0.0,0.324739,-0.038405
subj_1,1.0,-0.347738,0.166333
subj_2,-1.0,0.389334,0.489137
subj_2,0.0,-0.091422,-0.058301
subj_2,1.0,-0.297912,-0.430836
subj_3,-1.0,-0.143198,0.117496


So now we can write:

In [None]:
normalized = trials_df - subj_means_df
normalized.head()

This broadcasting is super powerful! Give us very expressive and concise syntax to work with aggregated data without using loops.

## Multi-indexing

Sometimes, we might want to average keeping segregations over multiple categories:

In [25]:
# Create again our trials_df (not relevant how here):
trials_df = pd.DataFrame([dict(subject=i, trial_type=j % 2, **all_subjects_data[i][j])
                             for i in all_subjects_data.keys()
                             for j in range(n_repetitions)])

trials_df

Unnamed: 0,subject,trial_type,var_1,var_2
0,subj_0,0,0.988801,0.993569
1,subj_0,1,0.386011,0.529518
2,subj_0,0,0.568554,0.396777
3,subj_1,0,0.514555,0.341404
4,subj_1,1,0.816294,0.430927
5,subj_1,0,0.143817,0.635664
6,subj_2,0,0.942541,0.927163
7,subj_2,1,0.461786,0.379725
8,subj_2,0,0.255295,0.00719
9,subj_3,0,0.36817,0.552401


In [26]:
trial_subj_avg = trials_df.groupby(["subject", "trial_type"]).mean()
trial_subj_avg

Unnamed: 0_level_0,Unnamed: 1_level_0,var_1,var_2
subject,trial_type,Unnamed: 2_level_1,Unnamed: 3_level_1
subj_0,0,0.778677,0.695173
subj_0,1,0.386011,0.529518
subj_1,0,0.329186,0.488534
subj_1,1,0.816294,0.430927
subj_2,0,0.598918,0.467177
subj_2,1,0.461786,0.379725
subj_3,0,0.61003,0.526873
subj_3,1,0.314043,0.25097


In [28]:
trials_df.set_index(["subject", "trial_type"]) - trial_subj_avg

Unnamed: 0_level_0,Unnamed: 1_level_0,var_1,var_2
subject,trial_type,Unnamed: 2_level_1,Unnamed: 3_level_1
subj_0,0,0.210124,0.298396
subj_0,0,-0.210124,-0.298396
subj_0,1,0.0,0.0
subj_1,0,0.185369,-0.14713
subj_1,0,-0.185369,0.14713
subj_1,1,0.0,0.0
subj_2,0,0.343623,0.459986
subj_2,0,-0.343623,-0.459986
subj_2,1,0.0,0.0
subj_3,0,-0.24186,0.025528


(Practicals 1.3.1)

## (bonus) Rolling functions with `.rolling()`

Imagine we have a time series of data, and we want to compute the mean over a window of time (e.g., for smoothing).

In [None]:
# Let's create a time series:
time_series = pd.Series(np.random.rand(100))

In [None]:
# This will compute the mean in a rolling window - ie, smoothing it!
rolling_wnd_size = 10
smoothed = time_series.rolling(rolling_wnd_size, center=True).mean()

In [None]:
time_series.plot()
smoothed.plot()

When done with averaging, same results as other smoothing tools

But now we can use arbitrary functions! (standard deviation, significance tests, etc)

(Practicals 1.3.2)