# Python for (open) Neuroscience

_Lecture 1.3_ - More on `pandas`

Luigi Petrucco

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vigji/python-cimec-2025/blob/main/lectures/Lecture1.3_More-pandas.ipynb)

## Overview
 - summary on `pandas` dataframes
 - creating and using dataframes
 - data organization principles

## Summary on `pandas`

In [1]:
import pandas as pd
import numpy as np

URL = "https://raw.githubusercontent.com/vigji/python-cimec-2024/main/lectures/files/stimulus_log.csv"
df = pd.read_csv(URL)

df.head()

Unnamed: 0,Radius,Theta,Direction,Timestamp
0,5,3.1415,1,2023-12-17T12:23:48.2339968+01:00
1,4,2.094395,1,2023-12-17T12:24:09.2596608+01:00
2,5,3.1415,2,2023-12-17T12:24:16.2907776+01:00
3,5,3.769911,2,2023-12-17T12:24:29.2541696+01:00
4,5,3.1415,1,2023-12-17T12:24:36.2617216+01:00


### Create `pd.DataFrames`

Tipically, we create a dataframe from a dictionary of arrays (lists):

In [None]:
dict_array = dict(int_col=[1, 2, 3], 
                  float_col=[4., 5., .6],
                  a_constant_val=1,
                  str_col=["a", "b", "c"])

pd.DataFrame(dict_array)

 or from a list of dictionaries:

In [None]:
pd.DataFrame([dict(int_col=1, float_col=4., str_col="a"),
              dict(int_col=2, float_col=5., str_col="b"),
              dict(int_col=3, float_col=.6, str_col="c")])

### From `numpy` arrays

We can also create a dataframe from data stored as a numpy array:

In [None]:
twod_array = np.random.rand(4, 3)
twod_array

In [None]:
# We just have to specify the columns names and (optionally) the indexes, if different from default:
pd.DataFrame(twod_array, 
             columns=["a", "b", "c"], # column names
             #index=["row1", "row2", "row3", "row4"], # non numerical indexing
            )

### Reading from files

Many (most?) times we'll be reading directly from a file (a `.csv`, a `.xlsx`...)

In [None]:
# For .csv files, we use the read_csv method.
# In this notebook we read from the web; if it was a file from your pc, you'd pass the filename
# instead of the URL. read_csv takes a bunch of inputs about how your file is formatted

URL = "https://raw.githubusercontent.com/vigji/python-cimec-2024/main/lectures/files/stimulus_log.csv"
df = pd.read_csv(URL)

df.head()  # this will show only the first rows!

## Increment existing dataframes

### Add new columns

We can add new columns to a dataframe:

In [None]:
df = pd.DataFrame(np.random.rand(3, 3), columns=["a", "b", "c"], index=["row1", "row2", "row3"])
df

To add data, we can use any multi-element varable: `list`s, `array`s...

In [None]:
# The length of the assignment has to match the length of the dataframe:
df["a_new_column"] = ["A", "A", "B"]
df

In [None]:
# We can also assign a single value to fill the whole column with the same content:
df["new_boring_column"] = 42
df

### Add new rows

We can add new rows to a dataframe (more rare). In this case we use concatenation:

In [None]:
df1 = pd.DataFrame(dict(col1=[99, 95, 92], col2=[95, 90, 99]))
df2 = pd.DataFrame(dict(col1=[100], col2=[101]))
print(df1)
print("========")
print(df2)

In [None]:
# Concat dataframes
pd.concat([df1, df2])

Note how indexes match the indexes of the original arrays! If we want, we can reassign it:

In [None]:
pd.concat([df1, df2]).reset_index(drop=True)

## `pandas` handles missing values

One convenient feature of dataframes is that we can concatenate mismatching dataframes (with different columns) and pandas fill missing values using `NaN`s

In [None]:
df1 = pd.DataFrame(dict(col1=[99, 95, 92], col2=[95, 90, 99]))
df2 = pd.DataFrame(dict(col1=[100], different_col=[101]))
# print(df1)
# print("========")
# print(df2)

In [None]:
pd.concat([df1, df2])

(Practicals 1.3.0)

### `pd.DataFrame`'s methods

`pd.DataFrame`s and `pd.Series` have many, many methods!

It is actually way too many to cover in a single lecture! It is more important to know that they exist, and to know how to find them! (google, stackoverflow, pandas documentation, chatGPT...)

### Methods to change the df content

Methods to drop rows/columns:

In [None]:
import pandas as pd

dict_array = {"int_col": [3, 2, 1, 1], 
              "float_col": [4., 5., .6, 7.], 
              "str_col": ["a", "d", "c", "a"]}
df = pd.DataFrame(dict_array)

df

In [None]:
df#.drop(columns=["int_col", "str_col"], inplace=True)  # drop columns

In [None]:
df.drop(index=[0, 2])  # drop rows

Methods to sort rows/columns:

In [None]:
df.sort_values(by="str_col") # sort by a column

In [None]:
df.sort_values(by=["int_col", "float_col"])  # sort by multiple columns

### Methods for statistics

In [None]:
# get a pandas sample dataset:
def get_meteo_dataset():
    URL = "https://api.open-meteo.com/v1/forecast?latitude=52.52&longitude=13.41&current=temperature_2m,wind_speed_10m&hourly=temperature_2m,relative_humidity_2m,precipitation,wind_speed_10m,winddirection_10m&start_date=2025-04-01&end_date=2025-04-20&format=csv"
    return pd.read_csv(URL, skiprows=5)

df = get_meteo_dataset()
df.head()

In [None]:
means = df["temperature_2m (°C)"].mean()
means

In [None]:
df["temperature_2m (°C)"].median()

In [None]:
df["temperature_2m (°C)"].std()

We can directly produce a whole summary for columns of the dataset:

In [None]:
df[["temperature_2m (°C)", "precipitation (mm)", "wind_speed_10m (km/h)"]].describe()

### `pd.DataFrame`'s plotting methods

`pd.DataFrame`s and `pd.Series` have several plotting methods!

In [None]:
df["temperature_2m (°C)"].plot()

In [None]:
df["temperature_2m (°C)"].plot(kind="box")

In [None]:
df["temperature_2m (°C)"].plot(kind="hist")

In [None]:
df.plot(kind="scatter", x="temperature_2m (°C)", y="precipitation (mm)")

### Methods to deal with missing data

As in numpy, we represent missing data by `NaN` (not a number).

In [None]:
df = pd.DataFrame({"column_a": [0, 3, 1, 2, np.nan, 4, 10], 
                   "column_b": [7, 6, np.nan, 4, 5, 7, 8]})
df

We can easily find out where the `NaN` values are using the `isna()` method:

In [None]:
df.isna()

To deal with missing data, we can use `pd.DataFrame`'s interpolation methods. By default, it will use linear interpolation:

In [None]:
df.plot()

In [None]:
?df.interpolate

In [None]:
df.interpolate().plot()

### Methods for doing operations on columns

Several, but particulary useful one is `isin()`:

In [None]:
df = pd.DataFrame({"int_col": [1, 2, 3], "str_col": ["a", "b", "c"]})
df

In [None]:
df

In [None]:
# Check which rows have int_col values in [1, 2]
mask = df["int_col"].isin([1, 3])  # or on strings: df["str_col"].isin(["a", "d"])
print(mask)
print("=======")
print(df[mask])

Another useful function is `between()` - it checks if values are within a range


In [None]:
df = pd.DataFrame({
    "int_col": [1, 2, 3, 1],
    "float_col": [4.0, 5.0, 0.6, 7.0],
    "str_col": ["a", "d", "c", "a"]
})

# Check which rows have float_col values between 4 and 6
mask = df["float_col"].between(4, 6)  # inclusive by default
print("\nRows where float_col is between 4 and 6:")
print(df[mask])


### Additional Useful Pandas Methods

Here are some additional useful pandas methods that weren't covered in the lecture:

#### Data Cleaning and Transformation
- `df.replace()`: Replace values in DataFrame
- `df.duplicated()`: Find duplicate rows
- `df.drop_duplicates()`: Remove duplicate rows

#### Advanced Indexing and Selection
- `df.query()`: Filter using a query expression
- `df.where()`: Replace values where condition is False
- `df.mask()`: Replace values where condition is True
- `df.nlargest()` and `df.nsmallest()`: Get n largest/smallest values

#### Time Series Operations
- `df.resample()`: Resample time series data
- `df.rolling()`: Rolling window calculations
- `df.diff()`: Calculate difference between consecutive elements

#### Data Aggregation and Grouping
- `df.agg()`: Aggregate using multiple operations
- `df.transform()`: Transform data using a function
- `df.pivot_table()`: Create a pivot table
- `df.crosstab()`: Compute a cross-tabulation
- `df.melt()`: Unpivot DataFrame from wide to long format

#### String Operations
- `df.str.contains()`: Test if pattern is contained in string
- `df.str.extract()`: Extract capture groups
- `df.str.split()`: Split strings on delimiter
- `df.str.cat()`: Concatenate strings
- `df.str.replace()`: Replace occurrences of pattern

#### Memory and Performance
- `df.memory_usage()`: Get memory usage of columns
- `df.convert_dtypes()`: Convert columns to best possible dtypes
- `df.astype()`: Cast to specified dtype
- `df.select_dtypes()`: Select columns based on dtype

#### Advanced Operations
- `df.applymap()`: Apply function element-wise
- `df.update()`: Update values in place
- `df.merge()`: Merge DataFrames (more flexible than join)

(Practicals 1.3.1)

## Organize data in a dataframe

A very common question in data science is: how to organize our datasets?

In [None]:
# Imagine we have 4 experimental subjects; to each one we show a stimulus 3 times (trials); 
# during every trial, we measure 2 variables (eg, accuracy and speed)
n_subjects = 4
n_repetitions = 3

# We could represent the data entries for each stimulus as a dictionary, 
# and the data for all trials for a subject as a list of dictionaries:
subject_data = [{"accuracy":np.random.rand(), "speed":np.random.rand()} for _ in range(n_repetitions)]
subject_data

In [None]:
# We could then pool the data for all subjects as a dictionary of lists of dictionaries:
all_subjects_data = dict()

for i in range(n_subjects):
    all_subjects_data[f"subj_{i}"] = \
        [{"accuracy":np.random.rand(), "speed":np.random.rand()} for _ in range(n_repetitions)]
all_subjects_data

This is now organized but very nested! it is not easy to perform statistics on it.

In [None]:
# Imagine we want to average the results across all subjects for variable_1:
means = []
for subject_results in all_subjects_data.values():
    for result in subject_results:
        means.append(result["speed"])
print(means)
np.mean(means)

When we organize data in pandas dataframes, there is an important principle to keep in mind:

**keep them as flat as possible**

`flat` = opposite of nested

`nested` = lists of dictionaries of lists of dictionaries of dataframes of...



Remember!


    🪷 The Zen of Python 🪷
        
        Flat is better than nested

In [None]:
# We can turn the data into a dataframe (does not matter how we do it here! this is just an ugly example)
flat_list_of_dicts = []

for sub in all_subjects_data.keys():
    for n_rep in range(n_repetitions):
        trial_dict = all_subjects_data[sub][n_rep]
        trial_dict.update({"subject": sub, "repetition": n_rep})  
        flat_list_of_dicts.append(trial_dict)
                    
trials_df = pd.DataFrame(flat_list_of_dicts)
trials_df


We can now easily perform statistics on the data:

In [None]:
trials_df["speed"].mean()

### Principles for organizing `pandas` dataframes

Keep in the same dataset all the data of the same type you have across groups (such as subjects). 

If you load lists of dataframes concatenate before working on them!

Consider having multiple dataframes to describe different aspects of your experiment. For example:
- a `subject_dataframe` with the info on your subjects
- a `trials_dataframe` with the trial responses across subjects

And keep consistent ids / nomenclature to easily work over both!

Example:

In [None]:
# Let's build a subjects dataframe for the experiment above:
np.random.seed(42)
subjects_df = pd.DataFrame({"sex":np.random.choice(["F", "M"], size=n_subjects),
                            "handedness": np.random.choice(["left", "right"], size=n_subjects),
                            "age": np.random.randint(20, 40, size=n_subjects)})
subjects_df.index = [f"subj_{i}" for i in range(n_subjects)]
subjects_df

We can now easily filter the subjects we want to work on based on categories:

In [None]:
selected_subjects_df = subjects_df[(subjects_df["sex"] == "F") & (subjects_df["age"] >=30)]
selected_subjects_df

In [None]:
selected_subjects_df.index

And restrain our analysis of the `trials_df` to these subjects :

In [None]:
selection = trials_df["subject"].isin(selected_subjects_df.index)
selection


In [None]:
trials_df.loc[selection, "speed"].mean()

## When not to use `pandas`

You do not always need pandas dataframes! E.g., it is not efficient with many columns.

Many times your raw data (ephys, imaging...) can live in numpy array and you work in `pandas` only with  derived quantities.

(Practicals 1.3.2)