# Python for (open) Neuroscience

_Lecture 1.2_ - Intro to `pandas`

Luigi Petrucco


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vigji/python-cimec-2025/blob/main/lectures/Lecture1.2_Intro-pandas.ipynb)

## Outline

 - some last bits of `numpy`
 - introducing `pandas`

## Search indexes

Some functions find indexes of the elements of an array that match some criterion.

### `np.argmin()` / `np.argmax()` 

Find the position of the maximum or the minimum of an array

In [None]:
arr = np.array([5,  0, 2, 9, 6,])

np.argmin(arr)  # give index of smallest element

In [None]:
np.argmax(arr)  # give index of biggest element

### `np.nonzero()` / `np.argwhere()`

There are functions that allow us to get the index of all True elements in an array. In this way, we can set a condition (_e.g._, values above a threshold), and find the index of all elements satisfying it!

#### `np.argwhere()`

`np.argwhere()` find `True` elements and gives us a `(n_pts, n_matrix_dims)` shaped array of indexes:

In [None]:
arr = np.array([[1, 2, 3, 4, 5], 
                [0, 5, 0, 3, 1],
                [0, 6, 7, 4, 1]])  # define an array
boolean_vals = arr > 2  # boolean condition
print("Original array:")

print(arr)
print("\nBoolean array:")
print(boolean_vals)

In [None]:
indexes = np.argwhere(boolean_vals)
print("\nTrue elements indexes:")
print(indexes)

The indexes matrix has one row for every True value, and each column represents the position of that value on the boolean matrix.

#### `np.nonzero()`

`np.nonzero()` is very similar to `np.argwhere()`, but instead of a single matrix of indexes (with each column representing the indexes over one dimension for all points), it returns us a tuple of arrays. 

That is to say, each one of those arrays corresponds to a column of the indexes array returned by `np.argwhere()`.

In [None]:
# print("\nBoolean array:")
#print(boolean_vals)

indexes_tuple = np.nonzero(boolean_vals)
# print("\nTrue elements indexes:")
print(indexes_tuple)

indexes_tuples is a tuple of arrays; each one of thise array corresponds to a column in the indexes matrix above!

In [None]:
indexes_tuple[0] == indexes[:, 0]

Why is it useful to return a tuple of arrays?

### Array indexes are tuples!

Whenever you are writing something like:

In [None]:
arr[0, 1]

You are actually passing a tuple into the square brackets! If you remember, in python, comma separated
values with no brakets are automatically put together in a tuple:

In [None]:
a = 1, 2
type(a)

So, writing `arr[0, 1]` is literally the same as writing `arr[(0, 1)]`:

In [None]:
arr[(0, 1)]

If you remember, you can pass arrays instead of single numbers for indexing:

In [None]:
# This code will retrieve three elements from arr:
#   - the element in (0, 2)
#   - the element in (1, 3)
#   - the element in (1, 1)

print(arr)
print("\nSelected items:")
arr[np.array([0, 1, 1]), np.array([2, 3, 1])]

Therefore, we can directly use the indexes arrays tuple we get from `np.nonzero()` to retrieve all elements of the original matrix that matched the boolean condition!

In [None]:
arr = np.array([[1, 2, 3, 4, 5], 
                [0, 5, 0, 3, 1],
                [0, 6, 7, 4, 1]])
boolean_vals = arr > 2
print("Original array:")
print(arr)
print("\nBoolean array:")
print(boolean_vals)

indexes_tuple = np.nonzero(boolean_vals)
indexes_tuple

In [None]:
filtered_values = arr[indexes_tuple]  # here we directly use the tuple to index the array!
print("\nElements bigger than 2:")
print(filtered_values)

(Practicals 1.2.0)

### `np.argsort`

`argsort` is a powerful function to retrieve the index array that when applied to an array sorts it in ascending or descending order:

In [None]:
scores = np.array([5,  0, 2, 9, 6,])
subjects = np.array(["Tom", "Judy", "Lara", "John", "Leah"])

# Get sorting indexes:
sorting_idxs = np.argsort(scores)
sorting_idxs

In [None]:
# applying those indexes we can sort the array:
scores[sorting_idxs]

In [None]:
# But also another array of matching length, eg to sort subjects based on score:
subjects[sorting_idxs]

## `numpy` booleans

A final note on `numpy` boolean indexing & arrays

### Boolean operations with arrays

In [144]:
an_array = np.array([1, 2, 3, 4, 5])

In [None]:
condition_0 = an_array > 2
condition_1 = an_array < 5

print(condition_0)
print(condition_1)

### `and` with `&`

If we try to use and with the array, it won't work element-wise!

In [None]:
condition_0 and condition_1  # DO NOT USE THIS!

To compute the and condition element-wise we use `&`:

In [None]:
condition_0 & condition_1

### `or` with `|`

To compute the or condition element-wise we use `|`:

In [None]:
condition_0 | condition_1

### `not` with `~`

To compute the not condition (over a single array) element-wise we use `~`:

In [None]:
~condition_0

Mind the execution order!

In [None]:
an_array > 0 & an_array < 5  # Without brackets & is applied to 0 and an_array!!

We get the error because we are not ensuring that the array comparisons are executed first!
This is how we can fix this:

In [None]:
# Correct:
(an_array > 0) & (an_array < 5)

(Practicals 1.2.1)

## `pandas`! 🐼

Finally, data sheets in Python!

You are mostly familiar with data sheets: `.xlsx`, `.csv`...

Some of you might have used `R` to do statistics and data aggregation on those tabular data.

The `pandas` library offers the same in Python!

In [146]:
import pandas as pd  # pandas import

In [None]:
# Real world example of stimulus log reading from a csv file:
df = pd.read_csv("https://raw.githubusercontent.com/vigji/python-cimec-2025/main/lectures/files/stimulus_log.csv")

df.head()  # show the first lines of the dataframe

In [None]:
# E.g., we can add columns to the datasheet: 
df["Theta degrees"] = 180 * df["Theta"] / np.pi  # convert angles from radians to degrees

df.head()

### Very convenient data structures!

- named columns
- mixed data types (integers, strings, booleans, timestamps...)

Also, `pandas` offers very powerful ways of aggregating data - we will have a look at them!

## The basic data structure: `pd.DataFrame`

In [149]:
import pandas as pd
import numpy as np

In [None]:
# Read an real example of experimental data with pandas:
df = pd.read_csv("https://raw.githubusercontent.com/vigji/python-cimec-2024/main/lectures/files/stimulus_log.csv")

df

In [None]:
type(df)

## `pd.DataFrame`

2D data structure with labelled **columns** and indexed **rows**

In [None]:
simple_df = pd.DataFrame({'positions': [10, 20, 30, 40, 50],
                   'contrast': [1., 0.5, 0, 1., .5],
                   'condition': ["A", "B", "C", "D", "E"]})
simple_df

### `pd.DataFrame` attributes

In [None]:
simple_df.columns  # returns the columns of the dataframe

In [None]:
simple_df.index  # returns the index of the dataframe

In [None]:
simple_df.shape  # returns the shape of the dataframe, as arrays

In [None]:
simple_df.values # returns the raw content in an array (to be used only on homogenous data)

## `pd.DataFrame` column selection

Index dataframe over columns:

In [None]:
df.head()  # show the first lines of the dataframe

In [None]:
df["Radius"]  # We can index using the name of the column

In [None]:
# or multiple columns with a list of multiple column names:

df[["Radius", "Theta"]]

What is the type of the indexing operation if we select one column?

In [None]:
print(type(df["Radius"]))
df["Radius"]

Once we have selected a column, what we get is a `pd.Series`

### `pd.Series`

`pd.Series` are 1-dimensional data structures - basically the columns of  `pd.DataFrame`s:

In [None]:
a_series_from_df = df["Radius"]
a_series_from_df

`pd.Series` have indexed rows (the same of the original dataframe), but no columns

### Operations on Series

We can operate with the series as we would for a `numpy` array:

In [None]:
# The result of this operation is an identically indexed Series with new values:
a_series_from_df * 10

In [None]:
# The result of this operation is an identically indexed Series with bolean values:
a_series_from_df == 5

### Back to indexing dataframes...

DataFrames are smart structures that can understand if we are indexing rows or columns, checking out the values that we are using for indexing:

In [None]:
df[0:2]  # in this case, the indexing refers to rows

Indexing on rows can be slicing (as above) or boolean indexing.

Many times row selections are done based on some boolean conditions.

E.g. we just want to consider trials where the stimulus `Radius` was `5`:

In [None]:
boolean_selection = df["Radius"] == 5

# As for numpy array, the operation on the Series above returns a series:
boolean_selection

We can use the resulting series to index the original DataFrame:

In [None]:
df[boolean_selection]

Notice how after filtering the indexes of each row are maintained as in the original dataframe!

### Non-numerical raw indexing

One potentially powerful feature of `pd.DataFrame` is the option of using non-numerical indexes

In [None]:
non_num_idx_df = df.copy()
# Replace numerical index with string:
non_num_idx_df.index = [f"trial_{i}" for i in df.index]

non_num_idx_df.head()

We will see how this can be useful to build tables queriable by trial/subject/experiment ids

### `.loc`

If we want to select values over both rows and columns we use `.loc` (highly recommended)

**This is not a method!** We need to use the square brackets:

In [None]:
df.loc[::2, ["Theta", "Timestamp"]]

Often, we use boolean indexing to select rows:

In [None]:
df.loc[boolean_selection, ["Theta", "Timestamp", "Radius"]]

It is very common to use multiple critieria to select rows:

In [43]:
selection_series = (df["Radius"] == 5) & (df["Theta"] > 4)

In [None]:
df[selection_series]  # note the boolean and operation à là numpy

### `.iloc`

If we feel like using numpy-like indexing, we can use `.iloc` (usually, discouraged, as it is less readable):

In [None]:
df.iloc[:5, :2]

(Practicals 1.2.1)

### Create `pd.DataFrames`

Tipically, we create a dataframe from a dictionary of arrays (lists):

In [None]:
dict_array = dict(int_col=[1, 2, 3], 
                  float_col=[4., 5., .6],
                  a_constant_val=1,
                  str_col=["a", "b", "c"])

pd.DataFrame(dict_array)

 or from a list of dictionaries:

In [None]:
pd.DataFrame([dict(int_col=1, float_col=4., str_col="a"),
              dict(int_col=2, float_col=5., str_col="b"),
              dict(int_col=3, float_col=.6, str_col="c")])

### From `numpy` arrays

We can also create a dataframe from data stored as a numpy array:

In [None]:
twod_array = np.random.rand(3, 3)
twod_array

In [None]:
# We just have to specify the columns names and (optionally) the indexes, if different from default:

pd.DataFrame(twod_array, 
             columns=["a", "b", "c"], # column names
             index=["row1", "row2", "row3"], # non numerical indexing
            )

### Reading from files

Many (most?) times we'll be reading directly from a file (a `.csv`, a `.xlsx`...)

In [None]:
# For .csv files, we use the read_csv method.
# In this notebook we read from the web; if it was a file from your pc, you'd pass the filename
# instead of the URL. read_csv takes a bunch of inputs about how your file is formatted

URL = "https://raw.githubusercontent.com/vigji/python-cimec-2024/main/lectures/files/stimulus_log.csv"
df = pd.read_csv(URL)

df.head()  # this will show only the first rows!

## Increment existing dataframes

### Add new columns

We can add new columns to a dataframe:

In [None]:
df = pd.DataFrame(np.random.rand(3, 3), columns=["a", "b", "c"], index=["row1", "row2", "row3"])
df

To add data, we can use any multi-element varable: `list`s, `array`s...

In [None]:
# The length of the assignment has to match the length of the dataframe:
df["a_new_column"] = ["a", "b", "c"]
df

In [None]:
# We can also assign a single value to fill the whole column with the same content:
df["new_boring_column"] = 42
df

### Add new rows

We can add new rows to a dataframe (more rare). In this case we use concatenation:

In [None]:
df1 = pd.DataFrame(dict(col1=[99, 95, 92],
                        col2=[95, 90, 99]))
df1

In [None]:
# Create another dataframe:
df2 = pd.DataFrame(dict(col1=[100],
                        col2=[101]))
df2

In [None]:
#
# Concat dataframes
pd.concat([df1, df2])

Note how indexes match the indexes of the original arrays! If we want, we can reassign it:

In [None]:
pd.concat([df1, df2]).reset_index()

(Practicals 1.2.2)

## [Bonus] index raveling / unraveling

For a multi-dimensional array:

In [None]:
arr = np.array([[5, 1, 2], 
                [3, 0, 4]])
print(arr)
np.argmin(arr)

What is this number 4?

## Flat indexing

When you have a multi-dimensional array, you can always index it in two ways:
 - the standard, multi-dimensional indexing (e.g., `my_array[3,4]`)
 - with **flat indexing**: we index the array after flattening it out in a single dimension

In [None]:
# Example:
my_arr = np.random.randint(0, 10, (4, 3))
my_arr

There is a flatten representation of this array that we can look at with `.flatten()` method we saw in the last lecture:

In [None]:
import numpy as np
# When we flatten, we concatenate all values of the matrix in a single dimension.
# We keep the order of the dimensions of the matrix (the first 3 elements are the first row, that
# is, the first dimension):
print(my_arr)
min_idx = np.argmin(my_arr)
print(min_idx)
my_arr.flatten()[min_idx]


The number we got from `np.argmax()` is the number we would need to use over the flattened representation of the array to get the maximum value!

In [None]:
max_idx = np.argmax(my_arr)  # get max index
print(max_idx)

flat_array = my_arr.flatten()  # create a flattened array
flat_array[max_idx]

One last thing. it is obviously annoying having to create a new flattened array to use the index. Also, we create a duplicated array in memory - not good.

The best way to use this indexing it through the `.flat` representation of the matrix:

In [None]:
my_arr.flat[max_idx]

### Index raveling / unraveling

To convert the flat index to a tuple of matrix indexes, we can use `np.unravel_index()`:

In [None]:
arr = np.array([[5, 1, 2], [3, 3, 0]])
idx = np.argmin(arr)

# unravel index takes an index of the flattened array, and the shape of the matrix,
# and give us the tuple of ordinary indexes that correspond to that flat index.
np.unravel_index(idx, arr.shape)  

This is an illustration of what happens. Flat indexes (_left_) are converted to tuple indexes (_right_):
(there is a bug in the image, last value should be 11!)

![unravel illustration](https://i.stack.imgur.com/sxwBU.png)

The converse operation, called unravel, can be done with `np.ravel_multi_index()`, and it goes from the right representation to the left one:

In [None]:
np.ravel_multi_index((1, 1), arr.shape)