# Scientific Python and Data Manipulation
## Advanced Python for Life Sciences @ Physalia courses (Summer 2025)
### Marco Chierici, Fondazione Bruno Kessler

# NumPy

## Motivation

Suppose you have to work with matrices and perform a matrix multiplication. You could create matrices in pure Python by using nested lists.

In [None]:
matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

This creates a 3x3 matrix that should look like this:

```
1 2 3
4 5 6
7 8 9
```

And then you can access the individual items by their indices.

In [None]:
matrix[0][1]

How could you multiply each item by 2? For example, you can use a nested `for` loop.

In [None]:
for row in matrix:
    for i in range(len(row)):
        row[i] = row[i] * 2

matrix

The output is correct; the problem is that all of this is done in pure Python and you thus need to reimplement every operation you need to use!

**NumPy** brings N-dimensional arrays and linear algebra routines to Python.

More info and full documentation: https://numpy.org

## Overview

In [None]:
# canonical import
import numpy as np

In [None]:
# a one-dimensional array
a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(a)
print(type(a))

You can index a Numpy array like you would to with a nested list:

In [None]:
a[0][1]

Or with the numpy-specific slicing syntax:

In [None]:
a[0, 1]

Arrays contain items of a single type: this is one major difference with respect to lists.

In [None]:
np.array([[1, 2, 3], ["a", "b", "c"]])

You can create three-dimensional arrays:

In [None]:
matrix = np.array([
    [[1, 2, 3], [4, 5, 6]], 
    [[7, 8, 9], [10, 11, 12]],
    [[13, 14, 15], [16, 17, 18]]
])

In [None]:
matrix[0, 1, 2]

## Operations

In [None]:
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
matrix * 2

In [None]:
second_matrix = np.array([[5, 4, 3], [7, 6, 5], [9, 8, 7]])
second_matrix

In [None]:
second_matrix - matrix

Mind the difference between core Python and Numpy:

In [None]:
# using regular or "core" Python
print([1, 2, 4] + [3, 5, 6])

# use numpy to add
np.array([[1, 2, 4]]) + np.array([3, 5, 6])

All arithmetic operators + - * / operate *element by element*:

In [None]:
matrix = np.array([[1, 1, 1], [1, 1, 1], [1, 1, 1]])
matrix * matrix

Matrix product is computed with the `@` operator:

In [None]:
matrix @ matrix

Other common array operations:

In [None]:
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
matrix

In [None]:
# dimension or shape
print(matrix.shape)

Note that `.shape` is an *attribute* of `matrix` and not a function or method: a common error is to call it like `matrix.shape()`

In [None]:
matrix.diagonal()

In [None]:
matrix.flatten()

In [None]:
matrix.transpose()

In [None]:
# equivalent
matrix.T

In [None]:
np.max(matrix)

In [None]:
# equivalent
matrix.max()

In [None]:
print(np.sum(matrix))
#or
matrix.sum()

In [None]:
matrix.mean()  # or np.mean(matrix)

In [None]:
np.sqrt(matrix)  # element by element

In [None]:
np.square(matrix)  # element by element

In [None]:
np.log2(matrix)  # element by element

## Ranges of values

`np.arange()` is the Numpy equivalent to `range()` and returns a numpy array:

In [None]:
np.arange(10)

In [None]:
np.arange(10, 20)

In [None]:
np.arange(10, 20, 2)

### Shaping

In [None]:
A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

print(A)
print(A.shape)

In [None]:
A.reshape(9, 1)

Note that you can only reshape an array if the total size of the reshaped array matches the original size.

In [None]:
# this will raise an error:
# A.reshape(2, 6)

If one of the parameters of `reshape()` is -1, then Numpy will determine that value depending on the other parameter and the total array size.

For example, `arr.reshape(3, -1)` means "reshape arr to three rows and as many columns it takes".

In [None]:
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr)

In [None]:
arr.reshape(3, -1)

In [None]:
arr.reshape(1, -1)

## Randomness

The `np.random` module allows you to perform random sampling.

In [None]:
# set a seed before generating random data
np.random.seed(11)
# draw samples from a normal distribution with 0 mean and 1 sd
# and put the output in a 3x3 array
np.random.normal(size=(3, 3))

## Resources

- **[Numpy cheat sheet](https://images.datacamp.com/image/upload/v1676302459/Marketing/Blog/Numpy_Cheat_Sheet.pdf)** (also in the `resources` folder)
- [Numpy tutorials and books](https://numpy.org/learn/)

---

# Scipy

- Scientific Python
- Built on top of NumPy
- Provides more complex mathematic, statistical, and scientific data analysis functions
  - Still, NumPy contains some linear algebra functions and Fourier transforms, even though these more properly belong in SciPy

More info and full documentation: https://scipy.org

Some particularly useful `scipy` sub-packages include:

- `scipy.stats`
  - randomness
  - statistical functions and tests
- `scipy.integrate`
  - numerical integration
 
Other useful sub-packages include `scipy.linalg` for linear algebra and `scipy.sparse` for sparse matrix problems (e.g. single-cell RNA-seq).

We'll briefly touch here how to do simple statistical testing with `scipy`.

## Statistical testing

`stats` contains functions for statistical hypothesis testing. For example, let's conduct a *paired t-test* to compare two sets of related measurements, such as the same biological parameter measured before and after a treatment in the same subjects.

In [None]:
from scipy import stats

# set a seed before generating random data
np.random.seed(12345)
# assume these are two sets of biological measurements
data_before = np.random.normal(loc=50, scale=10, size=100)  # before treatment
# we manually shift data_before to simulate the effect of a treatment
data_after = data_before + np.random.normal(loc=5, scale=5, size=100)  # after treatment

# paired t-test
t_statistic, p_value = stats.ttest_rel(data_before, data_after)
p_value

The p-value is way less than the usual significance threshold of 0.05, so we reject the null hypothesis that there is no difference between the two sets of measurements.

The *nonparametric version* of the paired t-test is the Wilcoxon signed-rank test: let's conduct this kind of test on the same data.

In [None]:
# Wilcoxon signed-rank test
statistic, p_value = stats.wilcoxon(data_before, data_after)
p_value

Again, this p-value indicates a statistically significant difference in this example data.

Wilcoxon signed-rank is especially used instead of the paired t-test when the data are not normally distributed, thus they do not meet the assumptions of a t-test.

In [None]:
# set a seed before generating random data
np.random.seed(999)
# non-normally distributed data from a log-normal distribution
data_before_non_normal = np.random.lognormal(mean=1.5, sigma=0.4, size=30)
data_after_non_normal = data_before_non_normal * np.random.lognormal(mean=0.1, sigma=0.4, size=30)

# Performing the Wilcoxon signed-rank test
statistic, p_value = stats.wilcoxon(data_before_non_normal, data_after_non_normal)
p_value

If, instead, you want to assess statistical significance of *independent samples* (i.e., one group of patients vs. another group of different patients), you can use a t-test (`stats.ttest_ind`) or a Mann-Whitney U test (`stats.mannwhitneyu`).

In [None]:
np.random.seed(999)

# two independent non-normal samples from log-normal dists
group_a = np.random.lognormal(mean=1.5, sigma=0.4, size=30)
group_b = np.random.lognormal(mean=1.2, sigma=0.5, size=30)

# perform the Mann–Whitney U test
statistic, p_value = stats.mannwhitneyu(group_a, group_b, alternative='two-sided')
p_value

### Adjust for multiple testing

When you conduct many tests, such as for differential gene expression analysis, you need to correct the p-values for multiple testing (for example, with the Benjamini-Hochberg procedure).

p-value correction is implemented in the 3rd-party Python library [statsmodels](https://www.statsmodels.org/stable/index.html)

`!conda install -y -c conda-forge statsmodels`

In particular, p-value correction is in the function `statsmodels.stats.multitest.multipletests(pvals, method)`


In [None]:
import numpy as np
from scipy import stats
from statsmodels.stats.multitest import multipletests

In [None]:
np.random.seed(999)

# 10 independent Mann–Whitney U tests (we simulate testing on 10 genes)
p_vals = []
for i in range(10):
    group_a = np.random.lognormal(mean=1.5, sigma=0.4, size=30)
    group_b = np.random.lognormal(mean=1.2, sigma=0.5, size=30)

    _, p = stats.mannwhitneyu(group_a, group_b, alternative="two-sided")
    p_vals.append(p)

np.array(p_vals)

In [None]:
# p-value adjustment
multipletests(np.array(p_vals), alpha=0.05, method="fdr_bh")[1]

## Resources

- **[Scipy cheat sheet](https://images.datacamp.com/image/upload/v1676303474/Marketing/Blog/SciPy_Cheat_Sheet.pdf)** (also in the `resources` folder)
- [Scipy documentation](https://docs.scipy.org/doc/scipy/)
- If you work a lot with statistical models and statistical tests, check out the [statsmodels](https://www.statsmodels.org/stable/index.html) library!

---

# Pandas

- "The" Python library for **data preprocessing and analysis**
- Built **on top of Numpy**
- Extremely versatile for manipulating datasets, mostly tabular data
- Think of Pandas as the **evolution of spreadsheets**, with more capabilities for coding, and queries on relational data such as joins and group-by
- Bonus: can be used for high quality **plots**
- Most important structure: the **Data Frame** (R users: yes, that one!)
- Trivia: stands for **Pan**el **Da**ta **S**ystem

More info and full documentation: https://pandas.pydata.org

A peek of what you can do with Pandas' toolbox:

- managing data and tables
  - selection
  - grouping
  - pivoting
- managing missing data
- preprocessing and data wrangling
- file I/O
- statistics on data

## Resources
We won't cover everything Pandas can do, so keep this **[cheat sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)** handy! (it's also in the `resources` folder)

---

In [None]:
# canonical import
import pandas as pd

In [None]:
# we'll also need this to improve dealing with files, folders, and paths
from pathlib import Path
DATADIR = Path("data")

---

## Series

A Pandas Series is a 1D container, similar to the Python list. With respect to lists, however, a Series can only hold items of the same `dtype`.

- 1D labeled array: in essence, a column
- Homogenous data type
- Size - immutable

In [None]:
# create a series using lists
# note: index defaults to 0, 1, 2, ... if not given.
genes = pd.Series([0.2, 1.4, 4, 5], index=["GeneA", "GeneB", "GeneC", "GeneD"])
genes

In [None]:
print(type(genes))

In [None]:
# get the number of rows and columns
# (shape is an attribute)
genes.shape

In [None]:
# get the index values (another attribute)
genes.index

Note that index has their own type `Index` - it is not a Python list (even though it looks like a list).

In [None]:
# get the type
genes.dtypes

In [None]:
# get more info
genes.info()

In [None]:
# create a series from a dictionary
genes2 = pd.Series(
    {"GeneA": 0.2, "GeneB": 1.4, "GeneC": 4.0, "GeneD": 5, "GeneE": np.nan}
)
print(genes2)

In [None]:
# give index column a name
genes2.rename_axis("Gene")

### Missing values

Checking for missing values is one of the main steps in your data preprocessing workflow. If you find missing data, you basically have three options:

- keep it (easiest; depends on whether your downstream analysis methods can deal with NANs)
- remove it (easy; potential loss of data; limits trained models for future data)
- replace it (hardest, somewhat arbitrary; potential to save a lot of data for model training; potential to lead to false conclusions)

Keep in mind that no approach applies to all circumstances!

In [None]:
# check for NA values
print(genes2.isnull())

In [None]:
# drop NA values
genes2.dropna()

In [None]:
genes2

In [None]:
# drop NA values inplace
genes3 = genes2.copy()
genes3.dropna(inplace=True)
genes3

**Note:** many Pandas functions and methods that alter dataframes support the parameter `inplace`, which by default is always `False`. In this case, the function returns `None` and the input dataframe is overwritten. In general, *you should avoid using `inplace=True`*. Moreover, this parameter could be deprecated in a future Pandas version (see [here](https://github.com/pandas-dev/pandas/issues/16529)).

In [None]:
# replace NA values with a custom value
genes2.fillna(value=0)

In [None]:
# replace NA values with a custom value
genes2.fillna(value=genes2.median())

In [None]:
# replace NA values propagating the last valid observation to next valid
genes2.ffill()  # use this instead of genes2.fillna(method="ffill")

So what to do with missing values? Here are some rules of thumb.

- You should **drop values** when a lot of data is missing;
- You should **fill with the same value** if you know that NaN is just a placeholder (e.g., for 0);
- You should **fill with interpolated or estimated value** if there is a reasonable assumption to do that!

### Selecting/Filtering Values in a series

1. What is the value of GeneC? There are multiple options to access GeneC.

In [None]:
# dot notation
genes.GeneC

In [None]:
# by row name
genes["GeneC"]

In [None]:
genes.loc["GeneC"]

In [None]:
# by row index
genes.iloc[2]

2. What is the value of GeneC and GeneD?

In [None]:
genes[["GeneC", "GeneD"]]

3. What genes are expressed with a value of at least 3?

In [None]:
genes[genes > 3]

In [None]:
# working with multiple conditions
genes[(genes > 3) | (genes < 1)]

- Multiple conditions can be combined using the symbols `|` (meaning "or") and `&` (meaning "and").
  - Be sure to wrap each condition around parentheses `( )`
  - Don't use Python's `or`, `and` Boolean operators
- In Pandas, the Boolean negation operator is the tilde `~`.
- Use the `.isin()` method to select data whose value "is in" a list of values (mostly used with categorical variables)

In [None]:
genes.isin([5.0])

4. What is the mean expression of the whole data?

In [None]:
genes.mean()

5. What is the largest value?

In [None]:
genes.max()

6. What gene has the largest value?

In [None]:
genes[genes == genes.max()].index[0]

In [None]:
# alternative
genes.idxmax()

### Sorting values

In [None]:
# sort from highest to lowest
# note: default is increasing order! check ?genes.sort_values
genes.sort_values(ascending=False)

In [None]:
?genes.sort_values

### Replacing values

If you have a categorical (factor) Series, it may be convenient to rename its levels. For example:

In [None]:
# create a dummy pandas series with a categorical variable (levels: "M", "F")
sex = pd.Series(["M", "F", "M", "F", "M", "F"], name="sex")
sex

In [None]:
# replace "M" and "F" with "Male" and "Female"
sex.replace(to_replace=["M", "F"], value=["Male", "Female"])

An alternative to `.replace()` is the method `.map()`, which accepts a Python dictionary like `{old_value1: new_value1, ...}`

In [None]:
sex.map({"M": "Male", "F": "Female"})

`.map()` is more flexible than `.replace()` and gracefully deals with `dtype` changes.

In [None]:
sex.replace(to_replace=["M", "F"], value=[0, 1])

In [None]:
sex.map({"M": 0, "F": 1})

In [None]:
sex.isin(['F'])

---

## DataFrames

A Pandas DataFrame can be thought of as a collection (or a dictionary) of Series objects. The keys in this dictionary are the column names and the values are the Series.

- 2D labeled table made up of a collection of Series
- Potentially heterogeneous data types
- Size - mutable

To create a DataFrame from scratch, there are multiple possibilities: we decide to create a dictionary first, then convert it to a DataFrame.

In [None]:
data = {
    "EnsemblID": [
        "ENSG00000223972",
        "ENSG00000227232",
        "ENSG00000243485",
        "ENSG00000237613",
        "ENSG00000268020",
        "ENSG00000186092",
    ],
    "Gene": ["DDX11L1", "WASH7P", "MIR1302-11", "FAM138A", "OR4G4P", "OR4F5"],
    "GTEX-1117F": [0.1082, 21.4, 0.1602, 0.05045, 0, 0],
    "GTEX-111CU": [0.1158, 11.03, 0.06433, 0, 0, 0],
    "GTEX-111FC": [0.02104, 16.75, 0.04674, 0.02945, 0, 0],
}

In [None]:
# create dataframe
pd.DataFrame(data)

In [None]:
# create and re-order column names
df_gene = pd.DataFrame(
    data, columns=["Gene", "EnsemblID", "GTEX-1117F", "GTEX-111FC", "GTEX-111CU"]
)
df_gene

In [None]:
print(type(df_gene))

In [None]:
# get the number of rows and columns
# (shape is an attribute of df)
df_gene.shape

In [None]:
# get the column names (another attribute)
df_gene.columns

In [None]:
# get the index (another attribute)
df_gene.index

Note that column names and index have their own types `Index` and `RangeIndex` - it is not a Python list (even though it looks like a list).

In [None]:
# create a new index (original dataframe is not modified)
df_gene.set_index("EnsemblID")

In [None]:
# create a new index in-place (mind the above caveat about the use of inplace)
df_gene.set_index("EnsemblID", inplace=True)
df_gene

In [None]:
# get the type of each column
df_gene.dtypes

In [None]:
# get more info
df_gene.info()

To transpose, we can use the numpy syntax `.T`:

In [None]:
# transpose v1
df_gene.T

Or a more human readable syntax `.transpose()`:

In [None]:
# transpose v2
df_gene.transpose()

Quick summary:

In [None]:
df_gene.describe()

In [None]:
df_gene.describe().transpose()

To convert a dataframe to a numpy array, just use the `.values` attribute:

In [None]:
df_gene.values

### Select columns: by name

In [None]:
# single column (returns a Series)
df_gene["Gene"]

In [None]:
# multicolumn (returns a DataFrame)
df_gene[["Gene", "GTEX-1117F"]]

### Select columns: by condition

Often you'll have to select (or filter out) columns based on a pattern: e.g., all those columns starting with, ending with, or containing something.

For this, you can manipulate column names using methods from the built-in `str` Python module: for example, `str.startswith()`.

To get access to these string methods on a DataFrame value or column name, you need to use the `.str.` attribute.

In [None]:
# select all columns starting with GTEX
condition = df_gene.columns.str.startswith("GTEX")
condition

In [None]:
df_gene[df_gene.columns[condition]]

This `.str.` attribute is called "accessor" because it can access string methods.

### Select rows: by name

In [None]:
# get the 1st row
df_gene.loc["ENSG00000223972"]

In [None]:
# (attempt to) get the last row
# df_gene.loc[-1]
# won't work because Pandas will look for a row index label '-1'

# revised version
nrows = df_gene.shape[0]
df_gene.loc[df_gene.index[nrows - 1]]  # output: Series

In [None]:
df_gene.tail(1)  # output: DataFrame

In [None]:
# subset multiple rows by using a list
df_gene.loc[["ENSG00000223972", "ENSG00000243485"]]  # output: dataframe

### Select rows: by condition

In [None]:
cond = df_gene["GTEX-1117F"] >= 5.
df_gene.loc[cond]

In [None]:
df_gene.query("`GTEX-1117F` >= 5")  # protect the column name with `` (since it contains a -)

### Select rows: by index

In [None]:
df_gene.iloc[0]

In [None]:
df_gene.iloc[-1]

### Select rows & columns

You can use `.loc[]` and `.iloc[]` to select columns too.

In [None]:
df_gene.loc[:, ["GTEX-1117F", "GTEX-111CU"]]

In [None]:
# same as above but using iloc
df_gene.iloc[:, [1, -1]]

You can select contiguous ranges as well:

In [None]:
df_gene.loc[:, "GTEX-1117F":"GTEX-111CU"]

In [None]:
df_gene.iloc[:, range(1, 4)]

In [None]:
df_gene.loc["ENSG00000237613", "GTEX-1117F"]

In [None]:
df_gene.iloc[3, 1]

Hint: whenever possible, use the actual column names when you select/subset the data (so, prefer `.loc[]`). Using absolute indexes can lead to issues if the order of rows or columns is changed.

### Multiple conditions

Multiple conditions can be combined using the same rules that apply to Series.

In [None]:
cond = (df_gene["GTEX-1117F"] > 5.) | (df_gene["GTEX-111FC"] > 0.03)
df_gene.loc[cond]

### Sort rows

You can apply the `.sort_values()` method on a dataframe to sort by a column, optionally selecting the sort order with the boolean `ascending` parameter.

In [None]:
# sort rows (genes) by decreasing expression in sample GTEX-1117F
df_gene.sort_values(by="GTEX-1117F", ascending=False)

A column is a Series, so as we did before we can apply `.sort_values()` to any column:

In [None]:
df_gene["GTEX-1117F"].sort_values(ascending=False)

### Add/remove/modify columns

You can add and remove columns as part of your initial data cleaning phase. Modifying columns is usually part of the feature engineering, where you create new, more effective features (columns) for downstream analysis (e.g. machine learning).

In [None]:
np.random.seed(42)
df_gene["dummy_col"] = np.random.normal(loc=10, size=df_gene.shape[0])
df_gene

In [None]:
# drop columns: method 1
df_gene.drop(["dummy_col", "GTEX-111CU"], axis=1)  # axis=1 -> operate on columns; axis=0 -> rows

In [None]:
# drop columns: method 2
df_gene = df_gene.drop(columns=["dummy_col", "GTEX-111CU"])
df_gene

We can assign and modify columns with the `.assign()` method.

In [None]:
df_gene.assign(
    delta=df_gene["GTEX-1117F"] - df_gene["GTEX-111FC"],
    logdelta=np.log2(1 + df_gene["GTEX-1117F"] - df_gene["GTEX-111FC"]),
)

If you are familiar with R's `dplyr`, at this point you may be wondering: in the above code, wouldn't be much easier to write something like  `logdelta = np.log2(1 + geneExp_df["delta"]`?

If we tried, we would get an error: that's because we are accessing the newly created column in the wrong way.

To explain the process, we need to use a **lambda function** (more on this shortly).

Let's see how lambda functions work on a simple dataframe.

In [None]:
toy = pd.DataFrame({"a": [10, 20, 30], "b": [20, 30, 40]})
toy

If I wanted to apply a function to the `a` column (say, the square), the usual way would involve the creation of the function, which I would apply to the column with the pandas `.apply()` method.

In [None]:
def my_sq(x):
    return x**2


toy["a_sq"] = toy["a"].apply(my_sq)
toy

Seems a lot of work for such a simple function. So I'm going to use a lambda function for that.

In [None]:
toy["a_sq_lambda"] = toy["a"].apply(lambda x: x**2)
toy

---

A **lambda function** is a function simple enough that we don't even need to give it a name - that's why they are also called "anonymous function".

Typical use case is for one-liners, like the one above.

The `x` in `lambda x` is each individual value of `toy["a"]`, which is passed to the anonymous function. The result is automatically returned (no need to `return` anything).

R also has anonymous functions, often used in combination with `apply()`, `sapply()`, etc.: consider for example `sapply(toy$a, function(x) x**2)`.

---

Now we can rewrite the `.assign()` statement with a lambda function:

In [None]:
df_gene.assign(
    delta=df_gene["GTEX-1117F"] - df_gene["GTEX-111FC"],
    logdelta=lambda df_: np.log2(1 + df_["delta"]),
)

Columns are easily renamed with the `.rename()` method, accepting a dictionary in the form `{'old_name': 'new_name'}`

In [None]:
df_gene.rename(columns={'GTEX-1117F': 'Sample1', 'GTEX-111FC': 'Sample2'})

### Reading from files

Pandas has convenient `read_<format>` methods to read data in different formats (CSV, TSV, Excel, JSON, pickle, etc.).

Let's load a small toy dataset about how mouse weight responded to a particular treatment. This data contains 4 columns: 

- `Mouse`, the mouse label/number
- `Treated`, whether or not it was treated
- `Sex`
- `Weight`

In [None]:
df = pd.read_csv(DATADIR / "mouse_weight_data.csv")
df.head()

By default, the resulting dataframe is indexed from 0 to the number of rows. We are not quite happy with this because we would like to use the Mouse ID instead. Let's fix this:

In [None]:
df = df.set_index("Mouse")
df.head()

Want to go back? Just use `reset_index`:

In [None]:
df.reset_index().head()

You can set the correct index straight from `pd.read_csv()`, by setting the argument `index_col` to the column name that you want to use as index!

### Exercise: Statistical testing revisited

1. (Re)load the `mouse_weight_data.csv` from the `data` folder using the column "Mouse" as index
2. Extract the weights of the treated and the untreated mice (hint: the column Treated is boolean: use it to select groups; in Pandas conditions, use `~` to negate)
3. Calculate the mean weight per group and print them
4. Test for statistical significance: there are two group involved, so you'll need a two-sample t-test (`ttest_ind()` from scipy `stats`)

In [None]:
mouse_df = pd.read_csv(DATADIR / "mouse_weight_data.csv", index_col="Mouse")
mouse_df

In [None]:
treated_weights = mouse_df[mouse_df["Treated"]]["Weight"]
untreated_weights = mouse_df[~mouse_df["Treated"]]["Weight"]

We start by checking if the mean weight is different between the groups.

In [None]:
treated_mean = treated_weights.mean()
untreated_mean = untreated_weights.mean()
print(f"Treated mean weight: {treated_mean:0.0f}g\nUntreated mean weight: {untreated_mean:0.0f}g")

It is slightly different, but is this due to randomness/noise in the data?

Let's test for statistical significance: since there are two groups involved, we have to use a two-sample t-test.

In [None]:
stats.ttest_ind(treated_weights, untreated_weights)

---

## Missing values (DataFrames)

What we saw earlier on Pandas Series also applies to DataFrames.

On DataFrames, it may be more practical to use the `.info()` or the `.count()` methods to get an overview of the non-missing values.

In [None]:
ebola = pd.read_csv(DATADIR / "country_timeseries.csv")
ebola.head()

In [None]:
ebola.info()

In [None]:
# count the no. of non-missing values
ebola.count()

In [None]:
# derive the total no. of missing values
np.count_nonzero(ebola.isnull())

In [None]:
# for an individual column
ebola["Cases_Guinea"].value_counts(dropna=False)  # also: ebola["Cases_Guinea"].isnull().sum()

### Handle missing values

You can use the same methods that work on Series objects.

In [None]:
ebola.head()

In [None]:
# keep only complete cases
ebola.dropna().head()  # only 1 row!

In [None]:
# use axis=1 to remove columns with (any) missing values
ebola.dropna(axis=1).head()  # 2 columns left

In [None]:
ebola.fillna(0).head()

Here is the DataFrame edition of the rules of thumb we previously saw for Series:

- You should **drop rows** (samples) when a lot of data is missing;
- You should **drop columns** (features) if many rows are missing that particular feature;
- You should **fill with the same value** if you know that NaN is just a placeholder (e.g., for 0);
- You should **fill with interpolated or estimated value** if there is a reasonable assumption to do that!

___

## Grouped operations

Pandas has the `.groupby()` method that allows you to compute grouped (or aggregated) calculations. For example, in our Mouse dataset:

- what is the average weight by sex?
- what is the average weight by sex, stratified by treatment?

In [None]:
mouse_df.groupby("Sex")

`.groupby()` alone does nothing visible: it *prepares* the data for downstream computations. 

In other words, it creates a "lazy" groupby object waiting to be evaluated by an *aggregate method* call, such as `.sum()`, `.mean()`, etc.

In [None]:
mouse_df.groupby("Sex")["Weight"].mean()

If you prefer, you can assign the grouped dataframe to its own variable.

In [None]:
grouped_df = mouse_df.groupby("Sex")
grouped_df["Weight"].mean()

It is easy to group by more than one variable: for example, let's compute the average `Weight` broken down by `Sex` and `Treated`.

In [None]:
multi_var_df = mouse_df.groupby(["Sex", "Treated"])["Weight"].mean()
multi_var_df

**Hint:** you can use `( )` to wrap long statements, writing each method on its own line ("method chaining")

In [None]:
multi_var_df = (
    mouse_df
    .groupby(["Sex", "Treated"])
    ["Weight"]
    .mean()
)
multi_var_df

Notice the hierarchical structure of the row indexes. If you prefer, you can "flatten" it out:

In [None]:
flat_df = multi_var_df.reset_index()
flat_df.head()

On a grouped dataframe you can also apply custom functions with `.apply()`. Here's an example displaying the first (or last) `Weight` for each unique `Sex` value.

In [None]:
mouse_df.groupby("Sex").apply(lambda df: df.Weight.iloc[0], include_groups=False)
# include_groups=False is to exclude the grouping columns from the operation
# it will become the default value in future versions of Pandas

Another `groupby()` method worth mentioning is `.agg()`, which lets you run a bunch of different functions on your DataFrame simultaneously. For example, we can generate a simple statistical summary of the dataset as follows:

In [None]:
mouse_df.groupby("Treated").Weight.agg(["min", "max", "mean", "median"])

## Exercise

The file "neuroblastoma.tsv", in the `data` folder, contains a few clinical parameters and the expression of 3 genes (MYCN, ALK, and TP53) for 20 neuroblastoma patients, randomly selected from a larger set.

Read the file with `pd.read_csv()`, setting the appropriate separator through the argument `sep`, and save it to `nb_df`.

In [None]:
nb_df = pd.read_csv(DATADIR / "neuroblastoma.tsv", sep="\t")
nb_df.head()

1. Create a DataFrame `avg_surv` containing the average survival time (`os_years`) broken down by the neuroblastoma INSS staging `inss_stage`. What stage is associated with the worst prognosis, on average?

In [None]:
avg_surv = nb_df.groupby("inss_stage")["os_years"].mean().sort_values(ascending=True)
avg_surv

2. What are the minimum and maximum survival times for each `age_group`? (age at diagnosis) Create a DataFrame whose index is the age group category from the dataset and whose values are the min and max values thereof.

In [None]:
surv_extremes = nb_df.groupby("age_group")["os_years"].agg(["min", "max"])
surv_extremes

3. Create a `Series` whose index is the `high_risk` status and whose values are the average expression value of TP53 for each value of `high_risk`.

In [None]:
hr_mean_expr = nb_df.groupby("high_risk")["TP53"].mean()
hr_mean_expr

---

## Combining data

Combining data is another part of your typical data analysis workflow. For example, you have sample IDs and clinical data in one file, and gene expression value for those samples in another file, and you want to combine those two files in the most robust way.

Merging, or joining, data is more elaborate than a simple concatenation: it resembles a database join, where you combine one or more tables based on common data values.

Pandas has a `.merge()` method to perform this kind of operation.

The syntax is `left.merge(right)`, meaning that a `left` dataframe will be merged with the `right` dataframe.

In [None]:
sites = pd.read_csv(DATADIR / "survey_site.csv")
visited = pd.read_csv(DATADIR / "survey_visited.csv")

In [None]:
sites

In [None]:
visited

Let's merge `sites` ("left") and `visited` ("right"), using the common column `site`.

### One-to-one merge

This kind of merge works when there are no duplicate values in the joining columns.

In [None]:
sites.merge(visited, left_on="site", right_on="site")

See how the resulting dataframe has the first columns from the "left" dataframe.

The optional argument `how` determines the type of join:

- `how="inner"` (default), use intersection of keys from both dataframes
- `how="outer"`, use union of keys from both dataframes
- `how="left"` / `how="right"`, use only keys from left (right) dataframe

### Many-to-one

In case the joining column contains duplicates, you would obtain a "many-to-one" merge, where all the left dataframe info are matched to the right dataframe and replicated as needed.

## Exercise

You have to merge clinical information (stored in the file `nb_clinical.txt`) with gene expression (provided in the file `nb_expr.txt`) into a single dataframe.

Start by reading the two files into dataframes, figuring out the file format and the separator.

Have a look at the data and check the sizes.

Merge the two dataframes. Clean up the merged dataframe be dropping redundant columns, if any.

Finally, impute missing values to 0.

In [None]:
df_clinical = pd.read_csv(DATADIR / "nb_clinical.txt")
df_expr = pd.read_csv(DATADIR / "nb_expr.txt", sep="\t")

In [None]:
df_clinical

In [None]:
df_expr

We don't have clinical information for all of the samples (real-world scenario). We should perform an inner join to keep only common samples.

In [None]:
df_merged = df_clinical.merge(df_expr, how="inner", left_on="sample_id", right_on="ID")

In [None]:
df_merged.head()

In [None]:
df_merged = df_merged.drop("ID", axis=1)
df_merged.fillna(0)

In [None]:
# alternate method
df_clinical = pd.read_csv(DATADIR / "nb_clinical.txt", index_col="sample_id")
df_expr = pd.read_csv(DATADIR / "nb_expr.txt", sep="\t", index_col="ID")
df_merged = df_clinical.merge(df_expr, how="inner", left_index=True, right_index=True)
df_merged.fillna(0)

---

## Breast Cancer Detection Data Set

source: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

The data comprises features from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image in the 3-dimensional space. For now, there is no need to fully understand where the variables come from or their units: the focus here is to explore what we can do with Pandas and understand how easy it is to apply these tools to any kind of data set.

In [None]:
# (re)import libraries in case you start from scratch
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path

# compose paths and filenames using pathlib, which is operating system-agnostic
# then load the data using Pandas
DATADIR = Path("data")
DATAFILE = DATADIR / "breast_cancer_diagnostic_data.csv"
bc_data = pd.read_csv(DATAFILE)
bc_data

Other parameters can be used inside `read_csv()` to deal with different types of data: for example, different separators, unwanted rows or columns, variable names, and other.

```python
# to skip (100) rows
pd.read_csv(DATAFILE, skiprows=100)

# if the dataset has no header
pd.read_csv(DATAFILE, header=None)
```

Our data set contains 569 samples (patients), categorical data (diagnosis result) and numerical data (31 attributes concerning the tumor shape and size and the id of each patient). We are now ready to start exploring and manipulating the data as intended. The script below shows a few examples of simple operations that can be applied with just a single line of code.

In [None]:
# check the dimensions
bc_data.shape

In [None]:
# check the header
bc_data.columns

In [None]:
# check the index
bc_data.index

In [None]:
# check the last n rows
bc_data.tail(5)

## Review Exercise

Index this dataframe by sample ID, instead of the default indexing. You can do that in two ways: by using `set_index()` on the dataframe, or by re-reading it from file specifying which column should be used as the index. Then, use indexing/slicing methods to perform the operations in the below cells.

In [None]:
# use your preferred method to index the df by sample ID



In [None]:
# check the contents of the column "area_mean"
# (output: list/array containing column values for all rows)



In [None]:
# create a subset of the data with the columns "area_mean", "perimeter_mean", "texture_mean"
# (output: a dataframe containing selected columns)



In [None]:
# check the information of a given patient by row index, i.e. select row 100



In [None]:
# get a slice of the table by row index
# (output: a subtable containing row indexes from 9 (incl.) to 90 (incl.))



In [None]:
# drop columns from the table without overwriting the original data
# (output: table without column "compactness_mean")



In [None]:
# select slices of lines and columns simultaneously by their index:
# - select row indexes 50 to 70 (incl., excl.)
# - select column indexes 5 to 10 (incl., excl.)



In [None]:
# check for missingness



In [None]:
# drop columns containing any number of missing values,
# saving the result to a new dataframe



In [None]:
# create a dataframe named "bc_data_subset" containing all the rows and only the first 7 columns



In [None]:
# rename the columns with better/shorter names,
# i.e., without the _mean suffix



In [None]:
# select patients that showed tumors with an area superior to 1000



In [None]:
# of those tumors with an area superior to 1000, how many had a perimeter superior to 180?



In [None]:
# what is the mean tumor area of patients for each diagnosis (Benign/Malignant)?



In [None]:
# and the median?



In [None]:
# what is the median value of all parameters for each level of diagnosis?



---

## Export & import data

It is very common to export (save) data while we process them - either at the end of the processing workflow, or as intermediate steps. It is perfectly fine to save intermediate files! Especially while you are setting up and fine-tuning the preprocessing steps.

In any case, the data sets you save can be used as inputs for the downstream analysis (e.g. statistical modeling, machine learning, visualization).

We did a bit of preprocessing on our BC data: so, we have now reached a quintessential "data export" moment.

Pandas offers convenient methods to export data in different formats: you apply these methods directly on DataFrames (or Series). They are named after the output format, e.g. `.to_<format>()`.

### Pickle

Pickle is Python's serialized format. I would say it is the counterpart of R's RDS format (e.g. `saveRDS()`).

Pickle files are binary: if you open them in a text editor, you'll see a lot of weird characters. This format is optimized for Python and disk storage space.

Common file extension for Pickles are `.pkl` or `.pickle`.

In [None]:
# save processed dataframe to pickle
bc_data_subset.to_pickle(DATADIR / "bc_new_table.pickle")

In [None]:
# read back to a dataframe
bc_pickle = pd.read_pickle(DATADIR / "bc_new_table.pickle")
bc_pickle

### CSV/TSV

Comma-separated values, or tab-separated values, are textual file formats. They are the most flexible storage type: any text editor or program can open this kind of files. You can share them with everyone.

On the downside, CSVs are usually slower and bigger than other binary formats.

Pandas Series and DataFrames have the `.to_csv()` method to write delimiter-separated values: it defaults to comma as the separator, but you can change this with the `sep` argument (e.g. `sep="\t"` for TSVs).

By default, the `.index` of a DataFrame is written to the CSV. If the DataFrame has a *named index*, like our `bc_data_subset` here, then no problem: the index will be saved as the first column, with its name. If, however, the DataFrame has an index without a name, then also the output file will have the first column without a name. This may create problems when reading the CSV back to Pandas. To overcome this, you have two options:

- you set `index=False` when you save the file
- you save the file with `index=True` (default), and you read it back using `pd.read_csv(..., index_col=0)` to tell Pandas that the first column holds the index.

For `bc_data_subset`, we'll just set the separator to tab and then use the other default values.

In [None]:
# save dataframe to TSV
bc_data_subset.to_csv(DATADIR / "bc_new_table.txt", sep="\t")

### Excel

Let's face it: Excel has a bad reputation in the data science community. Some of its obvious limitations include potential issues due to color-coding information, weird datetime conversions, and so on. 

Still, Excel is probably still the most commonly used file format: so, we are not here to blame it but to learn how to export to or import from Excel format, in case you have to collaborate with people who use it - while you are learning a cool alternative tool for data analytics with Pandas and Python :)

Before reading and saving Excel files with Pandas, you need to install the `openpyxl` library. Copy-paste the following in the cell code below and run it:

```
!conda install -y -c conda-forge openpyxl
```

If you are a Windows user, replace `conda` with `conda.exe`.

In [None]:
bc_data_subset.to_excel(DATADIR / "bc_new_table.xlsx", sheet_name="BC data")

In [None]:
bc_excel = pd.read_excel(DATADIR / "bc_new_table.xlsx", sheet_name="BC data")

### Feather

[Feather](https://arrow.apache.org/docs/python/feather.html) is another binary format, similar to Pickle. But it has the advantage that it can be read by other languages, like R, and it is faster than CSV. Feather is part of the Apache Arrow project. 

Again, you'll probably need to install a dependency:

```
!conda install -y -c conda-forge pyarrow
```

In [None]:
bc_data_subset.to_feather(DATADIR / "bc_new_table.feather")

In [None]:
bc_feather = pd.read_feather(DATADIR / "bc_new_table.feather")
bc_feather

### Dictionary

You can also convert a Series or DataFrame into a Python dictionary. This is convenient if you have to work on the data with Python, but outside Pandas.

In [None]:
# convert just the first rows
bc_dict = bc_data_subset.head().to_dict()
print(bc_dict)

Printing a raw dictionary is quite ugly, isn't it? 

What better opportunity to showcase Python's "pretty print" function, from the library `pprint`!

In [None]:
import pprint
pprint.pprint(bc_dict)

---

# Credits

Partially abridged from great work by Paulo Caldas https://github.com/paulocaldas, Samraat Pawar (MIT license), Center for Computational Biomedicine (Harvard Medical School), and others.