# PyDESeq 2 pipeline

This notebook gives a minimalistic example of how to perform DEA using PyDESeq2.

It allows you to run the PyDESeq2 pipeline on the synthetic data provided in this repository.

In [None]:
import os
import pickle as pkl

from pydeseq2.DeseqDataSet import DeseqDataSet
from pydeseq2.DeseqStats import DeseqStats
from pydeseq2.utils import load_example_data

In [None]:
SAVE = False  # whether to save the outputs of this notebook

## Data loading

In [None]:
OUTPUT_PATH = f"../output_files/synthetic_example"  # Replace this with the path were you wish to save outputs
os.makedirs(OUTPUT_PATH, exist_ok=True)  # Create path if it doesn't exist

In [None]:
counts_df = load_example_data(
    modality="raw_counts",
    dataset="synthetic",
    debug=False,
)

In [None]:
clinical_df = load_example_data(
    modality="clinical",
    dataset="synthetic",
    debug=False,
)

In [None]:
counts_df

In [None]:
clinical_df

Filter out genes that have less than 10 counts in total.
There shouldn't be any in the synthetic dataset, but pre-filtering genes is good practice in general.

In [None]:
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
len(genes_to_keep)

In [None]:
counts_df = counts_df[genes_to_keep]

## 1 - Read counts modeling with the `DeseqDataSet` class

We start by creating a `DeseqDataSet` object with the count and clinical data. Here, we use 8 threads, feel free to adapt this to your setup or to set to `None` to use all available CPUs.

Another option of interest is the `refit_cooks` argument (set to `True` by default), which controls whether Cooks outlier should be refitted – this is advised, in general.

Note: in the case of the provided synthetic data, there won't be any Cooks outliers.

In [None]:
# Start by creating a DeseqDataSet
dds = DeseqDataSet(
    counts_df,
    clinical_df,
    design_factors="condition",  # compare samples based on the "condition" column ("B" vs "A")
    refit_cooks=True,
    n_cpus=8,
)

In [None]:
# Then, run DESeq2 on it
dds.deseq2()

In [None]:
if SAVE:
    with open(os.path.join(OUTPUT_PATH, "dds.pkl"), "wb") as f:
        pkl.dump(dds, f)

## 2 - Statistical analysis with the `DeseqStats` class

The `DeseqDataSet` class has a unique mandatory arguments, `dds`, which should be a *fitted* `DeseqDataSet` object, as well as a set of optional keyword arguments, among which:

- `alpha`: the p-value and adjusted p-value significance threshold (0.05 by default),
- `cooks_filter`: whether to filter p-values based on cooks outliers (True by default),
- `independent_filter`: whether to perform independent filtering to correct p-value trends (True by default).

In [None]:
stat_res = DeseqStats(dds, n_cpus=8)

### Wald test

The `summary` function runs the statistical analysis (multiple testing adjustement included) and returns a summary DataFrame.

In [None]:
stat_res.summary()

In [None]:
if SAVE:
    with open(os.path.join(OUTPUT_PATH, "stat_results.pkl"), "wb") as f:
        pkl.dump(stat_res, f)

### LFC shrinkage

For visualization or post-processing purposes, it might be suitable to perform LFC shrinkage. This is implemented by the `lfc_shrink` method.

In [None]:
stat_res.lfc_shrink()

In [None]:
if SAVE:
    with open(os.path.join(OUTPUT_PATH, "shrunk_stat_results.pkl"), "wb") as f:
        pkl.dump(stat_res, f)