# PyDESeq 2 pipeline

This notebook gives a minimalistic example of how to perform DEA using PyDESeq2.

It allows you to run the PyDESeq2 pipeline either on synthetic data or on one of the following TCGA datasets:
- TCGA-BRCA
- TCGA-COAD
- TCGA-LUAD
- TCGA-LUSC
- TCGA-PAAD
- TCGA-PRAD
- TCGA-READ
- TCGA-SKCM.

Running this pipeline takes a few minutes (~5-10 min) depending on your setup and on the chosen dataset.

In [None]:
import os
import pickle as pkl

import numpy as np
import pandas as pd

from pydeseq2.DeseqDataSet import DeseqDataSet
from pydeseq2.DeseqStats import DeseqStats
from pydeseq2.utils import load_data

In [None]:
SAVE = False  # whether to save the outputs of this notebook

## Data loading

See the `datasets` readme for the required data organization. 

In [None]:
CANCER = "synthetic"  # or 'TCGA-BRCA', 'TCGA-COAD', etc.

In [None]:
OUTPUT_PATH = f"../output_files/{CANCER}"
os.makedirs(OUTPUT_PATH, exist_ok=True)  # Create path if it doesn't exist

In [None]:
counts_df = load_data(
    modality="raw_counts",
    cancer_type=CANCER,
    debug=False,
)

In [None]:
clinical_df = load_data(
    modality="clinical",
    cancer_type=CANCER,
    debug=False,
)

In [None]:
counts_df

Remove samples for which `high_grade` is NaN.

In [None]:
if CANCER != "synthetic":
    samples_to_keep = ~clinical_df.high_grade.isna()
    samples_to_keep.sum()
    counts_df = counts_df.loc[samples_to_keep]
    clinical_df = clinical_df.loc[samples_to_keep]

Filter out genes that have less than 10 counts in total

In [None]:
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
len(genes_to_keep)

In [None]:
counts_df = counts_df[genes_to_keep]

## 1 - Read counts modeling with the `DeseqDataSet` class

In [None]:
# Start by creating a DeseqDataSet
dds = DeseqDataSet(
    counts_df,
    clinical_df,
    design_factor="condition" if CANCER == "synthetic" else "high_grade",
    n_cpus=8,
)

In [None]:
# Then, run DESeq2 on it
dds.deseq2()

In [None]:
if SAVE:
    with open(os.path.join(OUTPUT_PATH, "dds.pkl"), "wb") as f:
        pkl.dump(dds, f)

## 2 - Statistical analysis with the `DeseqStats` class

### Wald test

In [None]:
stat_res = DeseqStats(dds, n_cpus=8)

In [None]:
stat_res.summary()

In [None]:
if SAVE:
    with open(os.path.join(OUTPUT_PATH, "stat_results.pkl"), "wb") as f:
        pkl.dump(stat_res, f)

### LFC shrinkage

In [None]:
stat_res.lfc_shrink()

In [None]:
if SAVE:
    with open(os.path.join(OUTPUT_PATH, "shrunk_stat_results.pkl"), "wb") as f:
        pkl.dump(stat_res, f)