# Compositional analysis

A central question in scRNA-seq experiments is if cell-type proportions have been changed between
conditions. This seemingly simple question is technically challenging, due to the compositional
nature of single-cell data. That is, if the abundance of a certain cell-type is increased, as a
consequence, the abundance of all other cell-types will be decreased, since the overall number of
cells profiled is limited. On top of that, cell proportions are not represented in an unbiased manner
in scRNA-seq data, as, depending on the protocol, different cell-types are captured with different
efficienies {cite}`Lambrechts2018, salcherHighresolutionSinglecellAtlas2022a`. 

:::{note}
Several alternative methods are available for comparing compositional data. scCODA
{cite}`buttnerScCODABayesianModel2021` is a Bayesian model for compositional data analysis that models the uncertainty of cell-type fractions of each sample. 
It requires the definition of a reference cell-type that is assumed to
be unchanged between conditions. tascCODA {cite}`ostnerTascCODABayesianTreeAggregated2021` is an extension of the scCODA model that additionally takes the hierarchical relationships of cell lineages into account. Propeller {cite}`phipsonPropellerTestingDifferences2022a` uses a log-linear model to model cell-type proportions and was demonstrated to have high statistical power with few biological replicates. Finally, sccomp {cite}`mangiolaRobustDifferentialComposition2022` provides a highly-flexible statistical framework that considers the presence of outliers and models group-specific variability of cell-type proportions.

Another group of tools work independent of discrete cell-types and are useful for finding more
subtle changes in functional states based on the cell × cell neighborhood graph. DA-seq {cite}`zhaoDetectionDifferentiallyAbundant2021`
computes a differential abundance (DA)-score for each cell, based on the prevalence of conditions
in neighborhoods of multiple sizes using a logistic regression classifier. Similarly, Milo {cite}`dannDifferentialAbundanceTesting2022` tests
if in certain parts of the neighborhood graph cells from a certain condition are over-represented.
Thanks to its statistics being based on a GLM with negative binomial noise model, it allows for
flexible modeling of experimental designs and covariates.  

There's a [dedicated chapter](https://www.sc-best-practices.org/conditions/compositional.html) in the [single-cell best practices book](https://www.sc-best-practices.org) which provides additional information on compositional analyses. 
:::

In {cite:t}`salcherHighresolutionSinglecellAtlas2022a`, we used scCODA for comparing cell-type fractions. In this section, we demonstrate how to apply it to a single-cell atlas. In this example, we are going to compare cell-type fractions between the two tumor types, *LUAD* and *LUSC*.

## 1. Import the required libraries

In [1]:
import os

# Set tensorflow logging level to only display warnings and errors
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "1"

import scanpy as sc
import altair as alt
import pandas as pd
import sccoda.util.cell_composition_data as scc_dat
import sccoda.util.comp_ana as scc_ana
import tensorflow as tf

tf.random.set_seed(0)



## 2. Load the input data

In [2]:
adata_path = "../../data/input_data_zenodo/atlas-integrated-annotated.h5ad"

In [3]:
adata = sc.read_h5ad(adata_path)

## 3. Compute cell-type count matrix

As a first step, we compute the number of cells per sample and cell-type. This matrix is the basis for both qualitative visualization and quantitative analysis using scCODA.

1. Subset the AnnData object to the samples of interest, in our case only primary tumor samples and only patients with LUAD and LUSC. 

In [4]:
adata_subset = adata[
    (adata.obs["origin"] == "tumor_primary")
    & adata.obs["condition"].isin(["LUAD", "LUSC"]),
    :,
]

2. Create a DataFrame with counts per cell-type using pandas

In [5]:
cells_per_patient = (
    adata_subset.obs.groupby(
        # groupby needs to include all covariates of interest, the column with
        # the biological replicate (patient) and the cell-type
        ["dataset", "condition", "tumor_stage", "patient", "cell_type_coarse"],
        observed=True,
    )
    .size()
    .unstack(fill_value=0)
)

In [6]:
cells_per_patient

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,cell_type_coarse,B cell,T cell,Epithelial cell,Macrophage/Monocyte,Mast cell,Plasma cell,cDC,Stromal,NK cell,Endothelial cell,pDC,Neutrophils
dataset,condition,tumor_stage,patient,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Lambrechts_Thienpont_2018_6653,LUSC,early,Lambrechts_Thienpont_2018_6653_7,405,759,718,191,125,160,56,22,127,14,4,1
Lambrechts_Thienpont_2018_6653,LUAD,advanced,Lambrechts_Thienpont_2018_6653_6,994,2620,303,274,68,251,39,27,202,27,6,0
Maynard_Bivona_2020,LUAD,early,Maynard_Bivona_2020_TH238,9,300,325,115,111,41,100,146,11,62,5,3
Maynard_Bivona_2020,LUAD,advanced,Maynard_Bivona_2020_TH226,7,284,575,371,112,91,28,300,33,113,8,115
Maynard_Bivona_2020,LUAD,advanced,Maynard_Bivona_2020_TH236,49,140,160,65,74,107,19,219,5,82,2,2
Maynard_Bivona_2020,LUAD,advanced,Maynard_Bivona_2020_TH179,16,429,173,323,94,121,96,30,14,83,50,1
Maynard_Bivona_2020,LUAD,advanced,Maynard_Bivona_2020_TH248,32,154,58,73,12,171,6,127,30,17,2,0
Maynard_Bivona_2020,LUAD,advanced,Maynard_Bivona_2020_TH231,2,4,375,24,5,0,2,2,2,2,3,87
Maynard_Bivona_2020,LUAD,advanced,Maynard_Bivona_2020_TH158,10,70,8,7,3,0,1,9,5,1,0,0
Maynard_Bivona_2020,LUAD,advanced,Maynard_Bivona_2020_TH169,0,10,138,4,6,10,1,3,0,3,0,0


## 4. Visualization as bar chart

For a first (qualitative) impression, we can make a bar chart to compare cell-type fractions between conditions. 

1. Transform the count matrix from above into a table of average cell-type fracitons per condition. We compute the mean cell-type fraction *per patient*, in order to give equal weight to each patient (this is particularly important if there are different numbers of cells per patient)

In [8]:
average_fractions_per_condition = (
    cells_per_patient.apply(lambda x: x / x.sum(), axis=1)
    .melt(ignore_index=False, value_name="frac")
    .reset_index()
    .groupby(["condition", "cell_type_coarse"], observed=True)
    .agg(mean_frac=pd.NamedAgg("frac", "mean"))
    .reset_index()
)

In [9]:
average_fractions_per_condition.head()

Unnamed: 0,condition,cell_type_coarse,mean_frac
0,LUSC,B cell,0.085394
1,LUSC,Endothelial cell,0.003429
2,LUSC,Epithelial cell,0.296106
3,LUSC,Macrophage/Monocyte,0.083022
4,LUSC,Mast cell,0.028012


2. Create a bar chart using `altair`: 

In [10]:
alt.Chart(average_fractions_per_condition).encode(
    y="condition", x="mean_frac", color="cell_type_coarse"
).mark_bar()

## 5. Quantitative analysis using scCODA

1.  Create an {class}`~anndata.AnnData` object for scCODA using {func}`~sccoda.util.cell_composition_data.from_pandas`

In [12]:
sccoda_data = scc_dat.from_pandas(
    cells_per_patient.reset_index(),
    # we need to specify all columns that do not contain cell-type counts as covariate columns
    covariate_columns=["patient", "dataset", "condition", "tumor_stage"],
)

  return ad.AnnData(X=count_data.values,


2. Make "condition" a categorical column. The first category is considered the "base" category (i.e. denominator of fold changes)

In [13]:
sccoda_data.obs["condition"] = pd.Categorical(
    sccoda_data.obs["condition"], categories=["LUSC", "LUAD"]
)

3. Create the scCODA model. In `formula`, specify the condition and all covariates you would like to include into the model. 

:::{important}
scCODA requires to specify a "reference cell-type" that is considered unchanged between the conditions. For studying the tumor microenvironment, we can set the reference cell-type to Epithelial cells/Tumor cells to capture changes of the tumor microenvironment
:::

In [14]:
sccoda_mod = scc_ana.CompositionalAnalysis(
    sccoda_data,
    formula=f"condition + tumor_stage + dataset",
    # TODO consider changing reference cell-type to "tumor cells" in the final version
    reference_cell_type="Epithelial cell",
)

Zero counts encountered in data! Added a pseudocount of 0.5.


4. Perform Hamiltonian Monte Carlo (HMC) sampling to estimate coefficients

In [16]:
# TODO: suitable number of iterations
sccoda_res = sccoda_mod.sample_hmc(num_results=20000)

Instructions for updating:
back_prop=False is deprecated. Consider using tf.stop_gradient instead.
Instead of:
results = tf.while_loop(c, b, vars, back_prop=False)
Use:
results = tf.nest.map_structure(tf.stop_gradient, tf.while_loop(c, b, vars))


100%|██████████| 20000/20000 [04:04<00:00, 81.85it/s]


MCMC sampling finished. (312.484 sec)
Acceptance rate: 51.2%


5. Set the false-discovery-rate (FDR) level for the results.

:::{note}
A smaller FDR value will produce more conservative results, but might miss some effects, while a larger FDR value selects more effects at the cost of a larger number of false discoveries. scCODA only computes a fold-change for cell-types that achieve an FDR smaller than the specified threshold. For more details, please refer to the [scCODA documentation](https://sccoda.readthedocs.io/en/latest/getting_started.html#Result-interpretation).
:::

In [18]:
# TODO: suitable FDR based on final dataset
sccoda_res.set_fdr(0.5)

6. Inspect the results

In [19]:
sccoda_res.summary()

Compositional Analysis summary:

Data: 13 samples, 12 cell types
Reference index: 2
Formula: condition + tumor_stage + dataset

Intercepts:
                     Final Parameter  Expected Sample
Cell Type                                            
B cell                         0.059       150.580711
T cell                         1.408       580.273315
Epithelial cell                1.098       425.599697
Macrophage/Monocyte            0.378       207.161613
Mast cell                     -0.408        94.396050
Plasma cell                   -0.230       112.786790
cDC                           -0.508        85.413078
Stromal                       -0.561        81.004056
NK cell                       -0.368        98.248426
Endothelial cell              -0.810        63.149140
pDC                           -1.166        44.234243
Neutrophils                   -1.059        49.229805


Effects:
                                                    Final Parameter  \
Covariate             

7. Compute "credible effects" based on the FDR. This will select cell-types that meet the FDR threshold. 

:::{note}
`condition[T.LUAD]` refers to the coefficient of the model that captures the differences inf `LUAD` compared to `LUSC` (which se set as the reference). The name of the coefficient can be found in the summary output above. By chosing a differen coefficient (e.g. `tumor_stage[T.early]`) we could test the effects of a different variable (in this case `tumor_stage`) on the cell-type composition. 
:::

In [137]:
credible_effects_condition = sccoda_res.credible_effects()["condition[T.LUAD]"]

8. Make a plot of log2 fold changes between LUAD and LUSC using `altair`. 

In [144]:
alt.Chart(
    sccoda_res.effect_df.loc["condition[T.LUAD]"]
    .loc[credible_effects_condition]
    .reset_index(),
    title="condition",
).mark_bar().encode(
    x=alt.X("Cell Type", sort="y"),
    y="log2-fold change",
    color=alt.Color("Cell Type"),
)