# Compositional analysis

A central question in scRNA-seq experiments is if cell-type proportions have been changed between
conditions. This seemingly simple question is technically challenging, due to the compositional
nature of single-cell data. That is, if the abundance of a certain cell-type is increased, as a
consequence, the abundance of all other cell-types will be decreased, since the overall number of
cells profiled is limited. On top of that, cell proportions are not represented in an unbiased manner
in scRNA-seq data, as, depending on the protocol, different cell-types are captured with different
efficienies {cite}`Lambrechts2018, salcherHighresolutionSinglecellAtlas2022a`. 

:::{note}
Several alternative methods are available for comparing compositional data. scCODA
{cite}`buttnerScCODABayesianModel2021` is a Bayesian model for compositional data analysis that models the uncertainty of cell-type fractions of each sample. 
It requires the definition of a reference cell-type that is assumed to
be unchanged between conditions. tascCODA {cite}`ostnerTascCODABayesianTreeAggregated2021` is an extension of the scCODA model that additionally takes the hierarchical relationships of cell lineages into account. Propeller {cite}`phipsonPropellerTestingDifferences2022a` uses a log-linear model to model cell-type proportions and was demonstrated to have high statistical power with few biological replicates. Finally, sccomp {cite}`mangiolaRobustDifferentialComposition2022` provides a highly-flexible statistical framework that considers the presence of outliers and models group-specific variability of cell-type proportions.

Another group of tools work independent of discrete cell-types and are useful for finding more
subtle changes in functional states based on the cell × cell neighborhood graph. DA-seq {cite}`zhaoDetectionDifferentiallyAbundant2021`
computes a differential abundance (DA)-score for each cell, based on the prevalence of conditions
in neighborhoods of multiple sizes using a logistic regression classifier. Similarly, Milo {cite}`dannDifferentialAbundanceTesting2022` tests
if in certain parts of the neighborhood graph cells from a certain condition are over-represented.
Thanks to its statistics being based on a GLM with negative binomial noise model, it allows for
flexible modeling of experimental designs and covariates.  

There's a [dedicated chapter](https://www.sc-best-practices.org/conditions/compositional.html) in the [single-cell best practices book](https://www.sc-best-practices.org) which provides additional information on compositional analyses. 
:::

In {cite:t}`salcherHighresolutionSinglecellAtlas2022a`, we used scCODA for comparing cell-type fractions. In this section, we demonstrate how to apply it to a single-cell atlas. 


In [103]:
import scanpy as sc
import altair as alt
import pandas as pd
import sccoda.util.cell_composition_data as scc_dat
import sccoda.util.comp_ana as scc_ana
import tensorflow as tf
tf.random.set_seed(0)
tf.logging.set_verbosity(tf.logging.INFO)

In [3]:
input_adata = "../../data/input_data_zenodo/atlas-integrated-annotated.h5ad"

In [4]:
adata = sc.read_h5ad(input_adata)

## Prepare data

In this example, we are going to compare cell-type fractions between the two tumor types, *LUAD* and *LUSC*. 

As a first step, we compute the number and fraction of cells per cell-type. We can use this table to qualitatively compare the cell-type fractions and then load them into scCODA for a quantitative analysis. 

To this end, we subset the data to only primary tumor samples and only patients with LUAD and LUSC. 

In [54]:
cells_per_patient = (
    adata.obs.loc[
        lambda x: (x["origin"] == "tumor_primary")
        & x["condition"].isin(["LUAD", "LUSC"])
    ]
    .groupby(
        # groupby needs to include all covariates of interest, the column with
        # the biological replicate (patient) and the cell-type
        ["dataset", "condition", "tumor_stage", "patient", "cell_type_coarse"],
        observed=True,
        group_keys=False,
    )
    .size()
    .unstack(fill_value=0)
)

In [55]:
cells_per_patient

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,cell_type_coarse,B cell,T cell,Epithelial cell,Macrophage/Monocyte,Mast cell,Plasma cell,cDC,Stromal,NK cell,Endothelial cell,pDC,Neutrophils
dataset,condition,tumor_stage,patient,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Lambrechts_Thienpont_2018_6653,LUSC,early,Lambrechts_Thienpont_2018_6653_7,405,759,718,191,125,160,56,22,127,14,4,1
Lambrechts_Thienpont_2018_6653,LUAD,advanced,Lambrechts_Thienpont_2018_6653_6,994,2620,303,274,68,251,39,27,202,27,6,0
Maynard_Bivona_2020,LUAD,early,Maynard_Bivona_2020_TH238,9,300,325,115,111,41,100,146,11,62,5,3
Maynard_Bivona_2020,LUAD,advanced,Maynard_Bivona_2020_TH226,7,284,575,371,112,91,28,300,33,113,8,115
Maynard_Bivona_2020,LUAD,advanced,Maynard_Bivona_2020_TH236,49,140,160,65,74,107,19,219,5,82,2,2
Maynard_Bivona_2020,LUAD,advanced,Maynard_Bivona_2020_TH179,16,429,173,323,94,121,96,30,14,83,50,1
Maynard_Bivona_2020,LUAD,advanced,Maynard_Bivona_2020_TH248,32,154,58,73,12,171,6,127,30,17,2,0
Maynard_Bivona_2020,LUAD,advanced,Maynard_Bivona_2020_TH231,2,4,375,24,5,0,2,2,2,2,3,87
Maynard_Bivona_2020,LUAD,advanced,Maynard_Bivona_2020_TH158,10,70,8,7,3,0,1,9,5,1,0,0
Maynard_Bivona_2020,LUAD,advanced,Maynard_Bivona_2020_TH169,0,10,138,4,6,10,1,3,0,3,0,0


## Qualitative visualization as bar chart

To get a first impression, we can make a bar chart to compare cell-type fractions between conditions. 
To this end, we transform the count matrix above into a table of average cell-type fractions per condition. 
We compute the mean cell-type fraction *per patient*, in order to give equal weight to each patient (this is particularly important if there are different numbers of cells per patient)

In [83]:
average_fractions_per_condition = (
    cells_per_patient.apply(lambda x: x / x.sum(), axis=1)
    .melt(ignore_index=False, value_name="frac")
    .reset_index()
    .groupby(["condition", "cell_type_coarse"], observed=True)
    .agg(mean_frac=pd.NamedAgg("frac", "mean"))
    .reset_index()
)

In [84]:
average_fractions_per_condition.head()

Unnamed: 0,condition,cell_type_coarse,mean_frac
0,LUSC,B cell,0.085394
1,LUSC,Endothelial cell,0.003429
2,LUSC,Epithelial cell,0.296106
3,LUSC,Macrophage/Monocyte,0.083022
4,LUSC,Mast cell,0.028012


In [85]:
alt.Chart(average_fractions_per_condition).encode(
    y="condition", x="mean_frac", color="cell_type_coarse"
).mark_bar()

## Quantitative analysis using scCODA

In [95]:
sccoda_data = scc_dat.from_pandas(
    # turns the "index" columns into normal columns again
    cells_per_patient.reset_index(),
    # we need to specify all columns that do not contain cell-type counts as covariate columns
    covariate_columns=["patient", "dataset", "condition", "tumor_stage"],
)

  return ad.AnnData(X=count_data.values,


By making "condition" a categorical column, we can define the order of the categories. 
The first category is condiered the "base" category (i.e. denominator fold changes)

In [99]:
sccoda_data.obs["condition"] = pd.Categorical(
    sccoda_data.obs["condition"], categories=["LUSC", "LUAD"]
)

In [102]:
sccoda_mod = scc_ana.CompositionalAnalysis(
    sccoda_data,
    formula=f"condition + tumor_stage + dataset",
    reference_cell_type="Epithelial cell",
)
sccoda_res = sccoda_mod.sample_hmc(num_results=20000)

Zero counts encountered in data! Added a pseudocount of 0.5.
Instructions for updating:
back_prop=False is deprecated. Consider using tf.stop_gradient instead.
Instead of:
results = tf.while_loop(c, b, vars, back_prop=False)
Use:
results = tf.nest.map_structure(tf.stop_gradient, tf.while_loop(c, b, vars))


2023-03-31 11:31:31.799993: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'shape' with dtype int32 and shape [3]
	 [[{{node shape}}]]
2023-03-31 11:31:31.800208: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'shape' with dtype int32 and shape [3]
	 [[{{node shape}}]]
2023-03-31 11:31:31.813448: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'shape' with dtype int32 and shape [3]
	 [[{{node shape}}]]
2023-03-31 11:31:31.813598: I tens

MCMC sampling finished. (306.444 sec)
Acceptance rate: 48.6%


In [135]:
sccoda_res.set_fdr(0.5)

In [140]:
sccoda_res.effect_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Final Parameter,HDI 3%,HDI 97%,SD,Inclusion probability,Expected Sample,log2-fold change
Covariate,Cell Type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
condition[T.LUAD],B cell,0.028187,-0.642,0.795,0.265,0.551733,132.367591,-0.077685
condition[T.LUAD],T cell,0.252691,-0.224,1.044,0.293,0.511533,671.212907,0.246206
condition[T.LUAD],Epithelial cell,0.0,0.0,0.0,0.0,0.0,395.201523,-0.11835
condition[T.LUAD],Macrophage/Monocyte,-0.005937,-0.518,0.52,0.192,0.545533,194.699768,-0.126915
condition[T.LUAD],Mast cell,-0.072112,-0.685,0.597,0.214,0.423933,84.460511,-0.222385
condition[T.LUAD],Plasma cell,-0.041131,-0.624,0.643,0.224,0.545267,103.675427,-0.177689
condition[T.LUAD],cDC,-0.012872,-0.59,0.582,0.208,0.4804,74.553993,-0.136921
condition[T.LUAD],Stromal,0.158768,-0.478,0.882,0.291,0.587067,93.987664,0.110703
condition[T.LUAD],NK cell,0.08631,-0.48,0.731,0.205,0.415667,98.170499,0.006169
condition[T.LUAD],Endothelial cell,0.010181,-0.502,0.656,0.202,0.480933,60.375116,-0.103662


In [137]:
credible_effects_condition = sccoda_res.credible_effects()["condition[T.LUAD]"]
credible_effects_stage = sccoda_res.credible_effects()["tumor_stage[T.early]"]

In [143]:
(
    alt.Chart(
        sccoda_res.effect_df.loc["condition[T.LUAD]"]
        .loc[credible_effects_condition]
        .reset_index(),
        title="condition",
    )
    .mark_bar()
    .encode(
        x=alt.X("Cell Type", sort="y"),
        y="log2-fold change",
        color=alt.Color("Cell Type"),
    )
    | alt.Chart(
        sccoda_res.effect_df.loc["tumor_stage[T.early]"]
        .loc[credible_effects_stage]
        .reset_index(),
        title="tumor_stage",
    )
    .mark_bar()
    .encode(
        x=alt.X("Cell Type", sort="y"),
        y="log2-fold change",
        color=alt.Color("Cell Type"),
    )
).resolve_scale(y="shared", color="shared")