In [1]:
import anndata
import os
import requests

save_path = "data/example_sce.h5ad"
if not os.path.exists(save_path):
    response = requests.get("https://go.wisc.edu/69435h")
    with open(save_path, "wb") as f:
        f.write(response.content)

exper = anndata.read_h5ad(save_path)
exper

AnnData object with n_obs × n_vars = 2087 × 100
    obs: 'clusters_coarse', 'clusters', 'S_score', 'G2M_score', 'cell_type', 'sizeFactor', 'pseudotime'
    var: 'highly_variable_genes'
    uns: 'X_name', 'clusters_coarse_colors', 'clusters_colors', 'day_colors', 'neighbors', 'pca'
    obsm: 'PCA', 'UMAP', 'X_pca', 'X_umap'
    layers: 'counts', 'cpm', 'logcounts', 'spliced', 'unspliced'
    obsp: 'connectivities', 'distances'

### Composite Definition

Here is an example of how we can can fit different formulas and models across different subsets of genes, while keeping them all in the same simulator object. This could be accomplished through several calls to `.fit()`, but the `Composite` interaface gives a more convenient shorthand.

In [2]:
from scdesigner.simulators import CompositeGLMSimulator, NegBinRegressionSimulator, NegBinCopulaSimulator

specification = {
    "group1": {"formula": "~ pseudotime", "simulator": NegBinCopulaSimulator(epochs=10), "var_names": exper.var_names[:50]},
    "group2": {"formula": "~ 1", "simulator": NegBinRegressionSimulator(epochs=10), "var_names": exper.var_names[50:]}
}

sim = CompositeGLMSimulator(specification)
sim.fit(exper)
sim

                                                                                 

scDesigner simulator object with
    method: 'Composite'
    features: {'group1': '[Pyy,Iapp, ..., Serpina1c]', 'group2': '[Dbpht2,Krt18, ..., 1110012L19Rik]'}
    simulators: {'group1': scDesigner simulator object with
    method: 'Negtive Binomial Copula'
    formula: '~ pseudotime'
    copula formula: 'None'
    parameters: 'coefficient', 'dispersion', 'covariance', 'group2': scDesigner simulator object with
    method: 'Negtive Binomial Regression'
    formula: '~ 1'
    parameters: 'beta', 'gamma'}

Now that the simulators are tied together in this way, we can get local parameter predictions across all groups through a single `predict` call.

In [3]:
sim.predict(exper.obs)

{'group1': {'mean':                         Pyy       Iapp       Chgb       Rbp4      Spp1  \
  AAACCTGAGAGGGATA  20.443543  18.265802  19.472219  16.468190  4.819735   
  AAACCTGGTAAGTGGC  13.088421  11.861833  12.523801  10.761526  5.731384   
  AAACGGGCAAAGAATC  23.303660  20.734400  22.166605  18.659560  4.580692   
  AAACGGGGTACAGTTC  32.079277  28.252794  30.414010  25.311946  4.045852   
  AAACGGGGTGAAATCA  11.926304  10.840752  11.422722   9.847989  5.942194   
  ...                     ...        ...        ...        ...       ...   
  TTTGGTTTCACTTACT   7.215083   6.664412   6.946201   6.096910  7.223340   
  TTTGGTTTCCTTTCGG  25.222668  22.385221  23.972482  20.122851  4.442018   
  TTTGTCAAGAATGTGT  21.268062  18.978521  20.249335  17.101286  4.746268   
  TTTGTCAAGTGACATA  17.472379  15.689595  16.669094  14.176694  5.122937   
  TTTGTCAAGTGTGGCA  10.812408   9.859053  10.366296   8.968511  6.172905   
  
                         Chga        Cck       Ins1       Nnat     

Simulated `AnnData` objects are automatically concatenated across groups.

In [4]:
sim.sample(exper.obs)

AnnData object with n_obs × n_vars = 2087 × 100

### Splitting a Simulator

One goal of `scDesigner` is to provide a grammar for manipulating already-trained simulators, so that we can have fine-grained of synthetic null/alternative generation without requiring re-estimating the model from scratch. We already have `transform` functions `nullify` and `amplify` for modifying parameters through string-matching. But we may want to re-estimate a part of a model, switch the input variables, or change the distribution. In this case, we can split a single simulator into a composite of several submodels. Note that we only re-fit the submodels that don't yet have parameters fitted. To illustrate, we first fit an ordinary simulator across all genes.

In [None]:
from scdesigner.transform import split_glm

sim = NegBinCopulaSimulator(epochs=10)
sim.fit(exper, "~ pseudotime")
sim.params["beta"]

                                                                                 

Unnamed: 0,Pyy,Iapp,Chgb,Rbp4,Spp1,Chga,Cck,Ins1,Nnat,Ins2,...,Nkx6-1,Fxyd3,Hn1,Smarcd2,Pdia6,Ffar2,Hes6,Serpinh1,Npy,1110012L19Rik
Intercept,1.788339,1.714943,1.75229,1.628577,2.050287,1.569325,1.737367,1.581807,1.412596,1.395276,...,0.761357,0.295798,1.531831,1.599978,0.632051,0.34762,1.447186,1.959203,-2.074277,0.578318
pseudotime,1.815212,1.75727,1.796564,1.731823,-0.705172,1.681196,1.653157,1.66563,1.621958,1.602153,...,0.991532,0.701913,-0.640142,-2.208465,1.043249,0.545949,-1.020503,-3.097816,2.475784,-0.433089


Let's now refit the first 10 genes without pseudotime as a predictor. This is related to nullifying those genes, except we also re-estimate the intercept terms. This is important in the case that the nullified variable is correlated to the other terms in the regression formula. After refitting, we're left with a composite (not NB) simulator. By default, the new "split" is given the key "group2", but this can be modified in the `split_glm` arguments.

In [6]:
sim_split = split_glm(sim, {"var_names": exper.var_names[:10], "formula": "~ 1"})
sim_split.fit(exper)
sim_split.params["group2"]["beta"]

                                                                                 

Unnamed: 0,Pyy,Iapp,Chgb,Rbp4,Spp1,Chga,Cck,Ins1,Nnat,Ins2
Intercept,2.197102,2.169116,2.159077,2.123474,2.077892,2.081271,2.075546,2.078535,2.017364,1.9903


Let's double check that the ten refitted genes have been removed entirely from the initial model.

In [7]:
sim_split.params["group1"]["covariance"].shape

(90, 90)

Nonetheless, when we sample the composite simulator, it internally combines sampled output across the genes.

In [8]:
sim_split.sample(exper.obs)

AnnData object with n_obs × n_vars = 2087 × 100

By default, it uses the same simulator type as we initially trained on. We can alternatively keep the formula the same but modify the model.

In [9]:
sim_split = split_glm(sim, {"var_names": exper.var_names[:10], "simulator": NegBinRegressionSimulator(epochs=4)})
sim_split.fit(exper)
sim_split.params["group2"]["beta"]

                                                                          

Unnamed: 0,Pyy,Iapp,Chgb,Rbp4,Spp1,Chga,Cck,Ins1,Nnat,Ins2
Intercept,1.350792,1.312179,1.354777,1.313173,1.454536,1.31021,1.355573,1.243647,1.244784,1.220354
pseudotime,1.356169,1.32115,1.36109,1.331967,0.69909,1.324415,1.337112,1.264062,1.29115,1.272105


We can also modify both the formula and the model simultaneously.

In [10]:
sim_split = split_glm(sim, {"var_names": exper.var_names[:10], "simulator": NegBinRegressionSimulator(epochs=4)})
sim_split.fit(exper)
sim_split.params["group2"]["beta"]

                                                                          

Unnamed: 0,Pyy,Iapp,Chgb,Rbp4,Spp1,Chga,Cck,Ins1,Nnat,Ins2
Intercept,1.350792,1.312179,1.354777,1.313173,1.454536,1.31021,1.355573,1.243647,1.244784,1.220354
pseudotime,1.356169,1.32115,1.36109,1.331967,0.69909,1.324415,1.337112,1.264062,1.29115,1.272105
