In [83]:
import anndata
import os
import requests
import pandas as pd

save_path = "data/example_sce.h5ad"
if not os.path.exists(save_path):
    response = requests.get("https://go.wisc.edu/69435h")
    with open(save_path, "wb") as f:
        f.write(response.content)

example_sce = anndata.read_h5ad(save_path)
example_sce

AnnData object with n_obs × n_vars = 2087 × 100
    obs: 'clusters_coarse', 'clusters', 'S_score', 'G2M_score', 'cell_type', 'sizeFactor', 'pseudotime'
    var: 'highly_variable_genes'
    uns: 'X_name', 'clusters_coarse_colors', 'clusters_colors', 'day_colors', 'neighbors', 'pca'
    obsm: 'PCA', 'UMAP', 'X_pca', 'X_umap'
    layers: 'counts', 'cpm', 'logcounts', 'spliced', 'unspliced'
    obsp: 'connectivities', 'distances'

```
conda create -n scdesigner python=3.11
pip install scdesigner==0.0.6
```

# Creating and Fitting a Simulator

Here we present an example of how to create and fit a negative binomial copula simulator.

Our simulators are designed using a `scikit-learn`-like API.

User may specify different formulas for the mean, dispersion, and copula models when initializing the simulator. Then, the simulator can be fitted to an `AnnData` object using the `fit` method.

In [84]:
# Import simulator object from scdesigner
from scdesigner.simulators import NegBinCopula

# Create a NegBinCopula model with specified formulas
sim = NegBinCopula(mean_formula="~ pseudotime", 
                   dispersion_formula="~ pseudotime", 
                   copula_formula="~ -1 + cell_type") 
sim.fit(example_sce, max_epochs=50)

Epoch 49/50, Loss: 269734.8626

Estimating copula covariance: 100%|██████████| 3/3 [00:00<00:00, 25.81it/s]


 There are two major components encapsulated as objects in the simulator: 
- a marginal model `sim.marginal`, and 
- a copula model `sim.copula`.

To access the parameters of the marginal model:

In [85]:
display(sim.marginal.parameters['mean'])
display(sim.marginal.parameters['dispersion'])

Unnamed: 0,Pyy,Iapp,Chgb,Rbp4,Spp1,Chga,Cck,Ins1,Nnat,Ins2,...,Nkx6-1,Fxyd3,Hn1,Smarcd2,Pdia6,Ffar2,Hes6,Serpinh1,Npy,1110012L19Rik
Intercept,0.117899,0.117808,0.11959,0.118028,0.120859,0.11954,0.119778,0.117433,0.118041,0.117929,...,0.118584,0.114993,0.118789,0.116308,0.117845,0.116657,0.118791,0.117921,-0.024891,0.11198
pseudotime,0.117843,0.117767,0.119508,0.118158,0.119114,0.119647,0.119408,0.117645,0.118413,0.118381,...,0.118493,0.115209,0.117111,0.108777,0.117583,0.116923,0.116265,0.086848,0.115788,0.107746


Unnamed: 0,Pyy,Iapp,Chgb,Rbp4,Spp1,Chga,Cck,Ins1,Nnat,Ins2,...,Nkx6-1,Fxyd3,Hn1,Smarcd2,Pdia6,Ffar2,Hes6,Serpinh1,Npy,1110012L19Rik
Intercept,-0.117746,-0.117989,-0.119301,-0.117878,-0.121148,-0.119387,-0.119852,-0.117905,-0.118648,-0.118879,...,-0.107257,-0.091699,-0.112027,-0.10567,-0.111583,-0.086039,-0.110257,-0.110795,-0.121786,-0.111324
pseudotime,-0.117657,-0.117942,-0.119404,-0.117704,-0.123518,-0.118825,-0.119442,-0.117676,-0.11875,-0.118926,...,-0.107904,-0.089513,-0.099136,-0.110277,-0.113731,-0.075681,-0.08127,-0.106168,-0.12091,-0.112885


Since the copula model is fitted on `cell_type`, each cell type has its own fitted covariance matrix.

The parameters are stored as a dictionary of covariance matrices, one for each cell type.

In [86]:
# Print the cell types (keys) in the copula model
print(sim.copula.parameters.keys())

dict_keys(['cell_type[T.Ngn3 low EP]', 'cell_type[T.Ngn3 high EP]', 'cell_type[T.Pre-endocrine]', 'cell_type[T.Beta]'])


In [87]:
# Show the covariance matrix for a specific cell type
sim.copula.parameters['cell_type[T.Ngn3 low EP]']

Unnamed: 0,Pyy,Iapp,Chgb,Rbp4,Spp1,Chga,Cck,Ins1,Nnat,Ins2,...,Nkx6-1,Fxyd3,Hn1,Smarcd2,Pdia6,Ffar2,Hes6,Serpinh1,Npy,1110012L19Rik
Pyy,0.715878,0.009592,0.028674,0.072201,0.003775,-0.009010,0.015381,0.044128,-0.038426,0.003910,...,0.069367,0.021873,0.092279,0.006783,0.019298,-0.040686,0.072489,0.044078,-0.020795,-0.029870
Iapp,0.009592,0.538755,0.022756,-0.024000,-0.043051,-0.021439,0.006056,-0.003368,-0.013139,0.026858,...,-0.017763,0.049617,-0.069061,0.037735,-0.002925,0.018396,0.018182,-0.005152,0.015690,0.003917
Chgb,0.028674,0.022756,0.395344,0.003503,-0.020925,0.013367,-0.003780,0.012739,0.019371,0.020829,...,-0.034462,0.006759,0.085094,-0.033779,0.027338,-0.020329,-0.020728,0.038782,-0.005892,0.035212
Rbp4,0.072201,-0.024000,0.003503,0.783377,0.001082,0.047506,0.041872,-0.042563,-0.045177,0.010706,...,0.098180,-0.044081,0.077480,0.021210,0.055486,0.033867,-0.112770,0.023404,0.009505,-0.011916
Spp1,0.003775,-0.043051,-0.020925,0.001082,0.141919,0.023082,-0.051654,0.028803,-0.022332,-0.032884,...,0.027381,-0.004628,-0.059421,-0.030956,0.025964,-0.009309,-0.033047,0.044060,0.005088,0.007494
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Ffar2,-0.040686,0.018396,-0.020329,0.033867,-0.009309,0.007327,0.049384,0.036575,-0.052264,0.008413,...,0.007047,0.065742,0.107262,0.093525,-0.032267,0.686553,0.135509,0.010776,0.022526,-0.049293
Hes6,0.072489,0.018182,-0.020728,-0.112770,-0.033047,0.043001,0.033593,-0.039247,-0.017643,0.050039,...,0.028249,0.068685,0.360624,0.141143,0.042830,0.135509,0.924427,0.064851,-0.044139,-0.015543
Serpinh1,0.044078,-0.005152,0.038782,0.023404,0.044060,0.028510,0.025734,0.045459,-0.003332,0.054591,...,-0.014296,0.029943,0.079353,0.047069,0.202702,0.010776,0.064851,0.646676,-0.035362,0.096024
Npy,-0.020795,0.015690,-0.005892,0.009505,0.005088,-0.000294,0.028699,0.029730,0.008219,-0.029079,...,0.045020,-0.004752,0.000810,0.025204,0.017371,0.022526,-0.044139,-0.035362,0.337652,-0.015309


# Predicting and Sampling from the Simulator

You can predict and sample from the simulator using the `predict` and `sample` methods and passing in the covariates of interest.

You may also obtain the aic and bic scores of the fitted model using the `complexity` method.

In [88]:
# Predict the mean and dispersion based on given covariates
preds = sim.predict(example_sce.obs[:10])
pd.DataFrame(preds['mean'])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,1.218605,1.218431,1.222044,1.219021,1.22327,1.222098,1.222191,1.217873,1.219249,1.219085,...,1.219977,1.212903,1.219085,1.20922,1.218324,1.216334,1.21839,1.193318,1.054983,1.203157
1,1.183832,1.183685,1.186688,1.184145,1.187993,1.1867,1.186859,1.183179,1.184292,1.184142,...,1.184976,1.179055,1.184511,1.177334,1.183635,1.181893,1.184082,1.168128,1.025397,1.171728
2,1.229008,1.228826,1.232625,1.229456,1.233826,1.232692,1.232764,1.228253,1.229708,1.229541,...,1.23045,1.223025,1.229427,1.218745,1.228702,1.226637,1.228652,1.200817,1.063832,1.212545
3,1.254774,1.254572,1.258836,1.255302,1.259975,1.258935,1.258956,1.253959,1.255615,1.255437,...,1.25639,1.248087,1.25504,1.242312,1.254405,1.252151,1.254062,1.21932,1.085743,1.235767
4,1.176707,1.176566,1.179446,1.176999,1.180766,1.179449,1.179622,1.17607,1.17713,1.176983,...,1.177805,1.172117,1.177426,1.170792,1.176527,1.174836,1.177051,1.162942,1.019333,1.165279
5,1.159635,1.159506,1.162093,1.159877,1.163452,1.162077,1.162282,1.159035,1.159969,1.159829,...,1.160623,1.155489,1.160449,1.155103,1.159495,1.157923,1.160201,1.150484,1.0048,1.149811
6,1.129035,1.128929,1.131001,1.12919,1.132424,1.130949,1.13121,1.128502,1.129214,1.129086,...,1.12983,1.125671,1.130015,1.126939,1.128965,1.127603,1.12999,1.128032,0.978742,1.122039
7,1.220214,1.220039,1.223681,1.220635,1.224903,1.223737,1.223827,1.219479,1.220867,1.220703,...,1.221597,1.214469,1.220684,1.210693,1.21993,1.217928,1.219977,1.194479,1.056352,1.20461
8,1.242069,1.241877,1.245911,1.242558,1.247081,1.245994,1.246041,1.241284,1.242841,1.242668,...,1.243599,1.235731,1.242412,1.230697,1.241732,1.239571,1.241534,1.21021,1.07494,1.224322
9,1.213017,1.212848,1.216362,1.213417,1.217601,1.21641,1.216514,1.212299,1.213632,1.213471,...,1.214353,1.207466,1.21353,1.204101,1.212751,1.210801,1.212878,1.189283,1.050231,1.198113


In [89]:
# Generating new samples from the simulator
samples = sim.sample(example_sce.obs[:10]) 
samples

AnnData object with n_obs × n_vars = 10 × 100
    obs: 'clusters_coarse', 'clusters', 'S_score', 'G2M_score', 'cell_type', 'sizeFactor', 'pseudotime'

In [90]:
sim.complexity()

Computing log-likelihood...: 100%|██████████| 3/3 [00:00<00:00, 23.66it/s]


{'aic': 1022523.7886652027, 'bic': 1134264.7502253312}

# Manipulating the Simulator

scDesigner provides a set of tools to manipulate the simulator.

For example, we can decorrelate certain genes in the copula model.


In [91]:
sim.copula.parameters['cell_type[T.Ngn3 low EP]']

Unnamed: 0,Pyy,Iapp,Chgb,Rbp4,Spp1,Chga,Cck,Ins1,Nnat,Ins2,...,Nkx6-1,Fxyd3,Hn1,Smarcd2,Pdia6,Ffar2,Hes6,Serpinh1,Npy,1110012L19Rik
Pyy,0.715878,0.009592,0.028674,0.072201,0.003775,-0.009010,0.015381,0.044128,-0.038426,0.003910,...,0.069367,0.021873,0.092279,0.006783,0.019298,-0.040686,0.072489,0.044078,-0.020795,-0.029870
Iapp,0.009592,0.538755,0.022756,-0.024000,-0.043051,-0.021439,0.006056,-0.003368,-0.013139,0.026858,...,-0.017763,0.049617,-0.069061,0.037735,-0.002925,0.018396,0.018182,-0.005152,0.015690,0.003917
Chgb,0.028674,0.022756,0.395344,0.003503,-0.020925,0.013367,-0.003780,0.012739,0.019371,0.020829,...,-0.034462,0.006759,0.085094,-0.033779,0.027338,-0.020329,-0.020728,0.038782,-0.005892,0.035212
Rbp4,0.072201,-0.024000,0.003503,0.783377,0.001082,0.047506,0.041872,-0.042563,-0.045177,0.010706,...,0.098180,-0.044081,0.077480,0.021210,0.055486,0.033867,-0.112770,0.023404,0.009505,-0.011916
Spp1,0.003775,-0.043051,-0.020925,0.001082,0.141919,0.023082,-0.051654,0.028803,-0.022332,-0.032884,...,0.027381,-0.004628,-0.059421,-0.030956,0.025964,-0.009309,-0.033047,0.044060,0.005088,0.007494
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Ffar2,-0.040686,0.018396,-0.020329,0.033867,-0.009309,0.007327,0.049384,0.036575,-0.052264,0.008413,...,0.007047,0.065742,0.107262,0.093525,-0.032267,0.686553,0.135509,0.010776,0.022526,-0.049293
Hes6,0.072489,0.018182,-0.020728,-0.112770,-0.033047,0.043001,0.033593,-0.039247,-0.017643,0.050039,...,0.028249,0.068685,0.360624,0.141143,0.042830,0.135509,0.924427,0.064851,-0.044139,-0.015543
Serpinh1,0.044078,-0.005152,0.038782,0.023404,0.044060,0.028510,0.025734,0.045459,-0.003332,0.054591,...,-0.014296,0.029943,0.079353,0.047069,0.202702,0.010776,0.064851,0.646676,-0.035362,0.096024
Npy,-0.020795,0.015690,-0.005892,0.009505,0.005088,-0.000294,0.028699,0.029730,0.008219,-0.029079,...,0.045020,-0.004752,0.000810,0.025204,0.017371,0.022526,-0.044139,-0.035362,0.337652,-0.015309


Decorrelate `Pyy` and `Iapp`

In [92]:
sim.copula.decorrelate("Pyy|Iapp", "Pyy|Iapp", 'cell_type[T.Ngn3 low EP]')
sim.copula.parameters['cell_type[T.Ngn3 low EP]']

Unnamed: 0,Pyy,Iapp,Chgb,Rbp4,Spp1,Chga,Cck,Ins1,Nnat,Ins2,...,Nkx6-1,Fxyd3,Hn1,Smarcd2,Pdia6,Ffar2,Hes6,Serpinh1,Npy,1110012L19Rik
Pyy,0.715878,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
Iapp,0.000000,0.538755,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
Chgb,0.000000,0.000000,0.395344,0.003503,-0.020925,0.013367,-0.003780,0.012739,0.019371,0.020829,...,-0.034462,0.006759,0.085094,-0.033779,0.027338,-0.020329,-0.020728,0.038782,-0.005892,0.035212
Rbp4,0.000000,0.000000,0.003503,0.783377,0.001082,0.047506,0.041872,-0.042563,-0.045177,0.010706,...,0.098180,-0.044081,0.077480,0.021210,0.055486,0.033867,-0.112770,0.023404,0.009505,-0.011916
Spp1,0.000000,0.000000,-0.020925,0.001082,0.141919,0.023082,-0.051654,0.028803,-0.022332,-0.032884,...,0.027381,-0.004628,-0.059421,-0.030956,0.025964,-0.009309,-0.033047,0.044060,0.005088,0.007494
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Ffar2,0.000000,0.000000,-0.020329,0.033867,-0.009309,0.007327,0.049384,0.036575,-0.052264,0.008413,...,0.007047,0.065742,0.107262,0.093525,-0.032267,0.686553,0.135509,0.010776,0.022526,-0.049293
Hes6,0.000000,0.000000,-0.020728,-0.112770,-0.033047,0.043001,0.033593,-0.039247,-0.017643,0.050039,...,0.028249,0.068685,0.360624,0.141143,0.042830,0.135509,0.924427,0.064851,-0.044139,-0.015543
Serpinh1,0.000000,0.000000,0.038782,0.023404,0.044060,0.028510,0.025734,0.045459,-0.003332,0.054591,...,-0.014296,0.029943,0.079353,0.047069,0.202702,0.010776,0.064851,0.646676,-0.035362,0.096024
Npy,0.000000,0.000000,-0.005892,0.009505,0.005088,-0.000294,0.028699,0.029730,0.008219,-0.029079,...,0.045020,-0.004752,0.000810,0.025204,0.017371,0.022526,-0.044139,-0.035362,0.337652,-0.015309


# Documentation Plan

We are working on a numpy-style docstrings for the scDesigner package, with simple examples included


In [100]:
from scdesigner.base.copula import Copula
print(Copula.__doc__)

Abstract Copula Class
    
    The scDesign3 model is built from two components: a collection of marginal
    models, and a copula to tie them together. This class implements an abstract
    version of the copula. Within this class, we may define different subclasses
    that implement various types of regularization or dependencies on
    experimental and biological conditions. Despite these differences, the
    overall class must always provide utilities for fitting and sampling
    dependent uniform variables.
    
    Parameters
    ----------
    formula : str
        A string describing the dependence of the copula on experimental or
        biological conditions. We support predictors for categorical variables
        like cell type; this corresponds to estimating a different covariance
        for each category.
    Attributes
    ----------
    loader : torch.utils.data.DataLoader
        A data loader object is used to estimate the covariance one batch at a
        time. This

In [101]:
print(Copula.decorrelate.__doc__)


        Decorrelate the covariance matrix for the given row and column patterns.
        This method can be used to generate synthetic null data where particular
        pairs of features are forced to be uncorrelated with one another. Any
        indices of the covariance that lie in the intersection of the specified
        row and column patterns will be set to zero.
        
        Parameters
        ----------
        row_pattern : str
            The regex pattern for the row names to match.
        col_pattern : str
            The regex pattern for the column names to match.
        group : Union[str, list, None], optional
            The group or groups to apply the transformation to. If None, the
            transformation is applied to all groups.
            
        Returns
        -------
        None
            This method does not return anything but modifies self parameters as
            a side effect.
        
