MOSCATO

Introduction

This code executes the Multi-Omic Single-Cell Analysis using TensOr regression (MOSCATO) method which uses multi-omic, single-cell data with two data types, X and G, and a univariate outcome which follows an exponential family distribution. In summary, it will estimate the similarity and perform tensor regression using the similarity as the predictor tensor. The features will be selected by applying an elastic net constraint on the coefficient vectors, and any non-zero coefficients will correspond to the features selected. In addition to applying MOSCATO, code to simulate multi-omic, single-cell data is also provided.

STEP 1: Setting up the data

To use MOSCATO, two data types and a univariate outcome are assumed for each subject. The data types may be supplied either by each subject containing it's own datasets or in the long format where the first column specifies the subject. Note that the data types must have their rows matched by the same cells (i.e., X and G must have the same number of rows where the rows correspond to cells).

If wanting to use simulated data, use the simSingleCellNetwork() function from the simSingleCellNetwork.R file. This will use Splatter simulations proposed by Zappia et al in 2017 with extensions using concepts from the proposal of DNSMI by Zhang et al in 2020. To perform the function, a series of inputs will be specified which correspond to Splatter inputs for single-cell data:

pi0: Probability of the expression being an outlier?
mu0: If the expression is an outlier, what is the location?
sigma0: If the expression is an outlier, what is the scale?
mu_L: Location for library size
sigma_L: Scale for library size
phi: Common dispersion
df0: BCV degrees of freedom
x0: Midpoint for determining if dropout (i.e., expression truncated to 0)
k_shape: Shape for determining if dropout
m_mean: Mean number of cells for a given subject

There are also parameters for specifying the network strength and size of the simulations, similarly as used in DNSMI:

alpha: Helps specify signal strength. A number between 0 and 1
Rsq: Helps specify signal strength. A number between 0 and 1
rind: Helps specify signal strength. A number between 0 and 1
N: Number of subjects
nx: Number of network features in X (i.e., related to features in G and to the outcome)
nxPrime: Number of features in X related to features within G but not the outcome
ns: Number of features in X related to the outcome but not any features in G
nxNoise: Number of features in X unrelated to G and the outcome
ng: Number of network features in G (i.e., related to features in X and to the outcome)
ngPrime: Number of features in G related to features within X but not the outcome
nh: Number of features in G related to the outcome but not any features in X
ngNoise: Number of features in G unrelated to X and the outcome
noiseSD: Standard deviation to use for noise features
noiseMean: Mean to use for the noise features

One other parameter outLoc_ may be specified as a folder to save the data if the subject-level data is wanting to be saved as separate files. Note that this will save files such as X_1.rda and G_1.rda for subject 1, X_2.rda and G_2.rda for subject 2, and so on.

An example is as follows to generate multi-omic single-cell data with a network containing subset of features.

OPTION 1: Save all subject-level data in separate files

  source("supportingSimulationFunctions.R")   
  source("simSingleCellNetwork.R")
  
  outputLocation <- "OPTIONAL: Put output location here for subject files"
  
  testSim <- simSingleCellNetwork(
    #Splatter inputs
    pi0=0.002, #Outlier Probability
    mu0=5, #outlier location
    sigma0=0.4, #Outlier scale
    mu_L=12, #Library size location
    sigma_L=0.2, #Library size scale
    phi=0.1, #common dispersion
    df0=7, #BCV DoF
    x0=1, #dropout modpoint
    k_shape=0.5, #Dropout shape
    #Sample-level inputs
    m_mean=500, #Mean number of cells per subject
    #DNSMI Inputs
    alpha=0.35, 
    Rsq=0.85,
    N=50,
    rind=0.85,
    nx=15,
    nxPrime=20,
    ns=20,
    nxNoise=1500,
    ng=10,
    ngPrime=15,
    nh=15,
    ngNoise=1400,
    noiseSD=1,
    noiseMean=0,
    outLoc_=outputLocation
  )

If outLoc_ was specified, this will create files for each of the subjects for their data types, along with a Ysim.rda file for a vector of their continuous outcomes.

OPTION 2: Stack the single-cell data:

  source("supportingSimulationFunctions.R")  
  source("simSingleCellNetwork.R")
  
  testSim <- simSingleCellNetwork(
    #Splatter inputs
    pi0=0.002, #Outlier Probability
    mu0=5, #outlier location
    sigma0=0.4, #Outlier scale
    mu_L=12, #Library size location
    sigma_L=0.2, #Library size scale
    phi=0.1, #common dispersion
    df0=7, #BCV DoF
    x0=1, #dropout modpoint
    k_shape=0.5, #Dropout shape
    #Sample-level inputs
    m_mean=500, #Mean number of cells per subject
    #DNSMI Inputs
    alpha=0.35, 
    Rsq=0.85,
    N=50,
    rind=0.85,
    nx=15,
    nxPrime=20,
    ns=20,
    nxNoise=1500,
    ng=10,
    ngPrime=15,
    nh=15,
    ngNoise=1400,
    noiseSD=1,
    noiseMean=0
  )

This returns objects X (stacked data type 1), G (stacked data type 2), subjectIDs (vector specifying which rows correspond to which subjects), and y (vector of outcomes).

STEP2: Generate similarity tensor

In order to perform the tensor regression, the similarity matrix must be estimated for each subject to be used as the predictor tensor. This will be done using the estimateSimilarity function from the estimateSimilarity.rda file.

OPTION 1: If outLoc_ was specified in STEP1 to save subject-level data in separate files:

  library(rTensor)
  source("estimateSimilarity.R")
  
  y <- readRDS(paste0(outputLocation, "/Ysim.rda"))
  Z_tensor <- estimateSimilarity(y=y,
                                 outLoc_=outputLocation,
                                 zeroAsMissing=FALSE)

OPTION 2: Otherwise, if outLoc_ was NOT specified in STEP1, then estimate the predictor tensor from the generated objects:

  library(rTensor)
  source("estimateSimilarity.R")
  
  y <- testNet$y
  Z_tensor <- estimateSimilarity(X=testNet$X,
                                 G=testNet$G, 
                                 y=y, 
                                 subjectIDs=testNet$subjectIDs,
                                 zeroAsMissing=FALSE)

Note that zeroAsMissing was set to FALSE in the above example, but if one wanted to treat zeros in the data as missing and making them be ignored in the similarity estimation, it may be updated to TRUE.

STEP3: Tune the hyperparameters

MOSCATO applies elastic net constraints for each of the data types X and G. This requires a total of 4 hyperparameters to be tuned:

alphaX: The weight to put on the l1-norm for X
alphaG: The weight to put on the l1-norm for G
maxX: The maximum number of features to be selected for X
maxG: The maximum number of features to be selected for G

The parameters are tuned using a variation of the StARS tuning method proposed by Liu et al in 2010. numIterForStability specifies the number of subsamples to use, sizeOfSubsampleForStability specifies the number of subjects to use per subsample, and phi specifies the instability threshold. minSelect specifies the minimum number of features in X and G (vector in that order) and alphaGrid specifies the alphas to consider when tuning alphaX and alphaG.

Note that the tuning takes a long time to process, possibly days.

  source("tuneMOSCATO.R")
  source("supportingMOSCATOFunctions.R")  
  library(rTensor)
  library(parallel)
  
  tuned <- tuneMOSCATO(Z_tensor=Z_tensor,
                       y=y,
                       distY="gaussian",
                       numIterForStability=20,
                       sizeOfSubsampleForStability=20,
                       phi=0.01,
                       minSelect=c(5,5),
                       alphaGrid=c(0.3,0.7))
                       
  #Tuned values?
  alphaX <- tuned$tunedAlpha[1]
  alphaG <- tuned$tunedAlpha[2]
  maxX <- tuned$tunedMax[1]
  maxG <- tuned$tunedMax[2]

STEP4: Perform MOSCATO with tuned values

Using the tuned alpha and max values from STEP3, the final feature selections may be made.

  source("supportingMOSCATOFunctions.R")    
  library(rTensor)
  
  MOSCATOFinal <- tensorElasticNet(Z_tensor=Z_tensor,
                                   y=y,
                                   glmnetAlpha=tuned$tunedAlpha,
                                   maxSelect=tuned$tunedMax
  )
  
  #What features were selected?
  selectX <- which(MOSCATOFinal$returnX==1)
  selectG <- which(MOSCATOFinal$returnG==1)

The tensorElasticNet function may return either "Did not converge!" or "Converged!" in the log. If "Did not converge!" was returned, this means that the Block Relaxation Algorithm did not converge on a consistent solution in estimating the coefficients from the tensor regression model. Sometimes re-running the method resolves this issue due to different random initializations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MOSCATO

Introduction

STEP 1: Setting up the data

STEP2: Generate similarity tensor

STEP3: Tune the hyperparameters

STEP4: Perform MOSCATO with tuned values

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
estimateSimilarity.R		estimateSimilarity.R
simSingleCellNetwork.R		simSingleCellNetwork.R
supportingMOSCATOFunctions.R		supportingMOSCATOFunctions.R
supportingSimulationFunctions.R		supportingSimulationFunctions.R
tuneMOSCATO.R		tuneMOSCATO.R

lorinmil/MOSCATO

Folders and files

Latest commit

History

Repository files navigation

MOSCATO

Introduction

STEP 1: Setting up the data

STEP2: Generate similarity tensor

STEP3: Tune the hyperparameters

STEP4: Perform MOSCATO with tuned values

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages