We expect T-cell and NK cell populations to be effected by the STAT4 mutation, so we will specifically look into those subpopulations. 

We also want to  look at disease-effects (Patient 2, off treatment vs control) and treatment-effects (Patient 1, on treatment vs Patient 2, off treatment). 

In [1]:
suppressPackageStartupMessages({
    suppressWarnings({
        library(Seurat, quietly = T)
        library(openxlsx, quietly = T)
        library(ggpubr, quietly = T)
    })
})

data_path = '/data3/hratch/STAT4_v2/'

In [2]:
pbmc.integrated<-readRDS(paste0(data_path, 'processed/pbmc_integrated.RDS'))
md<-pbmc.integrated@meta.data

Specifiy the cell types and context comparisons to test for:

In [3]:
cell.types<-c('Naive CD8+ T cells', 'Natural killer  cells', 
              'Naive CD4+ T cells', 'Effector CD4+ T cells', 'Memory CD4+ T cells')
comparisons<-list(disease.effect = c('Patient.2', 'Control'), 
                treatment.effect = c('Patient.1', 'Patient.2'))

# Wilcoxon

We will first try the Wilcoxon test since this worked well for markers and is the most common type of test used. 

In [28]:
wilcoxon.de<-function(cell.type, context.treat, context.base){
    pbmc.subset<-subset(x = pbmc.integrated, subset = Cell.Type == ct)
    Idents(pbmc.subset)<-'orig.ident'

    de.res<-FindMarkers(object = pbmc.subset, 
                        ident.1 = context.treat, ident.2 = context.base,
                        assay = 'RNA', only.pos = F, 
                        slot = 'data', test.use = 'wilcox', 
                        min.pct = 0.1, # default
                        logfc.threshold = 0.5 # default
                                      )
    de.res[['gene']]<-rownames(de.res)
    de.res[['Cell.Type']]<-ct
    de.res[['Comparison']]<-paste0(context.treat, '_vs_', context.base)
    
    return(de.res)
}

In [5]:
wilcoxon.de.res<-list()
for (comparison in comparisons){
    for (ct in cell.types){
        context.treat<-comparison[[1]]
        context.base<-comparison[[2]]
        cond.name<-paste0(ct, '_', paste0(comparison, collapse = 'vs'))
        wilcoxon.de.res[[cond.name]]<-wilcoxon.de(cell.type, context.treat, context.base)
    }
}
saveRDS(wilcoxon.de.res, paste0(data_path, 'processed/wilcoxon_condition-specific_DE.RDS'))

In [6]:
de.res<-do.call("rbind", wilcoxon.de.res)
de.res<-de.res[de.res$p_val_adj <= 0.1,] # threshold on p_adj

print('# of DE genes prior to filtering:')
table(de.res$Cell.Type, de.res$Comparison)

[1] "# of DE genes prior to filtering:"


                       
                        Patient.1_vs_Patient.2 Patient.2_vs_Control
  Effector CD4+ T cells                    141                 1135
  Memory CD4+ T cells                       11                  180
  Naive CD4+ T cells                        85                  961
  Naive CD8+ T cells                        56                  819
  Natural killer  cells                     27                  366

In [9]:
pos.only<-de.res[de.res$avg_log2FC > 0,]
table(pos.only$Cell.Type, pos.only$Comparison)

                       
                        Patient.1_vs_Patient.2 Patient.2_vs_Control
  Effector CD4+ T cells                     17                  173
  Memory CD4+ T cells                        2                   16
  Naive CD4+ T cells                        10                   91
  Naive CD8+ T cells                         8                   60
  Natural killer  cells                      1                   51

We anticipate that there is a general upregulation of genes in Patient 2 vs the control, since STAT4 is a gain-of-function mutation, so this result is unexpected. Since this test gives us unexpected results, we assume that this may be due to technical rather than biological variability. 

Thus, we move to tests that can better control for technical effects. Latent variables that account for technical effects have been [shown](https://www.biorxiv.org/content/10.1101/2022.03.15.484475v1) to be effective for DE across samples.The logistic regression (see 02Ci for details) allows for control of latent variables. We will first use the CDR (cellular detection rate) which has been [shown](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0844-5) to be an effective latent variable for technical effects

Note, we do expect downregulation of genes in Patient 1 relative to Patient 2 since this is the treatment-effect.

Skipped: Since there were fewer differences b/w the patients then vs the control, we apply more stringent thresholding to the control comparison than the one between patients:

In [190]:
# de.control<-de.res[de.res$Comparison == 'Patient.2_vs_Control', ]
# de.control<-de.control[abs(de.control$avg_log2FC) > 1.5,] # further threshold on LFC

# de.patient<-de.res[de.res$Comparison == 'Patient.1_vs_Patient.2', ] # already thresholded on LFC = 0.5
# # sort and merge
# de.control<-de.control[with(de.control, order(Cell.Type, -avg_log2FC)), ] # sort by effect size
# de.patient<-de.patient[with(de.patient, order(Cell.Type, -avg_log2FC)), ] # sort by effect size
# de.res<-rbind(de.control, de.patient)

# print('# of DE genes after filtering:')
# table(de.res$Cell.Type, de.res$Comparison)

[1] "# of DE genes after filtering:"


                       
                        Patient.1_vs_Patient.2 Patient.2_vs_Control
  Effector CD4+ T cells                    141                  236
  Memory CD4+ T cells                       11                  128
  Naive CD4+ T cells                        85                  201
  Naive CD8+ T cells                        56                  205
  Natural killer  cells                     27                   93

In [176]:
# # save to excel
# counter<-1
# markers_workbook<-createWorkbook()
# for (comparison in unique(de.res$Comparison)){
#     for (cell.type in  unique(de.res$Cell.Type)){
#         de.res.cl<-de.res[(de.res$Comparison == comparison) & (de.res$Cell.Type == cell.type), ]
#         if (dim(de.res.cl)[[1]] > 0){rownames(de.res.cl)<-1:dim(de.res.cl)[[1]]}
#         addWorksheet(markers_workbook, paste0(counter))
#         writeData(markers_workbook, sheet = paste0(counter), x = de.res.cl)
#         counter<-counter + 1
#     }
# }
# saveWorkbook(markers_workbook, overwrite = T, 
#                  paste0(data_path, 'processed/', 'wilcoxon_condition-specific_DE.xlsx'))

# Logistic Regression

In [15]:
LR.de<-function(cell.type, context.treat, context.base, latent.vars){
    pbmc.subset<-subset(x = pbmc.integrated, subset = Cell.Type == ct)
    Idents(pbmc.subset)<-'orig.ident'
    
    suppressWarnings({
        suppressMessages({
            de.res<-FindMarkers(object = pbmc.subset, 
                                ident.1 = context.treat, ident.2 = context.base,
                                assay = 'RNA', only.pos = F, 
                                slot = 'data', test.use = 'LR', 
                                latent.vars = latent.vars,
                                min.pct = 0.1, # default
                                logfc.threshold = 0.5 # default
                                              )
            })
    })
    de.res[['gene']]<-rownames(de.res)
    de.res[['Cell.Type']]<-ct
    de.res[['Comparison']]<-paste0(context.treat, '_vs_', context.base)
    
    return(de.res)
}

## CDR 

First, we calculate the CDR from the LogNormalized expression matrix:

In [5]:
freq<-function(expr){
    nonzero.counts<-rowSums(expr !=0 ) # get # of nonzero cells per gene
    return(nonzero.counts/dim(expr)[[2]])
}

In [6]:
expr = pbmc.integrated@assays$RNA@data # log-normalized matrix
expr<-expr[which(freq(expr)>0),] # remove invariant genes

In [7]:
thresh = 0 # calculate CDR on non-zero frequency (NOTE: code will need to be changed if setting higher thresh)
if (thresh == 0){
    cdr<-unlist(unname(scale(colSums(expr!=thresh))[, 1])) # calculate CDR as in MAST tutorial (https://www.bioconductor.org/packages/release/bioc/vignettes/MAST/inst/doc/MAITAnalysis.html)
    cdr.2<-unlist(unname(colSums(expr > thresh)/dim(expr)[[1]])) # calculate as in MAST manuscript
}else{
    stop('Need to implement this if using')
}

Note, although the two methods to calculate the CDR give different absolute values, they have perfect correlation (we will proceed with the tutorial recommended CDR calculation):

In [8]:
identical(cdr, cdr.2)
cor(cdr, cdr.2,  method = "spearman", use = "complete.obs")

In [9]:
pbmc.integrated@meta.data[['cellular.detection.rate']]<-cdr # add cdr to object

In [16]:
LR.de.res<-list()
for (comparison in comparisons){
    for (ct in cell.types){
        context.treat<-comparison[[1]]
        context.base<-comparison[[2]]
        cond.name<-paste0(ct, '_', paste0(comparison, collapse = 'vs'))
        LR.de.res[[cond.name]]<-LR.de(cell.type, context.treat, context.base, 
                                      latent.vars = 'cellular.detection.rate')
    }
}
saveRDS(LR.de.res, paste0(data_path, 'processed/LR_condition-specific_DE.RDS'))

In [11]:
de.res<-do.call("rbind", LR.de.res)
de.res<-de.res[de.res$p_val_adj <= 0.1,] # threshold on p_adj

print('# of DE genes prior to filtering:')
table(de.res$Cell.Type, de.res$Comparison)

[1] "# of DE genes prior to filtering:"


                       
                        Patient.1_vs_Patient.2 Patient.2_vs_Control
  Effector CD4+ T cells                     86                  435
  Memory CD4+ T cells                        5                   59
  Naive CD4+ T cells                        42                  265
  Naive CD8+ T cells                        36                  172
  Natural killer  cells                     16                   67

In [12]:
pos.only<-de.res[de.res$avg_log2FC > 0,]
table(pos.only$Cell.Type, pos.only$Comparison)

                       
                        Patient.1_vs_Patient.2 Patient.2_vs_Control
  Effector CD4+ T cells                     33                  195
  Memory CD4+ T cells                        2                   11
  Naive CD4+ T cells                        17                   84
  Naive CD8+ T cells                        11                   44
  Natural killer  cells                      2                   19

# CACAO

We still see the same unexpected rate of positive vs negative LFC. Instead of CDR, we use a DE testing method explicitly developed for comparing the same cell type between two samples (method [here](https://www.biorxiv.org/content/10.1101/2022.03.15.484475v1.full))