Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

range of normalized values #9

Closed
bio-la opened this issue Jul 17, 2020 · 5 comments
Closed

range of normalized values #9

bio-la opened this issue Jul 17, 2020 · 5 comments

Comments

@bio-la
Copy link

bio-la commented Jul 17, 2020

Hi,
i wonder what a range of values would be considered unusual for the DSB normalized protein signal.
The plots in your bioarxiv paper go from -2 to +30 for most markers, and you seem to suggest there's no transformation applied on the values, but after normalizing my data I have values ranging between -200 and +200. if i compare the DSB distribution to the CLR it looks fine, and the markers are expressed where i would expect them, but the visualization and interpretation of such a big range is difficult, and also, there's a lot of false positives.
i am normalizing using control isotypes. any suggestions?
thanks

@MattPM
Copy link
Collaborator

MattPM commented Jul 20, 2020

Hi @bio-la , thanks very much for posting this great question.

A range of values like that can occur when you use too many empty drops for the background. It happens more often when sequencing depth of the protein library is low, but it is very easy to address it as I outline below. The interpretation of the dsb values is always the number of standard deviations from the mean in the background, so if you have a cell with a protein with dsb value of 100 that cells' protein value was 100 s.d. from the mean of the background. If you use a ton of droplets to define background that have no reads at all, then the noise mean will be artificially low. By the way, the range of the values can also be driven by a couple outlier cells. you may want to check apply(mtx,1,mean), apply(mtx,1,quantile) on the dsb normalized values to see the number of drops that might have really high values like that.

Lets download a 10X dataset with really low sequencing saturation to illustrate how to address this. https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.2/5k_pbmc_protein_v3. This dataset has below 40% ADT saturation–that is really low. First lets take a look at the mRNA counts in the droplets to define the background – this is slightly less biased than artificially thresholding using the ADT reads directly but you can do that as well.

I use Seurat V2 below:

`

suppressMessages(library(Seurat)) 
suppressMessages(library(tidyverse))
suppressMessages(library(magrittr))
library(here)
# savepaths 
figpath = here("V2/figures_10x_example/"); dir.create(figpath)
source("functions/analysis_functions")

raw = Read10X_MPM("data/10x_data/raw_feature_bc_matrix/")
s = CreateSeuratObject(raw.data = raw, min.cells = 10, min.genes = 5)

# split protein and rna data into separate matrix 
prot = s@raw.data[grep(rownames(s@raw.data), pattern = "TotalSeq"), ]
rna = s@raw.data[rownames(s@raw.data)[rownames(s@raw.data) %ni% rownames(prot)], ]
dim(rna)

# make new object with separate assaya
s = CreateSeuratObject(raw.data = rna)

# define neg cells 
# Plot
p1 = ggplot(s@meta.data, aes(x = log10(nUMI + 1 ) )) +
  geom_density() + 
  geom_vline(xintercept = c(2.8, 1.4 ),   linetype = "dashed") + 
  ggtitle("131729 droplets 10X PBMC 5k \n 39.6% sequencing saturation") + 
  annotate("text", x = 1, y=2, label = "region 1: \n no data") + 
  annotate("text", x = 2.2, y=2, label = " region 2: \n the background\n define 'empty_drop_matrix' \n with these drops ") + 
  annotate("text", x = 4, y=2, label = " region 3, n= 5332 cell \n containing droplets \n zomed in on next plot") + 
  theme_bw() 
p2 = ggplot(s@meta.data %>% filter(log10(nUMI) > 2.5), aes(x = log10(nUMI + 1 ) )) +
  geom_density() + 
  ggtitle("5332 cell containing droplets 10X PBMC 5k \n 39.6% sequencing saturation") + 
  theme_bw()
p3 = plot_grid( p1 , p2 ) 

`

10x5k_pbmc_background

As highlighted in the plots, the second set of peaks is what you should use to define the background. The first peak are drops with between 0 to a few mRNA reads. This 0 / 1 inflated peak has will have not enough ADT to be informative (especially since sequencing depth is so low in this experiment). In the pipeline that follows I will compare using region 2 vs using region 1 + 2 for the background. When I use region 2 alone I get values in a reasonable range. When I use region 1 + 2 for background I reproduce your observations and get values in the range of -100 to 100 (but it is however driven by some outlier cells and the range of the means of proteins is actually pretty reasonable). If I use region 2 only I do get better results with the cells that are negative for each protein centered at 0.

`


md = s@meta.data %>% rownames_to_column("bc") %>%  mutate(log10umi = log10(nUMI)) %>% select(bc, log10umi) %>% column_to_rownames("bc")
s = AddMetaData(s,metadata = md)

# define negative drops 
neg_drops = WhichCells(s, subset.name =  "log10umi", accept.high = 1.5)
neg_drops2 = WhichCells(s, subset.name =  "log10umi",  accept.low = 1.5 , accept.high = 2.49)

# subset protein by negative drops   
neg_prot = prot[ , neg_drops ]
neg_prot = as.matrix(neg_prot)

neg_prot2 = prot[ , neg_drops2 ]
neg_prot2 = as.matrix(neg_prot2)


# Define the cell containing droplets
positive_cells = WhichCells(s, subset.name =  "log10umi", accept.low = 2.5, accept.high = 4.6)
positive_prot = prot[ , positive_cells] 
pos_prot = as.matrix(positive_prot)
hist(log(Matrix::colSums(positive_prot)), breaks = 1000)

# run DSB normalization 
library(dsb)
isotypes = rownames(pos_prot)[30:32]

# threshold 1 normalization -- region 1+ 2 used as background 
mtx = DSBNormalizeProtein(cell_protein_matrix = pos_prot,
                          empty_drop_matrix = neg_prot,
                          denoise.counts = TRUE,
                          use.isotype.control = TRUE, 
                          isotype.control.name.vec = isotypes)

# threshold 2 normalization  -- region 1+ 2 used a s background only
mtx2 = DSBNormalizeProtein(cell_protein_matrix = pos_prot,
                           empty_drop_matrix = neg_prot2,
                           denoise.counts = TRUE,
                           use.isotype.control = TRUE, 
                           isotype.control.name.vec = isotypes) 

# add protein data to Seurat object. 
s1 = s %>% SubsetData(cells.use = colnames(mtx), subset.raw = TRUE)
pos_prot = pos_prot[ ,s1@cell.names]
s1 = SetAssayData(s1, assay.type = "CITE", slot = "raw.data", new.data = pos_prot)
s1 = SetAssayData(s1, assay.type = "CITE", slot = "data",new.data = mtx)
s1 = SubsetData(s1, subset.name = "nGene", accept.low = 200, subset.raw = TRUE)

# threshold2 
s2 = s %>% SubsetData(cells.use = colnames(mtx2), subset.raw = TRUE)
pos_prot2 = pos_prot[ ,s2@cell.names]
s2 = SetAssayData(s2, assay.type = "CITE", slot = "raw.data", new.data = pos_prot2)
s2 = SetAssayData(s2, assay.type = "CITE", slot = "data",new.data = mtx2)
s2 = SubsetData(s2, subset.name = "nGene", accept.low = 200, subset.raw = TRUE)

# For visualization only:  
pmtx = t(mtx) %>% as.data.frame()
prots = colnames(pmtx)
index1 = prots[1]; index2 = prots[length(prots)]
pmtxl = pmtx %>% gather(prot, dsb_normalized_value, index1:index2)
#tester = pmtxl %>% filter(dsb_normalized_value < -10 | dsb_normalized_value > 50 )

ridge_aes = list(
  ggridges::geom_density_ridges2(show.legend = F, inherit.aes = T, scale = 2.5) ,
  ggridges::theme_ridges(font_size = 10, font_family = "Helvetica",center_axis_labels = T,  grid = TRUE, line_size = 0.2), 
  geom_vline(xintercept = 0, linetype="dashed", color = "black", size=1) ,
  theme(strip.background =element_rect(fill="white")),
  theme(strip.text = element_text(colour = 'black', size = 10, face = "bold")) ,
  theme(axis.text.x = element_text(size = 14, face = "bold")) ,
  theme(axis.text.y = element_text(size = 10, face = "bold")) 
  )


p3 = ggplot(pmtxl, aes(x = dsb_normalized_value , y = prot,  alpha = 1)) + ridge_aes + ggtitle("normalized with region 1 + 2 \n as background")

pmtx2 = t(mtx2) %>% as.data.frame()
prots = colnames(pmtx2)
index1 = prots[1]; index2 = prots[length(prots)]
pmtxl2 = pmtx2 %>% gather(prot, dsb_normalized_value, index1:index2) 

p4 = ggplot(pmtxl2, aes(x = dsb_normalized_value , y = prot, alpha = 1)) + ridge_aes + ggtitle("normalized with region 2 \n as background")

# remove some outliers 
p5 = ggplot(pmtxl2 %>% filter(dsb_normalized_value > -10 & dsb_normalized_value < 40), aes(x = dsb_normalized_value , y = prot, alpha = 1)) + ridge_aes + ggtitle("normalized with region 2 \n as background \n ~ 120 outlier drops removed ")
p6 = plot_grid(p3, p4, p5)
ggsave(p6, filename = paste0(figpath, "dsb_norm.png"), width = 12, height = 4)


`

dsb_norm

Keep in mind this is an extremely quick analysis and in reality i would have clustered the cells and removed doublets, done a ton of QC on the cells etc which would likely remove the outlier cells that have a really high count.

@bio-la
Copy link
Author

bio-la commented Jul 24, 2020

Hi @MattPM , thanks for the explanation and the analysis you shared.
Just a quick comment, you may want to clarify on the main vignette how this approach is different from the suggested selection of negatives (NegativeObject = subset(NegativeObject, subset = nFeature_RNA < 40)) as this would prob still include some very low UMI counts.
Also, I understand you remove outliers just for plotting, not for normalization in the vignette you produced here.
In a normal scenario, you would cluster on GEX and spend some time in cleaning up the ADT outliers in the RAW ADT matrices (and hence removing them from the GEX analysis too) before actually normalizing the ADT values?
It makes perfect sense, however I've found that cells can be outliers for one AB only while still being good for other features. i will try to keep track of those mischievous cells. thanks for your help.

@MattPM
Copy link
Collaborator

MattPM commented Jul 26, 2020

Correct, I removed outliers above to show the range of values for all but 120 cells out of >5000 was between -10 and 40 and I would do more QC first on the gene expression counts before normalizing the proteins. You can then normalize with dsb and cluster on the ADT data directly and you will definitely find some obvious additional doublets if you have PBMC. Depending on the size of the panel, we have found protein based clustering on a euclidean distance matrix, (exactly as they do in the Seurat vignette) is more informative for immune cells than mRNA, particularly for separating memory / naive T cell subsets which are obvious if you have proteins for them but can have pretty overlapping mRNA PCs. Thanks for the note on the vignette, I plan on switching it to some public data for simplicity.

@bio-la
Copy link
Author

bio-la commented Jul 27, 2020

thanks @MattPM the info about clustering is very helpful, as i am attempting some analysis right now.
how do you assign the markers to the cluster (and therefore assign identities?) do you do a Wilcox text on DSB values in cluster vs all cells and assign a pvalue to the difference in marker expression, do you just visually compare the distributions of the marker expression, or do you rely on the FACS-like scatterplots?

UMAPcoord_CD33
UMAPcoord_CD3

also, I've found that after dsb normalization there are markers that I can clearly see on a umap, but others that have a lot of negative values (i.e a lot of cells for that marker will have expression below the background) therefore these are difficult to visually inspect and clearly assign to a cluster. I am not sure if I need to apply some visual tricks (like plotting all negative values as 0, or attempting to scale) on some of these markers specifically, and i do realize that having lots of negatives tells a lot about how confident i can be about the marker staining.

thanks !

@MattPM
Copy link
Collaborator

MattPM commented Jul 28, 2020

The outlier cells are making it hard to see the true distribution on that umap. For annotation, it will be easier to just make a heatmap of the mean expression in each cluster:

adt_data = cbind(as.data.frame(t(SeuratObject@assays$CITE@data)), SeuratObject@meta.data)
prots = rownames(SeuratObject@assays$CITE@data)
adt_plot = adt_data %>% 
    group_by(cluster) %>% 
    summarize_at(.vars = prots, .funs = mean) %>% 
    column_to_rownames("cluster") %>% 
    t %>% 
    as.data.frame

pheatmap::pheatmap(adt_plot)

Usually with this strategy the doublet clusters really pop out so you can find and remove them.

You could also (just for the purpose of annotating the clusters) try subsetting the data so you only retain cells with a dsb normalized value between -10 and 30 or so if you want to just look at the umap but in my opinion that will be harder than looking at the mean. You can make histograms like I made above and do + facet_wrap(~cluster) and save a really wide plot. Sometimes I do that for annotation. Check out supplementary information > supplemental figure 2 https://www.nature.com/articles/s41591-020-0769-8#Sec49

Edit Sept 1 2020 added more clear pipeline with 10X data to main readme Vignette based on this thread

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants