Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runing scDblFinder before and after removing low QC cells gives different results #79

Closed
yeroslaviz opened this issue Jun 16, 2023 · 1 comment

Comments

@yeroslaviz
Copy link

Hi, thanks for a really great and easy to use tool for identifying doublets in my data set.

I was wondering how good the tool should work on a SMART-Seq 2 data set with "only" < 100 cells.

I'm getting the warning, that it might cause a problem.
But my question is different.

I have run the scDblFinder command on my sce object after removing low qc cell identified via addPerCellQCMetrics and only two cells were identified as doublet.

For some reason I needed to repeat the analysis and this time I have first ran the filtering for doublets only after removing lowQC cells. This time though it identified 9 cells as doublets. I know it is not much, but it's still >10% in my data set.

I'm mainly interested to understand if I can trust this results for such a small data set and if so why there is such a big difference, depending how (or when) one run the search.

thanks
Assa

@plger
Copy link
Owner

plger commented Jun 19, 2023

scDblFinder is not deterministic so running it twice with different seeds will also give slightly different results. The difference you're talking about however seems bigger than this. I guess a first question is whether the additional putative doublets were among the cells that were removed by scater's QC.

Whether to run before or after QC is already discussed elsewhere: it's preferable to get rid of droplets with very little coverage (e.g. <500 reads) but otherwise run scDblFinder before further filtering.

scDblFinder on very small datasets

The deeper issue is more complex -- you might consider renaming the issue to whether one can use scDblFinder on very small datasets. I've never done so. What I can do though is take an existing dataset with a ground truth, keeps only cells above a certain libsize (2000 reads) to be more similar to smartseq data, downsample it (keeping a 10% doublets), run scDblFinder and evaluate (the code can be found here). The area under the precision and recall curve doesn't really change, meaning that doublets are consistently ranked higher than singlets even in small datasets. So in principle you should be able to use scDblFinder on such small datasets.

downsampling

This evalutes the ranking of the doublet scores, but not the thresholding (at which point to make the call). With small datasets, an overabundance of artificial doublets is generated to increase power, which skews the doublet score to the higher side, and the scores also tend to be less polarized. You can see this in the figure below: with many cells (bottom row), most cells get a very low score, a few (the doublets) instead get a very high score, and there's a huge gap between the two. In such cases, putting a threshold is easy. However you can see that as we decrease the number of cells, the gap gradually disappears, and it becomes very difficult to put the threshold.

downsampling_scoreHist

In such circumstances, it's very hard to establish, from the data, how many doublets you really have. Since this is smartseq, you probably don't even have a clear prior expectation. So while you can trust the doublet score (i.e. scDblFinder.score) to be a good ranking of doublets, you can't really trust the call, i.e. scDblFinder.class.

This problem is in good part because of the large number of artificial doublets created. Normally, roughly the same number of artificial doublets are created as there are cells. However, for small datasets this would mean few doublets, which can be insufficient to capture the possible mixings, therefore a hard minimum was set (originally to 5000). However this makes the thresholding job harder in very small datasets, as can be seen by varying the artificialDoublets parameter (rows here):

downsampling_hist_nAd

I've now changed this hard minimum to 1500 (currently only on github), which should still sample the mixing space while improving the separation in very small datasets such as yours. In your case, I'd recommend simply using the artificialDoublets parameter (e.g. setting it to, say, 500 or 1000).

Finally, what I would suggest is to first look at your score histogram: if you see a clear peak close to 1, your job is easier. If you don't, then you probably want to visualize the doublets. I recommend doing so on the PCA space, because it's linear, i.e. a doublet should be somewhere between the two cell types it's composed of.

Hope this helps,
plger

@plger plger closed this as completed Jul 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants