## Notebook to identify potential doublets using Scrublet

- [Github repo](https://github.com/AllonKleinLab/scrublet)
- [repo example notebook](https://github.com/AllonKleinLab/scrublet/blob/master/examples/scrublet_basics.ipynb)
- [Cell Systems paper](https://www.sciencedirect.com/science/article/pii/S2405471218304745)

In [None]:
!date

#### import libraries

In [None]:
import scanpy as sc
import scrublet as scr
import matplotlib.pyplot as plt
from matplotlib.pyplot import rc_context

%matplotlib inline
# for white background of figures (only for docs rendering)
%config InlineBackend.print_figure_kwargs={'facecolor' : "w"}
%config InlineBackend.figure_format='retina'

#### set notebook variables

In [None]:
# naming
proj_name = 'aging_phase2'

# directories
wrk_dir = '/labshare/raph/datasets/adrd_neuro/brain_aging/phase2'
quants_dir = f'{wrk_dir}/quants'

# in files
anndata_file = f'{quants_dir}/{proj_name}.raw.h5ad'

# out files
scores_file = f'{quants_dir}/{proj_name}.scrublet_scores.csv'

# variables
DEBUG = True
dpi_value = 50
use_gene_only = False
expected_rate = 0.08

### load the anndata files

In [None]:
%%time
adata = sc.read(anndata_file)

if DEBUG:
    print(adata)
    
# run doublet detection using just the Gene Expression features, ie don't include the peak features
if use_gene_only:
    adata = adata[:,adata.var.modality == 'Gene Expression']

    if DEBUG:
        print(adata)

### Initialize Scrublet object
The relevant parameters are:

- expected_doublet_rate: the expected fraction of transcriptomes that are doublets, typically 0.05-0.1. Results are not particularly sensitive to this parameter. For this example, the expected doublet rate comes from the Chromium User Guide: https://support.10xgenomics.com/permalink/3vzDu3zQjY0o2AqkkkI4CC
- sim_doublet_ratio: the number of doublets to simulate, relative to the number of observed transcriptomes. This should be high enough that all doublet states are well-represented by simulated doublets. Setting it too high is computationally expensive. The default value is 2, though values as low as 0.5 give very similar results for the datasets that have been tested.
- n_neighbors: Number of neighbors used to construct the KNN classifier of observed transcriptomes and simulated doublets. The default value of round(0.5*sqrt(n_cells)) generally works well.

In [None]:
%%time
scrub = scr.Scrublet(adata.X, expected_doublet_rate=expected_rate)

### Run the default pipeline, which includes:
1. Doublet simulation
2. Normalization, gene filtering, rescaling, PCA
3. Doublet score calculation
4. Doublet score threshold detection and doublet calling

In [None]:
%%time
doublet_scores, predicted_doublets = scrub.scrub_doublets()

### Plot doublet score histograms for observed transcriptomes and simulated doublets
The simulated doublet histogram is typically bimodal. The left mode corresponds to "embedded" doublets generated by two cells with similar gene expression. The right mode corresponds to "neotypic" doublets, which are generated by cells with distinct gene expression (e.g., different cell types) and are expected to introduce more artifacts in downstream analyses. Scrublet can only detect neotypic doublets.

To call doublets vs. singlets, we must set a threshold doublet score, ideally at the minimum between the two modes of the simulated doublet histogram. scrub_doublets() attempts to identify this point automatically and has done a good job in this example. However, if automatic threshold detection doesn't work well, you can adjust the threshold with the call_doublets() function. For example:

scrub.call_doublets(threshold=0.25)

In [None]:
with rc_context({'figure.figsize': (9, 9), 'figure.dpi': dpi_value}):
    plt.style.use('seaborn-bright')
    scrub.plot_histogram()

### Get 2-D embedding to visualize the results

In [None]:
print('Running UMAP...')
scrub.set_embedding('UMAP', scr.get_umap(scrub.manifold_obs_, 10, min_dist=0.3))

print('Done.')

### Plot doublet predictions on 2-D embedding
Predicted doublets should co-localize in distinct states.

In [None]:
with rc_context({'figure.figsize': (9, 9), 'figure.dpi': dpi_value}):
    plt.style.use('seaborn-bright')
    scrub.plot_embedding('UMAP', order_points=True)

### add the scores the the cell observations

In [None]:
adata.obs['doublet_score'] = doublet_scores
adata.obs['predicted_doublet'] = predicted_doublets

In [None]:
display(adata.obs.predicted_doublet.value_counts())

### save the scores

In [None]:
adata.obs.to_csv(scores_file)

In [None]:
!date