# Doublet Detection and Removal

## Overview
This notebook identifies and removes doublets (droplets containing two or more cells) from scRNA-seq data.

### Why Remove Doublets?
- Doublets appear as intermediate cell types or rare populations
- They can confound clustering and differential expression
- Expected doublet rate: ~0.8% per 1000 cells loaded

### Methods
- **Scrublet**: Simulates doublets and scores cells based on similarity
- **DoubletDetection**: Alternative method using boosted classifiers

---

## 1. Setup

In [None]:
import scanpy as sc
import anndata as ad
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import yaml
import warnings

# Doublet detection
import scrublet as scr

warnings.filterwarnings('ignore')

# Project paths
PROJECT_ROOT = Path("../..").resolve()
DATA_PROCESSED = PROJECT_ROOT / 'data' / 'processed' / 'scrna'
FIGURES = PROJECT_ROOT / 'results' / 'figures'
CONFIG_PATH = PROJECT_ROOT / 'config' / 'analysis_params.yaml'

# Load configuration
with open(CONFIG_PATH, 'r') as f:
    config = yaml.safe_load(f)

SEED = config['random_seed']
np.random.seed(SEED)

## 2. Load Processed Data

In [None]:
# Load processed data
geo_id = "GSE115978"
input_path = DATA_PROCESSED / f'{geo_id}_processed.h5ad'

if input_path.exists():
    adata = sc.read_h5ad(input_path)
    print(f"Loaded: {input_path}")
    print(f"Cells: {adata.n_obs}")
    print(f"Genes: {adata.n_vars}")
else:
    print(f"File not found: {input_path}")
    print("Please run 02b_normalization_hvg.ipynb first")

## 3. Run Scrublet

Scrublet simulates artificial doublets by averaging pairs of cells, then scores each cell based on its similarity to simulated doublets.

In [None]:
def run_scrublet(adata, expected_doublet_rate=0.08):
    """
    Run Scrublet doublet detection.
    
    Parameters
    ----------
    adata : AnnData
        Annotated data matrix (use raw counts)
    expected_doublet_rate : float
        Expected fraction of doublets
    
    Returns
    -------
    AnnData
        Data with doublet scores and predictions
    """
    # Use raw counts for Scrublet
    if 'counts' in adata.layers:
        counts = adata.layers['counts']
    else:
        counts = adata.X
    
    # Initialize Scrublet
    scrub = scr.Scrublet(
        counts,
        expected_doublet_rate=expected_doublet_rate,
        random_state=SEED
    )
    
    # Run detection
    doublet_scores, predicted_doublets = scrub.scrub_doublets(
        min_counts=2,
        min_cells=3,
        min_gene_variability_pctl=85,
        n_prin_comps=30
    )
    
    # Add to adata
    adata.obs['doublet_score'] = doublet_scores
    adata.obs['predicted_doublet'] = predicted_doublets
    
    # Store scrublet object for visualization
    return adata, scrub

print("Scrublet function defined")

In [None]:
# Run Scrublet
expected_rate = config['qc']['doublet_rate']
print(f"Expected doublet rate: {expected_rate}")

adata, scrub = run_scrublet(adata, expected_doublet_rate=expected_rate)

# Summary
n_doublets = adata.obs['predicted_doublet'].sum()
pct_doublets = 100 * n_doublets / adata.n_obs

print(f"\nDoublet detection results:")
print(f"  Predicted doublets: {n_doublets} ({pct_doublets:.2f}%)")
print(f"  Threshold: {scrub.threshold_}")

## 4. Visualize Doublet Scores

In [None]:
# Histogram of doublet scores
scrub.plot_histogram()
plt.savefig(FIGURES / f'{geo_id}_doublet_histogram.png', dpi=150, bbox_inches='tight')
plt.show()

In [None]:
# UMAP colored by doublet score
sc.pl.umap(
    adata,
    color=['doublet_score', 'predicted_doublet'],
    ncols=2,
    save=f'_{geo_id}_doublets.png'
)

In [None]:
# Doublets per cluster
default_res = config['clustering']['default_resolution']
cluster_key = f'leiden_{default_res}'

if cluster_key in adata.obs.columns:
    doublet_by_cluster = adata.obs.groupby(cluster_key)['predicted_doublet'].agg(['sum', 'count'])
    doublet_by_cluster['pct'] = 100 * doublet_by_cluster['sum'] / doublet_by_cluster['count']
    
    print("Doublet percentage by cluster:")
    print(doublet_by_cluster.sort_values('pct', ascending=False))

## 5. Remove Doublets

In [None]:
# Filter out doublets
n_before = adata.n_obs

adata_singlets = adata[~adata.obs['predicted_doublet']].copy()

n_after = adata_singlets.n_obs
n_removed = n_before - n_after

print(f"Removed {n_removed} doublets ({100*n_removed/n_before:.2f}%)")
print(f"Cells remaining: {n_after}")

In [None]:
# Recompute UMAP after doublet removal
sc.pp.neighbors(adata_singlets, n_pcs=config['dim_reduction']['n_pcs'], random_state=SEED)
sc.tl.umap(adata_singlets, random_state=SEED)

# Visualize
sc.pl.umap(
    adata_singlets,
    color=[cluster_key],
    title='After doublet removal',
    save=f'_{geo_id}_after_doublet_removal.png'
)

## 6. Save Final Preprocessed Data

In [None]:
# Save data with doublets removed
output_path = DATA_PROCESSED / f'{geo_id}_final.h5ad'
adata_singlets.write(output_path)

print(f"Saved final preprocessed data to: {output_path}")
print(f"\nFinal data summary:")
print(f"  Cells: {adata_singlets.n_obs}")
print(f"  Genes: {adata_singlets.n_vars}")

## 7. Summary

### Preprocessing Complete
This concludes the preprocessing phase for this dataset.

### Data Status
- Quality control: Complete
- Normalization: Complete
- HVG selection: Complete
- Dimensionality reduction: Complete
- Doublet removal: Complete

### Next Steps
1. Repeat preprocessing for all datasets
2. Proceed to `03_integration/` for batch correction across datasets