# Negative Binomial: Overdispersed Poisson

The **Negative Binomial distribution** is a discrete probability distribution for count data. It's particularly important in genomics because it models **overdispersed** counts—where variance exceeds the mean.

## The Key Insight

**Negative Binomial = Poisson with Gamma-distributed rate**

This is why scRNA-seq count models (DESeq2, edgeR, scVI) use Negative Binomial instead of Poisson:
- Poisson assumes Variance = Mean
- Real gene expression data has Variance > Mean (overdispersion)
- Negative Binomial naturally captures this extra variability

## Table of Contents

1. [Poisson vs Negative Binomial](#1-poisson-vs-negative-binomial)
2. [The Poisson-Gamma Mixture](#2-the-poisson-gamma-mixture)
3. [Understanding Overdispersion](#3-understanding-overdispersion)
4. [Parameterizations](#4-parameterizations)
5. [Application: Gene Expression](#5-application-gene-expression)
6. [Quick Reference](#6-quick-reference)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Style
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (10, 5)
plt.rcParams['font.size'] = 12

## 1. Poisson vs Negative Binomial

### The Poisson Limitation

Poisson distribution has a strict constraint: **Variance = Mean**

This is often violated in real data, especially biological count data.

In [None]:
# Compare Poisson and Negative Binomial with same mean
mean = 10

# Poisson: variance = mean
poisson_var = mean

# Negative Binomial: variance > mean (overdispersed)
# We'll set variance = 2 × mean for illustration
nb_var = 2 * mean

# NB parameterization: mean = r(1-p)/p, var = r(1-p)/p²
# Given mean and var, solve for r and p:
# p = mean / var
# r = mean² / (var - mean)
p = mean / nb_var
r = mean**2 / (nb_var - mean)

x = np.arange(0, 35)

fig, ax = plt.subplots(figsize=(12, 5))

# Poisson PMF
poisson_pmf = stats.poisson.pmf(x, mean)
ax.bar(x - 0.2, poisson_pmf, width=0.4, alpha=0.7, label=f'Poisson(λ={mean})\nVar = {poisson_var}', color='steelblue')

# Negative Binomial PMF
nb_pmf = stats.nbinom.pmf(x, r, p)
ax.bar(x + 0.2, nb_pmf, width=0.4, alpha=0.7, label=f'NegBinom(r={r:.1f}, p={p:.2f})\nVar = {nb_var}', color='coral')

ax.set_xlabel('Count')
ax.set_ylabel('Probability')
ax.set_title(f'Poisson vs Negative Binomial (Same Mean = {mean})')
ax.legend()

plt.show()

print("Key observation:")
print(f"  Both have mean = {mean}")
print(f"  Poisson variance = {poisson_var}")
print(f"  NegBinom variance = {nb_var}")
print(f"  NegBinom has heavier tails (more extreme values)")

### Visualizing the Difference in Tails

In [None]:
# Simulate many samples and compare distributions
np.random.seed(42)
n_samples = 10000
mean = 10

# Poisson samples
poisson_samples = np.random.poisson(mean, n_samples)

# Negative Binomial samples (variance = 3 × mean)
nb_var = 3 * mean
p = mean / nb_var
r = mean**2 / (nb_var - mean)
nb_samples = np.random.negative_binomial(r, p, n_samples)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histograms
bins = np.arange(0, 50)
axes[0].hist(poisson_samples, bins=bins, density=True, alpha=0.6, label='Poisson', color='steelblue')
axes[0].hist(nb_samples, bins=bins, density=True, alpha=0.6, label='Neg Binomial', color='coral')
axes[0].set_xlabel('Count')
axes[0].set_ylabel('Density')
axes[0].set_title('Distribution Comparison')
axes[0].legend()

# Mean-Variance relationship
axes[1].scatter([poisson_samples.mean()], [poisson_samples.var()], s=200, 
                label=f'Poisson: μ={poisson_samples.mean():.1f}, σ²={poisson_samples.var():.1f}', 
                color='steelblue', marker='o')
axes[1].scatter([nb_samples.mean()], [nb_samples.var()], s=200, 
                label=f'NegBinom: μ={nb_samples.mean():.1f}, σ²={nb_samples.var():.1f}', 
                color='coral', marker='s')

# Reference line: variance = mean (Poisson)
x_line = np.linspace(0, 40, 100)
axes[1].plot(x_line, x_line, 'k--', label='Var = Mean (Poisson)')
axes[1].plot(x_line, 3*x_line, 'r--', alpha=0.5, label='Var = 3×Mean')

axes[1].set_xlabel('Mean')
axes[1].set_ylabel('Variance')
axes[1].set_title('Mean-Variance Relationship')
axes[1].legend()
axes[1].set_xlim(0, 20)
axes[1].set_ylim(0, 50)

plt.tight_layout()
plt.show()

## 2. The Poisson-Gamma Mixture

Here's the beautiful mathematical result:

**If:**
- $\lambda \sim \text{Gamma}(r, \beta)$ (rate is random)
- $X | \lambda \sim \text{Poisson}(\lambda)$ (counts given rate)

**Then:**
- $X \sim \text{NegativeBinomial}(r, p)$ where $p = \frac{\beta}{1 + \beta}$

### Intuition

- In Poisson, every observation has the **same rate** λ
- In Negative Binomial, each observation has its **own rate** drawn from a Gamma
- This extra variability in rates creates overdispersion

In [None]:
# Demonstrate the Poisson-Gamma mixture = Negative Binomial
np.random.seed(42)
n_samples = 50000

# Gamma parameters
gamma_shape = 5  # This becomes 'r' in NegBinom
gamma_scale = 2  # scale = 1/rate

# Step 1: Draw rates from Gamma
rates = np.random.gamma(gamma_shape, gamma_scale, n_samples)

# Step 2: Draw counts from Poisson with those rates
poisson_gamma_counts = np.array([np.random.poisson(r) for r in rates])

# Direct Negative Binomial with equivalent parameters
# NB parameterization in scipy: r (n), p
# When rate ~ Gamma(shape, scale), the marginal is NB with:
# r = shape
# p = 1 / (1 + scale)
r = gamma_shape
p = 1 / (1 + gamma_scale)
nb_counts = np.random.negative_binomial(r, p, n_samples)

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# 1. The Gamma distribution of rates
x_gamma = np.linspace(0, 30, 200)
gamma_pdf = stats.gamma.pdf(x_gamma, a=gamma_shape, scale=gamma_scale)
axes[0].hist(rates, bins=50, density=True, alpha=0.7, color='steelblue')
axes[0].plot(x_gamma, gamma_pdf, 'r-', lw=2, label='Gamma PDF')
axes[0].set_xlabel('Rate (λ)')
axes[0].set_ylabel('Density')
axes[0].set_title(f'Step 1: Draw rates from Gamma({gamma_shape}, {gamma_scale})')
axes[0].legend()

# 2. Poisson-Gamma mixture vs direct NB
bins = np.arange(0, 50)
axes[1].hist(poisson_gamma_counts, bins=bins, density=True, alpha=0.5, 
             label='Poisson-Gamma mixture', color='steelblue')
axes[1].hist(nb_counts, bins=bins, density=True, alpha=0.5, 
             label=f'NegBinom(r={r}, p={p:.3f})', color='coral')

# Add NB PMF
x_nb = np.arange(0, 50)
nb_pmf = stats.nbinom.pmf(x_nb, r, p)
axes[1].plot(x_nb, nb_pmf, 'k-', lw=2, label='NB PMF')

axes[1].set_xlabel('Count')
axes[1].set_ylabel('Density')
axes[1].set_title('Step 2: Poisson(λ) for each rate\n= Negative Binomial')
axes[1].legend()

# 3. Q-Q plot to show they're the same distribution
sorted_pg = np.sort(poisson_gamma_counts)
sorted_nb = np.sort(nb_counts)
axes[2].scatter(sorted_pg[::100], sorted_nb[::100], alpha=0.5, s=20)
max_val = max(sorted_pg.max(), sorted_nb.max())
axes[2].plot([0, max_val], [0, max_val], 'r--', lw=2, label='y = x')
axes[2].set_xlabel('Poisson-Gamma quantiles')
axes[2].set_ylabel('Negative Binomial quantiles')
axes[2].set_title('Q-Q Plot: Same Distribution!')
axes[2].legend()

plt.tight_layout()
plt.show()

# Statistics comparison
print("Statistics Comparison:")
print(f"                      Poisson-Gamma    Neg Binomial    Theory")
print(f"  Mean:               {poisson_gamma_counts.mean():12.2f}    {nb_counts.mean():12.2f}    {gamma_shape * gamma_scale:8.2f}")
print(f"  Variance:           {poisson_gamma_counts.var():12.2f}    {nb_counts.var():12.2f}    {gamma_shape * gamma_scale * (1 + gamma_scale):8.2f}")
print(f"  Var/Mean ratio:     {poisson_gamma_counts.var()/poisson_gamma_counts.mean():12.2f}    {nb_counts.var()/nb_counts.mean():12.2f}    {1 + gamma_scale:8.2f}")

## 3. Understanding Overdispersion

### What is Overdispersion?

**Overdispersion** = Variance > Mean

For Negative Binomial:
$$\text{Var}(X) = \mu + \frac{\mu^2}{r}$$

Where:
- μ = mean
- r = dispersion parameter (larger r → less overdispersion)
- As r → ∞, NB → Poisson

In [None]:
# Effect of dispersion parameter r
mean = 10
r_values = [0.5, 1, 2, 5, 10, 100]

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

x = np.arange(0, 50)

for ax, r in zip(axes, r_values):
    # NB parameterization: mean = r(1-p)/p, so p = r/(r + mean)
    p = r / (r + mean)
    
    # Variance = mean + mean²/r
    var = mean + mean**2 / r
    
    nb_pmf = stats.nbinom.pmf(x, r, p)
    poisson_pmf = stats.poisson.pmf(x, mean)
    
    ax.bar(x, nb_pmf, alpha=0.7, label=f'NB(r={r})', color='coral')
    ax.plot(x, poisson_pmf, 'b-', lw=2, label='Poisson', alpha=0.7)
    
    ax.set_title(f'r = {r}\nVar = {var:.1f} (Var/Mean = {var/mean:.2f})')
    ax.set_xlabel('Count')
    ax.set_ylabel('Probability')
    ax.legend(loc='upper right')
    ax.set_xlim(0, 40)

plt.suptitle(f'Negative Binomial: Effect of Dispersion Parameter r (Mean = {mean})\nAs r → ∞, NB → Poisson', 
             fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# Mean-Variance relationship for different r values
means = np.linspace(1, 50, 100)
r_values = [0.5, 1, 2, 5, 10, 100]

fig, ax = plt.subplots(figsize=(10, 6))

# Poisson line (Var = Mean)
ax.plot(means, means, 'k-', lw=3, label='Poisson (Var = Mean)')

# NB lines for different r
colors = plt.cm.Reds(np.linspace(0.3, 0.9, len(r_values)))
for r, color in zip(r_values, colors):
    var = means + means**2 / r
    ax.plot(means, var, lw=2, color=color, label=f'NB (r={r})')

ax.set_xlabel('Mean')
ax.set_ylabel('Variance')
ax.set_title('Mean-Variance Relationship\nNegative Binomial: Var = μ + μ²/r')
ax.legend(loc='upper left')
ax.set_xlim(0, 50)
ax.set_ylim(0, 200)

# Add annotation
ax.annotate('Overdispersion\n(Var > Mean)', xy=(30, 100), fontsize=12,
            bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.show()

print("Key insight:")
print("  - Small r → high overdispersion (Var >> Mean)")
print("  - Large r → approaches Poisson (Var ≈ Mean)")
print("  - Gene expression typically has r between 0.1 and 10")

## 4. Parameterizations

The Negative Binomial has multiple parameterizations, which can be confusing:

### Common Parameterizations

| Name | Parameters | Mean | Variance |
|------|------------|------|----------|
| scipy/numpy | n (r), p | r(1-p)/p | r(1-p)/p² |
| Mean-dispersion | μ, r | μ | μ + μ²/r |
| DESeq2/edgeR | μ, α (1/r) | μ | μ + αμ² |

In [None]:
# Conversion functions between parameterizations

def nb_scipy_to_mean_disp(n, p):
    """Convert scipy (n, p) to (mean, r)."""
    mean = n * (1 - p) / p
    r = n
    return mean, r

def nb_mean_disp_to_scipy(mean, r):
    """Convert (mean, r) to scipy (n, p)."""
    n = r
    p = r / (r + mean)
    return n, p

def nb_variance(mean, r):
    """Compute NB variance from mean and dispersion."""
    return mean + mean**2 / r

# Example
mean = 10
r = 2

n, p = nb_mean_disp_to_scipy(mean, r)
print(f"Mean-dispersion: μ={mean}, r={r}")
print(f"Scipy params:    n={n}, p={p:.4f}")
print(f"Variance:        {nb_variance(mean, r):.2f}")
print(f"Var/Mean:        {nb_variance(mean, r)/mean:.2f}")

# Verify with scipy
print(f"\nVerification with scipy.stats.nbinom:")
print(f"  Mean:     {stats.nbinom.mean(n, p):.2f}")
print(f"  Variance: {stats.nbinom.var(n, p):.2f}")

In [None]:
# Sampling with mean-dispersion parameterization
def sample_nb_mean_disp(mean, r, size=1):
    """Sample from NB using mean-dispersion parameterization."""
    n, p = nb_mean_disp_to_scipy(mean, r)
    return np.random.negative_binomial(n, p, size)

# Example: different genes with same mean but different dispersion
np.random.seed(42)
n_samples = 5000
mean = 20

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for ax, r in zip(axes, [0.5, 2, 10]):
    samples = sample_nb_mean_disp(mean, r, n_samples)
    
    ax.hist(samples, bins=50, density=True, alpha=0.7, edgecolor='black')
    ax.axvline(samples.mean(), color='red', linestyle='--', lw=2, label=f'Mean={samples.mean():.1f}')
    ax.set_xlabel('Count')
    ax.set_ylabel('Density')
    ax.set_title(f'NB(μ={mean}, r={r})\nVar={samples.var():.1f}, Var/Mean={samples.var()/samples.mean():.2f}')
    ax.legend()

plt.suptitle('Same Mean, Different Dispersion', fontsize=14)
plt.tight_layout()
plt.show()

## 5. Application: Gene Expression

### Why scRNA-seq Uses Negative Binomial

Gene expression counts are overdispersed due to:
1. **Biological variability**: Cells differ in their transcriptional state
2. **Technical noise**: Capture efficiency, amplification bias
3. **Bursty transcription**: Genes transcribe in bursts, not continuously

In [None]:
# Simulate gene expression with biological variability
np.random.seed(42)
n_cells = 1000

# Gene with mean expression = 50
base_mean = 50

# Scenario 1: No biological variability (Poisson)
poisson_counts = np.random.poisson(base_mean, n_cells)

# Scenario 2: Biological variability (each cell has different rate)
# Rate varies according to Gamma distribution
r = 2  # dispersion parameter
cell_rates = np.random.gamma(r, base_mean/r, n_cells)  # mean = base_mean
nb_counts = np.array([np.random.poisson(rate) for rate in cell_rates])

# Alternative: direct NB sampling
nb_direct = sample_nb_mean_disp(base_mean, r, n_cells)

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Cell-specific rates
axes[0].hist(cell_rates, bins=40, density=True, alpha=0.7, color='steelblue')
axes[0].axvline(base_mean, color='red', linestyle='--', lw=2, label=f'Mean rate = {base_mean}')
axes[0].set_xlabel('Cell-specific rate (λ)')
axes[0].set_ylabel('Density')
axes[0].set_title('Biological Variability:\nEach cell has different rate')
axes[0].legend()

# Poisson vs NB counts
bins = np.arange(0, 150)
axes[1].hist(poisson_counts, bins=bins, density=True, alpha=0.5, label='Poisson (no bio var)', color='steelblue')
axes[1].hist(nb_counts, bins=bins, density=True, alpha=0.5, label='NB (with bio var)', color='coral')
axes[1].set_xlabel('Count')
axes[1].set_ylabel('Density')
axes[1].set_title('Resulting Count Distributions')
axes[1].legend()

# Mean-variance plot
axes[2].scatter([poisson_counts.mean()], [poisson_counts.var()], s=200, 
                label=f'Poisson', color='steelblue', marker='o')
axes[2].scatter([nb_counts.mean()], [nb_counts.var()], s=200, 
                label=f'NB (Poisson-Gamma)', color='coral', marker='s')

# Reference lines
x_line = np.linspace(0, 100, 100)
axes[2].plot(x_line, x_line, 'k--', label='Var = Mean')
axes[2].plot(x_line, x_line + x_line**2/r, 'r--', alpha=0.5, label=f'Var = μ + μ²/{r}')

axes[2].set_xlabel('Mean')
axes[2].set_ylabel('Variance')
axes[2].set_title('Mean-Variance Relationship')
axes[2].legend()
axes[2].set_xlim(0, 100)
axes[2].set_ylim(0, 2000)

plt.tight_layout()
plt.show()

print("Statistics:")
print(f"  Poisson:  mean={poisson_counts.mean():.1f}, var={poisson_counts.var():.1f}, var/mean={poisson_counts.var()/poisson_counts.mean():.2f}")
print(f"  NB:       mean={nb_counts.mean():.1f}, var={nb_counts.var():.1f}, var/mean={nb_counts.var()/nb_counts.mean():.2f}")

In [None]:
# Simulate multiple genes with different expression levels and dispersions
np.random.seed(42)
n_cells = 500
n_genes = 100

# Gene-specific parameters
gene_means = np.random.exponential(50, n_genes)  # Different mean expression
gene_dispersions = np.random.uniform(0.5, 5, n_genes)  # Different dispersion

# Simulate counts
counts = np.zeros((n_cells, n_genes))
for j in range(n_genes):
    counts[:, j] = sample_nb_mean_disp(gene_means[j], gene_dispersions[j], n_cells)

# Compute mean and variance for each gene
gene_observed_means = counts.mean(axis=0)
gene_observed_vars = counts.var(axis=0)

fig, ax = plt.subplots(figsize=(10, 6))

ax.scatter(gene_observed_means, gene_observed_vars, alpha=0.6, s=50, c=gene_dispersions, cmap='coolwarm')

# Reference lines
x_line = np.linspace(1, gene_observed_means.max(), 100)
ax.plot(x_line, x_line, 'k--', lw=2, label='Var = Mean (Poisson)')
ax.plot(x_line, x_line + x_line**2/1, 'r-', alpha=0.3, lw=2, label='r = 1')
ax.plot(x_line, x_line + x_line**2/5, 'b-', alpha=0.3, lw=2, label='r = 5')

ax.set_xlabel('Mean Expression')
ax.set_ylabel('Variance')
ax.set_title('Mean-Variance Relationship Across Genes\n(Color = dispersion parameter r)')
ax.legend()
ax.set_xscale('log')
ax.set_yscale('log')

cbar = plt.colorbar(ax.collections[0], ax=ax, label='Dispersion (r)')

plt.show()

print("This is what real scRNA-seq data looks like!")
print("Points above the Poisson line indicate overdispersion.")

## 6. Quick Reference

### Key Formulas

**Mean-dispersion parameterization (most intuitive):**
```
Mean = μ
Variance = μ + μ²/r
Var/Mean = 1 + μ/r  (overdispersion factor)
```

**Scipy parameterization:**
```
n = r (dispersion)
p = r / (r + μ)
```

**Poisson-Gamma mixture:**
```
λ ~ Gamma(r, μ/r)  [shape, scale]
X | λ ~ Poisson(λ)
X ~ NegBinom(μ, r)
```

### When to Use What

| Situation | Distribution |
|-----------|-------------|
| Var ≈ Mean | Poisson |
| Var > Mean | Negative Binomial |
| Gene expression | Negative Binomial |
| Technical replicates only | Poisson (maybe) |
| Biological replicates | Negative Binomial |

In [None]:
# Helper functions for future use

def nb_summary(mean, r):
    """Print summary statistics for Negative Binomial."""
    var = mean + mean**2 / r
    n, p = nb_mean_disp_to_scipy(mean, r)
    
    print(f"Negative Binomial Summary:")
    print(f"  Mean (μ):        {mean:.4f}")
    print(f"  Dispersion (r):  {r:.4f}")
    print(f"  Variance:        {var:.4f}  (= μ + μ²/r)")
    print(f"  Var/Mean:        {var/mean:.4f}  (overdispersion factor)")
    print(f"  Scipy params:    n={n:.4f}, p={p:.4f}")
    
    if r > 10:
        print(f"  Note: r > 10, close to Poisson")
    elif r < 1:
        print(f"  Note: r < 1, highly overdispersed")

# Example
nb_summary(mean=50, r=2)

---

## Summary

### The Key Insight

**Negative Binomial = Poisson with Gamma-distributed rate**

This explains overdispersion:
- Poisson: every observation has the same rate
- NB: each observation has its own rate (from a Gamma)
- Extra variability in rates → extra variability in counts

### Why This Matters for Biology

1. **Gene expression is overdispersed** due to biological variability
2. **Poisson underestimates variance** → inflated false positives
3. **NB is the standard** for DESeq2, edgeR, scVI, etc.

### Remember

- **Small r** → high overdispersion (Var >> Mean)
- **Large r** → approaches Poisson (Var ≈ Mean)
- **r → ∞** → exactly Poisson