<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Calculation-of-multiplet-frequency-from-cell-type-mixing-in-Python" data-toc-modified-id="Calculation-of-multiplet-frequency-from-cell-type-mixing-in-Python-1">Calculation of multiplet frequency from cell-type mixing in <code>Python</code></a></span><ul class="toc-item"><li><span><a href="#Function-to-compute-multiplet-frequency" data-toc-modified-id="Function-to-compute-multiplet-frequency-1.1">Function to compute multiplet frequency</a></span></li><li><span><a href="#Python-function" data-toc-modified-id="Python-function-1.2">Python function</a></span></li><li><span><a href="#Example-when-cell-types-mixed-in-equal-proportion" data-toc-modified-id="Example-when-cell-types-mixed-in-equal-proportion-1.3">Example when cell types mixed in equal proportion</a></span></li><li><span><a href="#Example-when-cell-types-are-mixed-unequally" data-toc-modified-id="Example-when-cell-types-are-mixed-unequally-1.4">Example when cell types are mixed unequally</a></span></li><li><span><a href="#Write-results-to-LaTex-tables-for-paper" data-toc-modified-id="Write-results-to-LaTex-tables-for-paper-1.5">Write results to LaTex tables for paper</a></span></li></ul></li></ul></div>

# Calculation of multiplet frequency from cell-type mixing in `Python`
Here we implement the simple function to calculate the multiplet frequency from single-cell RNA sequencing experiments where we have mixed cells of two types (e.g., human and mouse), and know the number of observed droplets that contain cells of each type.

## Function to compute multiplet frequency
The multiplet frequency is computed in terms of the following three experimental observables:
  - the number of droplets that contain at least one cell of type 1, which we denote as $N_1$
  - the number of droplets that contain at least one of type 2, which we denote as $N_2$
  - the number of droplets containing cells of both type 1 and type 2, which we denote as $N_{1,2}$

The multiplet frequency $M$ is
$$M = 1 - \frac{\left(\mu_1 + \mu_2\right)e^{-\mu_1 - \mu_2}}{1 - e^{-\mu_1 - \mu_2}}$$
where
$$\mu_1 = -\ln\left(\frac{N - N_1}{N}\right)$$
and
$$\mu_2 = -\ln\left(\frac{N - N_2}{N}\right),$$
and where
$$N = \frac{N_1 N_2}{N_{12}}.$$

## Python function
Here is the calculation implemented as Python function:

In [1]:
import numpy

def multipletFreq(n1, n2, n12):
    """Estimated multiplet frequency from cell-type mixing experiment.

    `n1`, `n2`, `n12` (`int` or `numpy.ndarray` of integers)
        Number of droplets with at least one cell of type 1,
        at least one cell of type 2, or cells of both types.
    """
    n = numpy.array(n1 * n2 / n12).astype('float')
    mu1 = -numpy.log((n - n1) / n)
    mu2 = -numpy.log((n - n2) / n)
    mu = mu1 + mu2
    return 1 - mu * numpy.exp(-mu) / (1 - numpy.exp(-mu))

## Example when cell types mixed in equal proportion
First we demonstrate the calculations in the case when the cell types are mixed in equal proportions.

Let's create some hypothetical data.
We'll imagine that these data come from the [10X cellranger](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger) software analysis of a multi-species experiment that mixed mouse and human cells equally.
The current version (2.1.1) of the [10X cellranger](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger) pipeline for this type of study would return the *human Estimated Number of Cell Partitions*, the *mouse Estimated Number of Cell Partitions*, and the number of *GEMs with > 0 Cells* (this last quantity is also reported as the *Estimated Number of Cells*).
These statistics give the numbers of non-empty GEMs with cells of each type (GEMs is the term that 10X uses to refer to their droplets).

Here is a data frame of some hypothetical data from three different experiments, each with 4000 cells total but with different numbers of cross-celltype droplets:

In [2]:
import pandas

df_equal = (pandas.DataFrame({
                'human_droplets':[2005, 2050, 2500],
                'mouse_droplets':[2005, 2050, 2500],
                'nonempty_droplets':[4000, 4000, 4000]
                },
                index=numpy.arange(3) + 1)
            .rename_axis('experiment')
            )

df_equal

  return f(*args, **kwds)
  return f(*args, **kwds)


Unnamed: 0_level_0,human_droplets,mouse_droplets,nonempty_droplets
experiment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,2005,2005,4000
2,2050,2050,4000
3,2500,2500,4000


We calculate the number of droplets with cells of **both** types (human and mouse) simply as the sum of the human and mouse droplets minus the total number of non-empty droplets, since these cross-celltype droplets are double counted in the tally of human and mouse droplets.
As is apparent after this calculation, in this hypothetical example the cross-celltype droplets represent 0.25%, 2.5%, and 25% of the total non-empty droplets in the three examples.

In [3]:
df_equal = df_equal.assign(human_and_mouse_droplets=lambda x:
        x.human_droplets + x.mouse_droplets - x.nonempty_droplets)

df_equal

Unnamed: 0_level_0,human_droplets,mouse_droplets,nonempty_droplets,human_and_mouse_droplets
experiment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,2005,2005,4000,10
2,2050,2050,4000,100
3,2500,2500,4000,1000


We now calculate the multiplet frequency in two ways.
The first way is to precisely calculate the multiplet frequency using the exact Poisson derivation as implemented in the `multipletFreq` function above.

The second way is to do the simple calculation that has commonly been used in paper that do equal-proportion mixing.
This method is to simply estimate the multiplet frequency as twice the frequency of cross-cell-type droplets among all non-empty droplets (that is, as $\frac{2 \times N_{1,2}}{N_1 + N_2 + N_{12}}$).

In [4]:
df_equal = (df_equal
            .assign(multiplet_freq=lambda x: 
                multipletFreq(x.human_droplets,
                              x.mouse_droplets,
                              x.human_and_mouse_droplets))
            .assign(twice_cross_celltype_freq=lambda x:
                2 * x.human_and_mouse_droplets / x.nonempty_droplets)
            )

df_equal.round(3)

Unnamed: 0_level_0,human_droplets,mouse_droplets,nonempty_droplets,human_and_mouse_droplets,multiplet_freq,twice_cross_celltype_freq
experiment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,2005,2005,4000,10,0.005,0.005
2,2050,2050,4000,100,0.049,0.05
3,2500,2500,4000,1000,0.425,0.5


As can be seen above, the two methods give virtually identical results as long as the number of cross-celltype droplets is small relative to the total number of droplets.
The reason that the two methods aren't identical as simply calculating the multiplet frequency as twice the cross-celltype frequency neglects to account for droplets that have more than two cells.
However, this difference only becomes appreciable when the multiplet frequency is high.
So for the examples above, we can see that it only really matters in the case when the true multiplet frequency is $\approx$0.425; in that case, simply taking twice the cross-celltype frequency slightly overestimates the true multiplet frequency

## Example when cell types are mixed unequally
Now we repeat the example above, but for experiments where the cell types are mixed unequally.

Below we give some example calculations for various numbers. 
An interesting (and initially non-intuitve aspect) of the results is that when the cells are mixed highly unequally and multiplets are common, the multiplet frequency is actually substantially **less** than the fraction of droplets with the rare cell type that are multiplets. 
The reason is that multiplets are more likely than singlets to have a cell of the rarer type, and become progressively more likely to have a cell of the rare type as the number of cells in the multiplet increases.

In [5]:
df_unequal = (pandas.DataFrame({
                  'human_droplets':[2050, 3050, 3550, 3850, 3950],
                  'mouse_droplets':[2050, 1050, 550, 250, 150],
                  'nonempty_droplets':[4000, 4000, 4000, 4000, 4000]
                  },
                  index=numpy.arange(5) + 1)
              .rename_axis('experiment')
              .assign(human_and_mouse_droplets=lambda x:
                  x.human_droplets + x.mouse_droplets - x.nonempty_droplets)
              .assign(multiplet_freq=lambda x:
                  multipletFreq(x.human_droplets,
                                x.mouse_droplets,
                                x.human_and_mouse_droplets))
              )

df_unequal.round(3)

Unnamed: 0_level_0,human_droplets,mouse_droplets,nonempty_droplets,human_and_mouse_droplets,multiplet_freq
experiment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,2050,2050,4000,100,0.049
2,3050,1050,4000,100,0.065
3,3550,550,4000,100,0.11
4,3850,250,4000,100,0.245
5,3950,150,4000,100,0.459


## Write results to LaTex tables for paper
Finally, we write the example results to LaTex tables to be included in the paper:

In [6]:
for (df, dfname) in [(df_equal, 'equal'), (df_unequal, 'unequal')]:
    f = '{0}_example.tex'.format(dfname)
    print("Writing {0} example data frame to {1}".format(dfname, f))
    ncol = len(df.columns) + 1
    column_format = 'C{0.66in}' * ncol
    (df.round(3)
       .rename(columns={col:col.replace('_', ' ') for col in df.columns})
       .reset_index()
       .to_latex(f, index=False, column_format=column_format)
       )

Writing equal example data frame to equal_example.tex
Writing unequal example data frame to unequal_example.tex
