# Summarize 10X `STARsolo` alignments
This Python Jupyter notebook summarizes the results of the `STARsolo` alignments.
It does this by aggregating the statistics for all samples and then making some summary plots.

## Parameters for notebook
First, set the parameters for the notebook.
That should be done in the next cell, which is tagged as a `parameters` cell to enable [papermill parameterization](https://papermill.readthedocs.io/en/latest/usage-parameterize.html):

In [None]:
# parameters cell; in order for notebook to run this cell must define:
#  - samples_10x: list of samples that were processed with `STARsolo`
#  - input_summary: list of `STARsolo` summary CSV files for each sample
#  - input_umi_per_cell: list of `STARsolo` UMI counts per cell for each sample

## Import Python modules
We use [plotnine](https://plotnine.readthedocs.io/) for ggplot2-style plotting:

In [None]:
from IPython.display import display, HTML
import pandas as pd
from plotnine import *

Set [plotnine theme](https://plotnine.readthedocs.io/en/stable/api.html#themes):

In [None]:
_ = theme_set(theme_classic)

## Aggregate `STARsolo` stats
Read in the `STARsolo` stats for each sample:

In [None]:
print('Reading STARsolo stats from:\n\t' + '\n\t'.join(input_summary))
stats = pd.concat([(pd.read_csv(f, names=['statistic', 'value'])
                    .assign(sample=sample)
                    )
                   for f, sample in zip(input_summary, samples_10x)
                   ])

display(HTML(stats
             .pivot_table(index='statistic', values='value', columns='sample')
             .to_html()
             ))

## Plot cells per sample
The number of cell barcodes called as cells for each sample:

In [None]:
# get cells per sample
cells_per_sample = (
    stats
    .query('statistic == "Estimated Number of Cells"')
    .rename(columns={'value': 'cells'})
    [['sample', 'cells']]
    .assign(cells=lambda x: x['cells'].astype(int),
            name=lambda x: x['sample'] + ' (' + x['cells'].astype(str) + ' cells)')
    )

p = (ggplot(cells_per_sample, aes('sample', 'cells')) +
     geom_bar(stat='identity') +
     geom_text(aes(label='cells'), va='bottom') +
     scale_y_continuous(limits=(0, 1.07 * cells_per_sample['cells'].max())) +
     theme(figure_size=(0.4 * (1 + len(cells_per_sample)), 2.5),
           axis_text_x=element_text(angle=90))
     )
_ = p.draw()

## Knee plot of calling valid cells
Make [knee plot](https://liorpachter.wordpress.com/tag/knee-plot) showing how the number of cells was called from the number of UMIs per cell barcode; this is supposed to distinguish true cells from empty droplets:

In [None]:
print('Reading UMIs per cell barcode from:\n\t' +
      '\n\t'.join(input_umi_per_cell))
umis = pd.concat([(pd.read_csv(f, names=['number of UMIs'])
                   .assign(cell_barcode_rank=lambda x: x.index + 1,
                           sample=sample)
                   )
                  for f, sample in zip(input_umi_per_cell, samples_10x)
                  ])

# annotate cell barcodes by whether they are cells, and add sample name with n cells
umis = (umis
        .merge(cells_per_sample)
        .assign(is_cell=lambda x: x['cell_barcode_rank'] <= x['cells'])
        )

p = (ggplot(umis, aes('cell_barcode_rank', 'number of UMIs',
                      color='is_cell')) +
     geom_path() +
     facet_wrap('~ name', nrow=1) +
     theme(figure_size=(3 * umis['name'].nunique(), 2.5)) +
     scale_x_log10() +
     scale_y_log10() +
     geom_vline(aes(xintercept='cells'), data=cells_per_sample,
                linetype='dashed', color='#56B4E9') +
     scale_color_manual(values=['#000000', '#E69F00'])
     )
_ = p.draw()

## Plot average genes, reads, UMIs per cell
Plot the average number of genes, reads, and UMIs per call among the called cells:

In [None]:
p = (ggplot(stats.loc[stats['statistic'].str.contains('per Cell')],
            aes('statistic', 'value')) +
     geom_bar(stat='identity') +
     facet_wrap('~ sample', nrow=1) +
     ylab('count per cell') +
     theme(figure_size=(2.5 * stats['sample'].nunique(), 2),
           axis_text_x=element_text(angle=90))
     )
_ = p.draw()

## Plot mapping rates
Plot the read mapping rates.

In [None]:
p = (ggplot(stats.loc[stats['statistic'].str.contains('Reads Mapped to|Reads With Valid')],
            aes('statistic', 'value')) +
     geom_bar(stat='identity') +
     facet_wrap('~ sample', nrow=1) +
     ylab('fraction of reads') +
     theme(figure_size=(2 * stats['sample'].nunique(), 2),
           axis_text_x=element_text(angle=90)) +
     expand_limits(y=(0, 1))
     )
_ = p.draw()