# Single-cell virus sequencing
This example shows how to use [alignparse](https://jbloomlab.github.io/alignparse/) to process full-length PacBio CCSs of viral cDNAs generated in single-cell virus sequencing of influenza-infected cells.
Specifically, it processes a snippet of data taking from [Russell et al, 2019](https://jvi.asm.org/content/93/14/e00500-19), which was generated by PacBio sequencing of viral cDNAs with unique molecular identifiers (UMIS) and cell barcodes added using a 10X Chromium (v2 reagents).
This notebook aligns these CCSs to the influenza transcripts and parses the barcodes / UMIs and viral mutations.

## Set up for analysis
Import necessary Python modules.
We use [alignparse](https://jbloomlab.github.io/alignparse/) for most of the operations, [plotnine](https://plotnine.readthedocs.io) for ggplot2-like plotting:

In [None]:
import os
import warnings

import numpy

import pandas as pd

from plotnine import *

import alignparse.ccs
import alignparse.consensus
import alignparse.minimap2
import alignparse.targets
from alignparse.constants import CBPALETTE

Suppress warnings that clutter output:

In [None]:
warnings.simplefilter("ignore")

Directory for output:

In [None]:
outdir = "./output_files/"
os.makedirs(outdir, exist_ok=True)

## Target amplicons
The alignment targets consist of PCR amplicons covering the influenza virus mRNAs (WSN strain) with a 5' termini corresponding to the primer binding site, the 3' polyA tail used for reverse transcription, the UMI and cell barcode, and the common 3' termini.
There are also variant tags distinguishing the synonymously tagged viral variants (see [Russell et al, 2019](https://jvi.asm.org/content/93/14/e00500-19)).

First, let's look at the Genbank file holding the targets.
The targets are defined in [Genbank Flat File format](https://www.ncbi.nlm.nih.gov/genbank/samplerecord/)
Below we show the first 85 lines (the actual file is quite large as it defined all 10 influenza mRNAs):

In [None]:
targetfile = "input_files/flu_WSN_amplicons.gb"

nlines_to_show = 85
with open(targetfile) as f:
    print("".join(next(f) for _ in range(nlines_to_show)))

We also have the YAML file specifying how to parse the features in the alignments.
Note how this file uses the YAML syntax for defaults to avoid repeating the same specifications for each target:

In [None]:
feature_parse_specs_file = "input_files/flu_WSN_feature_parse_specs.yaml"
with open(feature_parse_specs_file) as f:
    print(f.read())

We now read the targets into an [alignparse.targets.Targets](https://jbloomlab.github.io/alignparse/alignparse.targets.html#alignparse.targets.Targets) object with the feature-parsing specs (ignoring the keys that define thet defaults):

In [None]:
targets = alignparse.targets.Targets(
    seqsfile=targetfile,
    feature_parse_specs=feature_parse_specs_file,
    ignore_feature_parse_specs_keys=["default_2tags", "default_4tags"],
    allow_extra_features=True,
)

Plot the [Targets](https://jbloomlab.github.io/alignparse/alignparse.targets.html#alignparse.targets.Targets):

In [None]:
_ = targets.plot(ax_width=10)

Several things of note about how these targets are defined.
These all relate to the fact that the most difficult region to parse is the polyA tail and the UMI / cell barcode--the reason being that the polyA tail is not sequenced very accurately (it is a homopolymer) plus appears to be somewhat impure in the actual primers, so is difficult to align correctly.
This is compounded by the fact that we don't know the actual UMI sequence and so don't want indels of the polyA tail to "bleed" into the UMI or gene alignment.
Therefore:

 - We define a separate feature for the final nucleotides (last 3) of the mRNA upstream of the sequenced mRNA, so polyA indels don't get assigned to the sequenced mRNA too often.
 - We define the polyA to be just 28 nucleotides in the alignment even though the length in the primer is 30, so that deletions don't affect the UMI.
 
These empirically help with the alilgnments.

## PacBio CCSs
Now let's look at the PacBio CCSs.
First, define a data frame with the information on the PacBio runs (here we just have one):

In [None]:
pacbio_runs = pd.DataFrame(
    {"name": ["flu_WSN"], "fastq": ["input_files/flu_WSN_ccs.fastq"]}
)

pacbio_runs

Create an [alignparse.ccs.Summaries](https://jbloomlab.github.io/alignparse/alignparse.ccs.html#alignparse.ccs.Summaries) object:

In [None]:
ccs_summaries = alignparse.ccs.Summaries(pacbio_runs, report_col=None)

Statistics on the CCSs (length, number of subread passes, quality):

In [None]:
# NBVAL_IGNORE_OUTPUT
for stat in ["length", "passes", "accuracy"]:
    if ccs_summaries.has_stat(stat):
        p = ccs_summaries.plot_ccs_stats(stat)
        p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
        _ = p.draw()
    else:
        print(f"No {stat} statistics available.")

## Align and parse
First, we create an [alignparse.minimap2.Mapper](https://jbloomlab.github.io/alignparse/alignparse.minimap2.html#alignparse.minimap2.Mapper) to run [minimap2](https://github.com/lh3/minimap2), which is used for the alignments. We use [minimap2](https://github.com/lh3/minimap2) options that are tailored for viruses with potentially large internal deletions (which are handled like spliced introns); these options are specified by [alignparse.minimap2.OPTIONS_VIRUS_W_DEL](https://jbloomlab.github.io/alignparse/alignparse.minimap2.html#alignparse.minimap2.OPTIONS_VIRUS_W_DEL):

In [None]:
# NBVAL_IGNORE_OUTPUT

mapper = alignparse.minimap2.Mapper(alignparse.minimap2.OPTIONS_VIRUS_W_DEL)

print(
    f"Using `minimap2` {mapper.version} with these options:\n"
    + " ".join(mapper.options)
)

Now use [Targets.align_and_parse](https://jbloomlab.github.io/alignparse/alignparse.targets.html#alignparse.targets.Targets.align_and_parse) to align and parse the CCSs.
First, create the output directory:

In [None]:
align_and_parse_outdir = os.path.join(outdir, "flu_WSN_align_and_parse")

Now do the alignments and parsing:

In [None]:
readstats, aligned, filtered = targets.align_and_parse(
    df=pacbio_runs,
    mapper=mapper,
    outdir=align_and_parse_outdir,
    name_col="name",
    queryfile_col="fastq",
    overwrite=True,  # overwrite any existing output
    ncpus=-1,  # use all available CPUs
)

First, let's look at the read alignment stats:

In [None]:
readstats

Plot these stats:

In [None]:
# NBVAL_IGNORE_OUTPUT
p = (
    ggplot(
        readstats.assign(
            category=lambda x: pd.Categorical(
                x["category"], x["category"].unique(), ordered=True
            ),
            is_aligned=lambda x: x["category"].str.contains("aligned"),
        ),
        aes("category", "count", fill="is_aligned"),
    )
    + geom_bar(stat="identity")
    + facet_wrap("~ name", nrow=1)
    + theme(
        axis_text_x=element_text(angle=90),
        figure_size=(0.3 * len(readstats), 2.5),
        panel_grid_major_x=element_blank(),  # no vertical grid lines
    )
    + scale_fill_manual(values=CBPALETTE)
)
_ = p.draw()

Now let's look at the reason that reads that were filtered failed to align.
Recall that `filtered` holds a data frame for each target:

In [None]:
for target in targets.target_names[:1]:
    print(f"First few lines of `filtered` for {target}:")
    display(filtered[target].head())

Concatenate the information for all of the targets and plot reasons why mapped reads were filtered:

In [None]:
# NBVAL_IGNORE_OUTPUT
p = (
    ggplot(
        pd.concat([df.assign(gene=gene) for gene, df in filtered.items()]).assign(
            gene=lambda x: pd.Categorical(x["gene"], x["gene"].unique(), ordered=True)
        ),
        aes("filter_reason"),
    )
    + geom_bar()
    + facet_wrap("~ gene", ncol=5)
    + theme(
        axis_text_x=element_text(angle=90),
        figure_size=(9, 4),
        panel_grid_major_x=element_blank(),  # no vertical grid lines
    )
)

_ = p.draw()

Now let's look at the first few aligned reads for the first few genes:

In [None]:
for target in targets.target_names[:4]:
    print(f"\nFirst few entries in `aligned` for {target}:")
    display(aligned[target].head())