Process CCSs

First, process the PacBio CCSs to extract barcodes and call mutations in the gene.

Define the background for CCS mapping

background = "BA1"

Setup

Import Python modules

Plotting is done with plotnine, which uses ggplot2-like syntax.

The analysis uses the Bloom lab's alignparse and dms_variants packages.

import collections
import math
import os
import re
import time
import warnings

import alignparse
import alignparse.ccs
from alignparse.constants import CBPALETTE
import alignparse.minimap2
import alignparse.targets
import alignparse.consensus

import dms_variants
import dms_variants.codonvarianttable
import dms_variants.plotnine_themes
import dms_variants.utils

from IPython.display import display, HTML

import numpy

import pandas as pd

from plotnine import *

import yaml

Set plotnine theme to the one defined in dms_variants:

theme_set(dms_variants.plotnine_themes.theme_graygrid())

Versions of key software:

print(f"Using alignparse version {alignparse.__version__}")
print(f"Using dms_variants version {dms_variants.__version__}")

Using alignparse version 0.2.4
Using dms_variants version 0.8.9

Ignore warnings that clutter output:

warnings.simplefilter('ignore')

Read the configuration file:

with open('config.yaml') as f:
    config = yaml.safe_load(f)

Make output directory for figures:

os.makedirs(config['process_ccs_dir'], exist_ok=True)

PacBio reads

Get the amplicons sequenced by PacBio as the alignment target along with the specs on how to parse the features:

pacbio_runs = (
    pd.read_csv(config['pacbio_runs'], dtype=str)
    .drop(columns=['ccs'])
    .assign(name=lambda x: x['library'] + '_' + x['bg'] + '_' + x['run'],
            fastq=lambda x: config['ccs_dir'] + '/' + x['name'] + '_ccs.fastq.gz'
            )
    )
 
pacbio_runs.drop(pacbio_runs[pacbio_runs['bg'] != background].index,inplace=True)

display(HTML(pacbio_runs.to_html(index=False)))

library	bg	run	name	fastq
pool1	BA1	220524_A	pool1_BA1_220524_A	results/ccs/pool1_BA1_220524_A_ccs.fastq.gz
pool2	BA1	220524_A	pool2_BA1_220524_A	results/ccs/pool2_BA1_220524_A_ccs.fastq.gz

PacBio amplicons

Get the amplicons sequenced by PacBio as the alignment target along with the specs on how to parse the features:

print(f"Reading amplicons from {config['amplicons_' + background]}")
print(f"Reading feature parse specs from {config['feature_parse_specs_' + background]}")

targets = alignparse.targets.Targets(
                seqsfile=config['amplicons_' + background],
                feature_parse_specs=config['feature_parse_specs_' + background])

Reading amplicons from data/PacBio_amplicon_BA1.gb
Reading feature parse specs from data/feature_parse_specs_BA1.yaml

Draw the target amplicons:

fig = targets.plot(ax_width=7,
                   plots_indexing='biopython',  # numbering starts at 0
                   ax_height=2,  # height of each plot
                   hspace=1.2,  # vertical space between plots
                   )

plotfile = os.path.join(config['process_ccs_dir'], 'amplicons_'+background+'.pdf')
print(f"Saving plot to {plotfile}")
fig.savefig(plotfile, bbox_inches='tight')

Saving plot to results/process_ccs/amplicons_BA1.pdf

Write out the specs used to parse the features (these are the same specs provided as feature_parse_specs when initializing targets, but with defaults filled in):

print(targets.feature_parse_specs('yaml'))

BA1:
  query_clip5: 4
  query_clip3: 4
  termini5:
    filter:
      clip5: 4
      mutation_nt_count: 1
      mutation_op_count: null
      clip3: 0
    return: []
  gene:
    filter:
      mutation_nt_count: 90
      mutation_op_count: null
      clip5: 0
      clip3: 0
    return:
    - mutations
    - accuracy
  spacer:
    filter:
      mutation_nt_count: 1
      mutation_op_count: null
      clip5: 0
      clip3: 0
    return: []
  barcode:
    filter:
      mutation_nt_count: 0
      mutation_op_count: null
      clip5: 0
      clip3: 0
    return:
    - sequence
    - accuracy
  termini3:
    filter:
      clip3: 4
      mutation_nt_count: 1
      mutation_op_count: null
      clip5: 0
    return: []

Align CCSs to amplicons

We now align the CCSs to the amplicon and parse features from the resulting alignments using the specs above.

First, we initialize an alignparse.minimap2.Mapper to align the reads to SAM files:

mapper = alignparse.minimap2.Mapper(alignparse.minimap2.OPTIONS_CODON_DMS)

print(f"Using `minimap2` {mapper.version} with these options:\n" +
      ' '.join(mapper.options))

Using `minimap2` 2.18-r1015 with these options:
-A2 -B4 -O12 -E2 --end-bonus=13 --secondary=no --cs

Next, we use Targets.align_and_parse to create the alignments and parse them:

readstats, aligned, filtered = targets.align_and_parse(
        df=pacbio_runs,
        mapper=mapper,
        outdir=config['process_ccs_dir'],
        name_col='run',
        group_cols=['name', 'library'],
        queryfile_col='fastq',
        overwrite=True,
        ncpus=config['max_cpus'],
        )

First, examine the read stats from the alignment / parsing, both extracting alignment target name and getting stats aggregated by target:

readstats = (
    readstats
    .assign(category_all_targets=lambda x: x['category'].str.split().str[0],
            target=lambda x: x['category'].str.split(None, 1).str[1],
            valid=lambda x: x['category_all_targets'] == 'aligned')
    )

Now plot the read stats by run (combining all targets and libraries within a run):

ncol = 2
p = (
    ggplot(readstats
           .groupby(['name', 'category_all_targets', 'valid'])
           .aggregate({'count': 'sum'})
           .reset_index(),
           aes('category_all_targets', 'count', fill='valid')) +
    geom_bar(stat='identity') +
    facet_wrap('~ name', ncol=ncol) +
    theme(axis_text_x=element_text(angle=90),
          figure_size=(1.85 * min(ncol, len(pacbio_runs)),
                       2 * math.ceil(len(pacbio_runs) / ncol)),
          panel_grid_major_x=element_blank(),
          legend_position='none',
          ) +
    scale_fill_manual(values=CBPALETTE)
    )
_ = p.draw()

And the read stats by library (combining all targets and runs within a library):

p = (
    ggplot(readstats
           .groupby(['library', 'category_all_targets', 'valid'])
           .aggregate({'count': 'sum'})
           .reset_index(), 
           aes('category_all_targets', 'count', fill='valid')) +
    geom_bar(stat='identity') +
    facet_wrap('~ library', nrow=1) +
    theme(axis_text_x=element_text(angle=90),
          figure_size=(1.5 * pacbio_runs['library'].nunique(), 2),
          panel_grid_major_x=element_blank(),
          legend_position='none',
          ) +
    scale_fill_manual(values=CBPALETTE)
    )
_ = p.draw()

And the number of reads by target (combining all libraries and runs for a target):

p = (
    ggplot(readstats
           .groupby(['target'])
           .aggregate({'count': 'sum'})
           .reset_index(), 
           aes('target', 'count')) +
    geom_point(stat='identity', size=3) +
    theme(axis_text_x=element_text(angle=90),
          figure_size=(0.3 * readstats['target'].nunique(), 2),
          panel_grid_major_x=element_blank(),
          ) +
    scale_y_log10(name='number of reads')
    )
_ = p.draw()

And read stats by target (combining all libraries and runs for a target):

p = (
    ggplot(readstats
           .groupby(['target', 'valid'])
           .aggregate({'count': 'sum'})
           .reset_index()
           .assign(total=lambda x: x.groupby('target')['count'].transform('sum'),
                   frac=lambda x: x['count'] / x['total'],
                   ), 
           aes('target', 'frac', fill='valid')) +
    geom_bar(stat='identity') +
    theme(axis_text_x=element_text(angle=90),
          figure_size=(0.5 * readstats['target'].nunique(), 2),
          panel_grid_major_x=element_blank(),
          ) +
    scale_fill_manual(values=CBPALETTE)
    )
_ = p.draw()

Now let's see why we filtered the reads. First, we do some transformations on the filtered dict returned by Targets.align_and_parse. Then we count up the number of CCSs filtered for each reason, and group together "unusual" reasons that represent less than some fraction of all filtering. For now, we group together all targets to the stats represent all targets combined:

other_cutoff = 0.02  # group as "other" reasons with <= this frac

filtered_df = (
    pd.concat(df.assign(target=target) for target, df in filtered.items())
    .groupby(['library', 'name', 'run', 'filter_reason'])
    .size()
    .rename('count')
    .reset_index()
    .assign(tot_reason_frac=lambda x: (x.groupby('filter_reason')['count']
                                       .transform('sum')) / x['count'].sum(),
            filter_reason=lambda x: numpy.where(x['tot_reason_frac'] > other_cutoff,
                                                x['filter_reason'],
                                                'other')
            )
    )

Now plot the filtering reason for all runs:

ncol = 7
nreasons = filtered_df['filter_reason'].nunique()

p = (
    ggplot(filtered_df, aes('filter_reason', 'count')) +
    geom_bar(stat='identity') +
    facet_wrap('~ name', ncol=ncol) +
    theme(axis_text_x=element_text(angle=90),
          figure_size=(0.25 * nreasons * min(ncol, len(pacbio_runs)),
                       2 * math.ceil(len(pacbio_runs) / ncol)),
          panel_grid_major_x=element_blank(),
          )
    )
_ = p.draw()

Finally, we take the successfully parsed alignments and read them into a data frame, keeping track of the target that each CCS aligns to. We also drop the pieces of information we won't use going forward, and rename a few columns:

aligned_df = (
    pd.concat(df.assign(target=target) for target, df in aligned.items())
    .drop(columns=['query_clip5', 'query_clip3', 'run','name'])
    .rename(columns={'barcode_sequence': 'barcode'})
    )

print(f"First few lines of information on the parsed alignments:")
display(HTML(aligned_df.head().to_html(index=False)))

First few lines of information on the parsed alignments:

library	query_name	gene_mutations	gene_accuracy	barcode	barcode_accuracy	target
pool1	m64272e_220525_133554/1/ccs	G146A T490A C491T A492G	1.0	GAGATATTTATCGTAA	1.0	BA1
pool1	m64272e_220525_133554/3/ccs	T184A C186T C327T	1.0	CTAAAGCGAATCCAAG	1.0	BA1
pool1	m64272e_220525_133554/14/ccs	A7T C8T A9G	1.0	GTACATATGAACAGCA	1.0	BA1
pool1	m64272e_220525_133554/15/ccs	G403T A404C A405T	1.0	ATTTACACGATACCCG	1.0	BA1
pool1	m64272e_220525_133554/28/ccs	G92T C93G	1.0	GCACCTATCTAATACG	1.0	BA1

Write valid CCSs

Write the processed CCSs to a file:

aligned_df.to_csv(config['processed_ccs_file' + '_' + background], index=False)

print("Barcodes and mutations for valid processed CCSs "
      f"have been written to {config['processed_ccs_file' + '_' + background]}.")

Barcodes and mutations for valid processed CCSs have been written to results/process_ccs/processed_ccs_BA1.csv.

Next, we analyze these processed CCSs to build the variants.

Build barcode variant table

Builds consensus sequences for barcoded variants from the mutations called in the processed PacBio CCSs. Uses these consensus sequences to build a codon variant table.

Make output directories if needed:

os.makedirs(config['variants_dir'], exist_ok=True)
os.makedirs(config['figs_dir'], exist_ok=True)

Read the CSV file with the processed CCSs into a data frame, display first few lines:

processed_ccs = pd.read_csv(config['processed_ccs_file' + '_' + background], na_filter=None)

nlibs = processed_ccs['library'].nunique()  # number of unique libraries

ntargets = processed_ccs['target'].nunique()  # number of unique targets

print(f"Read {len(processed_ccs)} CCSs from {nlibs} libraries and {ntargets} targets.")

display(HTML(processed_ccs.head().to_html(index=False)))

Read 2319075 CCSs from 2 libraries and 1 targets.

library	query_name	gene_mutations	gene_accuracy	barcode	barcode_accuracy	target
pool1	m64272e_220525_133554/1/ccs	G146A T490A C491T A492G	1.0	GAGATATTTATCGTAA	1.0	BA1
pool1	m64272e_220525_133554/3/ccs	T184A C186T C327T	1.0	CTAAAGCGAATCCAAG	1.0	BA1
pool1	m64272e_220525_133554/14/ccs	A7T C8T A9G	1.0	GTACATATGAACAGCA	1.0	BA1
pool1	m64272e_220525_133554/15/ccs	G403T A404C A405T	1.0	ATTTACACGATACCCG	1.0	BA1
pool1	m64272e_220525_133554/28/ccs	G92T C93G	1.0	GCACCTATCTAATACG	1.0	BA1

Overall statistics on number of total CCSs and number of unique barcodes:

display(HTML(
    processed_ccs
    .groupby(['target', 'library'])
    .aggregate(total_CCSs=('barcode', 'size'),
               unique_barcodes=('barcode', 'nunique'))
    .assign(avg_CCSs_per_barcode=lambda x: x['total_CCSs'] / x['unique_barcodes'])
    .round(2)
    .to_html()
    ))

		total_CCSs	unique_barcodes	avg_CCSs_per_barcode
target	library
BA1	pool1	1007940	139442	7.23
BA1	pool2	1311135	155336	8.44

Filter processed CCSs

We have the PacBio ccs program's estimated "accuracy" for both the barcode and the gene sequence for each processed CCS. We will filter the CCSs to only keep ones of sufficiently high accuracy.

First, we want to plot the accuracies. It is actually visually easier to look at the error rate, which is one minus the accuracy. Because we want to plot on a log scale (which can't show error rates of zero), we define a error_rate_floor, and set all error rates less than this to that value:

error_rate_floor = 1e-7  # error rates < this set to this
if error_rate_floor >= config['max_error_rate']:
    raise ValueError('error_rate_floor must be < max_error_rate')

processed_ccs = (
    processed_ccs
    .assign(barcode_error=lambda x: numpy.clip(1 - x['barcode_accuracy'],
                                               error_rate_floor, None),
            gene_error=lambda x: numpy.clip(1 - x['gene_accuracy'],
                                            error_rate_floor, None)
            )
    )

Now plot the error rates, drawing a dashed vertical line at the threshold separating the CCSs we retain for consensus building versus those that we discard:

_ = (
 ggplot(processed_ccs
        .melt(value_vars=['barcode_error', 'gene_error'],
              var_name='feature_type', value_name='error rate'),
        aes('error rate')) +
 geom_histogram(bins=25) +
 geom_vline(xintercept=config['max_error_rate'],
            linetype='dashed',
            color=CBPALETTE[1]) +
 facet_wrap('~ feature_type') +
 theme(figure_size=(4.5, 2)) +
 ylab('number of CCSs') +
 scale_x_log10()
 ).draw()

Flag the CCSs to retain, and indicate how many we are retaining and purging due to the accuracy filter:

processed_ccs = (
    processed_ccs
    .assign(retained=lambda x: ((x['gene_error'] < config['max_error_rate']) &
                                (x['barcode_error'] < config['max_error_rate'])))
    )

Here are number of retained CCSs:

_ = (
 ggplot(processed_ccs.assign(xlabel=lambda x: x['target'] + ', ' + x['library'])
                     .groupby(['xlabel', 'retained'])
                     .size()
                     .rename('count')
                     .reset_index(),
        aes('xlabel', 'count', color='retained', label='count')) +
 geom_point(size=3) +
 geom_text(va='bottom', size=7, ha='center',format_string='{:.3g}', nudge_y=0.2) +
 theme(figure_size=(0.5 * nlibs * ntargets, 3),
       panel_grid_major_x=element_blank(),
       axis_text_x=element_text(angle=90),
       ) +
 scale_y_log10(name='number of CCSs') +
 xlab('') +
 scale_color_manual(values=CBPALETTE[1:])
 ).draw()

Sequences per barcode

How many times is each barcode sequenced? This is useful to know for thinking about building the barcode consensus.

Plot the distribution of the number of times each barcode is observed among the retained CCSs:

max_count = 8 # in plot, group all barcodes with >= this many counts

p = (
 ggplot(
    processed_ccs
     .query('retained')
     .groupby(['library', 'barcode'])
     .size()
     .rename('nseqs')
     .reset_index()
     .assign(nseqs=lambda x: numpy.clip(x['nseqs'], None, max_count)),
    aes('nseqs')) +
 geom_bar() +
 facet_wrap('~ library', nrow=1) +
 theme(figure_size=(1.75 * nlibs, 2),
       panel_grid_major_x=element_blank(),
       ) +
 ylab('number of barcodes') +
 xlab('CCSs for barcode')
 )

_ = p.draw()

Empirical accuracy of CCSs

We want to directly estimate the accuracy of the gene-barcode link rather than relying on the PacBio ccs accuracy, which doesn't include inaccuracies due to things like strand exchange or the same barcode on different sequences.

One way to do this is to examine instances when we have multiple sequences for the same barcode. We can calculate the empirical accuracy of the sequences by looking at all instances of multiple sequences of the same barcode and determining how often they are identical. This calculation is performed by alignparse.consensus.empirical_accuracy using the equations described in the docs for that function.

We will do this four for sets of sequences:

All of the CCSs retained above.
CCSs retained by applying a PacBio ccs accuracy filter 10-fold more stringent than the one above. The rationale is that if this improves the concordance (real accuracy) of the CCSs substantially then maybe we should make the accuracy filter more stringent.
Like (1) but excluding all CCSs with indels. the rationale is that we only really care about substitutions, and will exclude sequences with indels anyway.
Like (2) but excluding all CCSs with indels.

First, we annotate the sequences with the number of indels and whether they have an indel to enable categorization into the aforementioned sets:

processed_ccs = processed_ccs.reset_index(drop=True)

processed_ccs = alignparse.consensus.add_mut_info_cols(processed_ccs,
                                                       mutation_col='gene_mutations',
                                                       n_indel_col='n_indels')

processed_ccs = processed_ccs.assign(has_indel=lambda x: x['n_indels'] > 0)

Plot how many sequences have indels:

_ = (
 ggplot(processed_ccs,
        aes('retained', fill='has_indel')) +
 geom_bar(position='dodge') +
 geom_text(aes(label='..count..'), stat='count', va='bottom', size=7,
           position=position_dodge(width=0.9), format_string='{:.2g}') +
 theme(figure_size=(2.5 * nlibs, 3),
       panel_grid_major_x=element_blank(),
       ) +
 ylab('number of CCSs') +
 scale_fill_manual(values=CBPALETTE[1:]) +
 facet_wrap('~ library', nrow=1)
 ).draw()

Now get the empirical accuracy for each of the CCS groups mentioned above:

high_acc = config['max_error_rate'] / 10
empirical_acc = []

for desc, query_str in [
        ('retained', 'retained'),
        ('retained, no indel', 'retained and not has_indel'),
        ('10X accuracy',
         f"(gene_error < {high_acc}) and (barcode_error < {high_acc})"),
        ('10X accuracy, no indel',
         f"(gene_error < {high_acc}) and (barcode_error < {high_acc}) and not has_indel")
        ]:
    # get just CCSs in that category
    df = processed_ccs.query(query_str)
    
    # compute empirical accuracy
    empirical_acc.append(
        alignparse.consensus.empirical_accuracy(df,
                                                mutation_col='gene_mutations')
        .assign(description=desc)
        .merge(df
               .groupby('library')
               .size()
               .rename('number_CCSs')
               .reset_index()
               )
        )

# make description categorical to preserve order, and annotate as "actual"
# the category ("retained, no indel") that we will use for building variants.
empirical_acc = (
    pd.concat(empirical_acc, ignore_index=True, sort=False)
    .assign(description=lambda x: pd.Categorical(x['description'],
                                                 x['description'].unique(),
                                                 ordered=True),
            actual=lambda x: numpy.where(x['description'] == 'retained, no indel',
                                         True, False),
            )
    )

Display table of the empirical accuracies:

display(HTML(empirical_acc.to_html(index=False)))

library	accuracy	description	number_CCSs	actual
pool1	0.972176	retained	876801	False
pool2	0.972453	retained	1141200	False
pool1	0.995369	retained, no indel	850498	True
pool2	0.995680	retained, no indel	1107023	True
pool1	0.980105	10X accuracy	827942	False
pool2	0.980418	10X accuracy	1077447	False
pool1	0.995455	10X accuracy, no indel	809705	False
pool2	0.995706	10X accuracy, no indel	1053879	False

Plot the empirical accuracies, using a different color to show the category that we will actually use:

p = (
    ggplot(empirical_acc,
           aes('description', 'accuracy', color='actual', label='accuracy')
           ) +
    geom_point(size=3) +
    geom_text(va='bottom', size=9, format_string='{:.3g}', nudge_y=0.003) +
    facet_wrap('~ library', ncol=8) +
    theme(figure_size=(1.75 * nlibs, 2.25),
          axis_text_x=element_text(angle=90),
          panel_grid_major_x=element_blank(),
          ) +
    xlab('') +
    scale_y_continuous(name='empirical accuracy', limits=(0.95, 1.005)) +
    scale_color_manual(values=CBPALETTE, guide=False)
    )

plotfile = os.path.join(config['figs_dir'], 'empirical_CCS_accuracy.pdf')
print(f"Saving plot to {plotfile}")
_ = p.draw()

Saving plot to results/figures/empirical_CCS_accuracy.pdf

The above analysis shows that if we exclude sequences with indels (which we plan to do among our consensus sequences), then the accuracy of each CCS is around 99%. We do not get notably higher empirical accuracy by imposing a more stringent filter from the PacBio ccs program, indicating that the major sources of error are due to processes that are not modeled in this program's accuracy filter (perhaps strand exchange or barcode sharing).

Note that this empirical accuracy is for a single CCS. When we build the consensus sequences for each barcode below, we will take the consensus of CCSs within a barcode. So for barcodes with multiple CCSs, the actual accuracy of the consensus sequences will be higher than the empirical accuracy above due to capturing information from multiple CCSs.

Consensus sequences for barcodes

We call the consensus sequence for each barcode using the simple method implemented in alignparse.consensus.simple_mutconsensus. The documentation for that function explains the method in detail, but basically it works like this:

When there is just one CCS per barcode, the consensus is just that sequence.
When there are multiple CCSs per barcode, they are used to build a consensus--however, the entire barcode is discarded if there are many differences between CCSs with the barcode, or high-frequency non-consensus mutations. The reason that barcodes are discarded in such cases as many differences between CCSs or high-frequency non-consensus mutations suggest errors such as barcode collisions or strand exchange.

First, call the consensus for each barcode including all retained sequences, even those with indels:

consensus, dropped = alignparse.consensus.simple_mutconsensus(
                        processed_ccs.query('retained'),
                        group_cols=('library', 'barcode', 'target'),
                        mutation_col='gene_mutations',
                        )

Here are the first few lines of the data frame of consensus sequences for each barcode. In addition to giving the library, barcode, target, and mutations, it also has a column indicating how many CCSs support the variant call:

display(HTML(consensus.head().to_html(index=False)))

library	barcode	target	gene_mutations	variant_call_support
pool1	AAAAAAAAAACAAAAA	BA1	G124A C125A C126T	2
pool1	AAAAAAAAAAGAATTA	BA1	T91A G92T C93G	3
pool1	AAAAAAAAAAGTAATT	BA1	C127G C128A T129A	9
pool1	AAAAAAAAAATACGTA	BA1	C484G T485C A486T	5
pool1	AAAAAAAAACAGAATC	BA1	G454T C456T	3

Since we retain variants with substitutions but ignore those with indels, add information about substitution mutations and number of indels:

consensus = alignparse.consensus.add_mut_info_cols(
                    consensus,
                    mutation_col='gene_mutations',
                    sub_str_col='substitutions',
                    n_indel_col='number_of_indels',
                    overwrite_cols=True)

display(HTML(consensus.head().to_html(index=False)))

library	barcode	target	gene_mutations	variant_call_support	substitutions
pool1	AAAAAAAAAACAAAAA	BA1	G124A C125A C126T	2	G124A C125A C126T
pool1	AAAAAAAAAAGAATTA	BA1	T91A G92T C93G	3	T91A G92T C93G
pool1	AAAAAAAAAAGTAATT	BA1	C127G C128A T129A	9	C127G C128A T129A
pool1	AAAAAAAAAATACGTA	BA1	C484G T485C A486T	5	C484G T485C A486T
pool1	AAAAAAAAACAGAATC	BA1	G454T C456T	3	G454T C456T

Plot distribution of number of CCSs supporting each variant call (consensus), indicating whether or not there is an indel:

max_variant_call_support = 6  # group variants with >= this much support

_ = (
 ggplot(consensus
        .assign(variant_call_support=lambda x: numpy.clip(x['variant_call_support'],
                                                          None,
                                                          max_variant_call_support),
                indel_state=lambda x: numpy.where(x['number_of_indels'] > 0,
                                                  'has indel', 'no indel')
                ),
        aes('variant_call_support')) +
 geom_bar() +
 ylab('number of variants') +
 facet_grid('indel_state ~ library') +
 theme(figure_size=(1.75 * nlibs, 3.5),
       panel_grid_major_x=element_blank(),
       ) 
 ).draw()

We see that most variant consensus sequences do not have indels, especially if we limit to the more "accurate" ones that have multiple CCSs supporting them.

We will ignore all consensus sequences with indels in the variant-barcode lookup table. We do this for two reasons:

When there is just one CCS supporting a consensus, it is less likely to be accurate as indels are the main mode of PacBio error.
For the purposes of our studies, we are interested in point mutations rather than indels anyway.

Here are number of valid consensus sequence (no indels) for each library and target:

consensus = consensus.query('number_of_indels == 0')

lib_target_counts = (
    consensus
    .groupby(['library', 'target'])
    .size()
    .rename('consensus sequences')
    .reset_index()
    )

display(HTML(lib_target_counts.to_html(index=False)))

p = (ggplot(lib_target_counts.assign(xlabel=lambda x: x['target'] + ', ' + x['library']),
            aes('xlabel', 'consensus sequences')) +
     geom_point(size=3) +
     theme(figure_size=(0.5 * nlibs * ntargets, 1.75),
           axis_text_x=element_text(angle=90)) +
     xlab('') +
     scale_y_log10()
     )

_ = p.draw()

library	target	consensus sequences
pool1	BA1	128827
pool2	BA1	143717

Below we write the retained consensus sequences to a CSV file that links the nucleotide mutations to the barcodes. (The next section analyzes this variant table in detail, and provides have more precise information on the number of variants and relevant statistics):

print(f"Writing nucleotide variants to {config['nt_variant_table_file' + '_' + background]}")
      
(consensus
 [['target', 'library', 'barcode', 'substitutions', 'variant_call_support']]
 .to_csv(config['nt_variant_table_file' + '_' + background], index=False)
 )
      
print('Here are the first few lines of this file:')
display(HTML(
    pd.read_csv(config['nt_variant_table_file' + '_' + background], na_filter=None)
    .head()
    .to_html(index=False)
    ))

Writing nucleotide variants to results/variants/nucleotide_variant_table_BA1.csv
Here are the first few lines of this file:

target	library	barcode	substitutions	variant_call_support
BA1	pool1	AAAAAAAAAACAAAAA	G124A C125A C126T	2
BA1	pool1	AAAAAAAAAAGAATTA	T91A G92T C93G	3
BA1	pool1	AAAAAAAAAAGTAATT	C127G C128A T129A	9
BA1	pool1	AAAAAAAAAATACGTA	C484G T485C A486T	5
BA1	pool1	AAAAAAAAACAGAATC	G454T C456T	3

What happened to the barcodes that we "dropped" because we could not construct a reliable consensus? The dropped data frame from alignparse.consensus.simple_mutconsensus has this information:

display(HTML(dropped.head().to_html(index=False)))

library	barcode	target	drop_reason	nseqs
pool1	AAAAAAACAACTTAGC	BA1	subs diff too large	2
pool1	AAAAAAATACCCACGC	BA1	subs diff too large	25
pool1	AAAAAACAAATAGACA	BA1	minor subs too frequent	31
pool1	AAAAAACAAATAGGAC	BA1	minor subs too frequent	16
pool1	AAAAAACACTGCCGTT	BA1	subs diff too large	21

Summarize the information in this data frame on dropped barcodes with the plot below. This plot shows several things. First, we see that the total number of barcodes dropped is modest (just a few thousand per library) relative to the total number of barcodes per library (seen above to be on the order of hundreds of thousands). Second, the main reason that barcodes are dropped is that there are CCSs within the same barcode with suspiciously large numbers of mutations relative to the consensus---which we use as a filter to discard the entire barcode as it could indicate strand exchange or some other issue. In any case, the modest number of dropped barcodes indicates that there probably isn't much of a need to worry:

max_nseqs = 8  # plot together all barcodes with >= this many sequences

_ = (
 ggplot(
    dropped.assign(nseqs=lambda x: numpy.clip(x['nseqs'], None, max_nseqs)),
    aes('nseqs')) + 
 geom_bar() + 
 scale_x_continuous(limits=(1, None)) +
 xlab('number of sequences for barcode') +
 ylab('number of barcodes') +
 facet_grid('library ~ drop_reason') +
 theme(figure_size=(10, 1.5 * nlibs),
       panel_grid_major_x=element_blank(),
       )
 ).draw()

Create barcode-variant table

We now create a CodonVariantTable that stores and processes all the information about the variant consensus sequences. Below we initialize such a table, and then analyze information about its composition.

Initialize codon variant table

In order to initialize the codon variant table, we need two pieces of information:

The wildtype gene sequence.
The list of nucleotide mutations for each variant as determined in the consensus calling above.

Read "wildtype" gene sequence to which we made the alignments (in order to do this, initialize an alignparse.Targets and get the gene sequence from it):

targets = alignparse.targets.Targets(seqsfile=config['amplicons_' + background],
                                     feature_parse_specs=config['feature_parse_specs_' + background])
geneseq = targets.get_target(background).get_feature('gene').seq

print(f"Read gene of {len(geneseq)} nts for {background} from {config['amplicons_' + background]}")

Read gene of 603 nts for BA1 from data/PacBio_amplicon_BA1.gb

Now initialize the codon variant table using this wildtype sequence and our list of nucleotide mutations for each variant:

variants = dms_variants.codonvarianttable.CodonVariantTable(
                barcode_variant_file=config['nt_variant_table_file_' + background],
                geneseq=geneseq,
                primary_target=background,
                )

Basic stats on variants

We now will analyze the variants. In this call and in the plots below, we set samples=None as we aren't looking at variant counts in specific samples, but are simply looking at properties of the variants in the table.

Here are the number of variants for each target:

display(HTML(
    variants
    .n_variants_df(samples=None)
    .pivot_table(index=['target'],
                 columns='library',
                 values='count')
    .to_html()
    ))

library	pool1	pool2	all libraries
target
BA1	128827	143717	272544

Plot the number of variants supported by each number of CCSs:

max_support = 10  # group variants with >= this much support

p = variants.plotVariantSupportHistogram(max_support=max_support,
                                         widthscale=1.1,
                                         heightscale=0.9)
p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
_ = p.draw()

Mutations per variant

Plot the number of barcoded variants with each number of amino-acid and codon mutations. This is for the primary target only, and doesn't include the spiked-in secondary targets:

max_muts = 7  # group all variants with >= this many mutations

for mut_type in ['aa', 'codon']:
    p = variants.plotNumMutsHistogram(mut_type, samples=None, max_muts=max_muts,
                                      widthscale=1.1,
                                      heightscale=0.9)
    p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
    _ = p.draw()
    plotfile = os.path.join(config['figs_dir'], f"n_{mut_type}_muts_per_variant_"+background+".pdf")
    print(f"Saving plot to {plotfile}")
    p.save(plotfile)

Saving plot to results/figures/n_aa_muts_per_variant_BA1.pdf
Saving plot to results/figures/n_codon_muts_per_variant_BA1.pdf

Plot the frequencies of different codon mutation types among all variants (any number of mutations), again only for primary target:

p = variants.plotNumCodonMutsByType(variant_type='all', samples=None,
                                    ylabel='mutations per variant',
                                    heightscale=0.8)
p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
_ = p.draw()
plotfile = os.path.join(config['figs_dir'], f"avg_muts_per_variant_"+background+".pdf")
print(f"Saving plot to {plotfile}")
p.save(plotfile)

Saving plot to results/figures/avg_muts_per_variant_BA1.pdf

Variants supported by multiple PacBio CCSs should have fewer spurious mutations since sequencing errors are very unlikely to occur on two CCSs. Below we plot the number of codon mutations per variant among variants with at least two CCSs supporting their call. The difference in mutation rates here and in the plot above (that does not apply the min_support=2 filter) gives some estimate of the frequency of mutations in our variants our spurious. In fact, we see the numbers are very similar, indicating that few of the mutations are spurious:

p = variants.plotNumCodonMutsByType(variant_type='all', samples=None,
                                    ylabel='mutations per variant', 
                                    min_support=2, heightscale=0.8)
p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
_ = p.draw()

Completeness of mutation sampling

We examine how completely amino-acid mutations are sampled by the variants for the primary target, looking at single-mutant variants only and all variants. The plot below shows that virtually every mutation is found in a variant in each library, even if we just look among the single mutants. Things look especially good if we aggregate across libraries:

for variant_type in ['all', 'single']:
    p = variants.plotCumulMutCoverage(variant_type, mut_type='aa', samples=None)
    _ = p.draw()
    plotfile = os.path.join(config['figs_dir'],
                            f"variant_cumul_{variant_type}_mut_coverage_"+background+".pdf")
    print(f"Saving plot to {plotfile}")
    p.save(plotfile)

Saving plot to results/figures/variant_cumul_all_mut_coverage_BA1.pdf
Saving plot to results/figures/variant_cumul_single_mut_coverage_BA1.pdf

To get more quantitative information like that plotted above, we determine how many mutations are found 0, 1, or >1 times both among single and all mutants for the primary target:

count_dfs = []
for variant_type in ['all', 'single']:
    i_counts = (variants.mutCounts(variant_type, mut_type='aa', samples=None)
                .assign(variant_type=variant_type)
                )
    count_dfs += [i_counts.assign(include_stops=True),
                  i_counts
                  .query('not mutation.str.contains("\*")', engine='python')
                  .assign(include_stops=False)
                  ]
    
display(HTML(
    pd.concat(count_dfs)
    .assign(count=lambda x: (numpy.clip(x['count'], None, 2)
                             .map({0: '0', 1: '1', 2:'>1'}))
            )
    .groupby(['variant_type', 'include_stops', 'library', 'count'])
    .aggregate(number_of_mutations=pd.NamedAgg(column='mutation', aggfunc='count'))
    .to_html()
    ))

				number_of_mutations
variant_type	include_stops	library	count
all	False	pool1	0	0
			1	0
			>1	3819
		pool2	0	0
			1	1
			>1	3818
		all libraries	0	0
			1	0
			>1	3819
	True	pool1	0	121
			1	13
			>1	3886
		pool2	0	119
			1	10
			>1	3891
		all libraries	0	109
			1	14
			>1	3897
single	False	pool1	0	0
			1	1
			>1	3818
		pool2	0	0
			1	1
			>1	3818
		all libraries	0	0
			1	0
			>1	3819
	True	pool1	0	146
			1	14
			>1	3860
		pool2	0	150
			1	8
			>1	3862
		all libraries	0	137
			1	19
			>1	3864

Mutation frequencies along gene

We plot the frequencies of mutations along the gene among the variants for the primary target. Ideally, this would be uniform. We make the plot for both all variants and single-mutant / wildtype variants:

for variant_type in ['all', 'single']:
    p = variants.plotMutFreqs(variant_type, mut_type='codon', samples=None)
    p.draw()

We can also use heat maps to examine the extent to which specific amino-acid or codon mutations are over-represented. These heat maps are large, so we make them just for all variants and the merge of all libraries:

for mut_type in ['aa', 'codon']:
    p = variants.plotMutHeatmap('all', mut_type, samples=None, #libraries='all_only',
                                widthscale=2)
    p.draw()

Write codon-variant table

We write the codon variant table to a CSV file. This table looks like this:

display(HTML(
    variants.barcode_variant_df
    .head()
    .to_html(index=False)
    ))

target	library	barcode	variant_call_support	codon_substitutions	aa_substitutions	n_codon_substitutions	n_aa_substitutions
BA1	pool1	AAAAAAAAAACAAAAA	2	GCC42AAT	A42N	1	1
BA1	pool1	AAAAAAAAAAGAATTA	3	TGC31ATG	C31M	1	1
BA1	pool1	AAAAAAAAAAGTAATT	9	CCT43GAA	P43E	1	1
BA1	pool1	AAAAAAAAAATACGTA	5	CTA162GCT	L162A	1	1
BA1	pool1	AAAAAAAAACAGAATC	3	GGC152TGT	G152C	1	1

Note how this table differs from the nucleotide variant table we generated above and used to initialize the CodonVariantTable in that it gives codon substitutions and associated amino-acid substitutions.

Write it to CSV file:

print(f"Writing codon-variant table to {config['codon_variant_table_file_'+background]}")

variants.barcode_variant_df.to_csv(config['codon_variant_table_file_'+background], index=False)

Writing codon-variant table to results/variants/codon_variant_table_BA1.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

process_ccs_BA1.md

process_ccs_BA1.md

Table of Contents

Process CCSs

Setup

PacBio reads

PacBio amplicons

Align CCSs to amplicons

Write valid CCSs

Build barcode variant table

Filter processed CCSs

Sequences per barcode

Empirical accuracy of CCSs

Consensus sequences for barcodes

Create barcode-variant table

Initialize codon variant table

Basic stats on variants

Mutations per variant

Completeness of mutation sampling

Mutation frequencies along gene

Write codon-variant table

Files

process_ccs_BA1.md

Latest commit

History

process_ccs_BA1.md

File metadata and controls

Table of Contents

Process CCSs

Setup

PacBio reads

PacBio amplicons

Align CCSs to amplicons

Write valid CCSs

Build barcode variant table

Filter processed CCSs

Sequences per barcode

Empirical accuracy of CCSs

Consensus sequences for barcodes

Create barcode-variant table

Initialize codon variant table

Basic stats on variants

Mutations per variant

Completeness of mutation sampling

Mutation frequencies along gene

Write codon-variant table