Skip to content

Latest commit

 

History

History
2007 lines (1577 loc) · 52.9 KB

process_ccs_BA1.md

File metadata and controls

2007 lines (1577 loc) · 52.9 KB

Table of Contents

This Python Jupyter notebook processes the PacBio circular consensus sequences (CCSs) to extract barcodes and call mutations in the gene. It then builds consensus sequences for barcoded variants from the mutations, to build a barcode-variant lookup table.

Process CCSs

First, process the PacBio CCSs to extract barcodes and call mutations in the gene.

Define the background for CCS mapping

background = "BA1"

Setup

Import Python modules

Plotting is done with plotnine, which uses ggplot2-like syntax.

The analysis uses the Bloom lab's alignparse and dms_variants packages.

import collections
import math
import os
import re
import time
import warnings

import alignparse
import alignparse.ccs
from alignparse.constants import CBPALETTE
import alignparse.minimap2
import alignparse.targets
import alignparse.consensus

import dms_variants
import dms_variants.codonvarianttable
import dms_variants.plotnine_themes
import dms_variants.utils

from IPython.display import display, HTML

import numpy

import pandas as pd

from plotnine import *

import yaml

Set plotnine theme to the one defined in dms_variants:

theme_set(dms_variants.plotnine_themes.theme_graygrid())

Versions of key software:

print(f"Using alignparse version {alignparse.__version__}")
print(f"Using dms_variants version {dms_variants.__version__}")
Using alignparse version 0.2.4
Using dms_variants version 0.8.9

Ignore warnings that clutter output:

warnings.simplefilter('ignore')

Read the configuration file:

with open('config.yaml') as f:
    config = yaml.safe_load(f)

Make output directory for figures:

os.makedirs(config['process_ccs_dir'], exist_ok=True)

PacBio reads

Get the amplicons sequenced by PacBio as the alignment target along with the specs on how to parse the features:

pacbio_runs = (
    pd.read_csv(config['pacbio_runs'], dtype=str)
    .drop(columns=['ccs'])
    .assign(name=lambda x: x['library'] + '_' + x['bg'] + '_' + x['run'],
            fastq=lambda x: config['ccs_dir'] + '/' + x['name'] + '_ccs.fastq.gz'
            )
    )
 
pacbio_runs.drop(pacbio_runs[pacbio_runs['bg'] != background].index,inplace=True)

display(HTML(pacbio_runs.to_html(index=False)))
library bg run name fastq
pool1 BA1 220524_A pool1_BA1_220524_A results/ccs/pool1_BA1_220524_A_ccs.fastq.gz
pool2 BA1 220524_A pool2_BA1_220524_A results/ccs/pool2_BA1_220524_A_ccs.fastq.gz

PacBio amplicons

Get the amplicons sequenced by PacBio as the alignment target along with the specs on how to parse the features:

print(f"Reading amplicons from {config['amplicons_' + background]}")
print(f"Reading feature parse specs from {config['feature_parse_specs_' + background]}")

targets = alignparse.targets.Targets(
                seqsfile=config['amplicons_' + background],
                feature_parse_specs=config['feature_parse_specs_' + background])
Reading amplicons from data/PacBio_amplicon_BA1.gb
Reading feature parse specs from data/feature_parse_specs_BA1.yaml

Draw the target amplicons:

fig = targets.plot(ax_width=7,
                   plots_indexing='biopython',  # numbering starts at 0
                   ax_height=2,  # height of each plot
                   hspace=1.2,  # vertical space between plots
                   )

plotfile = os.path.join(config['process_ccs_dir'], 'amplicons_'+background+'.pdf')
print(f"Saving plot to {plotfile}")
fig.savefig(plotfile, bbox_inches='tight')
Saving plot to results/process_ccs/amplicons_BA1.pdf

png

Write out the specs used to parse the features (these are the same specs provided as feature_parse_specs when initializing targets, but with defaults filled in):

print(targets.feature_parse_specs('yaml'))
BA1:
  query_clip5: 4
  query_clip3: 4
  termini5:
    filter:
      clip5: 4
      mutation_nt_count: 1
      mutation_op_count: null
      clip3: 0
    return: []
  gene:
    filter:
      mutation_nt_count: 90
      mutation_op_count: null
      clip5: 0
      clip3: 0
    return:
    - mutations
    - accuracy
  spacer:
    filter:
      mutation_nt_count: 1
      mutation_op_count: null
      clip5: 0
      clip3: 0
    return: []
  barcode:
    filter:
      mutation_nt_count: 0
      mutation_op_count: null
      clip5: 0
      clip3: 0
    return:
    - sequence
    - accuracy
  termini3:
    filter:
      clip3: 4
      mutation_nt_count: 1
      mutation_op_count: null
      clip5: 0
    return: []

Align CCSs to amplicons

We now align the CCSs to the amplicon and parse features from the resulting alignments using the specs above.

First, we initialize an alignparse.minimap2.Mapper to align the reads to SAM files:

mapper = alignparse.minimap2.Mapper(alignparse.minimap2.OPTIONS_CODON_DMS)

print(f"Using `minimap2` {mapper.version} with these options:\n" +
      ' '.join(mapper.options))
Using `minimap2` 2.18-r1015 with these options:
-A2 -B4 -O12 -E2 --end-bonus=13 --secondary=no --cs

Next, we use Targets.align_and_parse to create the alignments and parse them:

readstats, aligned, filtered = targets.align_and_parse(
        df=pacbio_runs,
        mapper=mapper,
        outdir=config['process_ccs_dir'],
        name_col='run',
        group_cols=['name', 'library'],
        queryfile_col='fastq',
        overwrite=True,
        ncpus=config['max_cpus'],
        )

First, examine the read stats from the alignment / parsing, both extracting alignment target name and getting stats aggregated by target:

readstats = (
    readstats
    .assign(category_all_targets=lambda x: x['category'].str.split().str[0],
            target=lambda x: x['category'].str.split(None, 1).str[1],
            valid=lambda x: x['category_all_targets'] == 'aligned')
    )

Now plot the read stats by run (combining all targets and libraries within a run):

ncol = 2
p = (
    ggplot(readstats
           .groupby(['name', 'category_all_targets', 'valid'])
           .aggregate({'count': 'sum'})
           .reset_index(),
           aes('category_all_targets', 'count', fill='valid')) +
    geom_bar(stat='identity') +
    facet_wrap('~ name', ncol=ncol) +
    theme(axis_text_x=element_text(angle=90),
          figure_size=(1.85 * min(ncol, len(pacbio_runs)),
                       2 * math.ceil(len(pacbio_runs) / ncol)),
          panel_grid_major_x=element_blank(),
          legend_position='none',
          ) +
    scale_fill_manual(values=CBPALETTE)
    )
_ = p.draw()

png

And the read stats by library (combining all targets and runs within a library):

p = (
    ggplot(readstats
           .groupby(['library', 'category_all_targets', 'valid'])
           .aggregate({'count': 'sum'})
           .reset_index(), 
           aes('category_all_targets', 'count', fill='valid')) +
    geom_bar(stat='identity') +
    facet_wrap('~ library', nrow=1) +
    theme(axis_text_x=element_text(angle=90),
          figure_size=(1.5 * pacbio_runs['library'].nunique(), 2),
          panel_grid_major_x=element_blank(),
          legend_position='none',
          ) +
    scale_fill_manual(values=CBPALETTE)
    )
_ = p.draw()

png

And the number of reads by target (combining all libraries and runs for a target):

p = (
    ggplot(readstats
           .groupby(['target'])
           .aggregate({'count': 'sum'})
           .reset_index(), 
           aes('target', 'count')) +
    geom_point(stat='identity', size=3) +
    theme(axis_text_x=element_text(angle=90),
          figure_size=(0.3 * readstats['target'].nunique(), 2),
          panel_grid_major_x=element_blank(),
          ) +
    scale_y_log10(name='number of reads')
    )
_ = p.draw()

png

And read stats by target (combining all libraries and runs for a target):

p = (
    ggplot(readstats
           .groupby(['target', 'valid'])
           .aggregate({'count': 'sum'})
           .reset_index()
           .assign(total=lambda x: x.groupby('target')['count'].transform('sum'),
                   frac=lambda x: x['count'] / x['total'],
                   ), 
           aes('target', 'frac', fill='valid')) +
    geom_bar(stat='identity') +
    theme(axis_text_x=element_text(angle=90),
          figure_size=(0.5 * readstats['target'].nunique(), 2),
          panel_grid_major_x=element_blank(),
          ) +
    scale_fill_manual(values=CBPALETTE)
    )
_ = p.draw()

png

Now let's see why we filtered the reads. First, we do some transformations on the filtered dict returned by Targets.align_and_parse. Then we count up the number of CCSs filtered for each reason, and group together "unusual" reasons that represent less than some fraction of all filtering. For now, we group together all targets to the stats represent all targets combined:

other_cutoff = 0.02  # group as "other" reasons with <= this frac

filtered_df = (
    pd.concat(df.assign(target=target) for target, df in filtered.items())
    .groupby(['library', 'name', 'run', 'filter_reason'])
    .size()
    .rename('count')
    .reset_index()
    .assign(tot_reason_frac=lambda x: (x.groupby('filter_reason')['count']
                                       .transform('sum')) / x['count'].sum(),
            filter_reason=lambda x: numpy.where(x['tot_reason_frac'] > other_cutoff,
                                                x['filter_reason'],
                                                'other')
            )
    )

Now plot the filtering reason for all runs:

ncol = 7
nreasons = filtered_df['filter_reason'].nunique()

p = (
    ggplot(filtered_df, aes('filter_reason', 'count')) +
    geom_bar(stat='identity') +
    facet_wrap('~ name', ncol=ncol) +
    theme(axis_text_x=element_text(angle=90),
          figure_size=(0.25 * nreasons * min(ncol, len(pacbio_runs)),
                       2 * math.ceil(len(pacbio_runs) / ncol)),
          panel_grid_major_x=element_blank(),
          )
    )
_ = p.draw()

png

Finally, we take the successfully parsed alignments and read them into a data frame, keeping track of the target that each CCS aligns to. We also drop the pieces of information we won't use going forward, and rename a few columns:

aligned_df = (
    pd.concat(df.assign(target=target) for target, df in aligned.items())
    .drop(columns=['query_clip5', 'query_clip3', 'run','name'])
    .rename(columns={'barcode_sequence': 'barcode'})
    )

print(f"First few lines of information on the parsed alignments:")
display(HTML(aligned_df.head().to_html(index=False)))
First few lines of information on the parsed alignments:
library query_name gene_mutations gene_accuracy barcode barcode_accuracy target
pool1 m64272e_220525_133554/1/ccs G146A T490A C491T A492G 1.0 GAGATATTTATCGTAA 1.0 BA1
pool1 m64272e_220525_133554/3/ccs T184A C186T C327T 1.0 CTAAAGCGAATCCAAG 1.0 BA1
pool1 m64272e_220525_133554/14/ccs A7T C8T A9G 1.0 GTACATATGAACAGCA 1.0 BA1
pool1 m64272e_220525_133554/15/ccs G403T A404C A405T 1.0 ATTTACACGATACCCG 1.0 BA1
pool1 m64272e_220525_133554/28/ccs G92T C93G 1.0 GCACCTATCTAATACG 1.0 BA1

Write valid CCSs

Write the processed CCSs to a file:

aligned_df.to_csv(config['processed_ccs_file' + '_' + background], index=False)

print("Barcodes and mutations for valid processed CCSs "
      f"have been written to {config['processed_ccs_file' + '_' + background]}.")
Barcodes and mutations for valid processed CCSs have been written to results/process_ccs/processed_ccs_BA1.csv.

Next, we analyze these processed CCSs to build the variants.

Build barcode variant table

Builds consensus sequences for barcoded variants from the mutations called in the processed PacBio CCSs. Uses these consensus sequences to build a codon variant table.

Make output directories if needed:

os.makedirs(config['variants_dir'], exist_ok=True)
os.makedirs(config['figs_dir'], exist_ok=True)

Read the CSV file with the processed CCSs into a data frame, display first few lines:

processed_ccs = pd.read_csv(config['processed_ccs_file' + '_' + background], na_filter=None)

nlibs = processed_ccs['library'].nunique()  # number of unique libraries

ntargets = processed_ccs['target'].nunique()  # number of unique targets

print(f"Read {len(processed_ccs)} CCSs from {nlibs} libraries and {ntargets} targets.")

display(HTML(processed_ccs.head().to_html(index=False)))
Read 2319075 CCSs from 2 libraries and 1 targets.
library query_name gene_mutations gene_accuracy barcode barcode_accuracy target
pool1 m64272e_220525_133554/1/ccs G146A T490A C491T A492G 1.0 GAGATATTTATCGTAA 1.0 BA1
pool1 m64272e_220525_133554/3/ccs T184A C186T C327T 1.0 CTAAAGCGAATCCAAG 1.0 BA1
pool1 m64272e_220525_133554/14/ccs A7T C8T A9G 1.0 GTACATATGAACAGCA 1.0 BA1
pool1 m64272e_220525_133554/15/ccs G403T A404C A405T 1.0 ATTTACACGATACCCG 1.0 BA1
pool1 m64272e_220525_133554/28/ccs G92T C93G 1.0 GCACCTATCTAATACG 1.0 BA1

Overall statistics on number of total CCSs and number of unique barcodes:

display(HTML(
    processed_ccs
    .groupby(['target', 'library'])
    .aggregate(total_CCSs=('barcode', 'size'),
               unique_barcodes=('barcode', 'nunique'))
    .assign(avg_CCSs_per_barcode=lambda x: x['total_CCSs'] / x['unique_barcodes'])
    .round(2)
    .to_html()
    ))
total_CCSs unique_barcodes avg_CCSs_per_barcode
target library
BA1 pool1 1007940 139442 7.23
pool2 1311135 155336 8.44

Filter processed CCSs

We have the PacBio ccs program's estimated "accuracy" for both the barcode and the gene sequence for each processed CCS. We will filter the CCSs to only keep ones of sufficiently high accuracy.

First, we want to plot the accuracies. It is actually visually easier to look at the error rate, which is one minus the accuracy. Because we want to plot on a log scale (which can't show error rates of zero), we define a error_rate_floor, and set all error rates less than this to that value:

error_rate_floor = 1e-7  # error rates < this set to this
if error_rate_floor >= config['max_error_rate']:
    raise ValueError('error_rate_floor must be < max_error_rate')

processed_ccs = (
    processed_ccs
    .assign(barcode_error=lambda x: numpy.clip(1 - x['barcode_accuracy'],
                                               error_rate_floor, None),
            gene_error=lambda x: numpy.clip(1 - x['gene_accuracy'],
                                            error_rate_floor, None)
            )
    )

Now plot the error rates, drawing a dashed vertical line at the threshold separating the CCSs we retain for consensus building versus those that we discard:

_ = (
 ggplot(processed_ccs
        .melt(value_vars=['barcode_error', 'gene_error'],
              var_name='feature_type', value_name='error rate'),
        aes('error rate')) +
 geom_histogram(bins=25) +
 geom_vline(xintercept=config['max_error_rate'],
            linetype='dashed',
            color=CBPALETTE[1]) +
 facet_wrap('~ feature_type') +
 theme(figure_size=(4.5, 2)) +
 ylab('number of CCSs') +
 scale_x_log10()
 ).draw()

png

Flag the CCSs to retain, and indicate how many we are retaining and purging due to the accuracy filter:

processed_ccs = (
    processed_ccs
    .assign(retained=lambda x: ((x['gene_error'] < config['max_error_rate']) &
                                (x['barcode_error'] < config['max_error_rate'])))
    )

Here are number of retained CCSs:

_ = (
 ggplot(processed_ccs.assign(xlabel=lambda x: x['target'] + ', ' + x['library'])
                     .groupby(['xlabel', 'retained'])
                     .size()
                     .rename('count')
                     .reset_index(),
        aes('xlabel', 'count', color='retained', label='count')) +
 geom_point(size=3) +
 geom_text(va='bottom', size=7, ha='center',format_string='{:.3g}', nudge_y=0.2) +
 theme(figure_size=(0.5 * nlibs * ntargets, 3),
       panel_grid_major_x=element_blank(),
       axis_text_x=element_text(angle=90),
       ) +
 scale_y_log10(name='number of CCSs') +
 xlab('') +
 scale_color_manual(values=CBPALETTE[1:])
 ).draw()

png

Sequences per barcode

How many times is each barcode sequenced? This is useful to know for thinking about building the barcode consensus.

Plot the distribution of the number of times each barcode is observed among the retained CCSs:

max_count = 8 # in plot, group all barcodes with >= this many counts

p = (
 ggplot(
    processed_ccs
     .query('retained')
     .groupby(['library', 'barcode'])
     .size()
     .rename('nseqs')
     .reset_index()
     .assign(nseqs=lambda x: numpy.clip(x['nseqs'], None, max_count)),
    aes('nseqs')) +
 geom_bar() +
 facet_wrap('~ library', nrow=1) +
 theme(figure_size=(1.75 * nlibs, 2),
       panel_grid_major_x=element_blank(),
       ) +
 ylab('number of barcodes') +
 xlab('CCSs for barcode')
 )

_ = p.draw()

png

Empirical accuracy of CCSs

We want to directly estimate the accuracy of the gene-barcode link rather than relying on the PacBio ccs accuracy, which doesn't include inaccuracies due to things like strand exchange or the same barcode on different sequences.

One way to do this is to examine instances when we have multiple sequences for the same barcode. We can calculate the empirical accuracy of the sequences by looking at all instances of multiple sequences of the same barcode and determining how often they are identical. This calculation is performed by alignparse.consensus.empirical_accuracy using the equations described in the docs for that function.

We will do this four for sets of sequences:

  1. All of the CCSs retained above.
  2. CCSs retained by applying a PacBio ccs accuracy filter 10-fold more stringent than the one above. The rationale is that if this improves the concordance (real accuracy) of the CCSs substantially then maybe we should make the accuracy filter more stringent.
  3. Like (1) but excluding all CCSs with indels. the rationale is that we only really care about substitutions, and will exclude sequences with indels anyway.
  4. Like (2) but excluding all CCSs with indels.

First, we annotate the sequences with the number of indels and whether they have an indel to enable categorization into the aforementioned sets:

processed_ccs = processed_ccs.reset_index(drop=True)

processed_ccs = alignparse.consensus.add_mut_info_cols(processed_ccs,
                                                       mutation_col='gene_mutations',
                                                       n_indel_col='n_indels')

processed_ccs = processed_ccs.assign(has_indel=lambda x: x['n_indels'] > 0)

Plot how many sequences have indels:

_ = (
 ggplot(processed_ccs,
        aes('retained', fill='has_indel')) +
 geom_bar(position='dodge') +
 geom_text(aes(label='..count..'), stat='count', va='bottom', size=7,
           position=position_dodge(width=0.9), format_string='{:.2g}') +
 theme(figure_size=(2.5 * nlibs, 3),
       panel_grid_major_x=element_blank(),
       ) +
 ylab('number of CCSs') +
 scale_fill_manual(values=CBPALETTE[1:]) +
 facet_wrap('~ library', nrow=1)
 ).draw()

png

Now get the empirical accuracy for each of the CCS groups mentioned above:

high_acc = config['max_error_rate'] / 10
empirical_acc = []

for desc, query_str in [
        ('retained', 'retained'),
        ('retained, no indel', 'retained and not has_indel'),
        ('10X accuracy',
         f"(gene_error < {high_acc}) and (barcode_error < {high_acc})"),
        ('10X accuracy, no indel',
         f"(gene_error < {high_acc}) and (barcode_error < {high_acc}) and not has_indel")
        ]:
    # get just CCSs in that category
    df = processed_ccs.query(query_str)
    
    # compute empirical accuracy
    empirical_acc.append(
        alignparse.consensus.empirical_accuracy(df,
                                                mutation_col='gene_mutations')
        .assign(description=desc)
        .merge(df
               .groupby('library')
               .size()
               .rename('number_CCSs')
               .reset_index()
               )
        )

# make description categorical to preserve order, and annotate as "actual"
# the category ("retained, no indel") that we will use for building variants.
empirical_acc = (
    pd.concat(empirical_acc, ignore_index=True, sort=False)
    .assign(description=lambda x: pd.Categorical(x['description'],
                                                 x['description'].unique(),
                                                 ordered=True),
            actual=lambda x: numpy.where(x['description'] == 'retained, no indel',
                                         True, False),
            )
    )

Display table of the empirical accuracies:

display(HTML(empirical_acc.to_html(index=False)))
library accuracy description number_CCSs actual
pool1 0.972176 retained 876801 False
pool2 0.972453 retained 1141200 False
pool1 0.995369 retained, no indel 850498 True
pool2 0.995680 retained, no indel 1107023 True
pool1 0.980105 10X accuracy 827942 False
pool2 0.980418 10X accuracy 1077447 False
pool1 0.995455 10X accuracy, no indel 809705 False
pool2 0.995706 10X accuracy, no indel 1053879 False

Plot the empirical accuracies, using a different color to show the category that we will actually use:

p = (
    ggplot(empirical_acc,
           aes('description', 'accuracy', color='actual', label='accuracy')
           ) +
    geom_point(size=3) +
    geom_text(va='bottom', size=9, format_string='{:.3g}', nudge_y=0.003) +
    facet_wrap('~ library', ncol=8) +
    theme(figure_size=(1.75 * nlibs, 2.25),
          axis_text_x=element_text(angle=90),
          panel_grid_major_x=element_blank(),
          ) +
    xlab('') +
    scale_y_continuous(name='empirical accuracy', limits=(0.95, 1.005)) +
    scale_color_manual(values=CBPALETTE, guide=False)
    )

plotfile = os.path.join(config['figs_dir'], 'empirical_CCS_accuracy.pdf')
print(f"Saving plot to {plotfile}")
_ = p.draw()
Saving plot to results/figures/empirical_CCS_accuracy.pdf

png

The above analysis shows that if we exclude sequences with indels (which we plan to do among our consensus sequences), then the accuracy of each CCS is around 99%. We do not get notably higher empirical accuracy by imposing a more stringent filter from the PacBio ccs program, indicating that the major sources of error are due to processes that are not modeled in this program's accuracy filter (perhaps strand exchange or barcode sharing).

Note that this empirical accuracy is for a single CCS. When we build the consensus sequences for each barcode below, we will take the consensus of CCSs within a barcode. So for barcodes with multiple CCSs, the actual accuracy of the consensus sequences will be higher than the empirical accuracy above due to capturing information from multiple CCSs.

Consensus sequences for barcodes

We call the consensus sequence for each barcode using the simple method implemented in alignparse.consensus.simple_mutconsensus. The documentation for that function explains the method in detail, but basically it works like this:

  1. When there is just one CCS per barcode, the consensus is just that sequence.
  2. When there are multiple CCSs per barcode, they are used to build a consensus--however, the entire barcode is discarded if there are many differences between CCSs with the barcode, or high-frequency non-consensus mutations. The reason that barcodes are discarded in such cases as many differences between CCSs or high-frequency non-consensus mutations suggest errors such as barcode collisions or strand exchange.

First, call the consensus for each barcode including all retained sequences, even those with indels:

consensus, dropped = alignparse.consensus.simple_mutconsensus(
                        processed_ccs.query('retained'),
                        group_cols=('library', 'barcode', 'target'),
                        mutation_col='gene_mutations',
                        )

Here are the first few lines of the data frame of consensus sequences for each barcode. In addition to giving the library, barcode, target, and mutations, it also has a column indicating how many CCSs support the variant call:

display(HTML(consensus.head().to_html(index=False)))
library barcode target gene_mutations variant_call_support
pool1 AAAAAAAAAACAAAAA BA1 G124A C125A C126T 2
pool1 AAAAAAAAAAGAATTA BA1 T91A G92T C93G 3
pool1 AAAAAAAAAAGTAATT BA1 C127G C128A T129A 9
pool1 AAAAAAAAAATACGTA BA1 C484G T485C A486T 5
pool1 AAAAAAAAACAGAATC BA1 G454T C456T 3

Since we retain variants with substitutions but ignore those with indels, add information about substitution mutations and number of indels:

consensus = alignparse.consensus.add_mut_info_cols(
                    consensus,
                    mutation_col='gene_mutations',
                    sub_str_col='substitutions',
                    n_indel_col='number_of_indels',
                    overwrite_cols=True)

display(HTML(consensus.head().to_html(index=False)))
library barcode target gene_mutations variant_call_support substitutions number_of_indels
pool1 AAAAAAAAAACAAAAA BA1 G124A C125A C126T 2 G124A C125A C126T 0
pool1 AAAAAAAAAAGAATTA BA1 T91A G92T C93G 3 T91A G92T C93G 0
pool1 AAAAAAAAAAGTAATT BA1 C127G C128A T129A 9 C127G C128A T129A 0
pool1 AAAAAAAAAATACGTA BA1 C484G T485C A486T 5 C484G T485C A486T 0
pool1 AAAAAAAAACAGAATC BA1 G454T C456T 3 G454T C456T 0

Plot distribution of number of CCSs supporting each variant call (consensus), indicating whether or not there is an indel:

max_variant_call_support = 6  # group variants with >= this much support

_ = (
 ggplot(consensus
        .assign(variant_call_support=lambda x: numpy.clip(x['variant_call_support'],
                                                          None,
                                                          max_variant_call_support),
                indel_state=lambda x: numpy.where(x['number_of_indels'] > 0,
                                                  'has indel', 'no indel')
                ),
        aes('variant_call_support')) +
 geom_bar() +
 ylab('number of variants') +
 facet_grid('indel_state ~ library') +
 theme(figure_size=(1.75 * nlibs, 3.5),
       panel_grid_major_x=element_blank(),
       ) 
 ).draw()

png

We see that most variant consensus sequences do not have indels, especially if we limit to the more "accurate" ones that have multiple CCSs supporting them.

We will ignore all consensus sequences with indels in the variant-barcode lookup table. We do this for two reasons:

  1. When there is just one CCS supporting a consensus, it is less likely to be accurate as indels are the main mode of PacBio error.
  2. For the purposes of our studies, we are interested in point mutations rather than indels anyway.

Here are number of valid consensus sequence (no indels) for each library and target:

consensus = consensus.query('number_of_indels == 0')

lib_target_counts = (
    consensus
    .groupby(['library', 'target'])
    .size()
    .rename('consensus sequences')
    .reset_index()
    )

display(HTML(lib_target_counts.to_html(index=False)))

p = (ggplot(lib_target_counts.assign(xlabel=lambda x: x['target'] + ', ' + x['library']),
            aes('xlabel', 'consensus sequences')) +
     geom_point(size=3) +
     theme(figure_size=(0.5 * nlibs * ntargets, 1.75),
           axis_text_x=element_text(angle=90)) +
     xlab('') +
     scale_y_log10()
     )

_ = p.draw()
library target consensus sequences
pool1 BA1 128827
pool2 BA1 143717

png

Below we write the retained consensus sequences to a CSV file that links the nucleotide mutations to the barcodes. (The next section analyzes this variant table in detail, and provides have more precise information on the number of variants and relevant statistics):

print(f"Writing nucleotide variants to {config['nt_variant_table_file' + '_' + background]}")
      
(consensus
 [['target', 'library', 'barcode', 'substitutions', 'variant_call_support']]
 .to_csv(config['nt_variant_table_file' + '_' + background], index=False)
 )
      
print('Here are the first few lines of this file:')
display(HTML(
    pd.read_csv(config['nt_variant_table_file' + '_' + background], na_filter=None)
    .head()
    .to_html(index=False)
    ))
Writing nucleotide variants to results/variants/nucleotide_variant_table_BA1.csv
Here are the first few lines of this file:
target library barcode substitutions variant_call_support
BA1 pool1 AAAAAAAAAACAAAAA G124A C125A C126T 2
BA1 pool1 AAAAAAAAAAGAATTA T91A G92T C93G 3
BA1 pool1 AAAAAAAAAAGTAATT C127G C128A T129A 9
BA1 pool1 AAAAAAAAAATACGTA C484G T485C A486T 5
BA1 pool1 AAAAAAAAACAGAATC G454T C456T 3

What happened to the barcodes that we "dropped" because we could not construct a reliable consensus? The dropped data frame from alignparse.consensus.simple_mutconsensus has this information:

display(HTML(dropped.head().to_html(index=False)))
library barcode target drop_reason nseqs
pool1 AAAAAAACAACTTAGC BA1 subs diff too large 2
pool1 AAAAAAATACCCACGC BA1 subs diff too large 25
pool1 AAAAAACAAATAGACA BA1 minor subs too frequent 31
pool1 AAAAAACAAATAGGAC BA1 minor subs too frequent 16
pool1 AAAAAACACTGCCGTT BA1 subs diff too large 21

Summarize the information in this data frame on dropped barcodes with the plot below. This plot shows several things. First, we see that the total number of barcodes dropped is modest (just a few thousand per library) relative to the total number of barcodes per library (seen above to be on the order of hundreds of thousands). Second, the main reason that barcodes are dropped is that there are CCSs within the same barcode with suspiciously large numbers of mutations relative to the consensus---which we use as a filter to discard the entire barcode as it could indicate strand exchange or some other issue. In any case, the modest number of dropped barcodes indicates that there probably isn't much of a need to worry:

max_nseqs = 8  # plot together all barcodes with >= this many sequences

_ = (
 ggplot(
    dropped.assign(nseqs=lambda x: numpy.clip(x['nseqs'], None, max_nseqs)),
    aes('nseqs')) + 
 geom_bar() + 
 scale_x_continuous(limits=(1, None)) +
 xlab('number of sequences for barcode') +
 ylab('number of barcodes') +
 facet_grid('library ~ drop_reason') +
 theme(figure_size=(10, 1.5 * nlibs),
       panel_grid_major_x=element_blank(),
       )
 ).draw()

png

Create barcode-variant table

We now create a CodonVariantTable that stores and processes all the information about the variant consensus sequences. Below we initialize such a table, and then analyze information about its composition.

Initialize codon variant table

In order to initialize the codon variant table, we need two pieces of information:

  1. The wildtype gene sequence.
  2. The list of nucleotide mutations for each variant as determined in the consensus calling above.

Read "wildtype" gene sequence to which we made the alignments (in order to do this, initialize an alignparse.Targets and get the gene sequence from it):

targets = alignparse.targets.Targets(seqsfile=config['amplicons_' + background],
                                     feature_parse_specs=config['feature_parse_specs_' + background])
geneseq = targets.get_target(background).get_feature('gene').seq

print(f"Read gene of {len(geneseq)} nts for {background} from {config['amplicons_' + background]}")
Read gene of 603 nts for BA1 from data/PacBio_amplicon_BA1.gb

Now initialize the codon variant table using this wildtype sequence and our list of nucleotide mutations for each variant:

variants = dms_variants.codonvarianttable.CodonVariantTable(
                barcode_variant_file=config['nt_variant_table_file_' + background],
                geneseq=geneseq,
                primary_target=background,
                )

Basic stats on variants

We now will analyze the variants. In this call and in the plots below, we set samples=None as we aren't looking at variant counts in specific samples, but are simply looking at properties of the variants in the table.

Here are the number of variants for each target:

display(HTML(
    variants
    .n_variants_df(samples=None)
    .pivot_table(index=['target'],
                 columns='library',
                 values='count')
    .to_html()
    ))
library pool1 pool2 all libraries
target
BA1 128827 143717 272544

Plot the number of variants supported by each number of CCSs:

max_support = 10  # group variants with >= this much support

p = variants.plotVariantSupportHistogram(max_support=max_support,
                                         widthscale=1.1,
                                         heightscale=0.9)
p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
_ = p.draw()

png

Mutations per variant

Plot the number of barcoded variants with each number of amino-acid and codon mutations. This is for the primary target only, and doesn't include the spiked-in secondary targets:

max_muts = 7  # group all variants with >= this many mutations

for mut_type in ['aa', 'codon']:
    p = variants.plotNumMutsHistogram(mut_type, samples=None, max_muts=max_muts,
                                      widthscale=1.1,
                                      heightscale=0.9)
    p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
    _ = p.draw()
    plotfile = os.path.join(config['figs_dir'], f"n_{mut_type}_muts_per_variant_"+background+".pdf")
    print(f"Saving plot to {plotfile}")
    p.save(plotfile)
Saving plot to results/figures/n_aa_muts_per_variant_BA1.pdf
Saving plot to results/figures/n_codon_muts_per_variant_BA1.pdf

png

png

Plot the frequencies of different codon mutation types among all variants (any number of mutations), again only for primary target:

p = variants.plotNumCodonMutsByType(variant_type='all', samples=None,
                                    ylabel='mutations per variant',
                                    heightscale=0.8)
p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
_ = p.draw()
plotfile = os.path.join(config['figs_dir'], f"avg_muts_per_variant_"+background+".pdf")
print(f"Saving plot to {plotfile}")
p.save(plotfile)
Saving plot to results/figures/avg_muts_per_variant_BA1.pdf

png

Variants supported by multiple PacBio CCSs should have fewer spurious mutations since sequencing errors are very unlikely to occur on two CCSs. Below we plot the number of codon mutations per variant among variants with at least two CCSs supporting their call. The difference in mutation rates here and in the plot above (that does not apply the min_support=2 filter) gives some estimate of the frequency of mutations in our variants our spurious. In fact, we see the numbers are very similar, indicating that few of the mutations are spurious:

p = variants.plotNumCodonMutsByType(variant_type='all', samples=None,
                                    ylabel='mutations per variant', 
                                    min_support=2, heightscale=0.8)
p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
_ = p.draw()

png

Completeness of mutation sampling

We examine how completely amino-acid mutations are sampled by the variants for the primary target, looking at single-mutant variants only and all variants. The plot below shows that virtually every mutation is found in a variant in each library, even if we just look among the single mutants. Things look especially good if we aggregate across libraries:

for variant_type in ['all', 'single']:
    p = variants.plotCumulMutCoverage(variant_type, mut_type='aa', samples=None)
    _ = p.draw()
    plotfile = os.path.join(config['figs_dir'],
                            f"variant_cumul_{variant_type}_mut_coverage_"+background+".pdf")
    print(f"Saving plot to {plotfile}")
    p.save(plotfile)
Saving plot to results/figures/variant_cumul_all_mut_coverage_BA1.pdf
Saving plot to results/figures/variant_cumul_single_mut_coverage_BA1.pdf

png

png

To get more quantitative information like that plotted above, we determine how many mutations are found 0, 1, or >1 times both among single and all mutants for the primary target:

count_dfs = []
for variant_type in ['all', 'single']:
    i_counts = (variants.mutCounts(variant_type, mut_type='aa', samples=None)
                .assign(variant_type=variant_type)
                )
    count_dfs += [i_counts.assign(include_stops=True),
                  i_counts
                  .query('not mutation.str.contains("\*")', engine='python')
                  .assign(include_stops=False)
                  ]
    
display(HTML(
    pd.concat(count_dfs)
    .assign(count=lambda x: (numpy.clip(x['count'], None, 2)
                             .map({0: '0', 1: '1', 2:'>1'}))
            )
    .groupby(['variant_type', 'include_stops', 'library', 'count'])
    .aggregate(number_of_mutations=pd.NamedAgg(column='mutation', aggfunc='count'))
    .to_html()
    ))
number_of_mutations
variant_type include_stops library count
all False pool1 0 0
1 0
>1 3819
pool2 0 0
1 1
>1 3818
all libraries 0 0
1 0
>1 3819
True pool1 0 121
1 13
>1 3886
pool2 0 119
1 10
>1 3891
all libraries 0 109
1 14
>1 3897
single False pool1 0 0
1 1
>1 3818
pool2 0 0
1 1
>1 3818
all libraries 0 0
1 0
>1 3819
True pool1 0 146
1 14
>1 3860
pool2 0 150
1 8
>1 3862
all libraries 0 137
1 19
>1 3864

Mutation frequencies along gene

We plot the frequencies of mutations along the gene among the variants for the primary target. Ideally, this would be uniform. We make the plot for both all variants and single-mutant / wildtype variants:

for variant_type in ['all', 'single']:
    p = variants.plotMutFreqs(variant_type, mut_type='codon', samples=None)
    p.draw()

png

png

We can also use heat maps to examine the extent to which specific amino-acid or codon mutations are over-represented. These heat maps are large, so we make them just for all variants and the merge of all libraries:

for mut_type in ['aa', 'codon']:
    p = variants.plotMutHeatmap('all', mut_type, samples=None, #libraries='all_only',
                                widthscale=2)
    p.draw()

png

png

Write codon-variant table

We write the codon variant table to a CSV file. This table looks like this:

display(HTML(
    variants.barcode_variant_df
    .head()
    .to_html(index=False)
    ))
target library barcode variant_call_support codon_substitutions aa_substitutions n_codon_substitutions n_aa_substitutions
BA1 pool1 AAAAAAAAAACAAAAA 2 GCC42AAT A42N 1 1
BA1 pool1 AAAAAAAAAAGAATTA 3 TGC31ATG C31M 1 1
BA1 pool1 AAAAAAAAAAGTAATT 9 CCT43GAA P43E 1 1
BA1 pool1 AAAAAAAAAATACGTA 5 CTA162GCT L162A 1 1
BA1 pool1 AAAAAAAAACAGAATC 3 GGC152TGT G152C 1 1

Note how this table differs from the nucleotide variant table we generated above and used to initialize the CodonVariantTable in that it gives codon substitutions and associated amino-acid substitutions.

Write it to CSV file:

print(f"Writing codon-variant table to {config['codon_variant_table_file_'+background]}")

variants.barcode_variant_df.to_csv(config['codon_variant_table_file_'+background], index=False)
Writing codon-variant table to results/variants/codon_variant_table_BA1.csv