# The utility of VAPOR
## Looking project wide at the amount of ambiguous base calls
### April 23rd, 2025

In this notebook we compare how many ambiguous bases are called across the entire project for a variety of strategies:

- using the single, complete reference drafted by our lab
- using VAPOR on a database of complete references
- using VAPOR on a database of all references

First, some imports...

In [1]:
import glob
import re
from collections import Counter

import pandas as pd
import numpy as np
from Bio import SeqIO
import matplotlib.pyplot as plt
import altair as alt

%matplotlib inline

We cloned the repository three times and made adjustments to the config. If we were using vapor, we set `use_vapor` to true. To use the complete subset of segments from our lab database, we make the following adjustment to the `vapor_segment` rule:

```
 rule vapor_segment:
     input:
         fastq=rules.trimmomatic.output.concat,
-        reference_db='data/reference/{segment}/all.fasta',
+        reference_db='data/reference/{segment}/complete.fasta',
         mlip_reference='data/reference/{segment}/sequence.fasta'
```

For the single reference, we set `use_vapor` to false and situate our lab reference in `data/reference`.

We run on the entire PA bird dataset and move the respective `data` folders output by the pipeline to `vapor-all`, `vapor-complete`, and `single-reference`. We then glob files as before and pull out metadata for analysis.

In [2]:
files = glob.glob('*/*/replicate-*/*/segments/*/consensus.fasta')
# match strategy, sample, replicate, remapping stage, segment
pattern = re.compile(r'(.*)/(.*)/replicate-(.*)/(.*)/segments/(.*)/consensus.fasta')

# example pattern match
match = pattern.match(files[0])
print(match.groups())

('vapor-all', 'be_w3', '2', 'initial', 'ns')


In [3]:
c = Counter()
for file in files:
    match = pattern.match(file)
    groups = match.groups()
    replicate = (groups[1], groups[2], groups[4])
    c[replicate] += 1
complete = set([key for key, value in c.items() if value==6])
print(len(complete), 'out of total', len(files), 'files')

512 out of total 3504 files


We extract data from each consensus call, including the number of Ns called and the fraction of Ns called.

In [5]:
def extract_N_percentage(fasta_path):
    record = SeqIO.read(fasta_path, 'fasta')
    Ns = sum([i == 'N' for i in record])
    total_bases = len(record)
    N_percentage = 1 if total_bases == 0 else Ns / total_bases
    return (Ns, N_percentage)

data = []
for file in files:
    match = pattern.match(file)
    groups = match.groups()
    is_complete = (groups[1], groups[2], groups[4]) in complete
    is_remapped = groups[3] == 'remapping-1'
    should_keep = is_complete and is_remapped
    if not should_keep:
        continue
    ns, n_percentage = extract_N_percentage(file)
    data.append({
        'strategy':  groups[0],
        'sample': groups[1],
        'replicate': groups[2],
        'remapping': groups[3],
        'segment': groups[4],
        'n_percentage': n_percentage,
        'n_s': ns
    })
df = pd.DataFrame(data)
df.head()

Unnamed: 0,strategy,sample,replicate,remapping,segment,n_percentage,n_s
0,vapor-all,be_w3,2,remapping-1,ns,0.0,0
1,vapor-all,be_w3,2,remapping-1,na,0.0,0
2,vapor-all,be_w3,2,remapping-1,pb2,0.0,0
3,vapor-all,be_w3,2,remapping-1,pa,0.0,0
4,vapor-all,be_w3,2,remapping-1,ha,0.0,0


We wish to compare how many many ambiguous bases are called by the various strategies. To keep it simple, we'll sum over samples.

In [6]:
df.groupby(['strategy', 'segment', 'replicate']).sum('n_s')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,n_percentage,n_s
strategy,segment,replicate,Unnamed: 3_level_1,Unnamed: 4_level_1
single-reference,ha,1,0.23705,421
single-reference,ha,2,0.0,0
single-reference,mp,1,4.160515,4254
single-reference,mp,2,3.37804,3466
single-reference,na,1,0.043925,64
single-reference,na,2,0.001371,2
single-reference,np,1,1.102875,1656
single-reference,np,2,0.021725,34
single-reference,ns,1,0.0,0
single-reference,ns,2,0.001124,1


Finally, we'll make a grouped bar chart, showing per segment how many ambiguous bases are called by each strategy for each replicate.

In [7]:
df_agg = df.groupby(['segment','strategy','replicate'], as_index=False)['n_s'].sum()
df_agg['strat_rep'] = df_agg['strategy'] + '_rep' + df_agg['replicate'].astype(str)
color_domain = [
    'single-reference_rep1', 'single-reference_rep2',
    'vapor-all_rep1',        'vapor-all_rep2',
    'vapor-complete_rep1',    'vapor-complete_rep2',
]
color_range = [
    'lightblue', 'darkblue',
    'pink', 'red',
    'lightgrey', 'black'
]

(
    alt.Chart(df_agg)
    .mark_bar()
    .encode(
        x=alt.X('segment:N', title='Segment'),
        y=alt.Y('n_s:Q', title='Total n_s'),
        xOffset=alt.XOffset('strat_rep:N'),
        color=alt.Color(
            'strat_rep:N',
            title='Strategy × Replicate',
            scale=alt.Scale(domain=color_domain, range=color_range)
        ),
        tooltip=[
            alt.Tooltip('segment:N', title='Segment'),
            alt.Tooltip('strategy:N', title='Strategy'),
            alt.Tooltip('replicate:N', title='Replicate'),
            alt.Tooltip('n_s:Q', title='Total n_s'),
        ]
    )
    .properties(
        width=600,
        height=400,
        title='Total n_s by Segment, Strategy & Replicate'
    )
    .configure_axis(
        labelFontSize=14,
        titleFontSize=16
    )
    .configure_legend(
        labelFontSize=14,
        titleFontSize=16
    )
    .configure_title(
        fontSize=18
    )
)

This demonstrates, at least for the entire project, only marginal utility for adding VAPOR. It may uncover a little bit of additional `pb1`, but performs comparably to the single-reference otherwise. Future work may involve exploring the benefits on a per sample basis.