# Process plate counts to get ratios of variants in a re-pooled variant library -- repeat

The initial re-pool that I tested probably had some technical errors in setup, leading to noisy data. Here, I repeated that dilution series and infection in duplicate (starting with 12.5 uL virus [4x less than previous replicates] and performing 2-fold dilutions) with both my re-pooled H3N2 2023-2024 library, as well as Andrea's pdmH1N1 2021-2022 library. 

The plots generated by this notebook are interactive, so you can mouseover points for details, use the mouse-scroll to zoom and pan, and use interactive dropdowns at the bottom of the plots.

## Setup
Import Python modules:

In [2]:
import pickle
import sys

import altair as alt

import matplotlib.pyplot as plt

import numpy

import pandas as pd
from os.path import join
import os
import ruamel.yaml as yaml

_ = alt.data_transformers.disable_max_rows()

## Add input data locations
Some of these files are defined as data, and some of these files are generated by running the specified library pooling data as `miscellaneous_plates` through the `seqneut-pipeline`. For details on how these files are generated, see the `README.md' in [https://github.com/jbloomlab/seqneut-pipeline](https://github.com/jbloomlab/seqneut-pipeline)

In [3]:
# Define file path prefix
filepath_prefix = '/fh/fast/bloom_j/computational_notebooks/ckikawa/2024/flu_seqneut_H3N2_2023-2024'

# Viral library contents and barcode IDs
viral_H3N2_library_csv = filepath_prefix + '/data/viral_libraries/2023_H3N2_Kikawa.csv'
viral_H1N1_library_csv = filepath_prefix + '/data/viral_libraries/pdmH1N1_lib2023_loes.csv'
# Neutralization standard set of barcode IDs
neut_standard_set_csv = filepath_prefix + '/data/neut_standard_sets/loes2023_neut_standards.csv'
# All samples included in library poolign sequencing run
# Contains information on library, dilution factor, R1 location
samplesfile_H3N2 = filepath_prefix + '/data/miscellaneous_plates/2024-02-07_H3N2_sampleData_rePool_MOItest.csv'
samplesfile_H1N1 = filepath_prefix + '/data/miscellaneous_plates/2024-02-07_H1N1_sampleData_rePool_MOItest.csv'

# Counts and fates files output by running library pooling samples as miscellaneous plates
platedir_H3N2 = filepath_prefix + '/results/miscellaneous_plates/240207_repool_H3N2/'
platedir_H1N1 = filepath_prefix + '/results/miscellaneous_plates/240207_repool_H1N1/'

# Identify all counts and fates CSVs
count_csvs = []
fate_csvs = []
file_list_H3N2 = os.listdir(platedir_H3N2)
file_list_H1N1 = os.listdir(platedir_H1N1)

for f in file_list_H3N2:
    location = platedir_H3N2 + f
    if "_counts" in f:
        count_csvs.append(location)
    elif "_fates" in f:
        fate_csvs.append(location)

for f in file_list_H1N1:
    location = platedir_H1N1 + f
    if "_counts" in f:
        count_csvs.append(location)
    elif "_fates" in f:
        fate_csvs.append(location)

In [4]:
# Define a samples dataframe using the samples file
samples_df = pd.concat([pd.read_csv(samplesfile_H3N2), pd.read_csv(samplesfile_H1N1)]).reset_index(drop=True)
samples_df.drop(columns=['fastq'], inplace=True)
samples_df['sample'] = samples_df.apply(
    lambda x: '-'.join(x.astype(str)), axis=1
)

samples = samples_df["sample"].unique().tolist()
print(f"There are {len(samples)} barcode runs.")

samples_df

There are 48 barcode runs.


Unnamed: 0,well,serum,dilution_factor,replicate,sample
0,A1,2023rePoolcol1_h3n2,1,1,A1-2023rePoolcol1_h3n2-1-1
1,B1,2023rePoolcol2_h3n2,2,1,B1-2023rePoolcol2_h3n2-2-1
2,C1,2023rePoolcol3_h3n2,4,1,C1-2023rePoolcol3_h3n2-4-1
3,D1,2023rePoolcol4_h3n2,8,1,D1-2023rePoolcol4_h3n2-8-1
4,E1,2023rePoolcol5_h3n2,16,1,E1-2023rePoolcol5_h3n2-16-1
5,F1,2023rePoolcol6_h3n2,32,1,F1-2023rePoolcol6_h3n2-32-1
6,G1,2023rePoolcol7_h3n2,64,1,G1-2023rePoolcol7_h3n2-64-1
7,H1,2023rePoolcol8_h3n2,128,1,H1-2023rePoolcol8_h3n2-128-1
8,A2,2023rePoolcol9_h3n2,256,1,A2-2023rePoolcol9_h3n2-256-1
9,B2,2023rePoolcol10_h3n2,512,1,B2-2023rePoolcol10_h3n2-512-1


## Statistics on barcode-parsing for each sample
Make interactive chart of the "fates" of the sequencing reads parsed for each sample on the plate.

If most sequencing reads are not "valid barcodes", this could potentially indicate some problem in the sequencing or barcode set you are parsing.

Potential fates are:
 - *valid barcode*: barcode that matches a known virus or neutralization standard, we hope most reads are this.
 - *invalid barcode*: a barcode with proper flanking sequences, but does not match a known virus or neutralization standard. If you  have a lot of reads of this type, it is probably a good idea to look at the invalid barcode CSVs (in the `./results/barcode_invalid/` subdirectory created by the pipeline) to see what these invalid barcodes are.
 - *unparseable barcode*: could not parse a barcode from this read as there was not a sequence of the correct length with the appropriate flanking sequence.
 - *low quality barcode*: low-quality or `N` nucleotides in barcode, could indicate problem with sequencing.
 - *failed chastity filter*: reads that failed the Illumina chastity filter, if these are reported in the FASTQ (they may not be).

Also, if the number of reads per sample is very uneven, that could indicate that you did not do a good job of balancing the different samples in the Illumina sequencing.

In [5]:
fates = (
    pd.concat([pd.read_csv(f)
               .assign(well = f.replace(platedir_H3N2,'').replace(platedir_H1N1,'')
                       .strip('_fates.csv')) for f, s in zip(fate_csvs, samples)])
    .merge(samples_df, on="well")
    .assign(
        fate_counts=lambda x: x.groupby("fate")["count"].transform("sum"),
        sample_well=lambda x: x["sample"] + " (" + x["well"] + ")",
    )
    .query("fate_counts > 0") # only keep fates with at least one count
    [['serum', "fate", "count", "well", "sample_well", "dilution_factor"]]
)

assert len(fates) == len(fates.drop_duplicates())


sample_wells = list(
    fates.sort_values(['serum', "dilution_factor"], ascending=True)["sample_well"]
)



fates_chart = (
    alt.Chart(fates)
    .encode(
        alt.X("count", scale=alt.Scale(nice=False, padding=3)),
        alt.Y(
            "sample_well",
            title=None,
            sort=sample_wells,
        ),
        alt.Color("fate", sort=sorted(fates["fate"].unique(), reverse=True)),
        alt.Order("fate", sort="descending"),
        tooltip=fates.columns.tolist(),
    )
    .mark_bar(height={"band": 0.85})
    .properties(
        height=alt.Step(10),
        width=200,
        title=f"Barcode parsing for initial titering plate",
    )
    .configure_axis(grid=False)
)

fates_chart

Looks like we got nice coverage across all wells. Previous issues with missing wells at the cDNA synthesis step have resolved?

## Read barcode counts
Read the counts per barcode:

In [6]:
# get barcode counts
counts = (
    pd.concat([pd.read_csv(c)
               .assign(well=c.replace(platedir_H3N2,'').replace(platedir_H1N1,'')
                       .strip('_counts.csv')) for c, s in zip(count_csvs, samples)])
    .merge(samples_df, validate="many_to_one", on="well")
    .drop(columns=["replicate"])
    .assign(sample_well=lambda x: x["sample"] + " (" + x["well"] + ")")
)

# classify barcodes as viral or neut standard
barcode_class = pd.concat(
    [
        pd.read_csv(viral_H3N2_library_csv)[["barcode", "strain"]].assign(
            neut_standard=False,
        ),
        pd.read_csv(viral_H1N1_library_csv)[["barcode", "strain"]].assign(
            neut_standard=False,
        ),
        pd.read_csv(neut_standard_set_csv)[["barcode"]].assign(
            neut_standard=True,
            strain=pd.NA,
        ),
    ],
    ignore_index=True,
)

# merge counts and classification of barcodes
assert set(counts["barcode"]) == set(barcode_class["barcode"])
counts = (counts
          .merge(barcode_class, on="barcode")
          # .assign(virus_library = lambda x: x['serum'].str.split('_').str[1][0])
         )

assert set(sample_wells) == set(counts["sample_well"])

# define virus library
counts['virus_library'] = counts['serum'].str.split('_', expand=True)[[1]]

## Average counts per barcode in each well

Plot average counts per barcode.
If a sample has inadequate barcode counts, it may not have good enough statistics for accurate analysis, and a QC-threshold is applied:

In [7]:
avg_barcode_counts = (
    counts.groupby(
        ["well", "sample_well"],
        dropna=False,
        as_index=False,
    )
    .aggregate(avg_count=pd.NamedAgg("count", "mean"))
    .assign(
        fails_qc=lambda x: (
            x["avg_count"] < 500
        ),
    )
)

avg_barcode_counts_chart = (
    alt.Chart(avg_barcode_counts)
    .encode(
        alt.X(
            "avg_count",
            title="average barcode counts per well",
            scale=alt.Scale(nice=False, padding=3),
        ),
        alt.Y("sample_well", sort=sample_wells),
        alt.Color(
            "fails_qc",
            title=f"fails {'min barcode count threshold'=}",
            legend=alt.Legend(titleLimit=500),
        ),
        tooltip=[
            alt.Tooltip(c, format=".3g") if avg_barcode_counts[c].dtype == float else c
            for c in avg_barcode_counts.columns
        ],
    )
    .mark_bar(height={"band": 0.85})
    .properties(
        height=alt.Step(10),
        width=250,
        title=f"Average barcode counts per well for titering plate",
    )
    .configure_axis(grid=False)
)

display(avg_barcode_counts_chart)

# drop wells failing QC
avg_barcode_counts_per_well_drops = list(avg_barcode_counts.query("fails_qc")["well"])

## Fraction of counts from neutralization standard
Determine the fraction of counts from the neutralization standard in each sample, and make sure this fraction passess the QC threshold.

In [8]:
# Define dataframe from counts with the fraction of neutralization standard calculated
neut_standard_fracs = (
    counts.assign(
        neut_standard_count=lambda x: x["count"] * x["neut_standard"].astype(int)
    )
    .groupby(
        ['virus_library', "well", "sample_well", 'dilution_factor'],
        dropna=False,
        as_index=False,
    )
    .aggregate(
        total_count=pd.NamedAgg("count", "sum"),
        neut_standard_count=pd.NamedAgg("neut_standard_count", "sum"),
    )
    .assign(
        neut_standard_frac=lambda x: x["neut_standard_count"] / x["total_count"],
        fails_qc=lambda x: (
            x["neut_standard_frac"] < 0.001
        ),
    )
)

# Plot as bar chart
chart_h3n2 = (
    alt.Chart(neut_standard_fracs.query('virus_library == "h3n2"'))
    .encode(
        alt.X(
            "neut_standard_frac",
            title="frac counts from neutralization standard per well",
            scale=alt.Scale(nice=False, padding=3),
        ),
        alt.Y("sample_well", sort=sample_wells),
        alt.Color(
            "fails_qc",
            legend=alt.Legend(titleLimit=500),
        ),
        tooltip=[
            alt.Tooltip(c, format=".3g") if neut_standard_fracs[c].dtype == float else c
            for c in neut_standard_fracs.columns
        ],
    )
    .mark_bar(height={"band": 0.85})
    .properties(
        height=alt.Step(10),
        width=250,
        title=f"Neutralization-standard fracs per well for titering plate, initial pool",
    )
    # .configure_axis(grid=False)
    # .configure_legend(titleLimit=1000)
)

chart_h1n1 = (
    alt.Chart(neut_standard_fracs.query('virus_library == "h1n1"'))
    .encode(
        alt.X(
            "neut_standard_frac",
            title="frac counts from neutralization standard per well",
            scale=alt.Scale(nice=False, padding=3),
        ),
        alt.Y("sample_well", sort=sample_wells),
        alt.Color(
            "fails_qc",
            legend=alt.Legend(titleLimit=500),
        ),
        tooltip=[
            alt.Tooltip(c, format=".3g") if neut_standard_fracs[c].dtype == float else c
            for c in neut_standard_fracs.columns
        ],
    )
    .mark_bar(height={"band": 0.85})
    .properties(
        height=alt.Step(10),
        width=250,
        title=f"Neutralization-standard fracs per well for titering plate, repool repeat",
    )
    # .configure_axis(grid=False)
    # .configure_legend(titleLimit=1000)
)

display(chart_h3n2 | chart_h1n1)

# drop wells failing QC
min_neut_standard_frac_per_well_drops = list(
    neut_standard_fracs.query("fails_qc")["well"]
)

In [16]:
neut_standard_fracs.dilution_factor = neut_standard_fracs.dilution_factor * 4

In [17]:
# Scatterplot of the same data as above, plotted by dilution factor
scatter_h3n2 = alt.Chart(neut_standard_fracs.query('virus_library == "h3n2"')
                        ).mark_circle(size=60).encode(
    alt.X('dilution_factor:Q', 
          scale=alt.Scale(type='log'),
          title='library pool reciprocal dilution factor'),
    alt.Y('neut_standard_frac:Q', 
          scale=alt.Scale(type='log'),
          title='fraction of reads = neutralization standard'),
    color='fails_qc',
    tooltip=['well', 'dilution_factor', 'neut_standard_frac', 'total_count']
).properties(
        height=alt.Step(10),
        width=250,
        title=f"Neutralization-standard fracs per well for H3N2 re-pool",
    ).interactive()

scatter_h1n1 = alt.Chart(neut_standard_fracs.query('virus_library == "h1n1"')
                        ).mark_circle(size=60).encode(
    alt.X('dilution_factor:Q', 
          scale=alt.Scale(type='log'),
          title='library pool reciprocal dilution factor'),
    alt.Y('neut_standard_frac:Q', 
          scale=alt.Scale(type='log'),
          title='fraction of reads = neutralization standard'),
    color='fails_qc',
    tooltip=['well', 'dilution_factor', 'neut_standard_frac', 'total_count']
).properties(
        height=alt.Step(10),
        width=250,
        title=f"Neutralization-standard fracs per well for pdmH1N1 pool",
    ).interactive()

scatter_h3n2 | scatter_h1n1

## Assessing balancing of strains contained in the library
Viruses were rescued and blind passaged individually. Based on data from my initial equal-volume pool, I re-pooled viruses to try to achieve more equal representation. I will now assess that re-pooling. 

Each of the 3 viral barcodes associated with each strain were pooled prior to rescue, so they cannot be balanced. 

In [11]:
# Get summed barcode counts for all strains across all wells
straincounts_allbarcodes = (counts.query('virus_library == "h3n2"')
                            .groupby(['sample','sample_well','strain','dilution_factor','serum','well'])
                          .sum()
                          .reset_index()
                          .drop(columns = ['sample_well', 'neut_standard', 'barcode', 'serum', 'virus_library'])
                         )

# Get sum of all virus/barcode counts per well
sumperwell = (straincounts_allbarcodes.groupby(['sample','dilution_factor','well'])
              .sum()
              .drop(columns=['strain'])
              .reset_index()
              .rename(columns={'count':'counts_perwell'})
             )

# Merge dataframes and calculate fraction of each well devoted to each strain
merged_df = straincounts_allbarcodes.merge(sumperwell, on=['sample','dilution_factor','well'])
merged_df['fraction_strain'] = merged_df['count'] /merged_df['counts_perwell'] /2
merged_df

Unnamed: 0,sample,strain,dilution_factor,well,count,counts_perwell,fraction_strain
0,A1-2023rePoolcol1_h3n2-1-1,A/Abu_Dhabi/6753/2023,1,A1,27085,2285326,0.005926
1,A1-2023rePoolcol1_h3n2-1-1,A/Bangkok/P3599/2023,1,A1,28678,2285326,0.006274
2,A1-2023rePoolcol1_h3n2-1-1,A/Bangkok/P3755/2023,1,A1,35873,2285326,0.007849
3,A1-2023rePoolcol1_h3n2-1-1,A/Bhutan/0006/2023,1,A1,41467,2285326,0.009072
4,A1-2023rePoolcol1_h3n2-1-1,A/Bhutan/0845/2023,1,A1,36796,2285326,0.008050
...,...,...,...,...,...,...,...
1891,H3-2023rePoolcol12_h3n2-2048-2,A/Thailand/8/2022,2048,H3,0,333107,0.000000
1892,H3-2023rePoolcol12_h3n2-2048-2,A/Townsville/68/2023,2048,H3,1971,333107,0.002959
1893,H3-2023rePoolcol12_h3n2-2048-2,A/Victoria/1033/2023,2048,H3,2,333107,0.000003
1894,H3-2023rePoolcol12_h3n2-2048-2,A/Wisconsin/27/2023,2048,H3,1,333107,0.000002


We now have this fraction of reads devoted to all strains calculated for all wells. However, ideally we should just focus on those wells containing dilutions that we would use for actual neutralization assays. We should choose a set of replicate wells where the fraction of neutralization standard reads begins to increase linearly with the increasing reciprocal dilution factor. See plots above for choosing these wells. 

In [12]:
# Using E1 and A3, corresponding to reciprocal dilution factor = 64
single_well = merged_df.loc[merged_df['sample'].str.contains('E1-|A3-')]

In [13]:
# Calculate mean fraction strain across both wells
mean_df = single_well.groupby(['strain'])['fraction_strain'].mean().to_frame().rename(columns = {'fraction_strain': 'mean_fraction_strains'}).reset_index()
mean_single_well = single_well.merge(mean_df, on = 'strain', how = 'left')

# calcualte ratios to add for equal pool
num_strains = 75
mean_single_well['ratio_to_add'] = (1/num_strains)/mean_single_well['fraction_strain']
mean_single_well['mean_ratio_to_add'] = (1/num_strains)/mean_single_well['mean_fraction_strains']

mean_single_well['est_tcid50'] = (mean_single_well['mean_fraction_strains']*25000)*75

mean_single_well.head(5)

Unnamed: 0,sample,strain,dilution_factor,well,count,counts_perwell,fraction_strain,mean_fraction_strains,ratio_to_add,mean_ratio_to_add,est_tcid50
0,A3-2023rePoolcol5_h3n2-16-2,A/Abu_Dhabi/6753/2023,16,A3,32744,2354749,0.006953,0.006948,1.917704,1.918881,13028.424411
1,A3-2023rePoolcol5_h3n2-16-2,A/Bangkok/P3599/2023,16,A3,21540,2354749,0.004574,0.004002,2.915195,3.331531,7504.057985
2,A3-2023rePoolcol5_h3n2-16-2,A/Bangkok/P3755/2023,16,A3,41678,2354749,0.00885,0.009778,1.50663,1.363579,18334.099351
3,A3-2023rePoolcol5_h3n2-16-2,A/Bhutan/0006/2023,16,A3,34296,2354749,0.007282,0.009496,1.830922,1.404135,17804.559139
4,A3-2023rePoolcol5_h3n2-16-2,A/Bhutan/0845/2023,16,A3,35376,2354749,0.007512,0.007424,1.775026,1.796097,13919.067887


## Visualize barcode- and strain-level balancing in the current pool

In [14]:
# Plot the current fraction of each strain in the pool
strains_chart = (
    alt.Chart(mean_single_well)
    .encode(
        alt.X(
            "fraction_strain",
            scale=alt.Scale(nice=False, padding=3),
        ),
        alt.Y("strain"),
        
        tooltip = ['strain', 'fraction_strain', 'est_tcid50'],
    )
).mark_bar(height={"band": 0.85}).properties(
        height=alt.Step(10),
        width=250,
        title="",
    ).properties(
        height = alt.Step(10),
        width = 200,
        title = "Strain representation, repool")

# add veritcal line where we would expect equal representation of all barcodes in pool
expected_line = alt.Chart(
    pd.DataFrame({'x': [1/num_strains]})
).mark_rule(strokeDash = [2,2], strokeWidth = 2).encode(x = 'x')

# plot both barcode counts and expected line
strains_chart + expected_line

In [15]:
# Each barcode fraction across strains
all_barcode_counts = (counts
                      .query('virus_library == "h3n2"')
                      [['strain', 'barcode', 'count', 'well']].dropna()
                     )
single_well_all_barcode_counts = all_barcode_counts[all_barcode_counts['well'].isin(['E1','A3'])]

# Get tidy single well means
tidy_single_well = single_well_all_barcode_counts[['strain','barcode','count']].groupby(['strain', 'barcode']).mean().reset_index()
# Get sums for each strain
strain_sums_df = tidy_single_well.groupby('strain').sum().rename(columns = {'count': 'strain_count_sum'}).reset_index()
# Merge and calculate per strain the fraction represented by each barcode
tidy_single_well = tidy_single_well.merge(strain_sums_df[['strain', 'strain_count_sum']], 
                       on = ['strain'],
                       validate="many_to_one",
                      )
tidy_single_well['per_strain_fraction_barcode'] = tidy_single_well['count'] / tidy_single_well['strain_count_sum']
tidy_single_well['barcode_letter'] = (['A', 'B', 'C'] * len(strain_sums_df))

# Plot as colored bar chart
bar_chart = alt.Chart(tidy_single_well).mark_bar(height={"band": 0.85}).encode(
    x = 'per_strain_fraction_barcode',
    y = 'strain',
    color=alt.Color('barcode_letter', legend=None),
    tooltip = ['strain', 'per_strain_fraction_barcode', 'barcode'],
).configure_axis(grid=False).properties(
        height = alt.Step(10),
        width = 200,
        title = "Barcode fraction for each strain, repool")

bar_chart

Beyond a few strains with lower representation in this experiment (mostly egg-based vaccine strains), this pools looks fairly well-balanced.