# Process plate counts to get ratios of variants in a re-pooled variant library

I repooled library on 24-Sept-2024 and will assess barcode balancing in this notebook.  

The plots generated by this notebook are interactive, so you can mouseover points for details, use the mouse-scroll to zoom and pan, and use interactive dropdowns at the bottom of the plots.

## Setup
Import Python modules:

In [1]:
import pickle
import sys

import altair as alt

import matplotlib.pyplot as plt

import numpy

import pandas as pd
from os.path import join
import os
import ruamel.yaml as yaml

_ = alt.data_transformers.disable_max_rows()

## Add input data locations
Some of these files are defined as data, and some of these files are generated by running the specified library pooling data as `miscellaneous_plates` through the `seqneut-pipeline`. For details on how these files are generated, see the `README.md' in [https://github.com/jbloomlab/seqneut-pipeline](https://github.com/jbloomlab/seqneut-pipeline)

In [2]:
# Define file path prefix
filepath_prefix = '/fh/fast/bloom_j/computational_notebooks/ckikawa/2024/flu_seqneut_H3N2_2023-2024'

# Viral library contents and barcode IDs
viral_library_csv = filepath_prefix + '/data/viral_libraries/2023_H3N2_Kikawa.csv'
# Neutralization standard set of barcode IDs
neut_standard_set_csv = filepath_prefix + '/data/neut_standard_sets/loes2023_neut_standards.csv'
# All samples included in library poolign sequencing run
# Contains information on library, dilution factor, R1 location
samplesfile = filepath_prefix + '/data/miscellaneous_plates/2024-09-24_repool_H3N2_balancing.csv'

# Counts and fates files output by running library pooling samples as miscellaneous plates
platedir = filepath_prefix + '/results/miscellaneous_plates/240924_repool_H3N2_balancing/'

# Identify all counts and fates CSVs
count_csvs = []
fate_csvs = []
file_list = os.listdir(platedir)
for f in file_list:
    location = platedir + f
    if "_counts" in f:
        count_csvs.append(location)
    elif "_fates" in f:
        fate_csvs.append(location)

In [3]:
# Define a samples dataframe using the samples file
samples_df = pd.read_csv(samplesfile)
samples_df.drop(columns=['fastq'], inplace=True)
samples_df['sample'] = samples_df.apply(
    lambda x: '-'.join(x.astype(str)), axis=1
)

samples = samples_df["sample"].unique().tolist()
print(f"There are {len(samples)} barcode runs.")

samples_df

There are 32 barcode runs.


Unnamed: 0,well,serum,dilution_factor,replicate,sample
0,A1,H3N2_100uL_libID39,4,1,A1-H3N2_100uL_libID39-4-1
1,B1,H3N2_100uL_libID39,8,1,B1-H3N2_100uL_libID39-8-1
2,C1,H3N2_100uL_libID39,16,1,C1-H3N2_100uL_libID39-16-1
3,D1,H3N2_100uL_libID39,32,1,D1-H3N2_100uL_libID39-32-1
4,E1,H3N2_100uL_libID39,64,1,E1-H3N2_100uL_libID39-64-1
5,F1,H3N2_100uL_libID39,128,1,F1-H3N2_100uL_libID39-128-1
6,G1,H3N2_100uL_libID39,256,1,G1-H3N2_100uL_libID39-256-1
7,H1,H3N2_100uL_libID39,512,1,H1-H3N2_100uL_libID39-512-1
8,A2,H3N2_100uL_libID39,4,2,A2-H3N2_100uL_libID39-4-2
9,B2,H3N2_100uL_libID39,8,2,B2-H3N2_100uL_libID39-8-2


## Statistics on barcode-parsing for each sample
Make interactive chart of the "fates" of the sequencing reads parsed for each sample on the plate.

If most sequencing reads are not "valid barcodes", this could potentially indicate some problem in the sequencing or barcode set you are parsing.

Potential fates are:
 - *valid barcode*: barcode that matches a known virus or neutralization standard, we hope most reads are this.
 - *invalid barcode*: a barcode with proper flanking sequences, but does not match a known virus or neutralization standard. If you  have a lot of reads of this type, it is probably a good idea to look at the invalid barcode CSVs (in the `./results/barcode_invalid/` subdirectory created by the pipeline) to see what these invalid barcodes are.
 - *unparseable barcode*: could not parse a barcode from this read as there was not a sequence of the correct length with the appropriate flanking sequence.
 - *low quality barcode*: low-quality or `N` nucleotides in barcode, could indicate problem with sequencing.
 - *failed chastity filter*: reads that failed the Illumina chastity filter, if these are reported in the FASTQ (they may not be).

Also, if the number of reads per sample is very uneven, that could indicate that you did not do a good job of balancing the different samples in the Illumina sequencing.

In [4]:
fates = (
    pd.concat([pd.read_csv(f).assign(well=f.strip(platedir).strip('_fates.csv')) for f, s in zip(fate_csvs, samples)])
    .merge(samples_df, validate="many_to_one", on="well")
    .assign(
        fate_counts=lambda x: x.groupby("fate")["count"].transform("sum"),
        sample_well=lambda x: x["sample"] + " (" + x["well"] + ")",
    )
    .query("fate_counts > 0")[  # only keep fates with at least one count
        ["fate", "count", "well", "sample_well", "dilution_factor"]
    ]
)

assert len(fates) == len(fates.drop_duplicates())


sample_wells = list(
    fates.sort_values(["dilution_factor"])["sample_well"]
)



fates_chart = (
    alt.Chart(fates)
    .encode(
        alt.X("count", scale=alt.Scale(nice=False, padding=3)),
        alt.Y(
            "sample_well",
            title=None,
            sort=sample_wells,
        ),
        alt.Color("fate", sort=sorted(fates["fate"].unique(), reverse=True)),
        alt.Order("fate", sort="descending"),
        tooltip=fates.columns.tolist(),
    )
    .mark_bar(height={"band": 0.85})
    .properties(
        height=alt.Step(10),
        width=200,
        title=f"Barcode parsing for initial titering plate",
    )
    .configure_axis(grid=False)
)

fates_chart

Nice counts for all wells. 

## Read barcode counts
Read the counts per barcode:

In [5]:
# get barcode counts
counts = (
    pd.concat([pd.read_csv(c).assign(well=c.strip(platedir).strip('_counts.csv')) for c, s in zip(count_csvs, samples)])
    .merge(samples_df, validate="many_to_one", on="well")
    .drop(columns=["replicate"])
    .assign(sample_well=lambda x: x["sample"] + " (" + x["well"] + ")")
)

# classify barcodes as viral or neut standard
barcode_class = pd.concat(
    [
        pd.read_csv(viral_library_csv)[["barcode", "strain"]].assign(
            neut_standard=False,
        ),
        pd.read_csv(neut_standard_set_csv)[["barcode"]].assign(
            neut_standard=True,
            strain=pd.NA,
        ),
    ],
    ignore_index=True,
)

# merge counts and classification of barcodes
assert set(counts["barcode"]) == set(barcode_class["barcode"])
counts = counts.merge(barcode_class, on="barcode", validate="many_to_one")
assert set(sample_wells) == set(counts["sample_well"])

## Average counts per barcode in each well

Plot average counts per barcode.
If a sample has inadequate barcode counts, it may not have good enough statistics for accurate analysis, and a QC-threshold is applied:

In [6]:
avg_barcode_counts = (
    counts.groupby(
        ["well", "sample_well"],
        dropna=False,
        as_index=False,
    )
    .aggregate(avg_count=pd.NamedAgg("count", "mean"))
    .assign(
        fails_qc=lambda x: (
            x["avg_count"] < 500
        ),
    )
)

avg_barcode_counts_chart = (
    alt.Chart(avg_barcode_counts)
    .encode(
        alt.X(
            "avg_count",
            title="average barcode counts per well",
            scale=alt.Scale(nice=False, padding=3),
        ),
        alt.Y("sample_well", sort=sample_wells),
        alt.Color(
            "fails_qc",
            title=f"fails {'min barcode count threshold'=}",
            legend=alt.Legend(titleLimit=500),
        ),
        tooltip=[
            alt.Tooltip(c, format=".3g") if avg_barcode_counts[c].dtype == float else c
            for c in avg_barcode_counts.columns
        ],
    )
    .mark_bar(height={"band": 0.85})
    .properties(
        height=alt.Step(10),
        width=250,
        title=f"Average barcode counts per well for titering plate",
    )
    .configure_axis(grid=False)
)

display(avg_barcode_counts_chart)

# drop wells failing QC
avg_barcode_counts_per_well_drops = list(avg_barcode_counts.query("fails_qc")["well"])

As I saw above, there are no barcodes sequenced in well A1. 

## Fraction of counts from neutralization standard
Determine the fraction of counts from the neutralization standard in each sample, and make sure this fraction passess the QC threshold.

In [7]:
neut_standard_fracs = (
    counts.assign(
        neut_standard_count=lambda x: x["count"] * x["neut_standard"].astype(int)
    )
    .groupby(
        ["well", "sample_well", 'dilution_factor'],
        dropna=False,
        as_index=False,
    )
    .aggregate(
        total_count=pd.NamedAgg("count", "sum"),
        neut_standard_count=pd.NamedAgg("neut_standard_count", "sum"),
    )
    .assign(
        neut_standard_frac=lambda x: x["neut_standard_count"] / x["total_count"],
        fails_qc=lambda x: (
            x["neut_standard_frac"] < 0.001
        ),
    )
)

neut_standard_fracs_chart = (
    alt.Chart(neut_standard_fracs)
    .encode(
        alt.X(
            "neut_standard_frac",
            title="frac counts from neutralization standard per well",
            scale=alt.Scale(nice=False, padding=3),
        ),
        alt.Y("sample_well", sort=sample_wells),
        alt.Color(
            "fails_qc",
            title=f"fails {'min_neut_standard_frac_per_well'=}",
            legend=alt.Legend(titleLimit=500),
        ),
        tooltip=[
            alt.Tooltip(c, format=".3g") if neut_standard_fracs[c].dtype == float else c
            for c in neut_standard_fracs.columns
        ],
    )
    .mark_bar(height={"band": 0.85})
    .properties(
        height=alt.Step(10),
        width=250,
        title=f"Neutralization-standard fracs per well for titering plate, initial pool",
    )
    .configure_axis(grid=False)
    .configure_legend(titleLimit=1000)
)

display(neut_standard_fracs_chart)

# drop wells failing QC
min_neut_standard_frac_per_well_drops = list(
    neut_standard_fracs.query("fails_qc")["well"]
)

I titrated across columns so this looks appropriate with fraction of counts devotred to neutralization standard increasing as you go down the plate (towards row G). 

In [8]:
# Scatterplot of the same data as above, plotted by dilution factor
alt.Chart(neut_standard_fracs).mark_circle(size=60).encode(
    alt.X('dilution_factor:Q', 
          scale=alt.Scale(type='log'),
          title='library pool reciprocal dilution factor'),
    alt.Y('neut_standard_frac:Q', 
          scale=alt.Scale(type='log'),
          title='fraction of reads = neutralization standard'),
    color='fails_qc',
    tooltip=['well', 'dilution_factor', 'neut_standard_frac', 'total_count']
).interactive()

Again, this is looking pretty consistent with other librayr dilutions, where the linear range starts somewhere in the 32-64 reciprocal dilution factor range. 

## Assessing balancing of strains contained in the library
Viruses were rescued and blind passaged individually. Each of the 3 viral barcodes associated with each strain were pooled prior to rescue, so they cannot be balanced. 

In [51]:
# Get summed barcode counts for all strains across all wells
straincounts_allbarcodes = (counts.groupby(['sample','sample_well','strain','dilution_factor','serum','well'])
                          .sum()
                          .reset_index()
                          .drop(columns = ['sample_well', 'neut_standard', 'barcode'])
                         )

# Get sum of all virus/barcode counts per well
sumperwell = (straincounts_allbarcodes.groupby(['sample','dilution_factor','serum','well'])
              .sum()
              .drop(columns=['strain'])
              .reset_index()
              .rename(columns={'count':'counts_perwell'})
             )

# Merge dataframes and calculate fraction of each well devoted to each strain
merged_df = straincounts_allbarcodes.merge(sumperwell, on=['sample','dilution_factor','serum','well'])
merged_df['fraction_strain'] = merged_df['count'] /merged_df['counts_perwell'] 
merged_df

Unnamed: 0,sample,strain,dilution_factor,serum,well,count,counts_perwell,fraction_strain
0,A1-H3N2_100uL_libID39-4-1,A/Abu_Dhabi/6753/2023,4,H3N2_100uL_libID39,A1,11401,1416474,0.008049
1,A1-H3N2_100uL_libID39-4-1,A/Bangkok/P3599/2023,4,H3N2_100uL_libID39,A1,19582,1416474,0.013824
2,A1-H3N2_100uL_libID39-4-1,A/Bangkok/P3755/2023,4,H3N2_100uL_libID39,A1,17458,1416474,0.012325
3,A1-H3N2_100uL_libID39-4-1,A/Bhutan/0006/2023,4,H3N2_100uL_libID39,A1,20132,1416474,0.014213
4,A1-H3N2_100uL_libID39-4-1,A/Bhutan/0845/2023,4,H3N2_100uL_libID39,A1,15311,1416474,0.010809
...,...,...,...,...,...,...,...,...
2179,G4-H3N2_50uL_libID39-256-2,A/Thailand/8/2022,256,H3N2_50uL_libID39,G4,6149,800436,0.007682
2180,G4-H3N2_50uL_libID39-256-2,A/Townsville/68/2023,256,H3N2_50uL_libID39,G4,8652,800436,0.010809
2181,G4-H3N2_50uL_libID39-256-2,A/Victoria/1033/2023,256,H3N2_50uL_libID39,G4,9763,800436,0.012197
2182,G4-H3N2_50uL_libID39-256-2,A/Wisconsin/27/2023,256,H3N2_50uL_libID39,G4,7594,800436,0.009487


We now have this fraction of reads devoted to all strains calculated for all wells. However, ideally we should just focus on those wells containing dilutions that we would use for actual neutralization assays. We should choose a set of replicate wells where the fraction of neutralization standard reads begins to increase linearly with the increasing reciprocal dilution factor. See plots above for choosing these wells. 

In [77]:
SELECT_WELLS = [
    'E1',
    'E2'
]

In [78]:
# Using A9 and B9, corresponding to reciprocal dilution factor = 256
single_well = merged_df.loc[merged_df['sample'].str.contains(f'{SELECT_WELLS[0]}-|{SELECT_WELLS[1]}-')]

In [79]:
# Calculate mean fraction strain across both wells
mean_df = single_well.groupby(['strain'])['fraction_strain'].mean().to_frame().rename(columns = {'fraction_strain': 'mean_fraction_strains'}).reset_index()
mean_single_well = single_well.merge(mean_df, on = 'strain', how = 'left')

# calcualte ratios to add for equal pool
num_strains = 76
mean_single_well['ratio_to_add'] = (1/num_strains)/mean_single_well['fraction_strain']
mean_single_well['mean_ratio_to_add'] = (1/num_strains)/mean_single_well['mean_fraction_strains']

mean_single_well['est_tcid50'] = (mean_single_well['mean_fraction_strains']*25000)*76

mean_single_well.head()

Unnamed: 0,sample,strain,dilution_factor,serum,well,count,counts_perwell,fraction_strain,mean_fraction_strains,ratio_to_add,mean_ratio_to_add,est_tcid50
0,E1-H3N2_100uL_libID39-64-1,A/Abu_Dhabi/6753/2023,64,H3N2_100uL_libID39,E1,8830,894746,0.009869,0.009538,1.333293,1.379595,18121.263896
1,E1-H3N2_100uL_libID39-64-1,A/Bangkok/P3599/2023,64,H3N2_100uL_libID39,E1,10124,894746,0.011315,0.012776,1.162878,1.02987,24274.908572
2,E1-H3N2_100uL_libID39-64-1,A/Bangkok/P3755/2023,64,H3N2_100uL_libID39,E1,11934,894746,0.013338,0.01454,0.986507,0.904919,27626.796138
3,E1-H3N2_100uL_libID39-64-1,A/Bhutan/0006/2023,64,H3N2_100uL_libID39,E1,14023,894746,0.015673,0.015011,0.839547,0.876532,28521.48095
4,E1-H3N2_100uL_libID39-64-1,A/Bhutan/0845/2023,64,H3N2_100uL_libID39,E1,8713,894746,0.009738,0.011824,1.351196,1.112855,22464.749225


## Visualize barcode- and strain-level balancing in the current pool

In [80]:
# Plot the current fraction of each strain in the pool
strains_chart = (
    alt.Chart(mean_single_well)
    .encode(
        alt.X(
            "fraction_strain",
            scale=alt.Scale(nice=False, padding=3),
        ),
        alt.Y("strain"),
        
        tooltip = ['strain', 'fraction_strain', 'est_tcid50'],
    )
).mark_bar(height={"band": 0.85}).properties(
        height=alt.Step(10),
        width=250,
        title="",
    ).properties(
        height = alt.Step(10),
        width = 200,
        title = "Strain representation, initial pool")

# add veritcal line where we would expect equal representation of all barcodes in pool
expected_line = alt.Chart(
    pd.DataFrame({'x': [1/75]})
).mark_rule(strokeDash = [2,2], strokeWidth = 2).encode(x = 'x')

# plot both barcode counts and expected line
strains_chart + expected_line

Everything looks *fairly* well balanced with the exception of **A/South_Africa/R05510/2023** which should be re-added to this pool at ~4x concentration to get a bit closer to other strains. I'm happy enough with all other strain balancing. 

In [60]:
# Each barcode fraction across strains
all_barcode_counts = counts[['strain', 'barcode', 'count', 'well']].dropna()
single_well_all_barcode_counts = all_barcode_counts[all_barcode_counts['well'].isin([f'{SELECT_WELLS[0]}',f'{SELECT_WELLS[1]}'])]

# Get tidy single well means
tidy_single_well = single_well_all_barcode_counts[['strain','barcode','count']].groupby(['strain', 'barcode']).mean().reset_index()
# Get sums for each strain
strain_sums_df = tidy_single_well.groupby('strain').sum().rename(columns = {'count': 'strain_count_sum'}).reset_index()
# Merge and calculate per strain the fraction represented by each barcode
tidy_single_well = tidy_single_well.merge(strain_sums_df[['strain', 'strain_count_sum']], 
                       on = ['strain'],
                       validate="many_to_one",
                      )
tidy_single_well['per_strain_fraction_barcode'] = tidy_single_well['count'] / tidy_single_well['strain_count_sum']
tidy_single_well['barcode_letter'] = (['A', 'B', 'C'] * len(strain_sums_df))

# Plot as colored bar chart
bar_chart = alt.Chart(tidy_single_well).mark_bar(height={"band": 0.85}).encode(
    x = 'per_strain_fraction_barcode',
    y = 'strain',
    color=alt.Color('barcode_letter', legend=None),
    tooltip = ['strain', 'per_strain_fraction_barcode', 'barcode'],
).configure_axis(grid=False).properties(
        height = alt.Step(10),
        width = 200,
        title = "Barcode fraction for each strain, initial pool")

bar_chart

Again, all strains look *OK* except for **A/South_Africa/R05510/2023** which has terrible barcode balancing. I expect this is due to just really low relative concentration and that better balance will be achieved with a re-pooled pool containing this strain at 4x concentration. 

Other strains (**A/Wisconsin/27/2023**, **A/YAMAGATA/98/2023**) have sort of questionable balancing, but since these are the same rescues that have been performing well in the prior library pool for **many** plates of neuts, I'm not exactly sure we should be worried. I'm thinking this could be as simple as a sampling issue or a mixing issue. I'll be careful to really mix the pool well before I re-add **A/South_Africa/R05510/2023** and aliquot. 

In summary, I'll go ahead and make the new pool with 4x more **A/South_Africa/R05510/2023** and just start using it.