This notebook loads in the pipeline output from the extract_segments step and exports an Eaton-formatted codebook.

In [None]:
import operator
from functools import partial
from glob import glob
from pathlib import Path

import gfapy
import holoviews as hv
import hvplot.pandas
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import polars as pl
from tqdm.auto import tqdm, trange

%load_ext autoreload
%autoreload 2

import paulssonlab.sequencing.processing as processing
from paulssonlab.util.sequence import reverse_complement

hv.extension("matplotlib")
# this is important, the variants_path and grouping_path columns may appear corrupted without this
pl.enable_string_cache()

# Functions

In [None]:
def concat_glob(filename):
    return pl.concat([pl.scan_ipc(f) for f in glob(filename)], how="diagonal")


def load_sequencing(filename, filter=True):
    df = concat_glob(filename)
    if "is_primary_alignment" not in df.collect_schema().names():
        df = df.with_columns(is_primary_alignment=pl.col("name").is_first_distinct())
    df = df.with_columns(
        dup=pl.col("name").is_duplicated(),
        e2e=pl.col("variants_path")
        .list.set_intersection(["<UNS9", ">UNS9", "<UNS3", ">UNS3"])
        .list.len()
        == 2,
        bc_e2e=pl.col("variants_path")
        .list.set_intersection(
            ["<BC:T7_prom", ">BC:T7_prom", "<BC:spacer2", ">BC:spacer2"]
        )
        .list.len()
        == 2,
    )
    if filter:
        df = df.filter(pl.col("is_primary_alignment"), pl.col("e2e"))
    return df


def path_to_barcode_string(path_col, bits=list(range(30))):
    if isinstance(path_col, str):
        path_col = pl.col(path_col)
    return pl.concat_str(
        [
            pl.when(
                path_col.list.contains(f">BC:bit{bit}=1").or_(
                    path_col.list.contains(f"<BC:bit{bit}=1")
                )
            )
            .then(pl.lit("1"))
            .otherwise(pl.lit("0"))
            for bit in bits
        ]
    )

# Load

Load the appropriate variants GFA and extract_segments Arrow output.

In [None]:
gfa = gfapy.Gfa.from_file(
    "/home/jqs1/scratch/sequencing/sequencing_references/pLIB502-503.gfa"
)

In [None]:
%%time
df = load_sequencing(
    "/home/jqs1/scratch/sequencing/241007_pLIB502-503/output/max_divergence=0.3/extract_segments/*.arrow"
)
df = processing.compute_divergences(
    df,
    list(dict.fromkeys(([s.split("=")[0] for s in gfa.segment_names]))),
    struct_name="variants_segments",
)

Examine the columns:
- `e2e`: Whether the alignment path covers the variants GFA end-to-end.
- `bc_e2e`: Whether the alignment path covers the grouping GFA end-to-end.
- `is_primary_alignment`: GraphAligner may output more than one alignment for each consensus sequence aligned against the variants GFA, you typically only want the primary (best) alignment.
- `dup`: Flags all consensus sequences that have more than one alignment. Secondary alignments are not unusual nor a sign that the primary alignment is poor, so it usually does not make sense to filter out based on this column.
  `name`: A unique ID for each consensus. Note that these IDs may appear multiple times if the same consensus has multiple alignments; it should be unique after filtering for `is_primary_aligmnent`.
- `consensus_depth`: The number of sequences that were used to compute this consensus.
- `grouping_depth`: The number of sequences that were grouped together during the `PREPARE_CONSENSUS` step. Typically this is the same as `consensus_depth`, but may be different depending on the filtering arguments passed to `consensus.py` during the `PREPARE_CONSENSUS` and `CONSENSUS_PREPARED` steps.
- `consensus_seq`: The raw consensus sequence.
- `grouping_path`: The alignment path used for grouping reads during the `PREPARE_CONSENSUS` step.
- `variants_path`: The alignment path produced by GraphAligner aligning `consensus_seq` against the variants GFA.

In [None]:
df.columns

Additionally, there is a `variants_segments` struct column containing hundreds of fields (subcolumns). For each segment `seg`, it has fields for:
- `seg|seq`: The slice of the consensus sequence that aligns to segment `seg`.
- `seg|cigar`: The slice of the CIGAR string that corresponds to the alignment across segment `seg`.
- `seg|variant` (if applicable): If there are mutually exclusive segments `seg=variant1`, `seg=variant2`, and so forth, this specifies which of those variants (`variant1`, `variant2`, etc.) appeared in the alignment.
- `seg|matches`: The number of matches in the CIGAR string for this segment.
- `seg|mismatches`: The number of mismatches in the CIGAR string for this segment.
- `seg|insertions`: The number of insertions in the CIGAR string for this segment.
- `seg|deletions`: The number of deletions in the CIGAR string for this segment.

Examine the list of `variants_segments` field names and select which segments you want to load in by setting `segment_columns`. It is recommended not to load in all columns because a typical extract_segments output with all columns is tens of GB and is less convenient to work with (e.g., you need to request a lot of memory for your jupyterlab job).

In [None]:
[f.name for f in df.schema["variants_segments"].fields]

In [None]:
segment_columns = [
    "sigma:promoter|variant",
    "sigma:promoter|divergence",
    "antisigma:promoter|variant",
    "antisigma:promoter|divergence",
    "reporter:promoter|variant",
    "reporter:promoter|divergence",
    "sigma:RBS:RiboJ|divergence",
    "sigma:RBS:BCD_leader|divergence",
    "antisigma:RBS:RiboJ|divergence",
    "antisigma:RBS:BCD_leader|divergence",
    "reporter:RBS:RiboJ|divergence",
    "reporter:RBS:BCD_leader|divergence",
    "sigma:RBS|seq",
    "antisigma:RBS|seq",
    "reporter:RBS|seq",
]

We've been working with a polars LazyFrame up until this point. Once we select only the columns we want to load in, we call `df.collect()` to execute the query and load the dataframe into memory.

In [None]:
%%time
df = df.select(
    pl.col(
        "grouping_path",
        # uncomment if you want to export full consensus sequences, this increases memory usage/file size
        # if so, you probably want 64GB of memory
        # "consensus_seq",
        "name",
        "grouping_path_hash",
        "grouping_depth",
        "consensus_depth",
        "strand",
        "variants_path",
        "is_primary_alignment",
        "dup",
        "e2e",
        "bc_e2e",
    ),
    *[pl.col("variants_segments").struct[f] for f in segment_columns]
)
df = df.collect()

Check the size of the resulting dataframe:

In [None]:
df.estimated_size("gb")

And that it has the columns you expect:

In [None]:
df.columns

# Diagnostics

## Depth

Here we plot grouping depth in descending order (barcode index on x-axis). Our informal heuristic is that a properly sampled sequencing run will show a steep cliff on the right-hand side of this plot.

In [None]:
df["grouping_depth"].sort(descending=True).to_pandas().hvplot.step(
    logy=True, height=800
)

Here we plot cumulative fractions of barcodes and reads (y-axis) for the subset of the dataset with at most a particular grouping depth (x-axis). The steep part of each curve indicates the depth at which we're spending most of our sequencing capacity on (the steep part of the curve should be roughly centered on our target depth). The left and right extremes of both curves show how much barcode space/sequencing capacity we're “wasting” on low-depth (low accuracy) or high-depth (diminishing returns) barcodes.

In [None]:
df.sort("grouping_depth").select(
    pl.col("grouping_depth"),
    frac_barcodes=pl.int_range(1, pl.len() + 1, dtype=pl.UInt32) / pl.len(),
    frac_reads=pl.col("grouping_depth").cum_sum() / pl.col("grouping_depth").sum(),
).to_pandas().hvplot.step("grouping_depth", logx=True, logy=False, where="pre")

## Variants

Here we can check for balance among our promoter variants.

In [None]:
df["sigma:promoter|variant"].value_counts()

In [None]:
df["antisigma:promoter|variant"].value_counts()

In [None]:
df["reporter:promoter|variant"].value_counts()

And their pairwise/three-way frequencies:

In [None]:
df.group_by(pl.col("sigma:promoter|variant", "antisigma:promoter|variant")).agg(
    pl.len()
).sort("len", descending=True)

In [None]:
df.group_by(
    pl.col(
        "sigma:promoter|variant",
        "antisigma:promoter|variant",
        "reporter:promoter|variant",
    )
).agg(pl.len()).sort("len", descending=True)

Here we plot the frequency distribution of six-tuples, including all promoter variants and RBS sequences.

In [None]:
counts = df.select(
    pl.struct(
        "sigma:promoter|variant",
        "antisigma:promoter|variant",
        "sigma:RBS|seq",
        "antisigma:RBS|seq",
        "reporter:promoter|variant",
        "reporter:RBS|seq",
    ).alias("foo")
)["foo"].value_counts(sort=True)

In [None]:
counts["count"].to_pandas().hvplot.step(logy=True, logx=True)

# Export to Eaton format

Now we convert the barcode into the Eaton-style “0100110...” string format and add some dummy columns that Eaton's pipeline expects.

In [None]:
%%time
df_eaton = df.with_columns(
    barcode=path_to_barcode_string("variants_path"),
    reference=pl.lit(""),
    alignmentstart=1,
    cigar=pl.lit(""),
    subsample=pl.lit(""),
)
if "consesus_seq" not in df_eaton.columns:
    # if not including consensus seq
    df_eaton = df_eaton.with_columns(consensus_seq=pl.lit(""))
df_eaton = (
    df_eaton.rename({"consensus_seq": "consensus"})
    .sort("barcode")
    .with_row_index(name="barcodeid")
    .with_row_index(name="")
)

In [None]:
df_eaton

We then write this to Parquet (which results in much smaller file sizes than CSV).

In [None]:
df_eaton.write_parquet("eaton_export.parquet")