# Reproduce the results presented in "CRISPR-HAWK: Haplotype- and Variant-Aware Guide Design Toolkit for CRISPR-Cas"

In [None]:
from matplotlib.colors import LinearSegmentedColormap
from matplotlib.lines import Line2D

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns
import pandas as pd
import numpy as np

import warnings
import random
import os

warnings.filterwarnings("ignore")

## Introduction

CRISPR-HAWK is a comprehensive and scalable framework for designing guide RNAs 
(gRNAs) and evaluating the impact of genetic variation on CRISPR-Cas on-target 
activity. Developed as an offline, user-friendly command-line tool, CRISPR-HAWK 
integrates large-scale human variation datasets, including the 1000 Genomes Project, 
the Human Genome Diversity Project (HGDP), and gnomAD, with orthogonal 
genomic annotations to systematically prioritize gRNAs targeting regions of interest.

The framework is Cas-agnostic and supports a broad range of nucleases, such as 
Cas9, SaCas9, and Cpf1 (Cas12a), while also allowing full customization of PAM 
sequences and guide lengths. This flexibility ensures compatibility with emerging 
CRISPR technologies and enables users to tailor gRNA design to specific experimental 
needs.

CRISPR-HAWK incorporates both single-nucleotide variants (SNVs) and small 
insertions and deletions (indels), and it natively handles individual- and 
population-specific haplotypes. This makes it particularly suitable for personalized 
genome editing as well as population-scale analyses. The workflow, from variant-aware 
preprocessing to gRNA discovery, is fully automated, generating ranked candidate 
gRNAs, annotated target sequences, and publication-ready visualizations.

Thanks to its modular architecture, CRISPR-HAWK can be seamlessly integrated with 
downstream tools such as CRISPRme or CRISPRitz for comprehensive off-target prediction 
and follow-up analysis of prioritized guides.

This notebook generates the figures presented in |||**add paper-link**|||.

In [None]:
# define static variables
CRISPRHAWKDIR = "crisprhawk-data"
RESULTSDIR = os.path.join(CRISPRHAWKDIR, "results_nsamples")

## Results Analysis and Visualization

In this section, we analyze and visualize the impact of genetic variation on gRNA 
design and activity across a curated set of clinically and experimentally relevant 
CRISPR targets. The analysis focuses on how population-level and individual-specific 
variants affect both gRNA sequence composition and expected on-target efficiency, 
highlighting differences between reference-designed guides and their alternative, 
haplotype-derived counterparts.

For each target region, we first quantify how genetic variation alters the landscape 
of candidate gRNAs. Specifically, we classify retrieved guides into four categories 
and summarize them using pie charts:
- gRNAs matching the reference sequence
- gRNAs with variants in the spacer region
- gRNAs with variants affecting only the PAM
- gRNAs with variants in both the spacer and the PAM

This provides an immediate overview of how frequently genetic variants modify 
targetable sequences across different loci. 

Next, we assess the functional consequences of these sequence differences. Using 
dot plots, we compare:
- The predicted on-target efficiency of reference gRNAs versus their alternative 
  versions found on variant-defined haplotypes
- The residual on-target activity of reference gRNAs when applied to alternative 
  haplotypes carrying sequence mismatches

These analyses capture both gain and loss of activity induced by genetic variation 
and enable a fine-grained comparison between reference and variant-aware gRNA designs.

All analyses are performed independently for the following target regions:
- BCL11A +58 Erythroid enhancer
- EMX1
- CCR5 (two independent target sites)
- TRBC1
- TRBC2
- FANCF
- HBB (two independent target sites)
- HBG1 and HBG2 (Cas9)
- HBG1 (Cpf1/Cas12a)

For Cpf1-based targets, residual on-target activity is not evaluated, as the 
analysis is specific to Cas9-mediated spacer–PAM interactions.

The analyses integrate variation from 1000 Genomes, HGDP, and gnomAD datasets. 
In particular, for the sg1617 guide targeting the BCL11A erythroid enhancer, we 
perform an in-depth follow-up analysis: for each alternative gRNA sequence generated 
by gnomAD variants, we run CRISPRme (using 1000G + HGDP genetic variants) to 
evaluate guide specificity genome-wide. This allows us to compare how genetic 
variation simultaneously affects on-target activity and off-target risk, 
providing a comprehensive assessment of guide performance in a population-aware 
context.

In the following cell, we retrieve the candidate gRNAs retrieved on BCL11A +58 
Eythroid enhnacer found by CRISPR-HAWK. Each guide has been processed to 
retrieve the number of samples in each dataset, carrying that gRNA alternative.

In [None]:
# load BCL11A reports
report_fname = "crisprhawk_guides__chr2_60495215_60496479_NGG_20_nsamples.tsv"

bcl11a_1000G = pd.read_csv(
    os.path.join(RESULTSDIR, f"CAS9/1000G/{report_fname}"), sep="\t"
)
bcl11a_HGDP = pd.read_csv(
    os.path.join(RESULTSDIR, f"CAS9/HGDP/{report_fname}"), sep="\t"
)
bcl11a_GNOMAD = pd.read_csv(
    os.path.join(RESULTSDIR, f"CAS9/GNOMAD/{report_fname}"), sep="\t"
)

In the following cell, we retrieve the candidate gRNAs retrieved on HBG1 region
found by CRISPR-HAWK. Each guide has been processed to retrieve the number of
samples in each dataset, carrying that gRNA alternative.

In [None]:
# load HBG1 reports
report_fname = "crisprhawk_guides__chr11_5248950_5250050_TTTV_23_nsamples.tsv"

hbg1_1000G = pd.read_csv(
    os.path.join(RESULTSDIR, f"CPF1/1000G/{report_fname}"), sep="\t"
)
hbg1_HGDP = pd.read_csv(os.path.join(RESULTSDIR, f"CPF1/HGDP/{report_fname}"), sep="\t")
hbg1_GNOMAD = pd.read_csv(
    os.path.join(RESULTSDIR, f"CPF1/GNOMAD/{report_fname}"), sep="\t"
)

In the following cell, we retrieve the candidate gRNAs retrieved on HBG2 region
found by CRISPR-HAWK. Each guide has been processed to retrieve the number of
samples in each dataset, carrying that gRNA alternative.

In [None]:
# load HBG2 reports
report_fname = "crisprhawk_guides__chr11_5253874_5255874_TTTV_23_nsamples.tsv"

hbg2_1000G = pd.read_csv(
    os.path.join(RESULTSDIR, f"CPF1/1000G/{report_fname}"), sep="\t"
)
hbg2_HGDP = pd.read_csv(os.path.join(RESULTSDIR, f"CPF1/HGDP/{report_fname}"), sep="\t")
hbg2_GNOMAD = pd.read_csv(
    os.path.join(RESULTSDIR, f"CPF1/GNOMAD/{report_fname}"), sep="\t"
)

## Types of Retrieved Guide Candidates

Here we classify all retrieved guide candidates according to their relationship 
with genetic variation. Each guide is assigned to one of the following categories:

- Reference Guides: guides fully supported by the reference genome and unaffected 
    by observed variants.

- PAM and Spacer Alternative Guides: guides that arise due to the presence of 
    genetic variants creating new protospacer or PAM sequences.

- PAM Alternative Guides: guides that arise due to the presence of 
    genetic variants creating new PAM sequences.

- Spacer Alternative Guides: guides that arise due to the presence of 
    genetic variants creating new protospacer sequences.

The pie charts below summarize the relative proportion of each guide type among 
all retrieved candidates. This visualization provides an intuitive overview of 
how much of the guide space would be missed or mischaracterized by a reference-only 
approach and highlights the contribution of variant- and haplotype-aware guide 
discovery.

In [None]:
LABELS = [
    "Reference Guides",
    "Spacer Alternative Guides",
    "PAM Alternative Guides",
    "PAM and Spacer Alternative Guides",
]


def compute_guide_id_piechart(chrom, start, stop, strand, sgRNA_sequence, pam):
    return f"{chrom}_{start}_{stop}_{strand}_{sgRNA_sequence}_{pam}"


def has_lowercase(s):
    return any(c.islower() for c in str(s))


def retrieve_pie_chart_data(df):
    df["guide_id"] = df.apply(
        lambda x: compute_guide_id_piechart(
            x["chr"], x["start"], x["stop"], x["strand"], x["sgRNA_sequence"], x["pam"]
        ),
        axis=1,
    )
    df["category"] = "Unclassified"  # default

    # remove duplicate rows (same site and spacer+pam, different haplotype and scores)
    df = df.drop_duplicates(subset="guide_id")

    # category 1: Reference Guides
    df.loc[df["origin"] == "ref", "category"] = "Reference Guides"

    # category 2: PAM and Spacer Alternative Guides
    mask_non_ref = df["origin"] != "ref"
    condition2 = (
        mask_non_ref
        & df["sgRNA_sequence"].apply(has_lowercase)
        & df["pam"].apply(has_lowercase)
    )
    df.loc[condition2, "category"] = "PAM and Spacer Alternative Guides"

    # category 3: PAM Alternative Guides
    condition3 = (
        mask_non_ref
        & df["pam"].apply(has_lowercase)
        & ~df["sgRNA_sequence"].apply(has_lowercase)
    )
    df.loc[condition3, "category"] = "PAM Alternative Guides"

    # Category 4: Spacer Alternative Guides
    condition4 = (
        mask_non_ref
        & df["sgRNA_sequence"].apply(has_lowercase)
        & ~df["pam"].apply(has_lowercase)
    )
    df.loc[condition4, "category"] = "Spacer Alternative Guides"

    unclassified_count = (df["category"] == "Unclassified").sum()
    assert unclassified_count == 0, f"{unclassified_count} guides remain unclassified!"

    return [df["category"].value_counts().get(cat, 0) for cat in LABELS]


def piechart(ax, data, title):
    # colors for each category
    colors = ["#aaa1cdff", "#92c1deff", "#ef86bdff", "#b8d8c8"]
    explode = (0, 0, 0.05, 0.1)

    wedges, texts, autotexts = ax.pie(
        data,
        explode=explode,
        colors=colors,
        autopct="%1.2f%%",
        shadow=False,
        startangle=140,
        textprops={"fontsize": 16},
        pctdistance=1.1,
    )

    ax.set_title(title, fontsize=20)
    ax.axis("equal")

Once defined the functions to plot our guide piechart, we display the gRNAs 
categories proportion in the three datasets on BCL11A enhancer region.

In [None]:
datasets = ["1000G", "HGDP", "GNOMAD"]
data_report_map = {"1000G": bcl11a_1000G, "HGDP": bcl11a_HGDP, "GNOMAD": bcl11a_GNOMAD}

f, axes = plt.subplots(1, 3, figsize=(30, 10))

for ax, dataset in zip(axes, datasets):
    title = f"BCL11A +58 Erythroid Enhancer - {dataset}"

    # retrieve dataset-specific dataframe
    pie_data = retrieve_pie_chart_data(data_report_map[dataset])
    piechart(ax, pie_data, title)

# single shared legend
f.legend(LABELS, loc="lower center", ncol=4, fontsize=16, frameon=False)

plt.tight_layout()
plt.show()

We repeat the analysis for HBG1

In [None]:
datasets = ["1000G", "HGDP", "GNOMAD"]
data_report_map = {"1000G": hbg1_1000G, "HGDP": hbg1_HGDP, "GNOMAD": hbg1_GNOMAD}

f, axes = plt.subplots(1, 3, figsize=(30, 10))

for ax, dataset in zip(axes, datasets):
    title = f"HBG1 - {dataset}"

    # retrieve dataset-specific dataframe
    pie_data = retrieve_pie_chart_data(data_report_map[dataset])
    piechart(ax, pie_data, title)

# single shared legend
f.legend(LABELS, loc="lower center", ncol=4, fontsize=16, frameon=False)

plt.tight_layout()
plt.show()

We repeat the analysis for HBG2

In [None]:
datasets = ["1000G", "HGDP", "GNOMAD"]
data_report_map = {"1000G": hbg2_1000G, "HGDP": hbg2_HGDP, "GNOMAD": hbg2_GNOMAD}

f, axes = plt.subplots(1, 3, figsize=(30, 10))

for ax, dataset in zip(axes, datasets):
    title = f"HBG2 - {dataset}"

    # retrieve dataset-specific dataframe
    pie_data = retrieve_pie_chart_data(data_report_map[dataset])
    piechart(ax, pie_data, title)

# single shared legend
f.legend(LABELS, loc="lower center", ncol=4, fontsize=16, frameon=False)

plt.tight_layout()
plt.show()

## Guides On-Target Efficiency and Variant Effect on Alternative On-Targets 

Here we assess the predicted on-target efficiency of the retrieved guide candidates
and investigate how genetic variation influences the presence and scoring of 
alternative on-targets.

For each guide, CRISPR-HAWK computes an on-target efficiency score based on its 
spacer and PAM sequence. When variants alter either the spacer, the PAM, or both, 
alternative guide configurations may emerge at the same genomic locus. These 
alternatives can differ substantially in predicted efficiency compared to the 
reference guide.

The following analyses focus on:

- Comparing the efficiency of reference guides against their alternatives

- Assessing whether genetoc variants may modulate the expected likelihood of
    a guide designed on the reference to work properly on haplotype-specific
    on-target sites


Define global constants and special guide identifier

In [None]:
SG1617 = "chr2_60495261_-"
HBG1 = "chr11_5249951_+"
HBG2 = "chr11_5254875_+"
SAMPLE_COL = "n_samples"


def compute_guide_id(chrom, start, strand):
    return f"{chrom}_{start}_{strand}"

Calculate score differences between reference and alternative guides

In [None]:
def calculate_deltas(df, score_col):
    df = df.copy()
    use_abs = score_col != "score_cfdon"

    def add_deltas(group):
        ref = group[group["origin"] == "ref"]
        if ref.empty:
            return group

        ref_score = ref[score_col].iloc[0]
        group["delta"] = group[score_col] - ref_score
        if use_abs:
            group["abs_delta"] = group["delta"].abs()
        return group

    df["delta"] = 0.0
    if use_abs:
        df["abs_delta"] = 0.0

    return df.groupby("guide_id", group_keys=False).apply(add_deltas)

Prepare top-ranked guides based on variant impact

In [None]:
def prepare_data_by_delta(df, score_col, top_n, dataset_type):
    df = df.copy()
    df["guide_id"] = df.apply(lambda x: compute_guide_id(x[0], x[1], x[6]), axis=1)
    df = calculate_deltas(df, score_col)

    # Group guides and extract ref/alt information
    guide_data = {}
    for guide_id, group in df.groupby("guide_id", sort=False):
        ref = group[group["origin"] == "ref"]
        if ref.empty:
            continue

        ref_row = ref.iloc[0]
        alts = group[group["origin"] == "alt"]

        if score_col == "score_cfdon":
            alts = alts[alts[score_col] < ref_row[score_col]]

        alt_entries = [
            {
                "alt_sgRNA": alt["sgRNA_sequence"],
                "pam": alt["pam"],
                "alt_score": alt[score_col],
                "delta": alt["delta"],
                "abs_delta": alt.get("abs_delta", abs(alt["delta"])),
                SAMPLE_COL: alt[SAMPLE_COL],
                "variant_id": alt["variant_id"],
            }
            for _, alt in alts.iterrows()
        ]

        guide_data[guide_id] = {
            "ref_sgRNA": ref_row["sgRNA_sequence"],
            "pam": ref_row["pam"],
            "ref_score": ref_row[score_col],
            "ref_samples": ref_row[SAMPLE_COL],
            "alts": alt_entries,
        }

    # Rank guides by worst variant effect
    rankings = []
    for gid, data in guide_data.items():
        if not data["alts"]:
            worst = 0.0
        elif score_col == "score_cfdon":
            worst = min(a["delta"] for a in data["alts"])
        else:
            worst = max(a["abs_delta"] for a in data["alts"])
        rankings.append((gid, worst))

    worst_df = (
        pd.DataFrame(rankings, columns=["guide_id", "delta"])
        .sort_values("delta", ascending=(score_col == "score_cfdon"))
        .reset_index(drop=True)
    )

    # Include special guide if present
    if SG1617 in worst_df["guide_id"].values:
        special = worst_df[worst_df["guide_id"] == SG1617]
        others = worst_df[worst_df["guide_id"] != SG1617].head(top_n - 1)
        final_guides = pd.concat([special, others], ignore_index=True)
    elif HBG1 in worst_df["guide_id"].values:
        special = worst_df[worst_df["guide_id"] == HBG1]
        others = worst_df[worst_df["guide_id"] != HBG1].head(top_n - 1)
        final_guides = pd.concat([special, others], ignore_index=True)
    elif HBG2 in worst_df["guide_id"].values:
        special = worst_df[worst_df["guide_id"] == HBG2]
        others = worst_df[worst_df["guide_id"] != HBG2].head(top_n - 1)
        final_guides = pd.concat([special, others], ignore_index=True)
    else:
        final_guides = worst_df.head(top_n)

    final_guides["Rank"] = final_guides.index + 1

    # Build wide-format dataframe
    max_alts = max(len(guide_data[gid]["alts"]) for gid in final_guides["guide_id"])
    rows = []

    for _, row in final_guides.iterrows():
        gid = row["guide_id"]
        data = guide_data[gid]

        out = {
            "guide_id": gid,
            "Rank": row["Rank"],
            "ref_sgRNA": data["ref_sgRNA"],
            "pam": data["pam"],
            "ref_score": data["ref_score"],
            "ref_n_samples": data["ref_samples"],
        }

        for i, alt in enumerate(data["alts"]):
            prefix = f"alt{i+1}_"
            out.update(
                {
                    f"{prefix}sgRNA": alt["alt_sgRNA"],
                    f"{prefix}pam": alt["pam"],
                    f"{prefix}score": alt["alt_score"],
                    f"{prefix}delta": alt["delta"],
                    f"{prefix}abs_delta": alt["abs_delta"],
                    f"{prefix}{SAMPLE_COL}": alt[SAMPLE_COL],
                    f"{prefix}variant_id": alt["variant_id"],
                }
            )

        # Fill missing alt columns with NaN
        for i in range(len(data["alts"]), max_alts):
            prefix = f"alt{i+1}_"
            for k in [
                "sgRNA",
                "pam",
                "score",
                "delta",
                "abs_delta",
                SAMPLE_COL,
                "variant_id",
            ]:
                out[f"{prefix}{k}"] = np.nan

        rows.append(out)

    return pd.DataFrame(rows)

Utilities for plot sizing and styling

In [None]:
def get_legend_size(n_samples, dataset_type):
    if dataset_type == "GNOMAD":
        thresholds = [1, 20, 50, 150, 500, np.inf]
    else:
        thresholds = [1, 20, 50, 100, 200, np.inf]

    bases = [1, 10, 35, 75, 150, 300]
    for limit, base in zip(thresholds, bases):
        if n_samples <= limit:
            return 150 * np.sqrt(base)

Create a dot plot showing variant effects across guides

In [None]:
def dotplot_delta(df, score_col, top_n, dataset_type):
    top_df = df.sort_values("Rank").head(top_n).copy()
    top_df["rank_chr_start"] = top_df.apply(
        lambda row: f"Rank {row['Rank']}, {row['guide_id'].split('_')[0]}:{row['guide_id'].split('_')[1]}",
        axis=1,
    )

    plt.figure(figsize=(22, 11))

    # Generate color palette for variants
    alt_cols = [
        c for c in df.columns if c.startswith("alt") and c.endswith(f"_{SAMPLE_COL}")
    ]
    variant_keys = list(
        {
            row["rank_chr_start"]
            for _, row in top_df.iterrows()
            for col in alt_cols
            if pd.notna(row[col]) and row[col] != "REF"
        }
    )

    base_cmaps = [
        "Purples",
        "Blues",
        "Greens",
        "Oranges",
        "Reds",
        "PuRd",
        "RdPu",
        "BuPu",
        "GnBu",
        "PuBuGn",
        "BuGn",
        "Spectral",
        "coolwarm",
    ]
    palette = [sns.color_palette(cmap, 9)[7] for cmap in base_cmaps]

    if len(variant_keys) > len(palette):
        palette += [
            sns.color_palette(cmap, 9)[i] for i in range(6, 9) for cmap in base_cmaps
        ][: len(variant_keys) - len(palette)]

    random.shuffle(palette)
    variant_colors = dict(zip(variant_keys, palette))

    # Plot alternative alleles
    for _, row in top_df.iterrows():
        for i in range(1, 1000):
            score_col_alt = f"alt{i}_score"
            samples_col_alt = f"alt{i}_{SAMPLE_COL}"

            if score_col_alt not in row or pd.isna(row[score_col_alt]):
                break

            n_samples = (
                int(row[samples_col_alt]) if pd.notna(row[samples_col_alt]) else 1
            )
            plt.scatter(
                row["Rank"],
                row[score_col_alt],
                color=variant_colors.get(row["rank_chr_start"], "gray"),
                s=get_legend_size(n_samples, dataset_type),
                alpha=0.6,
                edgecolors="white",
                linewidth=0.5,
                marker="D" if n_samples == 1 else "o",
                zorder=3,
            )

    # Plot reference alleles
    if score_col == "score_cfdon":
        plt.axhline(y=1, color="gray", linestyle="-", alpha=0.6, linewidth=3, zorder=5)
        ref_legend = plt.Line2D(
            [0], [0], color="gray", linewidth=3, label="Reference", alpha=0.6
        )
    else:
        plt.scatter(
            top_df["Rank"],
            top_df["ref_score"],
            color="black",
            s=600,
            zorder=5,
            edgecolors="white",
            linewidth=0.7,
            label="Reference",
            alpha=0.6,
        )
        ref_legend = plt.Line2D(
            [0],
            [0],
            marker="o",
            color="w",
            label="Reference",
            markerfacecolor="dimgray",
            alpha=0.6,
            markersize=np.sqrt(600),
            markeredgecolor="white",
            linewidth=0.7,
        )

    # Configure axes and labels
    ax = plt.gca()
    ax.set_xticks(top_df["Rank"])
    ax.set_xticklabels(
        [row["ref_sgRNA"] for _, row in top_df.iterrows()],
        rotation=45,
        ha="right",
        fontsize=15,
    )

    # Bold special guides
    for label, gid in zip(ax.get_xticklabels(), top_df["guide_id"].values):
        if gid in [SG1617, HBG1, HBG2]:
            label.set_weight("bold")

    plt.yticks(fontsize=15)
    plt.xlabel("Guide", fontsize=18)

    ylabel = (
        "Variant Effect (CFD)"
        if score_col == "score_cfdon"
        else f'On-Target Efficiency ({score_col.split("_")[1].upper()})'
    )
    title = (
        "Variant Effect on Alternative On-Targets"
        if score_col == "score_cfdon"
        else "Guides On-Target Efficiency"
    )

    plt.ylabel(ylabel, fontsize=19)
    plt.title(title, fontsize=28)

    # Create legend
    legend_labels = ["1", "2–20", "21–50", "51–100", "101–200", ">200"]
    sample_counts = [1, 10, 35, 75, 150, 300]
    markers = ["D"] + ["o"] * 5
    scaled_sizes = [150 * np.sqrt(n) for n in sample_counts]

    size_legend_handles = [
        plt.Line2D(
            [0],
            [0],
            marker=m,
            color="w",
            label=l,
            markerfacecolor="gray",
            markersize=np.sqrt(s),
            alpha=0.6,
        )
        for l, s, m in zip(legend_labels, scaled_sizes, markers)
    ]
    size_legend_handles.insert(0, ref_legend)

    legend = plt.legend(
        handles=size_legend_handles,
        title="Dot Size = #Samples",
        frameon=False,
        bbox_to_anchor=(0.5, -0.55),
        loc="lower center",
        ncol=7,
        fontsize=11,
        handletextpad=1,
        columnspacing=5,
        labelspacing=2,
    )
    legend.get_title().set_fontsize(24)

    plt.ylim((-10, 110) if score_col == "score_deepcpf1" else (-0.05, 1.05))
    plt.grid(True, alpha=0.3, linestyle="--")
    sns.despine()
    plt.subplots_adjust(bottom=0.25)
    plt.show()

Once defined the functions required to plot variants impact on guides efficiency
and on-target activity, we focus again on BCL11A +58 Erythroid enhancer. For these
analysis we combine the candidate guides using the variants from 1000G, HGDP, and
gnomAD datasets.

In [None]:
# combine data from different datasets
for df in [bcl11a_1000G, bcl11a_HGDP, bcl11a_GNOMAD]:
    if "n_samples" not in df.columns:
        df["n_samples"] = df["n_ref"] if "n_ref" in df.columns else np.nan

bcl11a_ALL = pd.concat(
    [
        bcl11a_1000G.assign(dataset="1000G"),
        bcl11a_HGDP.assign(dataset="HGDP"),
        bcl11a_GNOMAD.assign(dataset="GNOMAD"),
    ],
    ignore_index=True,
)

dotplot_delta(
    prepare_data_by_delta(bcl11a_ALL, "score_azimuth", 25, "ALL"),
    "score_azimuth",
    25,
    "ALL",
)
dotplot_delta(
    prepare_data_by_delta(bcl11a_ALL, "score_cfdon", 25, "ALL"),
    "score_cfdon",
    25,
    "ALL",
)

We further analyze the impact of genetic variants on gRNAs predicted on-target 
efficiency by performing a dataset-wise breakdown:

In [None]:
dotplot_delta(
    prepare_data_by_delta(bcl11a_1000G, "score_azimuth", 25, "1000G"),
    "score_azimuth",
    25,
    "1000G",
)
dotplot_delta(
    prepare_data_by_delta(bcl11a_HGDP, "score_azimuth", 25, "HGDP"),
    "score_azimuth",
    25,
    "HGDP",
)
dotplot_delta(
    prepare_data_by_delta(bcl11a_GNOMAD, "score_azimuth", 25, "GNOMAD"),
    "score_azimuth",
    25,
    "GNOMAD",
)

We further analyze the impact of genetic variants on gRNAs residual on-target 
activity by performing a dataset-wise breakdown:

In [None]:
dotplot_delta(
    prepare_data_by_delta(bcl11a_1000G, "score_cfdon", 25, "1000G"),
    "score_cfdon",
    25,
    "1000G",
)
dotplot_delta(
    prepare_data_by_delta(bcl11a_HGDP, "score_cfdon", 25, "HGDP"),
    "score_cfdon",
    25,
    "HGDP",
)
dotplot_delta(
    prepare_data_by_delta(bcl11a_GNOMAD, "score_cfdon", 25, "GNOMAD"),
    "score_cfdon",
    25,
    "GNOMAD",
)

We repeat the aggregated plot generation for HBG1

In [None]:
# combine data from different datasets
for df in [hbg1_1000G, hbg1_HGDP, hbg1_GNOMAD]:
    if "n_samples" not in df.columns:
        df["n_samples"] = df["n_ref"] if "n_ref" in df.columns else np.nan

hbg1_ALL = pd.concat(
    [
        hbg1_1000G.assign(dataset="1000G"),
        hbg1_HGDP.assign(dataset="HGDP"),
        hbg1_GNOMAD.assign(dataset="GNOMAD"),
    ],
    ignore_index=True,
)

dotplot_delta(
    prepare_data_by_delta(hbg1_ALL, "score_deepcpf1", 25, "ALL"),
    "score_deepcpf1",
    25,
    "ALL",
)

In [None]:
# combine data from different datasets
for df in [hbg2_1000G, hbg2_HGDP, hbg2_GNOMAD]:
    if "n_samples" not in df.columns:
        df["n_samples"] = df["n_ref"] if "n_ref" in df.columns else np.nan

hbg2_ALL = pd.concat(
    [
        hbg2_1000G.assign(dataset="1000G"),
        hbg2_HGDP.assign(dataset="HGDP"),
        hbg2_GNOMAD.assign(dataset="GNOMAD"),
    ],
    ignore_index=True,
)

dotplot_delta(
    prepare_data_by_delta(hbg2_ALL, "score_deepcpf1", 25, "ALL"),
    "score_deepcpf1",
    25,
    "ALL",
)

## Residual on-target activity vs guide specificity

In this section we explore the relationship between residual on-target activity 
and guide specificity for the sg1617 reference guide and all variant-containing 
guide sequences derived from the same genomic site.

Residual on-target activity measures how strongly a variant-derived guide is 
expected to cut at its (potentially altered) on-target site compared to the 
reference guide. Guide specificity instead reflects the likelihood that a guide 
may cut unintended off-target sites in the genome: higher values indicate better 
predicted specificity.

To perform this analysis:

- We collected reference and variant-derived guide sequences associated with 
sg1617.

- For each sequence, CRISPR-HAWK computed its predicted on-target activity and 
derived the residual value with respect to the reference guide.

- We retrieved guide specificity scores from the CRISPRme website reports.

- Off-target nomination was performed with CRISPRme v2.1.7, using:
    - NGG PAM
    - 1000 Genomes Project + HGDP variant datasets
    - up to 6 mismatches
    - up to 2 DNA/RNA bulges

The resulting scatter plot visualizes each guide configuration as a point in the 
activity–specificity space. 

In [None]:
def CFD_ONvsOFF(df):
    alt_colors = ["#1f77b4", "#2ca02c", "#ff7f0e", "#c212b0"]
    color_map = {"REF": "grey"}

    alt_ids = [v for v in df["variant_id"].unique() if v != "REF"]
    for i, alt_id in enumerate(alt_ids):
        color_map[alt_id] = alt_colors[i % len(alt_colors)]

    for gene in df["gene"].unique():
        subset = df[(df["gene"] == gene)]

        fig, ax = plt.subplots(figsize=(6, 6))

        x = np.linspace(-0.1, 1.1, 256)
        y = np.linspace(-0.1, 1.1, 256)
        X, Y = np.meshgrid(x, y)
        diagonal_distance = (X + Y) / 2.2
        colors = ["#B30000", "#FF4444", "#FFAAAA", "white"]
        n_bins = 256
        cmap = LinearSegmentedColormap.from_list("custom_gradient", colors, N=n_bins)
        im = plt.imshow(
            diagonal_distance,
            extent=[-0.4, 1.1, -0.4, 1.1],
            origin="lower",
            cmap=cmap,
            aspect="auto",
            alpha=0.3,
            zorder=0,
        )

        variant_handles = {}
        for _, row in subset.iterrows():
            marker = "D" if row["n_samples"] == 1 else "o"
            color = color_map[row["variant_id"]]
            size = 150 * np.sqrt(row["n_samples"])
            plt.scatter(
                row["off_target"],
                row["on_target"],
                color=color,
                s=size,
                alpha=0.6,
                marker=marker,
                edgecolors="white",
                linewidth=0.5,
                zorder=2,
            )
            if row["variant_id"] not in variant_handles:
                variant_marker = (
                    "D"
                    if (
                        subset[
                            (subset["variant_id"] == row["variant_id"])
                            & (subset["n_samples"] == 1)
                        ].shape[0]
                        > 0
                    )
                    else "o"
                )
                handle = Line2D(
                    [0],
                    [0],
                    marker=variant_marker,
                    color="w",
                    markerfacecolor=color,
                    markeredgecolor="white",
                    markersize=8,
                    linewidth=0,
                    label=(
                        "Reference" if row["variant_id"] == "REF" else row["variant_id"]
                    ),
                    alpha=0.8,
                )
                variant_handles[row["variant_id"]] = handle

        plt.xlabel("Guide specifity", fontsize=10)
        plt.ylabel("Residual on-target activity", fontsize=10)
        plt.xticks(fontsize=10)
        plt.yticks(fontsize=10)
        plt.xlim(-0.05, 1.1)
        plt.ylim(-0.05, 1.1)
        plt.grid(True, alpha=0.3, linestyle="--", zorder=1, color="black")
        plt.gca().set_axisbelow(True)

        cbar = fig.colorbar(im, ax=ax, fraction=0.025, pad=0.08, shrink=0.5)
        cbar.set_label(
            "Guide Penalty\n(Low → High)", rotation=270, labelpad=25, fontsize=8
        )
        cbar.set_ticks([])
        plt.gca().set_aspect("equal", adjustable="datalim")

        legend_labels = ["1 sample", "2-20 samples", ">20 samples"]
        sample_counts = [1, 10, 35]  # usa il valore reale per REF
        markers = ["D", "o", "o"]  # marker per ciascuno
        scaled_sizes = [150 * np.sqrt(n) for n in sample_counts]  # scala proporzionale

        size_handles = [
            Line2D(
                [0],
                [0],
                marker=marker,
                color="w",
                label=label,
                markerfacecolor="gray" if i < 2 else "grey",  # REF in grey
                markersize=np.sqrt(size),
                alpha=0.6,
            )
            for i, (label, size, marker) in enumerate(
                zip(legend_labels, scaled_sizes, markers)
            )
        ]

        l1 = plt.legend(
            handles=list(variant_handles.values()),
            fontsize=8,
            title_fontsize=9,
            loc="lower right",
            frameon=False,
            bbox_to_anchor=(1.2, 0),
        )
        plt.gca().add_artist(l1)
        plt.legend(
            handles=size_handles,
            frameon=False,
            labelspacing=1.2,
            ncol=len(size_handles),
            loc="lower center",
            bbox_to_anchor=(0.5, -0.25),
        )
        sns.despine()
        plt.show()

Once defined the plot function, we generate the plot

In [None]:
# gRNA specificieties retrieved from crisprme website, residual on-target
# activity retrieved from crisprhawk reports (score_cfdon column)
sg1617_combined = pd.DataFrame(
    [
        ("sg1617", "ref", 1.0, 0.63, "REF", 30),
        ("sg1617", "alt1", 0, 0.38, "chr2-60495268-T/G", 3),
        ("sg1617", "alt2", 0.33, 0.466, "chr2-60495273-G/A", 1),
        ("sg1617", "alt3", 0, 0.307, "chr2-60495268-T/G,chr2-60495273-G/A", 4),
        ("sg1617", "alt4", 0.9, 0.656, "chr2-60495283-G/C", 1),
    ],
    columns=["gene", "type", "on_target", "off_target", "variant_id", "n_samples"],
)

CFD_ONvsOFF(sg1617_combined)

## Residual on-target activity of therapeutic and benchmark gRNAs across variant-containing target sites

In this cell, we evaluate how genetic variation influences the predicted on-target activity of both therapeutic and benchmark gRNAs. 

To quantify the effect of sequence variation on cleavage efficiency, we computed 
CFD scores for each reference-designed gRNA against all variant-containing 
target sequences overlapping its intended on-target site. Each point in the plot 
corresponds to a distinct haplotype or allele configuration carrying one or more 
variants. Point size reflects the number of individuals harboring that sequence, 
thereby linking functional impact with population frequency.

In [None]:
# define therapeutic guides data
th_guides = {
    "CCR5_1": (
        "crisprhawk_guides__chr3_46372162_46374162_NGG_20_nsamples.tsv",
        46373163,
    ),
    "CCR5_2": (
        "crisprhawk_guides__chr3_46372138_46374138_NGG_20_nsamples.tsv",
        46373139,
    ),
    "TRBC1": (
        "crisprhawk_guides__chr7_142791004_142793004_NGG_20_nsamples.tsv",
        142792004,
    ),
    "TRBC2": (
        "crisprhawk_guides__chr7_142800351_142802350_NGG_20_nsamples.tsv",
        142801351,
    ),
    "HBB_1": ("crisprhawk_guides__chr11_5225967_5227967_NGG_20_nsamples.tsv", 5226968),
    "HBB_2": ("crisprhawk_guides__chr11_5225803_5227803_NGG_20_nsamples.tsv", 5226804),
    "FANCF": (
        "crisprhawk_guides__chr11_22624785_22626785_NGG_20_nsamples.tsv",
        22625786,
    ),
    "HBG1": ("crisprhawk_guides__chr11_5248955_5250955_NGG_20_nsamples.tsv", 5249956),
    "EMX1": ("crisprhawk_guides__chr2_72932853_72934853_NGG_20_nsamples.tsv", 72933853),
    "HBG2": ("crisprhawk_guides__chr11_5253879_5255879_NGG_20_nsamples.tsv", 5254880),
    "BCL11A": (
        "crisprhawk_guides__chr2_60495215_60496479_NGG_20_nsamples.tsv",
        60495261,
    ),
}

The next cell defines the functions to process the reports and generate the plot

In [None]:
def preprocess_report(report_fname, pos, target):
    df = pd.read_csv(report_fname, sep="\t").sort_values(by=["start", "origin"])
    df_th = df[df["start"] == pos]
    df_th["label"] = f"{target}:{pos}"
    return df_th


def TG_Dotplot_CFDOn(df):
    plt.figure(figsize=(28, 9))
    df = df.sort_values(by="label", ascending=True).reset_index(drop=True)
    df = df[df["origin"] != "ref"]

    labels = df["label"].unique()
    cmap = cm.get_cmap("tab20", len(labels))
    label_colors = {label: cmap(i) for i, label in enumerate(labels)}

    df["n_samples"] = df["n_samples"].fillna(1)
    df["dot_size"] = df["n_samples"].apply(lambda x: 150 * np.sqrt(x))

    n_labels = len(labels)
    x = np.linspace(-0.5, n_labels - 0.5, 256)
    y = np.linspace(-0.1, 1.1, 256)
    X, Y = np.meshgrid(x, y)

    vertical_distance = (Y - (-0.1)) / (1.1 - (-0.1))

    colors = ["#B30000", "#FF4444", "#FFAAAA", "white"]
    gradient_cmap = LinearSegmentedColormap.from_list("custom_gradient", colors, N=256)

    im = plt.imshow(
        vertical_distance,
        extent=[-0.5, n_labels - 0.5, -0.1, 1.1],
        origin="lower",
        cmap=gradient_cmap,
        aspect="auto",
        alpha=0.3,
        zorder=0,
    )

    for i, row in df.iterrows():
        marker = "D" if (row["n_samples"] == 1) else "o"
        alpha_val = 0.6
        color_val = label_colors[row["label"]]
        x_pos = np.where(labels == row["label"])[0][0]

        plt.scatter(
            x_pos,
            row["score_cfdon"],
            alpha=alpha_val,
            c=[color_val],
            s=row["dot_size"],
            marker=marker,
            edgecolors="white",
            linewidth=0.5,
            zorder=2,
        )

    legend_labels = [
        "1 sample",
        "2-20 samples",
        "21-50 samples",
        "51-100 samples",
        "101-200 samples",
        ">200 samples",
    ]
    sample_counts = [1, 10, 35, 75, 150, 300]
    markers = ["D", "o", "o", "o", "o", "o"]
    scaled_sizes = [150 * np.sqrt(n) for n in sample_counts]
    ref_handle = Line2D(
        [0], [0], color="gray", linestyle="-", linewidth=3, label="Reference", alpha=0.6
    )
    size_handles = [
        plt.Line2D(
            [0],
            [0],
            marker=marker,
            color="w",
            label=label,
            markerfacecolor="gray",
            markersize=np.sqrt(size),
            alpha=0.6,
        )
        for label, size, marker in zip(legend_labels, scaled_sizes, markers)
    ]
    all_handles = [ref_handle] + size_handles
    plt.legend(
        handles=all_handles,
        frameon=False,
        labelspacing=1.5,
        ncol=len(all_handles),
        loc="lower center",
        bbox_to_anchor=(0.5, -0.45),
        prop={"size": 14},
    )

    plt.title(
        "Residual activity of the on-target reference on alternative haplotypes of therapeutic guides",
        fontsize=18,
    )
    plt.xlabel("Guide:Start", fontsize=15)
    plt.ylabel("Residual on-target activity", fontsize=15)

    ax = plt.gca()
    ax.spines["top"].set_visible(False)
    ax.spines["right"].set_visible(False)
    plt.xticks(range(n_labels), labels, fontsize=15, rotation=45, ha="right")
    plt.yticks(fontsize=15)
    plt.ylim(-0.1, 1.1)
    plt.xlim(-0.5, n_labels - 0.5)
    plt.grid(True, alpha=0.3, linestyle="--", zorder=1, color="black")
    plt.axhline(y=1, color="gray", linestyle="-", linewidth=3, alpha=0.6, zorder=1)

    cbar = plt.colorbar(im, ax=ax, fraction=0.015, pad=0.02, shrink=0.6)
    cbar.set_label(
        "Guide Penalty\n(Low → High)", rotation=270, labelpad=25, fontsize=12
    )
    cbar.set_ticks([])

    plt.show()

The next cell generates the plot

In [None]:
th_reports_1000G = pd.concat(
    [
        preprocess_report(
            os.path.join(RESULTSDIR, "CAS9/1000G", report_fname), pos, target
        )
        for target, (report_fname, pos) in th_guides.items()
    ],
    ignore_index=True,
)
th_reports_1000G["dataset"] = "1000G"
th_reports_HGDP = pd.concat(
    [
        preprocess_report(
            os.path.join(RESULTSDIR, "CAS9/HGDP", report_fname), pos, target
        )
        for target, (report_fname, pos) in th_guides.items()
    ],
    ignore_index=True,
)
th_reports_HGDP["dataset"] = "HGDP"
th_reports_GNOMAD = pd.concat(
    [
        preprocess_report(
            os.path.join(RESULTSDIR, "CAS9/GNOMAD", report_fname), pos, target
        )
        for target, (report_fname, pos) in th_guides.items()
    ],
    ignore_index=True,
)
th_reports_GNOMAD["dataset"] = "GNOMAD"
th_report = pd.concat(
    [th_reports_1000G, th_reports_HGDP, th_reports_GNOMAD], ignore_index=True
)
th_report["sgRNA_sequence"] = th_report["sgRNA_sequence"].str.upper()
th_report["label"] = th_report["label"].replace({"BCL11A:60495261": "sg1617"})
all_labels = th_report["label"].unique().tolist()
ordered_labels = ["sg1617"] + sorted([lab for lab in all_labels if lab != "sg1617"])
th_report["label"] = pd.Categorical(
    th_report["label"], categories=ordered_labels, ordered=True
)
combined_all = th_report.sort_values(by="label").reset_index(drop=True)
TG_Dotplot_CFDOn(combined_all)