# Phase 5 – Publication-Ready Figures & Tables (BRCA2 South Asian Bias Project)

This notebook generates **publication-ready figures and tables** for:

- Manuscripts
- Posters
- Regeneron / ISEF submissions

**Inputs:**

- `data/processed/brca2_merged.csv`
- `results/tables/table1_acmg_reclassifications.csv` (Phase 3)
- (Optionally) tables generated in Phase 4

**Outputs (this notebook):**

Tables:
- `results/tables/table5_variant_counts_by_significance.csv`
- `results/tables/table6_reclassification_counts.csv`
- `results/tables/supplementary_variants.csv`

Figures:
- `results/figures/figure1_population_af_scatter_sas_vs_eur.png`
- `results/figures/figure2_frequency_thresholds.png` (already from Phase 3, just documented here)
- `results/figures/figure6_sas_eur_ratio_violin.png`
- `results/figures/figure7_reclassification_barplot.png`


## 0. Environment setup


In [1]:
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

print("✅ Imports done. pandas version:", pd.__version__)


✅ Imports done. pandas version: 2.2.2


In [2]:
# If you are running in Google Colab, mount your Drive.
try:
    from google.colab import drive  # type: ignore
    drive.mount("/content/drive")
    print("✅ Google Drive mounted.")
except ModuleNotFoundError:
    print("ℹ️ Not running in Colab; skipping Drive mount.")


Mounted at /content/drive
✅ Google Drive mounted.


In [3]:
PROJECT_ROOT = "/content/drive/MyDrive/BRCA2-database-bias"

DATA_PROCESSED = os.path.join(PROJECT_ROOT, "data", "processed")
RESULTS_TABLES = os.path.join(PROJECT_ROOT, "results", "tables")
RESULTS_FIGURES = os.path.join(PROJECT_ROOT, "results", "figures")

os.makedirs(RESULTS_TABLES, exist_ok=True)
os.makedirs(RESULTS_FIGURES, exist_ok=True)

print("PROJECT_ROOT:", PROJECT_ROOT)
print("DATA_PROCESSED:", DATA_PROCESSED)
print("RESULTS_TABLES:", RESULTS_TABLES)
print("RESULTS_FIGURES:", RESULTS_FIGURES)


PROJECT_ROOT: /content/drive/MyDrive/BRCA2-database-bias
DATA_PROCESSED: /content/drive/MyDrive/BRCA2-database-bias/data/processed
RESULTS_TABLES: /content/drive/MyDrive/BRCA2-database-bias/results/tables
RESULTS_FIGURES: /content/drive/MyDrive/BRCA2-database-bias/results/figures


## 1. Load core datasets

We load:

- The full merged BRCA2 dataset (`brca2_merged.csv`)
- The ACMG-based reclassification table (`table1_acmg_reclassifications.csv`)


In [4]:
merged_path = os.path.join(DATA_PROCESSED, "brca2_merged.csv")
table1_path = os.path.join(RESULTS_TABLES, "table1_acmg_reclassifications.csv")

df = pd.read_csv(merged_path)
table1 = pd.read_csv(table1_path)

print("✅ Loaded datasets:")
print("  merged:", df.shape)
print("  Table 1:", table1.shape)

df.head()


✅ Loaded datasets:
  merged: (20614, 21)
  Table 1: (1, 10)


Unnamed: 0,Chromosome,Start,ReferenceAllele,AlternateAllele,ClinicalSignificance,ReviewStatus,variant_key,variant_id,chrom,pos,...,alt,consequence,sas_af,eur_af,afr_af,eas_af,amr_af,log10_sas_af,log10_eur_af,sas_eur_ratio
0,13,32315212,G,A,"('Benign',)","('criteria_provided', '_single_submitter')",13-32315212-G-A,13-32315212-G-A,13.0,32315212.0,...,A,intron_variant,0.0,1.5e-05,0.002792,0.0,0.000262,-12.0,-4.832522,6.8002e-08
1,13,32315226,G,A,"('Benign',)","('reviewed_by_expert_panel',)",13-32315226-G-A,13-32315226-G-A,13.0,32315226.0,...,A,intron_variant,0.25,0.177842,0.5,0.149612,0.201544,-0.60206,-0.749965,1.405741
2,13,32315300,G,A,"('Benign',)","('criteria_provided', '_single_submitter')",13-32315300-G-A,13-32315300-G-A,13.0,32315300.0,...,A,intron_variant,0.000415,0.005453,0.00065,0.000193,0.001961,-3.382377,-2.263352,0.07602821
3,13,32315355,A,G,"('Uncertain_significance',)","('criteria_provided', '_single_submitter')",13-32315355-A-G,13-32315355-A-G,13.0,32315355.0,...,G,5_prime_UTR_variant,0.0,0.0,0.0,0.000965,0.0,-12.0,-12.0,1.0
4,13,32315355,ATGCCTGACAAGGAATTTCCTTTCGCCACACTGAGAAATACCCGCA...,A,"('Pathogenic',)","('no_assertion_criteria_provided',)",13-32315355-ATGCCTGACAAGGAATTTCCTTTCGCCACACTGA...,,,,...,,,0.0,0.0,0.0,0.0,0.0,-12.0,-12.0,1.0


In [5]:
# Ensure key columns exist
pop_cols = ["sas_af", "eur_af", "afr_af", "eas_af", "amr_af"]
for col in pop_cols:
    if col not in df.columns:
        df[col] = np.nan

if "sas_eur_ratio" not in df.columns:
    df["sas_eur_ratio"] = (df["sas_af"].fillna(0) + 1e-12) / (df["eur_af"].fillna(0) + 1e-12)

if "ClinicalSignificance" not in df.columns:
    df["ClinicalSignificance"] = "Unknown"

df[["ClinicalSignificance"] + pop_cols].head()


Unnamed: 0,ClinicalSignificance,sas_af,eur_af,afr_af,eas_af,amr_af
0,"('Benign',)",0.0,1.5e-05,0.002792,0.0,0.000262
1,"('Benign',)",0.25,0.177842,0.5,0.149612,0.201544
2,"('Benign',)",0.000415,0.005453,0.00065,0.000193,0.001961
3,"('Uncertain_significance',)",0.0,0.0,0.0,0.000965,0.0
4,"('Pathogenic',)",0.0,0.0,0.0,0.0,0.0


## 2. Figure 1 – Population AF scatter (SAS vs EUR)

This is a key visual showing how often South Asian allele frequencies differ from European frequencies.

- x-axis: EUR AF (log10)
- y-axis: SAS AF (log10)
- diagonal line: y = x (no difference)

Output: `results/figures/figure1_population_af_scatter_sas_vs_eur.png`


In [6]:
# Prepare scatter data
sas = df["sas_af"].fillna(0) + 1e-12
eur = df["eur_af"].fillna(0) + 1e-12

x = np.log10(eur)
y = np.log10(sas)

plt.figure(figsize=(6, 6))
plt.scatter(x, y, s=6, alpha=0.4)
plt.plot([-6, 0], [-6, 0], linestyle="--")  # y = x line
plt.xlim(-6, 0)
plt.ylim(-6, 0)
plt.xlabel("log10(EUR AF)")
plt.ylabel("log10(SAS AF)")
plt.title("Figure 1: South Asian vs European allele frequencies (BRCA2 variants)")

fig1_path = os.path.join(RESULTS_FIGURES, "figure1_population_af_scatter_sas_vs_eur.png")
plt.savefig(fig1_path, dpi=200, bbox_inches="tight")
plt.close()

print("✅ Saved Figure 1 to:", fig1_path)


✅ Saved Figure 1 to: /content/drive/MyDrive/BRCA2-database-bias/results/figures/figure1_population_af_scatter_sas_vs_eur.png


## 3. Figure 2 – Frequency thresholds (from Phase 3)

Figure 2 (`figure2_frequency_thresholds.png`) was generated in Phase 3.

Here we just **document the path** so it's easy to reference in papers/posters:

- `results/figures/figure2_frequency_thresholds.png`


In [7]:
fig2_path = os.path.join(RESULTS_FIGURES, "figure2_frequency_thresholds.png")
print("Figure 2 should exist at:", fig2_path)


Figure 2 should exist at: /content/drive/MyDrive/BRCA2-database-bias/results/figures/figure2_frequency_thresholds.png


## 4. Figure 6 – SAS/EUR ratio distribution

We visualize the **distribution of SAS/EUR ratios** using a histogram.

Output: `results/figures/figure6_sas_eur_ratio_violin.png`


In [8]:
ratio = df["sas_eur_ratio"].replace([np.inf, -np.inf], np.nan).dropna()

if len(ratio) > 0:
    plt.figure(figsize=(6, 4))
    plt.hist(np.log10(ratio), bins=40)
    plt.xlabel("log10(SAS AF / EUR AF)")
    plt.ylabel("Count")
    plt.title("Figure 6: Distribution of SAS/EUR allele frequency ratios (BRCA2 variants)")

    fig6_path = os.path.join(RESULTS_FIGURES, "figure6_sas_eur_ratio_violin.png")
    plt.savefig(fig6_path, dpi=200, bbox_inches="tight")
    plt.close()

    print("✅ Saved Figure 6 to:", fig6_path)
else:
    print("⚠️ No valid SAS/EUR ratios found; Figure 6 was not created.")


✅ Saved Figure 6 to: /content/drive/MyDrive/BRCA2-database-bias/results/figures/figure6_sas_eur_ratio_violin.png


## 5. Figure 7 – Reclassification bar plot

We create a bar plot showing **how many variants** were reclassified into:

- Benign (BA1)
- Likely benign (BS1)

stratified by **original ClinVar clinical significance**.

Output: `results/figures/figure7_reclassification_barplot.png`


In [9]:
if "ProposedReclassification" not in table1.columns:
    print("⚠️ ProposedReclassification not found in Table 1; cannot create reclassification bar plot.")
else:
    counts = (
        table1.groupby(["ClinicalSignificance", "ProposedReclassification"])
        .size()
        .reset_index(name="count")
    )

    # Pivot for easier plotting
    pivot = counts.pivot(
        index="ClinicalSignificance",
        columns="ProposedReclassification",
        values="count"
    ).fillna(0)

    plt.figure(figsize=(8, 5))
    pivot.plot(kind="bar", stacked=False)
    plt.ylabel("Number of variants")
    plt.title("Figure 7: ACMG frequency-based reclassifications by original ClinVar category")
    plt.xticks(rotation=45, ha="right")
    plt.tight_layout()

    fig7_path = os.path.join(RESULTS_FIGURES, "figure7_reclassification_barplot.png")
    plt.savefig(fig7_path, dpi=200, bbox_inches="tight")
    plt.close()

    print("✅ Saved Figure 7 to:", fig7_path)


✅ Saved Figure 7 to: /content/drive/MyDrive/BRCA2-database-bias/results/figures/figure7_reclassification_barplot.png


<Figure size 800x500 with 0 Axes>

## 6. Tables 5 & 6 – Counts for manuscript

We now generate:

- **Table 5**: variant counts by `ClinicalSignificance`
- **Table 6**: reclassification counts (original vs proposed)


In [10]:
# Table 5: variant counts by ClinicalSignificance
table5 = (
    df["ClinicalSignificance"]
    .value_counts()
    .rename_axis("ClinicalSignificance")
    .reset_index(name="count")
)

table5_path = os.path.join(RESULTS_TABLES, "table5_variant_counts_by_significance.csv")
table5.to_csv(table5_path, index=False)
print("✅ Saved Table 5 to:", table5_path)

table5.head()


✅ Saved Table 5 to: /content/drive/MyDrive/BRCA2-database-bias/results/tables/table5_variant_counts_by_significance.csv


Unnamed: 0,ClinicalSignificance,count
0,"('Conflicting_classifications_of_pathogenicity',)",5629
1,"('Pathogenic',)",4741
2,"('Likely_benign',)",4691
3,"('Uncertain_significance',)",3958
4,"('Benign',)",778


In [11]:
# Table 6: reclassification counts (original vs proposed)
if "ProposedReclassification" not in table1.columns:
    print("⚠️ ProposedReclassification not found in Table 1; cannot create Table 6.")
else:
    table6 = (
        table1.groupby(["ClinicalSignificance", "ProposedReclassification"])
        .size()
        .reset_index(name="count")
    )

    table6_path = os.path.join(RESULTS_TABLES, "table6_reclassification_counts.csv")
    table6.to_csv(table6_path, index=False)
    print("✅ Saved Table 6 to:", table6_path)

    table6.head()


✅ Saved Table 6 to: /content/drive/MyDrive/BRCA2-database-bias/results/tables/table6_reclassification_counts.csv


## 7. Supplementary variants table

We now build a **clean supplementary table** with key fields suitable for:

- Manuscript supplementary files
- Data sharing for Regeneron / ISEF

Output: `results/tables/supplementary_variants.csv`


In [12]:
# Merge Table 1 back onto full DF when possible, based on variant_key
if "variant_key" in df.columns and "variant_key" in table1.columns:
    sup = pd.merge(
        df,
        table1[["variant_key", "ProposedReclassification"]],
        on="variant_key",
        how="left",
        suffixes=("", "_phase3")
    )
else:
    print("⚠️ variant_key not found in both df and table1; using df only for supplementary table.")
    sup = df.copy()
    if "ProposedReclassification" not in sup.columns:
        sup["ProposedReclassification"] = np.nan

# Choose a subset of columns (only those that actually exist)
candidate_cols = [
    "variant_key",
    "HGVS_coding",
    "HGVS_protein",
    "ClinicalSignificance",
    "sas_af",
    "eur_af",
    "afr_af",
    "eas_af",
    "amr_af",
    "sas_eur_ratio",
    "ReviewStatus",
    "ProposedReclassification",
]

sup_cols = [c for c in candidate_cols if c in sup.columns]
supplementary = sup[sup_cols].copy()

supp_path = os.path.join(RESULTS_TABLES, "supplementary_variants.csv")
supplementary.to_csv(supp_path, index=False)
print("✅ Saved supplementary variants table to:", supp_path)

supplementary.head()


✅ Saved supplementary variants table to: /content/drive/MyDrive/BRCA2-database-bias/results/tables/supplementary_variants.csv


Unnamed: 0,variant_key,ClinicalSignificance,sas_af,eur_af,afr_af,eas_af,amr_af,sas_eur_ratio,ReviewStatus,ProposedReclassification
0,13-32315212-G-A,"('Benign',)",0.0,1.5e-05,0.002792,0.0,0.000262,6.8002e-08,"('criteria_provided', '_single_submitter')",
1,13-32315226-G-A,"('Benign',)",0.25,0.177842,0.5,0.149612,0.201544,1.405741,"('reviewed_by_expert_panel',)",
2,13-32315300-G-A,"('Benign',)",0.000415,0.005453,0.00065,0.000193,0.001961,0.07602821,"('criteria_provided', '_single_submitter')",
3,13-32315355-A-G,"('Uncertain_significance',)",0.0,0.0,0.0,0.000965,0.0,1.0,"('criteria_provided', '_single_submitter')",
4,13-32315355-ATGCCTGACAAGGAATTTCCTTTCGCCACACTGA...,"('Pathogenic',)",0.0,0.0,0.0,0.0,0.0,1.0,"('no_assertion_criteria_provided',)",


## 8. Summary

In this Phase 5 notebook you created:

**Figures**
- Figure 1 – SAS vs EUR AF scatter (`figure1_population_af_scatter_sas_vs_eur.png`)
- Figure 2 – Frequency thresholds (from Phase 3, documented here)
- Figure 6 – SAS/EUR ratio distribution (`figure6_sas_eur_ratio_violin.png`)
- Figure 7 – Reclassification bar plot (`figure7_reclassification_barplot.png`)

**Tables**
- Table 5 – Variant counts by `ClinicalSignificance`
- Table 6 – Reclassification counts (original vs proposed)
- Supplementary variants table (`supplementary_variants.csv`)

These are **plug-and-play** for manuscripts, posters, and competition submissions.
