# Phase 3 – ACMG Frequency-Based Reclassification (BRCA2 South Asian Bias Project)

This notebook performs **ACMG frequency-based reclassification** using your already merged BRCA2 dataset.

- **Input**:  
  `/content/drive/MyDrive/BRCA2-database-bias/data/processed/brca2_merged.csv`
- **Output tables**:  
  - `brca2_reclass_candidates.csv` (all candidates)  
  - `table1_acmg_reclassifications.csv` (clean table for manuscript / poster)
- **Output figure**:  
  - `figure2_frequency_thresholds.png` (ACMG BA1/BS1 thresholds)

The goal is to:

1. Load the merged dataset.
2. Apply **ACMG BA1 / BS1** rules using **South Asian (SAS) allele frequency**.
3. Identify VUS / Pathogenic variants that are **too common in SAS** to be truly pathogenic.
4. Save a clean candidate list + a scatter plot for Figure 2.


## 0. Environment setup

Run this section once per session.  
If you're using **Google Colab**, this will:

1. Import Python packages.
2. Mount Google Drive.
3. Set the project folder and output paths.


In [1]:
import os
import math

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

print("✅ Imports done. pandas version:", pd.__version__)


✅ Imports done. pandas version: 2.2.2


In [2]:
# If you are running in Google Colab, mount your Drive.
# If you're running locally (VS Code / Jupyter), you can skip this cell.

try:
    from google.colab import drive  # type: ignore
    drive.mount("/content/drive")
    print("✅ Google Drive mounted.")
except ModuleNotFoundError:
    print("ℹ️ Not running in Colab; skipping Drive mount.")


Mounted at /content/drive
✅ Google Drive mounted.


In [3]:
# Set your project root in Google Drive
PROJECT_ROOT = "/content/drive/MyDrive/BRCA2-database-bias"

DATA_PROCESSED = os.path.join(PROJECT_ROOT, "data", "processed")
RESULTS_TABLES = os.path.join(PROJECT_ROOT, "results", "tables")
RESULTS_FIGURES = os.path.join(PROJECT_ROOT, "results", "figures")

os.makedirs(RESULTS_TABLES, exist_ok=True)
os.makedirs(RESULTS_FIGURES, exist_ok=True)

print("PROJECT_ROOT:", PROJECT_ROOT)
print("DATA_PROCESSED:", DATA_PROCESSED)
print("RESULTS_TABLES:", RESULTS_TABLES)
print("RESULTS_FIGURES:", RESULTS_FIGURES)


PROJECT_ROOT: /content/drive/MyDrive/BRCA2-database-bias
DATA_PROCESSED: /content/drive/MyDrive/BRCA2-database-bias/data/processed
RESULTS_TABLES: /content/drive/MyDrive/BRCA2-database-bias/results/tables
RESULTS_FIGURES: /content/drive/MyDrive/BRCA2-database-bias/results/figures


## 1. Load the merged BRCA2 dataset

We now load the **Phase 1 master table** that already integrates:

- ClinVar labels (Pathogenic / Benign / VUS)
- gnomAD population allele frequencies (EUR, SAS, AFR, EAS, AMR)
- Engineered features from previous phases

Expected file:

`/content/drive/MyDrive/BRCA2-database-bias/data/processed/brca2_merged.csv`


In [4]:
merged_path = os.path.join(DATA_PROCESSED, "brca2_merged.csv")

df = pd.read_csv(merged_path)
print("✅ Loaded merged dataset from:", merged_path)
print("Shape (rows, columns):", df.shape)

df.head()


✅ Loaded merged dataset from: /content/drive/MyDrive/BRCA2-database-bias/data/processed/brca2_merged.csv
Shape (rows, columns): (20614, 21)


Unnamed: 0,Chromosome,Start,ReferenceAllele,AlternateAllele,ClinicalSignificance,ReviewStatus,variant_key,variant_id,chrom,pos,...,alt,consequence,sas_af,eur_af,afr_af,eas_af,amr_af,log10_sas_af,log10_eur_af,sas_eur_ratio
0,13,32315212,G,A,"('Benign',)","('criteria_provided', '_single_submitter')",13-32315212-G-A,13-32315212-G-A,13.0,32315212.0,...,A,intron_variant,0.0,1.5e-05,0.002792,0.0,0.000262,-12.0,-4.832522,6.8002e-08
1,13,32315226,G,A,"('Benign',)","('reviewed_by_expert_panel',)",13-32315226-G-A,13-32315226-G-A,13.0,32315226.0,...,A,intron_variant,0.25,0.177842,0.5,0.149612,0.201544,-0.60206,-0.749965,1.405741
2,13,32315300,G,A,"('Benign',)","('criteria_provided', '_single_submitter')",13-32315300-G-A,13-32315300-G-A,13.0,32315300.0,...,A,intron_variant,0.000415,0.005453,0.00065,0.000193,0.001961,-3.382377,-2.263352,0.07602821
3,13,32315355,A,G,"('Uncertain_significance',)","('criteria_provided', '_single_submitter')",13-32315355-A-G,13-32315355-A-G,13.0,32315355.0,...,G,5_prime_UTR_variant,0.0,0.0,0.0,0.000965,0.0,-12.0,-12.0,1.0
4,13,32315355,ATGCCTGACAAGGAATTTCCTTTCGCCACACTGAGAAATACCCGCA...,A,"('Pathogenic',)","('no_assertion_criteria_provided',)",13-32315355-ATGCCTGACAAGGAATTTCCTTTCGCCACACTGA...,,,,...,,,0.0,0.0,0.0,0.0,0.0,-12.0,-12.0,1.0


In [7]:
# Quick sanity checks on key columns we will need in this notebook

required_cols = [
    "variant_key",
    "ClinicalSignificance",
    "sas_af",
    "eur_af",
    "HGVS_coding",
    "HGVS_protein",
    "ReviewStatus",
    "DateLastUpdated",
]

# Create missing columns if they don’t exist
for c in required_cols:
    if c not in df.columns:
        df[c] = np.nan

missing = [c for c in required_cols if c not in df.columns]
if missing:
    print("⚠️ Added missing columns (empty):", missing)
else:
    print("✅ All required columns present.")

df[required_cols].head()

✅ All required columns present.


Unnamed: 0,variant_key,ClinicalSignificance,sas_af,eur_af,HGVS_coding,HGVS_protein,ReviewStatus,DateLastUpdated
0,13-32315212-G-A,"('Benign',)",0.0,1.5e-05,,,"('criteria_provided', '_single_submitter')",
1,13-32315226-G-A,"('Benign',)",0.25,0.177842,,,"('reviewed_by_expert_panel',)",
2,13-32315300-G-A,"('Benign',)",0.000415,0.005453,,,"('criteria_provided', '_single_submitter')",
3,13-32315355-A-G,"('Uncertain_significance',)",0.0,0.0,,,"('criteria_provided', '_single_submitter')",
4,13-32315355-ATGCCTGACAAGGAATTTCCTTTCGCCACACTGA...,"('Pathogenic',)",0.0,0.0,,,"('no_assertion_criteria_provided',)",


## 2. Define ACMG BA1 / BS1 rules (frequency-based)

We apply **ACMG frequency criteria** using **South Asian (SAS) allele frequency**:

- **BA1 (Benign, stand-alone)**:  
  - SAS AF ≥ 5% (0.05)
  - Too common to be a high-penetrance pathogenic BRCA2 variant.

- **BS1 (Likely benign)**:  
  - SAS AF ≥ 1% (0.01) but \< 5%
  - Strong benign evidence.

We will:
1. Compute BA1 and BS1 flags using `sas_af`.
2. (Optional) Compute SAS/EUR ratio to highlight large population differences.


In [8]:
# ACMG frequency thresholds (you can tweak later if needed)
BA1_THRESHOLD = 0.05  # >= 5%  -> Benign
BS1_THRESHOLD = 0.01  # >= 1%  -> Likely benign

print("BA1 threshold (SAS AF):", BA1_THRESHOLD)
print("BS1 threshold (SAS AF):", BS1_THRESHOLD)


BA1 threshold (SAS AF): 0.05
BS1 threshold (SAS AF): 0.01


In [9]:
# Make sure we have SAS and EUR AF columns
for col in ["sas_af", "eur_af"]:
    if col not in df.columns:
        df[col] = np.nan

# Helpful: SAS / EUR frequency ratio (if not already present)
if "sas_eur_ratio" not in df.columns:
    df["sas_eur_ratio"] = (df["sas_af"].fillna(0) + 1e-12) / (df["eur_af"].fillna(0) + 1e-12)

# BA1: SAS AF >= 5%
df["BA1_flag"] = df["sas_af"].fillna(0) >= BA1_THRESHOLD

# BS1: SAS AF >= 1% but < 5%, and not already BA1
df["BS1_flag"] = (df["sas_af"].fillna(0) >= BS1_THRESHOLD) & (~df["BA1_flag"])

df[["sas_af", "eur_af", "sas_eur_ratio", "BA1_flag", "BS1_flag"]].head()


Unnamed: 0,sas_af,eur_af,sas_eur_ratio,BA1_flag,BS1_flag
0,0.0,1.5e-05,6.8002e-08,False,False
1,0.25,0.177842,1.405741,True,False
2,0.000415,0.005453,0.07602821,False,False
3,0.0,0.0,1.0,False,False
4,0.0,0.0,1.0,False,False


## 3. Select VUS / Pathogenic variants for reclassification

We focus on variants that are currently labeled as:

- **VUS (Variant of Uncertain Significance)**, or  
- **Pathogenic / Likely pathogenic**

If those variants are **very common in South Asians**, they become candidates for being
**Benign (BA1)** or **Likely Benign (BS1)** based on ACMG frequency rules.


In [10]:
# Helper functions to flag VUS and Pathogenic descriptions
cs = df["ClinicalSignificance"].astype(str).str.lower()

is_vus = cs.str.contains("uncertain", na=False)
is_pathogenic_like = cs.str.contains("pathogenic", na=False)

print("Total variants:", len(df))
print("VUS count:", is_vus.sum())
print("Pathogenic-like count:", is_pathogenic_like.sum())

df["is_vus"] = is_vus
df["is_pathogenic_like"] = is_pathogenic_like


Total variants: 20614
VUS count: 3958
Pathogenic-like count: 11043


In [11]:
# Candidate set: VUS or Pathogenic-like AND meets BA1 or BS1 thresholds
cands = df.loc[(df["is_vus"] | df["is_pathogenic_like"]) & (df["BA1_flag"] | df["BS1_flag"])].copy()

print("Total reclassification candidates (VUS/Path + BA1/BS1):", len(cands))

# Proposed reclassification label
def propose_reclass(row):
    if row["BA1_flag"]:
        return "Benign (BA1)"
    elif row["BS1_flag"]:
        return "Likely benign (BS1)"
    else:
        return "None"

cands["ProposedReclassification"] = cands.apply(propose_reclass, axis=1)

# Simple summary
print("\nProposed reclassification counts:")
print(cands["ProposedReclassification"].value_counts())

cands.head()


Total reclassification candidates (VUS/Path + BA1/BS1): 1

Proposed reclassification counts:
ProposedReclassification
Benign (BA1)    1
Name: count, dtype: int64


Unnamed: 0,Chromosome,Start,ReferenceAllele,AlternateAllele,ClinicalSignificance,ReviewStatus,variant_key,variant_id,chrom,pos,...,log10_eur_af,sas_eur_ratio,HGVS_coding,HGVS_protein,DateLastUpdated,BA1_flag,BS1_flag,is_vus,is_pathogenic_like,ProposedReclassification
4105,13,32333398,CT,C,"('Conflicting_classifications_of_pathogenicity',)","('criteria_provided', '_conflicting_classifica...",13-32333398-CT-C,13-32333398-CT-C,13.0,32333398.0,...,-1.361813,1.434714,,,,True,False,False,True,Benign (BA1)


## 4. Save outputs (candidates + Table 1)

We now save:

1. **Full candidate list** (all columns) for your own analysis:
   - `data/processed/brca2_reclass_candidates.csv`

2. **Clean Table 1** (subset of columns) ready for paper / poster:
   - `results/tables/table1_acmg_reclassifications.csv`


In [12]:
# 4.1 Full candidate CSV (all columns)
cands_path = os.path.join(DATA_PROCESSED, "brca2_reclass_candidates.csv")
cands.to_csv(cands_path, index=False)
print("✅ Saved full candidate list to:", cands_path)

# 4.2 Clean "Table 1" with selected columns
table_cols = [
    "variant_key",
    "HGVS_coding",
    "HGVS_protein",
    "ClinicalSignificance",
    "sas_af",
    "eur_af",
    "sas_eur_ratio",
    "ProposedReclassification",
    "ReviewStatus",
    "DateLastUpdated",
]

# Keep only columns that actually exist
table_cols = [c for c in table_cols if c in cands.columns]

table1 = cands[table_cols].copy()

table1_path = os.path.join(RESULTS_TABLES, "table1_acmg_reclassifications.csv")
table1.to_csv(table1_path, index=False)
print("✅ Saved Table 1 to:", table1_path)

table1.head()


✅ Saved full candidate list to: /content/drive/MyDrive/BRCA2-database-bias/data/processed/brca2_reclass_candidates.csv
✅ Saved Table 1 to: /content/drive/MyDrive/BRCA2-database-bias/results/tables/table1_acmg_reclassifications.csv


Unnamed: 0,variant_key,HGVS_coding,HGVS_protein,ClinicalSignificance,sas_af,eur_af,sas_eur_ratio,ProposedReclassification,ReviewStatus,DateLastUpdated
4105,13-32333398-CT-C,,,"('Conflicting_classifications_of_pathogenicity',)",0.062367,0.04347,1.434714,Benign (BA1),"('criteria_provided', '_conflicting_classifica...",


## 5. Figure 2 – Frequency thresholds plot

We will create a simple **scatter plot** showing:

- **X-axis**: South Asian allele frequency (`sas_af`)  
- **Points**: VUS variants (to illustrate how many cross BS1 / BA1 lines)  
- **Vertical lines** at **1% (BS1)** and **5% (BA1)**

Output file:

- `results/figures/figure2_frequency_thresholds.png`


In [13]:
# Focus on VUS variants only for the scatter plot
vus_df = df[df["is_vus"]].copy()

plt.figure(figsize=(8, 4))

plt.scatter(vus_df["sas_af"], np.zeros_like(vus_df["sas_af"]), s=8, alpha=0.6, label="VUS (SAS AF)")

# Vertical lines for thresholds
plt.axvline(BS1_THRESHOLD, linestyle="--", label=f"BS1 ({BS1_THRESHOLD:.2%})")
plt.axvline(BA1_THRESHOLD, linestyle="--", label=f"BA1 ({BA1_THRESHOLD:.2%})")

plt.xlabel("South Asian allele frequency (sas_af)")
plt.yticks([])
plt.title("Figure 2: Frequency thresholds for ACMG BA1 / BS1 (South Asians)")
plt.legend()

fig2_path = os.path.join(RESULTS_FIGURES, "figure2_frequency_thresholds.png")
plt.savefig(fig2_path, dpi=200, bbox_inches="tight")
plt.close()

print("✅ Saved Figure 2 to:", fig2_path)


✅ Saved Figure 2 to: /content/drive/MyDrive/BRCA2-database-bias/results/figures/figure2_frequency_thresholds.png


## 6. Quick inspection of top candidates

Finally, let's look at a **sorted view** of the strongest candidates,
starting from the **highest SAS allele frequency**.


In [14]:
cands_sorted = cands.sort_values("sas_af", ascending=False)

# Show top 20 by SAS AF
cols_to_show = [
    "variant_key",
    "HGVS_coding",
    "ClinicalSignificance",
    "sas_af",
    "eur_af",
    "sas_eur_ratio",
    "ProposedReclassification",
]

cols_to_show = [c for c in cols_to_show if c in cands_sorted.columns]

cands_sorted[cols_to_show].head(20)


Unnamed: 0,variant_key,HGVS_coding,ClinicalSignificance,sas_af,eur_af,sas_eur_ratio,ProposedReclassification
4105,13-32333398-CT-C,,"('Conflicting_classifications_of_pathogenicity',)",0.062367,0.04347,1.434714,Benign (BA1)


## 7. Summary

In this Phase 3 notebook, you:

1. Loaded the **merged BRCA2 dataset** from Phase 1.  
2. Applied **ACMG BA1 / BS1** frequency rules using **South Asian allele frequencies**.  
3. Identified **VUS / Pathogenic-like variants** that are **too common in South Asians**.  
4. Saved:
   - `data/processed/brca2_reclass_candidates.csv` (full list)  
   - `results/tables/table1_acmg_reclassifications.csv` (clean table for reporting)  
   - `results/figures/figure2_frequency_thresholds.png` (plot for your poster/manuscript)

You can now:

- Use **Table 1** as a key result in your Regeneron / ISEF materials.  
- Refer to **Figure 2** when explaining how frequency thresholds support benign reclassification.
