# Golden Dataset Construction  
**Extraction → Label Expansion → Cleaning**

This notebook consolidates the Golden Dataset workflow used in the capstone project: *System Risk in Policy-Driven AI Systems*.

---

## Overview
The Golden Dataset is a governance-aligned evaluation subset designed to expose
system-level risks that are hidden by legacy labels.

This notebook includes two stages:

1. **Extraction / Sampling**  
   Selection of a small, governance-focused subset from the Civil Comments dataset
   (high-risk, borderline, and clean tiers).

2. **Label Expansion & Cleaning**  
   Transformation of human annotations into analysis-ready columns, including:
   - binary harm labels  
   - ambiguity codes  
   - cleaned export schema  

---

## Reproducibility & Ethics
To keep this repository ethical and lightweight, **raw datasets are not included**.

You must provide the Civil Comments source file locally (or via Drive) and specify
your own input and output paths below.



In [None]:
# Core libraries
import pandas as pd
import numpy as np

# Display tweaks (optional)
pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 120)


## Setup and Paths

This notebook assumes access to the Civil Comments dataset under its original
licensing terms.

Raw datasets are not included in this repository. Users must supply their own
local or Drive-based paths for input and output files.


In [None]:
# Civil Comments source for sampling/extraction ---
CIVIL_COMMENTS_PATH = "<PATH_TO_CIVIL_COMMENTS_READY_CSV>"  # e.g., "data/Civil_Comments_TFDS.csv"

# A pre-clean Golden Dataset file (spreadsheet) after human labeling ---
GOLD_PRE_CLEAN_PATH = "<PATH_TO_GOLDEN_PRECLEAN_XLSX>"      # e.g., "outputs/golden_dataset_preclean.xlsx"
GOLD_PRE_CLEAN_SHEET = "golden_dataset"

# Outputs
OUTPUT_DIR = "outputs"
EXTRACTED_OUTPUT_PATH = f"{OUTPUT_DIR}/golden_subset_extracted.csv"
CLEANED_OUTPUT_PATH  = f"{OUTPUT_DIR}/golden_dataset_clean.xlsx"

print("Configured paths. Update the <...> placeholders before running.")

In [None]:
import os

def _is_placeholder(p: str) -> bool:
    return isinstance(p, str) and p.strip().startswith("<") and p.strip().endswith(">")

# Create output directory early
os.makedirs(OUTPUT_DIR, exist_ok=True)

if _is_placeholder(CIVIL_COMMENTS_PATH) or _is_placeholder(GOLD_PRE_CLEAN_PATH):
    print("Update CIVIL_COMMENTS_PATH and GOLD_PRE_CLEAN_PATH before running the notebook.")
else:
    print("Configured paths:")
    print(" - CIVIL_COMMENTS_PATH =", CIVIL_COMMENTS_PATH)
    print(" - GOLD_PRE_CLEAN_PATH =", GOLD_PRE_CLEAN_PATH)
    print(" - OUTPUT_DIR          =", OUTPUT_DIR)


## 1) Load Civil Comments source

In [None]:
# Load the dataset used for extraction

df = pd.read_csv(CIVIL_COMMENTS_PATH)

print("Rows:", len(df))
print("Columns:", len(df.columns))
df.head()

## 2) Golden subset extraction / sampling (risk tiers + per-label minimums)

## Golden Dataset Extraction

This section constructs a small, governance-focused subset of the Civil Comments
dataset.

Sampling emphasizes:
- borderline toxicity regions (high ambiguity)
- clearly harmful cases
- clearly non-harmful cases

This design maximizes diagnostic value while keeping human labeling effort
tractable.


In [None]:
#Golden Dataset Sampling with Per-Label Minimums (≥ 15)

np.random.seed(42)

label_cols = [
    'toxicity', 'severe_toxicity', 'obscene',
    'insult', 'threat', 'identity_attack', 'sexual_explicit'
]

# Targets by risk tier
tier_targets = {'High-Risk': 186, 'Borderline': 62, 'Clean': 62}

# Minimum presence per label (for analysis power):
min_per_label = 15

# Copy to avoid side effects
pool = tiered_df.copy()

# Track selections
selected_idx = set()

# Track how many we have taken from each tier so far
tier_counts = {'High-Risk': 0, 'Borderline': 0, 'Clean': 0}

# Sample rows safely from a candidate subset, honoring tier quotas and avoiding duplicates
def take_from_pool(candidates, needed, tier_name):
    """Return a list of indices selected from 'candidates' up to 'needed',
    without exceeding the remaining tier quota or duplicating rows."""
    remaining_quota = max(0, tier_targets[tier_name] - tier_counts[tier_name])
    k = min(needed, remaining_quota, len(candidates))
    if k <= 0:
        return []
    taken = candidates.sample(n=k, random_state=42).index.tolist()
    return taken

# Ensure ≥ 15 presence per label using High-Risk first (≥0.50), then Borderline (0.30–0.49)
for L in label_cols:
    # Count current presence already selected (initially 0)
    def presence_mask(df):
        return df[L] >= 0.30  # presence definition for later per-label analysis

    current_presence = 0

    # High-Risk pool for this label (presence AND risk_tier == 'High-Risk')
    hr_pool = pool[
        (pool.index.isin(selected_idx) == False) &
        (pool['risk_tier'] == 'High-Risk') &
        (pool[L] >= 0.50)
    ]

    # Take as many as possible toward the min requirement from High-Risk
    need = max(0, min_per_label - current_presence)
    hr_take = take_from_pool(hr_pool, need, 'High-Risk')
    selected_idx.update(hr_take)
    tier_counts['High-Risk'] += len(hr_take)
    current_presence += len(hr_take)

    # If still short, top up from Borderline presence (0.30–0.49) where risk_tier == 'Borderline'
    if current_presence < min_per_label:
        bl_pool = pool[
            (pool.index.isin(selected_idx) == False) &
            (pool['risk_tier'] == 'Borderline') &
            (pool[L] >= 0.30) & (pool[L] < 0.50)
        ]
        need = max(0, min_per_label - current_presence)
        bl_take = take_from_pool(bl_pool, need, 'Borderline')
        selected_idx.update(bl_take)
        tier_counts['Borderline'] += len(bl_take)
        current_presence += len(bl_take)

    # If still short (extremely rare labels), allow spillover from remaining High-Risk presence again,
    # even if not dominant, as long as L >= 0.30 and risk_tier == 'High-Risk' (covers cases with another label slightly higher)
    if current_presence < min_per_label:
        hr_any_pool = pool[
            (pool.index.isin(selected_idx) == False) &
            (pool['risk_tier'] == 'High-Risk') &
            (pool[L] >= 0.30)
        ]
        need = max(0, min_per_label - current_presence)
        hr_any_take = take_from_pool(hr_any_pool, need, 'High-Risk')
        selected_idx.update(hr_any_take)
        tier_counts['High-Risk'] += len(hr_any_take)
        current_presence += len(hr_any_take)

    # NOTE: We do NOT pull presence from Clean for per-label minima,
    # because presence ≥ 0.30 by definition belongs to High-Risk/Borderline thresholds.

# After guaranteeing per-label minima, fill remaining quotas per tier at random (stratified only by tier).
# This preserves organic composition while completing 186 / 62 / 62.
for tier_name in ['High-Risk', 'Borderline', 'Clean']:
    remaining = max(0, tier_targets[tier_name] - tier_counts[tier_name])
    if remaining == 0:
        continue

    tier_pool = pool[
        (pool.index.isin(selected_idx) == False) &
        (pool['risk_tier'] == tier_name)
    ]

    # If the tier has fewer rows than needed, take all; otherwise sample
    if len(tier_pool) <= remaining:
        take_idx = tier_pool.index.tolist()
    else:
        take_idx = tier_pool.sample(n=remaining, random_state=42).index.tolist()

    selected_idx.update(take_idx)
    tier_counts[tier_name] += len(take_idx)

# Build the final DataFrame, shuffle, and double-check totals
golden_df = pool.loc[list(selected_idx)].sample(frac=1, random_state=42).reset_index(drop=True)

# Safety trims if overshoot (should not happen, but keep consistent with prior pattern)
if len(golden_df) > 313:
    golden_df = golden_df.sample(n=313, random_state=42).reset_index(drop=True)

# Quick verification
print(f"Extracted Golden subset size: {len(golden_df)}")
print("Tier counts:\n", golden_df['risk_tier'].value_counts().to_string(), "\n")

# Per-label PRESENCE counts (rows where label score ≥ 0.30 anywhere in the row)
presence_summary = {}
for L in label_cols:
    presence_summary[L] = int((golden_df[L] >= 0.30).sum())
print("Per-label presence (≥ 0.30) counts:\n", pd.Series(presence_summary).to_string(), "\n")

# Show dominant_label cross-tab for context (not the enforcement metric)
print("Dominant label distribution (context only):\n")
print(golden_df['dominant_label'].value_counts().to_string())

# Save final dataset
golden_df.to_csv(EXTRACTED_OUTPUT_PATH, index=False)
print(f"Saved extracted Golden subset to: {EXTRACTED_OUTPUT_PATH}")



## 3) Load pre-clean Golden Dataset (after human labeling)

This stage:
- parses label list columns (e.g., `first_label`, `final_label`)
- expands them into binary indicator columns
- converts ambiguity text into numeric codes
- drops helper/duplicate columns


In [None]:
# Load the labeled spreadsheet (pre-clean)
# Make sure the sheet contains columns like: first_label, final_label, ambiguity, final_ambiguity
df_gold = pd.read_excel(GOLD_PRE_CLEAN_PATH, sheet_name=GOLD_PRE_CLEAN_SHEET)

print("Rows:", len(df_gold))
df_gold.head()

In [None]:
# If labels are stored as strings like "['toxicity','insult']" convert them to Python lists.

import ast

def to_list(x):
    if isinstance(x, list):
        return x
    if pd.isna(x):
        return []
    if isinstance(x, str):
        try:
            return ast.literal_eval(x)
        except Exception:
            # fallback: split on commas if needed
            return [t.strip() for t in x.strip("[]").replace("'", "").split(",") if t.strip()]
    return []

for col in ["first_label", "final_label"]:
    if col in df_gold.columns:
        df_gold[col] = df_gold[col].apply(to_list)

df_gold[["first_label","final_label"]].head()

## 4) Label Expansion

Human annotations are expanded into analysis-ready columns, including:
- binary harm indicators
- ambiguity codes
- standardized column names

Explicit ambiguity handling is critical for governance-aligned evaluation and
prevents treating contested cases as ground truth.


In [None]:
# Expand label lists into binary columns
ALL_LABELS = [
    "toxicity", "severe_toxicity", "obscene",
    "insult", "threat", "identity_attack", "sexual_explicit"
]

for label in ALL_LABELS:
    df_gold[f"first_{label}"] = df_gold["first_label"].apply(lambda lst: 1 if label in lst else 0)
    df_gold[f"final_{label}"] = df_gold["final_label"].apply(lambda lst: 1 if label in lst else 0)

df_gold[[c for c in df_gold.columns if c.startswith("first_") or c.startswith("final_")]].head()

In [None]:
# Convert ambiguity text into numeric codes
ambiguity_map = {
    "no_violation": 0,
    "gray_area": 1,
    "clear_violation": 2
}

if "ambiguity" in df_gold.columns:
    df_gold["ambiguity_code"] = df_gold["ambiguity"].map(ambiguity_map)

if "final_ambiguity" in df_gold.columns:
    df_gold["final_ambiguity_code"] = df_gold["final_ambiguity"].map(ambiguity_map)

df_gold[[c for c in ["ambiguity","final_ambiguity","ambiguity_code","final_ambiguity_code"] if c in df_gold.columns]].head()

## 5) Cleaning and Finalization

This step standardizes column formats and prepares the dataset for evaluation.
No content-level filtering is applied in order to preserve ambiguity.


In [None]:
# Drop columns not needed in the final clean export
COLUMNS_TO_DROP = [] # add any helper columns here if needed

df_gold_clean = df_gold.drop(columns=COLUMNS_TO_DROP, errors="ignore").copy()

print("Clean columns:", len(df_gold_clean.columns))
df_gold_clean.head()

## 6) Outputs

Exported files are intentionally excluded from version control.
This repository provides construction logic only and does not redistribute
labeled datasets or raw comment text.


In [None]:
# Save cleaned Golden Dataset
import os
os.makedirs(OUTPUT_DIR, exist_ok=True)

df_gold_clean.to_excel(CLEANED_OUTPUT_PATH, index=False)
print("Saved:", CLEANED_OUTPUT_PATH)

## Summary

This notebook documents the full construction of a governance-aligned Golden
Dataset used to evaluate system risk in policy-driven AI systems.

The dataset is designed for diagnostic evaluation rather than large-scale
training and enables analysis of ambiguity, bias propagation, outcome disparity,
and governance drift.
