## 9.0 Build 50% Real + 50% CTGAN Synthetic Training Table (for CTGAN v2)

In this new experiment, we investigate whether CTGAN can benefit from being trained on a *mixture* of real and synthetic data.

- We reload the merged training data (`train_merged.parquet`) and rebuild the **Safe Top-27 + label** table.
- We recreate an 80/20 stratified split at the **row level**:
  - `df_train_real`: 80% real rows (for CTGAN training and sampling)
  - `df_val_real`: 20% held-out real rows (for downstream evaluation)
- We then load the previous **CTGAN synthetic 200k** table:
  - `synthetic/ctgan_safe_top27_200k.parquet`
- Finally, we construct a **mixed CTGAN v2 training table**:
  - 100,000 rows sampled from **real 80% train pool**
  - 100,000 rows sampled from **CTGAN synthetic 200k**
  - Total = 200,000 rows, same size as the first CTGAN training table

This mixed table (`df_ctgan_mix_200k`) will be used in Section 9.1 to train a **second CTGAN model (CTGAN v2)** and study how “second-generation” synthetic data behaves in terms of utility, fidelity, and privacy.

In [2]:
# === 9.0 Build 50% Real + 50% CTGAN Synthetic Training Table ===

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# 1) Path to the merged training data (same as previous notebooks)
train_path = "outputs_merged/train_merged.parquet"
df = pd.read_parquet(train_path)

print("[9.0] Loaded merged train:", df.shape)
print("[9.0] Overall CTR (label=1):", df["label"].mean())

# 2) Safe Top-27 feature list (must match previous CTGAN / VAE experiments)
safe_top_feats = [
    "creat_type_cd", "f_cat_uniq", "f_refresh_sum", "slot_id", "f_rows",
    "f_up_sum", "f_dislike_sum", "f_refresh_mean", "u_refreshTimes",
    "u_newsCatInterestsST_len", "f_up_mean", "u_feedLifeCycle",
    "u_newsCatInterestsST_uniq", "f_entities_len_mean", "f_dislike_mean",
    "f_browser_life", "adv_prim_id", "device_size", "adv_id", "task_id",
    "inter_type_cd", "hispace_app_tags", "spread_app_id", "app_second_class",
    "ad_click_list_v002_uniq", "ad_click_list_v002_len", "f_hour_cos",
]

# 3) Build the clean table: Safe Top-27 + label
df_clean = df[safe_top_feats + ["label"]].copy()

print("[9.0] Clean dataset:", df_clean.shape)
print("[9.0] Columns (first 10):", df_clean.columns.tolist()[:10], "...")

# 4) Row-level 80/20 split to define:
#    - df_train_real: used for CTGAN training / sampling
#    - df_val_real  : held-out real validation set (for utility evaluation)

all_idx = df_clean.index.to_numpy()
train_idx, val_idx = train_test_split(
    all_idx,
    test_size=0.20,
    random_state=42,
    stratify=df_clean.loc[all_idx, "label"]
)

df_train_real = df_clean.loc[train_idx].reset_index(drop=True)
df_val_real   = df_clean.loc[val_idx].reset_index(drop=True)

print("[9.0] df_train_real:", df_train_real.shape)
print("[9.0] df_val_real  :", df_val_real.shape)
print("[9.0] CTR (train_real):", df_train_real["label"].mean(),
      "CTR (val_real):", df_val_real["label"].mean())

# 5) Build X_val_real / y_val_real for later LightGBM evaluation
X_val_real = df_val_real[safe_top_feats].copy()
y_val_real = df_val_real["label"].astype(int).copy()

print("[9.0] X_val_real:", X_val_real.shape)
print("[9.0] y_val_real:", y_val_real.shape)

# 6) Load the first-generation CTGAN synthetic 200k table
ctgan_syn_path = "synthetic/ctgan_safe_top27_200k.parquet"
df_ctgan_syn_200k = pd.read_parquet(ctgan_syn_path)

print("[9.0] Loaded df_ctgan_syn_200k:", df_ctgan_syn_200k.shape)
print("[9.0] CTGAN synthetic CTR:", df_ctgan_syn_200k["label"].mean())

# 7) Construct a mixed training table: 50% real + 50% CTGAN synthetic
TOTAL_MIX = 200_000
N_REAL = TOTAL_MIX // 2
N_SYN  = TOTAL_MIX - N_REAL

real_sample = df_train_real.sample(n=N_REAL, random_state=123).reset_index(drop=True)
syn_sample  = df_ctgan_syn_200k.sample(n=N_SYN, random_state=123).reset_index(drop=True)

df_ctgan_mix_200k = pd.concat([real_sample, syn_sample], axis=0).reset_index(drop=True)

print("\n[9.0] Mixed CTGAN v2 training table created.")
print("[9.0] df_ctgan_mix_200k shape:", df_ctgan_mix_200k.shape)
print("[9.0] Mixed CTR:", df_ctgan_mix_200k["label"].mean())

# Optional: quick sanity check on source proportions
print("[9.0] Real part size:", len(real_sample), 
      "| Synthetic part size:", len(syn_sample))

[9.0] Loaded merged train: (7675517, 68)
[9.0] Overall CTR (label=1): 0.01552156030662169
[9.0] Clean dataset: (7675517, 28)
[9.0] Columns (first 10): ['creat_type_cd', 'f_cat_uniq', 'f_refresh_sum', 'slot_id', 'f_rows', 'f_up_sum', 'f_dislike_sum', 'f_refresh_mean', 'u_refreshTimes', 'u_newsCatInterestsST_len'] ...
[9.0] df_train_real: (6140413, 28)
[9.0] df_val_real  : (1535104, 28)
[9.0] CTR (train_real): 0.01552159439438357 CTR (val_real): 0.01552142395564079
[9.0] X_val_real: (1535104, 27)
[9.0] y_val_real: (1535104,)
[9.0] Loaded df_ctgan_syn_200k: (200000, 28)
[9.0] CTGAN synthetic CTR: 0.046655

[9.0] Mixed CTGAN v2 training table created.
[9.0] df_ctgan_mix_200k shape: (200000, 28)
[9.0] Mixed CTR: 0.03091
[9.0] Real part size: 100000 | Synthetic part size: 100000


## 9.1 Train CTGAN v2 on 50% Real + 50% Synthetic (CTGAN-Mix-200k)

In this section we train a **second CTGAN model (CTGAN v2)** using the mixed table
`df_ctgan_mix_200k` built in Section 9.0:

- Total rows: **200,000**
- Composition:
  - 100,000 rows sampled from the real 80% training pool (`df_train_real`)
  - 100,000 rows sampled from the first-generation CTGAN synthetic 200k table

Training setup:

- Inputs: Safe Top-27 features + binary label
- Discrete columns:
  - We explicitly treat **`label`** as a discrete (binary) column
- Hyperparameters (kept consistent with the first CTGAN for comparability):
  - `epochs = 10`
  - `batch_size = 1024`
  - `pac = 1`
  - `embedding_dim = 128`, `generator_dim = (256, 256)`, `discriminator_dim = (256, 256)`

The trained model is saved as:

- `ctgan_safe_top27_mix200k.pkl`

In later sections, we will:

- **9.2** Generate a new 200k synthetic table from CTGAN v2
- **9.3** Compare utility when training LightGBM on:
  - 200k real only
  - 200k CTGAN-v2 synthetic only
  - 100k real + 100k CTGAN-v2 synthetic

In [11]:
# === 9.1 Train CTGAN v2 on 50% Real + 50% Synthetic (CTGAN-Mix-200k) ===

import os
import pickle

# --- 1) Robustly locate CTGAN class regardless of package version ---
import ctgan

CTGANClass = None

# Try new-style API first
try:
    from ctgan.synthesizers import CTGAN as _CTGAN
    CTGANClass = _CTGAN
    print("[9.1] Using ctgan.synthesizers.CTGAN")
except Exception as e:
    print("[9.1] ctgan.synthesizers.CTGAN not available, fallback to ctgan module:", e)

# Fallbacks for older versions
if CTGANClass is None:
    if hasattr(ctgan, "CTGANSynthesizer"):
        CTGANClass = ctgan.CTGANSynthesizer
        print("[9.1] Using ctgan.CTGANSynthesizer")
    elif hasattr(ctgan, "CTGAN"):
        CTGANClass = ctgan.CTGAN
        print("[9.1] Using ctgan.CTGAN")

if CTGANClass is None:
    raise ImportError(
        "Cannot find CTGAN class in your ctgan installation. "
        "Tried ctgan.synthesizers.CTGAN, ctgan.CTGANSynthesizer, ctgan.CTGAN."
    )

# --- 2) Safety checks: need mixed training table and feature list ---
needed_vars = ["df_ctgan_mix_200k", "safe_top_feats"]
for v in needed_vars:
    if v not in globals():
        raise RuntimeError(f"{v} not found. Please run Section 9.0 first.")

print("[9.1] Mixed CTGAN training table:", df_ctgan_mix_200k.shape)

# Use Safe Top-27 + label for training
ctgan_cols = safe_top_feats + ["label"]
df_ctgan_train_v2 = df_ctgan_mix_200k[ctgan_cols].copy()

print("[9.1] CTGAN v2 training view:", df_ctgan_train_v2.shape)
print("[9.1] Columns:", df_ctgan_train_v2.columns.tolist()[:10], "...")

# Label 
discrete_columns_v2 = ["label"]
print("[9.1] Discrete columns for CTGAN v2:", discrete_columns_v2)

# --- 3) Instantiate CTGAN v2 ---
ctgan_v2 = CTGANClass(
    embedding_dim=128,
    generator_dim=(256, 256),
    discriminator_dim=(256, 256),
    batch_size=1024,
    epochs=10,
    pac=1,
    verbose=True,
)

# --- 4) Train ---
print("[9.1] Start training CTGAN v2 on df_ctgan_mix_200k (200k rows)...")
ctgan_v2.fit(
    df_ctgan_train_v2,
    discrete_columns=discrete_columns_v2
)
print("[9.1] CTGAN v2 training finished.")

# --- 5) Save model ---
os.makedirs("models", exist_ok=True)
ctgan_v2_path = "models/ctgan_safe_top27_mix200k.pkl"

with open(ctgan_v2_path, "wb") as f:
    pickle.dump(ctgan_v2, f)

print("[9.1] Saved CTGAN v2 model ->", ctgan_v2_path)

[9.1] Using ctgan.synthesizers.CTGAN
[9.1] Mixed CTGAN training table: (200000, 28)
[9.1] CTGAN v2 training view: (200000, 28)
[9.1] Columns: ['creat_type_cd', 'f_cat_uniq', 'f_refresh_sum', 'slot_id', 'f_rows', 'f_up_sum', 'f_dislike_sum', 'f_refresh_mean', 'u_refreshTimes', 'u_newsCatInterestsST_len'] ...
[9.1] Discrete columns for CTGAN v2: ['label']
[9.1] Start training CTGAN v2 on df_ctgan_mix_200k (200k rows)...


Gen. (0.22) | Discrim. (-0.19): 100%|███████████| 10/10 [02:03<00:00, 12.35s/it]

[9.1] CTGAN v2 training finished.
[9.1] Saved CTGAN v2 model -> models/ctgan_safe_top27_mix200k.pkl





## 9.2 Generate 200k Synthetic Samples from CTGAN v2 (Mixed-Trained)

In this section, we use the mixed-trained CTGAN v2 model (trained on 100k real + 100k CTGAN synthetic) to generate a fresh batch of 200,000 synthetic samples.

Steps:

1. Load or reuse the trained `ctgan_v2` model from Section 9.1.
2. Call the model's `sample` method to draw 200k rows with the full Safe Top-27 schema plus the label.
3. Compute the synthetic CTR and check basic statistics.
4. Save the new synthetic table for later utility and fidelity analysis:

   - `synthetic/ctgan_v2_safe_top27_200k.parquet`
   - `synthetic/ctgan_v2_safe_top27_200k.csv`

This second synthetic batch ("CTGAN v2 synthetic") will be compared against:
- the original CTGAN synthetic 200k, and
- the mixed real+synthetic setups in subsequent utility experiments.

In [31]:
# === 9.2 Generate 200k Synthetic Samples from CTGAN v2 ===

import os
import pickle
import pandas as pd

# 1) Ensure we have the trained ctgan_v2 object
if "ctgan_v2" not in globals():
    # If notebook was restarted, reload from disk
    model_path = "models/ctgan_safe_top27_mix200k.pkl"
    if not os.path.exists(model_path):
        raise FileNotFoundError(
            f"{model_path} not found. Please re-run Section 9.1 to train and save ctgan_v2."
        )
    with open(model_path, "rb") as f:
        ctgan_v2 = pickle.load(f)
    print("[9.2] Reloaded ctgan_v2 from disk:", model_path)
else:
    print("[9.2] Using ctgan_v2 from memory.")

# 2) Number of synthetic rows to generate
N_SYN_V2 = 200_000
print(f"[9.2] Generating {N_SYN_V2:,} synthetic rows from CTGAN v2...")

# 3) Sample from CTGAN v2
df_ctgan_v2_syn = ctgan_v2.sample(N_SYN_V2)

print("[9.2] Raw CTGAN v2 synthetic shape:", df_ctgan_v2_syn.shape)
print("[9.2] Columns:", df_ctgan_v2_syn.columns.tolist()[:10], "...")

# 4) Basic sanity checks: ensure Safe Top-27 + label all exist
needed_cols = safe_top_feats + ["label"]
missing = [c for c in needed_cols if c not in df_ctgan_v2_syn.columns]
if missing:
    raise RuntimeError(f"[9.2] Missing expected columns in CTGAN v2 output: {missing}")

# 5) Restrict to Safe Top-27 + label in a stable column order
df_ctgan_v2_syn = df_ctgan_v2_syn[needed_cols].copy()

# 6) Compute synthetic CTR
ctr_v2 = float(df_ctgan_v2_syn["label"].mean())
print("[9.2] CTGAN v2 synthetic CTR (label=1):", round(ctr_v2, 5))

# 7) Create output folder
os.makedirs("synthetic", exist_ok=True)

# 8) Save to Parquet and CSV
parquet_path_v2 = "synthetic/ctgan_v2_safe_top27_200k.parquet"
csv_path_v2     = "synthetic/ctgan_v2_safe_top27_200k.csv"

df_ctgan_v2_syn.to_parquet(parquet_path_v2, index=False)
df_ctgan_v2_syn.to_csv(csv_path_v2, index=False)

print("[9.2] Saved CTGAN v2 synthetic ->", parquet_path_v2)
print("[9.2] Saved CTGAN v2 synthetic ->", csv_path_v2)

# 9) Quick preview
df_ctgan_v2_syn.head()

[9.2] Using ctgan_v2 from memory.
[9.2] Generating 200,000 synthetic rows from CTGAN v2...
[9.2] Raw CTGAN v2 synthetic shape: (200000, 28)
[9.2] Columns: ['creat_type_cd', 'f_cat_uniq', 'f_refresh_sum', 'slot_id', 'f_rows', 'f_up_sum', 'f_dislike_sum', 'f_refresh_mean', 'u_refreshTimes', 'u_newsCatInterestsST_len'] ...
[9.2] CTGAN v2 synthetic CTR (label=1): 0.4185
[9.2] Saved CTGAN v2 synthetic -> synthetic/ctgan_v2_safe_top27_200k.parquet
[9.2] Saved CTGAN v2 synthetic -> synthetic/ctgan_v2_safe_top27_200k.csv


Unnamed: 0,creat_type_cd,f_cat_uniq,f_refresh_sum,slot_id,f_rows,f_up_sum,f_dislike_sum,f_refresh_mean,u_refreshTimes,u_newsCatInterestsST_len,...,adv_id,task_id,inter_type_cd,hispace_app_tags,spread_app_id,app_second_class,ad_click_list_v002_uniq,ad_click_list_v002_len,f_hour_cos,label
0,3,6,1,44,1,12,40,-0.026543,0,0,...,18404,19264,4,50,260,15,5,5,-0.089975,1
1,3,72,387,46,12,242,350,6.980751,5,5,...,17883,13882,5,47,168,23,5,5,-0.885994,0
2,3,18,-4,38,2,106,46,-0.049542,0,5,...,12691,23042,5,39,250,15,5,5,-0.739881,1
3,10,6,1,16,0,5,52,-0.035839,2,1,...,11363,21103,4,16,156,14,1,1,-0.893324,0
4,8,14,112,47,4,83,47,4.977116,5,0,...,12849,18984,4,41,261,15,5,5,-0.685563,1


## 9.3 Utility Evaluation for CTGAN v2

In this section, we evaluate the downstream predictive utility of the
second CTGAN model (CTGAN v2), which was trained on a 50% real + 50%
synthetic mixture of 200k samples.

We consider two training setups, both evaluated on the held-out real
validation set (`X_val_real`, `y_val_real`):

1. **CTGAN v2 synthetic only (200k)**  
   - Train LightGBM on 200,000 rows generated by CTGAN v2  
   - Test on the real validation set

2. **Mixed 100k real + 100k CTGAN v2 synthetic**  
   - Train LightGBM on 100,000 real rows (from the original training split)
     plus 100,000 CTGAN v2 synthetic rows  
   - Test on the same real validation set

These experiments show whether the second-round CTGAN training
(“bootstrapping” on synthetic data) improves or degrades predictive
performance compared with using real data alone or first-round CTGAN
samples.

In [34]:
# === 9.3 Utility: CTGAN v2 synthetic only vs Mixed 100k real + 100k v2 ===

import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.metrics import (
    roc_auc_score, average_precision_score, log_loss,
    accuracy_score, precision_score, recall_score, f1_score
)

# Safety checks
needed = ["safe_top_feats", "df_train_real", "X_val_real", "y_val_real"]
for v in needed:
    if v not in globals():
        raise RuntimeError(f"{v} not found. Please run Section 9.0 first.")


def evaluate_lgb(train_X, train_y, valid_X, valid_y, title):
    """
    Train a LightGBM model and evaluate on the real validation set.
    """
    pos_rate = train_y.mean()
    scale_pos_weight = (1.0 - pos_rate) / pos_rate

    params = {
        "objective": "binary",
        "metric": ["auc", "average_precision"],
        "learning_rate": 0.08,
        "num_leaves": 127,
        "max_depth": -1,
        "min_child_samples": 100,
        "subsample": 0.8,
        "subsample_freq": 1,
        "colsample_bytree": 0.8,
        "lambda_l2": 1.0,
        "scale_pos_weight": scale_pos_weight,
        "n_jobs": -1,
        "seed": 42,
    }

    dtrain = lgb.Dataset(train_X, label=train_y)
    dvalid = lgb.Dataset(valid_X, label=valid_y, reference=dtrain)

    print(f"\n[{title}] Training LightGBM...")
    print(f"[{title}] Train shape: {train_X.shape}, CTR: {train_y.mean():.5f}")
    print(f"[{title}] scale_pos_weight = {scale_pos_weight:.2f}")

    model = lgb.train(
        params=params,
        train_set=dtrain,
        num_boost_round=2000,
        valid_sets=[dtrain, dvalid],
        valid_names=["train", "valid"],
        callbacks=[
            lgb.early_stopping(stopping_rounds=100),
            lgb.log_evaluation(period=100),
        ],
    )

    # Predict on the real validation set
    y_proba = model.predict(valid_X, num_iteration=model.best_iteration)
    y_pred = (y_proba >= 0.5).astype(int)

    # Metrics
    metrics = {
        "ROC-AUC": roc_auc_score(valid_y, y_proba),
        "PR-AUC": average_precision_score(valid_y, y_proba),
        "LogLoss": log_loss(valid_y, y_proba),
        "Accuracy": accuracy_score(valid_y, y_pred),
        "Precision": precision_score(valid_y, y_pred, zero_division=0),
        "Recall": recall_score(valid_y, y_pred),
        "F1": f1_score(valid_y, y_pred),
    }

    print(f"\n=== {title} @ threshold=0.50 ===")
    for k, v in metrics.items():
        print(f"{k:<10s}: {v:.4f}")

    return metrics


# ---------------------------------------------------------
# 1) CTGAN v2 synthetic only (200k)
# ---------------------------------------------------------
df_ctgan_v2 = pd.read_parquet("synthetic/ctgan_v2_safe_top27_200k.parquet")
X_syn_v2 = df_ctgan_v2[safe_top_feats].copy()
y_syn_v2 = df_ctgan_v2["label"].astype(int).copy()

metrics_ctgan_v2_only = evaluate_lgb(
    train_X=X_syn_v2,
    train_y=y_syn_v2,
    valid_X=X_val_real[safe_top_feats],
    valid_y=y_val_real.astype(int),
    title="CTGAN v2 synthetic only (200k → real valid)",
)


# ---------------------------------------------------------
# 2) Mixed 100k real + 100k CTGAN v2 synthetic
# ---------------------------------------------------------
real_100k = df_train_real.sample(n=100_000, random_state=42)
syn_100k  = df_ctgan_v2.sample(n=100_000, random_state=42)

X_mix = pd.concat(
    [real_100k[safe_top_feats], syn_100k[safe_top_feats]],
    axis=0
).reset_index(drop=True)

y_mix = pd.concat(
    [real_100k["label"].astype(int), syn_100k["label"].astype(int)],
    axis=0
).reset_index(drop=True)

metrics_ctgan_v2_mix = evaluate_lgb(
    train_X=X_mix,
    train_y=y_mix,
    valid_X=X_val_real[safe_top_feats],
    valid_y=y_val_real.astype(int),
    title="Mixed 100k real + 100k CTGAN v2 (→ real valid)",
)


# ---------------------------------------------------------
# 3) Put results into a small summary table
# ---------------------------------------------------------
summary_9_3 = pd.DataFrame([
    {"Model": "CTGAN v2 synthetic only (200k)", **metrics_ctgan_v2_only},
    {"Model": "Mixed 100k real + 100k CTGAN v2", **metrics_ctgan_v2_mix},
])

summary_9_3


[CTGAN v2 synthetic only (200k → real valid)] Training LightGBM...
[CTGAN v2 synthetic only (200k → real valid)] Train shape: (200000, 27), CTR: 0.41850
[CTGAN v2 synthetic only (200k → real valid)] scale_pos_weight = 1.39
[LightGBM] [Info] Number of positive: 83700, number of negative: 116300
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.007950 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 4089
[LightGBM] [Info] Number of data points in the train set: 200000, number of used features: 27
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.418500 -> initscore=-0.328934
[LightGBM] [Info] Start training from score -0.328934
Training until validation scores don't improve for 100 rounds
[100]	train's auc: 0.981531	train's average_precision: 0.976421	valid's auc: 0.677047	valid's average_precision: 0.033924
Early stopping, best iteration is:
[19]	train's auc: 0.953017	train's average_precision: 0.94234

Unnamed: 0,Model,ROC-AUC,PR-AUC,LogLoss,Accuracy,Precision,Recall,F1
0,CTGAN v2 synthetic only (200k),0.694414,0.036664,0.461192,0.79291,0.033674,0.445629,0.062617
1,Mixed 100k real + 100k CTGAN v2,0.770494,0.075537,0.099023,0.977219,0.166107,0.116339,0.136838
