# Phase 1 — Hybrid Data Factory

**Strategy D: Hybrid Emitter + nvcc Validation**

| Source | Speed | Purpose |
|--------|-------|---------|
| **Pure-Python PTX Emitter** (90%) | ~50k pairs/min | Bulk training data — no GPU/nvcc needed |
| **nvcc compilation** (10%) | ~5 pairs/sec | Real compiler patterns + validation |

The emitter generates deterministic PTX instruction sequences from AST nodes
using a mini-compiler approach (register allocation, load/store generation,
expression codegen). Output goes through the same `normalize_ptx()` pipeline
as real nvcc output, so the model sees a unified token format.

**Runtime**: CPU is enough. No GPU needed for this step.

In [1]:
# --- Run this cell first on Google Colab to clone the repo ---
import os
if os.path.exists("/content"):
    %cd /content
    !rm -rf /content/DeepPTX
    !git clone https://github.com/ns-1456/DeepPTX.git /content/DeepPTX
    %cd /content/DeepPTX

fatal: could not create leading directories of '/content/DeepPTX': Read-only file system
[Errno 2] No such file or directory: '/content/DeepPTX'
/Users/ns/Projects for Resume/Neural PTX Decompiler/notebooks


## Setup

In [2]:
!pip install -q pyarrow tqdm

# Optional: mount Google Drive for persistent storage
# from google.colab import drive
# drive.mount("/content/drive")
# OUTPUT_DIR = "/content/drive/MyDrive/NeuralPTX"

OUTPUT_DIR = "."  # saves to repo root; uncomment above for Drive


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.3[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
# ======================= Configuration =======================
TARGET_PAIRS   = 100_000   # total pairs to generate
EMITTER_FRAC   = 0.90      # 90% from pure-Python emitter
NVCC_FRAC      = 0.10      # 10% from real nvcc (set to 0 if no nvcc)
NVCC_BATCH     = 200       # .cu files per nvcc round
# =============================================================

EMITTER_TARGET = int(TARGET_PAIRS * EMITTER_FRAC)
NVCC_TARGET    = TARGET_PAIRS - EMITTER_TARGET

print(f"Target: {TARGET_PAIRS:,} total")
print(f"  Emitter: {EMITTER_TARGET:,} pairs (instant)")
print(f"  nvcc:    {NVCC_TARGET:,} pairs (slower, real compiler)")

Target: 100,000 total
  Emitter: 90,000 pairs (instant)
  nvcc:    10,000 pairs (slower, real compiler)


## Imports

In [4]:
import sys, os, random, time

REPO_ROOT = "/content/DeepPTX" if os.path.exists("/content/DeepPTX") else os.path.abspath("..")
if REPO_ROOT not in sys.path:
    sys.path.insert(0, REPO_ROOT)

import pandas as pd
from pathlib import Path
from tqdm.auto import tqdm

from ptx_decompiler.data import (
    parse_sexp,
    ast_to_cuda,
    normalize_ptx,
    PTXEmitter,
)
from ptx_decompiler.data.grammar import TIER_CLASSES, sample_tier

print("Imports OK")

Imports OK


## Part A — Pure-Python Emitter (90% of data)

The `PTXEmitter` translates AST nodes directly into PTX instructions
without calling `nvcc`. This is a deterministic mini-compiler that:

1. Emits thread-index prologue (`mov.u32 %r0, %ctaid.x; ...`)
2. Loads array variables via `ld.param.u64` + `ld.global.f32`
3. Recursively compiles expression nodes to PTX arithmetic
4. Stores result via `st.global.f32`
5. Normalizes output through the same pipeline as nvcc

In [5]:
random.seed(42)
emitter = PTXEmitter()

emitter_data = []
emitter_fails = 0

pbar = tqdm(total=EMITTER_TARGET, desc="Emitter", unit="pair")

while len(emitter_data) < EMITTER_TARGET:
    tier_id, gen = sample_tier()
    ast_node = gen.generate()
    ast_sexp = ast_node.to_sexp()
    cuda_source = ast_to_cuda(ast_sexp)

    try:
        ptx_norm = emitter.emit_normalized(ast_node)
        if not ptx_norm.strip():
            emitter_fails += 1
            continue
        emitter_data.append({
            "ptx_normalized": ptx_norm,
            "ast_sexp": ast_sexp,
            "cuda_source": cuda_source,
            "tier": tier_id,
            "complexity_score": gen.complexity_score,
            "source": "emitter",
        })
        pbar.update(1)
    except Exception as e:
        emitter_fails += 1

pbar.close()
print(f"\nEmitter: {len(emitter_data):,} pairs | {emitter_fails} failures")

Emitter:   0%|          | 0/90000 [00:00<?, ?pair/s]


Emitter: 90,000 pairs | 0 failures


## Part B — nvcc Compilation (10% of data)

For real compiler diversity: generate CUDA, compile with `nvcc -ptx -O0`,
normalize the output. If nvcc is not available (e.g., on a Mac), this section
is skipped and the emitter provides 100% of the data.

In [6]:
import subprocess, tempfile, shutil

# ---- Detect nvcc ----
HAS_NVCC = shutil.which("nvcc") is not None

if HAS_NVCC:
    WORK_DIR = tempfile.mkdtemp(prefix="ptx_batch_")
    NUM_PARALLEL = max(os.cpu_count() or 2, 2) * 3
    
    # Sanity check
    test_cu = Path(WORK_DIR) / "test.cu"
    test_ptx = Path(WORK_DIR) / "test.ptx"
    test_cu.write_text('extern "C" __global__ void k(float* a) { a[0] = 1.0f; }')
    r = subprocess.run(["nvcc", "-ptx", "-O0", str(test_cu), "-o", str(test_ptx)],
                       capture_output=True, text=True, timeout=30)
    if r.returncode != 0:
        print(f"WARNING: nvcc sanity check failed — falling back to emitter-only")
        print(f"stderr: {r.stderr[:200]}")
        HAS_NVCC = False
    else:
        print(f"nvcc OK ({test_ptx.stat().st_size} bytes)")
    test_cu.unlink(missing_ok=True)
    test_ptx.unlink(missing_ok=True)
    
    # Detect arch
    NVCC_ARCH = []
    if HAS_NVCC:
        test_cu.write_text('extern "C" __global__ void k(float* a) { a[0] = 1.0f; }')
        r = subprocess.run(["nvcc", "-ptx", "-O0", "-arch=sm_75", str(test_cu), "-o", str(test_ptx)],
                           capture_output=True, text=True, timeout=30)
        if r.returncode == 0:
            NVCC_ARCH = ["-arch=sm_75"]
            print("Using -arch=sm_75 (T4)")
        else:
            print("Using default arch")
        test_cu.unlink(missing_ok=True)
        test_ptx.unlink(missing_ok=True)
else:
    print("nvcc not found — using emitter-only mode (100% synthetic PTX)")
    print("This is fine! The emitter produces realistic, training-ready PTX.")

nvcc not found — using emitter-only mode (100% synthetic PTX)
This is fine! The emitter produces realistic, training-ready PTX.


In [7]:
nvcc_data = []

if HAS_NVCC and NVCC_TARGET > 0:
    nvcc_fails = 0

    def compile_batch_parallel(batch, work_dir, max_concurrent):
        """Write .cu files, fire parallel nvcc processes, read .ptx results."""
        n = len(batch)
        cu_paths, ptx_paths = [], []
        for i, (_, cuda_src, _, _) in enumerate(batch):
            cu = Path(work_dir) / f"{i}.cu"
            ptx = Path(work_dir) / f"{i}.ptx"
            cu.write_text(cuda_src, encoding="utf-8")
            cu_paths.append(cu)
            ptx_paths.append(ptx)

        # Fire in waves
        for start in range(0, n, max_concurrent):
            end = min(start + max_concurrent, n)
            procs = []
            for i in range(start, end):
                cmd = ["nvcc", "-ptx", "-O0"] + NVCC_ARCH + [str(cu_paths[i]), "-o", str(ptx_paths[i])]
                p = subprocess.Popen(cmd, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
                procs.append((i, p))
            for i, p in procs:
                try:
                    p.wait(timeout=30)
                except subprocess.TimeoutExpired:
                    p.kill()

        results = []
        for i, (ast_sexp, cuda_src, tier_id, score) in enumerate(batch):
            if ptx_paths[i].exists() and ptx_paths[i].stat().st_size > 0:
                ptx_raw = ptx_paths[i].read_text(encoding="utf-8")
                ptx_norm = normalize_ptx(ptx_raw)
                if ptx_norm.strip():
                    results.append({
                        "ptx_normalized": ptx_norm,
                        "ast_sexp": ast_sexp,
                        "cuda_source": cuda_src,
                        "tier": tier_id,
                        "complexity_score": score,
                        "source": "nvcc",
                    })
            cu_paths[i].unlink(missing_ok=True)
            ptx_paths[i].unlink(missing_ok=True)
        return results

    pbar = tqdm(total=NVCC_TARGET, desc="nvcc", unit="pair")

    while len(nvcc_data) < NVCC_TARGET:
        need = min(NVCC_BATCH, int((NVCC_TARGET - len(nvcc_data)) * 1.15) + 10)
        batch = []
        for _ in range(need):
            tier_id, gen = sample_tier()
            ast = gen.generate()
            ast_sexp = ast.to_sexp()
            cuda_source = ast_to_cuda(ast_sexp)
            batch.append((ast_sexp, cuda_source, tier_id, gen.complexity_score))

        results = compile_batch_parallel(batch, WORK_DIR, NUM_PARALLEL)
        nvcc_fails += len(batch) - len(results)

        for row in results:
            if len(nvcc_data) >= NVCC_TARGET:
                break
            nvcc_data.append(row)
        pbar.n = len(nvcc_data)
        pbar.refresh()

    pbar.close()
    print(f"\nnvcc: {len(nvcc_data):,} pairs | {nvcc_fails} failures")
else:
    print("Skipping nvcc — emitter provides all data.")

Skipping nvcc — emitter provides all data.


## Merge & Shuffle

In [8]:
data = emitter_data + nvcc_data
random.shuffle(data)

df = pd.DataFrame(data)
print(f"\nTotal dataset: {len(df):,} pairs")
print(f"\nSource breakdown:")
print(df["source"].value_counts())
print(f"\nTier distribution:")
print(df["tier"].value_counts().sort_index())


Total dataset: 90,000 pairs

Source breakdown:
source
emitter    90000
Name: count, dtype: int64

Tier distribution:
tier
1    22307
2    22516
3    18105
4    13544
5     6178
6     4615
7     2735
Name: count, dtype: int64


## Save to Parquet

In [9]:
out_path = Path(OUTPUT_DIR) / "dataset_100k.parquet"
df.to_parquet(out_path, index=False)
print(f"Saved to {out_path} ({out_path.stat().st_size / 1e6:.1f} MB)")
df.head(3)

Saved to dataset_100k.parquet (7.8 MB)


Unnamed: 0,ptx_normalized,ast_sexp,cuda_source,tier,complexity_score,source
0,"mov.u32 %r0 , %s0.x mov.u32 %r1 , %s1.x mov.u3...",(SIN (EXP A)),"extern ""C"" __global__ void k(float* A, float* ...",5,5.0,emitter
1,"mov.u32 %r0 , %s0.x mov.u32 %r1 , %s1.x mov.u3...",(ADD Y X),"extern ""C"" __global__ void k(float* A, float* ...",1,1.0,emitter
2,"mov.u32 %r0 , %s0.x mov.u32 %r1 , %s1.x mov.u3...",(MIN Y X),"extern ""C"" __global__ void k(float* A, float* ...",1,1.0,emitter


## Validation

1. **Round-trip check**: AST → CUDA → (parse back) should match
2. **Emitter sanity**: Spot-check that emitter PTX is non-empty and tokenizes well
3. **Cross-check** (if nvcc available): Compare emitter vs nvcc instruction counts on same AST

In [10]:
from ptx_decompiler.data.renderer import CUDARenderer

# --- Round-trip test ---
n_check = min(500, len(df))
rt_ok = 0
for idx in tqdm(range(n_check), desc="Round-trip", unit="pair"):
    row = df.iloc[idx]
    tree = parse_sexp(row["ast_sexp"])
    rendered = CUDARenderer().kernel_source(tree)
    if row["cuda_source"].strip() == rendered.strip():
        rt_ok += 1
print(f"\nRound-trip: {rt_ok}/{n_check} OK ({rt_ok/n_check*100:.1f}%)")

# --- PTX sanity ---
ptx_lengths = df["ptx_normalized"].str.split().str.len()
print(f"\nPTX token counts: min={ptx_lengths.min()}, median={ptx_lengths.median():.0f}, "
      f"max={ptx_lengths.max()}, mean={ptx_lengths.mean():.1f}")
print(f"Empty PTX rows: {(ptx_lengths == 0).sum()}")

Round-trip:   0%|          | 0/500 [00:00<?, ?pair/s]


Round-trip: 500/500 OK (100.0%)

PTX token counts: min=83, median=113, max=461, mean=114.1
Empty PTX rows: 0


In [11]:
# --- Cross-check: emitter vs nvcc on same ASTs (if nvcc available) ---
if HAS_NVCC:
    from ptx_decompiler.data import compile_cuda_to_ptx_silent
    
    N_CROSS = 50
    sample_rows = df[df["source"] == "emitter"].sample(min(N_CROSS, len(df)), random_state=42)
    match_count = 0
    similar_count = 0
    
    emitter_check = PTXEmitter()
    for _, row in tqdm(sample_rows.iterrows(), total=len(sample_rows), desc="Cross-check"):
        tree = parse_sexp(row["ast_sexp"])
        cuda_src = row["cuda_source"]
        
        # Emitter PTX
        emu_ptx = emitter_check.emit_normalized(tree)
        
        # nvcc PTX
        ptx_raw = compile_cuda_to_ptx_silent(cuda_src, opt_level="-O0")
        if ptx_raw is None:
            continue
        nvcc_ptx = normalize_ptx(ptx_raw)
        
        # Compare
        emu_tokens = len(emu_ptx.split())
        nvcc_tokens = len(nvcc_ptx.split())
        
        if emu_ptx == nvcc_ptx:
            match_count += 1
            similar_count += 1
        elif abs(emu_tokens - nvcc_tokens) / max(emu_tokens, nvcc_tokens, 1) < 0.3:
            similar_count += 1
    
    print(f"\nCross-check ({len(sample_rows)} samples):")
    print(f"  Exact match: {match_count} ({match_count/len(sample_rows)*100:.1f}%)")
    print(f"  Similar (±30% tokens): {similar_count} ({similar_count/len(sample_rows)*100:.1f}%)")
    print(f"  Note: Differences are expected — emitter uses deterministic codegen")
    print(f"  while nvcc may reorder instructions or use different register allocation.")
else:
    print("nvcc not available — skipping cross-check.")
    print("The emitter produces self-consistent PTX that the model can learn from.")

nvcc not available — skipping cross-check.
The emitter produces self-consistent PTX that the model can learn from.


In [12]:
# --- Quick look at a sample ---
print("=" * 60)
row = df.iloc[0]
print(f"AST:  {row['ast_sexp']}")
print(f"Tier: {row['tier']} | Source: {row['source']}")
print(f"\nCUDA:")
print(row['cuda_source'])
print(f"\nNormalized PTX ({len(row['ptx_normalized'].split())} tokens):")
print(row['ptx_normalized'][:500])

AST:  (SIN (EXP A))
Tier: 5 | Source: emitter

CUDA:
extern "C" __global__ void k(float* A, float* B, float* C, float* X, float* Y, float* O, int N) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N) {
        O[i] = sinf(expf(A[i]));
    }
}

Normalized PTX (91 tokens):
mov.u32 %r0 , %s0.x mov.u32 %r1 , %s1.x mov.u32 %r2 , %s2.x mad.lo.s32 %r3 , %r0 , %r1 , %r2 ld.param.u32 %r4 , [k_param_6] setp.ge.s32 %p0 , %r3 , %r4 @%p0 bra EXIT ld.param.u64 %rd0 , [k_param_0] cvta.to.global.u64 %rd0 , %rd0 mul.wide.s32 %rd1 , %r3 , 4 add.s64 %rd1 , %rd0 , %rd1 ld.global.f32 %f0 , [%rd1] ex2.approx.f32 %f1 , %f0 sin.approx.f32 %f2 , %f1 ld.param.u64 %rd2 , [k_param_5] cvta.to.global.u64 %rd2 , %rd2 mul.wide.s32 %rd3 , %r3 , 4 add.s64 %rd3 , %rd2 , %rd3 st.global.f32 [%rd3] 
