# üöÄ ZeroLang Full Pipeline

**Automatic data collection + training in one notebook**

1. Collect 2000+ C‚ÜíWAT training pairs
2. Train Qwen2.5-Coder-14B model
3. Test and export

**Total time: ~3-4 hours**
- Data collection: ~2 hours
- Training: ~1-2 hours

**Requirements:**
- H100 GPU (or A100 for 7B model)
- Colab Pro+ recommended

## ‚öôÔ∏è Configuration

In [None]:
#@title Configuration { display-mode: "form" }

#@markdown ### Data Collection
TARGET_PAIRS = 2000  #@param {type:"integer"}
MAX_REPOS = 50  #@param {type:"integer"}

#@markdown ### Training
MODEL = "qwen-coder-14b"  #@param ["qwen-coder-7b", "qwen-coder-14b", "qwen-coder-32b"]
EPOCHS = 10  #@param {type:"integer"}
BATCH_SIZE = 8  #@param {type:"integer"}
MAX_LENGTH = 2048  #@param {type:"integer"}

#@markdown ### Output
SAVE_TO_DRIVE = True  #@param {type:"boolean"}

print(f"Target: {TARGET_PAIRS} pairs from {MAX_REPOS} repos")
print(f"Model: {MODEL}, Epochs: {EPOCHS}")

## 1Ô∏è‚É£ Setup Environment

In [None]:
# Check GPU
!nvidia-smi --query-gpu=name,memory.total --format=csv

import torch
print(f"\nPyTorch CUDA: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

In [None]:
# Install dependencies
!pip install -q transformers datasets peft accelerate bitsandbytes
!apt-get install -qq clang lld  # For C compilation

# Install wasm-tools
!curl -LO https://github.com/aspect-build/wasm-tools/releases/download/v1.230.0/wasm-tools-linux-x86_64.tar.gz
!tar -xzf wasm-tools-linux-x86_64.tar.gz
!mv wasm-tools /usr/local/bin/
!wasm-tools --version

In [None]:
# Clone repository
!git clone https://github.com/project-zero-git/zerolang.git
%cd zerolang

In [None]:
# Mount Google Drive (optional - for saving model)
if SAVE_TO_DRIVE:
    from google.colab import drive
    drive.mount('/content/drive')
    DRIVE_OUTPUT = '/content/drive/MyDrive/zerolang_models'
    !mkdir -p {DRIVE_OUTPUT}
    print(f"Models will be saved to: {DRIVE_OUTPUT}")

## 2Ô∏è‚É£ Data Collection

Collects C functions from GitHub repos and compiles to WAT

In [None]:
# Extended repository list for more data
REPOS = '''
# Algorithms & Data Structures
https://github.com/TheAlgorithms/C
https://github.com/fragglet/c-algorithms
https://github.com/attractivechaos/klib
https://github.com/srdja/Collections-C
https://github.com/troydhanson/uthash

# Cryptography
https://github.com/B-Con/crypto-algorithms
https://github.com/kokke/tiny-AES-c
https://github.com/ctz/cifra
https://github.com/983/SHA-256
https://github.com/983/Num

# String & Text
https://github.com/sheredom/utf8.h
https://github.com/antirez/sds
https://github.com/jwerle/murmurhash.c
https://github.com/skeeto/branchless-utf8

# JSON & Parsing
https://github.com/DaveGamble/cJSON
https://github.com/zserge/jsmn
https://github.com/kgabis/parson
https://github.com/cesanta/frozen

# Compression
https://github.com/lz4/lz4
https://github.com/richgel999/miniz
https://github.com/ebiggers/libdeflate

# Math & Numerical
https://github.com/nothings/stb
https://github.com/983/fft
https://github.com/skeeto/hash-prospector
https://github.com/lemire/clhash

# Utilities
https://github.com/antirez/linenoise
https://github.com/rxi/vec
https://github.com/rxi/map
https://github.com/rxi/log.c
https://github.com/skeeto/optparse
https://github.com/gingerBill/gb
https://github.com/mackron/dr_libs

# Embedded
https://github.com/cesanta/mongoose
https://github.com/nodejs/http-parser

# Additional algorithm repos
https://github.com/tezc/sc
https://github.com/tidwall/hashmap.c
https://github.com/sheredom/hashmap.h
https://github.com/tidwall/btree.c
https://github.com/antirez/rax
https://github.com/orangeduck/mpc
https://github.com/pervognsen/bitwise
https://github.com/clibs/buffer
https://github.com/clibs/list
https://github.com/michaelrsweet/mxml
'''

# Save to file
with open('pipeline/repos_extended.txt', 'w') as f:
    f.write(REPOS)

repo_count = len([l for l in REPOS.strip().split('\n') if l.strip() and not l.startswith('#')])
print(f"Total repos: {repo_count}")

In [None]:
# Update generator.py clang path for Colab
!sed -i 's|/opt/homebrew/opt/llvm/bin/clang|clang|g' pipeline/generator.py

# Verify clang works with WASM target
!echo 'int add(int a, int b) { return a + b; }' > /tmp/test.c
!clang --target=wasm32 -c /tmp/test.c -o /tmp/test.o 2>&1 || echo "Clang WASM check done"

In [None]:
%%time
# Run data collection
print(f"Collecting data from repos (target: {TARGET_PAIRS} pairs)...")
print("This will take ~1-2 hours...\n")

!python pipeline/generator.py \
    -l pipeline/repos_extended.txt \
    -o data/colab_training.jsonl \
    2>&1 | tee data/collection.log | grep -E '(SUCCESS|Processing|pairs_generated)'

In [None]:
# Check collected data
!wc -l data/colab_training.jsonl

import json
with open('data/colab_training.jsonl') as f:
    pairs = [json.loads(l) for l in f if l.strip()]

print(f"\nCollected {len(pairs)} training pairs")
print(f"Avg instruction length: {sum(len(p['instruction']) for p in pairs)/len(pairs):.0f} chars")
print(f"Avg WAT length: {sum(len(p['output']) for p in pairs)/len(pairs):.0f} chars")

In [None]:
# Split into train/val and convert to ChatML
!python pipeline/postprocess.py split data/colab_training.jsonl \
    --train data/train_colab.jsonl \
    --val data/val_colab.jsonl \
    --val-ratio 0.1

!python training/prepare_data.py data/train_colab.jsonl -o data/train_chatml_colab.jsonl -f chatml
!python training/prepare_data.py data/val_colab.jsonl -o data/val_chatml_colab.jsonl -f chatml

!wc -l data/*_colab.jsonl

## 3Ô∏è‚É£ Model Training

In [None]:
%%time
# Train model
print(f"Training {MODEL} for {EPOCHS} epochs...")
print("This will take ~1-2 hours...\n")

!python training/train_cloud.py \
    --model {MODEL} \
    --data data \
    --epochs {EPOCHS} \
    --batch-size {BATCH_SIZE} \
    --max-length {MAX_LENGTH} \
    --output models/zerolang-{MODEL}-colab

## 4Ô∏è‚É£ Test Model

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# Model mapping
BASE_MODELS = {
    "qwen-coder-7b": "Qwen/Qwen2.5-Coder-7B-Instruct",
    "qwen-coder-14b": "Qwen/Qwen2.5-Coder-14B-Instruct",
    "qwen-coder-32b": "Qwen/Qwen2.5-Coder-32B-Instruct",
}

model_path = f"models/zerolang-{MODEL}-colab"
base_model_name = BASE_MODELS[MODEL]

print(f"Loading {model_path}...")
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.float16,
    device_map="auto",
)
model = PeftModel.from_pretrained(model, model_path)
model.eval()
print("Model loaded!")

In [None]:
def generate_wat(instruction, max_tokens=1024):
    messages = [
        {"role": "system", "content": "You are ZeroLang, an AI that generates optimized WebAssembly (WAT) code. Output only valid WAT code."},
        {"role": "user", "content": instruction},
    ]
    
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            temperature=0.2,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id,
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract assistant response
    if "assistant" in response.lower():
        response = response.split("assistant")[-1].strip()
    return response

In [None]:
# Test with various prompts
test_prompts = [
    "Implement: int add(int a, int b)",
    "Implement: int factorial(int n)",
    "Implement: void swap(int *a, int *b)",
    "Implement: int max(int a, int b)",
    "Implement: int fibonacci(int n)",
]

for prompt in test_prompts:
    print(f"\n{'='*60}")
    print(f"Input: {prompt}")
    print('='*60)
    wat = generate_wat(prompt)
    print(wat[:800] if len(wat) > 800 else wat)

## 5Ô∏è‚É£ Save & Export

In [None]:
# Save to Google Drive
if SAVE_TO_DRIVE:
    import shutil
    output_name = f"zerolang-{MODEL}-colab"
    
    # Copy model
    shutil.copytree(f"models/{output_name}", f"{DRIVE_OUTPUT}/{output_name}", dirs_exist_ok=True)
    
    # Copy data
    shutil.copy("data/colab_training.jsonl", f"{DRIVE_OUTPUT}/training_data.jsonl")
    
    print(f"‚úÖ Saved to Google Drive: {DRIVE_OUTPUT}")
    !ls -la {DRIVE_OUTPUT}

In [None]:
# Or download as zip
!zip -r zerolang-model.zip models/zerolang-{MODEL}-colab data/colab_training.jsonl

from google.colab import files
files.download('zerolang-model.zip')

## üìä Summary

In [None]:
print("="*60)
print("üéâ ZeroLang Training Complete!")
print("="*60)
print(f"\nData collected: {len(pairs)} pairs")
print(f"Model: {MODEL}")
print(f"Epochs: {EPOCHS}")
print(f"\nOutput: models/zerolang-{MODEL}-colab")
if SAVE_TO_DRIVE:
    print(f"Google Drive: {DRIVE_OUTPUT}")