<a href="https://colab.research.google.com/github/mahb97/Wake2vec/blob/main/Wake2Vec_morpheme_expansion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Wake2Vec Morpheme Expansion Pipeline

This notebook documents a controlled procedure for integrating Joyce-style neologisms into a compact GPT-type language model through morphology-aware token expansion. I curate a small lexicon of prefixes and suffixes and generate synthetic candidates, then extend the tokenizer to admit previously split neologisms as single tokens. New embeddings are initialised by morphemic composition, using the rule \(E(\text{word}) = \alpha\,E(\text{prefix}) + (1 - 2\alpha)\,E(\text{root}) + \alpha\,E(\text{suffix}) + \varepsilon\), where \(\alpha\) is a fixed weight and \(\varepsilon\) is small Gaussian noise that prevents identical vectors. Training proceeds in two stages: an embedding-only warm-up on a mixture of synthetic lines and Finnegans *Wake* text, followed by a short full-model fine-tune under conservative schedules suitable for a T4 environment.

 I report top-five neighbor overlap for the newly introduced tokens before and after training, track shifts in embedding norms, provide a t-SNE projection of the new tokens against pre-training neighbor centroids, and save JSON snapshots of neighborhoods at each stage. These diagnostics are intended to show coherent integration of the new forms into the embedding space rather than collapse or runaway drift, and to make the procedure straightforward to reproduce on modest hardware.

**Config**

Base model: `TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T`. Composition weight \(\alpha = 0.25\). Maximum sequence length set to 1024 to respect T4 memory limits. Batching uses `per_device_train_batch_size = 1` with `gradient_accumulation_steps = 8`, attention implementation set to `eager`, and `use_cache = False`. Phase 1 trains input embeddings and the tied output head only; Phase 2 unfreezes all parameters with a warm-up ratio of 0.10 and light weight decay. All runs write plots and machine-readable artifacts to `runs/<RUN_ID>/` and generate a brief HTML report.

---

## Run controls
- **BASE_MODEL:** `TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T`
- **α (composition weight):** `0.25` (can tune)
- **Max seq length:** `1024` (T4-safe; raise only if VRAM allows)
- **Batching:** `per_device_train_batch_size=1`, `gradient_accumulation_steps=8`
- **Attn impl:** `eager` (avoid SDPA spikes on T4)
- **Two phases:**
  - **Phase 1:** embeddings + lm_head only, Adafactor/8-bit Adam, 1 epoch
  - **Phase 2:** full model, short run, warmup 0.10

## Inputs
- `data/FW_TEXT.txt` — Finnegans Wake plain text (slice for demo)
- `data/morpheme_data.json` or `data/morphemes.csv`  
  Structure maps:
  - `prefixes`: `{ prefix → [example words…] }`
  - `suffixes`: `{ suffix → [example words…] }`

## Outputs (per run)
- `runs/<RUN_ID>/metrics/`
  - `pre_morpheme_snapshot.json`
  - `morpheme_comparison_p1.json` *(midpoint, after Phase 1)*
  - `morpheme_comparison.json` *(final, after Phase 2)*
  - `summary_stats_p1.json`, `summary_stats.json`
- `runs/<RUN_ID>/plots/`
  - `hist_overlap_top5(_p1).png`, `hist_norm_change(_p1).png`
  - `scatter_norm_vs_overlap.png`, `tsne_newtokens_vs_precentroids.png`
- `reports/Wake2Vec_Report.html`

## Quickstart
1. **Reset & install** deps (Colab-friendly).  
2. **Load data** (prefers JSON).  
3. **Generate** synthetic forms (prefix + root + suffix).  
4. **Expand tokenizer** (add new tokens); compose embeddings with α-rule; tie head.  
5. **Phase 1**: train embeddings only. Saves midpoint snapshot.
6. **Phase 2**: unfreeze and short fine-tune.  
7. **Diagnostics**: compute overlap@5, norm deltas, t-SNE; write HTML report.  


## Diagnostics (what “good” looks like)
- **Top-5 neighbor overlap (pre→post):** ~3–4/5 indicates coherent integration (not collapse).
- **Norm shift (Δ‖E‖):** small positive mean (slight energy increase from training).
- **Qualitative neighbors:** morpheme-aligned (e.g., `presounder` ≈ `resound`, `ensounder`, …).
- **Tokenization:** most synthetic forms now **single IDs**.

## Repro & env
- `RUN_ID = "t4_<unix>"` auto-stamped; seeds fixed at 42.
- Tested on Colab T4 with: `transformers 4.57.1`, `datasets 2.21.0`, `pyarrow 22.0.0`.
- T4 guardrails: `MAX_LEN=1024`, `gradient_checkpointing=True`, attention=`eager`, batch=1 + accum=8.

## Troubleshooting (T4)
- **CUDA OOM** → lower `MAX_LEN` to 768/512; keep batch=1; accum=8–16; ensure `use_cache=False`; `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`.
- **Version noise** → uninstall RAPIDS/TF; pin `transformers 4.57.1`, `datasets 2.21.0`, `pyarrow 22.0.0`.

---

 *Wake2Vec tests morphology-aware token expansion to integrate Joyce-style neologisms into a small language model without destabilising the embedding space. We curate a prefix/suffix lexicon, generate synthetic forms, initialise new vectors by morpheme composition, and train in two phases. Evaluation reports neighbor-overlap@5, embedding-norm shifts, and qualitative neighborhoods, with JSON snapshots for reproducibility.*


In [None]:
!pip -q install --no-cache-dir --upgrade-strategy eager \
  "transformers==4.57.1" "datasets==2.21.0" "accelerate==1.0.1" \
  "peft==0.12.0" "bitsandbytes==0.43.3" \
  "huggingface-hub>=0.34,<1.0" \
  "pyarrow==22.0.0" "numpy==2.0.2" "pandas==2.2.2" "requests==2.32.4" \
  "matplotlib>=3.8" "scikit-learn>=1.5" "umap-learn" "faiss-cpu" "wordfreq" "Unidecode"

import os; os.kill(os.getpid(), 9)  # rr

In [1]:
import numpy as np, torch, transformers, datasets, pyarrow as pa
import inspect
print("Transformers:", transformers.__version__)
print("Datasets    :", datasets.__version__, "from", inspect.getfile(datasets))
print("PyArrow     :", pa.__version__)
print("Torch       :", torch.__version__)

Transformers: 4.57.1
Datasets    : 2.21.0 from /usr/local/lib/python3.12/dist-packages/datasets/__init__.py
PyArrow     : 22.0.0
Torch       : 2.8.0+cu126


In [2]:
print(type(datasets))

<class 'module'>


In [None]:
# paths, run id
from google.colab import drive
from pathlib import Path
import time, random, os, torch, numpy as np
drive.mount('/content/drive', force_remount=True)

PROJECT     = "wake2vec"
BASE_MODEL  = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
RUN_ID      = f"t4_{int(time.time())}"

ROOT        = Path("/content")
PERSIST     = Path("/content/drive/MyDrive")/PROJECT
RUN_DIR     = ROOT/"runs"/RUN_ID
METRICS_DIR = RUN_DIR/"metrics"
PLOTS_DIR   = RUN_DIR/"plots"
REPORTS_DIR = ROOT/"reports"
ADAPT_DIR   = RUN_DIR/"phase2_lora"/"final_adapters"

for d in [RUN_DIR, METRICS_DIR, PLOTS_DIR, REPORTS_DIR, ADAPT_DIR, PERSIST/"runs", PERSIST/"adapters", PERSIST/"reports", PERSIST/"archives", PERSIST/"notebooks"]:
    d.mkdir(parents=True, exist_ok=True)

# hygiene + seeds
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
torch.backends.cuda.matmul.allow_tf32 = True
def set_seed(s=42):
    random.seed(s); np.random.seed(s); torch.manual_seed(s); torch.cuda.manual_seed_all(s)
set_seed(42)

print("RUN_ID:", RUN_ID)
print("RUN_DIR:", RUN_DIR)
print("PERSIST:", PERSIST)

Imports, seeds, run IDs, paths

In [16]:
import os, json, time, random
from pathlib import Path
import numpy as np
import torch

SEED = 42
random.seed(SEED); np.random.seed(SEED)
torch.manual_seed(SEED); torch.cuda.manual_seed_all(SEED)

# Set MANUAL_RUN_ID to the desired run ID, or leave as None to use a timestamp
MANUAL_RUN_ID = "t4_1762051716"

RUN_ID = MANUAL_RUN_ID if MANUAL_RUN_ID is not None else f"t4_{int(time.time())}"
ROOT = Path("/content")
RUN_DIR = ROOT / "runs" / RUN_ID
PLOTS_DIR = RUN_DIR / "plots"
METRICS_DIR = RUN_DIR / "metrics"
REPORTS_DIR = ROOT / "reports"
for p in (PLOTS_DIR, METRICS_DIR, REPORTS_DIR): p.mkdir(parents=True, exist_ok=True)

META = {
    "run_id": RUN_ID, "seed": SEED, "alpha": 0.25,
    "phase1": {"lr": 5e-4, "epochs": 1, "ptd_bs": 8, "grad_accum": 2},
    "phase2": {"lr": 2e-5, "epochs": 2, "warmup_ratio": 0.10, "ptd_bs": 8, "grad_accum": 2, "weight_decay": 0.01}
}
(METRICS_DIR/"meta.json").write_text(json.dumps(META, indent=2))
print("RUN_ID:", RUN_ID)

RUN_ID: t4_1762051716


Load data

In [17]:
from pathlib import Path
import json, csv

DATA_DIR = ROOT/"data"; DATA_DIR.mkdir(parents=True, exist_ok=True)
FW_PATH   = DATA_DIR/"/content/runs/FW_TEXT.txt"
JSON_PATH = DATA_DIR/"/content/runs/morpheme_data.json"
CSV_PATH  = DATA_DIR/"/content/runs/morphemes.csv"

def load_morpheme_csv(path):
    d = {"prefixes": {}, "suffixes": {}}
    with open(path, newline="", encoding="utf-8") as f:
        rdr = csv.reader(f); header = next(rdr, None)
        for row in rdr:
            if not row: continue
            typ, morpheme, *examples = [x.strip() for x in row]
            if typ not in ("prefix","suffix"): continue
            key = "prefixes" if typ=="prefix" else "suffixes"
            ex = [w for w in dict.fromkeys(examples) if w]
            if ex: d[key][morpheme] = ex
    return d

if JSON_PATH.exists():
    MORPHEME_DATA = json.load(open(JSON_PATH, "r", encoding="utf-8"))
elif CSV_PATH.exists():
    MORPHEME_DATA = load_morpheme_csv(CSV_PATH)
else:
    raise FileNotFoundError("Put morpheme_data.json or morphemes.csv in /content/data")

prefixes = MORPHEME_DATA.get("prefixes", {})
suffixes = MORPHEME_DATA.get("suffixes", {})

if not FW_PATH.exists():
    FW_PATH.write_text("Placeholder FW text.\n"*5000, encoding="utf-8")
FW_TEXT = FW_PATH.read_text(encoding="utf-8")

print(f"Prefixes: {len(prefixes)} | Suffixes: {len(suffixes)} | FW chars: {len(FW_TEXT):,}")

Prefixes: 15 | Suffixes: 15 | FW chars: 1,364,712


Synthetic generator

In [7]:
import random
def synthetic_words(n=1200, roots=("river thunder word sound dance queen storm tree night sun rain book".split())):
    out=set()
    pfx_pool=[p for p,ex in prefixes.items() for _ in range(max(1,len(ex)//2+1))]
    sfx_pool=[s for s,ex in suffixes.items() for _ in range(max(1,len(ex)//2+1))]
    for _ in range(max(2*n, 2000)):
        if not pfx_pool or not sfx_pool: break
        p=random.choice(pfx_pool); s=random.choice(sfx_pool); r=random.choice(roots)
        if len(p)+len(r)+len(s)>3: out.add(f"{p}{r}{s}")
        if len(out)>=n: break
    return sorted(out)

SYN_WORDS = synthetic_words()
SYN_LINES = [f"The {w} rolled down the river at night." for w in random.sample(SYN_WORDS, min(400,len(SYN_WORDS)))]
print("Synthetic words:", len(SYN_WORDS), "| synthetic lines:", len(SYN_LINES))

Synthetic words: 1200 | synthetic lines: 400


base model, expand tokenizer, compose embeddings, tie head


In [8]:
from transformers import AutoTokenizer, AutoModelForCausalLM

BASE_MODEL = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
tok = AutoTokenizer.from_pretrained(BASE_MODEL, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(BASE_MODEL, dtype="float32", device_map="auto")

pre_token_splits = {w: tok.encode(w, add_special_tokens=False) for w in SYN_WORDS}
new_tokens = [w for w, ids in pre_token_splits.items() if len(ids) > 1]
added = tok.add_tokens(new_tokens, special_tokens=False)
model.resize_token_embeddings(len(tok), mean_resizing=False)
print(f"Added tokens: {added} | Vocab size: {len(tok)}")

import torch
def avg_vec(terms, emb, tok):
    vecs=[]
    for t in terms:
        ids = tok.encode(t, add_special_tokens=False)
        if len(ids)==1: vecs.append(emb.weight.data[ids[0]])
    return torch.stack(vecs,0).mean(0) if vecs else None

with torch.no_grad():
    emb = model.get_input_embeddings()
    alpha = META["alpha"]; std = emb.weight.data.std().item()
    for w in new_tokens:
        p = next((p for p in prefixes if w.startswith(p)), None)
        s = next((s for s in suffixes if w.endswith(s)), None)
        root = w[len(p):len(w)-len(s)] if (p and s and len(w)>len(p)+len(s)) else w
        vp = avg_vec(prefixes.get(p, []), emb, tok)
        vs = avg_vec(suffixes.get(s, []), emb, tok)
        vr_ids = tok.encode(root, add_special_tokens=False)
        vr = emb.weight.data[vr_ids[0]] if len(vr_ids)==1 else torch.randn(emb.embedding_dim, device=emb.weight.device)*(std*0.5)
        comp = alpha*(vp if vp is not None else vr) + (1-2*alpha)*vr + alpha*(vs if vs is not None else vr)
        comp = comp + torch.randn_like(comp)*(std*0.01)
        emb.weight.data[tok.convert_tokens_to_ids(w)] = comp
    model.lm_head.weight = emb.weight

print("Composed embeddings + tied head.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/560 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/129 [00:00<?, ?B/s]

Added tokens: 1200 | Vocab size: 33200
Composed embeddings + tied head.


In [9]:
if tok.pad_token is None:
    tok.pad_token = tok.eos_token
model.config.pad_token_id = tok.pad_token_id

blocks + PRE snapshot

In [10]:
from datasets import Dataset
from transformers import DataCollatorForLanguageModeling
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import json

MAX_LEN=1024; STRIDE=512
def make_blocks(text, max_len=MAX_LEN, stride=STRIDE):
    ids = tok.encode(text, add_special_tokens=False)
    return [{"input_ids": ids[i:i+max_len]} for i in range(0, max(0,len(ids)-max_len), stride) if len(ids[i:i+max_len])==max_len]

train_text = "\n".join(SYN_LINES) + "\n" + FW_TEXT[:600_000]
valid_text = FW_TEXT[600_000:630_000]
train_ds = Dataset.from_list(make_blocks(train_text))
valid_ds = Dataset.from_list(make_blocks(valid_text))
dc = DataCollatorForLanguageModeling(tok, mlm=False)

print("Train blocks:", len(train_ds), "| Valid blocks:", len(valid_ds))

with torch.no_grad():
    W_pre = model.get_input_embeddings().weight.detach().clone().to("cpu").numpy()
    new_ids = [tok.convert_tokens_to_ids(t) for t in new_tokens]
    sim_pre = cosine_similarity(W_pre[new_ids], W_pre)
    top5_pre = np.argsort(-sim_pre, axis=1)[:,1:6]

json.dump({"new_tokens": new_tokens, "top5_pre": top5_pre[:50].tolist()}, open(METRICS_DIR/"pre_morpheme_snapshot.json","w"), indent=2)
print("Saved pre snapshot.")

AttributeError: partially initialized module 'datasets' has no attribute 'utils' (most likely due to a circular import)

P1 — embeddings-only warm-up

In [11]:
import os, torch, gc
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
torch.backends.cuda.matmul.allow_tf32 = True
gc.collect(); torch.cuda.empty_cache()

In [13]:
# P1 Embedding-only warm-up
import gc, torch
from transformers import Trainer, TrainingArguments

import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
torch.backends.cuda.matmul.allow_tf32 = True
gc.collect(); torch.cuda.empty_cache()

# Freeze everything except embeddings and lm_head
def freeze_all_but_embeddings(m):
    for p in m.parameters():
        p.requires_grad = False
    for p in m.get_input_embeddings().parameters():
        p.requires_grad = True
    for p in m.lm_head.parameters():
        p.requires_grad = True

freeze_all_but_embeddings(model)

# Trainer args — tiny batch, big accum, checkpointing
model.config.use_cache = False
args1 = TrainingArguments(
    output_dir=str(RUN_DIR/"phase1"),
    per_device_train_batch_size=1,     # tiny batch
    gradient_accumulation_steps=8,
    learning_rate=META["phase1"]["lr"],
    num_train_epochs=META["phase1"]["epochs"],
    eval_strategy="steps",
    save_strategy="steps",
    save_steps=400,
    eval_steps=400,
    logging_strategy="steps",
    logging_steps=50,
    gradient_checkpointing=True,
    fp16=False,                        # fp16 fragile on T4
    load_best_model_at_end=False,
    report_to="none",
    optim="adamw_bnb_8bit",            # use 8-bit Adam if bitsandbytes is present
)

trainer1 = Trainer(
    model=model,
    args=args1,
    train_dataset=train_ds,
    eval_dataset=valid_ds,
    data_collator=dc,
)

trainer1.train()

ImportError: cannot import name 'Trainer' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)

In [36]:
# MID SNAPSHOT
import json, torch, numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from pathlib import Path
import matplotlib.pyplot as plt

MID_DIR = METRICS_DIR  # reuse same folder

with torch.no_grad():
    W_mid = model.get_input_embeddings().weight.detach().clone().to("cpu").numpy()
    sim_mid = cosine_similarity(W_mid[new_ids], W_mid)
    top5_mid = np.argsort(-sim_mid, axis=1)[:,1:6]

def overlap_at5(a,b): return len(set(a.tolist()) & set(b.tolist()))

overlaps_p1 = np.array([overlap_at5(top5_pre[i], top5_mid[i]) for i in range(len(new_ids))])
norms_pre   = np.linalg.norm(W_pre[new_ids], axis=1)
norms_mid   = np.linalg.norm(W_mid[new_ids], axis=1)
norm_deltas_p1 = norms_mid - norms_pre

summary_p1 = {
    "phase": "phase1",
    "compared_tokens": int(len(new_ids)),
    "mean_top5_overlap": float(np.mean(overlaps_p1)) if len(overlaps_p1) else None,
    "mean_norm_delta": float(np.mean(norm_deltas_p1)) if len(norm_deltas_p1) else None,
}

# save JSONs
(Path(MID_DIR)/"morpheme_comparison_p1.json").write_text(
    json.dumps({
        "top5_pre": top5_pre.tolist(),
        "top5_mid": top5_mid.tolist(),
        "overlap@5": overlaps_p1.tolist(),
        "norm_deltas": norm_deltas_p1.tolist(),
    }, indent=2)
)
(Path(MID_DIR)/"summary_stats_p1.json").write_text(json.dumps(summary_p1, indent=2))
print("Phase-1 summary:", summary_p1)

# quick plots
PLOTS_DIR.mkdir(parents=True, exist_ok=True)
plt.figure(); plt.hist(overlaps_p1, bins=[-0.5,0.5,1.5,2.5,3.5,4.5,5.5])
plt.title("Top-5 overlap (PRE → MID)"); plt.xlabel("Overlap"); plt.ylabel("Freq")
plt.tight_layout(); plt.savefig(PLOTS_DIR/"hist_overlap_top5_p1.png", dpi=180); plt.close()

plt.figure(); plt.hist(norm_deltas_p1, bins=30)
plt.title("Embedding norm change (MID − PRE)"); plt.xlabel("Δ norm"); plt.ylabel("Freq")
plt.tight_layout(); plt.savefig(PLOTS_DIR/"hist_norm_change_p1.png", dpi=180); plt.close()

Phase-1 summary: {'phase': 'phase1', 'compared_tokens': 1200, 'mean_top5_overlap': 3.1283333333333334, 'mean_norm_delta': 0.07451333105564117}


P2 full-model fine-tune

In [37]:
# capture init for new ids
with torch.no_grad():
    E_init = model.get_input_embeddings().weight.data.clone()

# custom loss wrapper
LAMBDA = 1e-4
def add_anchor_loss(outputs, inputs):
    input_ids = inputs["input_ids"]
    emb = model.get_input_embeddings().weight
    ids = torch.unique(input_ids)
    ids = ids[ids >= 0]
    return LAMBDA * (emb[ids] - E_init[ids]).pow(2).mean()

In [43]:
MAX_LEN = 384
STRIDE  = 192

def make_blocks(text, max_len=MAX_LEN, stride=STRIDE):
    ids = tok.encode(text, add_special_tokens=False)
    return [{"input_ids": ids[i:i+max_len]}
            for i in range(0, max(0, len(ids)-max_len), stride)
            if len(ids[i:i+max_len]) == max_len]

train_text = "\n".join(SYN_LINES) + "\n" + FW_TEXT[:600_000]
valid_text = FW_TEXT[600_000:630_000]

from datasets import Dataset
train_ds = Dataset.from_list(make_blocks(train_text))
valid_ds = Dataset.from_list(make_blocks(valid_text))

# ensure pad token + fresh collator
if tok.pad_token is None:
    tok.pad_token = tok.eos_token
model.config.pad_token_id = tok.pad_token_id

from transformers import DataCollatorForLanguageModeling
dc = DataCollatorForLanguageModeling(tok, mlm=False)

print("Phase-2 dataset:",
      "Train blocks =", len(train_ds),
      "| Valid blocks =", len(valid_ds),
      "| MAX_LEN =", MAX_LEN)

Phase-2 dataset: Train blocks = 995 | Valid blocks = 46 | MAX_LEN = 384


In [44]:
# === PHASE 2: LoRA adapters (tiny VRAM), no eval ===
import os, gc, torch
from transformers import Trainer, TrainingArguments
!pip -q install peft>=0.11

from peft import LoraConfig, get_peft_model

# hygiene + allocator defrag
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
torch.backends.cuda.matmul.allow_tf32 = True
gc.collect(); torch.cuda.empty_cache()

# Keep the base model as-is; add small trainable adapters
model.config.use_cache = False
model.config._attn_implementation = "eager"

# Typical LoRA targets for LLaMA-family blocks
lora_cfg = LoraConfig(
    r=8, lora_alpha=16, lora_dropout=0.05, bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"]
)
model = get_peft_model(model, lora_cfg)
model.print_trainable_parameters()  # sanity: should be a tiny % of total

# Trainer args — tiny batch + accum; AdamW on small adapter params is fine
args2 = TrainingArguments(
    output_dir=str(RUN_DIR/"phase2_lora"),
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,     # effective batch while keeping peak minimal
    learning_rate=1.5e-4,               # higher LR for adapters
    num_train_epochs=2,
    warmup_ratio=0.10,
    weight_decay=0.0,                   # usually 0 for LoRA
    eval_strategy="no",                 # skip eval pass to save VRAM
    save_strategy="steps",
    save_steps=2000,
    logging_strategy="steps", logging_steps=100,
    gradient_checkpointing=True,
    fp16=False,
    load_best_model_at_end=False,
    report_to="none",
    optim="adamw_torch",                # adapters are small; AdamW is fine
)

trainer2 = Trainer(
    model=model,
    args=args2,
    train_dataset=train_ds,
    data_collator=dc,
)

print(f"Starting Phase-2 LoRA (MAX_LEN={MAX_LEN}) …")
trainer2.train()

The model is already on multiple devices. Skipping the move to device specified in `args`.


trainable params: 6,307,840 || all params: 1,043,277,824 || trainable%: 0.6046
Starting Phase-2 LoRA (MAX_LEN=384) …


Step,Training Loss
100,4.8614




TrainOutput(global_step=126, training_loss=4.824113331143818, metrics={'train_runtime': 1817.0464, 'train_samples_per_second': 1.095, 'train_steps_per_second': 0.069, 'total_flos': 4471639155671040.0, 'train_loss': 4.824113331143818, 'epoch': 2.0})

In [45]:
# FINAL SNAPSHOT
import json, torch, numpy as np
from sklearn.metrics.pairwise import cosine_similarity

with torch.no_grad():
    W_post = model.get_input_embeddings().weight.detach().clone().to("cpu").numpy()
    sim_post = cosine_similarity(W_post[new_ids], W_post)
    top5_post = np.argsort(-sim_post, axis=1)[:,1:6]

def overlap_at5(a,b): return len(set(a.tolist()) & set(b.tolist()))
overlaps = np.array([overlap_at5(top5_pre[i], top5_post[i]) for i in range(len(new_ids))])
norms_pre = np.linalg.norm(W_pre[new_ids], axis=1)
norms_post = np.linalg.norm(W_post[new_ids], axis=1)
norm_deltas = norms_post - norms_pre

summary = {
    "phase": "phase2_final",
    "compared_tokens": int(len(new_ids)),
    "mean_top5_overlap": float(np.mean(overlaps)),
    "mean_norm_delta": float(np.mean(norm_deltas)),
}
(METRICS_DIR/"morpheme_comparison.json").write_text(json.dumps({
    "top5_pre": top5_pre.tolist(),
    "top5_post": top5_post.tolist(),
    "overlap@5": overlaps.tolist(),
    "norm_deltas": norm_deltas.tolist()
}, indent=2))
(METRICS_DIR/"summary_stats.json").write_text(json.dumps(summary, indent=2))
print("FINAL:", summary)

FINAL: {'phase': 'phase2_final', 'compared_tokens': 1200, 'mean_top5_overlap': 3.1283333333333334, 'mean_norm_delta': 0.07451333105564117}


In [None]:
import matplotlib.pyplot as plt
PLOTS_DIR.mkdir(parents=True, exist_ok=True)

plt.figure(); plt.hist(overlaps, bins=[-0.5,0.5,1.5,2.5,3.5,4.5,5.5])
plt.title("Top-5 neighbor overlap (pre → post)"); plt.xlabel("Overlap"); plt.ylabel("Freq")
plt.tight_layout(); plt.savefig(PLOTS_DIR/"hist_overlap_top5.png", dpi=180); plt.close()

plt.figure(); plt.hist(norm_deltas, bins=30)
plt.title("Embedding norm change (post − pre)"); plt.xlabel("Δ norm"); plt.ylabel("Freq")
plt.tight_layout(); plt.savefig(PLOTS_DIR/"hist_norm_change.png", dpi=180); plt.close()

plt.figure(); plt.scatter(norm_deltas, overlaps, alpha=0.6)
plt.title("Norm change vs Overlap@5"); plt.xlabel("Δ norm"); plt.ylabel("Overlap@5")
plt.tight_layout(); plt.savefig(PLOTS_DIR/"scatter_norm_vs_overlap.png", dpi=180); plt.close()

print("Saved plots to:", PLOTS_DIR)


In [47]:
import json
s_mid = json.loads((METRICS_DIR/"summary_stats_p1.json").read_text()) if (METRICS_DIR/"summary_stats_p1.json").exists() else {}
s_fin = json.loads((METRICS_DIR/"summary_stats.json").read_text())
html = f"""<!DOCTYPE html><html><head><meta charset="utf-8"><title>Wake2Vec — Report {RUN_ID}</title>
<style>body{{font-family:"Times New Roman",serif;line-height:1.35}}.c{{max-width:900px;margin:2rem auto;padding:0 1rem 3rem}}
h1{{font-size:1.9rem;border-bottom:2px solid #000;padding-bottom:.4rem}}</style></head><body>
<div class="c"><h1>Wake2Vec — Interim Report</h1>
<p><b>Run:</b> {RUN_ID}</p>
<ul>
<li><b>Phase 1</b> (PRE→MID): compared={s_mid.get('compared_tokens','—')}, overlap@5={s_mid.get('mean_top5_overlap','—')}, Δ‖E‖={s_mid.get('mean_norm_delta','—')}</li>
<li><b>Phase 2</b> (PRE→POST): compared={s_fin['compared_tokens']}, overlap@5={s_fin['mean_top5_overlap']:.3f}, Δ‖E‖={s_fin['mean_norm_delta']:.5f}</li>
</ul>
<img src="../runs/{RUN_ID}/plots/hist_overlap_top5.png" style="width:100%"><br>
<img src="../runs/{RUN_ID}/plots/hist_norm_change.png" style="width:100%"><br>
<img src="../runs/{RUN_ID}/plots/scatter_norm_vs_overlap.png" style="width:100%">
<ul><li>Metrics: runs/{RUN_ID}/metrics/*.json</li><li>Plots: runs/{RUN_ID}/plots/*.png</li></ul>
</div></body></html>"""
(REPORTS_DIR/"Wake2Vec_Report.html").write_text(html, encoding="utf-8")
print("Report:", REPORTS_DIR/"Wake2Vec_Report.html")

Report: /content/reports/Wake2Vec_Report.html


In [48]:
SAVE_DIR = RUN_DIR/"phase2_lora"/"final_adapters"
SAVE_DIR.mkdir(parents=True, exist_ok=True)
model.save_pretrained(str(SAVE_DIR), safe_serialization=True)
tok.save_pretrained(str(SAVE_DIR))
print("Saved adapters+tokenizer to:", SAVE_DIR)



Saved adapters+tokenizer to: /content/runs/t4_1761966609/phase2_lora/final_adapters


(P3-prep) precompute regulariser targets (should probably log here that this is the next day so had to reconnect, also t4 was calm at 14.3 gb for 2 hours so the above is all totally fine)

In [24]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [25]:
from pathlib import Path
PERSIST_BASE = Path('/content/drive/MyDrive/wake2vec')
RUN_ID = META["run_id"]
PERSIST_BASE.mkdir(parents=True, exist_ok=True)

RUN_DIR     = Path("runs")/RUN_ID
METRICS_DIR = RUN_DIR/"metrics"
PLOTS_DIR   = RUN_DIR/"plots"
REPORTS_DIR = Path("reports")
ADAPTERS_DIR= RUN_DIR/"phase2_lora"/"final_adapters"   # wherever you saved adapters/tokenizer
NB_GLOB     = list(Path("/content").glob("Wake2Vec*.ipynb"))

In [26]:
import shutil, os, time

DEST_RUN   = PERSIST_BASE/'runs'/RUN_ID
DEST_REP   = PERSIST_BASE/'reports'
DEST_ADAPT = PERSIST_BASE/'adapters'/RUN_ID
DEST_ARCH  = PERSIST_BASE/'archives'

DEST_RUN.mkdir(parents=True, exist_ok=True)
DEST_REP.mkdir(parents=True, exist_ok=True)
DEST_ADAPT.mkdir(parents=True, exist_ok=True)
DEST_ARCH.mkdir(parents=True, exist_ok=True)

if METRICS_DIR.exists(): shutil.copytree(METRICS_DIR, DEST_RUN/'metrics', dirs_exist_ok=True)
if PLOTS_DIR.exists():   shutil.copytree(PLOTS_DIR,   DEST_RUN/'plots',   dirs_exist_ok=True)
if ADAPTERS_DIR.exists():shutil.copytree(ADAPTERS_DIR,DEST_ADAPT/'final_adapters', dirs_exist_ok=True)
if REPORTS_DIR.joinpath("Wake2Vec_Report.html").exists():
    shutil.copy(REPORTS_DIR/"Wake2Vec_Report.html", DEST_REP/"Wake2Vec_Report.html")
for nb in NB_GLOB:
    shutil.copy(nb, PERSIST_BASE/'notebooks'/nb.name) if (PERSIST_BASE/'notebooks').mkdir(exist_ok=True, parents=True) or True else None

archive_path = DEST_ARCH/f"{RUN_ID}.tar.gz"
os.system(f"tar -czf {archive_path} runs/{RUN_ID} reports/Wake2Vec_Report.html || true")
print("Saved to Drive:", PERSIST_BASE)

Saved to Drive: /content/drive/MyDrive/wake2vec


In [27]:
import os, glob
print("Metrics:", glob.glob(str(DEST_RUN/'metrics'/'*.json')))
print("Plots:",   glob.glob(str(DEST_RUN/'plots'/'*.png'))[:3], "…")
print("Adapters dir exists:", (DEST_ADAPT/'final_adapters').exists())
print("Report:", (DEST_REP/'Wake2Vec_Report.html').exists())
print("Archive:", archive_path.exists(), archive_path)

Metrics: ['/content/drive/MyDrive/wake2vec/runs/t4_1762051716/metrics/meta.json']
Plots: [] …
Adapters dir exists: False
Report: False
Archive: True /content/drive/MyDrive/wake2vec/archives/t4_1762051716.tar.gz


In [28]:
from pathlib import Path
import tarfile

tar_path = Path("/content/drive/MyDrive/wake2vec/archives/t4_1762051716.tar.gz")

with tarfile.open(tar_path, "r:gz") as tar:
    names = tar.getnames()
    print(f"{len(names)} files in archive")
    # peek at first ~100 paths
    for p in names[:100]:
        print(p)


4 files in archive
runs/t4_1762051716
runs/t4_1762051716/plots
runs/t4_1762051716/metrics
runs/t4_1762051716/metrics/meta.json


In [29]:
import tarfile, os
extract_root = "/content"
with tarfile.open(tar_path, "r:gz") as tar:
    tar.extractall(path=extract_root)
print("Extracted to:", extract_root)

import glob
print("Metrics found:", glob.glob("/content/runs/*/metrics/*.json"))
print("Plots found:",   glob.glob("/content/runs/*/plots/*.png"))
print("Reports found:", glob.glob("/content/reports/*.html"))

Extracted to: /content
Metrics found: ['/content/runs/t4_1762051716/metrics/meta.json']
Plots found: []
Reports found: []


  tar.extractall(path=extract_root)


In [31]:
from pathlib import Path
import json, numpy as np, matplotlib.pyplot as plt

RUN_ID = "t4_1762051716"
RUN_DIR = Path(f"/content/runs/{RUN_ID}")
METRICS = RUN_DIR/"metrics"
PLOTS   = RUN_DIR/"plots"; PLOTS.mkdir(parents=True, exist_ok=True)

cmp_path = None
for name in ["/content/runs/t4_1762051716/morpheme_comparison_p1.json"]:
    if (METRICS/name).exists():
        cmp_path = METRICS/name; break

if cmp_path is None:
    raise FileNotFoundError("No morpheme_comparison*.json found; can’t rebuild plots yet.")

data = json.loads(cmp_path.read_text())
overlaps = np.array(data["overlap@5"])
norm_deltas = np.array(data["norm_deltas"])

plt.figure(); plt.hist(overlaps, bins=[-0.5,0.5,1.5,2.5,3.5,4.5,5.5])
plt.title("Top-5 neighbor overlap (pre → post)"); plt.xlabel("Overlap"); plt.ylabel("Freq")
plt.tight_layout(); plt.savefig(PLOTS/"hist_overlap_top5.png", dpi=180); plt.close()

plt.figure(); plt.hist(norm_deltas, bins=30)
plt.title("Embedding norm change (post − pre)"); plt.xlabel("Δ norm"); plt.ylabel("Freq")
plt.tight_layout(); plt.savefig(PLOTS/"hist_norm_change.png", dpi=180); plt.close()

plt.figure(); plt.scatter(norm_deltas, overlaps, alpha=0.6)
plt.title("Norm change vs Overlap@5"); plt.xlabel("Δ norm"); plt.ylabel("Overlap@5")
plt.tight_layout(); plt.savefig(PLOTS/"scatter_norm_vs_overlap.png", dpi=180); plt.close()

print("Plots rebuilt at:", PLOTS)

Plots rebuilt at: /content/runs/t4_1762051716/plots


In [32]:
from pathlib import Path, PurePosixPath
import json

RUN_ID = "t4_1762051716"
RUN_DIR = Path(f"/content/runs/{RUN_ID}")
METRICS = RUN_DIR/"metrics"
PLOTS   = RUN_DIR/"plots"
REPORTS = Path("/content/reports"); REPORTS.mkdir(parents=True, exist_ok=True)

s_mid = json.loads((METRICS/"summary_stats_p1.json").read_text()) if (METRICS/"summary_stats_p1.json").exists() else {}
s_fin = json.loads((METRICS/"summary_stats.json").read_text())    if (METRICS/"summary_stats.json").exists() else {}
s_p3  = json.loads((METRICS/"summary_stats_p3.json").read_text()) if (METRICS/"summary_stats_p3.json").exists() else {}

def img(src):
    p = PurePosixPath(f"../runs/{RUN_ID}/plots/{src}")
    return f'<img src="{p}" style="width:100%">'

html = f"""<!doctype html><html><head><meta charset="utf-8"><title>Wake2Vec — Report {RUN_ID}</title>
<style>body{{font-family:"Times New Roman",serif;line-height:1.35}}.c{{max-width:900px;margin:2rem auto;padding:0 1rem 3rem}}</style>
</head><body><div class="c">
<h1>Wake2Vec — Report</h1>
<p><b>Run:</b> {RUN_ID}</p>
<ul>
<li><b>Phase 1</b>: compared={s_mid.get('compared_tokens','—')}, overlap@5={s_mid.get('mean_top5_overlap','—')}, Δ‖E‖={s_mid.get('mean_norm_delta','—')}</li>
<li><b>Phase 2 (LoRA)</b>: compared={s_fin.get('compared_tokens','—')}, overlap@5={s_fin.get('mean_top5_overlap','—')}, Δ‖E‖={s_fin.get('mean_norm_delta','—')}</li>
<li><b>Phase 3 (embed align)</b>: compared={s_p3.get('compared_tokens','—')}, overlap@5={s_p3.get('mean_top5_overlap','—')}, Δ‖E‖={s_p3.get('mean_norm_delta','—')}</li>
</ul>
{img("hist_overlap_top5.png") if (PLOTS/"hist_overlap_top5.png").exists() else ""}
{img("hist_norm_change.png") if (PLOTS/"hist_norm_change.png").exists() else ""}
{img("scatter_norm_vs_overlap.png") if (PLOTS/"scatter_norm_vs_overlap.png").exists() else ""}
<ul><li>Metrics: runs/{RUN_ID}/metrics/*.json</li><li>Plots: runs/{RUN_ID}/plots/*.png</li></ul>
</div></body></html>"""

(REPORTS/"Wake2Vec_Report.html").write_text(html, encoding="utf-8")
print("Report at:", REPORTS/"Wake2Vec_Report.html")

Report at: /content/reports/Wake2Vec_Report.html


In [34]:
from pathlib import Path
from transformers import AutoTokenizer
import json

# Where we want to save a copy now:
SAVE_DIR = Path("/content/drive/MyDrive/wake2vec/adapters")/"t4_1762051716"/"final_adapters"
SAVE_DIR.mkdir(parents=True, exist_ok=True)

# Places to look (add/adjust paths if needed)
candidates = [
    Path("/content/drive/MyDrive/wake2vec")/"adapters"/"t4_1762051716"/"final_adapters",
    Path("/content/drive/MyDrive/wake2vec")/"adapters"/"final_adapters",
    Path("runs")/"t4_1762051716"/"phase2_lora"/"final_adapters",
    Path("adapters")/"t4_1762051716"/"final_adapters",
]

def has_tok_files(p: Path):
    return any((p/f).exists() for f in ["tokenizer.json","tokenizer_config.json","vocab.json","spiece.model"])

tok_dir = None
for p in candidates:
    if p.exists() and has_tok_files(p):
        tok_dir = p
        break

if tok_dir:
    print("Found tokenizer at:", tok_dir)
    tok = AutoTokenizer.from_pretrained(str(tok_dir), use_fast=True)
    if tok.pad_token is None:
        tok.pad_token = tok.eos_token
    tok.save_pretrained(str(SAVE_DIR))
    print("Tokenizer copied to:", SAVE_DIR)
else:
    print("No saved tokenizer found in common locations.")

No saved tokenizer found in common locations.


In [33]:
from pathlib import Path
SAVE_DIR = Path("/content/drive/MyDrive/wake2vec/adapters")/"t4_1762051716"/"final_adapters"
SAVE_DIR.mkdir(parents=True, exist_ok=True)
model.save_pretrained(str(SAVE_DIR), safe_serialization=True)
tok.save_pretrained(str(SAVE_DIR))
print("Adapters+tokenizer saved to:", SAVE_DIR)

AttributeError: 'NoneType' object has no attribute 'save_pretrained'