## Anchored Diffusion Inference Notebook
Use this notebook to load a trained ADLM checkpoint and generate text samples or compute evaluation metrics on a few samples

### Prerequisites
- The working directory is set to the repository root.
- A trained checkpoint is available on disk.

Adjust the parameters in the next cell to point to your checkpoint and sampling configuration.

In [1]:
from pathlib import Path
import os

# --- User configuration ----------------------------------------------------
PROJECT_ROOT = Path('/hdd1/lr/discrete/remdm')  # Update if running elsewhere
CHECKPOINT_PATH = PROJECT_ROOT / 'ckpts/adlm-large.ckpt'
# Supported checkpoints: adlm-medium.ckpt (262B tokens), adlm-large.ckpt (524B tokens)
NUM_SAMPLE_BATCHES = 10  # Number of batches to sample in this notebook
OUTPUT_JSON = PROJECT_ROOT / 'outputs/notebook_adlm_samples.json'

# Hydra/Lightning overrides mirroring the command-line evaluation settings
OVERRIDES = [
    'mode=sample_eval',
    'loader.batch_size=1',
    'loader.eval_batch_size=1',
    'eval.perplexity_batch_size=1',
    'data=openwebtext-split',
    'model=small',
    'parameterization=subs',
    'backbone=dit',
    'model.length=1024',
    'time_conditioning=false',
    '+wandb.offline=true',
    'T=0',
    'sampling.steps=4096',
    'sampling.nucleus_p=0.9',
    'sampling.sampler=remdm-loop',
    'sampling.eta=0.02',
    'sampling.t_on=0.55',
    'sampling.t_off=0.05',
    'sampling.alpha_on=0.9',
    f'sampling.num_sample_batches={NUM_SAMPLE_BATCHES}',
    f'sampling.generated_seqs_path={OUTPUT_JSON.as_posix()}',
    f'eval.checkpoint_path={CHECKPOINT_PATH.as_posix()}',
]

# Ensure we operate from the project root so Hydra finds configs
os.chdir(PROJECT_ROOT)
os.environ.setdefault('HYDRA_FULL_ERROR', '1')
print('Working directory set to', Path.cwd())
print('Checkpoint path:', CHECKPOINT_PATH)

Working directory set to /hdd1/lr/discrete/remdm
Checkpoint path: /hdd1/lr/discrete/remdm/ckpts/adlm-large.ckpt


In [2]:
import logging
from hydra import compose, initialize
from hydra.core.global_hydra import GlobalHydra
from omegaconf import OmegaConf

import dataloader
import utils
from adlm_main import _load_from_checkpoint, generate_samples

# Reset Hydra to allow multiple executions in the same notebook
GlobalHydra.instance().clear()

config_dir = PROJECT_ROOT / 'configs'
if not config_dir.is_dir():
    raise FileNotFoundError(
        f"Hydra config directory not found at {config_dir}. "
        "Run cell 2 to set PROJECT_ROOT correctly or update the path above.")

notebook_dir = PROJECT_ROOT / 'notebooks'
config_path_rel = os.path.relpath(config_dir, start=notebook_dir)
with initialize(version_base=None, config_path=config_path_rel):
    cfg = compose(config_name='config_adlm', overrides=OVERRIDES)

print('Mode:', cfg.mode)
print('Sampling batches:', cfg.sampling.num_sample_batches)
print('Checkpoint:', cfg.eval.checkpoint_path)
print('Generated sequences will be saved to:', cfg.sampling.generated_seqs_path)

# Optional: inspect a compact view of the config
display_dict = OmegaConf.to_container(cfg, resolve=True)
{k: display_dict[k] for k in ['mode', 'data', 'model', 'parameterization', 'sampling']}

Mode: sample_eval
Sampling batches: 10
Checkpoint: /hdd1/lr/discrete/remdm/ckpts/adlm-large.ckpt
Generated sequences will be saved to: /hdd1/lr/discrete/remdm/outputs/notebook_adlm_samples.json


{'mode': 'sample_eval',
 'data': {'train': 'openwebtext-train',
  'valid': 'openwebtext-valid',
  'tokenizer_name_or_path': 'gpt2',
  'cache_dir': '/hdd1/lr/data',
  'wrap': True,
  'streaming': False},
 'model': {'name': 'small',
  'type': 'ddit',
  'hidden_size': 768,
  'cond_dim': 128,
  'length': 1024,
  'n_blocks': 12,
  'n_heads': 12,
  'scale_by_sigma': True,
  'dropout': 0.1,
  'tie_word_embeddings': False},
 'parameterization': 'subs',
 'sampling': {'predictor': 'ddpm_cache',
  'steps': 4096,
  'noise_removal': True,
  'num_sample_batches': 10,
  'num_sample_log': 2,
  'semi_ar': False,
  'stride_length': 1,
  'num_strides': 1,
  'generated_seqs_path': '/hdd1/lr/discrete/remdm/outputs/notebook_adlm_samples.json',
  'nucleus_p': 0.9,
  'eta': 0.02,
  'sampler': 'remdm-loop',
  't_on': 0.55,
  't_off': 0.05,
  'alpha_on': 0.9,
  'dfm': False}}

In [3]:
# Prepare tokenizer, logger, and model
import torch
logger = logging.getLogger('adlm_inference')
if not logger.handlers:
    handler = logging.StreamHandler()
    handler.setLevel(logging.INFO)
    logger.addHandler(handler)
logger.setLevel(logging.INFO)

tokenizer = dataloader.get_tokenizer(cfg)
logger.info('Tokenizer loaded. Vocabulary size: %d', tokenizer.vocab_size)

device = 'cuda' if torch.cuda.is_available() and torch.cuda.device_count() > 0 else 'cpu'
logger.info('Using device: %s', device)

model = _load_from_checkpoint(config=cfg, tokenizer=tokenizer)
model = model.to(device)
model.eval()
logger.info('Model loaded and moved to %s', device)

Tokenizer loaded. Vocabulary size: 50257
Using device: cuda
Tokenizer loaded. Vocabulary size: 50257
Using device: cuda
Model loaded and moved to cuda
Model loaded and moved to cuda


In [4]:
# Run sampling/inference
if 'cfg' not in globals():
    raise RuntimeError("Configuration not initialized. Run cell 3 to compose the Hydra config before sampling.")
if 'logger' not in globals():
    raise RuntimeError("Logger not initialized. Run cell 4 before sampling.")
if 'tokenizer' not in globals():
    raise RuntimeError("Tokenizer not initialized. Run cell 4 before sampling.")

samples, gen_ppl, entropies = generate_samples(cfg, logger, tokenizer)

logger.info('Generated %d samples', len(samples))
logger.info('Average entropy: %.4f', sum(entropies) / max(len(entropies), 1))
logger.info('Generative perplexity: %.3f', gen_ppl)

# Preview a few sample strings
for idx, text in enumerate(samples[: min(5, len(samples))]):
    print('=' * 80)
    print(f'Sample {idx + 1}')
    print(text)

OUTPUT_JSON.parent.mkdir(parents=True, exist_ok=True)
print('Results JSON saved to:', cfg.sampling.generated_seqs_path)

Generating samples.


Starting Batch  0


100%|██████████| 4096/4096 [01:14<00:00, 54.62it/s] 



Starting Batch  1


100%|██████████| 4096/4096 [01:14<00:00, 54.68it/s] 



Starting Batch  2


100%|██████████| 4096/4096 [01:14<00:00, 54.77it/s] 



Starting Batch  3


100%|██████████| 4096/4096 [01:14<00:00, 54.95it/s] 



Starting Batch  4


100%|██████████| 4096/4096 [01:15<00:00, 54.54it/s] 



Starting Batch  5


100%|██████████| 4096/4096 [01:14<00:00, 55.04it/s] 



Starting Batch  6


100%|██████████| 4096/4096 [01:14<00:00, 55.21it/s] 



Starting Batch  7


100%|██████████| 4096/4096 [01:14<00:00, 54.96it/s] 



Starting Batch  8


100%|██████████| 4096/4096 [01:15<00:00, 54.12it/s] 



Starting Batch  9


100%|██████████| 4096/4096 [01:14<00:00, 54.71it/s] 

Generated 10 samples
Average entropy: 5.2613
Generative perplexity: 17.095
Generated 10 samples
Average entropy: 5.2613
Generative perplexity: 17.095


Sample 1
<|endoftext|> cost," Trump told CNBC. "But the problem is we need the money, and if we cannot make the money, then it’s not that expensive."

Trump won't tell me his wall" will cost up to 9%. Study of 3 policy budgets says opposite. — Ashley Smith (@_AshleySmith) June 18, 2016

_____________________________________________________________________________

According to one expert, the wall will probably cost $12 billion, but according to another expert, the cost will be far higher than Trump's estimate by at least $20 billion.

So why is Trump always saying such a number? It’s not because he’s lying. On the contrary, he can’t really tell us much at all.

June 16, 2016

Trump has touted his support for a wall to stop the flow of illegal immigrants entering the country, but hasn’t yet offered to reveal the details of its size. John Raoux/Reuters

In an interview with CNBC prior to Trump’s call for a border wall, he explained, in one of his favorite lines, the problem of illegal i