H-6504: vec2slug: URL slug generation from text embeddings#142
H-6504: vec2slug: URL slug generation from text embeddings#142indietyp wants to merge 35 commits into
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
This pull request is abnormally large and would use a significant amount of tokens to review. If you still wish to review it, comment "augment review" and we will review it. |
PR SummaryLow Risk Overview Pipeline & tooling: Adds workspace-based data prep (FineWeb URL slugs and a smaller Haiku-distilled corpus), embedding (OpenRouter, local Harrier, OpenAI Batch), cluster splits, BPE and KMeans vocab paths, Repo wiring: Reviewed by Cursor Bugbot for commit f13e4a3. Bugbot is set up for automated code reviews on this repo. Configure here. |
Dependency ReviewThe following issues were found:
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 444794f. Configure here.


vec2slug: URL slug generation from text embeddings
Adds
libs/vec2slug, a research project that generates URL slugs directly from pooled sentence embeddings using a tiny transformer decoder, without re-feeding source text through a language model.The core claim is that embeddings are a reusable substrate for cheap auxiliary outputs. Slug generation is the proof of concept: if a system already has embeddings for search or deduplication, it can produce human-readable slugs for ~$0 marginal cost (CPU time only) instead of making a Haiku-class LLM call ($0.001/slug).
What's in the PR
Full training pipeline (69 files, ~15.5k lines):
Two trained models:
Doubling parameters adds +0.008 Token F1, within the ±0.008 confidence interval. The smaller model is recommended for deployment.
HuggingFace publishing pipeline: model card template, eval extraction, ONNX bundling, and upload script targeting
hashintel/vec2slug-v1-openai-{small,large}.Standalone inference script (
hf/inference.py): zero-dependency ONNX inference with beam search, also supports PyTorch backend. Runs withuv rundirectly.Key findings
Companion
Blog post at hash.dev/blog/vec2slug (separate).