Runs spaCy's full English pipeline — tokenizer · POS tagging · dependency parser · sentence segmentation · lemmatizer · named-entity recognition — in the browser (or any WASM/native host) with no Python, loading spaCy's actual trained weights. Output matches Python spaCy to within floating-point noise. A Pyodide backstop covers everything the native engine doesn't (other languages, transformer models, the full Python API).
| result (held-out 1000-sentence corpus, vs full-pipeline spaCy) | |
|---|---|
| Token boundaries | 99.995 % |
POS tag (tag_) / UPOS (pos_) |
100.0 % / 100.0 % |
| Dependency parse (UAS / LAS) | 99.92 % / 99.83–99.92 % |
| Sentence boundaries (F1) | 1.0000 |
Lemma (lemma_) |
100.0 % |
| Named-entity F1 | 1.0000 |
| WASM size | 1.43 MB (518 KB gzipped) |
| Models | en_core_web_sm, en_core_web_md, en_core_web_lg |
spacy-rusty ships as source: you build two artifacts once — the WASM module
(the engine, ~1.4 MB) and a model bundle (spaCy's weights, exported to
model.json + model.safetensors) — then load them from JavaScript or Python.
There is no Python spaCy at runtime.
- Rust (stable) +
wasm-packand the WASM target:rustup target add wasm32-unknown-unknown - Python 3.13 + spaCy 3.8 — used only to export the model bundle, not at runtime.
git clone https://github.com/maymaywa/spacy-rusty
cd spacy-rusty
# a) Export a model bundle (weights + config). Run once per model.
python -m venv .venv && ./.venv/bin/pip install -r export/requirements.txt
./.venv/bin/python -m spacy download en_core_web_sm
./.venv/bin/python export/export_model.py en_core_web_sm bundles/en_core_web_sm
# -> bundles/en_core_web_sm/{model.json, model.safetensors}
# (md/lg also emit vectors_key2row.json + vectors_row2word.json)
# b) Compile the engine to WASM.
cd crates/spacy-rusty
wasm-pack build --target web --release --out-dir pkg
# -> crates/spacy-rusty/pkg/{spacy_rusty.js, spacy_rusty_bg.wasm, ...}Copy crates/spacy-rusty/pkg/ and your bundles/en_core_web_sm/ into your app (or
serve them from a CDN), then use the bundled web/loader.js
helper:
import { loadModel } from './loader.js';
// baseUrl is a directory served with model.json + model.safetensors
const nlp = await loadModel('/models/en_core_web_sm');
const doc = nlp.process('Apple is looking at buying a U.K. startup for $1 billion.');
for (const t of doc.tokens) console.log(t.text, t.pos, t.dep, t.head);
console.log(doc.ents); // [{ start, end, label, text }]
console.log(doc.sents); // [{ start, end, start_token, end_token }]loader.js imports the engine via the bare specifier spacy-rusty; for a plain
local setup change that import to './pkg/spacy_rusty.js'. If you prefer no helper,
construct the model directly:
import init, { SpacyModel } from './pkg/spacy_rusty.js';
await init();
const model = new SpacyModel(manifestJson, safetensorsBytes, key2rowJson /* md/lg only */);
const doc = model.process('…');Build the C-ABI variant of the module, install wasmtime, and load with
web/spacy_rusty.py:
cd crates/spacy-rusty
cargo build --target wasm32-unknown-unknown --release --no-default-features --features capi
pip install wasmtimefrom spacy_rusty import load
nlp = load("bundles/en_core_web_sm",
"crates/spacy-rusty/target/wasm32-unknown-unknown/release/spacy_rusty.wasm")
doc = nlp("Apple is looking at buying a U.K. startup.")
for t in doc:
print(t.text, t.pos_, t.dep_, t.head.i)
print(doc.ents, list(doc.noun_chunks))doc.process(text) (JS) / nlp(text) (Python) yields spaCy-shaped fields:
doc.tokens[].{ text, idx, ws, tag, pos, morph, lemma, head, dep,
is_sent_start, ent_iob, ent_type }
doc.ents[].{ start, end, label, text }
doc.sents[].{ start, end, start_token, end_token }
Vectors/similarity (md/lg): has_vectors, word_vector, doc_vector,
span_vector, similarity, most_similar. Rule matching: matcher,
phrase_matcher, entity_ruler. Serialization: process_json → spaCy's
to_json structure.
web/router.js serves English en_core_web_sm/md from the
fast native engine and routes everything else (other languages, _trf
models, training, the full Python API) to a Pyodide-hosted real spaCy:
import { SpacyRouter } from './router.js';
import { PyodideBackend } from './pyodide_backend.js';
const router = new SpacyRouter({ native, pyodide: new PyodideBackend({ wheels: [/* see WHEELS.md */] }) });
await router.process('en_core_web_sm', text); // -> native (instant)
await router.process('de_core_news_sm', text); // -> Pyodide real spaCyThe router + Pyodide loader are verified in headless Chrome
(web/run_hybrid.mjs boots real CPython 3.12). Loading
real spaCy needs emscripten-built WASM wheels (spaCy's compiled deps aren't in
the Pyodide index) — see web/WHEELS.md.
Beyond the core annotation pipeline, the runtime now also covers:
- Word vectors + similarity (
md/lg) —token.vector/doc.vector/span.vector,.similarity(), andmost_similar()(faithful to spaCy'sVectors.most_similar). Verified to <1e-4 vs spaCy. One shared vector table (Rc), so evenlg's 411 MB table is held once. doc.noun_chunks— spaCy's English syntactic iterator over the parse (exact match on the golden; F1 0.998 held-out, precision 1.0).en_core_web_lg— same pipeline, bigger vectors; meets all the fidelity targets above.- ~2× faster — the dense kernels accumulate in 4-lane f32 (matching Thinc's
sgemm); ~66→136 docs/s with byte-identical annotations (a digest over every output field across 1000 sentences × 3 models is unchanged). - Rule matching — a public
Matcher(attrs +IN/NOT_IN/REGEX, numericLENGTH, and the?/*/+quantifiers),PhraseMatcher, andEntityRuler(→filter_spans). All match spaCy'sMatcheroutput exactly. - Serialization + Span —
doc.to_json()(spaCy structure, exact) and span slicing with the realSpan.rootalgorithm. - Python binding —
web/spacy_rusty.pyloads the WASM viawasmtimeinto aDoc/Token/Spanmodel (no Python spaCy needed); output matches spaCy field-for-field.
Fidelity is measured against the full spaCy pipeline (parser enabled).
Adding the parser closed the v1 pos_ gap: spaCy derives some POS values from
the dependency parse (its attribute_ruler has DEP-conditioned rules), so with
the parser wired in, pos_ matches the full pipeline at 100 %.
Two halves connected by a portable model bundle:
- Offline export (
export/, Python) — loads a spaCy pipeline and extracts everything the runtime needs intomodel.safetensors(weights) +model.json(config, tokenizer rules, labels, NER moves, attribute_ruler patterns, symbol table, norm tables). Run once per model. - Runtime (
crates/spacy-rusty/, Rust → WASM) — reimplements the inference pipeline and loads that bundle (spaCy's order):text → tokenizer → tok2vec → tagger → parser → attribute_ruler → lemmatizer → NER → Doc.
Everything is a faithful port of spaCy/Thinc internals, verified bit-for-bit where possible (hashes, feature keys) and to <1e-4 for the neural tensors:
- Tokenizer — spaCy's
explain()algorithm + whitespace rule + special-case matcher (usesfancy-regexfor the lookbehind punctuation patterns). - Hashing — StringStore
MurmurHash64A(symbol-aware) + Thinc HashEmbedMurmurHash3_x86_128_uint64. - tok2vec — MultiHashEmbed (6 tables, +static vectors for
_md) + MaxoutWindowEncoder (with thepad=receptive_fieldboundary handling). - tagger — affine + argmax over Penn tags.
- parser — transition-based arc-eager (
TransitionBasedParser.v2), greedy decode; shares the main tok2vec (Tok2VecListener), 8 state features, movesS/D/L-*/R-*/B. Yieldshead/dep_, anddoc.sents(sentence starts derived from the parse tree'sl_edge-of-roots, as spaCy does). The shared transition-scorer (transition.rs) is reused by both the parser and NER. - attribute_ruler — TAG/LOWER/IS_SPACE/DEP Matcher (incl.
IN/NOT_IN/REGEXoperators, multiple alternatives per entry) → POS/MORPH/LEMMA. Runs after the parser, so itsDEP-conditioned rules fire. - lemmatizer — rule-based (
rule_lemmatize+ Englishis_base_formover morph); the attribute_ruler sets irregular/pronoun lemmas first, the rule lemmatizer fills the rest (overwrite=false). - NER — transition-based
BiluoPushDown, greedy decode (own 4-attr tok2vec, PrecomputableAffine lower, upper linear, 3 state features).
The dense kernels (src/ml.rs) have explicit
f32x4 SIMD paths (register-blocked maxout: one x load feeds 8 weight rows),
enabled by default via .cargo/config.toml (target-feature=+simd128). On Moby
Dick (5,000 sentences / 137 K tokens, en_core_web_sm, full pipeline, Node 24):
| engine | tokens/s | wall |
|---|---|---|
| WASM scalar (pre-SIMD) | 1,940 | 70.8 s |
| WASM +simd128 | 10,232 | 13.4 s |
| spaCy (Python, per-doc loop) | 6,925 | 19.8 s |
spaCy (Python, batched nlp.pipe, BLAS threads) |
10,591 | 13.0 s |
That is 5.3× over the scalar WASM — single-threaded, in-browser, no Python —
landing on par with spaCy's multi-threaded batched fast path and ~1.5× over its
per-doc loop. The SIMD f32x4 accumulator reproduces the scalar 4-lane reduction
lane-for-lane, so every annotation is byte-identical
(tests/perf_identity.rs digests
unchanged; scalar-vs-SIMD WASM digests match exactly).
In (native, English en_core_web_sm/md/lg): tokenizer,
tag_/pos_/morph, dependency parse (head/dep_), sentence
segmentation (doc.sents), lemmas (lemma_), named entities, word
vectors + similarity, noun chunks, and rule matching. Measured against the full
spaCy pipeline (above).
Via the Pyodide backstop (browser, when wheels are vendored): other
languages, transformer (_trf) models, training, and the full Python
Doc/Token/Span API / displacy — i.e. anything the native engine doesn't
do.
Known edge: 1 token in ~20,930 (0.005 %) had a tokenizer boundary difference on the held-out corpus — a rare punctuation/Unicode edge, candidate for follow-up. UAS/LAS are ~99.9 % (the ~0.1 % is float-precision argmax flips in the transition scorer, the same class as tagger near-ties).
To build, run the full fidelity suite, regenerate golden fixtures, or add a model/component, see CONTRIBUTING.md. The end-to-end verification gate (native tests + headless-Chrome proofs + the wasmtime Python binding) is a single script:
./verify_v3.shCI (.github/workflows/ci.yml) compiles the native
and WASM targets on every push and runs the sm+md fidelity suite.
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE)
- MIT license (LICENSE-MIT)
at your option. Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you shall be dual-licensed as above, without any additional terms or conditions.
spacy-rusty is an independent reimplementation of the inference pipeline of spaCy and Thinc by Explosion AI (both MIT-licensed), and loads model artifacts exported from spaCy's trained pipelines. It is not affiliated with or endorsed by Explosion AI. See NOTICE for details.