Skip to content

maymay-wa/spacy

Repository files navigation

spacy-rusty — a native Rust→WebAssembly runtime for spaCy

CI License: MIT OR Apache-2.0

Runs spaCy's full English pipeline — tokenizer · POS tagging · dependency parser · sentence segmentation · lemmatizer · named-entity recognition — in the browser (or any WASM/native host) with no Python, loading spaCy's actual trained weights. Output matches Python spaCy to within floating-point noise. A Pyodide backstop covers everything the native engine doesn't (other languages, transformer models, the full Python API).

result (held-out 1000-sentence corpus, vs full-pipeline spaCy)
Token boundaries 99.995 %
POS tag (tag_) / UPOS (pos_) 100.0 % / 100.0 %
Dependency parse (UAS / LAS) 99.92 % / 99.83–99.92 %
Sentence boundaries (F1) 1.0000
Lemma (lemma_) 100.0 %
Named-entity F1 1.0000
WASM size 1.43 MB (518 KB gzipped)
Models en_core_web_sm, en_core_web_md, en_core_web_lg

Add spacy-rusty to your project

spacy-rusty ships as source: you build two artifacts once — the WASM module (the engine, ~1.4 MB) and a model bundle (spaCy's weights, exported to model.json + model.safetensors) — then load them from JavaScript or Python. There is no Python spaCy at runtime.

Prerequisites (one time)

  • Rust (stable) + wasm-pack and the WASM target: rustup target add wasm32-unknown-unknown
  • Python 3.13 + spaCy 3.8 — used only to export the model bundle, not at runtime.

1. Clone and build the two artifacts

git clone https://github.com/maymaywa/spacy-rusty
cd spacy-rusty

# a) Export a model bundle (weights + config). Run once per model.
python -m venv .venv && ./.venv/bin/pip install -r export/requirements.txt
./.venv/bin/python -m spacy download en_core_web_sm
./.venv/bin/python export/export_model.py en_core_web_sm bundles/en_core_web_sm
# -> bundles/en_core_web_sm/{model.json, model.safetensors}
#    (md/lg also emit vectors_key2row.json + vectors_row2word.json)

# b) Compile the engine to WASM.
cd crates/spacy-rusty
wasm-pack build --target web --release --out-dir pkg
# -> crates/spacy-rusty/pkg/{spacy_rusty.js, spacy_rusty_bg.wasm, ...}

2a. Use it in a web app (JavaScript / TypeScript)

Copy crates/spacy-rusty/pkg/ and your bundles/en_core_web_sm/ into your app (or serve them from a CDN), then use the bundled web/loader.js helper:

import { loadModel } from './loader.js';

// baseUrl is a directory served with model.json + model.safetensors
const nlp = await loadModel('/models/en_core_web_sm');
const doc = nlp.process('Apple is looking at buying a U.K. startup for $1 billion.');

for (const t of doc.tokens) console.log(t.text, t.pos, t.dep, t.head);
console.log(doc.ents);   // [{ start, end, label, text }]
console.log(doc.sents);  // [{ start, end, start_token, end_token }]

loader.js imports the engine via the bare specifier spacy-rusty; for a plain local setup change that import to './pkg/spacy_rusty.js'. If you prefer no helper, construct the model directly:

import init, { SpacyModel } from './pkg/spacy_rusty.js';
await init();
const model = new SpacyModel(manifestJson, safetensorsBytes, key2rowJson /* md/lg only */);
const doc = model.process('…');

2b. Use it from Python (via wasmtime — still no spaCy)

Build the C-ABI variant of the module, install wasmtime, and load with web/spacy_rusty.py:

cd crates/spacy-rusty
cargo build --target wasm32-unknown-unknown --release --no-default-features --features capi
pip install wasmtime
from spacy_rusty import load

nlp = load("bundles/en_core_web_sm",
           "crates/spacy-rusty/target/wasm32-unknown-unknown/release/spacy_rusty.wasm")
doc = nlp("Apple is looking at buying a U.K. startup.")
for t in doc:
    print(t.text, t.pos_, t.dep_, t.head.i)
print(doc.ents, list(doc.noun_chunks))

Output shape

doc.process(text) (JS) / nlp(text) (Python) yields spaCy-shaped fields:

doc.tokens[].{ text, idx, ws, tag, pos, morph, lemma, head, dep,
               is_sent_start, ent_iob, ent_type }
doc.ents[].{ start, end, label, text }
doc.sents[].{ start, end, start_token, end_token }

Vectors/similarity (md/lg): has_vectors, word_vector, doc_vector, span_vector, similarity, most_similar. Rule matching: matcher, phrase_matcher, entity_ruler. Serialization: process_json → spaCy's to_json structure.

All of spaCy in the browser (hybrid native + Pyodide)

web/router.js serves English en_core_web_sm/md from the fast native engine and routes everything else (other languages, _trf models, training, the full Python API) to a Pyodide-hosted real spaCy:

import { SpacyRouter } from './router.js';
import { PyodideBackend } from './pyodide_backend.js';

const router = new SpacyRouter({ native, pyodide: new PyodideBackend({ wheels: [/* see WHEELS.md */] }) });
await router.process('en_core_web_sm', text);   // -> native (instant)
await router.process('de_core_news_sm', text);  // -> Pyodide real spaCy

The router + Pyodide loader are verified in headless Chrome (web/run_hybrid.mjs boots real CPython 3.12). Loading real spaCy needs emscripten-built WASM wheels (spaCy's compiled deps aren't in the Pyodide index) — see web/WHEELS.md.


v3 additions

Beyond the core annotation pipeline, the runtime now also covers:

  • Word vectors + similarity (md/lg) — token.vector / doc.vector / span.vector, .similarity(), and most_similar() (faithful to spaCy's Vectors.most_similar). Verified to <1e-4 vs spaCy. One shared vector table (Rc), so even lg's 411 MB table is held once.
  • doc.noun_chunks — spaCy's English syntactic iterator over the parse (exact match on the golden; F1 0.998 held-out, precision 1.0).
  • en_core_web_lg — same pipeline, bigger vectors; meets all the fidelity targets above.
  • ~2× faster — the dense kernels accumulate in 4-lane f32 (matching Thinc's sgemm); ~66→136 docs/s with byte-identical annotations (a digest over every output field across 1000 sentences × 3 models is unchanged).
  • Rule matching — a public Matcher (attrs + IN/NOT_IN/REGEX, numeric LENGTH, and the ?/*/+ quantifiers), PhraseMatcher, and EntityRuler (→ filter_spans). All match spaCy's Matcher output exactly.
  • Serialization + Spandoc.to_json() (spaCy structure, exact) and span slicing with the real Span.root algorithm.
  • Python bindingweb/spacy_rusty.py loads the WASM via wasmtime into a Doc/Token/Span model (no Python spaCy needed); output matches spaCy field-for-field.

Fidelity is measured against the full spaCy pipeline (parser enabled). Adding the parser closed the v1 pos_ gap: spaCy derives some POS values from the dependency parse (its attribute_ruler has DEP-conditioned rules), so with the parser wired in, pos_ matches the full pipeline at 100 %.

How it works

Two halves connected by a portable model bundle:

  1. Offline export (export/, Python) — loads a spaCy pipeline and extracts everything the runtime needs into model.safetensors (weights) + model.json (config, tokenizer rules, labels, NER moves, attribute_ruler patterns, symbol table, norm tables). Run once per model.
  2. Runtime (crates/spacy-rusty/, Rust → WASM) — reimplements the inference pipeline and loads that bundle (spaCy's order): text → tokenizer → tok2vec → tagger → parser → attribute_ruler → lemmatizer → NER → Doc.

Everything is a faithful port of spaCy/Thinc internals, verified bit-for-bit where possible (hashes, feature keys) and to <1e-4 for the neural tensors:

  • Tokenizer — spaCy's explain() algorithm + whitespace rule + special-case matcher (uses fancy-regex for the lookbehind punctuation patterns).
  • Hashing — StringStore MurmurHash64A (symbol-aware) + Thinc HashEmbed MurmurHash3_x86_128_uint64.
  • tok2vec — MultiHashEmbed (6 tables, +static vectors for _md) + MaxoutWindowEncoder (with the pad=receptive_field boundary handling).
  • tagger — affine + argmax over Penn tags.
  • parser — transition-based arc-eager (TransitionBasedParser.v2), greedy decode; shares the main tok2vec (Tok2VecListener), 8 state features, moves S/D/L-*/R-*/B. Yields head/dep_, and doc.sents (sentence starts derived from the parse tree's l_edge-of-roots, as spaCy does). The shared transition-scorer (transition.rs) is reused by both the parser and NER.
  • attribute_ruler — TAG/LOWER/IS_SPACE/DEP Matcher (incl. IN/NOT_IN/ REGEX operators, multiple alternatives per entry) → POS/MORPH/LEMMA. Runs after the parser, so its DEP-conditioned rules fire.
  • lemmatizer — rule-based (rule_lemmatize + English is_base_form over morph); the attribute_ruler sets irregular/pronoun lemmas first, the rule lemmatizer fills the rest (overwrite=false).
  • NER — transition-based BiluoPushDown, greedy decode (own 4-attr tok2vec, PrecomputableAffine lower, upper linear, 3 state features).

Performance

The dense kernels (src/ml.rs) have explicit f32x4 SIMD paths (register-blocked maxout: one x load feeds 8 weight rows), enabled by default via .cargo/config.toml (target-feature=+simd128). On Moby Dick (5,000 sentences / 137 K tokens, en_core_web_sm, full pipeline, Node 24):

engine tokens/s wall
WASM scalar (pre-SIMD) 1,940 70.8 s
WASM +simd128 10,232 13.4 s
spaCy (Python, per-doc loop) 6,925 19.8 s
spaCy (Python, batched nlp.pipe, BLAS threads) 10,591 13.0 s

That is 5.3× over the scalar WASM — single-threaded, in-browser, no Python — landing on par with spaCy's multi-threaded batched fast path and ~1.5× over its per-doc loop. The SIMD f32x4 accumulator reproduces the scalar 4-lane reduction lane-for-lane, so every annotation is byte-identical (tests/perf_identity.rs digests unchanged; scalar-vs-SIMD WASM digests match exactly).

Scope

In (native, English en_core_web_sm/md/lg): tokenizer, tag_/pos_/morph, dependency parse (head/dep_), sentence segmentation (doc.sents), lemmas (lemma_), named entities, word vectors + similarity, noun chunks, and rule matching. Measured against the full spaCy pipeline (above).

Via the Pyodide backstop (browser, when wheels are vendored): other languages, transformer (_trf) models, training, and the full Python Doc/Token/Span API / displacy — i.e. anything the native engine doesn't do.

Known edge: 1 token in ~20,930 (0.005 %) had a tokenizer boundary difference on the held-out corpus — a rare punctuation/Unicode edge, candidate for follow-up. UAS/LAS are ~99.9 % (the ~0.1 % is float-precision argmax flips in the transition scorer, the same class as tagger near-ties).

Development

To build, run the full fidelity suite, regenerate golden fixtures, or add a model/component, see CONTRIBUTING.md. The end-to-end verification gate (native tests + headless-Chrome proofs + the wasmtime Python binding) is a single script:

./verify_v3.sh

CI (.github/workflows/ci.yml) compiles the native and WASM targets on every push and runs the sm+md fidelity suite.

License

Licensed under either of

at your option. Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you shall be dual-licensed as above, without any additional terms or conditions.

Acknowledgements

spacy-rusty is an independent reimplementation of the inference pipeline of spaCy and Thinc by Explosion AI (both MIT-licensed), and loads model artifacts exported from spaCy's trained pipelines. It is not affiliated with or endorsed by Explosion AI. See NOTICE for details.

About

No description, website, or topics provided.

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages