Embedding model and Latin coverage: notes on paraphrase-multilingual-MiniLM-L12-v2 for NVBSE and VGCL #107
Replies: 5 comments 2 replies
-
Empirical Latin evaluation resultsNow that all 8 versions on staging have full embeddings (259,998 verses total), here's the cross-lingual semantic search quality matrix the original post asked us to gather. Same query — Modern languages (control)DRB (English): NABRE (English): CEI2008 (Italian): Excellent canonical recall in English, decent thematic recall via cross-lingual transfer to Italian. LatinNVBSE (Nova Vulgata): Caveat: this is a separate, unrelated bug — VGCL (Vulgata Clementina): These are wrong content and at materially lower similarity (0.46-0.50 vs the 0.60-0.76 we see for modern languages). Confirms the prediction in the original post: cross-lingual recall to classical Latin is the weak spot. Confirmation of the predicted pattern
So the model behaves exactly as the model card would predict: it's not actively broken on Latin, the embeddings are valid 384-dim vectors with rough thematic clustering, but the aligned cross-lingual semantic space wasn't trained to include Latin so cross-lingual recall is poor. RecommendationStanding by option B from the original post for VGCL: keep the embeddings (they were free as part of the bigger run), but consider gating Option C (switch to LaBSE for 109-language coverage) stays on the shelf as a future-improvement track. The cost is real — 768-dim embeddings means re-running the whole pipeline and altering the column types — but if user feedback says Latin search matters, it's the right destination. |
Beta Was this translation helpful? Give feedback.
-
Empirical comparison: paraphrase-multilingual-MiniLM-L12-v2 vs LaBSESetup
Headline — top-1 accuracy
Score: LaBSE 15 / 21 vs MiniLM 14 / 21. Top-1 counts are nearly tied; the shape of the wins and losses is what matters. Concrete examplesLast Supper, English — query
Same shape in Italian: MiniLM returns Psalm 63:9; LaBSE returns 1 Cor 11:24 + Luke 22:19. Shepherd Psalm, Latin — query
Beatitudes top-K coherence in Latin — this was the original motivating concern from this discussion. Query
This is the kind of qualitative improvement the discussion was asking about — not just better top-1, but a thematically coherent top-K. LaBSE regression — Word made flesh: For Score-scale differences
LaBSE compresses the high end. Any existing ReadFor Latin specifically — the issue this discussion was opened about — LaBSE is better in kind, not just degree. The improvement is most visible in top-K coherence (Beatitudes example) and in resisting surface-token traps (Shepherd Psalm Latin). For English and Italian, the two models are roughly even on this probe set, with LaBSE winning the harder cases (Last Supper in both languages) and losing one (Word made flesh in both languages). Suggested next steps
Full raw outputAll 21 probes × top-5, side by side (click to expand) |
Beta Was this translation helpful? Give feedback.
-
Follow-up: expanded to 20 probes (60 total queries)The original probe set was 7 concepts × 3 languages = 21 queries — enough to be suggestive, not enough to be decisive. Particularly, the LaBSE regression on "Word made flesh" (both English and Italian) was a single data point. So the probe set was expanded to 20 concepts × 3 languages = 60 queries, weighted toward theological / metaphorical / liturgical language where the first run hinted at LaBSE weakness. Probe set lives in Headline — per-language top-1 (across all 20 concepts)
The original 7-probe result (15 / 21 vs 14 / 21) was nearly tied. With 20, LaBSE is clearly ahead, driven entirely by Latin. The Latin story is now unambiguousOf the 13 newly-added probes, MiniLM Latin gets 5 / 13 right; LaBSE Latin gets 12 / 13. The MiniLM misses are systematic — surface-token noise drags in unrelated verses:
This is exactly the failure mode this discussion was opened about. The LaBSE column doesn't just rank the right verse first — it consistently provides a thematically coherent top-5, even when the query is terse. LaBSE regressions on the English/Italian sideThree new English regressions, plus the original
Pattern: the regressions cluster on liturgical / Christological metaphor. LaBSE may have weaker representation of these in its training. None of them push the correct verse out of the top-5 entirely — usually it's there at Score-scale shift, refinedWith more probes, the score distribution gap is clearer:
LaBSE genuinely uses more of Decision-strength readThe Latin numbers more than cancel the English regression, and Italian is comfortably up. Most importantly, the original concern from this discussion (Latin token-noise hits) is unambiguously fixed by LaBSE. Two caveats remain before flipping the switch on production:
#71 (commit-pinning) gets unblocked by this analysis: once the cutover model is settled, pin its commit hash. Raw output (Latin probes only — full set is 90KB, exceeds GitHub comment cap)Latin probes side-by-side (click to expand)The English and Italian probe outputs are produced by |
Beta Was this translation helpful? Give feedback.
-
Investigation: are the LaBSE English regressions a systematic weakness?The previous follow-up identified four cases where LaBSE missed verses MiniLM caught: To test, ran a focused supplementary probe set of 10 additional English-only Christological / liturgical verses: I-Am sayings, eucharistic narrative, Annunciation, Isaian messianic prophecies, Pauline Christological compression, the Gloria. Probe file: Result: the hypothesis does not survive
Tally: LaBSE 4 wins, MiniLM 1 win, 5 ties. On a dedicated set of 10 English Christological / liturgical probes, LaBSE out-performs MiniLM. The MiniLM What the original four regressions actually wereLooking back at the misses with this updated context, they read as isolated per-verse artifacts rather than a class:
Of the five, two share a property — highly compressed Pauline/Johannine theological metaphor ( Updated readThe case for LaBSE strengthens. Across the now 30 probes × 3 languages = 90 queries (plus 10 EN-only = 100 queries total):
The previously-flagged "liturgical regression" caveat from the prior follow-up was overstated. The actual residual issue is narrower: ~1–2 known verses with abstract Pauline / Johannine theological compression. Worth tracking post-cutover but not blocking. Updated recommended next steps
|
Beta Was this translation helpful? Give feedback.
-
|
LaBSE is highly effective for Biblical and Ecclesiastical Latin because its training corpus includes large-scale parallel data from sources like the Latin Vulgate and subsequent liturgical translations. [1, 2, 3] Why LaBSE works for Church Latin
Performance vs. Alternatives
Tips for your Implementation
Are you building a tool for scholarly research (like tracking how a specific verse is used across different Church Fathers) or a more general liturgical search engine? [1] [https://iris.unimore.it](https://iris.unimore.it/handle/11380/1371269) |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
While provisioning the embedding deploy lane (#95) and starting to populate verse embeddings on staging, the question came up: does our chosen model handle Latin? We have two Latin Bible versions in the catalog — NVBSE (Nova Vulgata) and VGCL (Vulgata Clementina) — so this matters operationally.
TL;DR
paraphrase-multilingual-MiniLM-L12-v2does not officially support Latin; it's not in the supported-languages list on the model card.Why "not supported" doesn't mean "broken"
The model is a distilled sentence-encoder built on top of XLM-R. The base XLM-R was pre-trained on CommonCrawl-100, which does include some Latin. The distillation step that produced
paraphrase-multilingual-MiniLM-L12-v2then aligned the embedding space using parallel sentence pairs across ~50 modern languages — Latin not among them. So:Net expectation:
Three concrete options
/search(keyword via Postgres FTS) already works for these versions; just don't populateembeddingfor them, and have/search/semanticand/search/similarreject Latin versions or return a clear "embeddings unavailable for this version" message.Recommendation
Go with A for now. The empirical test costs nothing — the embeddings are being computed as part of the bigger run anyway. Once done, run a small evaluation:
Compare to the same queries against DRB/CEI2008 to gauge the relative quality. If NVBSE/VGCL recall is materially worse, fall back to B for those two versions specifically. C stays on the shelf as a future-improvement track if there's enough user demand for high-quality cross-lingual Latin search.
Why this is worth pinning
This question will recur every time someone asks "why don't I get good Latin results?" — and the answer is intentional, not a bug. Pinning the analysis here means the next person can read this, run the same evaluation curls, and decide whether the recall is acceptable for their use case. Also useful when we eventually evaluate replacement models — we'll want to A/B test against the LaBSE option.
Filed as part of the staging buildout (#95) but kept as its own discussion because future model decisions don't belong buried in a deploy thread.
Beta Was this translation helpful? Give feedback.
All reactions