feat: implement feature extraction evaluation#190
Conversation
timenick
left a comment
There was a problem hiding this comment.
Review comments posted inline.
|
Issue 1: `prepare_pipeline` bypasses `super()` `prepare_pipeline` constructs the HF pipeline from scratch instead of calling `super().prepare_pipeline()`. Every other evaluator (`WinMLTextClassificationEvaluator`, `WinMLImageSegmentationEvaluator`, `WinMLTokenClassificationEvaluator`) follows the pattern of delegating to `super()` first then customizing. The bypass is documented (sentence-similarity isn't a valid HF pipeline task), but it means:
The task override can be achieved without bypassing the base class — e.g. temporarily patching `self.config.task` before calling `super()`, or extracting the shared padding setup into a helper both evaluators call. Issue 2: `forward()` input parameter names hardcode BERT-specific tensor names The signature hardcodes `input_ids`, `attention_mask`, `token_type_ids` — BERT-family input names — violating CLAUDE.md Cardinal Rule #1 ("Never hardcode... input/output tensor names. All solutions must be universal and architecture-agnostic."). The body then builds the inputs dict from these specific names; `**kwargs` is silently dropped. Any model whose tokenizer emits different key names (e.g. `pixel_values` for CLIP-style feature extractors) would have those inputs silently lost before reaching `_format_inputs`. The output side (line 67) already uses the shape-based `next(iter(outputs.values()))` — the same approach works for inputs: accept `**kwargs` only and pass them directly to `_format_inputs(**kwargs)`. 🤖 Generated with Claude Code |
TLDR
Spearman's rank correlationto measure the feature extraction and sentence similarity modelsFeature Extraction / Sentence Similarity Evaluator -> Design
Version: 1.0
Date: 2026-03-31
Status: Implemented
1. Overview
The feature extraction evaluator measures how well an ONNX sentence embedding model preserves semantic similarity after quantization and deployment. Given a pair of sentences, the model encodes each into a fixed-length embedding vector, and the evaluator computes how well the cosine similarity between embeddings correlates with human-judged similarity scores.
The primary metric is Spearman's rank correlation (cosine_spearman), which measures whether the model ranks sentence pairs in the same order as humans -> without requiring the similarity scores to match exactly. This is the standard metric used by MTEB (Massive Text Embedding Benchmark) and sentence-transformers for evaluating embedding models.
Both
feature-extractionandsentence-similaritytasks use this evaluator. The two task names are aliases -> they share the same inference model class (WinMLModelForFeatureExtraction), the same HF pipeline (feature-extraction), and the same evaluation logic. The distinction exists because HuggingFace Hub tags models under different task labels, but ModelKit treats them identically.This evaluator extends the existing
wmk evalframework (see 3_design.md) with embedding-specific logic for mean pooling, cosine similarity computation, and Spearman correlation viatorchmetrics.regression.SpearmanCorrCoef.2. Schemas
2.1 I/O Schema
Input -> Column Mapping
Sentence similarity datasets have paired sentences with a continuous similarity score. The evaluator uses
columns_mappingto locate these fields.input_column_1"sentence1"input_column_2"sentence2"score_column"score"CLI usage:
Output -> Evaluation Result
{ "model_id": "sentence-transformers/all-MiniLM-L6-v2", "model_path": "~/.cache/winml/artifacts/.../feat_..._model.onnx", "task": "feature-extraction", "device": "npu", "dataset": { "path": "mteb/stsbenchmark-sts", "split": "test", "samples": 1000, "shuffle": true, "seed": 42, "columns_mapping": { "input_column_1": "sentence1", "input_column_2": "sentence2", "score_column": "score" } }, "metrics": { "cosine_spearman": 82.05 } }cosine_spearmanIn practice, a well-trained sentence embedding model scores 75-90 on STS-B. A score below 50 typically indicates the model was not trained for semantic similarity (e.g., a classification-only BERT). Negative scores are extremely rare and would indicate a fundamentally broken model.
2.2 Dataset Ground Truth Schema
The default dataset is STS-B (
mteb/stsbenchmark-sts), a widely used benchmark for sentence similarity. Each row contains a sentence pair with a human-annotated similarity score:sentence1strsentence2strscorefloatThe score scale varies by dataset (STS-B uses 0-5). Since Spearman correlation operates on ranks, the evaluator is scale-agnostic -> it works with any monotonic score range.
2.3 Model Output Schema
The HuggingFace
feature-extractionpipeline returns a nested list of token-level embeddings:[1, seq_len, hidden_dim]This output is then mean-pooled (section 3.1) into a single sentence embedding of shape
[hidden_dim].3. Spearman Correlation Metric
3.1 From Model Output to Sentence Embedding
A sentence embedding model (e.g.,
sentence-transformers/all-MiniLM-L6-v2) takes a text string as input and produces token-level embeddings -> one vector per token in the sequence:where$L$ is the sequence length (padded to a fixed size, e.g., 512) and $d$ is the hidden dimension (e.g., 384).
To obtain a single sentence embedding$\mathbf{s} \in \mathbb{R}^{d}$ , we apply attention-mask-weighted mean pooling. This averages only over real tokens, excluding padding positions:
where$m_i \in {0, 1}$ is the attention mask (1 for real tokens, 0 for padding) and $\mathbf{h}_i$ is the token embedding at position $i$ .
3.2 Cosine Similarity Between Sentence Pairs
Given two sentences encoded as embeddings$\mathbf{s}_1$ and $\mathbf{s}_2$ , their cosine similarity measures the angle between the vectors:
The result is in$[-1, 1]$ : 1 means identical direction (semantically identical), 0 means orthogonal (unrelated), -1 means opposite.
3.3 Spearman's Rank Correlation
Spearman's correlation measures whether two variables have the same rank order, regardless of their actual magnitudes. This is crucial for evaluating embedding models because:
Worked example:
Consider 5 sentence pairs with human similarity scores (0-5 scale) and model-predicted cosine similarities:
In this example, the model ranks all pairs in perfect agreement with human judgment -> Spearman correlation = 1.0. Note that the actual cosine values (0.18-0.91) don't need to match the human scores (0.5-4.8); only the rank order matters.
If quantization caused the model to swap the ranking of two pairs (e.g., it ranked "Cars on road" above "A cat sleeps"), the Spearman correlation would drop, indicating quality degradation from quantization.
Formally, Spearman's correlation is computed as Pearson's correlation on the rank-transformed data:
where$R_X$ and $R_Y$ are the rank orderings of the predicted cosine similarities and the ground truth scores respectively.
The metric is reported as cosine_spearman in the range$[-100, 100]$ , following the MTEB convention (Spearman $\rho$ x 100). A value of 80+ is typical for well-performing sentence embedding models on STS-B.
4. Design Details
4.1 Evaluation Flow
The evaluator follows a four-step flow:
Steps 3 happens per-sample in a single loop. After iterating through all pairs, the collected cosine similarities and ground truth scores are passed to
SpearmanCorrelationMetric.compute().4.2 Prepare Pipeline
The evaluator creates a HuggingFace
feature-extractionpipeline and configures it for fixed-shape ONNX inference:"feature-extraction"->"sentence-similarity"is not a valid HF pipeline task, so bothfeature-extractionandsentence-similarityevaluator tasks use the same pipeline.io_config(e.g.,[1, 512]) and setspadding="max_length",max_length=512,truncation=Trueon the pipeline's preprocessing parameters. This ensures every input is padded/truncated to the model's fixed sequence length.4.3 Encode & Compare
For each sentence pair in the dataset:
pipe(sentence1)-> token embeddings[seq_len, hidden_dim][hidden_dim]cos_sim(emb1, emb2)-> scalar in [-1, 1]The cosine similarity and ground truth score are collected for all pairs.
4.4 Compute Metric
After processing all pairs, the evaluator computes Spearman's rank correlation:
Internally, this uses
torchmetrics.regression.SpearmanCorrCoef, which computes Pearson's correlation on the rank-transformed inputs. The result is multiplied by 100 to match the MTEB reporting convention.5. Design Decisions
DD-001: Use Spearman Rank Correlation as Primary Metric
Decision: Use Spearman's rank correlation (cosine_spearman) as the primary evaluation metric, following the MTEB standard.
Rationale: Sentence embedding evaluation cares about relative ranking, not absolute similarity values. Spearman correlation captures this: if a quantized model shifts all cosine similarities by a constant offset but preserves the ranking, Spearman stays at 1.0, correctly indicating no quality loss. This is exactly the property we need when comparing FP32 PyTorch baselines against quantized ONNX models on NPU.
Alternatives considered: Pearson correlation (sensitive to non-linear distortions in similarity scores), cosine similarity MAE (penalizes constant offsets that don't affect usability).
DD-002: Attention-Mask-Weighted Mean Pooling
Decision: Use attention-mask-weighted mean pooling to convert token embeddings into sentence embeddings, rather than simple mean or CLS token extraction.
Rationale: ONNX models have fixed input shapes (e.g.,
[1, 512]). A sentence with 8 real tokens is padded with 504 padding tokens. Without masking:Mask-weighted mean pooling is the standard approach used by sentence-transformers and matches the training-time pooling for these models. It produces embeddings consistent with what the model was optimized for.
DD-003: Use
feature-extractionPipeline for Both TasksDecision: Both
feature-extractionandsentence-similaritytasks use the HuggingFacefeature-extractionpipeline.Rationale:
"sentence-similarity"is not a valid HuggingFace pipeline task name. The HFfeature-extractionpipeline returns raw token embeddings, which is exactly what we need for mean pooling. Thesentence-similaritytask is an alias that routes to the same evaluator and model class (WinMLModelForFeatureExtraction), differing only in the HuggingFace model ID resolution at export time.DD-004: Default Dataset -> STS-B (mteb/stsbenchmark-sts)
Decision: Use STS-B as the default evaluation dataset for feature-extraction and sentence-similarity tasks.
Rationale: STS-B (Semantic Textual Similarity Benchmark) is the most widely used benchmark for sentence embedding evaluation. It appears in MTEB, sentence-transformers, and virtually all embedding model papers. Using STS-B as default means:
DD-005: Report Metric in 0-100 Scale
Decision: Report$\rho$ x 100 (e.g., 82.05 instead of 0.8205).
cosine_spearmanas Spearman'sRationale: This follows the MTEB convention used in leaderboards, model cards, and papers. Values in the 0-100 range are easier to compare and discuss (e.g., "the model scores 82 on STS-B" vs. "the model scores 0.82").