Problem
HashingEmbedder embeds each element independently. "Search" next to a text input means something different from "Search" next to a results heading. Two identically-named elements in different page regions produce identical embeddings.
Proposal
Include neighboring elements as weak signals:
func (h *HashingEmbedder) vectorizeWithContext(
el ElementDescriptor,
prev, next *ElementDescriptor,
) []float32 {
vec := h.vectorize(el.Composite())
if prev != nil {
nVec := h.vectorize(prev.Composite())
for i := range vec { vec[i] += nVec[i] * 0.1 }
}
if next != nil {
nVec := h.vectorize(next.Composite())
for i := range vec { vec[i] += nVec[i] * 0.1 }
}
h.normalize(vec)
return vec
}
Inspired By
- mgrep: context-aware parsing where chunks carry surrounding context
- Word2Vec/CBOW: words defined by their context
Acceptance Criteria
Problem
HashingEmbedderembeds each element independently. "Search" next to a text input means something different from "Search" next to a results heading. Two identically-named elements in different page regions produce identical embeddings.Proposal
Include neighboring elements as weak signals:
Inspired By
Acceptance Criteria
EmbeddingMatcher.Find()accepts or computes neighbor context