Why is cosine similarity failing in modern embedding systems (RAG, LLMs, search)? #45793
ahsanshaokat
started this conversation in
Ideas & Feature requests
Replies: 1 comment
-
|
Don't think this really works becasue pretrained model already handled the difference. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Why is cosine similarity failing in modern embedding systems (RAG, LLMs, search)?
Cosine similarity has been the default similarity metric for almost 20 years.
And yes — it worked beautifully back when embeddings were:
• small (300 dimensions)
• from clean text
• from one domain
• short and consistent in length
But 2025 embeddings are totally different.
They are:
multi-domain
noisy
multi-scale (10 words → 400 words)
multi-modal
uneven in magnitude
generated by different models
Cosine similarity was never designed for this world.
❌ The Core Problem
Cosine similarity ignores magnitude completely.
It throws away information about:
whether a chunk is long or short
whether a vector is confident or noisy
whether a document is rich or empty
Cosine only cares about the direction of vectors.
This is why cosine behaves badly in Retrieval-Augmented Generation (RAG):
Example
Query: “What is the meaning of balance in the Quran?”
Chunk A (long, meaningful paragraph):
“The Quran emphasizes balance (Mizan) as a universal moral principle…”
Chunk B (short, noisy):
“Mizan = balance.”
Cosine says: “Both point the same way → same similarity!”
So the system often picks the noisy chunk and ignores the good one.
This leads to hallucination, unstable retrieval, and wrong context selection.
❌ Multi-Scale Collapse
Real embedding magnitudes vary widely:
short chunk → 3.1
long paragraph → 16.4
OCR text → 0.8
technical explanation → 22.7
Cosine erases this information.
The result:
short noisy text wins over long meaningful text
RAG quality drops
retrieval becomes unstable
cross-domain systems fail
This is the hidden crisis of similarity in modern AI.
✔ The Solution: The Mizan Balance Function
Instead of asking:
“Do these vectors point in the same direction?”
Mizan asks:
“Are these vectors balanced relative to their scale?”
It measures:
direction
proportional magnitude
relative confidence
balance
Mizan fixes cosine's biggest blind spot.
Short noisy vectors no longer outrank long informative vectors.
✔ Real Example
Let magnitudes be:
A = 10
B = 10
C = 2
Cosine:
cos(A, B) = 0.98
cos(A, C) = 0.97 → almost identical
Mizan:
M(A, B) ≈ 0.97
M(A, C) ≈ 0.61 → correctly penalized
This is exactly what RAG systems need.
✔ When to switch to Mizan
Use Mizan if your system contains:
✔ variable-length text
✔ OCR / noisy data
✔ multi-domain mixed corpora
✔ multi-model embeddings
✔ lengthy documents
✔ paragraph + sentence mixtures
✔ hallucination issues in RAG
Cosine is fine only for academic datasets and clean single-domain text.
✔ Final takeaway
Cosine was the right tool for 2015.
It is the wrong tool for 2025.
Mizan restores:
scale awareness
proportional balance
retrieval stability
semantic fairness
Cosine measures direction.
Mizan measures meaning.
This shift is essential for next-generation AI search and retrieval.
Beta Was this translation helpful? Give feedback.
All reactions