You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The recent paper from Cornell University "Harnessing the Universal Geometry of Embeddings" essentially lays the foundation for universal translation between embedding models.
Abstract:
We introduce the first method for translating text embeddings from one vector
space to another without any paired data, encoders, or predefined sets of matches.
Our unsupervised approach translates any embedding to and from a universal latent
representation (i.e., a universal semantic structure conjectured by the Platonic
Representation Hypothesis). Our translations achieve high cosine similarity across
model pairs with different architectures, parameter counts, and training datasets.
The ability to translate unknown embeddings into a different space while preserving
their geometry has serious implications for the security of vector databases. An
adversary with access only to embedding vectors can extract sensitive information
about the underlying documents, sufficient for classification and attribute inference.
As far as LanceDB is concerned, there are two topics worth focusing on imo.
What implications does the ability to translatebetween embeddings imply for embedding maintanence and search? (Particularly across models, or in LanceDB terms, across embedding functions)
What security features should be explored to facilitate safeguarding embedding vectors, given the implication that reading embeddings doesn't require encoders.
There might as well be many other features worth considering given the implications of the paper, and I created this thread to provide a means for this
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
The recent paper from Cornell University "Harnessing the Universal Geometry of Embeddings" essentially lays the foundation for universal translation between embedding models.
Paper: https://arxiv.org/pdf/2505.12540
Repo:https://github.com/rjha18/vec2vec
Abstract:
We introduce the first method for translating text embeddings from one vector
space to another without any paired data, encoders, or predefined sets of matches.
Our unsupervised approach translates any embedding to and from a universal latent
representation (i.e., a universal semantic structure conjectured by the Platonic
Representation Hypothesis). Our translations achieve high cosine similarity across
model pairs with different architectures, parameter counts, and training datasets.
The ability to translate unknown embeddings into a different space while preserving
their geometry has serious implications for the security of vector databases. An
adversary with access only to embedding vectors can extract sensitive information
about the underlying documents, sufficient for classification and attribute inference.
As far as LanceDB is concerned, there are two topics worth focusing on imo.
There might as well be many other features worth considering given the implications of the paper, and I created this thread to provide a means for this
Beta Was this translation helpful? Give feedback.
All reactions